## Describe your model -> fine-tuned LLaMA 2
By Matt Shumer (https://twitter.com/mattshumer_)

The goal of this notebook is to experiment with a new way to make it very easy to build a task-specific model for your use-case.

First, use the best GPU available (go to Runtime -> change runtime type)

To create your model, just go to the first code cell, and describe the model you want to build in the prompt. Be descriptive and clear.

Select a temperature (high=creative, low=precise), and the number of training examples to generate to train the model. From there, just run all the cells.

You can change the model you want to fine-tune by changing `model_name` in the `Define Hyperparameters` cell.

#Data generation step

Write your prompt here. Make it as descriptive as possible!

Then, choose the temperature (between 0 and 1) to use when generating data. Lower values are great for precise tasks, like writing code, whereas larger values are better for creative tasks, like writing stories.

Finally, choose how many examples you want to generate. The more you generate, a) the longer it takes and b) the more expensive data generation will be. But generally, more examples will lead to a higher-quality model. 100 is usually the minimum to start.

In [None]:
prompt = "A model that takes in a puzzle-like reasoning-heavy question in English, and responds with a well-reasoned, step-by-step thought out response in English."
temperature = .4
number_of_examples = 100

We will save our generated data in output_path and system message in system_path. This would make it easier for loading again and again.

In [1]:
import csv
from google.colab import drive
drive.mount('/content/drive')
output_path = "/content/drive/MyDrive/llama2_raw_blocks.csv"
system_path = "/content/drive/MyDrive/llama2_system_prompt.txt"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


To generate 100 topics of sparse and great variety we will generate prompts and
responses under 10 sub topics.This will prevent in duplication of generated topics and ensure we have a range of examples under the same scope and different subproblems under the scope.

In [None]:
MAIN_PROMPT = "You are generating high-quality reasoning data to train a machine learning model."

# 10 reasoning-heavy main topics
main_topics = [
    "Number theory puzzles",
    "Algebraic reasoning problems",
    "Logical deduction and truth-teller/liar puzzles",
    "Combinatorics and counting puzzles",
    "Probability-based reasoning",
    "Geometric logic and spatial puzzles",
    "Mathematical word problems",
    "Pattern recognition and sequences",
    "Optimization and constraint satisfaction puzzles",
    "Mathematical paradoxes and counterintuitive results"
]

# Improved system prompt
def build_system_prompt(model_description, topic):
    return f"""{model_description}.

The model takes as input a reasoning-heavy question in English and generates a detailed, step-by-step explanation in English.

You are now tasked with generating one such example. Focus on creating questions under the theme of: **{topic}**

For every data sample, follow this format exactly:\n
```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```

Requirements:
- The **question** must require deep reasoning, puzzles, or critical thinking.
- The **response** must show a clear, step-by-step thought process in English.
- Each example must be **unique**, **diverse**, and **increasingly complex**.
- Make sure to use proper English grammar and academic tone.

Now generate one such prompt/response pair for the topic: '{topic}'"""

Run this to generate the dataset.

In [None]:
!pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.93.0
    Uninstalling openai-1.93.0:
      Successfully uninstalled openai-1.93.0
Successfully installed openai-0.28.0


In [None]:
import os
import openai
import random
import hashlib
import time
from openai.error import RateLimitError

openai.api_key = "xxxxxxx xxxx"

def generate_example(prompt_template, prev_examples, temperature=0.5):
    messages = [{"role": "system", "content": prompt_template}]

    if len(prev_examples) > 0:
        # Sample previous 5 examples for context
        context_examples = random.sample(prev_examples, min(5, len(prev_examples)))
        for example in context_examples:
            messages.append({"role": "assistant", "content": example})

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages,
        temperature=temperature,
        max_tokens=1300,
    )
    return response.choices[0].message['content']


def save_progress(path, examples):
    with open(path, "w", encoding="utf-8", newline='') as f:
        writer = csv.writer(f)
        for ex in examples:
            writer.writerow([ex])
    print(f"Progress saved to {path}")

def load_progress(path):
    try:
        with open(path, "r", encoding="utf-8") as f:
            reader = csv.reader(f)
            loaded = [row[0] for row in reader if row]
            print(f"Loaded {len(loaded)} previously saved examples.")
            return loaded
    except FileNotFoundError:
        return []

SAVE_PATH = "/content/drive/MyDrive/llama2_gen_examples.csv"
# Main loop
def generate_all_examples():
    total_examples = 10
    subtopics_per_main = 10
    temperature = 0.4

    all_examples = load_progress(SAVE_PATH)
    seen_hashes = set(hashlib.md5(ex.encode()).hexdigest() for ex in all_examples)

    main_topics_remaining = list(main_topics)
    examples_per_topic = {topic: 0 for topic in main_topics}

    # count existing examples per topic (assumes topics are inside example text)
    for ex in all_examples:
        for topic in main_topics:
            if topic in ex:
                examples_per_topic[topic] += 1

    for main_topic in main_topics:
        if examples_per_topic[main_topic] >= subtopics_per_main:
            print(f"Skipping completed topic: {main_topic}")
            continue

        print(f"\nMain topic: {main_topic}")
        sys_prompt = build_system_prompt(MAIN_PROMPT, main_topic)
        topic_examples = []

        while examples_per_topic[main_topic] < subtopics_per_main:
            try:
                example = generate_example(sys_prompt, topic_examples, temperature)
                ex_hash = hashlib.md5(example.encode()).hexdigest()

                if ex_hash not in seen_hashes:
                    all_examples.append(example)
                    topic_examples.append(example)
                    seen_hashes.add(ex_hash)
                    examples_per_topic[main_topic] += 1
                    print(f"Added example {len(all_examples)}")
                    if len(all_examples) % 5 == 0:
                        save_progress(SAVE_PATH, all_examples)
                else:
                    print("Duplicate detected. Retrying...")

            except RateLimitError:
                print("Rate limit hit. Sleeping for 10s...")
                time.sleep(10)

            except Exception as e:
                print(f"Error: {e}")
                time.sleep(5)

    save_progress(SAVE_PATH, all_examples)
    return all_examples

We will generate examples while ensuring there are no duplicates. We will encode generated query to check whether it was generated previously with previous examples. We will also keep a small time delay to ensure the Rate Limit doesn't exceed per minute. We will save the progress at the delay and we will keep track of the sub topic that the queries are being generated. This way even if the credits are over we would still start generating topics from where we left.

In [None]:
dataset = generate_all_examples()


Main topic: Number theory puzzles
Added example 1
Added example 2
Added example 3
Added example 4
Added example 5
Progress saved to /content/drive/MyDrive/llama2_gen_examples.csv
Added example 6
Added example 7
Added example 8
Added example 9
Added example 10
Progress saved to /content/drive/MyDrive/llama2_gen_examples.csv

Main topic: Algebraic reasoning problems
Added example 11
Added example 12
Added example 13
Added example 14
Added example 15
Progress saved to /content/drive/MyDrive/llama2_gen_examples.csv
Added example 16
Added example 17
Added example 18
Added example 19
Added example 20
Progress saved to /content/drive/MyDrive/llama2_gen_examples.csv

Main topic: Logical deduction and truth-teller/liar puzzles
Added example 21
Added example 22
Added example 23
Added example 24
Added example 25
Progress saved to /content/drive/MyDrive/llama2_gen_examples.csv
Added example 26
Added example 27
Added example 28
Added example 29
Added example 30
Progress saved to /content/drive/MyD

Now we have generate all required 100 examples. We will save the examples in .csv file for future loading.

In [None]:
import csv

output_path = "/content/drive/MyDrive/llama2_raw_blocks.csv"

with open(output_path, mode='w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f)
    for item in dataset:
        writer.writerow([item])  # one column, one full block

print(f"Raw dataset saved to CSV: {output_path}")

Raw dataset saved to CSV: /content/drive/MyDrive/llama2_raw_blocks.csv


In [2]:
loaded_raw_dataset = []

with open(output_path, mode='r', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        if row:
            loaded_raw_dataset.append(row[0])  # full block as a single string

prev_examples = loaded_raw_dataset

print(f"Loaded {len(loaded_raw_dataset)} raw entries from CSV")

Loaded 100 raw entries from CSV


We also need to generate a system message.

In [None]:
def generate_system_message(prompt):

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
          {
            "role": "system",
            "content": "You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use."
            "Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference."
            "A good format to follow is `Given $INPUT_DATA, you will $WHAT_THE_MODEL_SHOULD_DO.`.\n\nMake it as concise as possible. Include nothing but the system prompt in your response.\n\n"
            "For example, never write: `\"$SYSTEM_PROMPT_HERE\"`.\n\nIt should be like: `$SYSTEM_PROMPT_HERE`."
          },
          {
              "role": "user",
              "content": prompt.strip(),
          }
        ],
        temperature=temperature,
        max_tokens=500,
    )

    return response.choices[0].message['content']

system_message = generate_system_message(prompt)

print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')

with open(system_path, "w", encoding="utf-8") as f:
    f.write(system_message)

print(f"System prompt saved at: {system_path}")

The system message is: `Given a puzzle-like reasoning-heavy question in English, you will generate a well-reasoned, step-by-step thought out response in English.`. Feel free to re-run this cell if you want a better result.
System prompt saved at: /content/drive/MyDrive/llama2_system_prompt.txt


We can save and load the system message in a .txt file as well.

In [3]:
# Load the saved system message
with open("/content/drive/MyDrive/llama2_system_prompt.txt", "r", encoding="utf-8") as f:
    system_message = f.read().strip()

print(f"Loaded system message: `{system_message}`")

Loaded system message: `Given a puzzle-like reasoning-heavy question in English, you will generate a well-reasoned, step-by-step thought out response in English.`


Now let's put our examples into a dataframe and turn them into a final pair of datasets.

In [4]:
import pandas as pd

# Initialize lists to store prompts and responses
prompts = []
responses = []

# Parse out prompts and responses from examples
for example in prev_examples:
  try:
    split_example = example.split('-----------')
    prompts.append(split_example[1].strip())
    responses.append(split_example[3].strip())
  except:
    pass

# Create a DataFrame
df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# Remove duplicates
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')

df.head()

There are 100 successfully-generated examples. Here are the first few:


Unnamed: 0,prompt,response
0,A number's persistence is the number of steps ...,"To find the persistence of the number 679, we ..."
1,Consider a sequence where the nth term is the ...,"To find the 6th term of the sequence, we first..."
2,"In a certain number sequence, each number afte...","To find the 6th number in this sequence, we ne..."
3,Consider a sequence where the nth term is the ...,"To find the 10th prime number, we first need t..."
4,Consider a sequence where the nth term is the ...,"To find the 7th term of the sequence, we first..."


Our generated data has a good variety. But since its generated in some specific order, we can expect there to be some bias during training/eval. Therefore it is better to shuffle the entire dataset randoml to avoid any sort of biasness.That way when we pick our train and test sets it'd be easier for us to train the model with a greater randomness.

In [5]:
from sklearn.utils import shuffle

# Step 1: Shuffle the entire DataFrame
df_shuffled = shuffle(df, random_state=42).reset_index(drop=True)

Split into train and test sets.

In [6]:
# Split the data into train and test sets, with 90% in the train set
train_df = df_shuffled.sample(frac=0.9, random_state=42)
test_df = df_shuffled.drop(train_df.index)

# Save the dataframes to .jsonl files
train_df.to_json('train.jsonl', orient='records', lines=True)
test_df.to_json('test.jsonl', orient='records', lines=True)

# Install necessary libraries

We will install newer versions of the required packages. Older packages do not work anymore and they may cause issues with CUDA while training the model.

In [1]:
# Install required packages with latest versions
!pip install -q accelerate peft bitsandbytes transformers trl datasets torch torchvision torchaudio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.8/375.8 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install -q fsspec

In [7]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel, TaskType, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

## Fine Tuning the model

We will first load the dataset in a specific format. For LLama 2 with newer packages the format has slightly changed.

In [8]:
# Load datasets
train_dataset = load_dataset('json', data_files='/content/train.jsonl', split="train")
valid_dataset = load_dataset('json', data_files='/content/test.jsonl', split="train")

# Preprocess datasets - Create both 'text' and 'completion' fields for modern SFTTrainer
def format_prompts(examples):
    texts = []
    completions = []
    for prompt, response in zip(examples['prompt'], examples['response']):
        # Full formatted text for compatibility
        full_text = f"<s>[INST] <<SYS>>\n{system_message.strip()}\n<</SYS>>\n\n{prompt} [/INST] {response}</s>"
        texts.append(full_text)
        # Just the response for completion field
        completions.append(response)
    return {'text': texts, 'completion': completions}

train_dataset_mapped = train_dataset.map(format_prompts, batched=True)
valid_dataset_mapped = valid_dataset.map(format_prompts, batched=True)

print(f"Training dataset size: {len(train_dataset_mapped)}")
print(f"Validation dataset size: {len(valid_dataset_mapped)}")

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Training dataset size: 90
Validation dataset size: 10


Next we will specify our model that we are going to finetune and the custom finetune model name. We will also mention the necessary paramaeters , configurations, steps required to finetune the model.

In [9]:
model_name = "NousResearch/llama-2-7b-chat-hf"
new_model = "llama-2-7b-chat-custom"

# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [11]:
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

In [12]:
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=32,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

# Apply LoRA to the model
model = get_peft_model(model, peft_config)

In [13]:
# Print trainable parameters
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")

print_trainable_parameters(model)

trainable params: 79953920 || all params: 3580366848 || trainable%: 2.233120889404459


In [14]:
# Training arguments optimized for chat model
training_arguments = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,  # Reduced for chat model
    gradient_accumulation_steps=4,  # Increased to maintain effective batch size
    optim="paged_adamw_32bit",
    save_steps=250,
    logging_steps=10,
    learning_rate=1e-4,  # Slightly lower for chat model
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",  # Better for chat models
    report_to="tensorboard",
    num_train_epochs=1,  # Chat models often need fewer epochs
    eval_strategy="steps",  # Changed from evaluation_strategy
    eval_steps=250,
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    remove_unused_columns=False,  # Important for SFTTrainer
)

In [15]:
# Initialize the trainer with minimal parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_mapped,
    eval_dataset=valid_dataset_mapped,
    args=training_arguments,
)

Adding EOS to train dataset:   0%|          | 0/90 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/90 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/90 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Once we have done specifying all required parameters,configurations,steps its time to train the model with loaded dataset. Keep note some of the mentioned parameters are chosen to finetune the model in a colab instance within a short time.

In [16]:
# Start training
print("Starting training...")
trainer.train()

Starting training...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss


TrainOutput(global_step=23, training_loss=0.7394159358480702, metrics={'train_runtime': 576.8403, 'train_samples_per_second': 0.156, 'train_steps_per_second': 0.04, 'total_flos': 1404011492278272.0, 'train_loss': 0.7394159358480702})

Now we will save our fientuned model.

In [17]:
# Save the model
trainer.model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

('llama-2-7b-chat-custom/tokenizer_config.json',
 'llama-2-7b-chat-custom/special_tokens_map.json',
 'llama-2-7b-chat-custom/tokenizer.model',
 'llama-2-7b-chat-custom/added_tokens.json',
 'llama-2-7b-chat-custom/tokenizer.json')

We can also run some tests to check how our model performs. As in the result its evident our model hasn't fientuned very well enough. Therefore we need to reset our training parameters to better ones to get better results.

In [19]:
# Test the model
logging.set_verbosity(logging.CRITICAL)
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nWrite a function that reverses a string. [/INST]" # replace the command here with something relevant to your task
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt)
print(result[0]['generated_text'])

  return fn(*args, **kwargs)


[INST] <<SYS>>
Given a puzzle-like reasoning-heavy question in English, you will generate a well-reasoned, step-by-step thought out response in English.
<</SYS>>

Write a function that reverses a string. [/INST]  everybody Hinweis Hinweis hopefully hopefully nobody nobody sierp фев Hinweis everybody everybody Hinweis Hinweis Unterscheidung Begriffe nobody hopefully obviously nobody Unterscheidung nobody everybody nobody Unterscheidung nobody Unterscheidung hopefully hopefully everybody Hinweis фев Hinweis nobody hopefully nobody Unterscheidung nobody Hinweis Hinweis Hinweis hopefully Hinweis Hinweis everybody nobody Unterscheidung nobody everybody nobody nobody hopefully hopefully everybody nobody Hinweis everybody hopefully everybody everybody everybody nobody hopefully everybody Hinweis nobody nobody Hinweis Hinweis фев everybody everybody everybody everybody everybody Unterscheidung Unterscheidung nobody nobody nobody nobody hopefully hopefully hopefully nobody Hinweis nobody Unters

In [20]:
from transformers import pipeline

prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nWrite a function that reverses a string. [/INST]" # replace the command here with something relevant to your task
num_new_tokens = 100  # change to the number of new tokens you want to generate

# Count the number of tokens in the prompt
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])

# Calculate the maximum length for the generation
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result = gen(prompt)
print(result[0]['generated_text'].replace(prompt, ''))

  surely Unterscheidung Begriffe Unterscheidung Hinweis Hinweis nobody Hinweis nobody Hinweis everybody nobody nobody everybody nobody nobody nobody Hinweis Unterscheidung nobody everybody nobody everybody hopefully nobody nobody Unterscheidung nobody nobody Hinweis Unterscheidung nobody everybody Hinweis nobody everybody hopefully everybody everybody everybody everybody Unterscheidung Hinweis Unterscheidung Unterscheidung фев everybody obviously Hinweis Hinweis Hinweis nobody nobody sierp Unterscheidung Unterscheidung Hinweis nobody Hinweis everybody nobody Unterscheidung Unterscheidung nobody hopefully nobody nobody nobody hopefully everybody nobody Hinweis hopefully hopefully nobody everybody Unterscheidung hopefully nobody hopefully obviously Hinweis nobody Hinweis nobody hopefully hopefully everybody Hinweis Unterscheidung Hinweis hopefully everybody hopefully everybody everybody everybody Unterscheidung nobody


Next we will save our model in our Google drive. This way we can load the model at anywhere and run inference.

In [21]:
#Merge and save model efficiently
print("\n=== Model Merging and Saving ===")
model_path = "/content/drive/MyDrive/llama-2-7b-custom"

try:
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    # Load and merge LoRA
    model = PeftModel.from_pretrained(base_model, new_model)
    model = model.merge_and_unload()

    # Save tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    # Save to Google Drive
    model.save_pretrained(model_path, safe_serialization=True)
    tokenizer.save_pretrained(model_path)

    print(f"Model saved to {model_path}")


except Exception as e:
    print(f"Error: {e}")


=== Model Merging and Saving ===


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Error: We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules need to be offloaded: base_model.model.model.layers.12, base_model.model.model.layers.13, base_model.model.model.layers.14, base_model.model.model.layers.15, base_model.model.model.layers.16, base_model.model.model.layers.17, base_model.model.model.layers.18, base_model.model.model.layers.19, base_model.model.model.layers.20, base_model.model.model.layers.21, base_model.model.model.layers.22, base_model.model.model.layers.23, base_model.model.model.layers.24, base_model.model.model.layers.25, base_model.model.model.layers.26, base_model.model.model.layers.27, base_model.model.model.layers.28, base_model.model.model.layers.29, base_model.model.model.layers.30, base_model.model.model.layers.31, base_model.model.model.norm, base_model.model.model.rotary_emb, base_model.model.lm_head.


In [22]:
#Load and test inference
print("\n=== Inference Testing ===")
try:
    # Load from Google Drive
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Create pipeline
    generator = pipeline(
        'text-generation',
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16
    )

    # Test
    prompt = "What is 2 + 2?"
    result = generator(prompt, max_length=100, do_sample=True, temperature=0.7)
    print(f"Result: {result[0]['generated_text']}")

except Exception as e:
    print(f"Inference error: {e}")

print("Done!")


=== Inference Testing ===
Inference error: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/drive/MyDrive/llama-2-7b-custom'. Use `repo_type` argument if needed.
Done!
