# Fine-Tuning a Language Model with LoRA and Dataset Creation

## Overview

This project demonstrates how to fine-tune a pretrained Large Language Model (LLM) using LoRA (Low-Rank Adaptation) on a custom dataset. The model is configured to learn from research paper titles and summaries, where it generates assistant-like responses based on user queries. We’ll walk through the steps, from setting up the environment to loading data, formatting, tokenizing, and training the model.

### Prerequisites
- Basic understanding of Python and deep learning concepts.
- GPU-enabled machine with CUDA support (NVIDIA RTX 4090 is used in this project).
- Libraries required:
    - PyTorch
    - Transformers
    - PEFT (Parameter-Efficient Fine-Tuning)
    - pandas
    - torch

#### 1. Setting Up Your Environment

Ensure your machine supports CUDA and PyTorch is installed with GPU support. We’ll first check if the GPU is available:

In [None]:
import os
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    Trainer,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, PeftModel
import pandas as pd
import re
from torch.utils.data import Dataset

In [2]:
# test see if GPU is ready
def check_gpu():
    if torch.cuda.is_available():
        print("CUDA is ready!")
        device = torch.cuda.get_device_name(0)
        print(f"{device} is ready!")
    else:
        print("CUDA is gone...")
      

In [3]:
check_gpu()

CUDA is ready!
NVIDIA GeForce RTX 4090 is ready!


#### 2. Model and Data Preparation

We begin by specifying the base model and loading a dataset (in this case, research papers with titles and summaries).

In [2]:
# set the model info
base_model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
new_model = "/project/models/NV-llama3.1-8b-Arxiv"
api_key = "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"

In [5]:
# Load your data
data = pd.read_csv("ml_papers.csv")
data = data.dropna(subset=['title', 'summary']).reset_index(drop=True)

#### 3. Extracting Topics and Generating Queries

To fine-tune the model for specific user queries, we extract the topics from the paper titles and generate appropriate user queries and responses.

In [6]:
# Function to extract topics from titles
def extract_topic(title):
    title = re.sub(r"\(.*?\)|\[.*?\]", "", title)
    title = re.sub(r'[^\w\s]', '', title)
    title = title.lower()
    return title.strip()

# Generate user queries
def generate_user_query(topic):
    return f"I'm looking for papers discussing {topic}."

# Create assistant responses
def create_assistant_response(row):
    title = row['title']
    summary = row['summary']
    response = f"One paper that discusses this topic is '{title}'. {summary}"
    return response

In [7]:
data['topic'] = data['title'].apply(extract_topic)
data['instruction'] = data['topic'].apply(generate_user_query)

# generate assistant responses
data['response'] = data.apply(create_assistant_response, axis=1)

#### 4. Formatting the Dataset for the Model

We define custom tokens for formatting our data into the structure that the model can understand.

In [8]:
# Define special tokens
bos_token = "<bos>"
eos_token = "<eos>"
user_start = "<user>"
user_end = "</user>"
assistant_start = "<assistant>"
assistant_end = "</assistant>"
pad_token = "<pad>"

In [9]:
# Format examples
def format_example(instruction, response):
    return f"{bos_token}\n{user_start}\n{instruction}\n{user_end}\n{assistant_start}\n{response}\n{assistant_end}\n{eos_token}"

In [10]:
data['text'] = data.apply(lambda row: format_example(row['instruction'], row['response']), axis=1)

#### 5. Defining the Custom Dataset Class

We create a custom Dataset class to handle tokenized inputs, attention masks, and labels. The labels are set to -100 for tokens that are not relevant to the assistant’s response, which ensures the model only learns from the assistant’s output.

In [11]:
# Create dataset
class PapersDataset(Dataset):
    def __init__(self, tokenized_data):
        self.input_ids = tokenized_data['input_ids']
        self.attention_mask = tokenized_data['attention_mask']
        self.labels = tokenized_data['input_ids'].clone()

        assistant_start_id = tokenizer.convert_tokens_to_ids(assistant_start)
        assistant_end_id = tokenizer.convert_tokens_to_ids(assistant_end)

        for i in range(len(self.labels)):
            input_ids = self.input_ids[i]
            labels = self.labels[i]

            assistant_start_positions = (input_ids == assistant_start_id).nonzero(as_tuple=True)[0]
            assistant_end_positions = (input_ids == assistant_end_id).nonzero(as_tuple=True)[0]

            if len(assistant_start_positions) > 0 and len(assistant_end_positions) > 0:
                assistant_start_pos = assistant_start_positions[0]
                assistant_end_pos = assistant_end_positions[0]

                labels[:assistant_start_pos + 1] = -100
                labels[assistant_end_pos:] = -100
            else:
                labels[:] = -100

            self.labels[i] = labels

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'labels': self.labels[idx]
        }

#### 6. Dataset Tokenization

Next, we load the tokenizer and tokenize the dataset. The tokenizer is essential for converting the text into token IDs that the model can process.

In [12]:
# Setup the BitsAndBytesConfig for 8-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Load model in 8-bit precision
    bnb_4bit_compute_dtype=torch.float16,
)

In [13]:
from transformers import AutoTokenizer

# Define your special tokens
bos_token = "<bos>"
eos_token = "<eos>"
pad_token = "<pad>"
user_start = "<user>"
user_end = "</user>"
assistant_start = "<assistant>"
assistant_end = "</assistant>"

special_tokens = {
    'bos_token': bos_token,
    'eos_token': eos_token,
    'pad_token': pad_token,
    'additional_special_tokens': [user_start, user_end, assistant_start, assistant_end]
}

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    token=api_key
)

# Add special tokens to the tokenizer
tokenizer.add_special_tokens(special_tokens)
tokenizer.pad_token = pad_token

In [14]:
# Tokenize data
tokenized_data = tokenizer(
    data['text'].tolist(),
    padding='longest',
    truncation=True,
    max_length=512,
    return_tensors='pt'
)

In [15]:
dataset = PapersDataset(tokenized_data)

#### 7. Model Fine-Tuning with LoRA

We apply LoRA to the pretrained model for efficient fine-tuning, loading the model in 4-bit precision for memory optimization.

In [16]:
# this is important correct one
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    token=api_key,
    quantization_config=bnb_config,
    cache_dir="/project/models",
    device_map="auto"
)

# Update model's embeddings
model.resize_token_embeddings(len(tokenizer))

# Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)

# Apply LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Ensure these modules exist in your model
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

#### 8. Training the Model

We use the Trainer from the Hugging Face transformers library to fine-tune the model. The following code sets up training arguments and trains the model.

In [17]:
# arguments setting for 1 RTX 4090
from transformers import Trainer, TrainingArguments

training_arguments = TrainingArguments(
    output_dir="/project/models/NV-arxiv-llama3.1",             # Where to save results
    num_train_epochs=3,                 # Number of epochs
    per_device_train_batch_size=2,      # Start with 2, adjust based on memory
    gradient_accumulation_steps=5,      # Accumulate gradients to simulate larger batch size
    fp16=True,                         # Use FP16 for memory efficiency on RTX 4090
    gradient_checkpointing=True,        # Enable gradient checkpointing to save memory
    gradient_checkpointing_kwargs={"use_reentrant": False},
    learning_rate=2e-5,                 # Adjust learning rate for fine-tuning
    max_grad_norm=0.3,                  # Gradient clipping
    weight_decay=0.001,                 # Regularization
    optim="adamw_torch",                      # Use standard AdamW optimizer
    max_steps=50,                      # Train for 500 steps
    warmup_ratio=0.03,                  # Warmup learning rate
    group_by_length=True,               # Group sequences of similar lengths to save memory
    save_steps=100,                     # Save model checkpoint every 100 steps
    logging_steps=5,                    # Log training progress every 5 steps
    report_to="none"
    
)

In [18]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset,
    tokenizer=tokenizer
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs


In [19]:
# Train the model
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]


Step,Training Loss
5,2.1939
10,2.2107
15,2.1205
20,2.1231
25,2.0918
30,2.0825
35,2.0902
40,2.0204
45,1.9796
50,2.0274



Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3.1-8B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in. - silently ignoring the lookup for the file config.json in meta-llama/Meta-Llama-3.1-8B-Instruct.


TrainOutput(global_step=50, training_loss=2.0940171241760255, metrics={'train_runtime': 162.3597, 'train_samples_per_second': 3.08, 'train_steps_per_second': 0.308, 'total_flos': 1.1548546301952e+16, 'train_loss': 2.0940171241760255, 'epoch': 6.25})

#### 9. Saving and Loading the Model

Once training is complete, save the model for later use.

In [20]:
# Save the LoRA adapter weights
model.save_pretrained("/project/models/arxiv_model")


Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3.1-8B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in. - silently ignoring the lookup for the file config.json in meta-llama/Meta-Llama-3.1-8B-Instruct.


In [21]:
from transformers import AutoModelForCausalLM

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    token=api_key
)
tokenizer.add_special_tokens(special_tokens)
tokenizer.pad_token = pad_token


# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    token=api_key,
    quantization_config=bnb_config,
    cache_dir="/project/models",
    device_map="auto"
)

# Update model's embeddings to accommodate new tokens
base_model.resize_token_embeddings(len(tokenizer))

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Embedding(128263, 4096)

In [22]:
# Load the LoRA adapter weights
model = PeftModel.from_pretrained(
    base_model,
    "/project/models/arxiv_model",
    device_map="auto"
).to("cuda")

# Set model to evaluation mode
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128263, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): 

#### 10. Generating Text

Finally, we generate text from the fine-tuned model using a sample query.

In [23]:
# Define the format_example function
def format_example(instruction, response=""):
    return f"{bos_token}\n{user_start}\n{instruction}\n{user_end}\n{assistant_start}\n{response}"

# Prepare the input
instruction = "I am looking for a paper discussing To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning"
input_text = format_example(instruction)

In [24]:
# Tokenize the input
inputs = tokenizer(
    input_text,
    return_tensors="pt",
    truncation=True,
    max_length=512,
    padding=True
).to("cuda")

input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]


In [25]:
# Generate the response
with torch.no_grad():
    output_ids = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.2,
        num_return_sequences=1,
        eos_token_id=tokenizer.convert_tokens_to_ids(eos_token),
        pad_token_id=tokenizer.convert_tokens_to_ids(pad_token)
    )

# Decode and extract the assistant's response
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=False)

print(generated_text)

<|begin_of_text|><bos>
<user>
I am looking for a paper discussing To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
</user>
<assistant>
This research is focused on the topic of "Chain-of-Thought" (CoT) in artificial intelligence. Specifically, it explores how this technique can assist with mathematical problems and symbolic reasoning tasks.

The researchers employed both qualitative and quantitative methods to investigate the effectiveness of chain-of-thought explanations across various AI models trained using different architectures such as LSTMs, Transformers, and CNNs. They found that these systems were able to generate helpful chains of thought during problem-solving processes but did so without leveraging human language generation capabilities.
 
To evaluate their results they designed experiments where subjects engaged in solving either real-world math questions presented by an intelligent tutoring system's interface or reading comprehension passa