# LLM Fine-Tuning Project with IMDB Dataset
This project demonstrates the fine-tuning of a large language model (LLM) on the IMDB dataset, leveraging the LoRA (Low-Rank Adaptation) technique to improve the model's performance on sentiment analysis tasks. The project is divided into three main phases:

**1. Initial Predictions:** Using the pre-trained model to make predictions on the IMDB dataset without any fine-tuning.

**2. Fine-Tuning with LoRA:** Applying the LoRA technique to fine-tune the model on the IMDB dataset.

**3. Evaluation:** Assessing the model's performance before and after fine-tuning to understand the impact of LoRA.

## Dataset
The IMDB dataset is a popular dataset for binary sentiment classification and is available through the Hugging Face datasets library. This dataset contains 50,000 movie reviews split evenly into 25,000 training and 25,000 test examples.


## Project Workflow
###1. Initial Predictions:

*   Load the pre-trained model and tokenizer.
*   Make predictions on the test set of the IMDB dataset.
*   Evaluate the performance using accuracy and other relevant metrics.


###2. Fine-Tuning with LoRA:

*   Implement the LoRA technique to fine-tune the pre-trained model on the IMDB dataset.
*   Use the training set for fine-tuning and validate the performance on the test set.


###3. Evaluation:

*   Compare the performance of the model before and after fine-tuning.
*   Present the results with visualizations and detailed analysis.


## Getting Started
To get started with this project, clone the repository and install the required dependencies:

```
git clone https://github.com/SelahattinNazli/llm-fine-tuning.git
pip install -r requirements.txt
```


Run the initial prediction script:

```
python initial_predictions.py
```


Fine-tune the model using LoRA:

```
python fine_tune_lora.py
```


Evaluate the results:

```
python evaluate.py
```


## Dependencies


*   Python 3.7+
*   Transformers
*   Datasets
*   PEFT (Parameter-Efficient Fine-Tuning)
*   Evaluate
*   NumPy
*   Torch

## Results
Detailed results and analysis will be provided in the evaluation phase, showcasing the improvements in model performance due to the LoRA fine-tuning technique.

## Contributing
Contributions are welcome! Please fork the repository and create a pull request with your changes.

In [None]:
pip install accelerate -U



In [None]:
pip install transformers[torch]



In [None]:
pip install torch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 -f https://download.pytorch.org/whl/metal.html

Looking in links: https://download.pytorch.org/whl/metal.html
Collecting torch==2.0.0
  Downloading torch-2.0.0-cp310-cp310-manylinux1_x86_64.whl (619.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.15.0
  Downloading torchvision-0.15.0-cp310-cp310-manylinux1_x86_64.whl (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.0.0
  Downloading torchaudio-2.0.0-cp310-cp310-manylinux1_x86_64.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch==2.0.0)
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m30.8 MB/s[0m eta [36m0

In [None]:
!pip install peft transformers datasets evaluate

Collecting peft
  Downloading peft-0.11.1-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.4 MB/s[

In [None]:
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu113


# Initial Setup and Imports
In this section, we import the necessary libraries and load the IMDB dataset. These libraries include transformers for model handling, datasets for data loading, and peft for efficient fine-tuning methods.

In [None]:
from datasets import load_dataset, DatasetDict, Dataset

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


# Loading and Inspecting the Dataset
In this section, we load the IMDB dataset, which will be used for both initial predictions and fine-tuning. The dataset is loaded using the `load_dataset` function from the `datasets` library.

In [None]:
from datasets import load_dataset

dataset = load_dataset("stanfordnlp/imdb")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Filtering and Renaming the Dataset

In this section, we filter the original dataset to only include the training and test subsets. We then rename the filtered dataset for easier use in the subsequent steps.

In [None]:
# Let's create a new DatasetDict that only contains the train and test datasets
filtered_dataset = DatasetDict({
    'train': dataset['train'],
    'test': dataset['test']
})


# Rename filtered_dataset to dataset
dataset = filtered_dataset

# Check the new dataset
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


In [None]:
# Display % of training data with label=1
np.array(dataset['train']['label']).sum()/len(dataset['train']['label'])

0.5

## Model Initialization and Label Mapping
In this section, we load a pre-trained model for sentiment analysis and define the label mappings. We choose the [distilbert-base-uncased] model because it is lightweight and faster, which helps in speeding up the training and prediction processes. Alternatively, we could use the [roberta-base] model, but it is larger and thus takes longer to train.

In [None]:
# Define the model checkpoint
model_checkpoint = 'distilbert-base-uncased'
# Alternatively, you can use roberta-base, but this model is bigger and thus training will take longer
# model_checkpoint = 'roberta-base'

# Define label maps
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative":0, "Positive":1}

# Generate classification model from the model checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Display the model architecture
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

## Tokenizer Initialization and Configuration

In [None]:
# Initialize the tokenizer with the model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

# Add a pad token if it doesn't exist in the tokenizer
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# # Define the tokenize function
def tokenize_function(examples):
    # Extract the text from the examples
    text = examples["text"]

    # # Tokenize and truncate the text to a maximum length of 512 tokens
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

In [None]:
# Tokenize training and text datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
})

In [None]:
# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Loading the Accuracy Evaluation Metric

In [None]:
# Load the accuracy evaluation metric using the evaluate library
accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
# Define a function to compute evaluation metrics for the model
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)

    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

## Initial Predictions with the Unfine-tuned Model

In [None]:
# Define a list of example sentences for prediction
text_list = ["It was good.", "Not a fan, don't recommed.", "Better than the first one.", "This is not worth watching even once.", "This one is a pass."]
print("Unfine-tuned model predictions:")
print("----------------------------")
for text in text_list:
    # Tokenize the input text
    inputs = tokenizer.encode(text, return_tensors="pt")
    # Compute the logits (raw model outputs)
    logits = model(inputs).logits
    # Convert logits to the predicted label
    predictions = torch.argmax(logits)

    print(text + " - " + id2label[predictions.tolist()])

Unfine-tuned model predictions:
----------------------------
It was good. - Positive
Not a fan, don't recommed. - Positive
Better than the first one. - Positive
This is not worth watching even once. - Positive
This one is a pass. - Positive


## Configuration for LoRA Fine-Tuning

In [None]:
# Configure LoRA (Low-Rank Adaptation) for sequence classification
peft_config = LoraConfig(task_type="SEQ_CLS", # Specify the task type as sequence classification
                        r=4,                  # Rank of the low-rank decomposition
                        lora_alpha=32,        # Scaling factor for the LoRA updates
                        lora_dropout=0.01,    # Dropout rate to apply during LoRA updates
                        target_modules = ['q_lin']) # Target modules within the model to apply LoRA

In [None]:
# Display the current LoRA configuration settings
peft_config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='SEQ_CLS', inference_mode=False, r=4, target_modules={'q_lin'}, lora_alpha=32, lora_dropout=0.01, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None)

In [None]:
# Apply the LoRA configuration to the pre-trained model
model = get_peft_model(model, peft_config)
# Print the number of trainable parameters in the model
model.print_trainable_parameters()

trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9307


In [None]:
# Define the hyperparameters for training
lr = 1e-3         # Learning rate
batch_size = 8    # Batch size for training
num_epochs = 5   # Number of training epochs

In [None]:
# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",  # Output directory for model predictions and checkpoints
    learning_rate=lr,  # Learning rate for the optimizer
    per_device_train_batch_size=batch_size,  # Batch size for training per device
    per_device_eval_batch_size=batch_size,  # Batch size for evaluation per device
    num_train_epochs=num_epochs,  # Number of training epochs
    weight_decay=0.01,  # Weight decay to apply for regularization
    evaluation_strategy="epoch",  # Evaluation strategy to use at the end of each epoch
    save_strategy="epoch",  # Save strategy to use at the end of each epoch
    load_best_model_at_end=True,  # Load the best model when training is finished
)



In [None]:
# Initialize the Trainer with optimized settings
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"].select(range(100)),  # Use a subset for training
    eval_dataset=tokenized_dataset["test"].select(range(100)),    # Use a subset for testing
    tokenizer=tokenizer,
    data_collator=data_collator,  # This will dynamically pad examples in each batch to be equal length
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.0,{'accuracy': 1.0}
2,No log,0.0,{'accuracy': 1.0}
3,No log,0.0,{'accuracy': 1.0}
4,No log,0.0,{'accuracy': 1.0}
5,No log,0.0,{'accuracy': 1.0}


Trainer is attempting to log a value of "{'accuracy': 1.0}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 1.0}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 1.0}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 1.0}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 1.0}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation 

TrainOutput(global_step=65, training_loss=0.0, metrics={'train_runtime': 1801.9529, 'train_samples_per_second': 0.277, 'train_steps_per_second': 0.036, 'total_flos': 64507640757888.0, 'train_loss': 0.0, 'epoch': 5.0})

## Making Predictions with the Fine-tuned Model

In [None]:
# Move the model to the CPU device
model.to('cpu')

print("Fine-tuned model predictions:")
print("--------------------------")
for text in text_list:
    # Tokenize the input text and move it to the CPU device
    inputs = tokenizer.encode(text, return_tensors="pt").to("cpu")

    # Compute the logits (raw model outputs)
    logits = model(inputs).logits

    # Convert logits to predicted labels
    predictions = torch.max(logits,1).indices

    # Print the text along with its predicted label
    print(text + " - " + id2label[predictions.tolist()[0]])


Fine-tuned model predictions:
--------------------------
It was good. - Negative
Not a fan, don't recommed. - Negative
Better than the first one. - Negative
This is not worth watching even once. - Negative
This one is a pass. - Negative


## Improving Fine-Tuning Results
This project demonstrates the process of fine-tuning a language model using the IMDB dataset. While the initial results may not be optimal, the following strategies can be applied to enhance the model's performance. These strategies aim to address common issues encountered during fine-tuning, such as limited training data and model overfitting. By implementing these approaches, the model can achieve better accuracy and more reliable predictions.

## Strategies to Improve Fine-Tuning
**Increase** **Training** **Data** **Size**:

**Reason**: Using a small subset (e.g., 100 samples) for training limits the model's ability to learn and generalize from the data.

**Solution**: Increase the size of the training dataset. If computational resources are limited, try incrementally increasing the subset size and observing the impact on model performance.

**Balance** **the** **Dataset**:

**Reason**: An imbalanced dataset can lead to a model that is biased towards the majority class.

**Solution**: Ensure that the dataset is balanced, i.e., it has an equal number of samples for each class (Positive and Negative).

**Hyperparameter** **Tuning**:

**Reason**: Default hyperparameters may not be optimal for your specific dataset and task.

**Solution**: Experiment with different hyperparameters such as learning rate, batch size, number of epochs, and optimizer type. Use techniques like grid search or random search to find the best combination.

**Data** **Augmentation**:

**Reason**: Data augmentation can help improve model robustness and generalization by providing more varied examples.

**Solution**: Apply data augmentation techniques such as adding noise, paraphrasing text, or using data from similar tasks to enrich the training dataset.

**Use** **a** **Pre**-**trained** **Model** **with** **Similar** **Task**:

**Reason**: Starting with a pre-trained model that has been fine-tuned on a similar task can provide a better starting point.

**Solution**: Use a pre-trained model from a similar domain (e.g., sentiment analysis on a different but related dataset) and fine-tune it on your specific dataset.

**Cross**-**validation**:

**Reason**: Cross-validation helps in evaluating the model's performance more robustly by training and testing on different splits of the data.

**Solution**: Implement k-fold cross-validation to assess the model's performance and avoid overfitting.

**Regularization** **Techniques**:

**Reason**: Regularization can help prevent overfitting by penalizing large weights.
**Solution**: Use techniques like dropout, weight decay, or L2 regularization during training.

**Early** **Stopping**:

**Reason**: Early stopping can prevent the model from overfitting by halting training when performance on the validation set starts to degrade.

**Solution**: Monitor the validation loss and implement early stopping based on a predefined patience parameter.
By following these steps meticulously, I ensure that the data is clean, reliable, and ready for generating valuable insights and applying machine learning models.

## Logging into Hugging Face Hub

In [None]:
# Option 1: Login using the notebook interface
from huggingface_hub import notebook_login
notebook_login()  # Ensure the token provided has write access

# Option 2: Login using a key
# from huggingface_hub import login
# write_key = 'hf_'  # Paste your Hugging Face token here
# login(write_key)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
hf_name = 'snazli' # Your hf username or org name
model_id = hf_name + "/" + model_checkpoint + "-lora-text-classification" # You can name the model whatever you want

In [None]:
# Save model
model.push_to_hub(model_id)

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-66665580-78344db45119e829438523fd;a2585fa4-7fcc-4027-9b3d-ec58dd800e07)

Invalid username or password.

In [None]:
# Save trainer
trainer.push_to_hub(model_id)

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-666655e8-5f70ee1566a8769632b9f02f;6d52b0e8-ee11-4ef5-842d-c7a4ce3b1574)

Invalid username or password.