In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)

In [3]:
device='cuda' if torch.cuda.is_available() else 'cpu'

In [4]:
from huggingface_hub import login
login(token='YOUR_API_KEY')

model_name = "meta-llama/Llama-3.2-1B-Instruct"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map= "auto"
)


max_seq_length = 512 #2048
tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right"

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/dortp58/.cache/huggingface/token
Login successful


## Preparing the data and the core evaluation functions

The code in the next cell performs the following steps:

1. Reads the input dataset from the all-data.csv file, which is a comma-separated value (CSV) file with two columns: sentiment and text.
2. Splits the dataset into training and test sets, with 300 samples in each set. The split is stratified by sentiment, so that each set contains a representative sample of positive, neutral, and negative sentiments.
3. Shuffles the train data in a replicable order (random_state=10)
4. Transforms the texts contained in the train and test data into prompts to be used by Llama: the train prompts contains the expected answer we want to fine-tune the model with
5. The residual examples not in train or test, for reporting purposes during training (but it won't be used for early stopping), is treated as evaluation data, which is sampled with repetition in order to have a 50/50/50 sample (negative instances are very few, hence they should be repeated)
5. The train and eval data are wrapped by the class from Hugging Face (https://huggingface.co/docs/datasets/index)

This prepares in a single cell train_data, eval_data and test_data datasets to be used in our fine tuning.

In [5]:
train_path = 'archive/twitter_training.csv'
test_path = 'archive/twitter_validation.csv'
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)
df_test.columns = ['header', 'entity','labels','text']
df_train.columns = ['header', 'entity','labels','text']
df_train.dropna(inplace=True)
df_train.drop_duplicates(inplace=True)
df_train.isnull().sum()
df_train.drop(columns=['header'], inplace=True)
df_test.drop(columns=['header'], inplace=True)
df_train.replace(to_replace='Irrelevant', value='Neutral', inplace=True)
df_test.replace(to_replace='Irrelevant', value='Neutral', inplace=True)

In [6]:
df_train = df_train.sample(10000)

def generate_prompt(data_point):
    return f"""
            Analyze the sentiment of the tweet in square brackets about the entity {data_point["entity"]}, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "Positive" or "Neutral" or "Negative".
            Classify tweets that are not relevant to the entity as "Neutral".

            [{data_point["text"]}] = {data_point["labels"]}""".strip()

def generate_test_prompt(data_point):
    return f"""
            Analyze the sentiment of the tweet in square brackets about the entity {data_point["entity"]}, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "Positive" or "Neutral" or "Negative".
            Classify tweets that are not relevant to the entity as "Neutral".

            [{data_point["text"]}] = """.strip()


X_train = pd.DataFrame(df_train.apply(generate_prompt, axis=1), 
                       columns=["text"])

y_true = df_test['labels']
X_test = pd.DataFrame(df_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)
test_data = Dataset.from_pandas(X_test)

In [7]:
X_train = pd.DataFrame(df_test.apply(generate_prompt, axis=1), 
                       columns=["text"])
train_data = Dataset.from_pandas(X_train)

## Fine-tuning

In the next cell we set everything ready for the fine-tuning. We configures and initializes a Simple Fine-tuning Trainer (SFTTrainer) for training a large language model using the Parameter-Efficient Fine-Tuning (PEFT) method, which should save time as it operates on a reduced number of parameters compared to the model's overall size. The PEFT method focuses on refining a limited set of (additional) model parameters, while keeping the majority of the pre-trained LLM parameters fixed. This significantly reduces both computational and storage expenses. Additionally, this strategy addresses the challenge of catastrophic forgetting, which often occurs during the complete fine-tuning of LLMs.

PEFTConfig:

The peft_config object specifies the parameters for PEFT. The following are some of the most important parameters:

* lora_alpha: The learning rate for the LoRA update matrices.
* lora_dropout: The dropout probability for the LoRA update matrices.
* r: The rank of the LoRA update matrices.
* bias: The type of bias to use. The possible values are none, additive, and learned.
* task_type: The type of task that the model is being trained for. The possible values are CAUSAL_LM and MASKED_LM.

TrainingArguments:

The training_arguments object specifies the parameters for training the model. The following are some of the most important parameters:

* output_dir: The directory where the training logs and checkpoints will be saved.
* num_train_epochs: The number of epochs to train the model for.
* per_device_train_batch_size: The number of samples in each batch on each device.
* gradient_accumulation_steps: The number of batches to accumulate gradients before updating the model parameters.
* optim: The optimizer to use for training the model.
* save_steps: The number of steps after which to save a checkpoint.
* logging_steps: The number of steps after which to log the training metrics.
* learning_rate: The learning rate for the optimizer.
* weight_decay: The weight decay parameter for the optimizer.
* fp16: Whether to use 16-bit floating-point precision.
* bf16: Whether to use BFloat16 precision.
* max_grad_norm: The maximum gradient norm.
* max_steps: The maximum number of steps to train the model for.
* warmup_ratio: The proportion of the training steps to use for warming up the learning rate.
* group_by_length: Whether to group the training samples by length.
* lr_scheduler_type: The type of learning rate scheduler to use.
* report_to: The tools to report the training metrics to.
* evaluation_strategy: The strategy for evaluating the model during training.

SFTTrainer:

The SFTTrainer is a custom trainer class from the TRL library. It is used to train large language models (also using the PEFT method).

The SFTTrainer object is initialized with the following arguments:

* model: The model to be trained.
* train_dataset: The training dataset.
* eval_dataset: The evaluation dataset.
* peft_config: The PEFT configuration.
* dataset_text_field: The name of the text field in the dataset.
* tokenizer: The tokenizer to use.
* args: The training arguments.
* packing: Whether to pack the training samples.
* max_seq_length: The maximum sequence length.

Once the SFTTrainer object is initialized, it can be used to train the model by calling the train() method

In [8]:
from sklearn.metrics import (accuracy_score, 
                             recall_score, 
                             precision_score, 
                             f1_score)

from transformers import EarlyStoppingCallback, IntervalStrategy

def compute_metrics(p):    
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)    
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [9]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-0

In [10]:
peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["self_attn.q_proj", 
                    "self_attn.k_proj", 
                    "self_attn.v_proj", 
                    "self_attn.o_proj",
                    "mlp.gate_proj", 
                    "mlp.up_proj", 
                    "mlp.down_proj",],
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [11]:
model.print_trainable_parameters()

trainable params: 45,088,768 || all params: 1,280,903,168 || trainable%: 3.5201


In [19]:
output_dir="experiments"


training_arguments = SFTConfig(
    output_dir=output_dir,                    # directory to save and repository id
    dataset_text_field='text',
    max_seq_length=512,
    num_train_epochs=5,                       # number of training epochs
    per_device_train_batch_size=2,            # batch size per device during training
    gradient_accumulation_steps=4,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_8bit",
    save_steps=0.2,
    eval_steps=0.2,
    logging_steps=10,                         # log every 10 steps
    learning_rate=1e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    save_strategy="steps",
    save_total_limit=2,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.1,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="tensorboard",
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    }
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=test_data,
    peft_config=peft_config,
    tokenizer=tokenizer,
    packing=False,
    # compute_metrics=compute_metrics,
    # callbacks = [EarlyStoppingCallback(early_stopping_patience=3)],
)

Map: 100%|██████████████████████████| 999/999 [00:00<00:00, 24271.07 examples/s]
Map: 100%|██████████████████████████| 999/999 [00:00<00:00, 24521.49 examples/s]


In [12]:
output_dir="trained_weigths"


training_arguments = SFTConfig(
    output_dir=output_dir,                    # directory to save and repository id
    dataset_text_field='text',
    max_seq_length=512,
    num_train_epochs=1,                       # number of training epochs
    per_device_train_batch_size=2,            # batch size per device during training
    gradient_accumulation_steps=8,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_8bit",
    save_steps=0,
    logging_steps=25,                         # log every 10 steps
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="tensorboard",
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    }
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    #eval_dataset=eval_data,
    peft_config=peft_config,
    tokenizer=tokenizer,
    packing=False,
    #compute_metrics=compute_metrics,
    #callbacks = [EarlyStoppingCallback(early_stopping_patience=3)],
)

Map: 100%|██████████████████████████| 999/999 [00:00<00:00, 19804.00 examples/s]


In [20]:
trainer.train()

Step,Training Loss
10,1.3352
20,1.3897
30,1.3987
40,1.3201
50,1.4345
60,1.332
70,1.3838
80,1.3396
90,1.4009
100,1.2586


TrainOutput(global_step=625, training_loss=1.1213091278076173, metrics={'train_runtime': 853.5408, 'train_samples_per_second': 5.852, 'train_steps_per_second': 0.732, 'total_flos': 3674834119557120.0, 'train_loss': 1.1213091278076173, 'epoch': 5.0})

In [60]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
25,2.7803
50,1.426
75,1.3226
100,1.3219
125,1.3163
150,1.2669
175,1.2257
200,1.2616
225,1.2792
250,1.207


TrainOutput(global_step=625, training_loss=1.3028873565673829, metrics={'train_runtime': 1590.6801, 'train_samples_per_second': 6.287, 'train_steps_per_second': 0.393, 'total_flos': 6602132937351168.0, 'train_loss': 1.3028873565673829, 'epoch': 1.0})

The following code will train the model using the trainer.train() method and then save the trained model to the trained-model directory. Using The standard GPU P100 offered by Kaggle, the training should be quite fast.

In [16]:
# Train model
trainer.train()

Step,Training Loss
25,1.8561
50,0.985
75,0.9007
100,0.8695
125,0.8418
150,0.7648
175,0.7043
200,0.696
225,0.6818
250,0.4643


TrainOutput(global_step=560, training_loss=0.5523048922419548, metrics={'train_runtime': 5862.971, 'train_samples_per_second': 0.768, 'train_steps_per_second': 0.096, 'total_flos': 1.7183004555264e+16, 'train_loss': 0.5523048922419548, 'epoch': 4.977777777777778})

The model and the tokenizer are saved to disk for later usage.

In [22]:
# Save trained model and tokenizer
trainer.save_model()
tokenizer.save_pretrained(output_dir)

('experiments/tokenizer_config.json',
 'experiments/special_tokens_map.json',
 'experiments/tokenizer.json')

Afterwards, loading the TensorBoard extension and start TensorBoard, pointing to the logs/runs directory, which is assumed to contain the training logs and checkpoints for your model, will allow you to understand how the models fits during the training.

In [23]:
%load_ext tensorboard
%tensorboard --logdir logs/runs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 26054), started 0:00:32 ago. (Use '!kill 26054' to kill it.)

## Testing

The following code will first predict the sentiment labels for the test set using the predict() function. Then, it will evaluate the model's performance on the test set using the evaluate() function. The result now should be impressive with an overall accuracy of over 0.8 and high accuracy, precision and recall for the single sentiment labels. The prediction of the neutral label can still be improved, yet it is impressive how much could be done with little data and some fine-tuning.

In [19]:
y_pred = predict(test, model, tokenizer)
evaluate(y_true, y_pred)

  0%|          | 0/900 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
100%|██████████| 900/900 [06:05<00:00,  2.47it/s]

Accuracy: 0.873
Accuracy for label 0: 0.937
Accuracy for label 1: 0.847
Accuracy for label 2: 0.837

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.94      0.94       300
           1       0.80      0.85      0.82       300
           2       0.87      0.84      0.86       300

    accuracy                           0.87       900
   macro avg       0.88      0.87      0.87       900
weighted avg       0.88      0.87      0.87       900


Confusion Matrix:
[[281  18   1]
 [ 11 254  35]
 [  3  46 251]]





The following code will create a Pandas DataFrame called evaluation containing the text, true labels, and predicted labels from the test set. This is expectially useful for understanding the errors that the fine-tuned model makes, and gettting insights on how to improve the prompt.

In [20]:
evaluation = pd.DataFrame({'text': X_test["text"], 
                           'y_true':y_true, 
                           'y_pred': y_pred},
                         )
evaluation.to_csv("test_predictions.csv", index=False)

The evaluation results are indeed good when compared to simpler benchmarks such as a CONV1D + bidirectional LSTM based model () such as: https://www.kaggle.com/code/lucamassaron/lstm-baseline-for-sentiment-analysis

Here are the results of the baseline model:

Accuracy: 0.623
Accuracy for label 0: 0.620
Accuracy for label 1: 0.590
Accuracy for label 2: 0.660

Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.62      0.69       300
           1       0.61      0.59      0.60       300
           2       0.53      0.66      0.59       300

    accuracy                           0.62       900
   macro avg       0.64      0.62      0.63       900
weighted avg       0.64      0.62      0.63       900


Confusion Matrix:

[[186  39  75]\
 [ 23 177 100]\
 [ 27  75 198]]
 

With this testing, the fine-tuning of Llama 3 has reached its conclusion. Dont't forget to upvote if you find the notebook useful for your projects or work! 