# Fine-Tuning LLM for Sentiment Analysiss using HuggingFace transformers library

Text Classification is a patten-based approach of classifiying labels based on a set of categories. A popular form of Text Classification is Sentiment Analysis to categorize text into sentiments i.e. positive, neutral, and negative. This can be applied to applications like movie reviews, political posts, or stock trading.

The goal is to use DistilBERT on the imdb dataset to predict whether a movie review is either positive or negative.


Note: To Fine-Tune a model, it must adhere to the pre-trained model's target task.

Text Classification is a patten-based approach of classifiying labels based on a set of categories. A popular form of Text Classification is Sentiment Analysis to categorize text into sentiments i.e. positive, neutral, and negative. This can be applied to applications like movie reviews, political posts, or stock trading.

The goal is to use DistilBERT on the imdb dataset to predict whether a movie review is either positive or negative

#### Install/import packages + login to Huggingface Account for pushing model capabilities
# pip install transformers datasets evaluate accelerate
'''
By logging in to the HF Hub, you can upload your fined-tuned model and tokenizer
from huggingface_hub import notebook_login

Problem with loggin in to HF Hub
ImportError: The `notebook_login` function can only be used in a notebook (Jupyter or Colab) and you need the `ipywidgets` module: `pip install ipywidgets`.


Solution:
open up terminal and run `huggingface-cli login`
'''

#### Install/import packages + login to Huggingface Account for pushing model capabilities


In [1]:
# pip install transformers datasets evaluate accelerate
'''
By logging in to the HF Hub, you can upload your fined-tuned model and tokenizer
from huggingface_hub import notebook_login

Problem with loggin in to HF Hub
ImportError: The `notebook_login` function can only be used in a notebook (Jupyter or Colab) and you need the `ipywidgets` module: `pip install ipywidgets`.


Solution:
open up terminal and run `huggingface-cli login`
'''

# notebook_login()

'\nBy logging in to the HF Hub, you can upload your fined-tuned model and tokenizer\nfrom huggingface_hub import notebook_login\n\nProblem with loggin in to HF Hub\nImportError: The `notebook_login` function can only be used in a notebook (Jupyter or Colab) and you need the `ipywidgets` module: `pip install ipywidgets`.\n\n\nSolution:\nopen up terminal and run `huggingface-cli login`\n'

In [2]:
# Import Packages

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer
import evaluate
import numpy as np
from huggingface_hub import notebook_login

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
#### Load Dataset
data = load_dataset("imdb")
data
data['unsupervised']['text'][:10]
data['train']['text'][:10]
data['train']['label'][:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

#### Preprocess Dataset

In [4]:
# Establish Tokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")


In [5]:
# Create a map() function which inputs a dictionary then applies a function to each row

def preprocess_dataset(sequence):
    return tokenizer(sequence['text'], truncation=True, padding=True, max_length=50, add_special_tokens=True)
    # return tokenizer.encode(sequence['text'], return_tensors="pt", truncation=True)
tokenized_data = data.map(preprocess_dataset,batched=True)
tokenized_data
# tokenized_data['train']['text'][:3]
[len(i) for i in tokenized_data['train']['input_ids'][:3]]

Map: 100%|██████████| 25000/25000 [00:03<00:00, 7162.35 examples/s]
Map: 100%|██████████| 25000/25000 [00:03<00:00, 7506.40 examples/s]
Map: 100%|██████████| 50000/50000 [00:07<00:00, 7094.95 examples/s]


[50, 50, 50]

In [6]:
# Apply Dynamic Padding 
# It's more optimized to...
# pad a batch of sentences based on the max length of the longest sequence in that particular batch
# opposed to padding every sentence in the entire dataset based on the token with the maximum length. 
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_tensor

#### Initiate addition metrics if needed for evaluation on train/test sets

In [7]:
# Define a number of metrics for the model from the "evaluate" library.

acc_score = evaluate.load("accuracy")

def compute_metrics(predicted_probs):
    logits, labels = predicted_probs
    predictions = np.argmax(logits, axis=1)
    return acc_score.compute(predictions=predictions, references=labels)


In [8]:
# Create label_2_ids and label_2_ints dictionaries for easy reference when printing out the predictions

labels_2_ids = {'POSITIVE': 1, 'NEGATIVE': 0}
ids_2_labels = {1: 'POSITIVE', 0: 'NEGATIVE'}

In [9]:
# Load model from HF Model Hub and instatiate the hyperparameters: number of output labels, dictionary of both ids to labels and labels to ids

model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2, id2label=ids_2_labels, label2id=ids_2_labels)
model

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

#### Training

In [10]:

# Define training arguements - only one arguement is required "output_dir"

training_args = TrainingArguments(    
    output_dir="my_awesome_modelX",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True)
    # push_to_hub=True,                          
                                  # learning_rate= 15e-5,
                                  # weight_decay=.2,
                                  # output_dir="simple_model",
                                  # per_device_train_batch_size=16,
                                  # per_device_eval_batch_size=16,
                                  # num_train_epochs=5,
                                  # evaluation_strategy="epoch",
                                #   load_best_model_at_end=True,
                                #   push_to_hub=True

                                  #)

In [11]:
# Define the Trainer class which can include the model, train/test data, tokenizer, training arguements, data collator, and a metrics function 

trainer = Trainer(model=model,
                  data_collator=data_collator,
                  train_dataset=tokenized_data['train'],
                  eval_dataset=tokenized_data['test'],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)
trainer

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


<transformers.trainer.Trainer at 0x179fad880>

In [16]:
# Train model

trainer.train()

                                                   
 60%|█████▉    | 1872/3125 [01:53<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.2067, 'grad_norm': 0.07200344651937485, 'learning_rate': 4.7333333333333336e-05, 'epoch': 0.16}


                                                   
 60%|█████▉    | 1872/3125 [02:38<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.249, 'grad_norm': 7.350505352020264, 'learning_rate': 4.466666666666667e-05, 'epoch': 0.32}


                                                   
 60%|█████▉    | 1872/3125 [03:23<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-1500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.2426, 'grad_norm': 25.27147674560547, 'learning_rate': 4.2e-05, 'epoch': 0.48}


                                                   
 60%|█████▉    | 1872/3125 [04:08<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-2000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.2375, 'grad_norm': 16.328832626342773, 'learning_rate': 3.933333333333333e-05, 'epoch': 0.64}


                                                   
 60%|█████▉    | 1872/3125 [04:54<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-2500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.2468, 'grad_norm': 17.963199615478516, 'learning_rate': 3.6666666666666666e-05, 'epoch': 0.8}


                                                   
 60%|█████▉    | 1872/3125 [05:39<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-3000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.2517, 'grad_norm': 26.059907913208008, 'learning_rate': 3.4000000000000007e-05, 'epoch': 0.96}


                                                   
 60%|█████▉    | 1872/3125 [06:23<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-3500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.1634, 'grad_norm': 0.07462353259325027, 'learning_rate': 3.1333333333333334e-05, 'epoch': 1.12}


                                                   
 60%|█████▉    | 1872/3125 [07:07<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-4000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.1189, 'grad_norm': 0.4949056804180145, 'learning_rate': 2.8666666666666668e-05, 'epoch': 1.28}


                                                   
 60%|█████▉    | 1872/3125 [23:56<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-4500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.1334, 'grad_norm': 0.020918797701597214, 'learning_rate': 2.6000000000000002e-05, 'epoch': 1.44}


                                                   
 60%|█████▉    | 1872/3125 [24:41<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-5000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.1277, 'grad_norm': 0.18395085632801056, 'learning_rate': 2.3333333333333336e-05, 'epoch': 1.6}


                                                   
 60%|█████▉    | 1872/3125 [25:26<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-5500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.1293, 'grad_norm': 32.170372009277344, 'learning_rate': 2.0666666666666666e-05, 'epoch': 1.76}


                                                   
 60%|█████▉    | 1872/3125 [26:11<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-6000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.1383, 'grad_norm': 0.07567379623651505, 'learning_rate': 1.8e-05, 'epoch': 1.92}


                                                   
 60%|█████▉    | 1872/3125 [26:56<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-6500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.0823, 'grad_norm': 0.022272340953350067, 'learning_rate': 1.5333333333333334e-05, 'epoch': 2.08}


                                                   
 60%|█████▉    | 1872/3125 [27:42<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-7000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.0348, 'grad_norm': 0.002939199097454548, 'learning_rate': 1.2666666666666668e-05, 'epoch': 2.24}


                                                   
 60%|█████▉    | 1872/3125 [28:27<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-7500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.057, 'grad_norm': 16.094919204711914, 'learning_rate': 1e-05, 'epoch': 2.4}


                                                   
 60%|█████▉    | 1872/3125 [29:13<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-8000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.0461, 'grad_norm': 0.010443528182804585, 'learning_rate': 7.333333333333334e-06, 'epoch': 2.56}


                                                   
 60%|█████▉    | 1872/3125 [29:58<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-8500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.0424, 'grad_norm': 0.00723448907956481, 'learning_rate': 4.666666666666667e-06, 'epoch': 2.72}


                                                   
 60%|█████▉    | 1872/3125 [30:43<00:19, 64.70it/s]Checkpoint destination directory tmp_trainer/checkpoint-9000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.0378, 'grad_norm': 0.008888768032193184, 'learning_rate': 2.0000000000000003e-06, 'epoch': 2.88}


                                                   
100%|██████████| 9375/9375 [30:09<00:00,  5.18it/s]

{'train_runtime': 1809.7104, 'train_samples_per_second': 41.443, 'train_steps_per_second': 5.18, 'train_loss': 0.13806326314290365, 'epoch': 3.0}





TrainOutput(global_step=9375, training_loss=0.13806326314290365, metrics={'train_runtime': 1809.7104, 'train_samples_per_second': 41.443, 'train_steps_per_second': 5.18, 'train_loss': 0.13806326314290365, 'epoch': 3.0})

In [13]:
# Evaluate performance on test set 
trainer.evaluate(tokenized_data['test'])

100%|██████████| 3125/3125 [02:52<00:00, 18.07it/s]


{'eval_loss': 0.9095878005027771,
 'eval_accuracy': 0.79608,
 'eval_runtime': 172.9386,
 'eval_samples_per_second': 144.56,
 'eval_steps_per_second': 18.07,
 'epoch': 3.0}

In [14]:
# Make predictions 
test_set_predictions = trainer.predict(tokenized_data['test'])
test_set_predictions
np.argmax(test_set_predictions[0][:3], axis=1)

 60%|█████▉    | 1871/3125 [00:29<00:19, 64.70it/s]

KeyboardInterrupt: 

In [None]:
# test_set_predictions[0][:3]
for l, p in zip(tokenized_data['test']['label'][:10], np.argmax(test_set_predictions[0][:10], axis=1)):
    print(ids_2_labels[l], ids_2_labels[p])

In [None]:
# Upload the newly fined-tuned model to HF model hub
# Note: Be sure to login to HF via terminal using the command `huggingface-cli login`

# Link to all models in profile -> https://huggingface.co/dstaples08
# Link to newly created model -> https://huggingface.co/dstaples08/tmp_trainer?text=but+the+acting+brought+the+quality+down+some.+They+should+remove+the+main+cast+otherwise+my+curiousity+for+a+sequel+is+low.
# trainer.push_to_hub()

#### Inference





After pushing model to hub (make sure to use either Colob or notebook),
it is now available for anyone to use for inference or further fine-tuning with just a simple path to the model when defining a pipeline or AutoModel/AutoTokenizer. There are two ways to run inference on a fine-tuned model:

1. Use the simple pipeline() function from transformers library

2. Replicate the results of the pipeline i.e use AutoTokenizer.from_pretrained("dstaples/simple_model") and AutoModelForSequenceClassification.from_pretrained("dstaples/simple_model")
text = "I really liked the movie, but the acting brought the quality down some. They should remove the main cast otherwise my curiousity for a sequel is low."

In [None]:
# 1. Use the simple pipeline() function from transformers library

from transformers import pipeline

sentiment_analysis_classifier = pipeline("sentiment-analysis", model="dstaples08/tmp_trainer")

result = sentiment_analysis_classifier(text)
result

In [None]:

# 2. Replicate the results of the pipeline i.e use AutoTokenizer.from_pretrained("dstaples/simple_model") and AutoModelForSequenceClassification.from_pretrained("dstaples/simple_model")

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="dstaples08/tmp_trainer")
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path="dstaples08/tmp_trainer")
tokenizer
model
input_info = tokenizer(text, return_tensors="pt")
input_info
input_ids = input_info['input_ids']
input_ids.shape
with torch.no_grad():
    # Same as logits = model(input_ids)
    model_info = model(**input_info) 
model_info
logits = model_info['logits']
logits
ids_2_labels[logits.argmax().item()]
