# Sentiment Analysis with Hugging Face

In the "Data Cleaning and EDA Notebook," we focused on the process of cleaning and exploring a natural language processing (NLP) dataset. Now, in this section, we will shift our attention to the modeling phase for such a dataset. Our objective is to fine-tune and optimize two specific models to make them more suitable for our NLP task.
These two models are:
- Distilbert base uncased 
- Roberta base

Both these models are on hugging face.

The dataset going to be used here is already cleaned

## Installation and Importing of models


In [None]:
## Install Libraries
%%capture
! pip install transformers
! pip install accelerate -U
! pip install --upgrade tensorflow
! pip install datasets
! pip install huggingface_hub




In [None]:
## Load Libraries
%%capture
##for data handling
import pandas as pd
import numpy  as np
import os

##visualizations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns


#NLP
from transformers import TrainingArguments
from scipy.special import softmax
import torch
from transformers import Trainer
from transformers import AutoTokenizer, AutoModelForSequenceClassification

## Modellling
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import transformers
from datasets import load_dataset
from datasets import load_metric
from torch import nn

##for handling path of my datasets
import os
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"
from google.colab import drive
from huggingface_hub import notebook_login


In [None]:
# Allow access to google drive
drive.mount('/content/drive')

Mounted at /content/drive


# Application of Hugging Face Text classification model Fune-tuning

## Importing dataset from my Google Drive



In [None]:
#import dataset
data_path="/content/drive/MyDrive/Sentiment analysis/clean_data.csv"

In [None]:
# Load the dataset and display some values
df = pd.read_csv(data_path)


In [None]:
## View dataset
df.head()

Unnamed: 0.1,Unnamed: 0,tweets,label,agreement
0,0,me amp the big homie meanboy stegman st,0.0,1.0
1,1,im thinking of devoting my career to proving a...,1.0,1.0
2,2,vaccines do not vaccinate your child,-1.0,1.0
3,3,i mean if they immunize my kid with something ...,-1.0,1.0
4,4,thanks to catch me performing at la nuit nyc s...,0.0,1.0


In [None]:
## let remove the unnamed column and any missing values
df= df.dropna()
df= df.drop("Unnamed: 0", axis=1)

## Data Splitting

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like overfitting



In [None]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [None]:
#view train
train.head()

Unnamed: 0,tweets,label,agreement
8627,vaccine safety and side effects for kids from ...,-1.0,0.333333
7506,yellow fever vaccine not so much fun passport ...,1.0,1.0
8636,vaccine horrors medical mutilation of children...,-1.0,1.0
323,the all mighty mmr music money records,0.0,1.0
3254,on average people who complain live longer rel...,0.0,1.0


In [None]:
#view eval
eval.head()

Unnamed: 0,tweets,label,agreement
3681,disneyland measles cases still trail ohio amis...,0.0,1.0
9666,shawn siegel speaks the truth about vaccines,0.0,1.0
8561,pretty cool website about the antivaccine move...,-1.0,0.666667
7514,on average people who complain live longer rel...,0.0,1.0
1849,amnews aids measles and educating the public m...,1.0,1.0


In [None]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (7979, 3), eval is (1995, 3)


In [None]:
# Save splitted subsets
train.to_csv("/content/train_set.csv",index=False)
eval.to_csv("/content/eval_set.csv", index=False)

## Loading Datasets

In [None]:
dataset= load_dataset("csv", data_files= { "train_set":"train_set.csv", "eval_set":"eval_set.csv"})


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train_set split: 0 examples [00:00, ? examples/s]

Generating eval_set split: 0 examples [00:00, ? examples/s]

## Tokenization

In [None]:
##instantiate model
distil= "distilbert-base-uncased"

In [None]:
##use tokenizer on model
tokenizer = AutoTokenizer.from_pretrained(distil)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# our labels are-1, 0, 1 and we will like to transform them into 0,1,2 respectively

def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['tweets'], padding='max_length')



In [None]:
# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['label', 'tweets', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7979 [00:00<?, ? examples/s]

Map:   0%|          | 0/1995 [00:00<?, ? examples/s]

Map:   0%|          | 0/7979 [00:00<?, ? examples/s]

Map:   0%|          | 0/1995 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train_set: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7979
    })
    eval_set: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1995
    })
})

## Dealing with Imbalance Class


From our EDA, we realized the -1 class (now our 0 class) was imbalaned so we will deal with that in this section


In [None]:
# Calculate class weights
class_weights= (1-(df["label"].value_counts().sort_index() /len(df))).values
class_weights

array([0.8962302 , 0.50972529, 0.59404452])

In [None]:
# Configure the trianing parameters like `num_train_epochs`:
training_args = TrainingArguments(output_dir="Kodwo-Finetuned-distilbert-model",
                                  learning_rate=1e-05,
                                  num_train_epochs=5,
                                  load_best_model_at_end=True,
                                  evaluation_strategy="steps",
                                  save_strategy="steps",
                                  push_to_hub=True)

In [None]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning

model = AutoModelForSequenceClassification.from_pretrained(distil, num_labels=3)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
#ensure consistent shuffling
train_dataset = dataset['train_set'].shuffle(seed=10)
eval_dataset = dataset['eval_set'].shuffle(seed=10)


In [None]:
##login to hugging face
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# uploading class weights to GPU
class_weights = torch.from_numpy(class_weights).float().to("cuda")

In [None]:
#Define accuracy metric
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

  metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [None]:
#since I will be leveraging the class_weights, I am creating a custom trainer
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits,labels)
        return (loss, outputs) if return_outputs else loss

In [None]:
#instantiating my trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer= tokenizer,
    compute_metrics=compute_metrics

)

In [None]:
##training my model
trainer.train()


Step,Training Loss,Validation Loss,Accuracy
500,0.8463,0.759303,0.714286
1000,0.731,0.692467,0.736842
1500,0.6175,0.703698,0.747368
2000,0.6318,0.674303,0.749875
2500,0.4903,0.724114,0.748872
3000,0.4907,0.757324,0.75188
3500,0.4136,0.809752,0.753885
4000,0.3975,0.829416,0.755388
4500,0.3568,0.867988,0.750376


TrainOutput(global_step=4990, training_loss=0.5325839056041771, metrics={'train_runtime': 2245.1226, 'train_samples_per_second': 17.77, 'train_steps_per_second': 2.223, 'total_flos': 5284881116052480.0, 'train_loss': 0.5325839056041771, 'epoch': 5.0})

In [None]:
# Launch the final evaluation
trainer.evaluate()

{'eval_loss': 0.6743029356002808,
 'eval_accuracy': 0.749874686716792,
 'eval_runtime': 33.6311,
 'eval_samples_per_second': 59.32,
 'eval_steps_per_second': 7.434,
 'epoch': 5.0}

In [None]:
## push to hub
trainer.push_to_hub()

'https://huggingface.co/Kodwo11/Kodwo-Finetuned-distilbert-model/tree/main/'