# Sentiment Analysis with Hugging Face

In the "Data Cleaning and EDA Notebook," we focused on the process of cleaning and exploring a natural language processing (NLP) dataset. Now, in this section, we will shift our attention to the modeling phase for such a dataset. Our objective is to fine-tune and optimize two specific models to make them more suitable for our NLP task.
These two models are:
- Distilbert base uncased
- Roberta base

Both these models are on hugging face

The dataset going to be used here is already cleaned

## Installation and Importing of models


In [58]:
## Install Libraries
%%capture
! pip install transformers
! pip install accelerate -U
! pip install --upgrade tensorflow
! pip install datasets
! pip install huggingface_hub


In [None]:
## Load Libraries
%%capture
##for data handling
import pandas as pd
import numpy  as np
import os

##visualizations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns


#NLP
from transformers import TrainingArguments
from scipy.special import softmax
import torch
from transformers import Trainer
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

## Modellling
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import transformers
from datasets import load_dataset
from datasets import load_metric
from torch import nn
##for handling path of my datasets
import os
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"
from google.colab import drive
from huggingface_hub import notebook_login


In [60]:
# Allow access to google drive
drive.mount('/content/drive')

# Application of Hugging Face Text classification model Fune-tuning

## Importing dataset from my Google Drive



In [61]:
#import dataset
data_path="/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/clean_data.csv"

In [62]:
# Load the dataset and display some values
df = pd.read_csv(data_path)


In [63]:
## View dataset
df.head()

Unnamed: 0.1,Unnamed: 0,tweets,label,agreement
0,0,amp big homie meanboy stegman st,0.0,1.0
1,1,im thinking devoting career proving autism isn...,1.0,1.0
2,2,vaccine vaccinate child,-1.0,1.0
3,3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0


In [64]:
## let remove the unnamed column and any missing values
df= df.dropna()
df= df.drop("Unnamed: 0", axis=1)

## Data Splitting

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like overfitting



In [65]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [66]:
#view train
train.head()

Unnamed: 0,tweets,label,agreement
8627,vaccine safety side effect kid found,-1.0,0.333333
6394,dude gotten vaccinated swag virus known kill y...,1.0,1.0
8636,vaccine horror medical mutilation child expose...,-1.0,1.0
323,mighty mmr music money record,0.0,1.0
3254,average people complain live longer releasing ...,0.0,1.0


In [67]:
#view eval
eval.head()

Unnamed: 0,tweets,label,agreement
8474,bet asked cool bandaids group health anderson,1.0,0.666667
5486,also vaccination contain hg mercury non chemis...,-1.0,0.666667
9863,amen rt good thing parent dont vaccinate kid c...,1.0,1.0
6977,never understand weird state mind would cause ...,1.0,1.0
7866,manditory cootie vaccination protect kid,1.0,0.666667


In [68]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (7976, 3), eval is (1994, 3)


In [69]:
# Save splitted subsets
train.to_csv("/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/train_set.csv",index=False)
eval.to_csv("/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/eval_set.csv", index=False)

## Loading Datasets

In [70]:
dataset= load_dataset("csv", data_files= { "train_set":"/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/train_set.csv", "eval_set":"/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/eval_set.csv"})


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train_set split: 0 examples [00:00, ? examples/s]

Generating eval_set split: 0 examples [00:00, ? examples/s]

## Tokenization

In [71]:
##instantiate model
robert= "roberta-base"

In [72]:
##use tokenizer on model
tokenizer = AutoTokenizer.from_pretrained(robert)

In [73]:
# our labels are-1, 0, 1 and we will like to transform them into 0,1,2 respectively

def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['tweets'], padding='max_length')



In [74]:
# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['label', 'tweets', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [75]:
dataset

DatasetDict({
    train_set: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7976
    })
    eval_set: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1994
    })
})

## Dealing with Imbalance Class


From our EDA, we realized the -1 class (now our 0 class) was imbalaned so we will deal with that in this section


In [76]:
# Calculate class weights
class_weights= (1-(df["label"].value_counts().sort_index() /len(df))).values
class_weights

array([0.89618857, 0.50992979, 0.59388164])

In [77]:
# Configure the trianing parameters like `num_train_epochs`:
training_args = TrainingArguments(output_dir="Roberta-Sentiment-Classifier",
                                  learning_rate=1e-05,
                                  num_train_epochs=5,
                                  load_best_model_at_end=True,
                                  evaluation_strategy="steps",
                                  save_strategy="steps",
                                  push_to_hub=True)

In [78]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning

model = AutoModelForSequenceClassification.from_pretrained(robert, num_labels=3)

In [79]:
#ensure consistent shuffling
train_dataset = dataset['train_set'].shuffle(seed=10)
eval_dataset = dataset['eval_set'].shuffle(seed=10)


In [80]:
##login to hugging face
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [81]:
# uploading class weights to GPU
class_weights = torch.from_numpy(class_weights).float().to("cuda")

In [82]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [83]:
#since I will be leveraging the class_weights, I am creating a custom trainer
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits,labels)
        return (loss, outputs) if return_outputs else loss

In [84]:
#instantiating my trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer= tokenizer,
    compute_metrics=compute_metrics

)

In [None]:
##training my model
trainer.train()


Step,Training Loss,Validation Loss,Accuracy
500,0.8369,0.861764,0.683551
1000,0.7713,0.716128,0.742227
1500,0.6962,0.708452,0.734704
2000,0.6708,0.775595,0.736209
2500,0.5986,0.714073,0.735206
3000,0.5855,0.732368,0.749248
3500,0.5113,0.779237,0.746239
4000,0.5082,0.828813,0.750251
4500,0.4676,0.863876,0.747743


TrainOutput(global_step=4985, training_loss=0.6103642500989296, metrics={'train_runtime': 4183.1222, 'train_samples_per_second': 9.534, 'train_steps_per_second': 1.192, 'total_flos': 1.049296309899264e+16, 'train_loss': 0.6103642500989296, 'epoch': 5.0})

In [None]:
# Launch the final evaluation
trainer.evaluate()

In [None]:
## push to hub
trainer.push_to_hub()