# Sentiment Analysis with Hugging Face

In the "Data Cleaning and EDA Notebook," we focused on the process of cleaning and exploring a natural language processing (NLP) dataset. Now, in this section, we will shift our attention to the modeling phase for such a dataset. Our objective is to fine-tune and optimize two specific models to make them more suitable for our NLP task.
These two models are:
- Distilbert base uncased
- Roberta base

Both these models are on hugging face.

The dataset going to be used here is already cleaned

## Installation and Importing of models


In [89]:
# ## Install Libraries
# %%capture
# ! pip install transformers
# ! pip install accelerate -U
# ! pip install --upgrade tensorflow
# ! pip install datasets
# ! pip install huggingface_hub




In [90]:
## Load Libraries
%%capture
##for data handling
import pandas as pd
import numpy  as np
import os

##visualizations
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns


#NLP
from transformers import TrainingArguments
from scipy.special import softmax
import torch
from transformers import Trainer
from transformers import AutoTokenizer, AutoModelForSequenceClassification

## Modellling
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import transformers
from datasets import load_dataset
from datasets import load_metric
from torch import nn

##for handling path of my datasets
import os
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"
from google.colab import drive
from huggingface_hub import notebook_login


In [91]:
# # Allow access to google drive
# drive.mount('/content/drive')

# Application of Hugging Face Text classification model Fune-tuning

## Importing dataset from my Google Drive



In [92]:
#import dataset
data_path="/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/clean_data.csv"

In [93]:
# Load the dataset and display some values
df = pd.read_csv(data_path)


In [94]:
## View dataset
df.head()

Unnamed: 0.1,Unnamed: 0,tweets,label,agreement
0,0,amp big homie meanboy stegman st,0.0,1.0
1,1,im thinking devoting career proving autism isn...,1.0,1.0
2,2,vaccine vaccinate child,-1.0,1.0
3,3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0


In [95]:
## let remove the unnamed column and any missing values
df= df.dropna()
df= df.drop("Unnamed: 0", axis=1)

## Data Splitting

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like overfitting



In [96]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [97]:
#view train
train.head()

Unnamed: 0,tweets,label,agreement
8627,vaccine safety side effect kid found,-1.0,0.333333
6394,dude gotten vaccinated swag virus known kill y...,1.0,1.0
8636,vaccine horror medical mutilation child expose...,-1.0,1.0
323,mighty mmr music money record,0.0,1.0
3254,average people complain live longer releasing ...,0.0,1.0


In [98]:
#view eval
eval.head()

Unnamed: 0,tweets,label,agreement
8474,bet asked cool bandaids group health anderson,1.0,0.666667
5486,also vaccination contain hg mercury non chemis...,-1.0,0.666667
9863,amen rt good thing parent dont vaccinate kid c...,1.0,1.0
6977,never understand weird state mind would cause ...,1.0,1.0
7866,manditory cootie vaccination protect kid,1.0,0.666667


In [99]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

In [100]:
# Save splitted subsets
train.to_csv("/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/train_set.csv",index=False)
eval.to_csv("/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/eval_set.csv", index=False)

## Loading Datasets

In [101]:
dataset= load_dataset("csv", data_files= { "train_set":"/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/train_set.csv", "eval_set":"/content/drive/MyDrive/Sentiment  Analysis/Sentiment-Analysis-master/zindi_challenge/data/eval_set.csv"})


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train_set split: 0 examples [00:00, ? examples/s]

Generating eval_set split: 0 examples [00:00, ? examples/s]

## Tokenization

In [102]:
##instantiate model
distil= "distilbert-base-uncased"

In [103]:
##use tokenizer on model
tokenizer = AutoTokenizer.from_pretrained(distil)

In [104]:
# our labels are-1, 0, 1 and we will like to transform them into 0,1,2 respectively

def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['tweets'], padding='max_length')



In [105]:
# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['label', 'tweets', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [106]:
dataset

DatasetDict({
    train_set: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7976
    })
    eval_set: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1994
    })
})

## Dealing with Imbalance Class


From our EDA, we realized the -1 class (now our 0 class) was imbalaned so we will deal with that in this section


In [107]:
# Calculate class weights
class_weights= (1-(df["label"].value_counts().sort_index() /len(df))).values
class_weights

array([0.89618857, 0.50992979, 0.59388164])

In [108]:
# Configure the trianing parameters like `num_train_epochs`:
training_args = TrainingArguments(output_dir="Alvins-Finetuned-distilbert-model",
                                  learning_rate=1e-05,
                                  num_train_epochs=5,
                                  load_best_model_at_end=True,
                                  evaluation_strategy="steps",
                                  save_strategy="steps",
                                  push_to_hub=True)

In [109]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning

model = AutoModelForSequenceClassification.from_pretrained(distil, num_labels=3)

In [110]:
#ensure consistent shuffling
train_dataset = dataset['train_set'].shuffle(seed=10)
eval_dataset = dataset['eval_set'].shuffle(seed=10)


In [111]:
# ##login to hugging face
# notebook_login()

In [112]:
# uploading class weights to GPU
class_weights = torch.from_numpy(class_weights).float().to("cuda")

In [113]:
#Define accuracy metric
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [114]:
#since I will be leveraging the class_weights, I am creating a custom trainer
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits,labels)
        return (loss, outputs) if return_outputs else loss

In [115]:
#instantiating my trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer= tokenizer,
    compute_metrics=compute_metrics

)

In [116]:
##training my model
trainer.train()


Step,Training Loss,Validation Loss,Accuracy
500,0.8392,0.832646,0.668506
1000,0.7705,0.742988,0.721665
1500,0.6891,0.746447,0.726179
2000,0.6475,0.755016,0.728686
2500,0.5801,0.756416,0.71013
3000,0.5634,0.779159,0.721163
3500,0.476,0.837475,0.729689
4000,0.4728,0.907385,0.725677
4500,0.428,0.935508,0.728185


TrainOutput(global_step=4985, training_loss=0.5893504151370126, metrics={'train_runtime': 2286.3931, 'train_samples_per_second': 17.442, 'train_steps_per_second': 2.18, 'total_flos': 5282894069637120.0, 'train_loss': 0.5893504151370126, 'epoch': 5.0})

In [117]:
# Launch the final evaluation
trainer.evaluate()

{'eval_loss': 0.7429875731468201,
 'eval_accuracy': 0.7216649949849548,
 'eval_runtime': 33.9592,
 'eval_samples_per_second': 58.718,
 'eval_steps_per_second': 7.362,
 'epoch': 5.0}

In [118]:
## push to hub
trainer.push_to_hub()

'https://huggingface.co/VINAL/Alvins-Finetuned-distilbert-model/tree/main/'