# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `10 epochs of fine-tuning`.

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [1]:
%%capture

!pip install transformers
!pip install accelerate -U
!pip install datasets
!pip install huggingface_hub

In [2]:
%%capture

import torch
from sklearn.metrics import f1_score
import pandas as pd
import numpy as np

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
from transformers import pipeline
from datasets import load_dataset
import nltk
nltk.download('punkt')
from torch import nn
from transformers import TrainingArguments
from transformers import Trainer
##others
import warnings
warnings.filterwarnings("ignore")
import os
os.environ["WANDB_DISABLED"] = "true"

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Importing Dataset from Google Drive**

In [4]:
data_path= "/content/drive/MyDrive/Colab Notebooks/NLP/Transformed_copy.csv"

In [5]:
##reading data
data= pd.read_csv(data_path)

In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,safe_tweet,label,agreement
0,0,amp big homie meanboy stegman st,0.0,1.0
1,1,im thinking devoting career proving autism isn...,1.0,1.0
2,2,vaccines vaccinate child,-1.0,1.0
3,3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0


In [7]:
#Check for null values

data.isna().sum()

Unnamed: 0     0
safe_tweet    29
label          0
agreement      0
dtype: int64

In [8]:
data[data["safe_tweet"].isnull()]

Unnamed: 0.1,Unnamed: 0,safe_tweet,label,agreement
444,444,,0.0,1.0
1523,1523,,0.0,1.0
2155,2155,,0.0,1.0
2515,2515,,0.0,1.0
3062,3062,,0.0,0.666667
3204,3204,,0.0,1.0
3819,3819,,1.0,0.666667
4631,4631,,0.0,1.0
4638,4638,,0.0,1.0
4770,4770,,0.0,1.0


In [9]:
##drop Unnamed column and missing values to facilitate analysis
data= data.drop("Unnamed: 0", axis=1)

data= data.dropna()


In [10]:
data

Unnamed: 0,safe_tweet,label,agreement
0,amp big homie meanboy stegman st,0.0,1.000000
1,im thinking devoting career proving autism isn...,1.0,1.000000
2,vaccines vaccinate child,-1.0,1.000000
3,mean immunize kid something wont secretly kill...,-1.0,1.000000
4,thanks catch performing la nuit nyc st ave sho...,0.0,1.000000
...,...,...,...
9994,living time sperm used waste jenny mccarthy be...,1.0,1.000000
9995,spite measles outbreaks judge mi threatens put...,1.0,0.666667
9996,interesting trends child immunization oklahoma...,0.0,1.000000
9997,cdc says measles highest levels decades return...,0.0,1.000000


In [11]:
#Ensuring there are no null values

data.isna().sum()

safe_tweet    0
label         0
agreement     0
dtype: int64

In [12]:
# Change tweet rows to tuples  to conform to the standard

data['safe_tweet'] = data['safe_tweet'].apply(lambda tweet: tuple(tweet.split(),))

 **Data Splitting**

In [13]:
train, eval= train_test_split(data, test_size= 0.2, stratify= data["label"])

In [14]:
train.head()

Unnamed: 0,safe_tweet,label,agreement
3217,"(vaccine, critics, turn, defensive, measles, via)",1.0,1.0
7789,"(sum, vaccines, hn, n, flu, chemicals, no, wan...",-1.0,1.0
984,"(looking, ass, bitches, eazymix)",0.0,1.0
4136,"(market, research, group, weirdos, dawned, pro...",0.0,0.666667
1432,"(getting, twinrix, mmr, booster, apeoplee, wel...",1.0,0.666667


In [15]:
eval.head()

Unnamed: 0,safe_tweet,label,agreement
8988,"(blame, amish, measles)",0.0,0.666667
9951,"(new, years, wish, jenny, mccarthy, measles, m...",1.0,1.0
7673,"(vaccinate, dogs, children, huh, ok, ok, im, d...",1.0,1.0
6190,"(know, millions, nations, children, already, d...",-1.0,0.666667
9910,"(austerity, vaccine, crisis, parasite, pandemi...",0.0,1.0


In [16]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (7976, 3), eval is (1994, 3)


In [17]:
#saving the train and eval data to csv

train.to_csv("/content/train.csv")
eval.to_csv("/content/eval.csv")

 **Loading Datasets**

In [18]:
dataset= load_dataset("csv", data_files={"train":"train.csv", "eval":"eval.csv" }, encoding= "ISO-8859-1")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

In [19]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'safe_tweet', 'label', 'agreement'],
        num_rows: 7976
    })
    eval: Dataset({
        features: ['Unnamed: 0', 'safe_tweet', 'label', 'agreement'],
        num_rows: 1994
    })
})

**Tokenization**

In [20]:
#create an instance for tokenizer
tokenizer= AutoTokenizer.from_pretrained("roberta-base")


Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Preprocessing Data

In [21]:
## changing labels to 0,1,2  from the initial labels -1, 0, 1
def transform_labels(input):
  label= input["label"]
  num =0

  if label== -1:
    num= 0
  elif label== 0:
    num =1
  elif label == 1:
    num = 2
  return {"labels": num}

def tokenize(example):
  return tokenizer(example["safe_tweet"], padding= "max_length", truncation=True, return_tensors= "pt")


In [22]:
## Converting tweets to tokens for the model to work with
dataset= dataset.map(tokenize, batched= True)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [23]:
## eliminating features that are not needed for the analysis
remove_columns= ['Unnamed: 0', 'safe_tweet', 'label', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [24]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7976
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1994
    })
})

**Modelling**

In [25]:
#loading  model and creating an instance for the classes
model= AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels= 3)

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [26]:
from transformers import AutoModel
Roberta = 'roberta-base'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model= AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels= 3). to(device)

#model = AutoModel.from_pretrained(Roberta).to(device)

In [27]:
model_name = f"{Roberta}-Roberta-Model"

In [28]:
#Setting batch size
batch_size= 16


In [29]:

training_args = TrainingArguments( output_dir= model_name,
   num_train_epochs=10, load_best_model_at_end=True,evaluation_strategy="steps",save_strategy="steps",push_to_hub=True

)

In [30]:
##setting a shuffle seed to avoid randomization at each rerun
train_dataset= dataset['train'].shuffle(seed=10)
eval_dataset= dataset['eval'].shuffle(seed=10)

In [31]:
#Connecting to huggingface

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [43]:
# using f1-score as a metric score because of imbalance in the dataset

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average="weighted")
  return {"f1": f1}

In [33]:
#trainer= Trainer(
    #model= model,
      #args= training_args,
      #train_dataset= train_dataset,
      #eval_dataset= eval_dataset,
      #tokenizer= tokenizer,
      #compute_metrics=compute_metrics

#)

In [34]:
#training the model

#trainer.train()

In [35]:

#trainer.evaluate()

In [36]:
#  ##pushing the trained model to hugginface

#trainer.push_to_hub()

In [44]:
# creating class weights
class_weights= (1-(data["label"].value_counts().sort_index() /len(data))).values
class_weights

array([0.89618857, 0.50992979, 0.59388164])

In [45]:
#saving class weights to device  gpu
class_weights= torch.from_numpy(class_weights).float().to(device)

In [46]:
##creating a custom class to enable the classweight
class WeightedLossTrainer(Trainer):

   def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs["labels"]
        inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits.float()
        labels = labels
        loss_func = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_func(logits, labels)
        return (loss, outputs) if return_outputs else loss

In [40]:
#creating an instance for the training arguments with weights
#training_args = TrainingArguments(
   #output_dir= model_name,
   #num_train_epochs=10, load_best_model_at_end=True, weight_decay=0.01, evaluation_strategy="steps",save_strategy="steps",push_to_hub=True

#)

In [41]:
#train_dataset= dataset['train'].shuffle(seed=10)
#eval_dataset= dataset['eval'].shuffle(seed=10)



In [47]:
##creating an instance for the custom trainer
trainer = WeightedLossTrainer(
      model= model,
      args= training_args,
      train_dataset= train_dataset,
      eval_dataset= eval_dataset,
      tokenizer= tokenizer,
      compute_metrics=compute_metrics )

In [48]:
#Training the model with the class weights
trainer.train()

Step,Training Loss,Validation Loss,F1
500,0.916,0.883464,0.621844
1000,0.8783,0.846743,0.653129
1500,0.8769,0.858104,0.648669
2000,0.8499,0.865102,0.648757
2500,0.8734,0.890788,0.640852
3000,0.8597,0.892319,0.640937
3500,0.8987,0.899914,0.621522
4000,0.879,0.921874,0.622007
4500,0.8892,0.893602,0.622007
5000,0.8926,0.891447,0.622598


Step,Training Loss,Validation Loss,F1
500,0.916,0.883464,0.621844
1000,0.8783,0.846743,0.653129
1500,0.8769,0.858104,0.648669
2000,0.8499,0.865102,0.648757
2500,0.8734,0.890788,0.640852
3000,0.8597,0.892319,0.640937
3500,0.8987,0.899914,0.621522
4000,0.879,0.921874,0.622007
4500,0.8892,0.893602,0.622007
5000,0.8926,0.891447,0.622598


TrainOutput(global_step=9970, training_loss=0.8828675265297846, metrics={'train_runtime': 8772.8663, 'train_samples_per_second': 9.092, 'train_steps_per_second': 1.136, 'total_flos': 2.098592619798528e+16, 'train_loss': 0.8828675265297846, 'epoch': 10.0})

In [49]:

trainer.evaluate()




{'eval_loss': 0.8449752330780029,
 'eval_f1': 0.6468322092015373,
 'eval_runtime': 59.209,
 'eval_samples_per_second': 33.677,
 'eval_steps_per_second': 4.222,
 'epoch': 10.0}

In [50]:
#Pushing the model to the hub
trainer.push_to_hub()


'https://huggingface.co/Enyonam/roberta-base-Roberta-Model/tree/main/'