# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

In [1]:
%%capture

!pip install transformers
!pip install accelerate -U
!pip install datasets
!pip install huggingface_hub

In [2]:
%%capture

import torch
from sklearn.metrics import f1_score
import pandas as pd
import numpy as np

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
from transformers import pipeline
from datasets import load_dataset
import nltk
nltk.download('punkt')
from torch import nn
from transformers import TrainingArguments
from transformers import Trainer
##others
import warnings
warnings.filterwarnings("ignore")
import os
os.environ["WANDB_DISABLED"] = "true"

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Importing Dataset from Google Drive**

In [4]:
data_path= "/content/drive/MyDrive/Colab Notebooks/NLP/clean_copy.csv"

In [5]:
##reading data
data= pd.read_csv(data_path)

In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,clean_tweet,label,agreement
0,0,amp big homie meanboy stegman st,0.0,1.0
1,1,im thinking devoting career proving autism isn...,1.0,1.0
2,2,vaccines vaccinate child,-1.0,1.0
3,3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0


In [7]:
#Check for null values

data.isna().sum()

Unnamed: 0      0
clean_tweet    29
label           0
agreement       0
dtype: int64

In [8]:
data[data["clean_tweet"].isnull()]

Unnamed: 0.1,Unnamed: 0,clean_tweet,label,agreement
444,444,,0.0,1.0
1523,1523,,0.0,1.0
2155,2155,,0.0,1.0
2515,2515,,0.0,1.0
3062,3062,,0.0,0.666667
3204,3204,,0.0,1.0
3819,3819,,1.0,0.666667
4631,4631,,0.0,1.0
4638,4638,,0.0,1.0
4770,4770,,0.0,1.0


In [9]:
##drop Unnamed column and missing values to facilitate analysis
data= data.drop("Unnamed: 0", axis=1)

data= data.dropna()


In [10]:
data

Unnamed: 0,clean_tweet,label,agreement
0,amp big homie meanboy stegman st,0.0,1.000000
1,im thinking devoting career proving autism isn...,1.0,1.000000
2,vaccines vaccinate child,-1.0,1.000000
3,mean immunize kid something wont secretly kill...,-1.0,1.000000
4,thanks catch performing la nuit nyc st ave sho...,0.0,1.000000
...,...,...,...
9994,living time sperm used waste jenny mccarthy be...,1.0,1.000000
9995,spite measles outbreaks judge mi threatens put...,1.0,0.666667
9996,interesting trends child immunization oklahoma...,0.0,1.000000
9997,cdc says measles highest levels decades return...,0.0,1.000000


In [11]:
#Ensuring there are no null values

data.isna().sum()

clean_tweet    0
label          0
agreement      0
dtype: int64

In [12]:
# Change tweet rows to tuples  to conform to the standard

data['clean_tweet'] = data['clean_tweet'].apply(lambda tweet: tuple(tweet.split(),))

 **Data Splitting**

In [13]:
train, eval= train_test_split(data, test_size= 0.2, stratify= data["label"])

In [14]:
train.head()

Unnamed: 0,clean_tweet,label,agreement
4991,"(saw, interview, wdan, barton, heres, list, in...",0.0,1.0
7788,"(love, measles, go)",0.0,1.0
6856,"(dumb, shit, dont, vaccinate, kids)",1.0,1.0
7459,"(better, kids, vaccinated, dont, catch, hands)",1.0,1.0
3684,"(parents, question, vaccines, kids, almost, al...",-1.0,1.0


Find below a simple example, with just `3 epochs of fine-tuning`.

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [15]:
eval.head()

Unnamed: 0,clean_tweet,label,agreement
278,"(one, countries, children, missed, vaccination...",1.0,1.0
1071,"(elb, protect, children, misguided, parents, m...",1.0,1.0
716,"(peace, march, end, drone, killings, yes, poli...",1.0,0.666667
5358,"(mmr, n, da, building, band, name, ex, first, ...",0.0,1.0
2832,"(sister, parents, hoped, catch, never, caught,...",0.0,1.0


In [16]:
#saving the train and eval data to csv

train.to_csv("/content/train.csv")
eval.to_csv("/content/eval.csv")

 **Loading Datasets**

In [17]:
dataset= load_dataset("csv", data_files={"train":"train.csv", "eval":"eval.csv" }, encoding= "ISO-8859-1")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

In [18]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'clean_tweet', 'label', 'agreement'],
        num_rows: 7976
    })
    eval: Dataset({
        features: ['Unnamed: 0', 'clean_tweet', 'label', 'agreement'],
        num_rows: 1994
    })
})

**Tokenization**

In [19]:
#create an instance for tokenizer
tokenizer= AutoTokenizer.from_pretrained("roberta-base")


Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [20]:
## changing labels to 0,1,2  from the initial labels -1, 0, 1
def transform_labels(input):
  label= input["label"]
  num =0

  if label== -1:
    num= 0
  elif label== 0:
    num =1
  elif label == 1:
    num = 2
  return {"labels": num}

def tokenize(example):
  return tokenizer(example["clean_tweet"], padding= "max_length", truncation=True, return_tensors= "pt")


In [21]:
## Converting tweets to tokens for the model to work with
dataset= dataset.map(tokenize, batched= True)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [22]:
## eliminating features that are not needed for the analysis
remove_columns= ['Unnamed: 0', 'clean_tweet', 'label', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [23]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7976
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1994
    })
})

**Training**

In [39]:
# Configure the trianing parameters like `num_train_epochs`,the number of times the model will repeat the training loop over the dataset
training_args = TrainingArguments("test_trainer",
                                  num_train_epochs=3,
                                  load_best_model_at_end=True,
                                  save_strategy='epoch',
                                  evaluation_strategy='epoch',
                                  logging_strategy='epoch',
                                  logging_steps=100,
                                  per_device_train_batch_size=16,
                                  )

**Modelling**

In [40]:
#loading  model and creating an instance for the classes
model= AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels= 3)

In [41]:
##setting a shuffle seed to avoid randomization at each rerun
train_dataset= dataset['train'].shuffle(seed=10)
eval_dataset= dataset['eval'].shuffle(seed=10)

In [42]:
from huggingface_hub import notebook_login  ##connecting to my hugginface profile

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [43]:
from datasets import load_metric
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [44]:
#load training arguments
trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [45]:
#training my model

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.7816,0.727101,0.684554
2,0.7013,0.702724,0.704112


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7816,0.727101,0.684554
2,0.7013,0.702724,0.704112
3,0.6358,0.689734,0.723671


TrainOutput(global_step=1497, training_loss=0.706270430672543, metrics={'train_runtime': 2310.4453, 'train_samples_per_second': 10.356, 'train_steps_per_second': 0.648, 'total_flos': 6295777859395584.0, 'train_loss': 0.706270430672543, 'epoch': 3.0})

In [46]:
# Launch the final evaluation
trainer.evaluate()

{'eval_loss': 0.6897336840629578,
 'eval_accuracy': 0.7236710130391174,
 'eval_runtime': 56.8716,
 'eval_samples_per_second': 35.061,
 'eval_steps_per_second': 4.396,
 'epoch': 3.0}

In [None]:
  ##pushing my trained model together with the results to hugginface

trainer.push_to_hub()

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ).

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.