# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, go to the website and sign-in to access all the features of the platform.

Read more about Text classification with Hugging Face

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use Colab to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In the previous EDA/Clean Data notebook, the NLP dataset was cleaned and there was an exploration of the dataset. Models will be created from the data. Two models, namely,RoBerTa and DistillBeRT from HuggingFace will be fine tuned.

# Installations





In [2]:
%%capture
!pip install transformers
!pip install accelerate -U
!pip install datasets
!pip install huggingface_hub



### Dependencies Importations

In [3]:
##for handling path of my datasets
import os
from google.colab import drive

##for data handling:

import pandas as pd
import numpy as np

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from transformers import TrainingArguments
from scipy.special import softmax
from torch import nn
import torch
from transformers import Trainer

##modelling:

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import transformers
from transformers import pipeline
from datasets import load_dataset
import nltk
nltk.download('punkt')
##others
import warnings
warnings.filterwarnings("ignore")
import os
os.environ["WANDB_DISABLED"] = "true"
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Importing dataset from Google Drive**

In [4]:
data_path=  "/content/drive/MyDrive/Colab Notebooks/NLP/Transformed_copy.csv"

In [5]:
# read data
data= pd.read_csv(data_path)


In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,safe_tweet,label,agreement
0,0,amp big homie meanboy stegman st,0.0,1.0
1,1,im thinking devoting career proving autism isn...,1.0,1.0
2,2,vaccines vaccinate child,-1.0,1.0
3,3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0


In [7]:
#Check for null values

data.isna().sum()

Unnamed: 0     0
safe_tweet    29
label          0
agreement      0
dtype: int64

In [8]:
data[data["safe_tweet"].isnull()]

Unnamed: 0.1,Unnamed: 0,safe_tweet,label,agreement
444,444,,0.0,1.0
1523,1523,,0.0,1.0
2155,2155,,0.0,1.0
2515,2515,,0.0,1.0
3062,3062,,0.0,0.666667
3204,3204,,0.0,1.0
3819,3819,,1.0,0.666667
4631,4631,,0.0,1.0
4638,4638,,0.0,1.0
4770,4770,,0.0,1.0


In [9]:
##drop Unnamed column and missing values to facilitate analysis
data= data.dropna()
data= data.drop("Unnamed: 0", axis=1)

In [10]:
data

Unnamed: 0,safe_tweet,label,agreement
0,amp big homie meanboy stegman st,0.0,1.0
1,im thinking devoting career proving autism isn...,1.0,1.0
2,vaccines vaccinate child,-1.0,1.0
3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0
5,nearly year old study mental health studies va...,1.0,0.666667
6,study kids finds link mmr vaccine autism,1.0,0.666667
7,psa vaccinate fucking kids,1.0,1.0
8,coughing extra shuttle everyone thinks measles,1.0,0.666667
9,aids vaccine created oregon health amp science...,1.0,0.666667


In [11]:
#Ensuring there are no null values

data.isna().sum()

safe_tweet    0
label         0
agreement     0
dtype: int64

In [12]:
# Change tweet rows to tuples  to conform to the standard

data['safe_tweet'] = data['safe_tweet'].apply(lambda tweet: tuple(tweet.split(),))




In [13]:
data.head()

Unnamed: 0,safe_tweet,label,agreement
0,"(amp, big, homie, meanboy, stegman, st)",0.0,1.0
1,"(im, thinking, devoting, career, proving, auti...",1.0,1.0
2,"(vaccines, vaccinate, child)",-1.0,1.0
3,"(mean, immunize, kid, something, wont, secretl...",-1.0,1.0
4,"(thanks, catch, performing, la, nuit, nyc, st,...",0.0,1.0


**Data Splitting**


In [14]:
train, eval= train_test_split(data, test_size= 0.2, stratify= data["label"], random_state= 42)

In [15]:
train.head()

Unnamed: 0,safe_tweet,label,agreement
8627,"(vaccine, safety, side, effects, kids, found)",-1.0,0.333333
6394,"(dude, gotten, vaccinated, swag, virus, known,...",1.0,1.0
8636,"(vaccine, horrors, medical, mutilation, childr...",-1.0,1.0
323,"(mighty, mmr, music, money, records)",0.0,1.0
3254,"(average, people, complain, live, longer, rele...",0.0,1.0


In [16]:
eval.head()

Unnamed: 0,safe_tweet,label,agreement
8474,"(bet, asked, cool, bandaids, group, health, an...",1.0,0.666667
5486,"(vaccination, contain, hg, mercury, non, chemi...",-1.0,0.666667
9863,"(amen, rt, good, thing, parents, dont, vaccina...",1.0,1.0
6977,"(never, understand, weird, state, mind, cause,...",1.0,1.0
7866,"(manditory, cootie, vaccinations, protect, kids)",1.0,0.666667


In [17]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (7976, 3), eval is (1994, 3)


In [18]:
#saving the train and eval data to csv
train.to_csv("/content/train.csv")
eval.to_csv("/content/eval.csv")

**Load the Dataset**

In [19]:
dataset= load_dataset( "csv", data_files= { "train":"train.csv", "eval":"eval.csv"}                     )

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

In [20]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'safe_tweet', 'label', 'agreement'],
        num_rows: 7976
    })
    eval: Dataset({
        features: ['Unnamed: 0', 'safe_tweet', 'label', 'agreement'],
        num_rows: 1994
    })
})

 **Tokenization**

In [21]:
distil= "distilbert-base-uncased"

In [22]:
#create an instance for tokenizer
distil_tokenizer= AutoTokenizer.from_pretrained(distil)

In [23]:
data["safe_tweet"].head()

0              (amp, big, homie, meanboy, stegman, st)
1    (im, thinking, devoting, career, proving, auti...
2                         (vaccines, vaccinate, child)
3    (mean, immunize, kid, something, wont, secretl...
4    (thanks, catch, performing, la, nuit, nyc, st,...
Name: safe_tweet, dtype: object

## Preprocessing Data

In [24]:
## changing labels to 0,1,2  from the initial labels -1, 0, 1

def transform_labels(input):
  label= input["label"]
  num =0

  if label== -1:
    num= 0
  elif label== 0:
    num =1
  elif label == 1:
    num = 2
  return {"label": num}

def distil_tokenize(example):
  return distil_tokenizer(example["safe_tweet"], padding= "max_length", truncation=True)


In [25]:
## Converting tweets to tokens for the model to work with and eliminating features that are not needed for the analysis

dataset= dataset.map(distil_tokenize, batched= True)
remove_columns= ['Unnamed: 0', 'safe_tweet', 'label', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [26]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 7976
    })
    eval: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1994
    })
})

**Handling  Class Imbalance**

From the EDA, we realized there was an imbalance with -1 class (now the 0 class) and it will be dealt with



In [27]:


class_weights= (1-(data["label"].value_counts().sort_index() /len(data))).values
class_weights


array([0.89618857, 0.50992979, 0.59388164])

In [28]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [29]:
##uploading weights to the GPU


class_weights= torch.from_numpy(class_weights).float().to(device)

In [30]:
##pyTorch recognizes our label column to be named as 'labels' therefore, I am going to go ahead and rename it

dataset= dataset.rename_column("label","labels")

In [31]:
model= AutoModelForSequenceClassification.from_pretrained(distil, num_labels= 3)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 1.Finetuning DistilBert with Class Weight


In [32]:
##creating an instance for the model
model= AutoModelForSequenceClassification.from_pretrained(distil, num_labels= 3)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
##creating a custom class to enable the classweight


class WeightedLossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs["labels"]
        inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits.float()
        labels = labels.long()
        loss_func = nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_func(logits, labels)
        return (loss, outputs) if return_outputs else loss

In [34]:
#f1-score will be used because there is class imbalance in the dataset

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average="weighted")
  return {"f1": f1}

In [35]:
model_name = f"{distil}-Distilbert-Model"


In [36]:
#setting batch size to 16
batch_size= 16

In [37]:
#creating an instance for the training arguments

training_args = TrainingArguments( output_dir=model_name,
   num_train_epochs=10, load_best_model_at_end=True,evaluation_strategy="steps",save_strategy="steps",push_to_hub=True

)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [38]:
##setting a shuffle seed to avoid randomization at each rerun
train_dataset= dataset['train'].shuffle(seed=10)
eval_dataset= dataset['eval'].shuffle(seed=10)

In [39]:
#making a connection to huggingface
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [41]:
#loading training arguments

trainer = WeightedLossTrainer(
      model= model,
      args= training_args,
      train_dataset= train_dataset,
      eval_dataset= eval_dataset,
      tokenizer= distil_tokenizer,
      compute_metrics=compute_metrics )

In [42]:
##training the model

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,F1
500,0.8395,0.796509,0.652037
1000,0.791,0.741044,0.680063
1500,0.7135,0.745159,0.692972
2000,0.6608,0.760969,0.713357
2500,0.532,0.850335,0.709773
3000,0.5301,0.812362,0.722058
3500,0.3763,1.044133,0.705323
4000,0.4136,1.267941,0.700143
4500,0.2813,1.576616,0.69985
5000,0.2986,1.531767,0.696309


Step,Training Loss,Validation Loss,F1
500,0.8395,0.796509,0.652037
1000,0.791,0.741044,0.680063
1500,0.7135,0.745159,0.692972
2000,0.6608,0.760969,0.713357
2500,0.532,0.850335,0.709773
3000,0.5301,0.812362,0.722058
3500,0.3763,1.044133,0.705323
4000,0.4136,1.267941,0.700143
4500,0.2813,1.576616,0.69985
5000,0.2986,1.531767,0.696309


TrainOutput(global_step=9970, training_loss=0.33809904138685587, metrics={'train_runtime': 4350.6287, 'train_samples_per_second': 18.333, 'train_steps_per_second': 2.292, 'total_flos': 1.056578813927424e+16, 'train_loss': 0.33809904138685587, 'epoch': 10.0})

In [43]:
trainer.evaluate()

{'eval_loss': 0.7410444021224976,
 'eval_f1': 0.6800629363246677,
 'eval_runtime': 31.4232,
 'eval_samples_per_second': 63.456,
 'eval_steps_per_second': 7.956,
 'epoch': 10.0}

In [44]:
trainer.push_to_hub()

'https://huggingface.co/Enyonam/distilbert-base-uncased-Distilbert-Model/tree/main/'