# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `3 epochs of fine-tuning`.

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [4]:
# !pip install datasets
# !pip install transformers
# !pip install accelerate -U
# !pip install accelerate>=0.20.1
# !pip install transformers[torch]

In [3]:
import os
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import accelerate
import sys
from transformers import AutoTokenizer
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import Trainer
import numpy as np
from sklearn.metrics import f1_score
from datasets import load_metric
from sklearn.metrics import mean_squared_error

In [5]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [6]:
# Load the dataset and display some values
df = pd.read_csv('/content/Asset/Train.csv')

In [7]:
df.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
0,CL1KWCMY,Me &amp; The Big Homie meanboy3000 #MEANBOY #M...,0.0,1.0
1,E3303EME,I'm 100% thinking of devoting my career to pro...,1.0,1.0
2,M4IVFSMS,"#whatcausesautism VACCINES, DO NOT VACCINATE Y...",-1.0,1.0
3,1DR6ROZ4,I mean if they immunize my kid with something ...,-1.0,1.0
4,J77ENIIE,Thanks to <user> Catch me performing at La Nui...,0.0,1.0


In [8]:
df.shape

(10001, 4)

In [9]:
df.info

<bound method DataFrame.info of        tweet_id                                          safe_text  label  \
0      CL1KWCMY  Me &amp; The Big Homie meanboy3000 #MEANBOY #M...    0.0   
1      E3303EME  I'm 100% thinking of devoting my career to pro...    1.0   
2      M4IVFSMS  #whatcausesautism VACCINES, DO NOT VACCINATE Y...   -1.0   
3      1DR6ROZ4  I mean if they immunize my kid with something ...   -1.0   
4      J77ENIIE  Thanks to <user> Catch me performing at La Nui...    0.0   
...         ...                                                ...    ...   
9996   IU0TIJDI  Living in a time where the sperm I used to was...    1.0   
9997   WKKPCJY6  <user> <user>  In spite of all measles outbrea...    1.0   
9998   ST3A265H  Interesting trends in child immunization in Ok...    0.0   
9999   6Z27IJGD  CDC Says Measles Are At Highest Levels In Deca...    0.0   
10000  P6190L3Q  Pneumonia vaccine: for women w risk of pulmona...    1.0   

       agreement  
0       1.000000  
1    

In [10]:
df.describe

<bound method NDFrame.describe of        tweet_id                                          safe_text  label  \
0      CL1KWCMY  Me &amp; The Big Homie meanboy3000 #MEANBOY #M...    0.0   
1      E3303EME  I'm 100% thinking of devoting my career to pro...    1.0   
2      M4IVFSMS  #whatcausesautism VACCINES, DO NOT VACCINATE Y...   -1.0   
3      1DR6ROZ4  I mean if they immunize my kid with something ...   -1.0   
4      J77ENIIE  Thanks to <user> Catch me performing at La Nui...    0.0   
...         ...                                                ...    ...   
9996   IU0TIJDI  Living in a time where the sperm I used to was...    1.0   
9997   WKKPCJY6  <user> <user>  In spite of all measles outbrea...    1.0   
9998   ST3A265H  Interesting trends in child immunization in Ok...    0.0   
9999   6Z27IJGD  CDC Says Measles Are At Highest Levels In Deca...    0.0   
10000  P6190L3Q  Pneumonia vaccine: for women w risk of pulmona...    1.0   

       agreement  
0       1.000000  
1  

In [11]:
df.isna().sum()

tweet_id     0
safe_text    0
label        1
agreement    2
dtype: int64

In [12]:
# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]

In [13]:
df.isna().sum()

tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ).

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.

In [14]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=40, stratify=df['label'])

In [15]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
3389,HK43EYLX,Children receive vaccinations before school: I...,0.0,0.666667
8635,3BWLQK25,Rabid cat's capture leads to vaccination remin...,1.0,0.666667
4406,9KKVJ51T,BREAKING: Two suspected measles cases being mo...,0.0,1.0
8624,VOF67NGH,"Mickey, Minnie and measles for nine Disneyland...",0.0,0.666667
3454,9B1WHXX5,Finally rt <user> California cracks down on va...,1.0,0.666667


In [16]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
9276,MU0AZP0A,Parenting checklist: don't vaccinate kid. Take...,0.0,0.666667
8805,QHAQXRI9,Thoughts on World #Autism Awareness Day from a...,0.0,1.0
511,FFJQ2Q4N,<user> So the expenditure of money is comparab...,-1.0,0.333333
3290,39EE91VG,“<user> More than 500 parents arrested for not...,1.0,1.0
6588,PAKFM4QV,City of Milwaukee offers measles immunization ...,1.0,0.666667


In [17]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

# Check for duplicates in 'train' dataset
train_duplicates = train[train.duplicated()]
print(f"Number of duplicates in 'train': {len(train_duplicates)}")

# Check for NaN values in 'train' dataset
train_nan = train.isna().sum()
print("NaN counts in 'train':")
print(train_nan)

# Check for duplicates in 'eval' dataset
eval_duplicates = eval[eval.duplicated()]
print(f"Number of duplicates in 'eval': {len(eval_duplicates)}")

# Check for NaN values in 'eval' dataset
eval_nan = eval.isna().sum()
print("NaN counts in 'eval':")
print(eval_nan)

new dataframe shapes: train is (7999, 4), eval is (2000, 4)
Number of duplicates in 'train': 0
NaN counts in 'train':
tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64
Number of duplicates in 'eval': 0
NaN counts in 'eval':
tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64


In [18]:
# Save splitted subsets
train.to_csv("../content/Asset/train_subset.csv", index=False)
eval.to_csv("../content/Asset/eval_subset.csv", index=False)

In [19]:
dataset = load_dataset('csv',
                        data_files={'train': '../content/Asset/train_subset.csv',
                        'eval': '../content/Asset/eval_subset.csv'}, encoding = "ISO-8859-1")


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

In [20]:
# Load the RoBERTa tokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base')

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [21]:
def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [22]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7999
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [23]:
# dataset['train']

In [24]:
# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
# training_args = TrainingArguments("test_trainer", num_train_epochs=2, load_best_model_at_end=True,)

training_args = TrainingArguments(
    "test_trainer",
    num_train_epochs=5,
    load_best_model_at_end=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=1000,
    save_steps=1000,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [25]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
train_dataset = dataset['train'].shuffle(seed=10) #.select(range(40000)) # to select a part
eval_dataset = dataset['eval'].shuffle(seed=10)

In [27]:
def compute_f1_score(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    f1 = f1_score(labels, preds, average="weighted")
    return {"f1_score": f1}

In [28]:
# Define a function to compute accuracy
metric = load_metric("accuracy")

def compute_accuracy(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

  metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [29]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_f1_score,
)

In [30]:
# Configure a new trainer for evaluation with accuracy
evaluation_trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=eval_dataset,
    compute_metrics=compute_accuracy,
)

In [31]:
# Launch the learning process: training
trainer.train()

Step,Training Loss,Validation Loss,F1 Score
1000,0.7514,0.729413,0.669932
2000,0.6183,0.736046,0.701759
3000,0.5445,0.713962,0.749628
4000,0.4506,0.77257,0.759663
5000,0.3555,0.996084,0.763361


TrainOutput(global_step=5000, training_loss=0.5572739318847656, metrics={'train_runtime': 4047.2848, 'train_samples_per_second': 9.882, 'train_steps_per_second': 1.235, 'total_flos': 1.052322114203136e+16, 'train_loss': 0.5572739318847656, 'epoch': 5.0})

In [32]:
# Launch the final evaluation
eval_metrics = evaluation_trainer.evaluate()

print("Evaluation metrics:", eval_metrics)

Evaluation metrics: {'eval_loss': 0.7139616012573242, 'eval_accuracy': 0.755, 'eval_runtime': 59.7844, 'eval_samples_per_second': 33.454, 'eval_steps_per_second': 4.182}


Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.