# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just 10 epochs of fine-tuning`.

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [3]:
%%capture
!pip install datasets

In [4]:
%%capture
!pip install transformers

In [5]:
%%capture
!pip install --upgrade accelerate

In [6]:
%%capture
!pip install sentencepiece

## Importing Libraries

In [7]:
%%capture
import huggingface_hub
import os
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

from sklearn.metrics import mean_squared_error

In [8]:
# Now log in to the Hugging Face Hub
huggingface_hub.notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [9]:
# Disab W&B
os.environ["WANDB_DISABLED"] = "true"

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
!pwd

/content


In [12]:
data_path= "/content/drive/My Drive/Colab Notebooks/natural-language-processing/clean_copy.csv"

In [13]:
##reading dataset
data= pd.read_csv(data_path)

In [14]:
data.head()

Unnamed: 0.1,Unnamed: 0,clean_tweet,label,agreement
0,0,amp big homie meanboy stegman st,0.0,1.0
1,1,im thinking devoting career proving autism isn...,1.0,1.0
2,2,vaccines vaccinate child,-1.0,1.0
3,3,mean immunize kid something wont secretly kill...,-1.0,1.0
4,4,thanks catch performing la nuit nyc st ave sho...,0.0,1.0


In [15]:
##Dropping Unnamed: 0 column
data.isna().sum()

Unnamed: 0      0
clean_tweet    29
label           0
agreement       0
dtype: int64

In [16]:
data[data["clean_tweet"].isnull()]

Unnamed: 0.1,Unnamed: 0,clean_tweet,label,agreement
444,444,,0.0,1.0
1523,1523,,0.0,1.0
2155,2155,,0.0,1.0
2515,2515,,0.0,1.0
3062,3062,,0.0,0.666667
3204,3204,,0.0,1.0
3819,3819,,1.0,0.666667
4631,4631,,0.0,1.0
4638,4638,,0.0,1.0
4770,4770,,0.0,1.0


In [17]:
##All missing values dropped

data= data.dropna()
data= data.drop("Unnamed: 0", axis=1)

In [18]:
##before splitting I will convert each tweet row to a tuple since that't the acceptable format

data['clean_tweet'] = data['clean_tweet'].apply(lambda tweet: tuple(tweet.split(),))

## Splitting the dataset

In [19]:
train_set, eval_set= train_test_split(data, test_size= 0.2, stratify= data["label"])


In [20]:
train_set

Unnamed: 0,clean_tweet,label,agreement
4391,"(s, thousands, die, measles, ebola, aftermath,...",0.0,0.666667
1096,"(thats, saying, look, made, disease, vaccine, ...",0.0,1.000000
9862,"(im, im, age, measles, mumps, amp, chicken, po...",0.0,0.666667
9799,"(juss, gott, news, stanky, butt, gott, measles...",0.0,1.000000
4147,"(cdc, says, flu, shot, less, effective, health...",0.0,1.000000
...,...,...,...
693,"(plz, rt, admits, mmr, causes, inc, boys, x, b...",-1.0,1.000000
2256,"(irresponsible, ignorant, vaccinate, children,...",1.0,1.000000
7292,"(increase, measles, cases, expected, ohio, mea...",0.0,1.000000
8148,"(measles, outbreak, underscores, need, continu...",0.0,0.666667


In [21]:
##saving my train and eval set

train_set.to_csv("/content/train_set.csv")
eval_set.to_csv("/content/eval_set.csv")

In [22]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7976 entries, 4391 to 969
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   clean_tweet  7976 non-null   object 
 1   label        7976 non-null   float64
 2   agreement    7976 non-null   float64
dtypes: float64(2), object(1)
memory usage: 249.2+ KB


In [23]:
eval_set.head()

Unnamed: 0,clean_tweet,label,agreement
5358,"(mmr, n, da, building, band, name, ex, first, ...",0.0,1.0
6345,"(itch, little, clear, center, blister, say, ch...",0.0,1.0
8598,"(child, porn, measles, outbreak, amp, coverup,...",0.0,1.0
5002,"(lewis, first, editorinchief, publicly, issue,...",0.0,1.0
1578,"(come, onsomebody, come, mmr)",0.0,1.0


In [24]:
eval_set.label.unique()

array([ 0.,  1., -1.])

In [25]:
print(f"new dataframe shapes: train is {train_set.shape}, eval is {eval_set.shape}")

new dataframe shapes: train is (7976, 3), eval is (1994, 3)


## Load dataset

In [26]:
##ensuring my dataset is in the right format for deep learning.

dataset= load_dataset("csv", data_files={"train_set":"train_set.csv", "eval_set":"eval_set.csv" }, encoding= "ISO-8859-1")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train_set split: 0 examples [00:00, ? examples/s]

Generating eval_set split: 0 examples [00:00, ? examples/s]

## View data

In [27]:
##dataset viewing
dataset

DatasetDict({
    train_set: Dataset({
        features: ['Unnamed: 0', 'clean_tweet', 'label', 'agreement'],
        num_rows: 7976
    })
    eval_set: Dataset({
        features: ['Unnamed: 0', 'clean_tweet', 'label', 'agreement'],
        num_rows: 1994
    })
})

## Tokenization

In [28]:
##instatiating tokenizer
tokenizer= AutoTokenizer.from_pretrained("roberta-base")

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [29]:
## the labels are -1, 0, 1 and we will like to transform them respectively into 0,1,2

def transform_labels(input):
  label= input["label"]
  num =0

  if label== -1:
    num= 0
  elif label== 0:
    num =1
  elif label == 1:
    num = 2
  return {"labels": num}

def tokenize(example):
  return tokenizer(example["clean_tweet"], padding= "max_length", truncation=True, return_tensors= "pt")

In [30]:
##tokenizing words and removing all unnecessary column

dataset= dataset.map(tokenize, batched= True)
remove_columns= ['Unnamed: 0', 'clean_tweet', 'label', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

Map:   0%|          | 0/7976 [00:00<?, ? examples/s]

Map:   0%|          | 0/1994 [00:00<?, ? examples/s]

In [31]:
dataset

DatasetDict({
    train_set: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7976
    })
    eval_set: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1994
    })
})

## Training

In [32]:
# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments("test_trainer",
                                  num_train_epochs=10,
                                  load_best_model_at_end=True,
                                  save_strategy='epoch',
                                  evaluation_strategy='epoch',
                                  logging_strategy='epoch',
                                  logging_steps=100,
                                  per_device_train_batch_size=8,
                                  )

In [33]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [34]:
train_dataset = dataset['train_set'].shuffle(seed=10)
eval_dataset = dataset['eval_set'].shuffle(seed=10)

In [35]:
def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  return {"rmse": mean_squared_error(labels, preds, squared=False)}

In [36]:
##loading training arguments
trainer= Trainer(
    model= model,
      args= training_args,
      train_dataset= train_dataset,
      eval_dataset= eval_dataset,
      tokenizer= tokenizer,
      compute_metrics=compute_metrics

)

In [37]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rmse
1,0.9681,0.957271,0.714164
2,0.9601,0.952174,0.714164
3,0.9585,0.959081,0.951428
4,0.9568,0.952042,0.714164
5,0.957,0.952174,0.714164
6,0.9557,0.967638,0.714164
7,0.955,0.951222,0.714164
8,0.9538,0.95555,0.714164
9,0.9532,0.951647,0.714164
10,0.9529,0.951276,0.714164


TrainOutput(global_step=9970, training_loss=0.9571112935975896, metrics={'train_runtime': 8147.019, 'train_samples_per_second': 9.79, 'train_steps_per_second': 1.224, 'total_flos': 2.098592619798528e+16, 'train_loss': 0.9571112935975896, 'epoch': 10.0})

In [38]:
# Launch the final evaluation
trainer.evaluate()

{'eval_loss': 0.9512220621109009,
 'eval_rmse': 0.7141639099470181,
 'eval_runtime': 59.946,
 'eval_samples_per_second': 33.263,
 'eval_steps_per_second': 4.17,
 'epoch': 10.0}

## Pushing to HuggingFace
Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

In [39]:
# Push the model and tokenizer to Hugging Face
# Push model and tokenizer to HugginFace
model.push_to_hub("HerbertAIHug/finetuned_sentiment_analysis_modell")
tokenizer.push_to_hub("HerbertAIHug/finetuned_sentiment_analysis_modell")

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/HerbertAIHug/finetuned_sentiment_analysis_modell/commit/bdf00be68ac79716ff835049206261a16853fcf2', commit_message='Upload tokenizer', commit_description='', oid='bdf00be68ac79716ff835049206261a16853fcf2', pr_url=None, pr_revision=None, pr_num=None)