<a href="https://colab.research.google.com/github/Gilbert-B/Natural-Language-Processing-Sentiment-Analysis-/blob/main/HuggingFace_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Hugging Face

Sentiment analysis is a natural language processing technique used to determine the emotional tone of a piece of text. Hugging Face is an open-source library for natural language processing that provides pre-trained models and tools for building, training, and deploying state-of-the-art deep learning models. It has achieved state-of-the-art performance on a wide range of natural language processing tasks. In the context of sentiment analysis, Hugging Face provides pre-trained models that can be fine-tuned on specific sentiment analysis tasks, as well as tools for building custom models from scratch. The Hugging Face Transformers library provides pre-trained transformer models that can be used for sentiment analysis. The Hugging Face Datasets library provides datasets that can be used for training and evaluation. These tools make it easier for developers and researchers to build and deploy state-of-the-art sentiment analysis models. Sentiment analysis has a wide range of potential use cases, including customer feedback analysis, brand reputation management, and social media monitoring. It involves classifying the sentiment of a given text as positive, negative, or neutral. It has become an increasingly popular application of machine learning in recent years.







## Application of Hugging Face Text classification model Fune-tuning

In [1]:
!pip install datasets
!pip install transformers
!pip install huggingface_hub



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.

In [35]:
import numpy as np
import os
import pandas as pd
import warnings

from transformers import Trainer
from transformers import TrainingArguments
from transformers.trainer_callback import EarlyStoppingCallback
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from datasets import load_metric
from datasets import load_dataset
from huggingface_hub import notebook_login
from huggingface_hub import Repository
from sklearn.model_selection import train_test_split



In [3]:
warnings.filterwarnings('ignore')

In [4]:
#login to HF hub
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
# Disabe Weight & Biases 
os.environ["WANDB_DISABLED"] = "true"

In [6]:
# Load the dataset and display some values
df = pd.read_csv('https://raw.githubusercontent.com/Azubi-Africa/Career_Accelerator_P5-NLP/master/zindi_challenge/data/Train.csv')
df


Unnamed: 0,tweet_id,safe_text,label,agreement
0,CL1KWCMY,Me &amp; The Big Homie meanboy3000 #MEANBOY #M...,0.0,1.000000
1,E3303EME,I'm 100% thinking of devoting my career to pro...,1.0,1.000000
2,M4IVFSMS,"#whatcausesautism VACCINES, DO NOT VACCINATE Y...",-1.0,1.000000
3,1DR6ROZ4,I mean if they immunize my kid with something ...,-1.0,1.000000
4,J77ENIIE,Thanks to <user> Catch me performing at La Nui...,0.0,1.000000
...,...,...,...,...
9996,IU0TIJDI,Living in a time where the sperm I used to was...,1.0,1.000000
9997,WKKPCJY6,<user> <user> In spite of all measles outbrea...,1.0,0.666667
9998,ST3A265H,Interesting trends in child immunization in Ok...,0.0,1.000000
9999,6Z27IJGD,CDC Says Measles Are At Highest Levels In Deca...,0.0,1.000000


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 0 to 10000
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   tweet_id   10001 non-null  object 
 1   safe_text  10001 non-null  object 
 2   label      10000 non-null  float64
 3   agreement  9999 non-null   float64
dtypes: float64(2), object(2)
memory usage: 312.7+ KB


In [8]:
#eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9999 entries, 0 to 10000
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   tweet_id   9999 non-null   object 
 1   safe_text  9999 non-null   object 
 2   label      9999 non-null   float64
 3   agreement  9999 non-null   float64
dtypes: float64(2), object(2)
memory usage: 390.6+ KB


In [10]:
#distribution of sentiments 
df["label"].value_counts()

 0.0    4908
 1.0    4053
-1.0    1038
Name: label, dtype: int64

## **Finetuning the RoBERTa model**

In [11]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [12]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
9305,YMRMEDME,Mickey's Measles has gone international <url>,0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
795,EI10PS46,<user> your ignorance on vaccines isn't just ...,1.0,0.666667
5793,OM26E6DG,Pakistan partly suspends polio vaccination pro...,0.0,1.0
3431,NBBY86FX,In other news I've gone up like 1000 mmr,0.0,1.0


In [13]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
6571,R7JPIFN7,Children's Museum of Houston to Offer Free Vac...,1.0,1.0
1754,2DD250VN,<user> no. I was properly immunized prior to t...,1.0,1.0
3325,ESEVBTFN,<user> thx for posting vaccinations are impera...,1.0,1.0
1485,S17ZU0LC,This Baby Is Exactly Why Everyone Needs To Vac...,1.0,0.666667
4175,IIN5D33V,"Meeting tonight, 8:30pm in room 322 of the stu...",1.0,1.0


In [14]:
print(f"New Dataframe shapes: train is {train.shape}, eval is {eval.shape}")

New Dataframe shapes: train is (7999, 4), eval is (2000, 4)


In [15]:
directory = r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data'

In [16]:
# Save splitted subsets
train.to_csv(r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data\train_subset.csv', index=False)
eval.to_csv(r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data\eval_subset.csv', index=False)

In [17]:
dataset = load_dataset('csv',
                        data_files={'train': r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data\train_subset.csv',
                        'eval': r'C:\Users\GilB\OneDrive\Documents\Git Repo\NLP\Natural-Language-Processing-Project-Sentiment-Analysis\Data\eval_subset.csv'}, encoding = "ISO-8859-1")


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-3bc0189691ae0000/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-3bc0189691ae0000/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [18]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [19]:
def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length', truncation = True, max_length= 256)

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [20]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7999
    })
    eval: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [22]:
# Configure the trianing parameters like `num_train_epochs`: 
# the number of time the model will repeat the training loop over the dataset

training_args = TrainingArguments(
    "test_trainer",
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [24]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [25]:
# Define evaluation metrics
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
     

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [26]:
#Instantiating the training and evaluation sets 

train_dataset = dataset['train'].shuffle(seed=10) 
eval_dataset = dataset['eval'].shuffle(seed=10)

In [27]:
#converting training data to PyTorch tensors to speed up training and adding padding:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [29]:
# Define trainer and training arguments
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

In [30]:
# Train the model
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7326,0.727536,0.7195
2,0.6181,0.632707,0.752
3,0.4432,0.693158,0.7515


TrainOutput(global_step=3000, training_loss=0.6360030568440755, metrics={'train_runtime': 1142.0013, 'train_samples_per_second': 21.013, 'train_steps_per_second': 2.627, 'total_flos': 3156966342609408.0, 'train_loss': 0.6360030568440755, 'epoch': 3.0})

In [31]:

# Reinstantiate the trainer for evaluation
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [32]:
# Launch the final evaluation 
trainer.evaluate()
    

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.6327070593833923,
 'eval_accuracy': 0.752,
 'eval_runtime': 30.5024,
 'eval_samples_per_second': 65.569,
 'eval_steps_per_second': 8.196}

In [34]:
import torch

# Save the PyTorch model to a file
torch.save(model.state_dict(), 'PredictSentiment.pt')


In [46]:
# Create a Repository object with the URL of your existing repository and the local directory where it will be cloned to
repo = Repository(clone_from='https://huggingface.co/GhylB/Sentiment_Analysis', local_dir='./Sentiment_Analysis')

# Push your model and tokenizer to the repository using the `push_to_hub` method of the `Repo` class
model.push_to_hub("GhylB/Sentiment_Analysis")
tokenizer.push_to_hub("GhylB/Sentiment_Analysis")

/content/./Sentiment_Analysis is already a clone of https://huggingface.co/GhylB/Sentiment_Analysis. Make sure you pull the latest changes with `repo.git_pull()`.


CommitInfo(commit_url='https://huggingface.co/GhylB/Sentiment_Analysis/commit/a32dfe04fb2dd4651c6175b98c5acaf6a5964dbb', commit_message='Upload tokenizer', commit_description='', oid='a32dfe04fb2dd4651c6175b98c5acaf6a5964dbb', pr_url=None, pr_revision=None, pr_num=None)

Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.