Sentiment Analysis With Hugging Face

Hugging Face is an open-source platform that offers machine learning technologies, including pre-built models for various tasks. With their package, you can easily access these models to use them directly or fine-tune them on your own dataset. The platform also allows you to host your trained models, enabling you to utilize them on different devices and applications.

To access the full features of the Hugging Face platform, please visit their website and sign in.

Text classification with Hugging Face is a powerful capability provided by their models. By leveraging deep learning techniques, these models can analyze and classify text based on its sentiment, among other factors. However, training such models requires substantial computational power, particularly GPU resources. To tackle this, you can use platforms like Colab, GPU cloud providers, or a local machine equipped with an NVIDIA GPU to ensure efficient training and fine-tuning processes.

Exploring sentiment analysis with Hugging Face can greatly enhance your natural language processing projects. Visit their website to learn more about the available models and get started with this powerful tool.

In [1]:
!pip install transformers
!pip install datasets




In [2]:
!pip install huggingface_hub transformers datasets gradio pipreqs



In [3]:
pip install transformers




In [1]:
pip install --upgrade huggingface_hub



In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate throu

In [12]:
# Import libraries
import os
import uuid
import pandas as pd
import numpy as np
from scipy.special import softmax
import gradio as gr

from google.colab import drive
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    TFAutoModelForSequenceClassification,
    IntervalStrategy,
    TrainingArguments,
    EarlyStoppingCallback,
    pipeline,
    Trainer
)

In [13]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Setting up my enviroment

In [14]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [18]:
# Load the CSV file into a DataFrame

url = "https://github.com/Azubi-Africa/Career_Accelerator_P5-NLP/raw/master/zindi_challenge/data/Train.csv"

train= pd.read_csv(url)

In [20]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 0 to 10000
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   tweet_id   10001 non-null  object 
 1   safe_text  10001 non-null  object 
 2   label      10000 non-null  float64
 3   agreement  9999 non-null   float64
dtypes: float64(2), object(2)
memory usage: 312.7+ KB


In [21]:
train.isnull().sum()

tweet_id     0
safe_text    0
label        1
agreement    2
dtype: int64

the label and agreement columns have missing datasets

In [22]:
#checking the row with missing column
train[train.isna().any(axis=1)]


Unnamed: 0,tweet_id,safe_text,label,agreement
4798,RQMQ0L2A,#lawandorderSVU,,
4799,I cannot believe in this day and age some pare...,1,0.666667,


In [23]:
complete_text = train.iloc[4798]['safe_text']
complete_text = train['safe_text'].iloc[4798]
complete_text

'#lawandorderSVU '

In [24]:
# Select row by index and assign values to columns
train.loc[4798, 'label'] = 0
train.loc[4798, 'agreement'] = 0.666667

# Use .iloc[] and .iat[] to select and update safe_text column
train.iloc[4798, train.columns.get_loc('safe_text')] = complete_text

In [25]:
train.iloc[4798]

tweet_id             RQMQ0L2A
safe_text    #lawandorderSVU 
label                     0.0
agreement            0.666667
Name: 4798, dtype: object

In [26]:
import uuid

rand_tweet_id = str(uuid.uuid4())


In [27]:
row_index = 4799
train.loc[row_index, 'tweet_id'] = rand_tweet_id
train.loc[row_index, 'label'] = 1
train.loc[row_index, 'agreement'] = 0.666667


In [28]:
train.iloc[row_index, train.columns.get_loc('safe_text')] = train.iloc[row_index, train.columns.get_loc('safe_text')]


In [29]:
train.iloc[4799]

tweet_id     88d58509-b9f0-4fbf-984c-884b48820d7f
safe_text                                       1
label                                         1.0
agreement                                0.666667
Name: 4799, dtype: object

In [30]:
train.duplicated().sum()

0

Spliting of dataset

In [31]:
# Split the train data => {train, eval}
train, eval = train_test_split(train, test_size=0.2, random_state=42, stratify=train['label'])

In [32]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
1641,CQDD6QLM,"New <user> ""Hey Love"" #MMR #ManyMenRecords #Yo...",0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
336,I4D043ST,<user> esp when mercury free vaccines are avai...,1.0,0.666667
6861,CKX52Y8G,"My Life, Your Entertainment #YOTC #MMR @ Exoti...",0.0,1.0
720,07S3NL2T,Baby Luna is sore from her vaccines :( #poorpuppy,0.0,0.666667


In [33]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
5818,Y8PQ0BT7,So nervous... The baby's getting vaccines... (...,1.0,0.666667
7842,C9Z6JBSS,AIDS N : A malaria vaccine in children with HI...,0.0,0.666667
880,0VE4NWWQ,Measles Outbreak Hits Texas Church That Preach...,1.0,0.666667
9072,RHQRUF14,Thank you <user> for mtg with your staff. We l...,1.0,1.0
288,ZWEP2IL4,Health district offers no-cost immunizations f...,1.0,0.666667


In [34]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")


new dataframe shapes: train is (8000, 4), eval is (2001, 4)


In [35]:
import os

# Specify the directory path
directory = '/content/drive/MyDrive/Colab Notebooks/Sentiment Analysis'

# Create the directory if it does not exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Save the dataframes as CSV files in the specified directory
train.to_csv(os.path.join(directory, "train_subset.csv"), index=False)
eval.to_csv(os.path.join(directory, "eval_subset.csv"), index=False)


In [36]:
from datasets import load_dataset

dataset = load_dataset('csv', data_files={
    'train': os.path.join(directory, 'train_subset.csv'),
    'eval': os.path.join(directory, 'eval_subset.csv')
}, encoding='ISO-8859-1')



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-f2e67038b849871f/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-f2e67038b849871f/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [37]:
# Loading my Tokenizer
tokenizer_distilbert = AutoTokenizer.from_pretrained('distilbert-base-uncased')




In [38]:
# Define a function to transform the label values
def transform_labels(label):
    # Extract the label value
    label = label['label']
    # Map the label value to an integer value
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2
    # Return a dictionary with a single key-value pair
    return {'labels': num}


# Define a function to tokenize the text data
def tokenize_data3(example):
    # Extract the 'safe_text' value from the input example and tokenize it
    return tokenizer_distilbert(example['safe_text'], padding='max_length')

# Apply the transformation functions to the dataset using the 'map' method
# This transforms the label values and tokenizes the text data
dataset_out = dataset.map(transform_labels)

dataset_distilbert = dataset_out.map(tokenize_data3, batched=True)

# Define a list of column names to remove from the dataset
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']

# Apply the 'transform_labels' function to the dataset to transform the label values
# Also remove the columns specified in 'remove_columns'
dataset_distilbert = dataset_distilbert.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

In [39]:
dataset

DatasetDict({
    train: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 8000
    })
    eval: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 2001
    })
})

In [40]:
!pip install accelerate>=0.20.1

In [41]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',                          # Directory where the model checkpoints and evaluation results will be stored
    evaluation_strategy=IntervalStrategy.STEPS,      # Interval for evaluating the model during training (every specified number of steps)
    save_strategy=IntervalStrategy.STEPS,            # Interval for saving the model during training (every specified number of steps)
    save_steps=500,                                  # Number of steps between two saves
    load_best_model_at_end=True,                     # Whether to load the best model at the end of training
    num_train_epochs=10,                              # Number of training epochs
    per_device_train_batch_size=2,                   # Batch size per GPU for training
    per_device_eval_batch_size=2,                    # Batch size per GPU for evaluation
    learning_rate=3e-5,                              # Learning rate
    weight_decay=0.01,                               # Weight decay
    warmup_steps=500,                                # Number of warmup steps
    logging_steps=500,                               # Number of steps between two logs
    fp16=True,                                       # Whether to use 16-bit precision
    gradient_accumulation_steps=16,                  # Number of steps to accumulate gradients before performing an optimizer step
    dataloader_num_workers=2,                        # Number of workers to use for loading data
    #push_to_hub=True,                                # Whether to push the model checkpoints to the Hugging Face hub
    #hub_model_id="Preencez/finetuned-Sentiment-classfication-BERT-model",  # Model ID to use when pushing the model to the Hugging Face hub
)

#use hub_model_id="finetuned-Sentiment-classfication-ROBERTA-model
#use hub_model_id="finetuned-Sentiment-classfication-BERT-model
#use hub_model_id="finetuned-Sentiment-classfication-DISTILBERT-model

# Define the early stopping callback
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3,                       # Number of epochs with no improvement before stopping training
    early_stopping_threshold=0.01,                   # Minimum improvement in the metric for considering an improvement
)

# Combine the training arguments and the early stopping callback
training_args.callbacks = [early_stopping]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [42]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model_distilbert = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.

In [43]:
from transformers import Trainer, TrainingArguments

train_dataset_distilbert = dataset_distilbert['train'].shuffle(seed=10)

eval_dataset_distilbert = dataset_distilbert['eval'].shuffle(seed=10)


In [44]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    rmse = np.sqrt(np.mean((predictions - labels)**2))
    return {"rmse": rmse}

In [45]:
trainer_distilbert = Trainer(
    model=model_distilbert,
    args=training_args,
    train_dataset=train_dataset_distilbert,
    eval_dataset=eval_dataset_distilbert,
    compute_metrics=compute_metrics    # Add this line to define the compute_metrics function
)

In [46]:
from transformers import Trainer, TrainingArguments
trainer_distilbert.train()



Step,Training Loss,Validation Loss,Rmse
500,0.7362,0.578388,0.65938
1000,0.4174,0.676862,0.669534
1500,0.158,0.935468,0.648683
2000,0.0692,1.177571,0.637414


Step,Training Loss,Validation Loss,Rmse
500,0.7362,0.578388,0.65938
1000,0.4174,0.676862,0.669534
1500,0.158,0.935468,0.648683
2000,0.0692,1.177571,0.637414
2500,0.0364,1.2415,0.649838


TrainOutput(global_step=2500, training_loss=0.28343899002075196, metrics={'train_runtime': 1920.2829, 'train_samples_per_second': 41.661, 'train_steps_per_second': 1.302, 'total_flos': 1.059758088192e+16, 'train_loss': 0.28343899002075196, 'epoch': 10.0})