# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

In [1]:
!pip install huggingface_hub transformers datasets gradio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting huggingface_hub
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gradio
  Downloading gradio-3.27.0-py3-none-any.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m55.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-c

In [2]:
!huggingface-cli login

#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


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
# Import libraries
import os
import uuid
import pandas as pd
import numpy as np
from scipy.special import softmax
import gradio as gr

from google.colab import drive
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import (
    AutoTokenizer,
    AutoConfig, 
    AutoModelForSequenceClassification,
    TFAutoModelForSequenceClassification,
    IntervalStrategy,
    TrainingArguments,
    EarlyStoppingCallback,
    pipeline,
    Trainer
) 


In [4]:
drive.mount('/content/drive')

Mounted at /content/drive


## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `3 epochs of fine-tuning`. 

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

The datasets package is a Python library that provides a collection of over 100 natural language processing (NLP) datasets commonly used for research and development. The library is designed to provide easy access to these datasets, as well as a uniform interface for loading, preprocessing, and working with the data.

The datasets include a range of tasks such as text classification, question answering, named entity recognition, and sentiment analysis, and cover a variety of languages including English, Spanish, French, Chinese, and many others. Some of the popular datasets included in the package are IMDB, COCO, SQuAD, Multi30k, Wikipedia, and Amazon Reviews.

The datasets package is developed by Hugging Face, a company that specializes in NLP and provides a suite of libraries and tools for working with NLP models.




This code sets the environment variable "WANDB_DISABLED" to "true", which disables the use of the Weights and Biases (W&B) tool. W&B is a third-party tool that can be used to track and visualize the training progress of machine learning models. By setting this environment variable, you are telling your code to not use this tool.

In [5]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [6]:
# Load the dataset and display some values

# Load the CSV file into a DataFrame

url = "https://github.com/Azubi-Africa/Career_Accelerator_P5-NLP/raw/master/zindi_challenge/data/Train.csv"

df = pd.read_csv(url)


Data Quality checks 

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 0 to 10000
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   tweet_id   10001 non-null  object 
 1   safe_text  10001 non-null  object 
 2   label      10000 non-null  float64
 3   agreement  9999 non-null   float64
dtypes: float64(2), object(2)
memory usage: 312.7+ KB


In [8]:
# Select rows with missing values
df.isnull().sum()

tweet_id     0
safe_text    0
label        1
agreement    2
dtype: int64

In [9]:
# Select rows with missing values
df[df.isnull().any(axis=1)]

Unnamed: 0,tweet_id,safe_text,label,agreement
4798,RQMQ0L2A,#lawandorderSVU,,
4799,I cannot believe in this day and age some pare...,1,0.666667,


In [10]:
# Extract complete text from 'safe_text' column
complete_text = df.iloc[4798]['safe_text']
complete_text

'#lawandorderSVU '

In [11]:
# Select row by index and assign values to columns
df.loc[4798, 'label'] = 0
df.loc[4798, 'agreement'] = 0.666667

# Use .iloc[] and .iat[] to select and update safe_text column
df.iloc[4798, df.columns.get_loc('safe_text')] = complete_text


In [12]:
df.iloc[4798]

tweet_id             RQMQ0L2A
safe_text    #lawandorderSVU 
label                     0.0
agreement            0.666667
Name: 4798, dtype: object

In [13]:
# Generate random UUID string for tweet_id
'''UUIDs are often used in software applications for various purposes such as generating unique IDs for entities, 
tracking unique user sessions, or creating unique file names'''
rand_tweet_id = str(uuid.uuid4())

# Select row by index and assign values to columns
row_index = 4799
df.loc[row_index, 'tweet_id'] = rand_tweet_id
df.loc[row_index, 'label'] = 1
df.loc[row_index, 'agreement'] = 0.666667

# Use .iloc[] and .iat[] to select and update safe_text column
df.iloc[row_index, df.columns.get_loc('safe_text')] = df.iloc[row_index, 1]


In [14]:
df.iloc[4799]

tweet_id     a10264c4-1e99-49f8-9c6a-da51ee96b2c2
safe_text                                       1
label                                         1.0
agreement                                0.666667
Name: 4799, dtype: object

In [15]:
df[df.duplicated()].sum()

tweet_id     0.0
safe_text    0.0
label        0.0
agreement    0.0
dtype: float64

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ). 

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.

In [16]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [17]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
1641,CQDD6QLM,"New <user> ""Hey Love"" #MMR #ManyMenRecords #Yo...",0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
336,I4D043ST,<user> esp when mercury free vaccines are avai...,1.0,0.666667
6861,CKX52Y8G,"My Life, Your Entertainment #YOTC #MMR @ Exoti...",0.0,1.0
720,07S3NL2T,Baby Luna is sore from her vaccines :( #poorpuppy,0.0,0.666667


In [18]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
5818,Y8PQ0BT7,So nervous... The baby's getting vaccines... (...,1.0,0.666667
7842,C9Z6JBSS,AIDS N : A malaria vaccine in children with HI...,0.0,0.666667
880,0VE4NWWQ,Measles Outbreak Hits Texas Church That Preach...,1.0,0.666667
9072,RHQRUF14,Thank you <user> for mtg with your staff. We l...,1.0,1.0
288,ZWEP2IL4,Health district offers no-cost immunizations f...,1.0,0.666667


In [19]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (8000, 4), eval is (2001, 4)


By saving the subsets as CSV files, you can easily load them into your machine learning framework of choice (e.g., PyTorch, TensorFlow) and preprocess the data as needed for your specific task. Additionally, saving the subsets as separate files allows you to easily swap in new training or evaluation data as needed during the development process.

In [20]:
# Save splitted subsets

# Define file path

file_path = "/content/drive/MyDrive/Colab Notebooks/NLP Sentiment Analysis /"

#"/content/drive/MyDrive/NLP-Sentiment-Classification "

train.to_csv(os.path.join(file_path, "train_subset.csv"), index=False)
eval.to_csv(os.path.join(file_path, "eval_subset.csv"), index=False)

In [21]:
# Load the CSV files into a dataset

from datasets import load_dataset

dataset = load_dataset('csv', data_files={
    'train': file_path + 'train_subset.csv',
    'eval': file_path + 'eval_subset.csv'
}, encoding='ISO-8859-1')

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-f625ce4de5a059ae/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-f625ce4de5a059ae/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Transformers is a Python library for natural language processing (NLP) developed by Hugging Face. It provides an easy-to-use interface for building and training state-of-the-art deep learning models for a variety of NLP tasks, such as text classification, named entity recognition, question answering, and more.

The transformer architecture is a type of neural network that is particularly well-suited for processing sequential data, such as natural language text. It replaces the recurrent neural networks (RNNs) and convolutional neural networks (CNNs) that were previously used for NLP tasks, and has achieved state-of-the-art performance on a wide range of benchmarks.

The Transformers library provides pre-trained transformer models that can be fine-tuned on a specific NLP task with only a small amount of task-specific data. This allows developers to easily leverage the power of transformer models for their own NLP tasks, even if they do not have access to large amounts of training data or high-performance computing resources.

A tokenizer is a component in natural language processing (NLP) that breaks down text into individual tokens, which are usually words or subwords. Tokenization is an important preprocessing step in many NLP tasks, because it converts raw text data into a format that can be easily processed by machine learning models.

There are different types of tokenizers that can be used, depending on the specific requirements of the task. Some common types include:

Word tokenizers: These tokenize text into individual words based on whitespace or punctuation.

Subword tokenizers: These tokenize text into subwords, which can be useful for handling out-of-vocabulary words or words that are rare in the training data.

Character tokenizers: These tokenize text into individual characters, which can be useful for languages that have complex orthographies or for handling misspellings.

AutoTokenizer is used to instantiate a tokenizer. AutoTokenizer is a class in the Transformers library that provides a convenient way to automatically select the appropriate tokenizer for a given pre-trained model. The AutoTokenizer class uses heuristics to determine the type of tokenizer that should be used based on the architecture and configuration of the pre-trained model. This can be useful when working with a variety of pre-trained models, because it allows you to use the appropriate tokenizer without having to manually select one for each model.

In [22]:

tokenizer_distilbert = AutoTokenizer.from_pretrained('distilbert-base-uncased')

'''
This code instantiates a tokenizer for the BERT (Bidirectional Encoder Representations from Transformers) 
pre-trained model with the bert-base-cased configuration.

'''


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

'\nThis code instantiates a tokenizer for the BERT (Bidirectional Encoder Representations from Transformers) \npre-trained model with the bert-base-cased configuration.\n\n'

Specifically, AutoTokenizer.from_pretrained() is a method in the Transformers library that allows you to load a pre-trained tokenizer for a specific model architecture and configuration. In this case, the from_pretrained() method is called with the argument 'bert-base-cased', which is the name of a pre-trained BERT model that has been trained on a large corpus of English text.

The bert-base-cased configuration refers to a version of the BERT model that has a cased vocabulary, meaning that it distinguishes between uppercase and lowercase letters. This can be useful in tasks where the case of words is important, such as named entity recognition or sentiment analysis.

By instantiating a tokenizer for the bert-base-cased model using AutoTokenizer.from_pretrained(), you can tokenize text according to the same scheme used during pre-training of the BERT model. This can be useful when fine-tuning the pre-trained model on a specific task, because it ensures that the input data is pre-processed in the same way as the data used to train the original model.

In [23]:
# Define a function to transform the label values
def transform_labels(label):
    # Extract the label value
    label = label['label']
    # Map the label value to an integer value
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2
    # Return a dictionary with a single key-value pair
    return {'labels': num}


# Define a function to tokenize the text data
def tokenize_data3(example):
    # Extract the 'safe_text' value from the input example and tokenize it
    return tokenizer_distilbert(example['safe_text'], padding='max_length')

# Apply the transformation functions to the dataset using the 'map' method
# This transforms the label values and tokenizes the text data
dataset_out = dataset.map(transform_labels)

dataset_distilbert = dataset_out.map(tokenize_data3, batched=True)

# Define a list of column names to remove from the dataset
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']

# Apply the 'transform_labels' function to the dataset to transform the label values
# Also remove the columns specified in 'remove_columns'
dataset_distilbert = dataset_distilbert.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

The columns specified in remove_columns are removed from the dataset because they are not needed for the subsequent analysis or model training.

tweet_id: This column contains unique identifiers for each tweet, which are not relevant for the analysis or modeling.

label: This column contains the original label values, which have already been transformed into numerical values using the transform_labels function.

safe_text: This column contains the preprocessed text data that has already been tokenized and encoded, so it is not needed for subsequent analysis or modeling.

agreement: This column indicates the level of agreement among the annotators for each tweet. While this information might be useful for some analyses, it is not necessary for the sentiment analysis task at hand.

By removing these columns, the resulting dataset is more compact and easier to work with, while retaining all the relevant information for the sentiment analysis task.

In [24]:
dataset

DatasetDict({
    train: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 8000
    })
    eval: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 2001
    })
})

In [25]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',                          # Directory where the model checkpoints and evaluation results will be stored
    evaluation_strategy=IntervalStrategy.STEPS,      # Interval for evaluating the model during training (every specified number of steps)
    save_strategy=IntervalStrategy.STEPS,            # Interval for saving the model during training (every specified number of steps)
    save_steps=500,                                  # Number of steps between two saves
    load_best_model_at_end=True,                     # Whether to load the best model at the end of training
    num_train_epochs=7,                              # Number of training epochs
    per_device_train_batch_size=4,                   # Batch size per GPU for training
    per_device_eval_batch_size=4,                    # Batch size per GPU for evaluation
    learning_rate=3e-5,                              # Learning rate
    weight_decay=0.01,                               # Weight decay
    warmup_steps=500,                                # Number of warmup steps
    logging_steps=500,                               # Number of steps between two logs
    fp16=True,                                       # Whether to use 16-bit precision
    gradient_accumulation_steps=16,                  # Number of steps to accumulate gradients before performing an optimizer step
    dataloader_num_workers=2,                        # Number of workers to use for loading data
    push_to_hub=True,                                # Whether to push the model checkpoints to the Hugging Face hub
    hub_model_id="Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model",  # Model ID to use when pushing the model to the Hugging Face hub 
)

#use hub_model_id="finetuned-Sentiment-classfication-ROBERTA-model
#use hub_model_id="finetuned-Sentiment-classfication-BERT-model
#use hub_model_id="finetuned-Sentiment-classfication-DISTILBERT-model

# Define the early stopping callback
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3,                       # Number of epochs with no improvement before stopping training
    early_stopping_threshold=0.01,                   # Minimum improvement in the metric for considering an improvement
)

# Combine the training arguments and the early stopping callback
training_args.callbacks = [early_stopping]


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Explanation:

from transformers import IntervalStrategy, TrainingArguments: Importing the IntervalStrategy and TrainingArguments classes from the transformers library.

training_args = TrainingArguments(: Creating a TrainingArguments object and assigning it to the variable training_args.

output_dir='./results': Specifies the directory where the training results will be saved.

evaluation_strategy=IntervalStrategy.STEPS: Specifies how often the model will be evaluated during training. In this case, the model will be evaluated at specific intervals.

save_strategy=IntervalStrategy.STEPS: Specifies how often the model will be saved during training. In this case, the model will be saved at specific intervals.

save_steps=500: Specifies how often the model will be saved during training, in terms of the number of steps taken. In this case, the model will be saved every 500 steps.

load_best_model_at_end=True: Specifies whether to load the best model at the end of training. If set to True, the best model will be loaded; if set to False, the last model will be loaded.

num_train_epochs=3: Specifies the number of epochs for training the model. In this case, the model will be trained for 3 epochs.

per_device_train_batch_size=2: Specifies the batch size for training. In this case, each training batch will contain 2 examples.

per_device_eval_batch_size=2: Specifies the batch size for evaluation. In this case, each evaluation batch will contain 2 examples.

In [26]:

'''
AutoModelForSequenceClassification is a class in the Transformers library that is used for sequence classification tasks, 
where the input is a sequence of text and the output is a label or category assigned to that sequence.

The benefit of using AutoModelForSequenceClassification is that it automatically selects the 
appropriate pre-trained model architecture based on the specified configuration and dataset. 
This makes it easy to fine-tune pre-trained models for various sequence classification tasks without having 
to manually select the appropriate model architecture.
'''

# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model_distilbert = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3) 

'''
Sentiment analysis is a common use case for sequence classification, 
where the goal is to classify text into categories such as positive, negative, or neutral sentiment. 
Therefore, AutoModelForSequenceClassification is a suitable choice for building a sentiment analysis model using BERT.
'''


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

'\nSentiment analysis is a common use case for sequence classification, \nwhere the goal is to classify text into categories such as positive, negative, or neutral sentiment. \nTherefore, AutoModelForSequenceClassification is a suitable choice for building a sentiment analysis model using BERT.\n'

In [27]:
train_dataset_distilbert = dataset_distilbert['train'].shuffle(seed=10)

'''
train_dataset is created by selecting the 'train' subset of the original dataset and 
shuffling it randomly using the shuffle() function with a specified seed value of 10. 
This ensures that the data samples are presented to the model in a randomized order during training.

'''

eval_dataset_distilbert = dataset_distilbert['eval'].shuffle(seed=10)

## other way to split the train set ... in the range you must use: 
# # int(num_rows*.8 ) for [0 - 80%] and  int(num_rows*.8 ),num_rows for the 20% ([80 - 100%])
# train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
# eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))

In [28]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    rmse = np.sqrt(np.mean((predictions - labels)**2))
    return {"rmse": rmse}


In [29]:
trainer_distilbert = Trainer(
    model=model_distilbert, 
    args=training_args, 
    train_dataset=train_dataset_distilbert, 
    eval_dataset=eval_dataset_distilbert,
    compute_metrics=compute_metrics    # Add this line to define the compute_metrics function
)


Cloning https://huggingface.co/Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model into local empty directory.


Download file pytorch_model.bin:   0%|          | 8.00k/255M [00:00<?, ?B/s]

Download file runs/Apr25_20-19-56_1101d97f1245/1682454042.834149/events.out.tfevents.1682454042.1101d97f1245.8…

Download file runs/Apr25_21-48-23_436848a73b65/events.out.tfevents.1682459410.436848a73b65.499.0: 100%|#######…

Download file runs/Apr26_04-20-53_fd4282602d69/events.out.tfevents.1682482947.fd4282602d69.339.0: 100%|#######…

Download file runs/Apr26_05-34-35_fd4282602d69/1682487292.9648054/events.out.tfevents.1682487292.fd4282602d69.…

Download file runs/Apr25_19-47-48_1101d97f1245/1682452110.285981/events.out.tfevents.1682452110.1101d97f1245.8…

Clean file runs/Apr25_20-19-56_1101d97f1245/1682454042.834149/events.out.tfevents.1682454042.1101d97f1245.865.…

Clean file runs/Apr25_21-48-23_436848a73b65/events.out.tfevents.1682459410.436848a73b65.499.0:  16%|#5        …

Clean file runs/Apr26_05-34-35_fd4282602d69/1682487292.9648054/events.out.tfevents.1682487292.fd4282602d69.339…

Clean file runs/Apr26_04-20-53_fd4282602d69/events.out.tfevents.1682482947.fd4282602d69.339.0:  15%|#5        …

Clean file runs/Apr25_19-47-48_1101d97f1245/1682452110.285981/events.out.tfevents.1682452110.1101d97f1245.865.…

Download file runs/Apr25_21-48-23_436848a73b65/1682459410.7219/events.out.tfevents.1682459410.436848a73b65.499…

Clean file runs/Apr25_21-48-23_436848a73b65/1682459410.7219/events.out.tfevents.1682459410.436848a73b65.499.1:…

Download file runs/Apr25_19-00-41_1101d97f1245/1682449941.9968894/events.out.tfevents.1682449941.1101d97f1245.…

Clean file runs/Apr25_19-00-41_1101d97f1245/1682449941.9968894/events.out.tfevents.1682449941.1101d97f1245.865…

Download file runs/Apr26_05-02-53_fd4282602d69/1682485408.8248785/events.out.tfevents.1682485408.fd4282602d69.…

Clean file runs/Apr26_05-02-53_fd4282602d69/1682485408.8248785/events.out.tfevents.1682485408.fd4282602d69.339…

Download file runs/Apr26_04-20-53_fd4282602d69/1682482947.5184/events.out.tfevents.1682482947.fd4282602d69.339…

Clean file runs/Apr26_04-20-53_fd4282602d69/1682482947.5184/events.out.tfevents.1682482947.fd4282602d69.339.1:…

Download file runs/Apr26_06-51-45_4a7df7ea4d0c/events.out.tfevents.1682491928.4a7df7ea4d0c.201.3: 100%|#######…

Download file runs/Apr26_08-07-27_4a7df7ea4d0c/1682496459.97879/events.out.tfevents.1682496459.4a7df7ea4d0c.20…

Clean file runs/Apr26_06-51-45_4a7df7ea4d0c/events.out.tfevents.1682491928.4a7df7ea4d0c.201.3:  15%|#4        …

Download file runs/Apr26_07-28-26_4a7df7ea4d0c/1682495330.8748186/events.out.tfevents.1682495330.4a7df7ea4d0c.…

Clean file runs/Apr26_08-07-27_4a7df7ea4d0c/1682496459.97879/events.out.tfevents.1682496459.4a7df7ea4d0c.201.1…

Download file runs/Apr26_08-13-06_4a7df7ea4d0c/1682496797.398475/events.out.tfevents.1682496797.4a7df7ea4d0c.2…

Clean file runs/Apr26_07-28-26_4a7df7ea4d0c/1682495330.8748186/events.out.tfevents.1682495330.4a7df7ea4d0c.201…

Download file runs/Apr26_05-02-53_fd4282602d69/events.out.tfevents.1682485408.fd4282602d69.339.3: 100%|#######…

Clean file runs/Apr26_08-13-06_4a7df7ea4d0c/1682496797.398475/events.out.tfevents.1682496797.4a7df7ea4d0c.201.…

Download file runs/Apr26_07-28-26_4a7df7ea4d0c/1682494181.4856892/events.out.tfevents.1682494181.4a7df7ea4d0c.…

Clean file runs/Apr26_05-02-53_fd4282602d69/events.out.tfevents.1682485408.fd4282602d69.339.3:  19%|#8        …

Download file runs/Apr26_06-28-36_4a7df7ea4d0c/1682490678.9212508/events.out.tfevents.1682490678.4a7df7ea4d0c.…

Download file runs/Apr25_20-19-56_1101d97f1245/events.out.tfevents.1682454042.1101d97f1245.865.6: 100%|#######…

Download file runs/Apr25_19-47-48_1101d97f1245/events.out.tfevents.1682452110.1101d97f1245.865.3: 100%|#######…

Clean file runs/Apr26_07-28-26_4a7df7ea4d0c/1682494181.4856892/events.out.tfevents.1682494181.4a7df7ea4d0c.201…

Download file runs/Apr26_05-34-35_fd4282602d69/events.out.tfevents.1682487292.fd4282602d69.339.6: 100%|#######…

Clean file runs/Apr25_20-19-56_1101d97f1245/events.out.tfevents.1682454042.1101d97f1245.865.6:  20%|##        …

Clean file runs/Apr25_19-47-48_1101d97f1245/events.out.tfevents.1682452110.1101d97f1245.865.3:  21%|##        …

Download file runs/Apr25_19-00-41_1101d97f1245/events.out.tfevents.1682449941.1101d97f1245.865.0: 100%|#######…

Clean file runs/Apr26_06-28-36_4a7df7ea4d0c/1682490678.9212508/events.out.tfevents.1682490678.4a7df7ea4d0c.201…

Clean file runs/Apr26_05-34-35_fd4282602d69/events.out.tfevents.1682487292.fd4282602d69.339.6:  22%|##2       …

Clean file runs/Apr25_19-00-41_1101d97f1245/events.out.tfevents.1682449941.1101d97f1245.865.0:  23%|##3       …

Download file runs/Apr26_06-51-45_4a7df7ea4d0c/1682491928.9372973/events.out.tfevents.1682491928.4a7df7ea4d0c.…

Clean file runs/Apr26_06-51-45_4a7df7ea4d0c/1682491928.9372973/events.out.tfevents.1682491928.4a7df7ea4d0c.201…

Download file runs/Apr26_07-28-26_4a7df7ea4d0c/events.out.tfevents.1682494181.4a7df7ea4d0c.201.6: 100%|#######…

Download file runs/Apr26_07-28-26_4a7df7ea4d0c/events.out.tfevents.1682495330.4a7df7ea4d0c.201.8: 100%|#######…

Clean file runs/Apr26_07-28-26_4a7df7ea4d0c/events.out.tfevents.1682494181.4a7df7ea4d0c.201.6:  19%|#8        …

Download file runs/Apr25_19-47-48_1101d97f1245/events.out.tfevents.1682452745.1101d97f1245.865.5: 100%|#######…

Clean file runs/Apr26_07-28-26_4a7df7ea4d0c/events.out.tfevents.1682495330.4a7df7ea4d0c.201.8:  19%|#8        …

Download file runs/Apr26_05-02-53_fd4282602d69/events.out.tfevents.1682487226.fd4282602d69.339.5: 100%|#######…

Download file runs/Apr26_08-07-27_4a7df7ea4d0c/events.out.tfevents.1682496459.4a7df7ea4d0c.201.11: 100%|######…

Clean file runs/Apr25_19-47-48_1101d97f1245/events.out.tfevents.1682452745.1101d97f1245.865.5: 100%|##########…

Download file runs/Apr26_06-28-36_4a7df7ea4d0c/events.out.tfevents.1682490678.4a7df7ea4d0c.201.0: 100%|#######…

Download file runs/Apr26_08-13-06_4a7df7ea4d0c/events.out.tfevents.1682496797.4a7df7ea4d0c.201.14: 100%|######…

Clean file runs/Apr26_05-02-53_fd4282602d69/events.out.tfevents.1682487226.fd4282602d69.339.5: 100%|##########…

Clean file runs/Apr26_08-07-27_4a7df7ea4d0c/events.out.tfevents.1682496459.4a7df7ea4d0c.201.11:  23%|##2      …

Download file runs/Apr25_19-00-41_1101d97f1245/events.out.tfevents.1682450157.1101d97f1245.865.2: 100%|#######…

Clean file runs/Apr26_06-28-36_4a7df7ea4d0c/events.out.tfevents.1682490678.4a7df7ea4d0c.201.0:  23%|##3       …

Download file runs/Apr26_04-20-53_fd4282602d69/events.out.tfevents.1682485126.fd4282602d69.339.2: 100%|#######…

Clean file runs/Apr26_08-13-06_4a7df7ea4d0c/events.out.tfevents.1682496797.4a7df7ea4d0c.201.14:  18%|#8       …

Download file training_args.bin: 100%|##########| 3.75k/3.75k [00:00<?, ?B/s]

Clean file runs/Apr25_19-00-41_1101d97f1245/events.out.tfevents.1682450157.1101d97f1245.865.2: 100%|##########…

Clean file runs/Apr26_04-20-53_fd4282602d69/events.out.tfevents.1682485126.fd4282602d69.339.2: 100%|##########…

Clean file training_args.bin:  27%|##6       | 1.00k/3.75k [00:00<?, ?B/s]

Download file runs/Apr26_06-51-45_4a7df7ea4d0c/events.out.tfevents.1682493992.4a7df7ea4d0c.201.5: 100%|#######…

Clean file runs/Apr26_06-51-45_4a7df7ea4d0c/events.out.tfevents.1682493992.4a7df7ea4d0c.201.5: 100%|##########…

Download file runs/Apr26_07-28-26_4a7df7ea4d0c/events.out.tfevents.1682496401.4a7df7ea4d0c.201.10: 100%|######…

Clean file runs/Apr26_07-28-26_4a7df7ea4d0c/events.out.tfevents.1682496401.4a7df7ea4d0c.201.10: 100%|#########…

Download file runs/Apr26_06-28-36_4a7df7ea4d0c/events.out.tfevents.1682491626.4a7df7ea4d0c.201.2: 100%|#######…

Download file runs/Apr26_08-07-27_4a7df7ea4d0c/events.out.tfevents.1682496669.4a7df7ea4d0c.201.13: 100%|######…

Clean file runs/Apr26_06-28-36_4a7df7ea4d0c/events.out.tfevents.1682491626.4a7df7ea4d0c.201.2: 100%|##########…

Clean file runs/Apr26_08-07-27_4a7df7ea4d0c/events.out.tfevents.1682496669.4a7df7ea4d0c.201.13: 100%|#########…

Clean file pytorch_model.bin:   0%|          | 1.00k/255M [00:00<?, ?B/s]

In [30]:
trainer_distilbert.train()



Step,Training Loss,Validation Loss,Rmse
500,0.6862,0.564407,0.623943


TrainOutput(global_step=875, training_loss=0.4962402605329241, metrics={'train_runtime': 1081.3797, 'train_samples_per_second': 51.786, 'train_steps_per_second': 0.809, 'total_flos': 7418306617344000.0, 'train_loss': 0.4962402605329241, 'epoch': 7.0})

In [31]:
# Evaluate the model
eval_results = trainer_distilbert.evaluate()

# Create a dictionary of the evaluation results
results_dict = {
    "Model": "distilbert",
    "Loss": eval_results["eval_loss"],
    "RMSE": eval_results["eval_rmse"],
    "Runtime": eval_results["eval_runtime"],
    "Samples Per Second": eval_results["eval_samples_per_second"],
    "Steps Per Second": eval_results["eval_steps_per_second"],
    "Epoch": eval_results["epoch"]
}

# Create a pandas DataFrame from the dictionary
results_df = pd.DataFrame([results_dict])

# Print the results
print(results_df)


        Model      Loss      RMSE  Runtime  Samples Per Second  \
0  distilbert  0.564407  0.623943   12.848             155.744   

   Steps Per Second  Epoch  
0            38.994    7.0  


In [None]:
# Sort the results by "eval_rmse" in ascending order and get the name and state dict of the best model
best_model = results_df.loc[results_df['RMSE'].idxmin()]

print(best_model)


Model                  roberta
Loss                  0.596201
RMSE                  0.672885
Runtime                26.3948
Samples Per Second       75.81
Steps Per Second         25.27
Epoch                      1.0
Name: 1, dtype: object


In [None]:
# Find the model with the lowest RMSE
best_model_name = results_df['Model'][results_df['RMSE'].idxmin()]
best_model_name

'roberta'

**Note that you should only push the best model to the Hugging Face Model Hub if you are satisfied with its performance.**

---

❌ ❌ ❌ ❌ ❌ ❌

In [34]:

 # Push the final fine-tuned model to the Hugging Face model hub

trainer_distilbert.push_to_hub("Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model")


To https://huggingface.co/Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model
   caa9da4..17cf6fe  main -> main

   caa9da4..17cf6fe  main -> main



In [37]:
tokenizer_distilbert.push_to_hub("Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model")

CommitInfo(commit_url='https://huggingface.co/Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model/commit/7ee79cee9fb45173eb2ad62ad0059bb7ed925aca', commit_message='Upload tokenizer', commit_description='', oid='7ee79cee9fb45173eb2ad62ad0059bb7ed925aca', pr_url=None, pr_revision=None, pr_num=None)

In [41]:
model_distilbert.push_to_hub("Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model")

CommitInfo(commit_url='https://huggingface.co/Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model/commit/70db80550b1eb698ffdcc7405afa1eddaca6960a', commit_message='Upload DistilBertForSequenceClassification', commit_description='', oid='70db80550b1eb698ffdcc7405afa1eddaca6960a', pr_url=None, pr_revision=None, pr_num=None)

### You can load your model from anywhere using from_pretrained!

In [43]:
# Load the tokenizer
tokenizer = tokenizer_distilbert.from_pretrained("Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model")

# Load the fine-tuned model
model = pipeline("text-classification", model="Abubakari/finetuned-Sentiment-classfication-DISTILBERT-model", tokenizer=tokenizer)



Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [47]:
label_map = {0: "negative", 1: "neutral", 2: "positive"}

# Make predictions on some example text
result = model("I do n0t really care about these covid vaccines.")

# Map the numerical label to the corresponding class name
result[0]["label"] = label_map[int(result[0]["label"].split("_")[1])]

# Print the predicted label and score
print(result)

[{'label': 'negative', 'score': 0.3873116374015808}]


Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.

distilbert-base-uncased is a variant of the BERT model, which is a transformer-based neural network architecture designed for natural language processing (NLP) tasks such as text classification, named entity recognition, question answering, and more.

The "uncased" in the model name refers to the fact that the model was trained on lowercased text, meaning that the tokenizer will convert all text to lowercase before encoding it. The "base" refers to the size of the model, which has approximately 66 million parameters.

The main difference between distilbert-base-uncased and the original BERT model is that distilbert-base-uncased has been "distilled" or compressed to be smaller and faster, while maintaining a similar level of performance to the original BERT model. This was achieved by removing some of the model's layers and reducing the hidden size of the model.

distilbert-base-uncased was trained on a large corpus of English text using a masked language modeling (MLM) objective. In MLM, some of the input tokens are masked, and the model is trained to predict the masked tokens based on the surrounding context. This helps the model learn contextual representations of words, which are useful for many NLP tasks.

The distilbert-base-uncased model can be fine-tuned on a variety of NLP tasks by adding a task-specific output layer on top of the pre-trained model and fine-tuning the entire network on task-specific data. This fine-tuning process allows the model to adapt to a particular task and achieve state-of-the-art performance on many benchmarks.

In [None]:
# Define the model path where the pre-trained model is saved on the Hugging Face model hub
model_path = "Abubakari/finetuned-Sentiment-classfication-model"

# Initialize the tokenizer for the pre-trained model
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# Load the configuration for the pre-trained model
config = AutoConfig.from_pretrained(model_path)

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Define a function to preprocess the text data
def preprocess(text):
    new_text = []
    # Replace user mentions with '@user'
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        # Replace links with 'http'
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    # Join the preprocessed text
    return " ".join(new_text)

# Define a function to perform sentiment analysis on the input text
def sentiment_analysis(text):
    # Preprocess the input text
    text = preprocess(text)

    # Tokenize the input text using the pre-trained tokenizer
    encoded_input = tokenizer(text, return_tensors='pt')
    
    # Feed the tokenized input to the pre-trained model and obtain output
    output = model(**encoded_input)
    
    # Obtain the prediction scores for the output
    scores_ = output[0][0].detach().numpy()
    
    # Apply softmax activation function to obtain probability distribution over the labels
    scores_ = softmax(scores_)
    
    # Format the output dictionary with the predicted scores
    labels = ['Negative', 'Neutral', 'Positive']
    scores = {l:float(s) for (l,s) in zip(labels, scores_) }
    
    # Return the scores
    return scores

# Define a Gradio interface to interact with the model
demo = gr.Interface(
    fn=sentiment_analysis, # Function to perform sentiment analysis
    inputs=gr.Textbox(placeholder="Write your tweet here..."), # Text input field
    outputs="label", # Output type (here, we only display the label with the highest score)
    interpretation="default", # Interpretation mode
    examples=[["This is wonderful!"]]) # Example input(s) to display on the interface

# Launch the Gradio interface
demo.launch()
