<a href="https://colab.research.google.com/github/DeeeTeeee/AZUBISTORE/blob/master/Fine_tuning_Hugging_face_text_classification_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `3 epochs of fine-tuning`.

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [None]:
# #Install the datasets library
# !pip install datasets

In [70]:
# Import libraries
import os
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

In [71]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [72]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [73]:
# Load the dataset and display some values
df = pd.read_csv('/content/drive/MyDrive/Natural Language Processing/zindi_challenge/data/Train.csv')

# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]


I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ).

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.

In [74]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [None]:
# Display the first few rows of the training data
train.sample(5)

In [None]:
# Display the first few rows of the evaluation data
eval.head()

In [None]:
# Print the shapes of the new dataframes
print(f"Shape of the train dataframe: {train.shape}")
print(f"Shape of the eval dataframe: {eval.shape}")

In [78]:
# # Save splitted subsets
# train.to_csv("../data/train_subset.csv", index=False)
# eval.to_csv("../data/eval_subset.csv", index=False)
import os
import pandas as pd

# Create the data directory if it doesn't exist
if not os.path.exists("../data"):
    os.makedirs("../data")

# Save splitted subsets
train.to_csv("../data/train_subset.csv", index=False)
eval.to_csv("../data/eval_subset.csv", index=False)


In [None]:
# Load the dataset from CSV files
dataset = load_dataset('csv',
                        data_files={'train': '../data/train_subset.csv',
                        'eval': '../data/eval_subset.csv'}, encoding = "ISO-8859-1")

## VADER Seniment Scoring

We are using the NLTK's SentimentIntensityAnalyzer to get the neg/neu/pos scores of the text.

This uses a "bag of words" approach:
Stop words are removed
each word is scored and combined to a total score.

In [22]:
#!pip install nltk
#!pip install tqdm



In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm

sia = SentimentIntensityAnalyzer()

In [None]:
# Assuming your eval dataframe is named "eval"
tweet_id = "R7JPIFN7"
safe_text_col = eval[eval["tweet_id"] == tweet_id]["safe_text"]
text = safe_text_col.iloc[0]  # Extract the string value from the Series

# Pass the text to sia.polarity_scores() for sentiment analysis
score = sia.polarity_scores(text)

# Print the sentiment score
print(score)


In [None]:
# Run the polarity score on the entire dataset
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = row['safe_text']
    myid = row['tweet_id']
    res[myid] = sia.polarity_scores(text)

In [None]:
#Metadata

# Assuming your dataframe is named "df" and the column is named "safe_text"
sentiment_scores = df['safe_text'].apply(lambda x: sia.polarity_scores(x))

# Create a new dataframe with the sentiment scores
score_df = pd.DataFrame(sentiment_scores.tolist())

# Print the new dataframe
score_df


In [83]:
vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={'index': 'tweet_id'})
vaders = vaders.merge(df, how='left')

In [None]:
# Now we have sentiment score and metadata
vaders.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

ax = sns.barplot(data=vaders, x='label', y='compound')
ax.set_title('Covid Tweet Label Review')
plt.show()

In [91]:

#!pip install transformers

# Rest of your code...

# Import the tokenizer from transformers library
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')

In [None]:
# Function to transform labels
def transform_labels(label):
    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

# Function to tokenize data
def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

In [None]:
dataset

In [None]:
# dataset['train']

In [None]:
# !pip install transformers[torch]
#!pip install accelerate
#!pip install transformers[torch]


In [117]:
#from transformers import TrainingArguments
import transformers

# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
# Training Arguments
#training_args = TrainingArguments("test_trainer", num_train_epochs=3000, load_best_model_at_end=True,)
# # Training Arguments
# training_args = TrainingArguments(
#     "test_trainer",
#     num_train_epochs=3000,
#     load_best_model_at_end=True,
# )
#from transformers import TrainingArguments

# Configure the training parameters like `num_train_epochs`:
# the number of times the model will repeat the training loop over the dataset
# Training Arguments
#!pip install torch

#get_ipython().system('pip install accelerate --upgrade')
# Install accelerate
# Install accelerate
#!pip install accelerate -U

# Define the training arguments
# Install accelerate
#!pip install accelerate -U

# Import transformers
import transformers

# Define the training arguments
def training_args():
    """Defines the training arguments for the sentiment analysis model."""

    args = transformers.TrainingArguments(
        output_dir="/content/drive/results",
        group_by_length=True,
        length_column_name="input_length",
        per_device_train_batch_size=24,
        gradient_accumulation_steps=2,
        evaluation_strategy="steps",
        num_train_epochs=20,
        fp16=True,
        save_steps=1000,
        save_strategy="steps",
        eval_steps=1000,
        logging_steps=1000,
        learning_rate=5e-5,
        warmup_steps=500,
        save_total_limit=3,
        load_best_model_at_end=True,
    )

    return args


# Run the training arguments
args = training_args()



# Run the training arguments
args = training_args()




ImportError: ignored

In [104]:
accelerate>=0.20.1

SyntaxError: ignored

In [None]:
from transformers import AutoModelForSequenceClassification

# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment", num_labels=3)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [None]:
# Train and Evaluation Datasets
train_dataset = dataset['train'].shuffle(seed=10) #.select(range(40000)) # to select a part
eval_dataset = dataset['eval'].shuffle(seed=10)

## other way to split the train set ... in the range you must use:
# # int(num_rows*.8 ) for [0 - 80%] and  int(num_rows*.8 ),num_rows for the 20% ([80 - 100%])
# train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
# eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))

In [None]:
# Model Training Setup
from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)


# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=eval_dataset
# )

In [None]:
# Launch the learning process: training
trainer.train()

***** Running training *****
  Num examples = 7999
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3000
                                                   
  1%|          | 16/3000 [4:25:07<6:59:23,  8.43s/it] Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json


{'loss': 0.7607, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.5}


Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
                                                     
  1%|          | 16/3000 [7:16:40<6:59:23,  8.43s/it]  Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json


{'loss': 0.6572, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}


Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin


KeyboardInterrupt: 

Don't worry the above issue, it is a `KeyboardInterrupt` that means I stopped the training to avoid taking a long time to finish.

In [None]:
import numpy as np
from datasets import load_metric

# Load the metric for evaluation
metric = load_metric("accuracy")

# Define a function to compute evaluation metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# Initialize the Trainer object with the model, training arguments, datasets, and compute metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
# Launch the final evaluation
trainer.evaluate()


Downloading builder script: 4.21kB [00:00, 932kB/s]                    
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 8

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A


{'eval_loss': 0.6274272203445435,
 'eval_accuracy': 0.7665,
 'eval_runtime': 546.3013,
 'eval_samples_per_second': 3.661,
 'eval_steps_per_second': 0.458}

Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.