<a href="https://colab.research.google.com/github/DeeeTeeee/AZUBISTORE/blob/master/Fine_tuning_Hugging_face_text_classification_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `3 epochs of fine-tuning`.

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [76]:
# #Install the datasets library
# !pip install datasets
# !pip install sentencepiece
# !pip install transformers datasets

In [77]:
# Import libraries
import os
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig, AdamW
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from sklearn.metrics import mean_squared_error



In [78]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [79]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [80]:
# Load the dataset and display some values
df = pd.read_csv('/content/drive/MyDrive/Natural Language Processing/zindi_challenge/data/Train.csv')

# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]


I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ).

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.

In [81]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [82]:
# Display the first few rows of the training data
train.sample(5)

Unnamed: 0,tweet_id,safe_text,label,agreement
2613,9Q13B59P,#ILoveObama Because he is a Good Man in A den ...,0.0,1.0
7843,B9XJ9ICZ,“<user> Nearly two dozen medical studies prove...,-1.0,1.0
1797,4E5N4V80,<user> RT <user> Two confirmed cases of measle...,0.0,0.666667
7394,NH32H68N,"So on this Rosa de Guadalupe, the parents don'...",-1.0,0.333333
7225,URV849TL,42 measles cases linked to Disneyland <url>,0.0,1.0


In [83]:
# Display the first few rows of the evaluation data
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
6571,R7JPIFN7,Children's Museum of Houston to Offer Free Vac...,1.0,1.0
1754,2DD250VN,<user> no. I was properly immunized prior to t...,1.0,1.0
3325,ESEVBTFN,<user> thx for posting vaccinations are impera...,1.0,1.0
1485,S17ZU0LC,This Baby Is Exactly Why Everyone Needs To Vac...,1.0,0.666667
4175,IIN5D33V,"Meeting tonight, 8:30pm in room 322 of the stu...",1.0,1.0


In [84]:
eval.label.unique()

array([ 1., -1.,  0.])

In [85]:
# Print the shapes of the new dataframes
print(f"Shape of the train dataframe: {train.shape}")
print(f"Shape of the eval dataframe: {eval.shape}")

Shape of the train dataframe: (7999, 4)
Shape of the eval dataframe: (2000, 4)


In [86]:
# # Save splitted subsets
# train.to_csv("../data/train_subset.csv", index=False)
# eval.to_csv("../data/eval_subset.csv", index=False)
import os
import pandas as pd

# Create the data directory if it doesn't exist
if not os.path.exists("../data"):
    os.makedirs("../data")

# Save splitted subsets
train.to_csv("../data/train_subset.csv", index=False)
eval.to_csv("../data/eval_subset.csv", index=False)


In [87]:
# Load the dataset from CSV files
dataset = load_dataset('csv',
                        data_files={'train': '../data/train_subset.csv',
                        'eval': '../data/eval_subset.csv'}, encoding = "ISO-8859-1")

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-0130c71d78512182/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-0130c71d78512182/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [88]:
eval

Unnamed: 0,tweet_id,safe_text,label,agreement
6571,R7JPIFN7,Children's Museum of Houston to Offer Free Vac...,1.0,1.000000
1754,2DD250VN,<user> no. I was properly immunized prior to t...,1.0,1.000000
3325,ESEVBTFN,<user> thx for posting vaccinations are impera...,1.0,1.000000
1485,S17ZU0LC,This Baby Is Exactly Why Everyone Needs To Vac...,1.0,0.666667
4175,IIN5D33V,"Meeting tonight, 8:30pm in room 322 of the stu...",1.0,1.000000
...,...,...,...,...
3219,J7IQVE6X,<user> Is Imus really telling people to chill ...,0.0,0.666667
2080,ICX9AF8Y,"After overseas trip, Wake resident could have ...",0.0,1.000000
8111,C9TSI9HK,How many kids at a school aren't immunized? Pl...,0.0,1.000000
9729,ZR5J41GP,Come &amp; Celebrate <user> #goodtime #mmr #dc...,0.0,0.666667


In [89]:

#!pip install transformers

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# checkpoint = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
# checkpoint = "roberta-base"
# "xlnet-base-cased"

# Import the tokenizer from transformers library
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('xlnet-base-cased')

In [90]:
# Function to transform labels
def transform_labels(label):
    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

# Function to tokenize data
def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [91]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7999
    })
    eval: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [92]:
# # Preprocess text (username and link placeholders)
# def preprocess(text):
#     new_text = []
#     for t in text.split(" "):
#         t = '@user' if t.startswith('@') and len(t) > 1 else t
#         t = 'http' if t.startswith('http') else t
#         new_text.append(t)
#     return " ".join(new_text)


In [93]:
# !pip install transformers[torch]
# !pip install accelerate
# !pip install transformers[torch]
# !pip install accelerate>=0.20.1



In [94]:
from transformers import TrainingArguments
import transformers

# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
# Training Arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    num_train_epochs=10,
    load_best_model_at_end=True,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    logging_steps=100,
    per_device_train_batch_size=16,
)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [95]:
from transformers import AutoModelForSequenceClassification

# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=3)

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

In [96]:
# Train and Evaluation Datasets
train_dataset = dataset['train'].shuffle(seed=10) #.select(range(40000)) # to select a part
eval_dataset = dataset['eval'].shuffle(seed=10)

## other way to split the train set ... in the range you must use:
# # int(num_rows*.8 ) for [0 - 80%] and  int(num_rows*.8 ),num_rows for the 20% ([80 - 100%])
# train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
# eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))

In [97]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"rmse": mean_squared_error(labels, predictions, squared=False)}

In [98]:
# Model Training Setup
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)


In [99]:
import numpy as np


In [100]:
# Launch the learning process: training
trainer.train()

You're using a XLNetTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rmse
1,0.7385,0.620962,0.716589
2,0.5725,0.598116,0.630476
3,0.4262,0.765837,0.636789
4,0.3027,0.815537,0.650769
5,0.2225,0.979619,0.621691
6,0.1668,1.089235,0.614817
7,0.1204,1.248763,0.643428
8,0.0939,1.374636,0.64304
9,0.0743,1.408575,0.605392


Epoch,Training Loss,Validation Loss,Rmse
1,0.7385,0.620962,0.716589
2,0.5725,0.598116,0.630476
3,0.4262,0.765837,0.636789
4,0.3027,0.815537,0.650769
5,0.2225,0.979619,0.621691
6,0.1668,1.089235,0.614817
7,0.1204,1.248763,0.643428
8,0.0939,1.374636,0.64304
9,0.0743,1.408575,0.605392
10,0.055,1.424654,0.620886


TrainOutput(global_step=5000, training_loss=0.2772871994018555, metrics={'train_runtime': 1093.4133, 'train_samples_per_second': 73.156, 'train_steps_per_second': 4.573, 'total_flos': 2144486915325990.0, 'train_loss': 0.2772871994018555, 'epoch': 10.0})

Don't worry the above issue, it is a `KeyboardInterrupt` that means I stopped the training to avoid taking a long time to finish.

In [101]:
import numpy as np
from datasets import load_metric

# Load the metric for evaluation
metric = load_metric("accuracy")

# Define a function to compute evaluation metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [102]:
# Initialize the Trainer object with the model, training arguments, datasets, and compute metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [103]:
# Launch the final evaluation
trainer.evaluate()

In [None]:
import transformers
import os

token = "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"
repo_name = "SentimentAnalysis"
huggingface_account = "DeeeTeeee01"

model_path = "model.pth"
tokenizer_path = "tokenizer.json"

model = transformers.models.AutoModel.from_pretrained(model_path)
tokenizer = transformers.models.AutoTokenizer.from_pretrained(tokenizer_path)

model.push_to_hub(repo_name, use_auth_token=token, commit_message="Pushed model")
tokenizer.push_to_hub(repo_name, use_auth_token=token, commit_message="Pushed tokenizer")


Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.