Sentiment Analysis With Hugging Face

Hugging Face is an open-source platform that offers machine learning technologies, including pre-built models for various tasks. With their package, you can easily access these models to use them directly or fine-tune them on your own dataset. The platform also allows you to host your trained models, enabling you to utilize them on different devices and applications.

To access the full features of the Hugging Face platform, please visit their website and sign in.

Text classification with Hugging Face is a powerful capability provided by their models. By leveraging deep learning techniques, these models can analyze and classify text based on its sentiment, among other factors. However, training such models requires substantial computational power, particularly GPU resources. To tackle this, you can use platforms like Colab, GPU cloud providers, or a local machine equipped with an NVIDIA GPU to ensure efficient training and fine-tuning processes.

Exploring sentiment analysis with Hugging Face can greatly enhance your natural language processing projects. Visit their website to learn more about the available models and get started with this powerful tool.

In [None]:
!pip install transformers
!pip install datasets


Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m99.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m71.9 MB/s[0m eta [36m0:00:0

In [None]:
!pip install huggingface_hub transformers datasets gradio pipreqs

In [None]:
pip install transformers


In [None]:
pip install --upgrade huggingface_hub

In [None]:
!huggingface-cli login
#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

In [None]:
# Import libraries
import os
import uuid
import pandas as pd
import numpy as np
from scipy.special import softmax
import gradio as gr

from google.colab import drive
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    TFAutoModelForSequenceClassification,
    IntervalStrategy,
    TrainingArguments,
    EarlyStoppingCallback,
    pipeline,
    Trainer
)


In [None]:
drive.mount('/content/drive')

Setting up my enviroment

In [None]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# Load the CSV file into a DataFrame

url = "https://github.com/Azubi-Africa/Career_Accelerator_P5-NLP/raw/master/zindi_challenge/data/Train.csv"

train= pd.read_csv(url)

In [None]:
train.info()

In [None]:
train.isnull().sum()

the label and agreement columns have missing datasets

In [None]:
#checking the row with missing column
train[train.isna().any(axis=1)]


In [None]:
complete_text = train.iloc[4798]['safe_text']
complete_text = train['safe_text'].iloc[4798]
complete_text

In [None]:
# Select row by index and assign values to columns
train.loc[4798, 'label'] = 0
train.loc[4798, 'agreement'] = 0.666667

# Use .iloc[] and .iat[] to select and update safe_text column
train.iloc[4798, train.columns.get_loc('safe_text')] = complete_text

In [None]:
train.iloc[4798]

In [None]:
import uuid

rand_tweet_id = str(uuid.uuid4())


In [None]:
row_index = 4799
train.loc[row_index, 'tweet_id'] = rand_tweet_id
train.loc[row_index, 'label'] = 1
train.loc[row_index, 'agreement'] = 0.666667


In [None]:
train.iloc[row_index, train.columns.get_loc('safe_text')] = train.iloc[row_index, train.columns.get_loc('safe_text')]


In [None]:
train.iloc[4799]

In [None]:
train.duplicated().sum()

Spliting of dataset

In [None]:
# Split the train data => {train, eval}
train, eval = train_test_split(train, test_size=0.2, random_state=42, stratify=train['label'])

In [None]:
train.head()

In [None]:
eval.head()

In [None]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")


In [None]:
import os

# Specify the directory path
directory = '/content/drive/MyDrive/Colab Notebooks/Sentiment Analysis'

# Create the directory if it does not exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Save the dataframes as CSV files in the specified directory
train.to_csv(os.path.join(directory, "train_subset.csv"), index=False)
eval.to_csv(os.path.join(directory, "eval_subset.csv"), index=False)


In [None]:
from datasets import load_dataset

dataset = load_dataset('csv', data_files={
    'train': os.path.join(directory, 'train_subset.csv'),
    'eval': os.path.join(directory, 'eval_subset.csv')
}, encoding='ISO-8859-1')



In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [None]:
# Define a function to transform the label values
def transform_labels(label):
    # Extract the label value
    label = label['label']
    # Map the label value to an integer value
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2
    # Return a dictionary with a single key-value pair
    return {'labels': num}

# Define a function to tokenize the text data
def tokenize_data(example):
    # Extract the 'safe_text' value from the input example and tokenize it
    return tokenizer(example['safe_text'], padding='max_length')

# Apply the transformation functions to the dataset using the 'map' method
# This transforms the label values and tokenizes the text data
dataset_out = dataset.map(transform_labels)

dataset_base = dataset_out.map(tokenize_data, batched=True)

# Define a list of column names to remove from the dataset
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']

# Apply the 'transform_labels' function to the dataset to transform the label values
# Also remove the columns specified in 'remove_columns'

dataset_base = dataset_base.map(transform_labels, remove_columns=remove_columns)

In [None]:
dataset

In [None]:
# Define the training arguments
# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    fp16=False,            # Disable mixed-precision training
    fp16_full_eval=False,  # Disable FP16 half precision evaluation
)

#use hub_model_id="finetuned-Sentiment-classfication-ROBERTA-model
#use hub_model_id="finetuned-Sentiment-classfication-BERT-model
#use hub_model_id="finetuned-Sentiment-classfication-DISTILBERT-model

# Define the early stopping callback
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3,                       # Number of epochs with no improvement before stopping training
    early_stopping_threshold=0.01,                   # Minimum improvement in the metric for considering an improvement
)

# Combine the training arguments and the early stopping callback
training_args.callbacks = [early_stopping]

In [None]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)


In [None]:
train_dataset_base = dataset_base['train'].shuffle(seed=10) #.select(range(40000)) # to select a part
eval_dataset_base = dataset_base['eval'].shuffle(seed=10)


In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    rmse = np.sqrt(np.mean((predictions - labels)**2))
    return {"rmse": rmse}

In [None]:
trainer_base = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_base,
    eval_dataset=eval_dataset_base,
    compute_metrics=compute_metrics    # Add this line to define the compute_metrics function
)

In [None]:
trainer_base.train()