<a href="https://colab.research.google.com/github/Azie88/NLP-Huggingface-Covid-19-Tweet-Sentiment-Analysis/blob/main/dev/Tweet%20Sentiment%20Analysis%20Roberta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with HuggingFace & Colab

Deep learning has pretty much taken over NLP. Language models like those available through huggingface are able to capture nuances of text, and can be trained with very little effort. They are super easy to use.

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps. It's really awesome.


The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. This project will use [Google Colab](https://colab.research.google.com/) to leverage the GPU computational power.

This project is about Natural Language Processing, specifically text classification (Sentiment analysis). In this project, we will fine-tune a pre-trained text classification Deep Learning model from HuggingFace on a new dataset to adapt the models to the task that we want to solve, i.e the prediction of the sentiment expressed in a Tweet (e.g: neutral, positive, negative), then create an app to use the models and deploy the app on the HuggingFace platform.

<br>

Read more about [Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

## Business Understanding

Vaccines have lowered the risk of illness and death, and have saved countless lives around the world. Unfortunately in some countries, the 'anti-vaxxer' movement has led to lower rates of vaccination and new outbreaks of old diseases.

The COVID vaccination has been very controversial and people have mixed feelings and opinions about it. Therefore, it is important to monitor public sentiment towards vaccinations now and in the future as the COVID-19 vaccines continue to be offered to the public. The anti-vaccination sentiment could pose a serious threat to the global efforts to get COVID-19 under control in the long term.

The objective of this challenge is to develop a machine learning model to assess if a Twitter post related to COVID vaccinations is positive, neutral, or negative. This solution could help governments and other public health actors monitor public sentiment towards COVID-19 vaccinations and help improve public health policy, vaccine communication strategies, and vaccination programs across the world.

## Data Understanding

### Install Libraries and Packages

In [None]:
!pip install datasets
!pip install accelerate>=0.20.1
!pip install transformers[torch]
!pip install -U huggingface_hub
!pip install tokenizers --upgrade
!pip install evaluate

### Import Libraries and Packages

In [None]:
#System and data handling
import os
import re
import pandas as pd
pd.set_option('display.max_colwidth', None)
import numpy as np

#Data Preparation
from evaluate import load
from datasets import Dataset, DatasetDict

#Scikit-Learn
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

#Google Drive
from google.colab import drive

#Visualization
import matplotlib.pyplot as plt
from wordcloud import WordCloud

#Transformers
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding, TFAutoModelForSequenceClassification

#Scores
from scipy.special import softmax

# Deep learning
import torch
from torch import nn

#Huggingface
from huggingface_hub import notebook_login

### Setup

In [None]:
#login to huggingface with access token

notebook_login()

In [None]:
# Set a fixed random seed for PyTorch on CPU
torch.manual_seed(42)

# Control the seed for individual GPU operations (optional)
if torch.cuda.is_available:
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
  torch.cuda.manual_seed_all(42)


In [None]:
# Connect to your google drive

drive.mount('/content/drive')

In [None]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

### Data Loading

In [None]:
# Load the dataset and display some values
df = pd.read_csv('/content/drive/MyDrive/Covid-19 tweet dataset/Train.csv')


In [None]:
#look at first 10 rows in train data
df.head(10)

1. **tweet_id**: Unique identifier of the tweet

2. **safe_tweet**: Text contained in the tweet. Some sensitive information has been removed like usernames and urls

3. **label**: Sentiment of the tweet (-1 for negative, 0 for neutral, 1 for positive)

4. **agreement**: The tweets were labeled by three people. Agreement indicates the percentage of the three reviewers that agreed on the given label. You may use this column in your training, but agreement data will not be shared for the test set.

In [None]:
#Check rows and columns
df.shape

In [None]:
#Check Data types
df.dtypes

In [None]:
#Descriptive Statistics
df.describe()

In [None]:
#Check Null values
df.isna().sum()

In [None]:
# Check the 'label' value counts
df.label.value_counts()

In [None]:
# Check for quality of 'safe_text' tweets
df.safe_text.sample(10)

## Data Preparation

1. Remove rows with NaN values.
2. Clean *safe-text* column of Twitter Handles, HTML characters, URLs and other non alphabetic characters. Text is inconsistent and may affect model performance.

In [None]:
# Eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]

In [None]:
# Check null values
df.isna().sum()

In [None]:
# Function to clean text
# Replace unwanted characters with empty string

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove tweet mentions
    text = re.sub(r'<user>', '', text)
    text = re.sub(r'<url>', '', text)

    # Remove special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Replace all whitespace characters with a single space
    text = re.sub(r'\s+', ' ', text)

    return text

In [None]:
# Apply the clean_text function to the 'safe_text' column
df['safe_text'] = df.safe_text.apply(clean_text)

In [None]:
df.safe_text.sample(20)

In [None]:
# Check label value counts after deleting NaN values
df.label.value_counts()

The target classes are imbalanced.

### Exploratory Data Analysis

In [None]:
# pie chart wth 'labels' column
plt.figure(figsize=(6,6))
explode=0.1,0
df.label.value_counts().plot.pie(autopct='%1.2f%%',labels=['Neutral','Positive','Negative'])
plt.legend(bbox_to_anchor=(1.5,1))
plt.show()

Neutral and positive sentiments are more prevalent, while negative sentiments are relatively less frequent in the dataset.

In [None]:
#generate a word cloud visualization from the 'safe_text' column

all_data = df['safe_text'].to_string()
wordcloud = WordCloud().generate(all_data)
plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.title('Word Cloud for Most Common Words')
plt.axis("off")

The word cloud provides a visual representation of the most frequent terms in the tweets. The size of each word in the cloud is proportional to its frequency. Lets look at how many words are in each tweet.

In [None]:
# Number of words in each tweet in the 'safe_text' column
text_lengths = df['safe_text'].str.split().str.len()
text_lengths.value_counts().sort_values(ascending=False)

In [None]:
# Calculate the average
average_length = np.mean(text_lengths)

In [None]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

# Using plt.hist to create a histogram with Matplotlib
ax.hist(text_lengths, bins=20, color="blue", edgecolor="black", alpha=0.7)

# Add average line
ax.axvline(average_length, color='red', linestyle='dashed', linewidth=2, label=f'Average: {average_length:.2f}')

ax.set_title('Histogram of Tweet Lengths')
ax.set_xlabel('Tweet Length')
ax.set_ylabel('Count')

# Display the plot
plt.show()


### Train Test Split

In [None]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [None]:
#preview the train subset
train.head()

In [None]:
#preview the eval subset
eval.head()

In [None]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

In [None]:
# Save split subsets
train.to_csv("/content/drive/MyDrive/Covid-19 tweet dataset/train_subset.csv", index=False)
eval.to_csv("/content/drive/MyDrive/Covid-19 tweet dataset/eval_subset.csv", index=False)

## Model Fine Tuning and Training

In [None]:
# Define pre-trained model name and instance of tokenizer from the model
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

**Model**: Twitter-roBERTa-base for Sentiment Analysis - UPDATED (2022) [Model link on huggingface](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)

**Description**: This is a RoBERTa-base model trained on ~124M tweets from January 2018 to December 2021, and finetuned for sentiment analysis with the TweetEval benchmark.

**Labels**:
*   Negative --> 0
*   Neutral --> 1
*   Positive --> 2




In [None]:
# Function to transform the labels:
# Negative -1:0
# Neutral 0:1
# Positive 1:2

def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

In [None]:
# Convert dataframes to datasets objects
train_dataset = Dataset.from_pandas(train)
eval_dataset = Dataset.from_pandas(eval)

# Create a DatasetDict
dataset = DatasetDict({
    'train': train_dataset,
    'eval': eval_dataset
})

In [None]:
# Function to tokenize data

def tokenize_data(example):
    return tokenizer(example['safe_text'], max_length = 128, padding='max_length', truncation=True)

In [None]:
# Change the tweets to tokens that the model can use
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

In [None]:
dataset

#### Balancing Target Classes

Since our target has imbalanced class weights (positive, neutral and negative dont have an equal number of samples), we want to give more weight to underrepresented classes and give less weight to classes with more samples.

In [None]:
# Define the labels
labels = dataset['train']['labels']

# Apply the compute class weight function to calculate the class weight
class_weights = compute_class_weight('balanced', classes=np.unique(labels), y=labels)

The `balanced` option in compute_class_weight will calculate weights such that the classes are balanced.

In [None]:
# Preview class weights
class_weights, np.unique(labels)

In [None]:
# Define an instance of the pre-trained model with the number of labels
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=3)

In [None]:
# Configure the training parameters

training_args = TrainingArguments("./results",
    num_train_epochs=5, # the number of times the model will repeat the training loop over the dataset
    load_best_model_at_end=True,
    eval_strategy='epoch',
    save_strategy='epoch',)

In [None]:
# evaluation metrics

metric = load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# Instantiate the training and validation sets with random state of 10
train_dataset = dataset['train'].shuffle(seed=10)
eval_dataset = dataset['eval'].shuffle(seed=10)

In [None]:
# Convert train data to PyTorch tensors to speed up training and add padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer,padding=True, max_length='max_length', return_tensors='pt')

In [None]:
# Define Custom Trainer | Modify loss function and assign computed weights
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.get("labels")

        # Forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")

        # Ensure logits and labels have compatible shapes and labels are of integer type
        #assert logits.shape[1] == self.model.config.num_labels, f"Logits shape {logits.shape} does not match number of labels {self.model.config.num_labels}"
        #assert labels.max() < self.model.config.num_labels, f"Labels contain values outside the valid range: {labels}"
        #assert labels.dtype == torch.long, f"Labels must be of type torch.long, but got {labels.dtype}"

        # Apply Class Weights
        class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32).to(model.device)

        # Compute custom loss
        loss_fct = nn.CrossEntropyLoss(weight=class_weights_tensor)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

In [None]:
# Instantiate the trainer for training
c_trainer = CustomTrainer(
                  model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  tokenizer = tokenizer,
                  compute_metrics=compute_metrics,
)

In [None]:
# Launch the learning process: training
c_trainer.train()

`Training Loss`: The training loss is decreasing with each epoch, which is a positive sign. It suggests that the model is learning and improving its predictions on the training data.

`Validation Loss`: The validation loss is relatively stable for the first 3 epochs but starts to increase thereafter. This could indicate overfitting, where the model is performing well on the training data but not generalizing as effectively on evaluation(unseen) data.

`Accuracy`: The accuracy on the validation data is around 78% in the final epoch, which is a reasonable accuracy. The model is correctly predicting sentiments for approximately 78% of the validation samples.

In [None]:
# Launch the final evaluation
c_trainer.evaluate()

In [None]:
# Push model and tokenizer to HF Hub
model.push_to_hub("Azie88/COVID_Vaccine_Tweet_sentiment_analysis_roberta")
tokenizer.push_to_hub("Azie88/COVID_Vaccine_Tweet_sentiment_analysis_roberta")
dataset.push_to_hub("Azie88/COVID_Vaccine_Tweet_sentiment_analysis_roberta")

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

## Inference
Let's test out our model with with some sample text

In [None]:
model_path = f"Azie88/COVID_Vaccine_Tweet_sentiment_analysis_roberta"

tokenizer = AutoTokenizer.from_pretrained(model_path)
config = AutoConfig.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

In [None]:
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

In [None]:
# Input preprocessing
text = "Covid vaccine is very effective"
text = preprocess(text)

In [None]:
# PyTorch-based models
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

In [None]:
print("Scores:", scores)
print("id2label Dictionary:", config.id2label)


In [None]:
config.id2label = {0: 'NEGATIVE', 1: 'NEUTRAL', 2: 'POSITIVE'}

In [None]:
# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")