# Sentiment Analysis with Deep Learning using BERT

### About the Dataset

In today's digital age, Twitter has evolved into a vast and dynamic tapestry of human emotion. It's a platform where opinions are shared, reactions are amplified, and feelings are expressed in 280 characters or less.  This constant stream of raw, unfiltered sentiments offers a unique and valuable opportunity for data scientists - the chance to analyze and understand the emotional pulse of the world in real-time.

Traditional sentiment analysis methods, while useful, often fall short in capturing the nuances and complexities of human language. Enter BERT (Bidirectional Encoder Representations from Transformers), a revolutionary deep learning model that has redefined the landscape of Natural Language Processing (NLP).  With its ability to understand context and decipher subtle linguistic cues, BERT empowers us to move beyond simple positive/negative classifications and delve into the rich tapestry of human emotions expressed on Twitter.

This project embarks on a journey to harness the power of BERT for advanced sentiment analysis of Twitter data.  Our goal is to develop a model capable of accurately classifying tweets not just as positive or negative, but across a wider spectrum of emotions such as joy, sadness, anger, fear, and surprise.  By doing so, we aim to unlock deeper insights into public opinion, social trends, and the emotional drivers behind online conversations..

### 1. Data Loading and Exploration

In [1]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import re
import string
import torch
from tqdm.notebook import tqdm # Progress bar library
from transformers import BertTokenizer # For BERT tokenization
from torch.utils.data import TensorDataset # Efficient dataset creation
from transformers import BertForSequenceClassification # Pre-trained BERT for classification

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import re
import string

In [3]:
# Load the dataset
path = "text.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0.1,Unnamed: 0,text,label
0,0,i just feel really helpless and heavy hearted,4
1,1,ive enjoyed being able to slouch about relax a...,0
2,2,i gave up my internship with the dmrg and am f...,4
3,3,i dont know i feel so lost,0
4,4,i am a kindergarten teacher and i am thoroughl...,4


In [4]:
df.isnull().sum()

Unnamed: 0    0
text          0
label         0
dtype: int64

In [5]:
df.dropna()

Unnamed: 0.1,Unnamed: 0,text,label
0,0,i just feel really helpless and heavy hearted,4
1,1,ive enjoyed being able to slouch about relax a...,0
2,2,i gave up my internship with the dmrg and am f...,4
3,3,i dont know i feel so lost,0
4,4,i am a kindergarten teacher and i am thoroughl...,4
...,...,...,...
416804,416804,i feel like telling these horny devils to find...,2
416805,416805,i began to realize that when i was feeling agi...,3
416806,416806,i feel very curious be why previous early dawn...,5
416807,416807,i feel that becuase of the tyranical nature of...,3


In [6]:
# Drop the Index column
df.drop('Unnamed: 0',axis=1,inplace=True)

In [7]:
# Lets create a new column 'Emotion' to store the labled emotions
df['Emotion'] = df['label']
df['Emotion'] = df['Emotion'].replace(0,'Sadness')
df['Emotion'] = df['Emotion'].replace(1,'Joy')
df['Emotion'] = df['Emotion'].replace(2,'Love')
df['Emotion'] = df['Emotion'].replace(3,'Anger')
df['Emotion'] = df['Emotion'].replace(4,'Fear')
df['Emotion'] = df['Emotion'].replace(5,'Surprise')

In [8]:
df.Emotion.iloc[:10]

0        Fear
1     Sadness
2        Fear
3     Sadness
4        Fear
5     Sadness
6        Love
7         Joy
8    Surprise
9     Sadness
Name: Emotion, dtype: object

In [9]:

# Check the distribution of sentiment labels
df.Emotion.value_counts()


Emotion
Joy         141067
Sadness     121187
Anger        57317
Fear         47712
Love         34554
Surprise     14972
Name: count, dtype: int64

**Explanation:**

* **Libraries:** We begin by importing necessary libraries:
    * `torch`: For core PyTorch functionality.
    * `tqdm`: To display progress bars during training.
    * `transformers`: For loading BERT models and tokenizer.
    * `TensorDataset`: To create PyTorch datasets efficiently.
* **Loading Data:**  We load our emotions dataset (`text.csv`).
* **Initial Exploration:**
    * `df.head()`:  Displays the first few rows of the DataFrame, allowing us to understand its structure.
    * `df.Emotione.iloc[:10]`: Prints the first 10 emotion samples, giving us a feel for the data.
    * `df.Emotion.value_counts()`: Counts occurrences of each sentiment label, helping identify potential class imbalances.


In [10]:
possible_labels = df.Emotion.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'Fear': 0, 'Sadness': 1, 'Love': 2, 'Joy': 3, 'Surprise': 4, 'Anger': 5}

### 2. Text cleaning

In [11]:
def clean_text(text):
    """
    Cleans text for BERT model by performing the following operations:
        - Removing URLs
        - Removing emojis
        - Removing HTML tags
        - Removing special characters
        - Removing text within square brackets
        - Removing words containing numbers

    Args:
        text (str): The input text to be cleaned.

    Returns:
        str: The cleaned text.
    """
    # Remove URLs
    text = re.sub(
        r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
        '',
        text
    )
    text = re.sub('https?://\S+|www\.\S+', '', text)

    # Remove emojis
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)

    # Remove HTML tags
    text = re.sub(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', '', text)

    # Remove text within square brackets
    text = re.sub('\[.*?\]', '', text)

    # Remove special characters, punctuation, and words containing numbers
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)

    # Remove repeated characters
    text = re.sub(r'(.)\1+', r'\1', text)

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

In [12]:
df['Text_clean'] = df['text'].apply(clean_text)
df.head()

Unnamed: 0,text,label,Emotion,Text_clean
0,i just feel really helpless and heavy hearted,4,Fear,i just fel realy helples and heavy hearted
1,ive enjoyed being able to slouch about relax a...,0,Sadness,ive enjoyed being able to slouch about relax a...
2,i gave up my internship with the dmrg and am f...,4,Fear,i gave up my internship with the dmrg and am f...
3,i dont know i feel so lost,0,Sadness,i dont know i fel so lost
4,i am a kindergarten teacher and i am thoroughl...,4,Fear,i am a kindergarten teacher and i am thoroughl...


### 3. Train and Validation Split

In [23]:
df = df.sample(frac=0.10, random_state=42)


In [24]:
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=42,
    stratify=df.label.values
)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['Emotion', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text,Text_clean
Emotion,label,data_type,Unnamed: 3_level_1,Unnamed: 4_level_1
Anger,3,train,1218,1218
Anger,3,val,215,215
Fear,4,train,1009,1009
Fear,4,val,178,178
Joy,1,train,3071,3071
Joy,1,val,542,542
Love,2,train,691,691
Love,2,val,122,122
Sadness,0,train,2563,2563
Sadness,0,val,452,452


**Explanation:**

* **Splitting for Training and Evaluation:**
    * We import `train_test_split` to divide our data into training and validation sets.
    * `test_size=0.15` allocates 15% for validation, leaving 85% for training.
    * `random_state=42` ensures reproducible splitting.
    * `stratify=df.label.values` maintains the original label distribution in both sets.
* **Marking Data Splits:**  We add a `data_type` column to the DataFrame to easily track which data points belong to the training and validation sets.
* **Verification:**  The `groupby` operation helps us verify that the label and emotion distributions are consistent across the training and validation splits.

### 4. BERT Tokenizer and Encoding the Data

In [25]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].Emotion.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].Emotion.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [26]:
# ...

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

# ... (similarly for validation set: input_ids_val, attention_masks_val, labels_val)
input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)


dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

**Explanation:**

* **Tokenizer:**
    * We load the BERT tokenizer (`bert-base-uncased`) and specify `do_lower_case=True` to lowercase all text.
* **Encoding:**  
    * `batch_encode_plus` efficiently encodes multiple text sequences at once.
    * `add_special_tokens=True`: Adds BERT's special tokens ([CLS], [SEP])
    * `return_attention_mask=True`:  Generates masks to handle padding.
    * `pad_to_max_length=True`:  Pads sequences to a fixed length (256 here).
    * `return_tensors='pt'`:  Returns PyTorch tensors, ready for model input.
* **Extracting Encoded Data:**
    * `input_ids_train`, `attention_masks_train`, `labels_train` are extracted from the encoding results. These are essential for training BERT.
* **TensorDataset:**  
    * We create `TensorDataset` objects to combine the encoded inputs, attention masks, and labels into a convenient format for training and validation.


### 5. BERT Pre-trained Model:

In [27]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Explanation:**

In this code snippet, a pre-trained BERT model for sequence classification is loaded using the `BertForSequenceClassification` class from the Hugging Face `transformers` library. The model is initialized with the "bert-base-uncased" pre-trained weights, specifying the number of labels for classification, and setting options to not output attentions or hidden states.


### 6. Data Loaders:

In [28]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 32

dataloader_train = DataLoader(dataset_train,
                              sampler=RandomSampler(dataset_train),
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val,
                                   sampler=SequentialSampler(dataset_val),
                                   batch_size=batch_size)

**Explanation:**

Data loaders are created using the `DataLoader` class from PyTorch. Two data loaders are defined for training and validation datasets. Each data loader is initialized with a specific dataset, a sampler (RandomSampler for training and SequentialSampler for validation), and a batch size of 3.


### 7. Optimizer & Scheduler:

In [29]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr=1e-5,
                  eps=1e-8)

epochs = 5

scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)



**Explanation**

The AdamW optimizer is used to optimize the parameters of the BERT model with a learning rate of 1e-5 and epsilon value of 1e-8. Additionally, a linear scheduler with warmup is set up using the `get_linear_schedule_with_warmup` function from the `transformers` library. The number of warmup steps is set to 0, and the total number of training steps is calculated based on the number of batches and epochs.


### 8. Performance Metrics:

In [30]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}

    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

**Explanation**

Two functions are defined for evaluating model performance metrics:
   
   - `f1_score_func`: Calculates the F1 score for the model predictions compared to the actual labels. The F1 score is computed with the 'weighted' average.
   
   - `accuracy_per_class`: Computes the accuracy per class by comparing the model predictions with the true labels. It prints the accuracy for each class based on the class index in the `label_dict`.


### 9. Training Loop

In [31]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [33]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

for epoch in tqdm(range(1, epochs+1)):

    model.train()

    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        outputs = model(**inputs)

        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})


    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')

    tqdm.write(f'\nEpoch {epoch}')

    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')

    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/277 [00:00<?, ?it/s]


Epoch 1
Training loss: 0.00807078936328903
Validation loss: 0.0035456467410359457
F1 Score (Weighted): 1.0


Epoch 2:   0%|          | 0/277 [00:00<?, ?it/s]


Epoch 2
Training loss: 0.004171891453161513
Validation loss: 0.002255538937027509
F1 Score (Weighted): 1.0


Epoch 3:   0%|          | 0/277 [00:00<?, ?it/s]


Epoch 3
Training loss: 0.002958360101207284
Validation loss: 0.001773207058787954
F1 Score (Weighted): 1.0


Epoch 4:   0%|          | 0/277 [00:00<?, ?it/s]


Epoch 4
Training loss: 0.002510785521092129
Validation loss: 0.0016359419809008132
F1 Score (Weighted): 1.0


Epoch 5:   0%|          | 0/277 [00:00<?, ?it/s]


Epoch 5
Training loss: 0.002430781777964952
Validation loss: 0.0016359419809008132
F1 Score (Weighted): 1.0


**Explanation:**

* **Seed Setting:** The code starts by setting random seeds for Python, NumPy, and PyTorch. This ensures consistent results across multiple runs, which is crucial for reproducibility in research and debugging.

* **`evaluate()` function:** This function takes a data loader (`dataloader_val`) as input, sets the model to evaluation mode, and iterates over the data loader to calculate the average validation loss and obtain predictions. Key points:
    * **`model.eval()`:** Informs PyTorch that the model is in evaluation mode.  Disables dropout and batch normalization layers, which behave differently during training.
    * **`torch.no_grad()`:**  Temporarily disables gradient calculations.  Improves speed and memory efficiency during evaluation, as gradients aren't needed.
    * **Moving data to the device:** Ensures that tensors are on the correct device (CPU or GPU).
    * **Detaching and moving to the CPU:** For calculations and metrics, data is detached from the computation graph (`detach()`) and moved to the CPU (`cpu()`) for compatibility with NumPy.

* **Main Training Loop:**  Iterates over the specified number of epochs (`epochs`), training the model and evaluating its performance.
    * **`model.train()`:**  Sets the model to training mode, activating dropout and batch normalization.
    * **`model.zero_grad()`:** Clears the gradients from previous iterations. Crucial because PyTorch accumulates gradients.
    * **Gradient Clipping (`torch.nn.utils.clip_grad_norm_`)**:  Prevents exploding gradients (very large gradients that can destabilize training) by clipping them to a maximum norm.
    * **`optimizer.step()`:** Updates the model's parameters based on the calculated gradients.
    * **`scheduler.step()`:** Updates the learning rate according to the scheduler's strategy.
    * **Saving the Model (`torch.save`)**: Saves the model's state dictionary (weights and biases) after each epoch.
    * **Evaluation:** After each epoch, the model is evaluated on the validation set using the `evaluate()` function.  Training and validation losses are printed, along with the F1 score.

### 10. Loading and Evaluating the Model

In [34]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

model.load_state_dict(torch.load('finetuned_BERT_epoch_1.model', map_location=torch.device('cpu')))

_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Class: Fear
Accuracy: 452/452

Class: Sadness
Accuracy: 542/542

Class: Love
Accuracy: 122/122

Class: Joy
Accuracy: 215/215

Class: Surprise
Accuracy: 178/178

Class: Anger
Accuracy: 54/54



**Explanation:**

* **Loading the pre-trained model and moving it to device:** Similar to before, we load a fresh BERT model and transfer it to the chosen device (CPU or GPU).

* **Loading Model Checkpoint:** The `torch.load` function loads the saved state dictionary of the trained model from the specified path. `map_location` might be needed to load the model on a different device (CPU in this case) than the one it was saved on.

* **Final Evaluation:**  The loaded model is evaluated one last time on the validation set to assess its performance using the previously defined `evaluate` and `accuracy_per_class` functions.

**Key Points and Enhancements for Intermediate/Advanced Readers:**

* **Early Stopping:** To prevent overfitting, consider implementing early stopping. Monitor the validation loss and stop training if it doesn't improve for a certain number of epochs.
* **Hyperparameter Tuning:** Experiment with different batch sizes, learning rates, optimizers (e.g., Adam, SGD), schedulers, and the number of epochs to find the best configuration for your specific dataset. Tools like Weights & Biases (wandb) can help track and visualize these experiments.
* **Model Complexity:** Try larger BERT models (e.g., `bert-large-uncased`) for potentially better performance, but be mindful of increased computational resources.
* **Data Augmentation:**  Explore techniques like back-translation or synonym replacement to augment your dataset and improve model generalization.
