# Sentiment Analysis


### BERT Model

### Notebook performed in Google Colab


In the final stage of the "Sentiment Analysis" project, we will utilize the powerful BERT (Bidirectional Encoder Representations from Transformers) language model to construct a sentiment classifier.

To accomplish this, we will employ the Transformers library, which offers a convenient and versatile interface for working with various transformer-based models. In particular, we will leverage BERT's pre-trained model, which has been trained on large-scale corpora and can capture complex language patterns effectively.

In the script, we will begin by loading the dataset that was used in previous stages of the project. This dataset contains the necessary sentiment labels that we will use to train and evaluate the BERT-based sentiment classifier.

Next, we will download the pre-trained BERT model from the web. These models are available for various languages and have been trained on massive amounts of text data, enabling them to learn rich representations of language.

To ensure efficient computations, it is recommended to use a GPU environment. GPUs can significantly accelerate the training process of deep learning models, such as BERT, due to their parallel processing capabilities.

Throughout the script, we will perform fine-tuning of the BERT model on our dataset. Fine-tuning involves adapting the pre-trained BERT model to our specific task of sentiment classification. This process allows the model to learn the sentiment patterns present in our dataset and make accurate predictions.

By leveraging the BERT model and the Transformers library, we aim to achieve state-of-the-art performance in sentiment classification. BERT's ability to capture contextual information and complex language structures can provide more accurate sentiment predictions compared to previous approaches.

By incorporating BERT into the project, we aim to benefit from its contextual understanding and powerful language representation capabilities. The use of BERT can enhance the sentiment classifier's performance, particularly in handling challenging sentiment analysis tasks where contextual information is crucial for accurate predictions.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers==4.27.1
!pip install -U datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.27.1
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.27.1)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.27.1)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m110.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transfor

In [None]:
import transformers
transformers.__version__

'4.27.1'

In [None]:
import pandas as pd 
import regex as re

First, we will load data. 

In [None]:
df = pd.read_csv('/content/drive/MyDrive/models/tweets_df_cleaned_labeled.csv', index_col = 0)
df.drop(df.columns[[0]], axis=1, inplace=True)

In [None]:
df = df.rename(columns = {'zs_prediction' : 'label'})
df

Unnamed: 0,date,author_id,text,original_tweets,tokens,cleaned_tokens,stems,lemma,label
0,2023-03-31 00:54:56+00:00,80832189,ron desantis stated honor legal requirement ex...,Ron DeSantis just stated he would not honor a ...,"['ron', 'desantis', 'stated', 'honor', 'legal'...","['ron', 'desantis', 'stated', 'honor', 'legal'...","['ron', 'desanti', 'state', 'honor', 'legal', ...","['ron', 'desantis', 'state', 'honor', 'legal',...",1
1,2023-03-31 00:54:50+00:00,2479303121,abortion completely legal constitution . .,Abortion is completely legal and in our consti...,"['abortion', 'completely', 'legal', 'constitut...","['abortion', 'completely', 'legal', 'constitut...","['abort', 'complet', 'legal', 'constitut']","['abortion', 'completely', 'legal', 'constitut...",1
2,2023-03-31 00:52:32+00:00,1526105788728475653,""" forced pregnancy "" legal term specifical...","""Forced pregnancy"" is a legal term specificall...","['""', 'forced', 'pregnancy', '""', 'legal', 'te...","['forced', 'pregnancy', 'legal', 'term', 'spec...","['forc', 'pregnanc', 'legal', 'term', 'specif'...","['force', 'pregnancy', 'legal', 'term', 'speci...",1
3,2023-03-31 00:52:09+00:00,438628988,"americans know true , 80 % believe aborti...","Americans know this IS true, which is why 80% ...","['americans', 'know', 'true', ',', '80', '%', ...","['americans', 'know', 'true', '80', 'believe',...","['american', 'know', 'true', '80', 'believ', '...","['americans', 'know', 'true', 'believe', 'abor...",0
4,2023-03-31 00:51:47+00:00,1492201362154831874,democrats want legalize abortion scruples .,Democrats that want to legalize abortion don’t...,"['democrats', 'want', 'legalize', 'abortion', ...","['democrats', 'want', 'legalize', 'abortion', ...","['democrat', 'want', 'legal', 'abort', 'scrupl']","['democrats', 'want', 'legalize', 'abortion', ...",1
...,...,...,...,...,...,...,...,...,...
23826,2023-02-23 02:42:35+00:00,1492272489455620097,"abortion 9 months legal , past trimester ill...","abortion up to 9 months shouldn’t be legal, an...","['abortion', '9', 'months', 'legal', ',', 'pas...","['abortion', 'months', 'legal', 'past', 'trime...","['abort', 'month', 'legal', 'past', 'trimest',...","['abortion', 'month', 'legal', 'past', 'trimes...",1
23827,2023-02-23 02:29:15+00:00,452462161,"scotus rules decades ago privacy , fundame...","SCOTUS also rules decades ago that privacy, be...","['scotus', 'rules', 'decades', 'ago', 'privacy...","['scotus', 'rules', 'decades', 'ago', 'privacy...","['scotu', 'rule', 'decad', 'ago', 'privaci', '...","['scotus', 'rule', 'decade', 'ago', 'privacy',...",1
23828,2023-02-23 02:27:01+00:00,1519800752549421058,yes . means obtain abortion legal .,Yes. And I will go to all means to obtain an a...,"['yes', '.', 'means', 'obtain', 'abortion', 'l...","['yes', 'means', 'obtain', 'abortion', 'legal']","['ye', 'mean', 'obtain', 'abort', 'legal']","['yes', 'mean', 'obtain', 'abortion', 'legal']",1
23829,2023-02-23 02:25:21+00:00,1445411972388847622,", injunction ( incorrectly ) placed trigger...","Also, due to the injunction (incorrectly) plac...","[',', 'injunction', '(', 'incorrectly', ')', '...","['injunction', 'incorrectly', 'placed', 'trigg...","['injunct', 'incorrectli', 'place', 'trigger',...","['injunction', 'incorrectly', 'place', 'trigge...",1


In [None]:
tweets = df[['text', 'label']]
signs = r'[^a-zA-Z0-9]'
tweets['text'] = tweets['text'].apply(lambda x: re.sub(signs, ' ', x))
tweets

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweets['text'] = tweets['text'].apply(lambda x: re.sub(signs, ' ', x))


Unnamed: 0,text,label
0,ron desantis stated honor legal requirement ex...,1
1,abortion completely legal constitution,1
2,forced pregnancy legal term specifical...,1
3,americans know true 80 believe aborti...,0
4,democrats want legalize abortion scruples,1
...,...,...
23826,abortion 9 months legal past trimester ill...,1
23827,scotus rules decades ago privacy fundame...,1
23828,yes means obtain abortion legal,1
23829,injunction incorrectly placed trigger...,1


Now we import the `Dataset` class from the `datasets` library and we use the `Dataset.from_pandas` method to convert the above table into a Dataset object. This class makes it very easy to work with the Transformers library models - you can do without it, but it will be much easier with it.

Next, use the `train_test_split` method on the newly created set and pass the argument `0.1` as a percentage of the test data. Save the result of this method to a variable that represents our final dataset (e.g. simply `dataset`):

In [None]:
from datasets import Dataset

dataset_ = Dataset.from_pandas(tweets)
dataset = dataset_.train_test_split(0.1)

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 21447
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 2384
    })
})


In [None]:
dataset['train']

Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 21447
})

# Data pre-processing (preparation)

Let's start by defining the model from which we start - in the model_checkpoint variable, let's enter the name of the model 'distilbert-base-uncased'. Enter 64 in the batch_size variable. 

In [None]:
model_checkpoint = 'distilbert-base-uncased'
batch_size = 64

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
tokenizer('This is the first sentence')

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [None]:
def process(x): 
    return tokenizer(x['text'])

train_ds = dataset['train'].map(process)
test_ds = dataset['test'].map(process)

Map:   0%|          | 0/21447 [00:00<?, ? examples/s]

Map:   0%|          | 0/2384 [00:00<?, ? examples/s]

In [None]:
train_ds

Dataset({
    features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 21447
})

# Training

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.w

In [None]:
!pip install --upgrade accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.19.0


In [None]:
from datasets import load_metric
import numpy as np

metric = load_metric('glue', 'sst2')

def compute_metrics(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

  metric = load_metric('glue', 'sst2')


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

In [None]:
import gc
def report_gpu():
   print(torch.cuda.list_gpu_processes())
   gc.collect()
   torch.cuda.empty_cache()

In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

This function takes eval_preds as input, which represents the model's predictions and corresponding labels during evaluation. The function performs the following steps:

- Extracts logits and labels from eval_preds.
- Uses np.argmax to obtain the predicted class indices with the highest probabilities (logits).
- Calls metric.compute to calculate the evaluation metric (in this case, the loaded GLUE metric) by providing the predicted predictions and the ground truth labels.
- Returns the computed metric value.

In summary, the code sets up a metric for evaluation (loaded from the GLUE benchmark dataset) and defines a compute_metrics function that computes the evaluation metric based on the model's predictions and labels. This code can be used to evaluate the performance of a sentiment analysis model or any other model on the Stanford Sentiment Treebank 2 task using the specified GLUE metric.

In [None]:
import torch
from transformers import DistilBertForSequenceClassification, DistilBertConfig

#config = DistilBertConfig.from_pretrained(model_checkpoint, num_labels=2)
#config.dropout = 0.2   # set the dropout rate to 0.2

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

from transformers import EarlyStoppingCallback

max_split_size_mb = 1024


args = TrainingArguments(
    f'{model_checkpoint}_sentiment_analysis',
    evaluation_strategy = 'epoch',
    save_strategy = 'epoch',
    learning_rate = 2e-5,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32,
    gradient_accumulation_steps=2,  # Accumulate gradients over 2 steps
    num_train_epochs = 10,
    weight_decay = 0.01,
    load_best_model_at_end = True,
    metric_for_best_model = 'accuracy'
)

trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

trainer.train()



Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.w

Epoch,Training Loss,Validation Loss,Accuracy
0,No log,0.34221,0.853188
2,0.342000,0.341945,0.864513
2,0.246600,0.370493,0.85151
4,0.246600,0.400737,0.853607
4,0.158300,0.457601,0.858221


TrainOutput(global_step=1677, training_loss=0.23602707441463028, metrics={'train_runtime': 424.8314, 'train_samples_per_second': 504.836, 'train_steps_per_second': 7.885, 'total_flos': 1158008007392772.0, 'train_loss': 0.23602707441463028, 'epoch': 5.0})

1. The code above creates a model configuration using DistilBertConfig.from_pretrained() function, loading a pre-trained DistilBERT configuration from the specified model_checkpoint directory. The configuration is customized by setting the dropout rate to 0.2 and the number of labels to 2 for binary sequence classification.

2. An instance of DistilBertForSequenceClassification model is created using DistilBertForSequenceClassification.from_pretrained(). It initializes the model with the pre-trained weights loaded from model_checkpoint and the custom configuration.

3. Training arguments (TrainingArguments) are defined, specifying various settings such as the model checkpoint directory, evaluation strategy, learning rate, batch sizes, number of epochs, weight decay, and metrics for determining the best model.

4. A Trainer object is created, taking the model, training arguments, training dataset (train_ds), evaluation dataset (test_ds), tokenizer, evaluation metrics function (compute_metrics), and an EarlyStoppingCallback with a patience of 3 epochs as arguments.

5. The training process is initiated by calling the train() method on the Trainer object. This starts the training loop and trains the model based on the provided configurations.

In summary, the code sets up a DistilBERT model for binary sequence classification, defines training arguments, creates a Trainer object with the specified configurations, and starts the training process.

### Summary of the results

Based on the provided code and results, we can draw the following conclusions:

1. The model was trained for a total of 5 epochs.
2. The training loss steadily decreased from the initial value of "No log" to 0.1583, indicating that the model learned from the training data over the course of training.
3. The validation loss fluctuated but generally increased from 0.3419 to 0.4576, suggesting that the model's performance on unseen data may not be as good as on the training data.
4. The accuracy of the model varied between 0.8515 and 0.8645, indicating that the model achieved a moderate level of accuracy on both the training and validation sets.
5. The training process took a total runtime of 424.8314 seconds, with an average of 504.836 samples per second and 7.885 steps per second.

**Improvements that could be considered:**

1. Analyzing the training and validation loss curves over epochs to identify potential issues, such as overfitting or underfitting, and adjusting the model or training process accordingly.
2. Experimenting with different hyperparameter values, such as learning rate, batch size, and weight decay, to find optimal settings for improving model performance.
3. Implementing techniques like data augmentation or regularization (e.g., dropout) to improve generalization and reduce overfitting.
4. Increasing the number of training epochs if the validation loss and accuracy still show room for improvement.
5. Collecting more diverse and representative data for training to enhance the model's ability to generalize to various scenarios.


In [None]:
trainer.evaluate()

{'eval_loss': 0.34194517135620117,
 'eval_accuracy': 0.864513422818792,
 'eval_runtime': 2.9549,
 'eval_samples_per_second': 806.802,
 'eval_steps_per_second': 25.382,
 'epoch': 5.0}

The provided model evaluation summary shows the performance of the model on the evaluation dataset. The evaluation loss, which measures the model's performance in terms of the difference between predicted and actual values, is 0.3419. The evaluation accuracy, representing the proportion of correctly predicted instances, is 0.8645, indicating that approximately 86.45% of the instances were classified correctly. The evaluation runtime for this process was 2.9549 seconds, with approximately 806.802 samples processed per second and 25.382 steps processed per second. The model completed a total of 5 epochs during training. Overall, the model achieved a reasonably high accuracy on the evaluation dataset, indicating good performance

# Testing

In [None]:
trainer.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [None]:
trainer.save_model('./BERT_model')

trainer.save_model('/content/drive/MyDrive/models/BERT_model')

In [None]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

text = "This is a happy positive example"
inputs = tokenizer(text, return_tensors='pt')
input_ids = inputs['input_ids'].to(device)
attention_mask = inputs['attention_mask'].to(device)

In [None]:
with torch.no_grad():
  output = model(input_ids=input_ids, attention_mask=attention_mask)
  logits = output.logits
  predictions = torch.argmax(logits, dim=-1)

In [None]:
if predictions.item() == 0:
  print('This text is negative')
else:
  print('This text is positive')

This text is negative
