<a href="https://colab.research.google.com/github/LuluW8071/Text-Sentiment-Analysis/blob/main/Text_Sentiment_Analysis_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text-Sentiment-Analysis-using-BERT

In [1]:
# Install new version of transformers, if you running this notebook for 1st time on the gpu
# Then Comment it
!pip uninstall transformers -y
!pip install transformers[torch]

Found existing installation: transformers 4.38.2
Uninstalling transformers-4.38.2:
  Successfully uninstalled transformers-4.38.2
Collecting transformers[torch]
  Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.29.2-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using c

## 1. Download and Load the dataset

The dataset that the following script will download is a combination of the Yelp Polarity Dataset and the IMDb Movie Dataset. The Yelp Polarity Dataset has been preprocessed by selecting specific columns to create a dataset suitable for sentiment analysis. This preprocessed dataset has been merged with the IMDb Movie Dataset.

In [2]:
import gdown
import zipfile
import os

file_url = 'https://drive.google.com/uc?id=1Jp3D5gdxGrwa5dHbr4p-pECrD8wi7vik'
file_name = 'sentiment_dataset.zip'

# Download the file from Google Drive
gdown.download(file_url, file_name, quiet=False)
extract_dir = './dataset'

# Extract the zip file
with zipfile.ZipFile(file_name, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Remove the zip file after extraction
os.remove(file_name)
print("Files extracted successfully to:", extract_dir)

Downloading...
From (original): https://drive.google.com/uc?id=1Jp3D5gdxGrwa5dHbr4p-pECrD8wi7vik
From (redirected): https://drive.google.com/uc?id=1Jp3D5gdxGrwa5dHbr4p-pECrD8wi7vik&confirm=t&uuid=55759104-badb-4023-8282-3e65db748110
To: /content/sentiment_dataset.zip
100%|██████████| 182M/182M [00:01<00:00, 143MB/s]


Files extracted successfully to: ./dataset


In [4]:
import pandas as pd
import numpy as np

>__Note:__</br>
**BERT** (Bidirectional Encoder Representations from Transformers) can indeed be trained on a relatively small dataset to yield improved results for certain tasks, especially when fine-tuning a pre-trained model, due to its powerful architecture. It is already pre-trained on larger datasets, possesses powerful contextual understanding, and benefits from effective regularization techniques such as dropout and attention mechanisms, which help prevent overfitting.

>So, we can just take just `5000` datasets and train the **BERT** Model on it for our purpose.

In [34]:
import random

# Read dataset and take random 5000 samples
df = pd.read_csv("dataset/sentiment_combined.csv")
df = df.sample(n=5000, random_state=random.randint(0, 100))

# Reset the index
df.reset_index(drop=True, inplace=True)
df.head(), df.shape

(                                              review sentiment
 0  This was my first visit here after recently mo...  positive
 1  Jalape\u00f1o poppers are out of this world. N...  positive
 2  Excited about my first dining experience today...  negative
 3  Yelpers commenting on the Charlotte Area Trans...  positive
 4  I live for their nachos!!! My husband and I al...  positive,
 (5000, 2))

In [35]:
df['review'][0]

'This was my first visit here after recently moving to the Phoenix area this summer.  After living in CA, I have become accustomed to the being able to pop in local breweries for a beer and a bite...particularly in the San Diego area (incredible selection there, Stone is my favorite).  I really am trying to get into my local beers.  I enjoyed the Hope Knot IPA, so my wife surprised me for a lunch date at Four Peaks!  \\n\\nThe place was a little less polished than I imaged it would be, sort of a sports bar feel.  The food seletion looked good, typcial bar fare combined with some very unique selections.  We only order some pretzels...and I ordered the taster set so that I could explore their full line of beers.  The beer was good overall, not the best I have had.  I still think the Hop Knot is one of their best, and their ales are strong as well.  \\n\\nThe best thing about this place was the incredible service.  Despite the super casual feel, the service was fantastic.  Our server cove

In [36]:
df['sentiment'].value_counts()

sentiment
negative    2507
positive    2493
Name: count, dtype: int64

## 2. Text Pre-Processing

- Cleaning up the text data by removing punctuation, extra spaces, and numbers.
- Transform sentences into individual words, remove common words (known as "stop words")

In [37]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from collections import Counter

# Precompile regular expressions for faster pre processing
non_word_chars_pattern = re.compile(r"[^\w\s]")
whitespace_pattern = re.compile(r"\s+")
digits_pattern = re.compile(r"\d")
username_pattern = re.compile(r"@([^\s]+)")
hashtags_pattern = re.compile(r"#\d+")
br_pattern = re.compile(r'<br\s*/?>\s*<br\s*/?>')

def preprocess_string(s):
    # Remove all non-word characters (everything except numbers and letters)
    s = non_word_chars_pattern.sub('', s)
    # Replace all runs of whitespaces with single space
    s = whitespace_pattern.sub(' ', s)
    # Replace digits with no space
    s = digits_pattern.sub('', s)
    # Replace usernames with no space
    s = username_pattern.sub('', s)
    # Replace hashtags with no space
    s = hashtags_pattern.sub('', s)
    # Replace <br /> pattern with empty string
    s = br_pattern.sub('', s)
    # Replace specific characters
    s = s.replace("https", "")
    s = s.replace("http", "")
    s = s.replace("rt", "")
    s = s.replace("-", "")
    # Replace br with empty string
    s = s.replace("br", "")
    # Replace newline character with empty string
    s = s.replace("\n", "")
    return s

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [38]:
from tqdm.notebook import tqdm_notebook

preprocessed_reviews = []

# Apply preprocessing
for review in tqdm_notebook(df['review'], desc='Preprocessing'):
    preprocessed_review = preprocess_string(review)
    preprocessed_reviews.append(preprocessed_review)

# Assign the preprocessed reviews back to  'review' column
df['review'] = preprocessed_reviews

Preprocessing:   0%|          | 0/5000 [00:00<?, ?it/s]

In [39]:
df['review'][0], df['sentiment'][0]

('This was my first visit here after recently moving to the Phoenix area this summer After living in CA I have become accustomed to the being able to pop in local eweries for a beer and a bitepaicularly in the San Diego area incredible selection there Stone is my favorite I really am trying to get into my local beers I enjoyed the Hope Knot IPA so my wife surprised me for a lunch date at Four Peaks nnThe place was a little less polished than I imaged it would be so of a spos bar feel The food seletion looked good typcial bar fare combined with some very unique selections We only order some pretzelsand I ordered the taster set so that I could explore their full line of beers The beer was good overall not the best I have had I still think the Hop Knot is one of their best and their ales are strong as well nnThe best thing about this place was the incredible service Despite the super casual feel the service was fantastic Our server covered every detail offered wonderful suggestions and ch

## 3. Mapping `sentiment` column to numeric values

In [40]:
# Map 'positive' to 1 & 'negative' to 0
df['sentiment'] = df['sentiment'].replace({'positive': 1, 'negative': 0})
df.head()

Unnamed: 0,review,sentiment
0,This was my first visit here after recently mo...,1
1,Jalapeufo poppers are out of this world Never ...,1
2,Excited about my first dining experience today...,0
3,Yelpers commenting on the Charlotte Area Trans...,1
4,I live for their nachos My husband and I alway...,1


## 4. Spliiting datasets into train and test

In [41]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['review'],
                                                    df['sentiment'],
                                                    test_size=0.2)

len(X_train), len(X_test)

(4000, 1000)

In [42]:
X_train, X_test, y_train, y_test = list(X_train), list(X_test), list(y_train), list(y_test)
X_train[:2], y_train[:2]

(['After a disappointing lunch we decided to live it up with some tapas Tapas for us at least are a little exquisite with good variety and expensive Julian Serrano about sums up these principals but even as tapas go this place is real goodnnWe ordered four dishes here which filled us up although we were looking for a light meal so it worked out We got the Croquetas chicken calamari white ceviche and papas avas Everything was delicious although the white ceviche was the clear winner while everything else was a tied for second runner up with no weak dishes to be had A great experience with a very friendly waiter as well making this the star eat of our Vegas tripnnAs long as you are prepared for the tapas experience cant go wrong with Julian Serrano',
  'Air conditioning stopped working around  pm Sunday July th After checking yelp called Legacy around  pm was told tech was just finishing a job and would probably be at my home sometime between   pm but time frame would be more like  Dave 

## 5. Preparing data using custom dataloader

In [85]:
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

# Setting device agnostic code
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [44]:
class data(Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, index):
    item = {key: torch.tensor(val[index]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[index])
    return item

  def __len__(self):
    return len(self.labels)

## 6. Load PreTrained BERT Model

**BERT** (Bidirectional Encoder Representations from Transformers) is a pre-trained language representation model developed by researchers at Google.

<img src = "https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Attention_diagram_transformer.png?ssl=1">

- BERT architecture consists of `multiple encoder transformer blocks` stacked together.
- Each transformer block includes` multi-head self-attention` and `feed-forward neural networks`.
- `Multi-head self-attention` allows BERT to weigh word importance based on context, capturing long-range dependencies effectively.
- The output from `attention mechanisms` undergoes non-linear transformations via `feed-forward neural networks`.
- `Layer normalization` and `residual connections` stabilize training and facilitate gradient flow within each transformer block.
- `Positional encodings` preserve word order in sequences, aiding BERT in understanding the sequential nature of data.

>BERT is pre-trained on a large text corpus using tasks like masked language modeling and next sentence prediction. Fine-tuning on specific tasks involves adjusting the final layers of the pre-trained BERT model.

### [Explanation Video on BERT](https://www.youtube.com/watch?v=6ahxPTLZxU8)

In [45]:
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

## 7. Tokenize and Create Encoded Dataset

In [46]:
# Tokenize with truncation and padding and create dataset from tokenized data
train_encoding = tokenizer(X_train, truncation=True, padding=True)
test_encoding = tokenizer(X_test, truncation=True, padding=True)

train_dataset = data(train_encoding, y_train)
test_dataset = data(test_encoding, y_test)

## 8. Fine-Tuning BERT

Fine-tuning BERT, a pre-trained language model, allows us to adapt it to specific NLP tasks such as text classification, named entity recognition, sentiment analysis, and question answering.


<img src = "https://www.researchgate.net/publication/351386823/figure/fig4/AS:1024183752478725@1621195843655/BERT-Fine-tuning-pipeline-for-a-sample-sentiment-identification-task.jpg">

In [53]:
training_args = TrainingArguments(
  output_dir='./results',            # Directory where model checkpoints & results will be saved
  num_train_epochs=3,                # No of training epochs
  per_device_train_batch_size=32,    # Batch size for training per device
  per_device_eval_batch_size=32,     # Batch size for evaluation per device
  learning_rate=1e-04,               # Learning rate for optimizer
  warmup_steps=500,                  # No of warmup steps for the learning rate scheduler
  weight_decay=0.01,                 # Weight decay coefficient for regularization
  logging_dir='./logs',              # Directory for logging training information
  load_best_model_at_end=True,       # Whether to load the best model from checkpoints at the end of training
  logging_steps=100,                 # Log training metrics every `logging_steps` steps
  save_steps=800,                    # Save model checkpoints every `save_steps` steps
  evaluation_strategy="steps",       # Evaluate on the evaluation dataset every `logging_steps` steps
)


## 9. Train the Fine-Tuned BERT Model

In [54]:
model = DistilBertForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=test_dataset)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [55]:
from accelerate import Accelerator

# Initialize Accelerator and Trainer
Accelerator()
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=186, training_loss=0.36029856692078294, metrics={'train_runtime': 563.2138, 'train_samples_per_second': 21.306, 'train_steps_per_second': 0.33, 'total_flos': 1576891913601024.0, 'train_loss': 0.36029856692078294, 'epoch': 2.98})

## 10. Sentiment Prediction using custom text


In [105]:
# Tokenize text, get ouput model and predict
def predict_sentiment(model, tokenizer, text, device):
  tokenized = tokenizer(text, truncation=True, padding=True, return_tensors='pt').to(device)
  outputs = model(**tokenized)
  probs = F.softmax(outputs.logits, dim=-1)
  preds = torch.argmax(outputs.logits, dim=-1).item()
  probs_max = probs.max().detach().cpu().numpy()
  return "Positive" if preds == 1 else "Negative", probs_max

In [112]:
text = "The traffic was horrendous this morning; I was stuck in it for over an hour."
prediction, probs = predict_sentiment(model, tokenizer, text, device)
print(f'{text}\nSentiment: {prediction}\tProbability: {probs*100:.2f}%')

The traffic was horrendous this morning; I was stuck in it for over an hour.
Sentiment: Negative	Probability: 91.58%


In [115]:
text = "I was extremely disappointed with the quality of the product; it didn't meet my expectations at all."
prediction, probs = predict_sentiment(model, tokenizer, text, device)
print(f'{text}\nSentiment: {prediction}\tProbability: {probs*100:.2f}%')

I was extremely disappointed with the quality of the product; it didn't meet my expectations at all.
Sentiment: Negative	Probability: 91.96%


In [116]:
text = "The customer service at the restaurant was very good the staff went above and beyond to make us feel welcome."
prediction, probs = predict_sentiment(model, tokenizer, text, device)
print(f'{text}\nSentiment: {prediction}\tProbability: {probs*100:.2f}%')

The customer service at the restaurant was very good the staff went above and beyond to make us feel welcome.
Sentiment: Positive	Probability: 96.04%


In [117]:
text = "My recent stay at Paradise Resort was absolutely fantastic! From the moment I arrived, I was greeted with warm smiles and excellent service. The room was spacious, beautifully decorated, and spotlessly clean. I loved the breathtaking view from my balcony overlooking the pool and tropical gardens. The dining options were exceptional, and the resort's facilities were top-notch, offering everything from a fitness center to guided nature walks. Overall, Paradise Resort exceeded all my expectations, and I can't wait to return for another memorable stay!"
prediction, probs = predict_sentiment(model, tokenizer, text, device)
print(f'{text}\nSentiment: {prediction}\tProbability: {probs*100:.2f}%')

My recent stay at Paradise Resort was absolutely fantastic! From the moment I arrived, I was greeted with warm smiles and excellent service. The room was spacious, beautifully decorated, and spotlessly clean. I loved the breathtaking view from my balcony overlooking the pool and tropical gardens. The dining options were exceptional, and the resort's facilities were top-notch, offering everything from a fitness center to guided nature walks. Overall, Paradise Resort exceeded all my expectations, and I can't wait to return for another memorable stay!
Sentiment: Positive	Probability: 98.63%
