# Sentiment Analysis with BERT: IMDB Movie Reviews Dataset

###a.

In [None]:
import pandas as pd
#load a dataset into a dataframe from my google drive

from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/My Drive/IMDB Dataset.csv')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
print(df.columns)
df.head(9)

Index(['review', 'sentiment'], dtype='object')


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative


In [None]:
# Clean up the dataset a bit
import re

# Remove links, userid, lines etc.
def clean_text(text):
  # Convert text to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    # Remove user IDs
    text = re.sub(r'@\w+', '', text)
    # Remove newlines
    text = re.sub(r'\r|\n', ' ', text)
    # Remove non-alphanumeric characters (except whitespace)
    text = re.sub(r'[^\w\s#]', '', text)
    # Remove the '#' symbol while keeping the hashtags
    text = re.sub(r'br', '', text)
    text = re.sub(r'[^\w\s#]', '', text)
    return text

df['review'] = df['review'].apply(clean_text)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming te...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


In [None]:
import pandas as pd

# The max input for the model is 128, first I splited the long reviews, but it created 30,000 more rows,
# so I decided to better just remove comments that are longer than 128 tokens
# Define a function to count words in a review
def count_words(review):
    return len(review.split())

# Apply the function to count words in each review
df['word_count'] = df['review'].apply(count_words)

# Filter out reviews longer than 128 words
df_filtered = df[df['word_count'] <= 128]

# Drop the temporary 'word_count' column
df_filtered.drop(columns=['word_count'], inplace=True)

# Display the filtered DataFrame
df_filtered.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered.drop(columns=['word_count'], inplace=True)


Unnamed: 0,review,sentiment
5,probably my alltime favorite movie a story of ...,positive
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive
10,phil the alien is one of those quirky films wh...,negative
13,the cast played shakespeare shakespeare lost ...,negative


In [None]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13816 entries, 5 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     13816 non-null  object
 1   sentiment  13816 non-null  object
dtypes: object(2)
memory usage: 323.8+ KB


###b.

In [None]:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def analyze_sentiment(review):
    # Tokenize and truncate the review to fit the model's max length
    tokens = tokenizer(review, truncation=True, padding='max_length', max_length=128, return_tensors='pt')
    tokens = {key: value.to(device) for key, value in tokens.items()}  # Move tokens to device

    with torch.no_grad():
        # Make prediction
        outputs = model(**tokens)

    # Get the predicted label
    logits = outputs.logits
    predicted_class = logits.argmax().item()
    labels = model.config.id2label
    return labels[predicted_class]

# Apply the function to the 'review' column
df_filtered['sentiment_model'] = df_filtered['review'].apply(analyze_sentiment)

# Display the DataFrame with sentiment
df_filtered.head()



tokenizer_config.json:   0%|          | 0.00/338 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/843k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0


config.json:   0%|          | 0.00/949 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['sentiment_model'] = df_filtered['review'].apply(analyze_sentiment)


Unnamed: 0,review,sentiment,sentiment_model
5,probably my alltime favorite movie a story of ...,positive,POS
8,encouraged by the positive comments about this...,negative,NEG
9,if you like original gut wrenching laughter yo...,positive,POS
10,phil the alien is one of those quirky films wh...,negative,NEG
13,the cast played shakespeare shakespeare lost ...,negative,NEG


In [None]:
# Evaluate the the Bert model  before fine-tuning
# Update sentiment_model values

df_filtered['sentiment_model'] = df_filtered['sentiment_model'].replace({'POS': 'positive', 'NEG': 'negative'})

# Calculate accuracy
accuracy = accuracy_score(df_filtered['sentiment'], df_filtered['sentiment_model'])
print(f'Accuracy: {accuracy * 100:.2f}%')


Accuracy: 79.10%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['sentiment_model'] = df_filtered['sentiment_model'].replace({'POS': 'positive', 'NEG': 'negative'})


###c.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()
df_filtered['encoded_labels'] = label_encoder.fit_transform(df_filtered['sentiment'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['encoded_labels'] = label_encoder.fit_transform(df_filtered['sentiment'])


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW, get_scheduler

# Step 1: Prepare the data
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

# Example data
texts = df_filtered['review'].tolist()
labels = df_filtered['encoded_labels'].tolist()

# Create dataset and dataloader
dataset = SentimentDataset(texts, labels, tokenizer)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

# Step 2: Fine-tuning the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=1e-5)

num_epochs = 5
num_training_steps = num_epochs * len(dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

model.train()

for epoch in range(num_epochs):
    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['label'])
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch + 1}: Loss = {loss.item()}")

# Step 3: Save the fine-tuned model
model.save_pretrained("fine-tuned-bertweet")
tokenizer.save_pretrained("fine-tuned-bertweet")




Epoch 1: Loss = 0.05631118640303612
Epoch 2: Loss = 0.11075017601251602
Epoch 3: Loss = 0.011303894221782684
Epoch 4: Loss = 0.12039816379547119
Epoch 5: Loss = 0.0024539364967495203


('fine-tuned-bertweet/tokenizer_config.json',
 'fine-tuned-bertweet/special_tokens_map.json',
 'fine-tuned-bertweet/vocab.txt',
 'fine-tuned-bertweet/bpe.codes',
 'fine-tuned-bertweet/added_tokens.json')

###d.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
N_FEATURES= 3000

def make_pipeline(_clf, _mx_feats):
    return Pipeline([('vect', CountVectorizer(max_features=_mx_feats)), ('tfidf', TfidfTransformer()), ('clf', _clf)])

svm_lin = make_pipeline(LinearSVC(dual=False, class_weight='balanced'), N_FEATURES)

In [None]:

from sklearn.model_selection import train_test_split
def split_eval_docs(_clf, _Xdocs, _ydocs):
    X_train, X_test, y_train, y_test = train_test_split(_Xdocs, _ydocs, test_size=0.3, random_state=42)
    _clf.fit(X_train, y_train)
    y_pred = _clf.predict(X_test)
    return y_test, y_pred, X_test

###e.

In [None]:
# Eval. SVM model from Module 8
import numpy as np
from sklearn.metrics import classification_report


plCategories = np.unique(df['sentiment'])

plCategories_mapping = {k:i for i, k in enumerate(plCategories)}


y_test, y_pred, X_test = split_eval_docs(svm_lin, df.review, df.sentiment)
print('SupportVector Machine\n' + classification_report(y_test, y_pred, target_names=plCategories))

SupportVector Machine
              precision    recall  f1-score   support

    negative       0.89      0.88      0.89      7411
    positive       0.89      0.90      0.89      7589

    accuracy                           0.89     15000
   macro avg       0.89      0.89      0.89     15000
weighted avg       0.89      0.89      0.89     15000



###f.

In [None]:
import pandas as pd
import numpy as np

# Example indices to inspect
indices = range(5, 10)

# Get the actual labels and predicted labels
y_true = y_test.iloc[indices].values
y_pred_classes = y_pred[indices]

# Get the associated X_test values
x_samples = X_test.iloc[indices].values

# Create a DataFrame
df_test = pd.DataFrame({
    'Index': indices,
    'X_test': [str(x) for x in x_samples],  # Convert arrays to strings for better display
    'Actual Label': y_true,
    'Predicted Label': y_pred_classes
})

# Display the DataFrame
df_test


Unnamed: 0,Index,X_test,Actual Label,Predicted Label
0,5,ive watched this movie on a fairly regular bas...,positive,positive
1,6,for once a story of hope highlighted over the ...,positive,positive
2,7,okay i didnt get the purgatory thing the first...,positive,negative
3,8,i was very disappointed with this series it ha...,negative,negative
4,9,the first 30 minutes of tinseltown had my fing...,negative,negative


###g.

i. The model has an accuracy of 89%, meaning it correctly predicts the sentiment (positive or negative) for 89% of the reviews in the dataset. : Precision is 0.89 for both positive and negative classes. Recall is 0.88 for negative and 0.90 for positive reviews.

ii. The model performs equally well on both positive and negative reviews, as indicated by the nearly identical precision, recall, and F1 scores for the two classes. This suggests that the model is equally capable of identifying positive and negative sentiments in reviews. However, the recall for positive reviews is slightly higher (0.90) than for negative reviews (0.88)

iii.  SVMs can be computationally expensive, especially with large datasets. Efficient use of resources, such as using smaller batches and leveraging optimized libraries, helped mitigate this challenge.

iv. 4 out of 5 sentimentsment are correct. One is labeled negative, but the original sentiment is positive. I have read the review and could not definetelly say wheather it is positibe or negative. No wonder, the model did not classify it correctly.

v. Explore more advanced models like BERT, RoBERTa, or GPT-based models, which have shown significant improvements in NLP tasks.

i. **Accuracy** is 0.79. 79% are correctly classified instances. While accuracy is a useful metric, it may not always be the best indicator of model performance, especially with imbalanced datasets.
**Precision** is 0.8. High precision indicates a low false-positive rate. For the Titanic dataset, this means that when the model predicts a passenger survived, it is often correct. **Recall and F-1 score** is 0.79. High recall indicates that the model is good at identifying true positives. In this context, a high recall would mean the model correctly identifies most of the survivors. The F1 Score is the harmonic mean of precision and recall, providing a balanced measure of these two metrics.

ii. Significance: **Sex** is by far the most important feature in the model, contributing over 52% to the decision-making process, women had a higher survival rate due to the "women and children first" evacuation protocol.
**Passenger class** is the second most important feature, contributing about 16% to the decision-making process. This suggests that socio-economic status played a significant role in survival rates, with those in higher classes (1st class) more likely to survive than those in lower classes (3rd class).**Age** contributes roughly 11% to the model's decisions. Younger passengers  had a higher likelihood of survival, potentially due to the prioritization of children during evacuation. The rest of the features were not a as important to the model.

iii. No challenges, the Decision Tree is straightforward.

iv. Tree Structure: The root node (Sex) represents the most important feature, and each subsequent split represents the next most important feature in that branch. The blue color represents survived passengers and approptiate feature values can be seen.

v. Using ensemble methods like Random Forest or Gradient Boosting to improve model performance. These methods often outperform single decision trees by combining the strengths of multiple models.

