<a href="https://colab.research.google.com/github/BPALAN-USD/AAI-520/blob/main/AAI_520_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Install the Required Dependencies

In [27]:
!pip install transformers kagglehub kaggle torch
!pip install --upgrade transformers




2. Import Dataset from Kaggle

In [39]:

import kagglehub
from kagglehub import KaggleDatasetAdapter
from google.colab import data_table
from transformers import BertTokenizer, BertForSequenceClassification
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler


data_table.enable_dataframe_formatter()

# Set the path to the file you'd like to load
file_path = "IMDB Dataset.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
  file_path,
)

print("First 5 records:")
df = df.head(1000)
df.head()

Using Colab cache for faster access to the 'imdb-dataset-of-50k-movie-reviews' dataset.
First 5 records:


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


1. Text Pre-Processing

* Tokenize the movie reviews using the BERT tokenizer.
* Convert the tokenized reviews into input features suitable for BERT.

In [40]:

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


input_tokenized = tokenizer(
    df["review"].tolist(),
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)


In [41]:
from sklearn.preprocessing import LabelEncoder
labels = torch.tensor(LabelEncoder().fit_transform(df["sentiment"]))


input_ids = input_tokenized["input_ids"]
attention_mask = input_tokenized["attention_mask"]

**2. Model Training:**
- Load the pre-trained BERT model for sequence classification from the Transformers library.
- Fine-tune the BERT model on the preprocessed IMDb dataset for sentiment analysis.
- Implement training loops and loss calculation.

In [42]:
from torch.utils.data import TensorDataset

dataset = TensorDataset(input_tokenized["input_ids"], input_tokenized["attention_mask"], labels)


train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])


batch_size = 16
train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
val_dataloader = DataLoader(val_dataset, sampler=SequentialSampler(val_dataset), batch_size=batch_size)


In [43]:


from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup


model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


optimizer = AdamW(model.parameters(), lr=2e-5)

epochs = 2
total_steps = len(train_dataloader) * epochs

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [44]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
    print(f"\nEpoch {epoch+1}/{epochs}")


    model.train()
    total_train_loss = 0

    for batch in train_dataloader:
        b_input_ids, b_attention_mask, b_labels = tuple(t.to(device) for t in batch)

        model.zero_grad()

        outputs = model(
            input_ids=b_input_ids,
            attention_mask=b_attention_mask,
            labels=b_labels
        )

        loss = outputs.loss
        total_train_loss += loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    avg_train_loss = total_train_loss / len(train_dataloader)
    print(f"Train loss: {avg_train_loss:.4f}")


    model.eval()
    total_val_loss = 0
    correct, total = 0, 0

    for batch in val_dataloader:
        b_input_ids, b_attention_mask, b_labels = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            outputs = model(
                input_ids=b_input_ids,
                attention_mask=b_attention_mask,
                labels=b_labels
            )

        loss = outputs.loss
        total_val_loss += loss.item()

        preds = torch.argmax(outputs.logits, dim=1)
        correct += (preds == b_labels).sum().item()
        total += b_labels.size(0)

    avg_val_loss = total_val_loss / len(val_dataloader)
    accuracy = correct / total
    print(f"Val loss: {avg_val_loss:.4f}, Accuracy: {accuracy:.4f}")



Epoch 1/2
Train loss: 0.6256
Val loss: 0.4635, Accuracy: 0.7900

Epoch 2/2
Train loss: 0.3584
Val loss: 0.3855, Accuracy: 0.8400


In [45]:
model.save_pretrained("./sentiment_model")
tokenizer.save_pretrained("./sentiment_model")


('./sentiment_model/tokenizer_config.json',
 './sentiment_model/special_tokens_map.json',
 './sentiment_model/vocab.txt',
 './sentiment_model/added_tokens.json')

**3. Evaluation:**
- Split the dataset into training and testing sets.
- Evaluate the trained model on the testing set using accuracy, precision, recall, and F1-score metrics.

In [46]:
from torch.utils.data import random_split


train_size = int(0.8 * len(dataset))
val_size   = int(0.1 * len(dataset))
test_size  = len(dataset) - train_size - val_size

train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])

train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=16)
val_dataloader   = DataLoader(val_dataset, sampler=SequentialSampler(val_dataset), batch_size=16)
test_dataloader  = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=16)


In [47]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

model.eval()
all_preds = []
all_labels = []

for batch in test_dataloader:
    b_input_ids, b_attention_mask, b_labels = tuple(t.to(device) for t in batch)

    with torch.no_grad():
        outputs = model(
            input_ids=b_input_ids,
            attention_mask=b_attention_mask
        )

    preds = torch.argmax(outputs.logits, dim=1)

    all_preds.extend(preds.cpu().numpy())
    all_labels.extend(b_labels.cpu().numpy())


accuracy = accuracy_score(all_labels, all_preds)
precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")

print("Evaluation Results on Test Set:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-score:  {f1:.4f}")


Evaluation Results on Test Set:
Accuracy:  0.8600
Precision: 0.7755
Recall:    0.9268
F1-score:  0.8444


**4. Predictions:**
Use the trained model to predict sentiments for a set of sample movie reviews.

In [48]:
def predict_sentiment(model, tokenizer, texts, device="cpu"):
    model.eval()
    encodings = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        outputs = model(**encodings)
        preds = torch.argmax(outputs.logits, dim=1)

    return preds.cpu().numpy()


In [49]:
sample_reviews = [
    "An absolute masterpiece, every scene was breathtaking.",
    "Terrible script and poor direction ruined the experience.",
    "The soundtrack was amazing, it elevated the entire film.",
    "Mediocre at best, nothing really stood out.",
    "The visuals were stunning, but the story lacked depth.",
    "I laughed so much, the comedy was spot on!",
    "Predictable plot, I knew the ending from the start.",
    "The cast gave phenomenal performances, truly impressive.",
    "Not my type of movie, I struggled to stay interested.",
    "Thrilling from start to finish, I was on the edge of my seat.",
    "The pacing was slow and it dragged unnecessarily.",
    "One of the most emotional movies I’ve watched, very touching.",
    "The dialogue felt forced and unnatural.",
    "Great balance of action and drama, very entertaining.",
    "I couldn’t connect with the characters at all.",
    "The cinematography was top-notch, a visual delight.",
    "Too many clichés, nothing original about it.",
    "Brilliantly directed with an engaging storyline.",
    "Confusing and messy, I left more frustrated than entertained.",
    "A heartfelt story that stayed with me after watching."
]


preds = predict_sentiment(model, tokenizer, sample_reviews, device=device)

for review, pred in zip(sample_reviews, preds):
    sentiment = "positive" if pred == 1 else "negative"
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")


Review: An absolute masterpiece, every scene was breathtaking.
Predicted Sentiment: positive

Review: Terrible script and poor direction ruined the experience.
Predicted Sentiment: negative

Review: The soundtrack was amazing, it elevated the entire film.
Predicted Sentiment: positive

Review: Mediocre at best, nothing really stood out.
Predicted Sentiment: negative

Review: The visuals were stunning, but the story lacked depth.
Predicted Sentiment: positive

Review: I laughed so much, the comedy was spot on!
Predicted Sentiment: positive

Review: Predictable plot, I knew the ending from the start.
Predicted Sentiment: negative

Review: The cast gave phenomenal performances, truly impressive.
Predicted Sentiment: positive

Review: Not my type of movie, I struggled to stay interested.
Predicted Sentiment: negative

Review: Thrilling from start to finish, I was on the edge of my seat.
Predicted Sentiment: positive

Review: The pacing was slow and it dragged unnecessarily.
Predicted Senti

In [50]:
!jupyter nbconvert --to html AAI_520_Assignment_3_V1.ipynb

[NbConvertApp] Converting notebook AAI_520_Assignment_3_V1.ipynb to html
[NbConvertApp] Writing 330903 bytes to AAI_520_Assignment_3_V1.html
