<h2 style="text-align:center;">IMDb Movie Review Sentiment Analysis (BERT)</h2>

<h3 style="text-align:center;">Part A: NLP-Final Project</h3>

---


## 1. Introduction <a name="introduction"></a>

This project performs sentiment analysis on IMDb movie reviews using deep learning techniques. We compare with advanced deep learning models (BERT).

In [None]:
# Install required packages
!pip install tensorflow transformers pandas numpy matplotlib nltk seaborn

In [None]:
!pip install tf-keras

### 1. Import Required Libraries

First, we import all necessary Python libraries for data processing, modeling, and visualization.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import re
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

### 2. Dataset Loading

We load the IMDB reviews dataset from a CSV file. The dataset contains movie reviews and their corresponding sentiment labels.

In [None]:
# Load dataset from CSV file
df = pd.read_csv('data_imdb.csv')

## 3. Data Cleaning

We clean the text data by:
- Converting to lowercase
- Removing special characters and numbers
- Removing extra whitespace

In [None]:
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

# Apply cleaning function to review column
df['cleaned_review'] = df['review'].apply(clean_text)

### 4. Prepare Data for BERT

We initialize the BERT tokenizer and encode our text data into a format suitable for BERT model input.

In [None]:
# Initialize BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to encode texts
def encode_texts(texts, max_length=128):
    return tokenizer(
        texts.tolist(),
        max_length=max_length,
        truncation=True,
        padding='max_length',
        return_tensors='tf'
    )

# Encode the cleaned reviews
encoded_data = encode_texts(df['cleaned_review'])

## 4. Split Data into Training and Testing Sets

We prepare our data for modeling by:
- Converting sentiment labels to numerical values (0 and 1)
- Splitting the dataset into training (80%) and testing (20%) sets
- Using a fixed random state for reproducibility

In [None]:
# Convert sentiment labels to numerical values
df['sentiment'] = pd.factorize(df['sentiment'])[0]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    encoded_data['input_ids'].numpy(),  # Convert to NumPy array before splitting
    df['sentiment'],
    test_size=0.2,
    random_state=42
)

## 5. Load Pre-trained BERT Model

We load the pre-trained BERT base model (uncased version) and adapt it for our binary classification task by:
- Using the base BERT architecture
- Adding a classification head with 2 output units (positive/negative sentiment)

In [None]:
# Load pre-trained BERT model for sequence classification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 6. Compile the Model

We configure the model for training with:
- Adam optimizer with a small learning rate (2e-5) suitable for fine-tuning
- Sparse categorical crossentropy loss function (since we have integer labels)
- Accuracy as our evaluation metric

In [None]:
# Configure model training parameters
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

# Compile the model
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

## 7. Train the Model

We train the model with:
- Training data (X_train, y_train)
- Validation on test set (X_test, y_test)
- 1 epoch (for demonstration - typically would use more)
- Batch size of 64 samples

In [None]:
# Train the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=1,
    batch_size=64
)





## 8. Evaluate Model Performance

We evaluate the trained model on the test set to get:
- Test loss value
- Test accuracy score

In [None]:
# Evaluate model on test set
test_loss, test_acc = model.evaluate(X_test, y_test)

# Print test accuracy
print(f"Test Accuracy: {test_acc:.4f}")

Test Accuracy: 0.6275


### 9. Save the Model

We save the trained model and tokenizer for future use, which allows us to:
- Avoid retraining the model each time
- Deploy the model in production
- Share the model with others

In [None]:
# Save the trained model and tokenizer
model.save_pretrained('sentiment_bert_model')
tokenizer.save_pretrained('sentiment_bert_model')

('sentiment_bert_model/tokenizer_config.json',
 'sentiment_bert_model/special_tokens_map.json',
 'sentiment_bert_model/vocab.txt',
 'sentiment_bert_model/added_tokens.json')

### 10. Load the Saved Model

We demonstrate how to load the saved model, which is useful for:
- Making predictions without retraining
- Continuing training later
- Deploying the model in different environments

In [None]:
# Load the saved model and tokenizer
loaded_model = TFBertForSequenceClassification.from_pretrained('sentiment_bert_model')
loaded_tokenizer = BertTokenizer.from_pretrained('sentiment_bert_model')

Some layers from the model checkpoint at sentiment_bert_model were not used when initializing TFBertForSequenceClassification: ['dropout_303']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at sentiment_bert_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


## 11. Test with Sample Data

We create a prediction function that:
1. Cleans input text
2. Tokenizes the text for BERT
3. Makes sentiment predictions
4. Returns both the prediction and confidence score

We then test this function with sample reviews.

In [None]:
def predict_sentiment(text, model, tokenizer):
    # Clean and tokenize the text
    cleaned_text = clean_text(text)
    inputs = tokenizer(
        cleaned_text,
        max_length=128,
        truncation=True,
        padding='max_length',
        return_tensors='tf'
    )

    # Make prediction
    outputs = model(inputs)
    logits = outputs.logits
    probabilities = tf.nn.softmax(logits, axis=1)
    predicted_class = tf.argmax(probabilities, axis=1).numpy()[0]

    # Get confidence score
    confidence = np.max(probabilities.numpy())

    return "Positive" if predicted_class == 1 else "Negative", confidence

## 12. Sample Predictions

We test our model with diverse sample reviews to:
- Verify model performance
- Show different confidence levels
- Demonstrate real-world usage

In [None]:
# Sample reviews for testing
sample_reviews = [
    "This movie was absolutely fantastic! The acting was superb.",
    "I hated this film. Worst two hours of my life.",
    "The plot was predictable but the cinematography made up for it.",
    "Not worth the money. Would not recommend to anyone.",
    "The director did an amazing job with this adaptation.",
    "Boring from start to finish. Fell asleep halfway through."
]

# Make predictions and display results
print("\nSample Predictions:")
for review in sample_reviews:
    sentiment, confidence = predict_sentiment(review, loaded_model, loaded_tokenizer)
    print(f"Review: {review[:60]}...")
    print(f"Predicted Sentiment: {sentiment} (Confidence: {confidence:.2f})")
    print("-" * 80)


Sample Predictions:
Review: This movie was absolutely fantastic! The acting was superb....
Predicted Sentiment: Negative (Confidence: 0.62)
--------------------------------------------------------------------------------
Review: I hated this film. Worst two hours of my life....
Predicted Sentiment: Negative (Confidence: 0.62)
--------------------------------------------------------------------------------
Review: The plot was predictable but the cinematography made up for ...
Predicted Sentiment: Negative (Confidence: 0.65)
--------------------------------------------------------------------------------
Review: Not worth the money. Would not recommend to anyone....
Predicted Sentiment: Negative (Confidence: 0.63)
--------------------------------------------------------------------------------
Review: The director did an amazing job with this adaptation....
Predicted Sentiment: Negative (Confidence: 0.63)
--------------------------------------------------------------------------------


---