# IMDB 50K Movie Reviews - Sentiment Classification with MLP

## Overview

### In the next exercises, you will work with the IMDB 50K Movie Reviews dataset to build a sentiment analysis model using a Multi-Layer Perceptron (MLP). You will practice essential steps in the data science pipeline such as data loading, preprocessing, feature generation, and training/testing a neural network model. This exercise should be completed using the `pandas`, `nltk`, `sklearn`, and `torch` libraries.


### Exercise 1: Data Loading and Exploration
**Objective**: Load the IMDB dataset and explore its structure.

1. **Load the dataset** using `pandas`. The dataset is in the `data/` folder and the file name is `imdb_dataset.zip`. **Hint: you can load zip files with pandas by passing `compression='zip'` tp `pd.read_csv`**
2. **Explore the dataset** by checking for missing values and getting a summary of the data. 
    - Check the shape of the dataset.
    - Get the distribution of the sentiment labels (positive/negative reviews).
3. Print the first few reviews and their corresponding labels.


In [3]:
import modin.pandas as pd # Optimized distributed version of Pandas

# Load dataset
df = pd.read_csv('../../data/imdb_dataset.zip', compression='zip')  # Load as per your format
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
# Check for missing values
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [4]:
# Check the distribution of labels
df['sentiment'].value_counts()

the groupby keys will be sorted anyway, although the 'sort=False' was passed. See the following issue for more details: https://github.com/modin-project/modin/issues/3571.


sentiment
negative    25000
positive    25000
Name: count, dtype: int64

### Exercise 2: Splitting the Data
**Objective**: Split the data into training, validation and test sets.

1. Split the dataset into features (reviews) and labels (sentiment).
2. Use `train_test_split` from `sklearn` to split the dataset into training, validation and test sets (use an 60/20/20 split).
3. Print the sizes of the training, validation and test sets to ensure the splits were done correctly.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into features (X) and labels (y)
X = df['review']
y = df['sentiment']

In [6]:
# Split into training (60%), validation (20%), and test (20%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [7]:
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")

Training set size: 30000
Validation set size: 10000
Test set size: 10000


### Exercise 3: Text Preprocessing
**Objective**: Preprocess the text data to prepare it for feature generation.

1. Lowercase the text data. **Hint: python has a built-in string method for this**.
2. Remove any URL from the reviews. **Hint: you can use regular expressions for this**.
3. Remove non-word and non-whitespace characters (punctuation, special characters, etc.). **Hint: you can use regular expressions for this**.
4. Remove digits. **Hint: you can use regular expressions for this**.
5. Tokenize the reviews into individual words. **Hint: you can use the `nltk` library for this**.
6. Remove stopwords. **Hint: you can use the `nltk` library for this**.
7. Perform stemming or lemmatization. **Hint: you can use the `nltk` library for this**.
8. Apply the preprocessing steps to both the training, validation and test sets.

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

[nltk_data] Downloading package punkt to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
# Define a preprocessing function
def preprocess_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove non-word/non-whitespace characters
    text = re.sub(r'\d+', '', text)  # Remove digits
    words = word_tokenize(text)  # Tokenize text
    words = [word for word in words if word not in stop_words]  # Remove stop words
    words = [stemmer.stem(word) for word in words]  # Perform stemming
    return ' '.join(words)

In [10]:
# Apply preprocessing to the training, validation, and test sets
X_train = X_train.apply(preprocess_text)
X_val = X_val.apply(preprocess_text)
X_test = X_test.apply(preprocess_text)

In [11]:
X_train.head()

18306    borrow slightli modifi titl comment say usual ...
49528    product account movi also got voic work entir ...
44745    far one worst movi ever seen life watch practi...
46827    obvious inspir seen sometim even gruesom blood...
27531    movi almost gener defin import us born earli p...
Name: review, dtype: object

### Exercise 4: Feature Generation (TF-IDF)
**Objective**: Convert the preprocessed text data into numerical features using TF-IDF.

1. Use the `TfidfVectorizer` from `sklearn` to convert the reviews into numerical vectors.
2. Limit the maximum number of features to 5,000 to reduce the dimensionality.
3. Fit the vectorizer on the training set and transform both the training, validation and test sets.
4. Print the shape of the transformed feature sets to confirm the conversion.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the training data, transform the validation and test data
X_train_tfidf = vectorizer.fit_transform(X_train).toarray()
X_val_tfidf = vectorizer.transform(X_val).toarray()
X_test_tfidf = vectorizer.transform(X_test).toarray()

X_train_tfidf.shape, X_val_tfidf.shape, X_test_tfidf.shape

((30000, 5000), (10000, 5000), (10000, 5000))

### Exercise 5: Building the MLP Model (PyTorch)
**Objective**: Build a simple Multi-Layer Perceptron (MLP) for binary classification.

1. Define the MLP model using `torch.nn.Module`. The model should have:
    - An input layer that matches the size of the TF-IDF features.
    - Two hidden layers with ReLU activations.
    - A single output layer with a sigmoid activation function.
2. Print the model summary.


In [13]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the MLP model
class MLP(nn.Module):
    def __init__(self, input_dim):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 1)  # Binary classification
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return torch.sigmoid(x)

In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Initialize the model
input_dim = X_train_tfidf.shape[1]
model = MLP(input_dim).to(device)

# Print the model summary
print(model)

Using device: cuda
MLP(
  (fc1): Linear(in_features=5000, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=1, bias=True)
  (relu): ReLU()
)


### Exercise 6: Training the Model
**Objective**: Train the MLP model on the training data.

1. Convert the TF-IDF feature matrices and labels into PyTorch tensors (the label needs to be binarized).
2. Define the loss function (`BCELoss` for binary classification) and the optimizer (`Adam`).
3. Implement a training loop to train the model for a specified number of epochs (e.g., 50).
4. Monitor the training and validation loss during training.

In [15]:
# Convert data to tensors
X_train_tensor = torch.tensor(X_train_tfidf, dtype=torch.float32).to(device)
y_train = y_train.map({'positive': 1, 'negative': 0})
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1).to(device)
X_val_tensor = torch.tensor(X_val_tfidf, dtype=torch.float32).to(device)
y_val = y_val.map({'positive': 1, 'negative': 0})
y_val_tensor = torch.tensor(y_val.values, dtype=torch.float32).view(-1, 1).to(device)

In [16]:
# Define loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [17]:
# Training loop with validation monitoring
num_epochs = 100
best_val_loss = float('inf')

for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    
    # Forward pass on training data
    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    
    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Validation step (model evaluation on validation set)
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        val_outputs = model(X_val_tensor)
        val_loss = criterion(val_outputs, y_val_tensor)
        val_predictions = (val_outputs > 0.5).float()
        val_accuracy = (val_predictions.eq(y_val_tensor).sum() / y_val_tensor.shape[0]).item()
    
    # Print training and validation stats
    print(f'Epoch [{epoch+1}/{num_epochs}], '
          f'Training Loss: {loss.item():.4f}, '
          f'Validation Loss: {val_loss.item():.4f}, '
          f'Validation Accuracy: {val_accuracy * 100:.2f}%')


Epoch [1/100], Training Loss: 0.6936, Validation Loss: 0.6928, Validation Accuracy: 49.90%
Epoch [2/100], Training Loss: 0.6926, Validation Loss: 0.6916, Validation Accuracy: 49.90%
Epoch [3/100], Training Loss: 0.6915, Validation Loss: 0.6901, Validation Accuracy: 49.90%
Epoch [4/100], Training Loss: 0.6899, Validation Loss: 0.6882, Validation Accuracy: 50.93%
Epoch [5/100], Training Loss: 0.6879, Validation Loss: 0.6860, Validation Accuracy: 62.51%
Epoch [6/100], Training Loss: 0.6855, Validation Loss: 0.6835, Validation Accuracy: 73.61%
Epoch [7/100], Training Loss: 0.6829, Validation Loss: 0.6807, Validation Accuracy: 79.47%
Epoch [8/100], Training Loss: 0.6799, Validation Loss: 0.6776, Validation Accuracy: 81.69%
Epoch [9/100], Training Loss: 0.6765, Validation Loss: 0.6741, Validation Accuracy: 82.44%
Epoch [10/100], Training Loss: 0.6728, Validation Loss: 0.6703, Validation Accuracy: 82.64%
Epoch [11/100], Training Loss: 0.6687, Validation Loss: 0.6661, Validation Accuracy: 82.4

### Exercise 7: Model Evaluation
**Objective**: Evaluate the performance of the trained model on the test data.

1. Use the trained model to make predictions on the test set.
2. Calculate the accuracy of the model on the test data.
3. Print the test accuracy.

In [18]:
# Convert test data to tensors
X_test_tensor = torch.tensor(X_test_tfidf, dtype=torch.float32).to(device)
y_test = y_test.map({'positive': 1, 'negative': 0})
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1).to(device)

# Set model to evaluation mode
model.eval()

# Evaluate on the test set
with torch.no_grad():
    test_outputs = model(X_test_tensor)
    test_predictions = (test_outputs > 0.5).float()  # Convert probabilities to binary predictions
    test_accuracy = (test_predictions.eq(y_test_tensor).sum() / y_test_tensor.shape[0]).item()

print(f'Test Accuracy: {test_accuracy * 100:.2f}%')

Test Accuracy: 86.28%


### **Exercise 8: Saving the Trained Model**

1. Save the model's state_dict using `torch.save()`. 

2. Save the entire model, including its architecture and weights.

3. Demonstrate how to load the saved model and use it for making predictions on new data.

In [27]:
# Save the model’s state_dict (recommended)
model_path = "mlp_model.pth"
torch.save(model.state_dict(), model_path)
print(f"Model's state_dict saved at {model_path}")

Model's state_dict saved at mlp_model.pth


In [28]:
# Save the entire model (includes architecture and weights)
model_path_full = "mlp_model_full.pth"
torch.save(model, model_path_full)
print(f"Entire model saved at {model_path_full}")

Entire model saved at mlp_model_full.pth


In [29]:
# To load the saved model for inference:
# Loading the state_dict
loaded_model = MLP(input_dim).to(device)
loaded_model.load_state_dict(torch.load(model_path))
loaded_model.eval()  # Set the model to evaluation mode



MLP(
  (fc1): Linear(in_features=5000, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=1, bias=True)
  (relu): ReLU()
)

In [30]:
# Load the entire model
loaded_model_full = torch.load(model_path_full).to(device)
loaded_model_full.eval()



MLP(
  (fc1): Linear(in_features=5000, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=1, bias=True)
  (relu): ReLU()
)

In [31]:
# Example: Making predictions with the loaded model
with torch.no_grad():
    example_outputs = loaded_model(X_test_tensor[:5])
    example_predictions = (example_outputs > 0.5).float()
    print(f"Predictions on new data: {example_predictions}")

Predictions on new data: tensor([[0.],
        [1.],
        [0.],
        [1.],
        [0.]], device='cuda:0')
