# IMDB 50K Movie Reviews - Sentiment Classification with MLP

## Overview

### In the next exercises, you will work with the IMDB 50K Movie Reviews dataset to build a sentiment analysis model using a Multi-Layer Perceptron (MLP). You will practice essential steps in the data science pipeline such as data loading, preprocessing, feature generation, and training/testing a neural network model. This exercise should be completed using the `pandas`, `nltk`, `sklearn`, and `torch` libraries.


### Exercise 1: Data Loading and Exploration
**Objective**: Load the IMDB dataset and explore its structure.

1. **Load the dataset** using `pandas`. The dataset is in the `data/` folder and the file name is `imdb_dataset.zip`. **Hint: you can load zip files with pandas by passing `compression='zip'` tp `pd.read_csv`**
2. **Explore the dataset** by checking for missing values and getting a summary of the data. 
    - Check the shape of the dataset.
    - Get the distribution of the sentiment labels (positive/negative reviews).
3. Print the first few reviews and their corresponding labels.


In [1]:
import modin.pandas as pd
from torch.utils.data import TensorDataset

df = pd.read_csv('../../data/imdb_dataset.zip', compression='zip')
df

2024-10-04 12:03:27,577	INFO worker.py:1786 -- Started a local Ray instance.


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [2]:
df.shape

(50000, 2)

In [3]:
df['sentiment'].value_counts()

the groupby keys will be sorted anyway, although the 'sort=False' was passed. See the following issue for more details: https://github.com/modin-project/modin/issues/3571.


sentiment
negative    25000
positive    25000
Name: count, dtype: int64

### Exercise 2: Splitting the Data
**Objective**: Split the data into training, validation and test sets.

1. Split the dataset into features (reviews) and labels (sentiment).
2. Use `train_test_split` from `sklearn` to split the dataset into training, validation and test sets (use an 60/20/20 split).
3. Print the sizes of the training and test sets to ensure the splits were done correctly.

In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, stratify=df['sentiment'])
train.shape, test.shape

((40000, 2), (10000, 2))

In [5]:
train, valid =train_test_split(train, test_size=0.25, stratify=train['sentiment'])

train.shape, valid.shape, test.shape

((30000, 2), (10000, 2), (10000, 2))

### Exercise 3: Text Preprocessing
**Objective**: Preprocess the text data to prepare it for feature generation.

1. Lowercase the text data. **Hint: python has a built-in string method for this**.
2. Remove any URL from the reviews. **Hint: you can use regular expressions for this**.
3. Remove non-word and non-whitespace characters (punctuation, special characters, etc.). **Hint: you can use regular expressions for this**.
4. Remove digits. **Hint: you can use regular expressions for this**.
5. Tokenize the reviews into individual words. **Hint: you can use the `nltk` library for this**.
6. Remove stopwords. **Hint: you can use the `nltk` library for this**.
7. Perform stemming or lemmatization. **Hint: you can use the `nltk` library for this**.
8. Apply the preprocessing steps to both the training, validation and test sets.

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

nltk.download('punkt_tab')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

x_train, y_train = train['review'], train['sentiment']
x_valid, y_valid = valid['review'], valid['sentiment']
x_test, y_test = test['review'], test['sentiment']

[nltk_data] Downloading package punkt_tab to /home/lcda/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/lcda/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
# Define a preprocessing function
def preprocess_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove non-word/non-whitespace characters
    text = re.sub(r'\d+', '', text)  # Remove digits
    words = word_tokenize(text)  # Tokenize text
    words = [word for word in words if word not in stop_words]  # Remove stop words
    words = [stemmer.stem(word) for word in words]  # Perform stemming
    return ' '.join(words)

In [8]:
# Apply preprocessing to the training, validation, and test sets
x_train = x_train.apply(preprocess_text)
x_val = x_valid.apply(preprocess_text)
x_test = x_test.apply(preprocess_text)

In [9]:
x_test.head()

8478     plot certainli seem interest enough real life ...
16709    signific french titl film la naissanc de pieuv...
45245    movi noth like book br br everyth mix chang mo...
34721    watch movi time watch famili age everyon love ...
36868    thank hollywood yet anoth movi classic utterli...
Name: review, dtype: object

### Exercise 4: Feature Generation (TF-IDF)
**Objective**: Convert the preprocessed text data into numerical features using TF-IDF.

1. Use the `TfidfVectorizer` from `sklearn` to convert the reviews into numerical vectors.
2. Limit the maximum number of features to 5,000 to reduce the dimensionality.
3. Fit the vectorizer on the training set and transform both the training and test sets.
4. Print the shape of the transformed feature sets to confirm the conversion.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
tfidf_matrix_train = tfidf.fit_transform(x_train)

tfidf_matrix_train.shape

(30000, 5000)

In [11]:
tfidf_matrix_valid = tfidf.transform(x_valid)
tfidf_matrix_test = tfidf.transform(x_test)

tfidf_matrix_train.shape, tfidf_matrix_valid.shape, tfidf_matrix_test.shape

((30000, 5000), (10000, 5000), (10000, 5000))

### Exercise 5: Building the MLP Model (PyTorch)
**Objective**: Build a simple Multi-Layer Perceptron (MLP) for binary classification.

1. Define the MLP model using `torch.nn.Module`. The model should have:
    - An input layer that matches the size of the TF-IDF features.
    - Two hidden layers with ReLU activations.
    - A single output layer with a sigmoid activation function.
2. Print the model summary.


In [12]:
import torch.nn as nn

class MLPClassifier(nn.Module):
    
    def __init__(self, input_dim):
        super(MLPClassifier, self).__init__()
        self.input_dim = input_dim
        self.fc1 = nn.Linear(input_dim, 50)
        self.fc2 = nn.Linear(50, 10)
        self.fc3 = nn.Linear(10, 3)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return nn.Softmax()(x)

In [13]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model =MLPClassifier(tfidf_matrix_train.shape[1]).to(device)
model

MLPClassifier(
  (fc1): Linear(in_features=5000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=1, bias=True)
  (relu): ReLU()
)

### Exercise 6: Training the Model
**Objective**: Train the MLP model on the training data.

1. Convert the TF-IDF feature matrices and labels into PyTorch tensors (the label needs to be binarized).
2. Define the loss function (`BCELoss` for binary classification) and the optimizer (`Adam`).
3. Implement a training loop to train the model for a specified number of epochs (e.g., 50).
4. Monitor the training and validation loss during training.

In [14]:
x_train = tfidf_matrix_train.toarray()
y_train = y_train.map({'positive': 1, 'negative': 0})
x_valid = tfidf_matrix_valid.toarray()
y_valid = y_valid.map({'positive': 1, 'negative': 0})
x_test = tfidf_matrix_test.toarray()
y_test = y_test.map({'positive': 1, 'negative': 0})

In [15]:
from torch.utils.data import TensorDataset 
import torch

x_train_tfidf = torch.tensor(x_train, dtype=torch.float32).to(device)
y_train = torch.tensor(y_train, dtype=torch.float32).reshape(-1, 1).to(device)

x_valid_tfidf = torch.tensor(x_valid, dtype=torch.float32).to(device)
y_valid = torch.tensor(y_valid, dtype=torch.float32).reshape(-1, 1).to(device)

x_test_tfidf = torch.tensor(x_test, dtype=torch.float32).to(device)
y_test = torch.tensor(y_test, dtype=torch.float32).reshape(-1, 1).to(device)

train = TensorDataset(x_train_tfidf, y_train)
valid = TensorDataset(x_valid_tfidf, y_valid)
test = TensorDataset(x_test_tfidf, y_test)

In [32]:
for layer in model.children():
   if hasattr(layer, 'reset_parameters'):
       layer.reset_parameters()
       
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

n_epochs = 5000

for epoch in range(n_epochs):
    model.train()
    
    outputs = model(x_train_tfidf)
    loss = criterion(outputs, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    model.eval()
    with torch.no_grad():
        val_outputs = model(x_valid_tfidf)
        val_loss = criterion(val_outputs, y_valid)
        val_predictions = (val_outputs >= 0.5).float()
        accuracy = (val_predictions.eq(y_valid)).sum() / y_valid.shape[0]
    
    print(f"Epoch: {epoch}, Loss: {loss.item()}")
    print(f"Validation Loss: {val_loss.item()}")
    print(f"Validation Accuracy: {accuracy.item()}")
    

Epoch: 0, Loss: 0.695345401763916
Validation Loss: 0.6948872804641724
Validation Accuracy: 0.5
Epoch: 1, Loss: 0.6946032643318176
Validation Loss: 0.6944329738616943
Validation Accuracy: 0.5
Epoch: 2, Loss: 0.6938134431838989
Validation Loss: 0.6939446926116943
Validation Accuracy: 0.5
Epoch: 3, Loss: 0.6929478049278259
Validation Loss: 0.6934011578559875
Validation Accuracy: 0.5
Epoch: 4, Loss: 0.69200599193573
Validation Loss: 0.6928213238716125
Validation Accuracy: 0.5
Epoch: 5, Loss: 0.6910610198974609
Validation Loss: 0.6922494173049927
Validation Accuracy: 0.5
Epoch: 6, Loss: 0.6900890469551086
Validation Loss: 0.6916968822479248
Validation Accuracy: 0.5
Epoch: 7, Loss: 0.6890773773193359
Validation Loss: 0.6911538243293762
Validation Accuracy: 0.5
Epoch: 8, Loss: 0.6880559921264648
Validation Loss: 0.6906039118766785
Validation Accuracy: 0.5
Epoch: 9, Loss: 0.6870325207710266
Validation Loss: 0.6900284886360168
Validation Accuracy: 0.5
Epoch: 10, Loss: 0.6859792470932007
Validat

### Exercise 7: Model Evaluation
**Objective**: Evaluate the performance of the trained model on the test data.

1. Use the trained model to make predictions on the test set.
2. Calculate the accuracy of the model on the test data.
3. Print the test accuracy.

In [44]:
from sklearn.metrics import accuracy_score

model.eval()
with torch.no_grad():
    test_outputs = model(x_test_tfidf)
    test_predictions = (test_outputs >= 0.5).float()
    accuracy = (test_predictions.eq(y_test)).sum() / y_test.shape[0]
    accuracy2 = torch.sum(test_predictions == y_test) / y_test.shape[0]
    
accuracy.item(), accuracy2.item()

(0.8407999873161316, 0.8407999873161316)

### **Exercise 8: Saving the Trained Model**

1. Save the model's state_dict using `torch.save()`. 

2. Save the entire model, including its architecture and weights.

3. Demonstrate how to load the saved model and use it for making predictions on new data.


In [46]:
torch.save(model.state_dict(), 'model.pt')

In [47]:
state_dict = torch.load('model.pt')

model = MLPClassifier(tfidf_matrix_train.shape[1]).to(device)
model.load_state_dict(state_dict)



<All keys matched successfully>

In [45]:
# save entire model
torch.save(model, 'model_full.pt')

In [None]:
model = torch.load('model_full.pt')