# IMDB 50K Movie Reviews - Sentiment Classification with MLP

## Overview

### In the next exercises, you will work with the IMDB 50K Movie Reviews dataset to build a sentiment analysis model using a Multi-Layer Perceptron (MLP). You will practice essential steps in the data science pipeline such as data loading, preprocessing, feature generation, and training/testing a neural network model. This exercise should be completed using the `pandas`, `nltk`, `sklearn`, and `torch` libraries.


### Exercise 1: Data Loading and Exploration
**Objective**: Load the IMDB dataset and explore its structure.

1. **Load the dataset** using `pandas`. The dataset is in the `data/` folder and the file name is `imdb_dataset.zip`. **Hint: you can load zip files with pandas by passing `compression='zip'` tp `pd.read_csv`**
2. **Explore the dataset** by checking for missing values and getting a summary of the data. 
    - Check the shape of the dataset.
    - Get the distribution of the sentiment labels (positive/negative reviews).
3. Print the first few reviews and their corresponding labels.


In [2]:
import modin.pandas as pd
df = pd.read_csv("../../data/imdb_dataset.zip", compression='zip')
df

2024-10-04 11:39:44,859	INFO worker.py:1786 -- Started a local Ray instance.


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [3]:
df.shape

(50000, 2)

In [4]:
df['sentiment'].value_counts()

the groupby keys will be sorted anyway, although the 'sort=False' was passed. See the following issue for more details: https://github.com/modin-project/modin/issues/3571.


sentiment
negative    25000
positive    25000
Name: count, dtype: int64

### Exercise 2: Splitting the Data
**Objective**: Split the data into training, validation and test sets.

1. Split the dataset into features (reviews) and labels (sentiment).
2. Use `train_test_split` from `sklearn` to split the dataset into training, validation and test sets (use an 60/20/20 split).
3. Print the sizes of the training and test sets to ensure the splits were done correctly.

In [5]:
X = df['review']
y = df['sentiment']

In [6]:
from sklearn.model_selection import train_test_split

x_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [7]:
print(f"Training set size: {len(x_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")

Training set size: 30000
Validation set size: 10000
Test set size: 10000


### Exercise 3: Text Preprocessing
**Objective**: Preprocess the text data to prepare it for feature generation.

1. Lowercase the text data. **Hint: python has a built-in string method for this**.
2. Remove any URL from the reviews. **Hint: you can use regular expressions for this**.
3. Remove non-word and non-whitespace characters (punctuation, special characters, etc.). **Hint: you can use regular expressions for this**.
4. Remove digits. **Hint: you can use regular expressions for this**.
5. Tokenize the reviews into individual words. **Hint: you can use the `nltk` library for this**.
6. Remove stopwords. **Hint: you can use the `nltk` library for this**.
7. Perform stemming or lemmatization. **Hint: you can use the `nltk` library for this**.
8. Apply the preprocessing steps to both the training, validation and test sets.

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re
nltk.download('punkt_tab')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\diogo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
def preprocess_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove non-word/non-whitespace characters
    text = re.sub(r'\d+', '', text)  # Remove digits
    words = word_tokenize(text)  # Tokenize text
    words = [word for word in words if word not in stop_words]  # Remove stop words
    words = [stemmer.stem(word) for word in words]  # Perform stemming
    return ' '.join(words)

In [10]:
X_train = x_train.apply(preprocess_text)
X_val = X_val.apply(preprocess_text)
X_test = X_test.apply(preprocess_text)

In [11]:
X_train.head()

18306    borrow slightli modifi titl comment say usual ...
49528    product account movi also got voic work entir ...
44745    far one worst movi ever seen life watch practi...
46827    obvious inspir seen sometim even gruesom blood...
27531    movi almost gener defin import us born earli p...
Name: review, dtype: object

### Exercise 4: Feature Generation (TF-IDF)
**Objective**: Convert the preprocessed text data into numerical features using TF-IDF.

1. Use the `TfidfVectorizer` from `sklearn` to convert the reviews into numerical vectors.
2. Limit the maximum number of features to 5,000 to reduce the dimensionality.
3. Fit the vectorizer on the training set and transform both the training and test sets.
4. Print the shape of the transformed feature sets to confirm the conversion.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000)
tfidf_matrix_train = tfidf.fit_transform(X_train)
tfidf_matrix_valid = tfidf.transform(X_val)
tfidf_matrix_test = tfidf.transform(X_test)

tfidf_matrix_train.shape

(30000, 5000)

### Exercise 5: Building the MLP Model (PyTorch)
**Objective**: Build a simple Multi-Layer Perceptron (MLP) for binary classification.

1. Define the MLP model using `torch.nn.Module`. The model should have:
    - An input layer that matches the size of the TF-IDF features.
    - Two hidden layers with ReLU activations.
    - A single output layer with a sigmoid activation function.
2. Print the model summary.


In [13]:
import torch

class MLPClssifier(torch.nn.Module):
    def __init__(self, input_dim):
        super(MLPClssifier, self).__init__()
        self.input_dim = input_dim
        self.fc1 =  torch.nn.Linear(input_dim, 50)    #  input layer (inp) -> hidden layer (hl1)
        self.fc2 = torch.nn.Linear(50, 10)    #  hidden layer (hl1) -> hidden layer (hl2)
        self.fc3 = torch.nn.Linear(10, 1)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)

        return torch.nn.Sigmoid()(x)

In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = MLPClssifier(tfidf_matrix_train.shape[1]).to(device)
model

MLPClssifier(
  (fc1): Linear(in_features=5000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=1, bias=True)
  (relu): ReLU()
)

### Exercise 6: Training the Model
**Objective**: Train the MLP model on the training data.

1. Convert the TF-IDF feature matrices and labels into PyTorch tensors (the label needs to be binarized).
2. Define the loss function (`BCELoss` for binary classification) and the optimizer (`Adam`).
3. Implement a training loop to train the model for a specified number of epochs (e.g., 50).
4. Monitor the training and validation loss during training.

In [15]:
y_train_map = y_train.map({'positive': 1, 'negative': 0})
y_valid_map = y_val.map({'positive': 1, 'negative': 0})
y_test_map = y_test.map({'positive': 1, 'negative': 0})

In [20]:
from torch.utils.data import TensorDataset

x_train_tfidf = torch.tensor(tfidf_matrix_train.toarray(), dtype=torch.float32).to(device)
x_valid_tfidf = torch.tensor(tfidf_matrix_valid.toarray(), dtype=torch.float32).to(device)
x_test_tfidf = torch.tensor(tfidf_matrix_test.toarray(), dtype=torch.float32).to(device)

y_train = torch.tensor(y_train_map, dtype=torch.float32).reshape(-1, 1).to(device)
y_valid = torch.tensor(y_valid_map, dtype=torch.float32).reshape(-1, 1).to(device)
y_test = torch.tensor(y_test_map, dtype=torch.float32).reshape(-1, 1).to(device)

train = TensorDataset(x_train_tfidf, y_train)
valid = TensorDataset(x_valid_tfidf, y_valid)
test = TensorDataset(x_test_tfidf, y_test)

train, valid, test

(<torch.utils.data.dataset.TensorDataset at 0x26104788d90>,
 <torch.utils.data.dataset.TensorDataset at 0x261135a03d0>,
 <torch.utils.data.dataset.TensorDataset at 0x2646a1cb7d0>)

In [21]:
criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [35]:
from sklearn.metrics import accuracy_score

model = MLPClssifier(tfidf_matrix_train.shape[1]).to(device)
criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

n_epochs = 100

for epoch in range(n_epochs):
    model.train()

    outputs = model(x_train_tfidf)
    loss = criterion(outputs, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    model.eval()
    with torch.no_grad():
        val_outputs = model(x_valid_tfidf)
        val_loss = criterion(val_outputs, y_valid)
        val_pred = (val_outputs >= 0.5).float()
        accuracy = (val_pred.eq(y_valid)).sum() / y_valid.shape[0]
        # accuracy2 = torch.sum(val_pred == y_valid) / y_valid.shape[0]
        print(f"Epoch: {epoch}, Loss: {loss.item()}")
        print(f"Validation Loss: {val_loss}")
        print(f"Accuracy: {accuracy}")

Epoch: 0, Loss: 0.6968812942504883
Validation Loss: 0.6966938376426697
Accuracy: 0.4989999830722809
Epoch: 1, Loss: 0.6965698003768921
Validation Loss: 0.6963880658149719
Accuracy: 0.4989999830722809
Epoch: 2, Loss: 0.6962490677833557
Validation Loss: 0.6960293054580688
Accuracy: 0.4989999830722809
Epoch: 3, Loss: 0.695872962474823
Validation Loss: 0.6956089735031128
Accuracy: 0.4989999830722809
Epoch: 4, Loss: 0.6954323053359985
Validation Loss: 0.6951379776000977
Accuracy: 0.4989999830722809
Epoch: 5, Loss: 0.6949372291564941
Validation Loss: 0.6946307420730591
Accuracy: 0.4989999830722809
Epoch: 6, Loss: 0.6944020986557007
Validation Loss: 0.6940974593162537
Accuracy: 0.4989999830722809
Epoch: 7, Loss: 0.6938371658325195
Validation Loss: 0.6935368180274963
Accuracy: 0.4989999830722809
Epoch: 8, Loss: 0.6932419538497925
Validation Loss: 0.6929364800453186
Accuracy: 0.4989999830722809
Epoch: 9, Loss: 0.6926049590110779
Validation Loss: 0.6922904253005981
Accuracy: 0.4989999830722809
E

### Exercise 7: Model Evaluation
**Objective**: Evaluate the performance of the trained model on the test data.

1. Use the trained model to make predictions on the test set.
2. Calculate the accuracy of the model on the test data.
3. Print the test accuracy.

In [39]:
model.eval()
with torch.no_grad():
    test_outputs = model(x_test_tfidf)
    test_predictions = (test_outputs >= 0.5).float()
    accuracy = torch.sum(test_predictions == y_test) / y_test.shape[0]

accuracy.item()

0.8575999736785889

### **Exercise 8: Saving the Trained Model**

1. Save the model's state_dict using `torch.save()`. 

2. Save the entire model, including its architecture and weights.

3. Demonstrate how to load the saved model and use it for making predictions on new data.


In [37]:
torch.save(model.state_dict(), 'model.pt')

In [38]:
state_dict = torch.load('model.pt')
model = MLPClssifier(tfidf_matrix_train.shape[1]).to(device)
model.load_state_dict(state_dict)



<All keys matched successfully>

In [40]:
torch.save(model, 'model_full.pt')

In [41]:
torch.load('model_full.pt')



MLPClssifier(
  (fc1): Linear(in_features=5000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=1, bias=True)
  (relu): ReLU()
)