# IMDB 50K Movie Reviews - Sentiment Classification with MLP

## Overview

### In the next exercises, you will work with the IMDB 50K Movie Reviews dataset to build a sentiment analysis model using a Multi-Layer Perceptron (MLP). You will practice essential steps in the data science pipeline such as data loading, preprocessing, feature generation, and training/testing a neural network model. This exercise should be completed using the `pandas`, `nltk`, `sklearn`, and `torch` libraries.


### Exercise 1: Data Loading and Exploration
**Objective**: Load the IMDB dataset and explore its structure.

1. **Load the dataset** using `pandas`. The dataset is in the `data/` folder and the file name is `imdb_dataset.zip`. **Hint: you can load zip files with pandas by passing `compression='zip'` tp `pd.read_csv`**
2. **Explore the dataset** by checking for missing values and getting a summary of the data. 
    - Check the shape of the dataset.
    - Get the distribution of the sentiment labels (positive/negative reviews).
3. Print the first few reviews and their corresponding labels.


In [1]:
import modin.pandas as pd

df = pd.read_csv('../../data/imdb_dataset.zip', compression='zip')
df

2024-10-04 10:00:53,692	INFO worker.py:1786 -- Started a local Ray instance.


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [2]:
df.shape

(50000, 2)

In [3]:
df['sentiment'].value_counts()

the groupby keys will be sorted anyway, although the 'sort=False' was passed. See the following issue for more details: https://github.com/modin-project/modin/issues/3571.


sentiment
negative    25000
positive    25000
Name: count, dtype: int64

### Exercise 2: Splitting the Data
**Objective**: Split the data into training, validation and test sets.

1. Split the dataset into features (reviews) and labels (sentiment).
2. Use `train_test_split` from `sklearn` to split the dataset into training, validation and test sets (use an 60/20/20 split).
3. Print the sizes of the training and test sets to ensure the splits were done correctly.

In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, stratify=df['sentiment'])
train.shape, test.shape

((40000, 2), (10000, 2))

In [5]:
train, valid =train_test_split(train, test_size=0.25, stratify=train['sentiment'])

train.shape, valid.shape, test.shape

((30000, 2), (10000, 2), (10000, 2))

### Exercise 3: Text Preprocessing
**Objective**: Preprocess the text data to prepare it for feature generation.

1. Lowercase the text data. **Hint: python has a built-in string method for this**.
2. Remove any URL from the reviews. **Hint: you can use regular expressions for this**.
3. Remove non-word and non-whitespace characters (punctuation, special characters, etc.). **Hint: you can use regular expressions for this**.
4. Remove digits. **Hint: you can use regular expressions for this**.
5. Tokenize the reviews into individual words. **Hint: you can use the `nltk` library for this**.
6. Remove stopwords. **Hint: you can use the `nltk` library for this**.
7. Perform stemming or lemmatization. **Hint: you can use the `nltk` library for this**.
8. Apply the preprocessing steps to both the training, validation and test sets.

In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')
nltk.download('punkt')

stemmer = PorterStemmer()

def preccess(text:str) -> str:
    text = text.lower()
    link_pattern = r'http\S+|www\S+'
    # re.sub(A, B, C)replace A with B in string C
    text = re.sub(link_pattern, '', text)
    non_alphanumeric_pattern = r'\W'
    text = re.sub(non_alphanumeric_pattern, '', text)
    digits_pattern = r'\d'
    text = re.sub(digits_pattern, '', text)
    tokenize = nltk.word_tokenize(text)
    words = [word for word in tokenize if word not in stopwords.words('english')]
    stems = [stemmer.stem(word) for word in words]
    return " ".join(stems)

x_train, y_train = train['review'], train['sentiment']
x_valid, y_valid = valid['review'], valid['sentiment']
x_test, y_test = test['review'], test['sentiment']

[nltk_data] Downloading package stopwords to /home/lcda/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/lcda/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
x_train = x_train.apply(preccess)
x_valid = x_valid.apply(preccess)
x_test = x_test.apply(preccess)

x_test

[36m(_remote_exec_multi_chain pid=12832)[0m 
[36m(_remote_exec_multi_chain pid=12832)[0m **********************************************************************
[36m(_remote_exec_multi_chain pid=12832)[0m   Resource [93mpunkt_tab[0m not found.
[36m(_remote_exec_multi_chain pid=12832)[0m   Please use the NLTK Downloader to obtain the resource:
[36m(_remote_exec_multi_chain pid=12832)[0m 
[36m(_remote_exec_multi_chain pid=12832)[0m   [31m>>> import nltk
[36m(_remote_exec_multi_chain pid=12832)[0m   >>> nltk.download('punkt_tab')
[36m(_remote_exec_multi_chain pid=12832)[0m   [0m
[36m(_remote_exec_multi_chain pid=12832)[0m   For more information see: https://www.nltk.org/data.html
[36m(_remote_exec_multi_chain pid=12832)[0m 
[36m(_remote_exec_multi_chain pid=12832)[0m   Attempted to load [93mtokenizers/punkt_tab/english/[0m
[36m(_remote_exec_multi_chain pid=12832)[0m 
[36m(_remote_exec_multi_chain pid=12832)[0m   Searched in:
[36m(_remote_exec_multi_chain pi

RayTaskError(LookupError): [36mray::remote_exec_func()[39m (pid=12833, ip=192.168.126.137)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: [36mray::_remote_exec_single_chain()[39m (pid=12818, ip=192.168.126.137)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/modin/core/execution/ray/common/deferred_execution.py", line 643, in construct
    raise err
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/modin/core/execution/ray/common/deferred_execution.py", line 633, in construct
    obj = next(gen)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/modin/core/execution/ray/common/deferred_execution.py", line 711, in construct_chain
    obj = cls.exec_func(fn, obj, args, kwargs)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/modin/core/execution/ray/common/deferred_execution.py", line 605, in exec_func
    raise err
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/modin/core/execution/ray/common/deferred_execution.py", line 592, in exec_func
    return fn(obj, *args, **kwargs)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/modin/core/dataframe/algebra/map.py", line 51, in <lambda>
    lambda x: function(x, *args, **kwargs), *call_args, **call_kwds
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/pandas/core/frame.py", line 10468, in map
    return self.apply(infer).__finalize__(self, "map")
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/pandas/core/frame.py", line 10466, in infer
    return x._map_values(func, na_action=na_action)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/pandas/core/base.py", line 921, in _map_values
    return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/pandas/core/algorithms.py", line 1743, in map_array
    return lib.map_infer(values, mapper, convert=convert)
  File "lib.pyx", line 2972, in pandas._libs.lib.map_infer
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/modin/pandas/series.py", line 1268, in <lambda>
    arg(s) if pandas.isnull(s) is not True or na_action is None else s
  File "/tmp/ipykernel_12497/4137630162.py", line 20, in preccess
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 142, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 119, in sent_tokenize
    tokenizer = _get_punkt_tokenizer(language)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 105, in _get_punkt_tokenizer
    return PunktTokenizer(language)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
    self.load_lang(lang)
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
    lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
  File "/home/lcda/miniconda3/envs/dl_with_pytorch/lib/python3.10/site-packages/nltk/data.py", line 579, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/lcda/nltk_data'
    - '/home/lcda/miniconda3/envs/dl_with_pytorch/nltk_data'
    - '/home/lcda/miniconda3/envs/dl_with_pytorch/share/nltk_data'
    - '/home/lcda/miniconda3/envs/dl_with_pytorch/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

### Exercise 4: Feature Generation (TF-IDF)
**Objective**: Convert the preprocessed text data into numerical features using TF-IDF.

1. Use the `TfidfVectorizer` from `sklearn` to convert the reviews into numerical vectors.
2. Limit the maximum number of features to 5,000 to reduce the dimensionality.
3. Fit the vectorizer on the training set and transform both the training and test sets.
4. Print the shape of the transformed feature sets to confirm the conversion.

### Exercise 5: Building the MLP Model (PyTorch)
**Objective**: Build a simple Multi-Layer Perceptron (MLP) for binary classification.

1. Define the MLP model using `torch.nn.Module`. The model should have:
    - An input layer that matches the size of the TF-IDF features.
    - Two hidden layers with ReLU activations.
    - A single output layer with a sigmoid activation function.
2. Print the model summary.


### Exercise 6: Training the Model
**Objective**: Train the MLP model on the training data.

1. Convert the TF-IDF feature matrices and labels into PyTorch tensors (the label needs to be binarized).
2. Define the loss function (`BCELoss` for binary classification) and the optimizer (`Adam`).
3. Implement a training loop to train the model for a specified number of epochs (e.g., 50).
4. Monitor the training and validation loss during training.

### Exercise 7: Model Evaluation
**Objective**: Evaluate the performance of the trained model on the test data.

1. Use the trained model to make predictions on the test set.
2. Calculate the accuracy of the model on the test data.
3. Print the test accuracy.

### **Exercise 8: Saving the Trained Model**

1. Save the model's state_dict using `torch.save()`. 

2. Save the entire model, including its architecture and weights.

3. Demonstrate how to load the saved model and use it for making predictions on new data.
