**Course: BA820 - Unsupervised and Unstructured ML**

**Notebook created by: Mohannad Elhamod**

**Note**: You are **ALLOWED** to use Generative AI for this notebook, but you must properly cite your usage. Be sure to review the syllabus for details on citation requirements and the consequences of failing to cite your sources correctly or simply copy-pasting without meaningful engagement.

# Movie Review Classification



In this notebook, we analyze the [IMDB Movie Reviews dataset](https://huggingface.co/datasets/ajaykarthick/imdb-movie-reviews) and perform various data analysis and machine learning tasks.



In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Loading the dataset

In [None]:
#### DO NOT CHANGE THIS CODE ###
import pandas as pd

splits = {'train': 'train.jsonl', 'test': 'test.jsonl'}

# Load both train and test datasets
df_train = pd.read_json("hf://datasets/ajaykarthick/imdb-movie-reviews/" + splits["train"], lines=True).sample(frac=0.1, random_state=42)
df_test = pd.read_json("hf://datasets/ajaykarthick/imdb-movie-reviews/" + splits["test"], lines=True).sample(frac=0.1, random_state=42)

### ✅ Remove duplicates across both datasets but keep them separate ###
# Add a 'dataset' column to identify where each row came from
df_train["dataset"] = "train"
df_test["dataset"] = "test"

# Concatenate datasets temporarily
df_combined = pd.concat([df_train, df_test], ignore_index=True)

# Remove duplicates across both datasets based on 'review'
df_combined.drop_duplicates(subset=['review'], inplace=True, ignore_index=True)

# Split them back into train and test sets
df_train = df_combined[df_combined["dataset"] == "train"].drop(columns=["dataset"])
df_test = df_combined[df_combined["dataset"] == "test"].drop(columns=["dataset"])

# Reset indices
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
df_train

Unnamed: 0,review,label
0,"Seriously, I can't imagine how anyone could fi...",1
1,I disliked Frosty returns and this one. Both o...,1
2,"This was such a terrible film, almost a comedy...",1
3,"When I first rented Batman Returns, I immediat...",0
4,I am quite the Mitchell Leisen fan so it was a...,0
...,...,...
3992,This is one of the most overlooked gems Hollyw...,0
3993,I saw only the first part of this series when ...,0
3994,David Duchovny and Michelle Forbes play a youn...,0
3995,I watched SCARECROWS because of the buzz surro...,1


In [None]:
print(df_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3997 entries, 0 to 3996
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  3997 non-null   object
 1   label   3997 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 62.6+ KB
None


# Questions

## Question 1: Classifying The Reviews

Your entertainment company needs a model to classify reviews as **positive** or **negative**.  

Your team debated between two approaches: a traditional **frequentist NLP method** and a more advanced **neural network-based method**. You decided to compare them in terms of **performance, speed, and interpretability** (i.e., how well the model explains why a review is classified as positive or negative).  

Specifically, you will compare:  
- **Bag-of-Words (BoW)** approach  **(1 Point)**
- **Mean GloVe embeddings** approach **(1 Point)**

#### **Tasks:**  
- Experiment with different text-cleaning and preprocessing techniques, among other model hyperparameters, to optimize each approach.  **(1 Point)**
- Summarize your results in a table.  **(1 Point)**
- Which model performed better? Explain based on theoretical concepts discussed in class.  **(1 Point)**
- Use [`import time`](https://stackoverflow.com/questions/7370801/how-do-i-measure-elapsed-time-in-python) to measure execution. Place it at the first line of a cell.  
- For interpretability, you may want to use [`mglearn.tools.visualize_coefficients`](https://medium.com/towards-data-science/how-a-simple-algorithm-classifies-texts-with-moderate-accuracy-79f0cd9eb47)
- For classification, use `LogisticRegression` from `sklearn`.

###BoW Approach

In [None]:
import time



**Interpretability of BoW**

In [None]:
!pip install mglearn



Collecting mglearn
  Downloading mglearn-0.2.0-py2.py3-none-any.whl.metadata (628 bytes)
Downloading mglearn-0.2.0-py2.py3-none-any.whl (581 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/581.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m581.4/581.4 kB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mglearn
Successfully installed mglearn-0.2.0


###Mean Glove Approach

In [None]:
import gensim.downloader as api

pretrained_model = api.load(  ) ### Fill appropriate model here in parantheses





 **Answers:**  

*Leave answer here.*

## Question 2: Clustering The Reviews


Another colleague suggests that classification may be unnecessary. *Why bother with labeled data when clustering could achieve the same result?*, he said.  

Using **K-Means clustering** with **Mean GloVe embeddings** **(1 Point)**, test whether you can automatically separate positive and negative reviews based on their content and without labels.

- Visualize the clusters using **PCA**.  **(0.5 Point)**
- Compare the clusters to the original sentiment labels.  **(1 Point)**



**Answer:**

*Leave answer here.*

##Question 3: Pre-trained Model Approach

After testing various approaches, another colleague suggests trying a **[state-of-the-art pre-trained model](https://huggingface.co/l3cube-pune/marathi-sentiment-political-tweets)** that has been generating a lot of buzz.  

Your task is to apply this model to movie reviews **(1 Point)** and evaluate whether it **truly outperforms** the previous approaches.

- **Explicitly state any assumptions you make.**  **(0.5 Point)**
- **Does this model actually surpass the previous models?** Support your answer with empirical results.  **(1 Point)**

**A GPU runtime type is required when solving this question.**  



In [None]:
### DO NOT CHANGE THIS CODE##
from transformers import pipeline

pipe = pipeline("text-classification", model="l3cube-pune/marathi-sentiment-political-tweets", truncation=True)

pipe(["I love this movie. It is really sooo amazing", "This is an OK movie",  "This is a bad movie"])

**Answer:**

*Leave answer here.*
