Use the following dataset - https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

## Problem 1
### Apply all the preprocessing techniques that you think are necessary

In [3]:
# Import Basis Libraries    
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [4]:
# Load the dataset
df = pd.read_csv('IMDB Dataset.csv')

In [5]:
# Head
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
# Initialize lemmatizer & stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [7]:
# Function for text preprocessing
def preprocess_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and apply lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

In [8]:
# Apply preprocessing to each review
df['cleaned_review'] = df['review'].apply(preprocess_text)

In [9]:
# Save the preprocessed dataset (optional)
df.to_csv("IMDB_Cleaned_Dataset.csv", index=False)

# Display first few rows
print(df[['review', 'cleaned_review']].head())

                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   
2  I thought this was a wonderful way to spend ti...   
3  Basically there's a family where a little boy ...   
4  Petter Mattei's "Love in the Time of Money" is...   

                                      cleaned_review  
0  one reviewer mentioned watching oz episode you...  
1  wonderful little production filming technique ...  
2  thought wonderful way spend time hot summer we...  
3  basically there family little boy jake think t...  
4  petter matteis love time money visually stunni...  


## Problem 2

### Find out the number of words in the entire corpus and also the total number of unique words(vocabulary) using just python

In [13]:
from collections import Counter

# Combine all reviews into one large text
all_words = ' '.join(df['cleaned_review'])

# Split into individual words
word_list = all_words.split()

# Count total words
total_words = len(word_list)

# Count unique words (vocabulary size), used set to remove duplicate words.
unique_words = len(set(word_list))

print(f"Total words in corpus: {total_words}")
print(f"Unique words in vocabulary: {unique_words}")

Total words in corpus: 5948526
Unique words in vocabulary: 151053


## Problem 3

### Apply One Hot Encoding

Applying one-hot encoding directly to the entire IMDB dataset is memory-intensive because:

âœ… The dataset has 50,000 reviews.

âœ… The vocabulary size is huge (likely 50,000+ unique words).

âœ… Each word creates a new column, leading to an enormous sparse matrix.

## Problem 4

### Apply bag words and find the vocabulary also find the times each word has occured

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Limit vocabulary to 5000 most common words
vectorizer = CountVectorizer(dtype=np.int8)

# Fit and transform the cleaned reviews
bow_matrix = vectorizer.fit_transform(df['cleaned_review'])

# Convert to DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Get word frequencies
word_counts = bow_df.sum().sort_values(ascending=False)

print(f"Vocabulary Size: {len(vectorizer.get_feature_names_out())}")
print("Top 10 Most Frequent Words:")
print(word_counts.head(10))

Vocabulary Size: 151029
Top 10 Most Frequent Words:
movie        100957
film          91574
one           53811
like          40022
time          30249
good          29030
character     27982
story         24740
even          24587
get           24517
dtype: int64


## Problem 5

### Apply bag of bi-gram and bag of tri-gram and write down your observation about the dimensionality of the vocabulary

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

# Bi-gram Model (2-word sequences)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), dtype=np.int8)
bigram_matrix = bigram_vectorizer.fit_transform(df['cleaned_review'])
bigram_vocab_size = len(bigram_vectorizer.get_feature_names_out())

# Tri-gram Model (3-word sequences)
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3), dtype=np.int8)
trigram_matrix = trigram_vectorizer.fit_transform(df['cleaned_review'])
trigram_vocab_size = len(trigram_vectorizer.get_feature_names_out())

print(f"Vocabulary Size (Bi-gram): {bigram_vocab_size}")
print(f"Vocabulary Size (Tri-gram): {trigram_vocab_size}")

Vocabulary Size (Bi-gram): 2992555
Vocabulary Size (Tri-gram): 5366114


ðŸ“Œ Higher n-grams increase the feature space â†’ more memory usage â†’ higher computation cost.

ðŸ“Œ Capturing word sequences improves context understanding but also leads to sparsity.

## Problem 6

### Apply tf-idf and find out the idf scores of words, also find out the vocabulary.

âœ… Give importance to words that appear frequently in a document but are rare in the dataset.

âœ… Reduce the effect of common words (like "the", "is", etc.).

âœ… Balance word frequency with uniqueness.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the cleaned reviews
tfidf_matrix = tfidf_vectorizer.fit_transform(df['cleaned_review'])

# Get the vocabulary (all unique words)
vocabulary = tfidf_vectorizer.get_feature_names_out()
print(f"Vocabulary Size: {len(vocabulary)}")

# Get the IDF scores for each word
idf_scores = dict(zip(vocabulary, tfidf_vectorizer.idf_))

# Convert to DataFrame for better visualization
idf_df = pd.DataFrame(idf_scores.items(), columns=['Word', 'IDF_Score']).sort_values(by="IDF_Score", ascending=False)

# Display top 10 words with highest IDF scores (most unique words)
print("Top 10 Most Unique Words Based on IDF Scores:")
print(idf_df.head(10))

Vocabulary Size: 151029
Top 10 Most Unique Words Based on IDF Scores:
               Word  IDF_Score
75514        lifeha  11.126651
87720    mukerjhees  11.126651
87745      mulgrews  11.126651
87743       mulford  11.126651
87742      mulelike  11.126651
87740       muldrun  11.126651
87738  mulderscully  11.126651
87735       muldayr  11.126651
87734    muldaurwho  11.126651
87732        muldar  11.126651
