**Assignment 1: Natural Language Processing**

Movie Review Sentiment Analysis

The goal of this assignment is to implement and compare three text classification algorithms—
Naive Bayes, Logistic Regression, and Multilayer Perceptron (MLP)—on the NLTK Movie Reviews
dataset. You will explore the impact of using both raw Term Frequency (TF) and Term Frequency-
Inverse Document Frequency (TF-IDF) as feature representations.

**1. Data Preparation**
- Load the NLTK Movie Reviews Dataset

In [4]:
import nltk
# Download the IMDb movie reviews dataset
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

**Access the Dataset**
- Once downloaded, we acess the data using the following code

In [5]:
from nltk.corpus import movie_reviews
# Access the movie reviews and labels
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
# Shuffle the documents to ensure a balanced distribution of positive and negative reviews
import random
random.shuffle(documents)

**Explore the Dataset**
- Take a look at the structure of the dataset and sample reviews to understand its characteristics


In [6]:
# Print the first review and its label
print("Sample Review:", documents[0][0][:10]) # Displaying the first 10 words for brevity
print("Label:", documents[0][1])

Sample Review: ['talk', 'about', 'a', 'movie', 'that', 'seemed', 'dated', 'before', 'it', 'even']
Label: neg


Preprocess the dataset by tokenization (use nltk punkt tokenizer), stemming/lemmatization, and remove stopwords.

- I will use lemmatization here

In [8]:
import nltk
import random
from nltk.corpus import movie_reviews, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download necessary NLTK data
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

# Initialize lemmatizer and stop words
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Load and preprocess the data
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        raw_words = movie_reviews.raw(fileid)
        tokens = word_tokenize(raw_words)  # Tokenize using Punkt tokenizer
        tokens = [word.lower() for word in tokens if word.isalpha()]  # Lowercase and remove punctuation
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]  # Lemmatize and remove stopwords
        documents.append((tokens, category))

# Shuffle the documents
random.shuffle(documents)

# Display a sample
print("Sample Review:", documents[0][0][:10])
print("Label:", documents[0][1])

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Sample Review: ['bob', 'happy', 'bastard', 'quickie', 'review', 'rush', 'hour', 'problem', 'hour', 'clone']
Label: pos


**2. Coverage Analysis Insights**

- Conduct a coverage analysis to identify the percentage of unique words covered by the
preprocessing steps.

- Visualize the coverage analysis with the y-axis representing coverage percentage and the x-
axis representing the id of tokens (words) ordered by frequency of occurrence. Use a line plot
for clarity.

- Discuss the insights gained from the coverage analysis. Consider questions such as:

 - How does the coverage change with the number of tokens considered?

 - At what point does the coverage seem to stabilize?

 - Are there diminishing returns in terms of coverage as the number of tokens increases?