# NLP501 - NATURAL LANGUAGE PROCESSING

# EXTRA LAB 02
## Multi-Algorithm Text Classification
## Movie Review Sentiment Analysis

- **Language:** Python 3.8+
- **Tools:** Jupyter Notebook, NumPy, NLTK, scikit-learn
- **Dataset:** IMDB Movie Reviews
- **Objective:** Compare multiple classification algorithms


## Overview

In this exercise, you will:
1. Load and explore a real movie review dataset
2. Implement comprehensive text preprocessing
3. Extract different types of features (Bag-of-Words, TF-IDF, N-grams)
4. Train and evaluate **multiple classifiers**:
   - Naive Bayes (Multinomial & Bernoulli)
   - Logistic Regression
   - Support Vector Machine (SVM)
5. Compare performance using various metrics
6. Perform error analysis and model interpretation

## 1. Environment Setup

### 1.1. Install Required Libraries

In [None]:
# Install required packages
# !pip install numpy pandas nltk scikit-learn matplotlib seaborn wordcloud

# Download NLTK data
import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

### 1.2. Import Libraries

Import all necessary libraries for your implementation.

In [None]:
# TODO: Import all necessary libraries
# Hints: numpy, pandas, nltk, sklearn, matplotlib, seaborn, etc.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add more imports as needed
# YOUR CODE HERE

# Set random seed for reproducibility
np.random.seed(42)

## 2. Data Loading and Exploration

### Task 1: Load the IMDB Movie Reviews Dataset

The dataset contains 2,000 movie reviews (1,000 positive, 1,000 negative) from the NLTK corpus.

**Requirements:**
1. Load reviews from `nltk.corpus.movie_reviews`
2. Create a list of texts and corresponding labels
3. Convert to a pandas DataFrame with columns: `text`, `label`, `length`
4. Shuffle the dataset
5. Display basic statistics

In [None]:
# TODO: Load movie reviews dataset
# Hint: Use nltk.corpus.movie_reviews.fileids() and movie_reviews.raw(fileid)

from nltk.corpus import movie_reviews

### Task 2: Exploratory Data Analysis (EDA)

**Requirements:**
1. Display dataset shape and first few rows
2. Check class distribution (positive vs negative)
3. Calculate and display statistics:
   - Average review length
   - Min/Max review length
   - Median review length per class
4. Create visualizations:
   - Distribution of review lengths (histogram)
   - Class distribution (bar plot)
   - Box plot of lengths by class

In [None]:
# TODO: Perform EDA

In [None]:
# TODO: Create visualizations

## 3. Text Preprocessing

### Task 3: Implement Comprehensive Preprocessing

Create a preprocessing pipeline that:
1. Converts to lowercase
2. Removes special characters and numbers
3. Tokenizes text
4. Removes stopwords
5. Applies lemmatization (optional: use stemming instead)
6. Filters out very short tokens (< 3 characters)

In [None]:
# TODO: Implement preprocessing functions

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import re
import string

def preprocess_text_basic(text):
    """
    Basic preprocessing: lowercase, tokenize, remove stopwords
    
    Args:
        text (str): Input text
    Returns:
        str: Preprocessed text (space-separated tokens)
    """
    pass

def preprocess_text_advanced(text):
    """
    Advanced preprocessing: + lemmatization, remove numbers/special chars
    
    Args:
        text (str): Input text
    Returns:
        str: Preprocessed text (space-separated tokens)
    """
    pass

In [None]:
# TODO: Test your preprocessing functions
sample_text = "This movie is AMAZING!!! I've watched it 3 times. Best film of 2023! :)"

print("Original:", sample_text)
print("Basic:", preprocess_text_basic(sample_text))
print("Advanced:", preprocess_text_advanced(sample_text))

In [None]:
# TODO: Apply preprocessing to entire dataset

## 4. Feature Extraction

### Task 4: Extract Multiple Types of Features

Implement feature extraction using scikit-learn:

**A. Bag-of-Words (CountVectorizer)**
- Parameters to experiment with: `max_features`, `min_df`, `max_df`

**B. TF-IDF (TfidfVectorizer)**
- Compare with Bag-of-Words

**C. N-grams**
- Unigrams (1-gram)
- Bigrams (2-gram)
- Trigrams (3-gram)
- Combined (1,2)-grams or (1,3)-grams

**Requirements:**
1. Split data into train/test (80/20)
2. Create at least **4 different feature representations**
3. Print shape and sample features for each representation

In [None]:
# TODO: Split data into train/test sets
from sklearn.model_selection import train_test_split


In [None]:
# TODO: Feature Extraction Method 1 - Bag of Words
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
# TODO: Feature Extraction Method 2 - TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
# TODO: Feature Extraction Method 3 - Bigrams

In [None]:
# TODO: Feature Extraction Method 4 - Combined N-grams

## 5. Model Training and Evaluation

### Task 5: Train Multiple Classifiers

Train and evaluate the following algorithms:

1. **Multinomial Naive Bayes**
2. **Bernoulli Naive Bayes**
3. **Logistic Regression**
4. **Linear SVM (LinearSVC)**
5. **Bonus:** Random Forest or any other classifier

**For each classifier:**
- Train on ALL feature types (BoW, TF-IDF, bigrams, n-grams)
- Calculate metrics: Accuracy, Precision, Recall, F1-score
- Record training time
- Store results in a structured format (dictionary or DataFrame)

In [None]:
# TODO: Import classifiers
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix
import time


In [None]:
# TODO: Create a function to train and evaluate a model

def train_and_evaluate(model, X_train, X_test, y_train, y_test, model_name, feature_name):
    """
    Train a model and return evaluation metrics
    
    Args:
        model: sklearn classifier
        X_train, X_test: feature matrices
        y_train, y_test: labels
        model_name: name of the model (str)
        feature_name: name of feature type (str)
    
    Returns:
        dict: Dictionary containing all metrics
    """
    
    pass

In [None]:
# TODO: Train all models on all feature types

results = []

models = {
    'Multinomial NB': MultinomialNB(),
    'Bernoulli NB': BernoulliNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Linear SVM': LinearSVC(max_iter=1000)
}

feature_sets = {
    'Bag-of-Words': (X_train_bow, X_test_bow),
    'TF-IDF': (X_train_tfidf, X_test_tfidf),
    'Bigrams': (X_train_bigram, X_test_bigram),
    'N-grams': (X_train_ngram, X_test_ngram)
}

# Loop through all combinations of models and features
# Use train_and_evaluate function
# Append results to list

# Convert results to DataFrame for easy analysis
# results_df = pd.DataFrame(results)

### Task 6: Results Analysis and Visualization

**Requirements:**
1. Create a comprehensive results table showing all metrics
2. Identify the best model for each feature type
3. Identify the best overall model
4. Create visualizations:
   - Bar chart comparing accuracy across models and features
   - Heatmap of F1-scores (models Ã— features)
   - Training time comparison
5. Display confusion matrix for the best model

In [None]:
# TODO: Display comprehensive results table


In [None]:
# TODO: Find best models


In [None]:
# TODO: Create visualization 1 - Accuracy comparison


In [None]:
# TODO: Create visualization 2 - F1-score 


In [None]:
# TODO: Display confusion matrix for best model

from sklearn.metrics import ConfusionMatrixDisplay

# Retrain best model and display confusion matrix with visualization

## 6. Error Analysis

### Task 7: Analyze Misclassifications

**Requirements:**
1. Find all misclassified examples from the best model
2. Categorize errors:
   - False Positives (predicted positive, actually negative)
   - False Negatives (predicted negative, actually positive)
3. Display at least 10 examples from each category
4. Analyze patterns:
   - Are there common words in misclassified reviews?
   - Is there a relationship between review length and errors?
5. Create visualizations:
   - Length distribution of correct vs incorrect predictions
   - Word cloud of misclassified reviews

In [None]:
# TODO: Get predictions from best model

In [None]:
# TODO: Find and categorize misclassifications


In [None]:
# TODO: Analyze error patterns

In [None]:
# TODO: Visualization - Length distribution

In [None]:
# TODO: Visualization - Word cloud of misclassified reviews
from wordcloud import WordCloud
