# Sentiment Classifier Assignment: Logistic Regression + TF-IDF

## Introduction
In this assignment, you will build a sentiment classification model using one of the fundamental machine learning techniques: Logistic Regression. To represent text data, you will employ the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization method. This assignment will guide you through the complete pipeline, from data loading and preprocessing to model training and evaluation.

**Sentiment analysis** (also known as opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative, or neutral. It's often used to identify customer sentiment toward products, brands, or services in feedback and online conversations.

## Learning Objectives
Upon successful completion of this assignment, you should be able to:
- Load and prepare a text classification dataset.
- Apply TF-IDF vectorization to convert text into numerical features.
- Split data into training and testing sets.
- Train a Logistic Regression model for binary classification.
- Evaluate the performance of a classification model using various metrics (accuracy, precision, recall, F1-score, confusion matrix).
- Interpret model results and identify areas for improvement.

## Dataset
For this assignment, we will use a sentiment analysis dataset. A good choice is the **IMDb Movie Reviews Dataset**, which contains 50,000 movie reviews labeled as either positive or negative.

**How to get the dataset:**
You can download the dataset from Kaggle or directly from `sklearn.datasets` if available (though for larger datasets, manual download is often preferred).

**If using a CSV file (recommended for this setup):**
Assume you have a CSV file named `IMDB_Dataset.csv` with at least two columns:
- `review`: The raw text of the movie review.
- `sentiment`: The sentiment label (`positive` or `negative`).

If you don't have it, you can find it on Kaggle: [IMDb Movie Reviews Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

**Placeholder Data (if you don't download the file):**
We will provide a small dummy dataset if `IMDB_Dataset.csv` is not found, so the code can still run for demonstration purposes.

In [None]:
import pandas as pd
import numpy as np
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download NLTK data (if not already downloaded)
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except nltk.downloader.DownloadError:
    nltk.download('wordnet')
try:
    nltk.data.find('corpora/omw-1.4')
except nltk.downloader.DownloadError:
    nltk.download('omw-1.4') # Required for WordNetLemmatizer
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

# Load dataset
try:
    df = pd.read_csv('IMDB_Dataset.csv')
    print("Dataset loaded successfully!")
    print(f"Number of reviews: {len(df)}")
    print(df.head())
except FileNotFoundError:
    print("Error: 'IMDB_Dataset.csv' not found. Creating a dummy DataFrame for demonstration.")
    data = {
        'review': [
            "A truly fantastic movie! The acting was superb and the plot was engaging. Highly recommend.",
            "Absolutely terrible film. Boring, confusing, and a waste of time. I hated it.",
            "It was an okay movie, nothing special. Some good parts, some bad.",
            "Loved every minute of it! Great characters and a thrilling story. Five stars!",
            "Worst movie I've seen in years. Poor direction and terrible script. Avoid!",
            "The plot was a bit slow at times, but the ending made it worthwhile. Enjoyed it overall.",
            "Mediocre at best. Didn't live up to the hype at all.",
            "Brilliant! A must-watch for anyone who loves psychological thrillers.",
            "Could not finish it. Too boring and the sound quality was bad.",
            "Surprisingly good! I went in with low expectations and was pleasantly surprised."
        ],
        'sentiment': [
            'positive', 'negative', 'neutral', 'positive', 'negative',
            'positive', 'negative', 'positive', 'negative', 'positive'
        ]
    }
    df = pd.DataFrame(data)
    # For simplicity in this binary classification, let's map 'neutral' to 'negative' if present
    df['sentiment'] = df['sentiment'].replace('neutral', 'negative')
    print("Using dummy data with sentiment mapped to positive/negative.")
    print(df.head())


# Map sentiment to numerical values: positive -> 1, negative -> 0
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
print("\nSentiment mapping to numerical values completed.")
print(df['sentiment'].value_counts())


## Assignment Questions

### Question 1: Text Preprocessing
Before vectorization, raw text needs to be cleaned and normalized. Create a function `preprocess_text(text)` that performs the following steps:
1.  **Convert to Lowercase:** Convert the entire text to lowercase.
2.  **Remove HTML Tags:** Remove any HTML tags (e.g., `<br />`, `<p>`). Use `re.sub(r'<.*?>', '', text)`.
3.  **Remove Punctuation:** Remove all punctuation marks. Use `str.translate` with `str.maketrans` or `re.sub`.
4.  **Remove Numbers:** Remove all numerical digits.
5.  **Remove Extra Whitespace:** Replace multiple spaces with a single space and strip leading/trailing whitespace.
6.  **Tokenization:** Tokenize the text into individual words.
7.  **Remove Stop Words:** Remove common English stop words using NLTK's `stopwords` corpus.
8.  **Lemmatization:** Apply lemmatization to each token using NLTK's `WordNetLemmatizer`. (Remember to provide POS tags for better lemmatization, if possible, but for simplicity, you can default to 'n' for nouns if you don't want to implement POS tagging).
9.  **Join Tokens:** Join the processed tokens back into a single string, separated by spaces.

Apply this function to the `review` column of your DataFrame and store the results in a new column named `cleaned_review`.

### Question 2: TF-IDF Vectorization
TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic that reflects how important a word is to a document in a collection or corpus. Rare words tend to have higher TF-IDF values.

1.  Initialize a `TfidfVectorizer` from `sklearn.feature_extraction.text`.
    * Consider setting `max_features` to limit the number of features (e.g., 5000 or 10000) to manage computational complexity, especially for large datasets. This will select the top `max_features` terms by TF-IDF score.
    * You might also consider `ngram_range` (e.g., `(1, 2)` for unigrams and bigrams) if you want to capture word combinations, but for this assignment, start with `(1,1)` (unigrams).
2.  Fit the `TfidfVectorizer` on your `cleaned_review` column and then transform the text data into TF-IDF features.
3.  Print the shape of the resulting TF-IDF feature matrix (`X`).

### Question 3: Train-Test Split
Split your data into training and testing sets. This is crucial to evaluate your model's performance on unseen data.

1.  Split the TF-IDF features (`X`) and the `sentiment` labels (`y`) into training and testing sets using `train_test_split` from `sklearn.model_selection`.
2.  Use a `test_size` of 0.2 (20% for testing) and a `random_state` for reproducibility (e.g., 42).
3.  Print the shapes of `X_train`, `X_test`, `y_train`, and `y_test`.

### Question 4: Logistic Regression Model Training
Logistic Regression is a linear model used for binary classification. Despite its name, it's a classification algorithm.

1.  Initialize a `LogisticRegression` model from `sklearn.linear_model`.
    * Set `solver='liblinear'` (a good default for small datasets and L1/L2 regularization).
    * Set `random_state=42` for reproducibility.
    * Consider setting `max_iter` to a higher value (e.g., 1000) if you encounter convergence warnings.
2.  Train the model using your training data (`X_train`, `y_train`).
3.  Make predictions on both the training set (`y_train_pred`) and the test set (`y_test_pred`).

### Question 5: Model Evaluation
Evaluate the performance of your trained model using various classification metrics.

1.  Calculate and print the **Accuracy Score** for both the training and test sets.
2.  Calculate and print the **Precision Score** for the test set.
3.  Calculate and print the **Recall Score** for the test set.
4.  Calculate and print the **F1-Score** for the test set.
5.  Generate and display the **Confusion Matrix** for the test set.
6.  Print the **Classification Report** for the test set (which conveniently provides precision, recall, f1-score, and support for each class).

### Question 6: Interpretation and Analysis
Based on the evaluation metrics and your understanding of the process, answer the following questions:

1.  **Training vs. Test Accuracy:** Compare the accuracy on the training set with the accuracy on the test set. What does this comparison tell you about potential overfitting or underfitting?
2.  **Importance of Metrics:** Why is accuracy alone often not sufficient for evaluating classification models, especially with imbalanced datasets? Which metrics (precision, recall, F1-score) are more informative in such cases, and why?
3.  **Confusion Matrix Analysis:** Explain what the values in your confusion matrix represent. Which type of error (false positives or false negatives) do you think is more critical for a movie review sentiment classifier, and why?
4.  **Model Limitations and Improvements:** What are some limitations of using Logistic Regression with TF-IDF for sentiment analysis? Suggest at least two ways you could potentially improve this model (e.g., trying different preprocessing steps, other feature engineering techniques, or different models).
5.  **Most Important Features (Optional/Bonus):** If time permits, try to identify the top 10 most influential words (features) that contribute to a 'positive' sentiment and the top 10 words that contribute to a 'negative' sentiment. (Hint: Look at the `coef_` attribute of the `LogisticRegression` model and the `get_feature_names_out()` method of the `TfidfVectorizer`).

In [None]:
# Bonus: Identify top features (optional)

## Submission Guidelines
- Ensure your notebook runs without errors from top to bottom.
- Save your notebook as `your_name_sentiment_classifier_assignment.ipynb`.
- Clearly answer all questions and provide explanations where requested in Markdown cells.
- Feel free to add additional code cells or markdown cells for clarity or experimentation.