# Hyperparameter-Tuned Logistic Regression for IMDb Sentiment Analysis**

Summary of the notebook **`3-Sentiment-Analysis-Scikit-Tuned-Logistic-Regression.ipynb`**

---
### This notebook demonstrates a **focused sentiment analysis pipeline** using Scikit-learn, applying **hyperparameter tuning** to optimize a logistic regression model on the IMDb movie review dataset.

1. **Data Acquisition & Preprocessing**

   * Full IMDb dataset (50,000 reviews) downloaded and parsed from raw tar.gz format.
   * Reviews labeled as positive (1) or negative (0) and shuffled.

2. **Data Splitting**

   * Dataset split into training and testing sets (80/20 split).

3. **Pipeline Setup**

   * A Scikit-learn `Pipeline` combining **TF-IDF vectorization** and **Logistic Regression**.

4. **Grid Search for Hyperparameter Optimization**

   * Conducted `GridSearchCV` across TF-IDF features and logistic regression regularization strength (`C`).
   * Best config: `max_features=10000`, `ngram_range=(1,2)`, `C=1`.

5. **Model Evaluation**

   * **Test Accuracy:** 0.90
   * **ROC AUC Score:** 0.9646
   * Provided full classification report and confusion matrix.

6. **Model Interpretation**

   * Extracted and displayed top positive and negative words by learned coefficients.

7. **User Interaction**

   * Developed an interactive command-line tool for live movie review classification.

---

### 📊 Results Table

| Model                       | Accuracy | File Name                                  | Any Brief Note                            |
| --------------------------- | -------- | ------------------------------------------ | ----------------------------------------- |
| Logistic Regression (Tuned) | 0.9000   | `3-Sentiment-Analysis-Scikit-Tuned-Logistic-Regression.ipynb` | GridSearchCV-tuned; high ROC AUC (0.9646) |

# Mount Google Drive

In [1]:
# # Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# Import Libraries

In [2]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.utils import shuffle
import textwrap

# 1. Load Data

In [3]:
# 1. Load Data
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
import tarfile
import urllib.request
import os

# Download and extract dataset
if not os.path.exists("aclImdb"):
    urllib.request.urlretrieve(url, "aclImdb_v1.tar.gz")
    with tarfile.open("aclImdb_v1.tar.gz", "r:gz") as tar:
        tar.extractall()

# Function to read reviews
def load_imdb_data(data_dir):
    data = {"review": [], "sentiment": []}
    for label in ["pos", "neg"]:
        sentiment = 1 if label == "pos" else 0
        path = os.path.join(data_dir, label)
        for file in os.listdir(path):
            with open(os.path.join(path, file), encoding="utf-8") as f:
                data["review"].append(f.read())
                data["sentiment"].append(sentiment)
    return pd.DataFrame(data)

train_df = load_imdb_data("aclImdb/train")
test_df = load_imdb_data("aclImdb/test")
df = pd.concat([train_df, test_df])
df = shuffle(df).reset_index(drop=True)

# 2. Train-Test Split

In [4]:
# 2. Train-Test Split
X = df['review']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Pipeline Creation

### **Brief Explanation of Each Step**

* **`Pipeline([...])`**
  *   Combines multiple preprocessing and modeling steps into a single, streamlined workflow. Ensures consistency and simplifies training and prediction.

* **`('tfidf', TfidfVectorizer())`**
  *   Handles both tokenization and vectorization internally. Converts raw text into numerical features using **Term Frequency–Inverse Document Frequency (TF-IDF)**, which reflects the importance of words relative to the entire corpus.

* **`('clf', LogisticRegression(solver='liblinear'))`**
  * Applies **Logistic Regression** as the classification algorithm.

  * `solver='liblinear'` is suitable for smaller datasets and supports **L1/L2 regularization**, helping to prevent overfitting.


In [None]:
# 3. Pipeline Creation
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),                       # does both tokenizing and vectorizing internally
    ('clf', LogisticRegression(solver='liblinear'))
])

# 4. Hyperparameter Tuning

### **Brief Explanation of Each Step**

* **`grid_params`**
  * Defines a set of hyperparameter values to search over for both TF-IDF and Logistic Regression.

* **`GridSearchCV(...)`**
  * Performs cross-validated grid search to find the best combination of hyperparameters.

* **`gs.fit(X_train, y_train)`**
  * Trains models for all parameter combinations and selects the best based on validation accuracy.

* **`gs.best_params_ / gs.best_score_`**
  * Outputs the best parameters and corresponding cross-validation accuracy.

In [6]:
# 4. Hyperparameter Tuning
grid_params = {
    'tfidf__max_features': [5000, 10000],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'clf__C': [0.1, 1, 10]
}

gs = GridSearchCV(pipeline, grid_params, cv=3, n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)

print("Best Parameters:", gs.best_params_)
print("Best CV Accuracy:", gs.best_score_)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best Parameters: {'clf__C': 1, 'tfidf__max_features': 10000, 'tfidf__ngram_range': (1, 2)}
Best CV Accuracy: 0.8939499923676283


### 🔍 Line-by-Line Explanation:

#### **`Fitting 3 folds for each of 12 candidates, totalling 36 fits`**

* You're using **3-fold cross-validation** (i.e., the training set is split into 3 parts, and each is used once as a validation set while the others are used for training).
* You have **12 hyperparameter combinations** (candidates) to test.
* Therefore, **36 model fits** (12 combinations × 3 folds) are performed.

---

#### **`Best Parameters: {'clf__C': 1, 'tfidf__max_features': 10000, 'tfidf__ngram_range': (1, 2)}`**

* These are the best hyperparameters found:

  * `clf__C: 1`: The regularization strength for the Logistic Regression classifier. A moderate value, balancing bias and variance.
  * `tfidf__max_features: 10000`: The TF-IDF vectorizer will use the top 10,000 most informative words.
  * `tfidf__ngram_range: (1, 2)`: Both unigrams and bigrams (single words and two-word phrases) are included in the features.

---

#### **`Best CV Accuracy: 0.8939499923676283`**

* The **cross-validated accuracy score** of the model using the best parameters above is approximately **89.4%**.
* This score is based only on the training data split into 3 folds — it gives a reliable estimate of how well the model is expected to perform on unseen data.

---

### ✅ Summary:

This result tells you that after testing 12 different combinations of parameters for your sentiment classification pipeline:

* The best model includes both unigrams and bigrams, limits features to 10,000, and uses a regularization strength of 1.
* It achieves nearly **89.4% accuracy** in cross-validation, suggesting strong generalization performance.

# 5. Evaluate on Test Set

### 🔹 **Brief Explanation of Each Step**

* **`best_model = gs.best_estimator_`**
  * Retrieves the best model found during grid search.

* **`predict(X_test)` / `predict_proba(X_test)`**
  * Makes predictions and estimates class probabilities on the test set.

* **`accuracy_score(...)`**
  * Measures overall test accuracy.

* **`confusion_matrix(...)`**
  * Shows counts of true/false positives and negatives.

* **`classification_report(...)`**
  * Displays precision, recall, F1-score, and support for each class.

* **`roc_auc_score(...)`**
  * Evaluates model's ability to distinguish between classes (higher = better).

In [7]:
# 5. Evaluate on Test Set
best_model = gs.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("\nBest Model Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nROC AUC Score:", roc_auc_score(y_test, y_proba))


Best Model Test Accuracy: 0.9

Confusion Matrix:
 [[4469  531]
 [ 469 4531]]

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.89      0.90      5000
           1       0.90      0.91      0.90      5000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000


ROC AUC Score: 0.96458336


# 6. Evaluation of the **best-performing sentiment analysis model** on the **test dataset** of 10,000 IMDB reviews

### ✅ **1. Best Model Test Accuracy: `0.90`**

* The model correctly classified **90% of all test reviews** (both positive and negative).
* Out of 10,000 test samples, **9,000 were predicted correctly**, and **1,000 were misclassified**.

### 📉 **2. Confusion Matrix:**

```
[[4469  531]
 [ 469 4531]]
```

|                         | Predicted Negative (0) | Predicted Positive (1) |
| ----------------------- | ---------------------- | ---------------------- |
| **Actual Negative (0)** | 4469 (True Negative)   | 531 (False Positive)   |
| **Actual Positive (1)** | 469 (False Negative)   | 4531 (True Positive)   |

#### Interpretation:

* **4469 reviews** were correctly identified as negative.
* **4531 reviews** were correctly identified as positive.
* **531 negative reviews** were wrongly classified as positive.
* **469 positive reviews** were wrongly classified as negative.

### 📊 **3. Classification Report:**

| Class                | Precision | Recall | F1-Score | Support |
| -------------------- | --------- | ------ | -------- | ------- |
| 0 (Negative)         | 0.91      | 0.89   | 0.90     | 5000    |
| 1 (Positive)         | 0.90      | 0.91   | 0.90     | 5000    |
| **Overall Accuracy** |           |        | **0.90** | 10000   |

#### Metrics:

* **Precision**: Of all reviews predicted as a class, how many were correct.
* **Recall**: Of all actual reviews of a class, how many were captured.
* **F1-Score**: The harmonic mean of precision and recall — a balanced measure.

##### Balanced performance:

* The model performs **similarly well** on both positive and negative reviews, with F1-scores of **0.90**.
* There's **no significant bias** toward either class, which is ideal.


### 📈 **4. ROC AUC Score: `0.9646`**

* **ROC AUC** (Receiver Operating Characteristic - Area Under Curve) measures the model's ability to distinguish between the classes.
* A value of **0.9646** is **excellent** (closer to 1 means better separability).

#### What it means:

* The model has a **96.5% chance of ranking a randomly chosen positive review higher than a negative one**.
* This suggests the model outputs strong, confident probabilities — not just good hard predictions.

### 🧠 Summary:

* **90% accuracy** on test data means the model generalizes well.
* **Balanced precision and recall** across both classes.
* **High ROC AUC** shows strong discriminative ability.
* Only \~5–6% of each class was misclassified — overall, a **very effective sentiment classifier**.

# 7. View Top Words by Coefficient

### 🔹 **Brief Explanation of Each Step**

* **`show_top_features(...)`**
  * Displays the words with the strongest influence on the model’s predictions.

* **`vectorizer.get_feature_names_out()`**
  * Retrieves the vocabulary (all words used in the model).

* **`classifier.coef_`**
  * Contains **model coefficients**, which indicate how much each word contributes to predicting the positive or negative class:

  * **Positive coefficients** = push prediction toward the positive class
  * **Negative coefficients** = push prediction toward the negative class

* **`np.argsort(...)`**
  * Sorts the coefficients to find the top positively and negatively weighted words.


In [None]:
# 7. View Top Words by Coefficient
def show_top_features(vectorizer, classifier, n=20):
    feature_names = np.array(vectorizer.get_feature_names_out())
    coef = classifier.coef_.flatten()
    top_pos = np.argsort(coef)[-n:]
    top_neg = np.argsort(coef)[:n]

    print("\nTop Positive Words:")
    print(feature_names[top_pos][::-1])
    print("\nTop Negative Words:")
    print(feature_names[top_neg])

show_top_features(best_model.named_steps['tfidf'], best_model.named_steps['clf'])


Top Positive Words:
['great' 'excellent' 'amazing' 'perfect' 'wonderful' 'today' 'fun' 'loved'
 'brilliant' 'hilarious' 'best' 'superb' 'the best' 'definitely'
 'enjoyable' 'especially' 'bit' 'fantastic' 'favorite' 'enjoyed']

Top Negative Words:
['worst' 'bad' 'awful' 'boring' 'the worst' 'poor' 'waste' 'terrible'
 'nothing' 'worse' 'dull' 'horrible' 'stupid' 'poorly' 'disappointing'
 'unfortunately' 'lame' 'annoying' 'disappointment' 'fails']


# 8. Interactive sentiment prediction function, incorporating user interaction and model behavior:

### 🔹 **Brief Explanation of Each Step**

* **`predict_sentiment_interactive(...)`**
  * Provides a loop for users to enter reviews and receive real-time sentiment predictions.

* **`pipeline.predict(...)`**
  * Uses the trained model to classify the sentiment (positive or negative).

* **`pipeline.predict_proba(...)`**
  * Returns class probabilities to estimate **confidence** in the prediction.

* **`textwrap.fill(...)`**
  * Neatly formats long reviews for easier reading in the console or notebook.

* **Interactive Loop**
  * Continues until the user types `'exit'`.

In [None]:
def predict_sentiment_interactive(pipeline, width=100):
    while True:
        review_text = input("\nEnter a movie review (or type 'exit' to quit): ")
        if review_text.lower() == 'exit':
            print("Exiting sentiment analysis. Goodbye!")
            break

        prediction = pipeline.predict([review_text])[0]
        probability = pipeline.predict_proba([review_text])[0]

        sentiment = "Positive 😊" if prediction == 1 else "Negative 😞"
        confidence = round(max(probability) * 100, 2)

        # Wrap the text for display in notebook
        wrapped_review = textwrap.fill(review_text, width=width)

        print("\n📝 Review:")
        print(wrapped_review)
        print(f"\n✅ Sentiment: {sentiment}")
        print(f"📊 Confidence: {confidence}%")

# 9. Interactive tool

### 🔹 **Brief Explanation of This Step**

* **`predict_sentiment_interactive(best_model)`**
  * Launches the interactive tool, allowing users to enter custom movie reviews and get real-time sentiment predictions with confidence scores.

In [None]:
# 🔍 Run it
predict_sentiment_interactive(best_model)


📝 Review:
HI there!

✅ Sentiment: Negative 😞
📊 Confidence: 67.32%

📝 Review:
What a fantastic mobie!

✅ Sentiment: Positive 😊
📊 Confidence: 96.07%

📝 Review:
What a fantastic movie! Are you kidding? I will not recommend it

✅ Sentiment: Positive 😊
📊 Confidence: 72.06%
