## Kaggle IMDB Dataset of 50k Reviews
## Source: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

### Libraries Imported:

- **`pandas`**: Used for data loading, manipulation, and analysis, especially in tabular form (DataFrames).

- **`re`**: A module for working with regular expressions to clean and transform text (e.g., removing special characters).

- **`nltk`**: The Natural Language Toolkit, used for processing human language data. Includes tokenization, stopword removal, and lemmatization.

- **`stopwords`** (from `nltk.corpus`): A list of common English words (like "is", "and", "the") that are typically removed to reduce noise in text data.

- **`WordNetLemmatizer`** (from `nltk.stem`): Converts words to their base/dictionary form (e.g., "running" becomes "run"), helping standardize textual input.

- **`TfidfVectorizer`** (from `sklearn.feature_extraction.text`): Converts text data into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF), useful for machine learning.

- **`train_test_split`** (from `sklearn.model_selection`): Splits the dataset into training and testing sets to evaluate model generalization.

- **`LogisticRegression`** (from `sklearn.linear_model`): A popular supervised learning algorithm for binary classification problems like sentiment analysis.

- **`MultinomialNB`** (from `sklearn.naive_bayes`): A Naive Bayes algorithm optimized for discrete features, commonly used in text classification.

- **`classification_report`, `accuracy_score`, `precision_score`, `recall_score`, `f1_score`** (from `sklearn.metrics`): Evaluation metrics used to assess how well the classification model performs on the test set.


In [1]:
# pandas is used for loading, manipulating, and analyzing structured data
import pandas as pd

# re (Regular Expressions) is used for pattern matching and text cleaning
import re

# nltk (Natural Language Toolkit) is a powerful Python library for NLP tasks
import nltk

# Import the list of common English stopwords (e.g., 'the', 'and', 'is') to be removed from text
from nltk.corpus import stopwords

# WordNetLemmatizer is used to reduce words to their base/root form (e.g., "running" → "run")
from nltk.stem import WordNetLemmatizer

# TfidfVectorizer converts raw text into numerical features based on Term Frequency-Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfVectorizer

# train_test_split splits data into training and testing sets for model evaluation
from sklearn.model_selection import train_test_split

# LogisticRegression is a simple yet powerful linear model for binary classification
from sklearn.linear_model import LogisticRegression

# MultinomialNB is a Naive Bayes classifier typically used for text classification tasks
from sklearn.naive_bayes import MultinomialNB

# Classification metrics to evaluate model performance using accuracy, precision, recall, and F1-score
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)


## NLTK Resource Downloads

The following code downloads essential NLTK resources required for natural language processing tasks:

- `stopwords`: A list of common words in various languages that are typically filtered out during text preprocessing.
- `wordnet`: A large lexical database of English, used for lemmatization and semantic analysis.
- `omw-1.4`: Open Multilingual WordNet, provides translations and links to WordNet in multiple languages.

In [2]:
# Download the list of stopwords (common words like 'the', 'is', etc. that are usually removed in text processing)
nltk.download('stopwords')

# Download the WordNet lexical database (used for lemmatization, synonym extraction, etc.)
nltk.download('wordnet')

# Download Open Multilingual Wordnet (helps WordNet support multiple languages)
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ak\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ak\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Ak\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Loading and Previewing the IMDB Dataset

We use `pandas` to load the IMDB reviews dataset stored in a CSV file and preview the first few entries:
1. **Reading CSV File**: We use `pd.read_csv()` to load the dataset from a specified file path.
2. **Viewing Data**: The `.head(10)` function is used to display the first 10 rows of the DataFrame. This helps us get a quick overview of the structure of the data.

In [3]:
# Load the dataset from the specified file path into a DataFrame
df = pd.read_csv(r"G:\Other computers\My Laptop\Education and Bootcamp\Internship\Developers Hub Internship\Task 2 Text Sentiment Analysis\Task_2_IMDB Dataset.csv")

# Display the first five rows of the DataFrame to get a quick overview
print(df.head(10))

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
5  Probably my all-time favorite movie, a story o...  positive
6  I sure would like to see a resurrection of a u...  positive
7  This show was an amazing, fresh & innovative i...  negative
8  Encouraged by the positive comments about this...  negative
9  If you like original gut wrenching laughter yo...  positive


## Text Preprocessing Function

This function performs several standard preprocessing steps on input text to prepare it for natural language processing tasks such as sentiment analysis or topic modeling.
## Preprocessing Steps

- **Lowercasing**: Ensures uniformity by converting all characters to lowercase.
- **HTML Removal**: Cleans web-based datasets by removing HTML tags.
- **Character Filtering**: Removes symbols, punctuation, and any non-alphanumeric characters.
- **Tokenization**: Breaks the text into individual words (tokens).
- **Stopword Removal**: Filters out common, uninformative words (e.g., "the", "and", "is").
- **Lemmatization**: Converts words to their base or dictionary form (e.g., "running" → "run").


In [4]:
# Define a function for text preprocessing
def preprocess_text(text):
    # Convert all characters to lowercase to ensure uniformity
    text = text.lower()
    
    # Remove HTML tags using regex
    text = re.sub(r'<.*?>', ' ', text)
    
    # Remove any character that is not a lowercase letter, digit, or whitespace
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    
    # Split the text into individual words (tokens)
    tokens = text.split()
    
    # Load the set of English stopwords (e.g., 'the', 'and', 'is', etc.)
    stops = set(stopwords.words('english'))
    
    # Remove stopwords from the list of tokens
    tokens = [t for t in tokens if t not in stops]
    
    # Initialize the WordNet lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Lemmatize each token (convert to base form, e.g., 'running' → 'run')
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    
    # Join the cleaned tokens back into a single string
    return ' '.join(tokens)


## Applying Text Preprocessing to the Dataset

We apply the `preprocess_text` function to every entry in the `review` column of the DataFrame and store the results in a new column called `clean_review`.
### Explanation:
- `df['review']`: Accesses the column containing raw review text.
- `.apply(preprocess_text)`: Applies the custom preprocessing function to each review.
- `df['clean_review']`: Stores the cleaned and normalized version of the text.



In [5]:
# Apply the text preprocessing function to the 'review' column
# This creates a new column 'clean_review' with the cleaned text
df['clean_review'] = df['review'].apply(preprocess_text)

## Encoding Sentiment Labels

We convert the sentiment labels from text to numeric values to make them suitable for machine learning models.
### Explanation:
- `df['sentiment']`: Accesses the original sentiment column containing `'positive'` or `'negative'` strings.
- `.map({'positive': 1, 'negative': 0})`: Maps each sentiment to a corresponding integer (`1` for positive, `0` for negative).
- `df['label']`: A new column storing the encoded numeric labels.


In [6]:
# Encode the sentiment labels: 'positive' becomes 1, 'negative' becomes 0
df['label'] = df['sentiment'].map({'positive': 1, 'negative': 0})

## Splitting the Data into Training and Testing Sets

We use `train_test_split` from `scikit-learn` to divide the dataset into training and testing subsets for model evaluation.
### Explanation:
- `df['clean_review']`: Input features (preprocessed text).
- `df['label']`: Target labels (binary sentiment).
- `test_size=0.2`: Reserves 20% of the data for testing.
- `random_state=42`: Sets a seed for reproducibility.
- `stratify=df['label']`: Ensures class distribution is preserved in both training and test sets.


In [7]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['clean_review'],     # Features (preprocessed reviews)
    df['label'],            # Target labels (0 or 1)
    test_size=0.2,          # 20% of the data will be used for testing
    random_state=42,        # Ensures reproducibility of the split
    stratify=df['label']    # Keeps the same proportion of classes in both sets
    )

## Feature Engineering with TF-IDF

We use `TfidfVectorizer` to convert text data into numerical feature vectors based on Term Frequency–Inverse Document Frequency (TF-IDF), which reflects the importance of words in documents.
### Explanation:
- `TfidfVectorizer(max_features=10000)`: Limits the vocabulary to the 10,000 most informative words.
- `fit_transform(X_train)`: Learns the vocabulary and transforms the training data into a TF-IDF matrix.
- `transform(X_test)`: Applies the same vocabulary to transform the test data.


In [8]:
# Initialize the vectorizer with a maximum of 10,000 features
vectorizer = TfidfVectorizer(max_features=10000)

# Fit the vectorizer on the training data and transform it into TF-IDF features
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the test data using the already-fitted vectorizer
X_test_tfidf = vectorizer.transform(X_test)

## Defining Classification Models

We define a dictionary of machine learning models to compare their performance on the text classification task.
### Explanation:
- `'Logistic Regression'`: Uses `LogisticRegression` with `max_iter=1000` to ensure convergence during training.
- `'Multinomial Naive Bayes'`: A probabilistic classifier well-suited for discrete features like word counts or TF-IDF scores.
- The models are stored in a dictionary for easy iteration during training and evaluation.


In [9]:
# Define a dictionary of models to evaluate
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),  # Logistic Regression with increased iterations
    'Multinomial Naive Bayes': MultinomialNB()                 # Naive Bayes classifier suitable for text data
}

## Model Training, Prediction, and Evaluation

We iterate over the defined models, train each one, make predictions, and evaluate their performance using standard classification metrics.
### Explanation:
- `clf.fit(...)`: Trains the model on TF-IDF features.
- `clf.predict(...)`: Predicts labels for the test set.
- `accuracy_score`, `precision_score`, `recall_score`, `f1_score`: Measure the quality of predictions.
- `classification_report`: Provides detailed performance metrics for each class.
- `results[name]`: Stores all metrics in a dictionary for later comparison.


In [10]:
# Initialize a dictionary to store evaluation results
results = {}

# Loop through each model in the dictionary
for name, clf in models.items():
    print(f"--- {name} ---")  # Print the model name

    # Train the model on the training data
    clf.fit(X_train_tfidf, y_train)

    # Predict labels for the test data
    y_pred = clf.predict(X_test_tfidf)

    # Calculate evaluation metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Store the results in the dictionary
    results[name] = {
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-score': f1
    }

    # Print the evaluation metrics
    print(f"Accuracy : {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall   : {rec:.4f}")
    print(f"F1-score : {f1:.4f}")

    # Print a detailed classification report
    print("\nClassification Report:\n")
    print(classification_report(y_test, y_pred, target_names=['negative', 'positive']))

--- Logistic Regression ---
Accuracy : 0.8941
Precision: 0.8861
Recall   : 0.9044
F1-score : 0.8952

Classification Report:

              precision    recall  f1-score   support

    negative       0.90      0.88      0.89      5000
    positive       0.89      0.90      0.90      5000

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000

--- Multinomial Naive Bayes ---
Accuracy : 0.8584
Precision: 0.8568
Recall   : 0.8606
F1-score : 0.8587

Classification Report:

              precision    recall  f1-score   support

    negative       0.86      0.86      0.86      5000
    positive       0.86      0.86      0.86      5000

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



## Summary Comparison of Model Performance

This section prints a summary of the evaluation metrics for each model, allowing quick comparison across different classifiers.
### Explanation:
- `results.items()`: Iterates through each model and its associated performance metrics.
- `metrics['Accuracy']`, `['Precision']`, `['Recall']`, `['F1-score']`: Access specific evaluation metrics.
- `:.4f`: Formats each metric to 4 decimal places for consistent and readable output.


In [11]:
# Print a header for model comparison
print("\n=== Model Comparison ===")

# Loop through the evaluation results for each model
for name, metrics in results.items():
    # Print the model's name along with its accuracy, precision, recall, and F1-score
    print(f"{name}: Accuracy={metrics['Accuracy']:.4f}, "
          f"Precision={metrics['Precision']:.4f}, "
          f"Recall={metrics['Recall']:.4f}, "
          f"F1-score={metrics['F1-score']:.4f}")



=== Model Comparison ===
Logistic Regression: Accuracy=0.8941, Precision=0.8861, Recall=0.9044, F1-score=0.8952
Multinomial Naive Bayes: Accuracy=0.8584, Precision=0.8568, Recall=0.8606, F1-score=0.8587


## Prediction Function (Accepts Both Text and Choice of Model)

This function predicts the sentiment of a given text based on the selected machine learning model.
### Explanation:
- `preprocess_text(text)`: Preprocesses the input text (lowercase, remove stopwords, lemmatization).
- `vectorizer.transform([clean])`: Converts the cleaned text into a TF-IDF feature vector.
- `model_name not in models`: Checks if the specified model exists in the models dictionary.
- `models[model_name].predict(vect)`: Uses the selected model to predict the sentiment based on the transformed vector.
- `return 'Positive' if pred == 1 else 'Negative'`: Returns "Positive" if the model predicts a 1 (positive sentiment) and "Negative" for 0 (negative sentiment).


In [12]:
# Prediction function that accepts both text and a choice of model
def predict_sentiment(text, model_name='Multinomial Naive Bayes'):
   
    # Preprocess the input text
    clean = preprocess_text(text)
    
    # Transform the cleaned text into a TF-IDF vector
    vect = vectorizer.transform([clean])
    
    # Check if the provided model name exists in the models dictionary
    if model_name not in models:
        raise ValueError(f"Model '{model_name}' not found. Choose from {list(models.keys())}.")
    
    # Use the chosen model to predict sentiment and return the result
    pred = models[model_name].predict(vect)[0]
    
    # Return 'Positive' if the prediction is 1, else 'Negative'
    return 'Positive' if pred == 1 else 'Negative'


## Building a GUI for Sentiment Analysis

This section creates a simple graphical user interface (GUI) using Tkinter, which allows users to input a review, select a model, and analyze sentiment.
### Explanation:
- **Tkinter Setup**: Initializes the Tkinter GUI framework and creates the main application window (`root`).
- **Label for Input**: `tk.Label` is used to create a label ("Enter Review:") that explains the purpose of the text box.
- **Text Box for Input**: `tk.Text` creates a multi-line text box where users can type the review.
- **Model Selection**: A label and `OptionMenu` are used to allow users to select one of the pre-defined models from a dropdown list.
- **Grid Layout**: `grid()` positions the widgets (labels, text box, and dropdown) within the window, with padding (`padx`, `pady`) for spacing.


In [13]:
# Import necessary libraries for GUI
import tkinter as tk
from tkinter import ttk

# Create the main application window
root = tk.Tk()
root.title("Sentiment Analysis GUI")  # Set the window title

# Input label and text box for entering a review
tk.Label(root, text="Enter Review:").grid(row=0, column=0, padx=5, pady=5, sticky="w")  # Label for input
input_text = tk.Text(root, height=5, width=60)  # Text box for review input
input_text.grid(row=1, column=0, columnspan=2, padx=5, pady=5)  # Place the text box in the window

# Model selection label and dropdown menu
tk.Label(root, text="Select Model:").grid(row=2, column=0, padx=5, pady=5, sticky="w")  # Label for model selection
model_var = tk.StringVar(value=list(models.keys())[0])  # Default model is the first one in the list
model_menu = ttk.OptionMenu(root, model_var, list(models.keys())[0], *models.keys())  # Dropdown menu for model selection
model_menu.grid(row=2, column=1, padx=5, pady=5, sticky="w")  # Place the dropdown in the window

## Prediction Result Label and GUI Functionality

This section handles the prediction and result display. It creates a button that triggers the sentiment analysis based on the input text.
### Explanation:
- **`on_predict()`**: This function retrieves the user’s input, calls the sentiment prediction function, and updates the result label with the sentiment ("Positive" or "Negative").
- **`predict_button`**: Creates a button labeled "Predict Sentiment" which triggers the `on_predict()` function when clicked.
- **`result_label`**: A label that initially displays "Sentiment: " and is updated with the prediction after clicking the button.
- **`root.mainloop()`**: Starts the Tkinter event loop, making the GUI interactive.


In [14]:
# Prediction result label and function to handle prediction
def on_predict():
    # Get the text entered in the input box (from the first character to the end)
    text = input_text.get("1.0", tk.END).strip()
    
    # If text is entered, predict sentiment using the selected model
    if text:
        result = predict_sentiment(text, model_var.get())  # Call the prediction function
        result_label.config(text=f"Sentiment: {result}")  # Update the result label with the prediction
    else:
        result_label.config(text="Please enter text to analyze.")  # Ask for input if no text is provided

# Create the prediction button, bind it to the on_predict function
predict_button = tk.Button(root, text="Predict Sentiment", command=on_predict)
predict_button.grid(row=3, column=0, columnspan=2, padx=5, pady=5)  # Place the button in the window

# Create the result label to display sentiment prediction
result_label = tk.Label(root, text="Sentiment: ")
result_label.grid(row=4, column=0, columnspan=2, padx=5, pady=5)  # Place the label in the window

# Start the Tkinter GUI event loop
root.mainloop()
