#Sentiment Analysis on IMDB Movie Reviews
##Introduction
This notebook demonstrates a complete workflow for performing sentiment analysis on the popular IMDB movie reviews dataset.
The project covers loading and preprocessing the data, extracting meaningful features using Count Vectorizer and TF-IDF techniques, and training multiple machine learning classifiers including Naive Bayes, Logistic Regression, Decision Trees, Random Forests, and XGBoost.

By evaluating model performance through accuracy scores, this notebook helps identify the most effective methods for classifying movie reviews as positive or negative.
The modular design makes it easy to extend or adapt for similar text classification tasks.



At first, we import all the essential Python libraries and modules needed for the sentiment analysis project.  
These include libraries for data manipulation (`pandas`, `numpy`), data preprocessing, feature extraction, machine learning models, and evaluation metrics.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score



Now we load the IMDB movie reviews dataset from a CSV file.  
We use the 'python' engine in pandas to handle potential parsing issues gracefully.  
Additionally, we extract the first 1000 rows to create a smaller subset for quicker experimentation and save it as a new CSV file.


In [None]:
# Load the original dataset
# Using the 'python' engine for better handling of potential parsing issues
try:
    df = pd.read_csv("IMDB_Dataset.csv", engine='python')
except ParserError as e:
    print(f"Error reading IMDB_Dataset.csv: {e}")
    print("Attempting to read with different parameters or inspect the file.")

    pass


if 'df' in locals():
    # Extract the first 1000 rows
    df_top_1000 = df.head(1000)

    # Save it as a new CSV file
    df_top_1000.to_csv("IMDB_Dataset_1000.csv", index=False)

    print("Top 1000 rows saved as 'IMDB_Dataset_1000.csv'")
else:
    print("Failed to load IMDB_Dataset.csv. Cannot proceed with creating IMDB_Dataset_1000.csv")

Top 1000 rows saved as 'IMDB_Dataset_1000.csv'




The function, `load_data`, is designed to load the preprocessed subset of the IMDB dataset from a CSV file (`IMDB_Dataset_1000.csv`).  

It selects only the relevant columns — `review` and `sentiment` — which are required for the sentiment analysis task, and returns the filtered dataframe.


In [None]:
# Function to load dataset
def load_data(filepath):
    print("Dataset Loaded Successfully")
    df = pd.read_csv("IMDB_Dataset_1000.csv")
    df = df[['review', 'sentiment']]  # Ensure only relevant columns
    return df



The function `preprocess_data` takes the loaded dataframe as input and prepares it for model training by:  
- Separating the features (`review` texts) and labels (`sentiment`).  
- Encoding the sentiment labels (`positive` and `negative`) into numerical format (1 and 0) using Label Encoding.  
- Splitting the data into training and testing sets with an 80-20 ratio to evaluate model performance on unseen data.


In [None]:
# Function to preprocess data and split
def preprocess_data(df):
    X = df['review']
    y = df['sentiment']

    # Apply Label Encoding to convert 'positive'/'negative' → 1/0
    label_encoder = LabelEncoder()
    y_encoded = label_encoder.fit_transform(y)

    return train_test_split(X, y_encoded, test_size=0.2, random_state=42)



The function `extract_features` converts raw text data into numerical features that machine learning models can understand.  

It supports two vectorization methods:  
- **Count Vectorization**: Counts the frequency of words (excluding English stopwords).  
- **TF-IDF Vectorization**: Weighs words by their importance using Term Frequency-Inverse Document Frequency, also removing stopwords.  

The function fits the vectorizer on the training data and transforms both training and test sets accordingly.


In [None]:
# Function to extract features (CountVectorizer / TF-IDF) with stopword removal
def extract_features(X_train, X_test, method="count"):
    if method == "count":
        vectorizer = CountVectorizer(stop_words='english')  # Removing stopwords
    elif method == "tfidf":
        vectorizer = TfidfVectorizer(stop_words='english')  # Removing stopwords
    else:
        raise ValueError("Method should be 'count' or 'tfidf'")

    X_train_transformed = vectorizer.fit_transform(X_train)
    X_test_transformed = vectorizer.transform(X_test)

    return X_train_transformed, X_test_transformed



This function `train_and_evaluate` trains a specified machine learning model on the training data and evaluates its accuracy on the test data.  

Supported models include:  
- Naive Bayes  
- Logistic Regression  
- Decision Tree  
- Random Forest  
- XGBoost  

The function returns the accuracy score as a performance metric.


In [None]:
# Function to train and evaluate models
def train_and_evaluate(X_train, X_test, y_train, y_test, model_name):
    models = {
        "naive_bayes": MultinomialNB(),
        "logistic_regression": LogisticRegression(max_iter=1000),
        "decision_tree": DecisionTreeClassifier(),
        "random_forest": RandomForestClassifier(n_estimators=100),
        "xgboost": XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    }

    model = models[model_name]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    return accuracy_score(y_test, y_pred)



This function `sentiment_analysis_pipeline` orchestrates the entire workflow from data loading to model evaluation:  
- Loads the dataset from a CSV file.  
- Preprocesses the data (including label encoding and train-test splitting).  
- Extracts features using both Count Vectorizer and TF-IDF methods.  
- Trains and evaluates five different classifiers on each feature set.  

The results (accuracy scores) of each model-feature combination are collected and returned as a pandas DataFrame for easy comparison.


In [None]:
# Pipeline function to run the entire analysis
def sentiment_analysis_pipeline(filepath):
    df = load_data(filepath)
    X_train, X_test, y_train, y_test = preprocess_data(df)  # Label encoding applied here

    methods = ["count", "tfidf"]
    models = ["naive_bayes", "logistic_regression", "decision_tree", "random_forest", "xgboost"]

    results = {}

    for method in methods:
        X_train_transformed, X_test_transformed = extract_features(X_train, X_test, method)

        for model in models:
            key = f"{method}_{model}"
            results[key] = train_and_evaluate(X_train_transformed, X_test_transformed, y_train, y_test, model)

    return pd.DataFrame(results, index=["Accuracy"]).T

# Run the full pipeline
results_df = sentiment_analysis_pipeline("TMDB_Dataset_1000.csv")
print(results_df)


Dataset Loaded Successfully


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



                           Accuracy
count_naive_bayes             0.775
count_logistic_regression     0.790
count_decision_tree           0.705
count_random_forest           0.770
count_xgboost                 0.775
tfidf_naive_bayes             0.835
tfidf_logistic_regression     0.785
tfidf_decision_tree           0.660
tfidf_random_forest           0.790
tfidf_xgboost                 0.730


# Results Interpretation

- The table shows accuracy scores for different classifiers combined with Count Vectorizer and TF-IDF feature extraction methods.  
- TF-IDF with Naive Bayes achieved the highest accuracy of 83.5%, indicating strong performance with weighted term importance.  
- Logistic Regression and Random Forest also performed well across both feature extraction techniques.  
- Decision Tree showed relatively lower accuracy, suggesting it may not be the best choice for this dataset and task.  
- These insights guide us toward selecting the most effective model and feature extraction strategy for sentiment analysis on the IMDB dataset.
