# **1. Import Libraries**

In [18]:
import pandas as pd
import numpy as np
import string
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score, f1_score

# **2. Load and Preprocess the Data**

In [19]:
df = pd.read_csv("//kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")

# Basic preprocessing
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

def clean_text(text):
    text = text.lower()
    text = re.sub(r"<.*?>", "", text)  # remove HTML tags
    text = re.sub(r"[^\w\s]", "", text)  # remove punctuation
    text = re.sub(r"\d+", "", text)  # remove digits
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

df['clean_review'] = df['review'].apply(clean_text)
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})


[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **3. Split Data and Vectorize**

In [20]:
X = df['clean_review']
y = df['sentiment']

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf.fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# **4. Train ML Models**

**1. Logistic Regression**

In [21]:
model = LogisticRegression()

**2. Naive Bayes**

In [22]:
model = MultinomialNB()

**3. SVM**

In [23]:
model = LinearSVC()

**Train Fit Model**

In [24]:
model.fit(X_train, y_train)

# **5. Evaluate the Model**

In [25]:
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8798
F1 Score: 0.8819485366332744

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.87      0.88      4961
           1       0.87      0.89      0.88      5039

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000



# **6. Optional: Interface to Test Custom Reviews**

In [26]:
def predict_sentiment(review):
    review_clean = clean_text(review)
    review_tfidf = tfidf.transform([review_clean])
    prediction = model.predict(review_tfidf)
    return "Positive" if prediction[0] == 1 else "Negative"

# Example
test_review = "This movie was absolutely fantastic!"
print("Prediction:", predict_sentiment(test_review))

Prediction: Positive


# **(Optional) 7. Save the Model**

In [27]:
import pickle

with open("sentiment_model.pkl", "wb") as file:
    pickle.dump((model, tfidf), file)

**To load:**

In [28]:
with open("sentiment_model.pkl", "rb") as file:
    model, tfidf = pickle.load(file)

# **(Optional) 8. Simple CLI Interface 🖥️**

In [15]:
while True:
    review = input("Enter your review ('quit'): ")
    if review.lower() == 'quit':
        break
    print("Sentiment:", predict_sentiment(review))


Enter your review ('quit'):  "This was a complete waste of time. The story made no sense at all


Sentiment: Negative


Enter your review ('quit'):  Not the worst film, but certainly not the best either."


Sentiment: Negative


Enter your review ('quit'):  This was a complete waste of time. The story made no sense at all."


Sentiment: Negative


Enter your review ('quit'):  One of the best films I’ve seen in years. Highly recommended!"


Sentiment: Positive


Enter your review ('quit'):  Heartwarming, funny, and beautifully acted. A must-watch."


Sentiment: Positive


Enter your review ('quit'):  quit


# **🎬 Project Title: Sentiment Analysis of Movie Reviews**

# **🎯 Objective**

The primary goal of this project is to build a machine learning model that can classify movie reviews as either positive or negative based on the text content of the reviews. This is a classic binary text classification problem within the domain of Natural Language Processing (NLP).

# **📁 Dataset Information**

Name: IMDb Dataset of 50K Movie Reviews

Source: Kaggle (lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

Content:

50,000 movie reviews

Each review is labeled as either positive or negative

Balanced dataset (equal number of positive and negative samples)

# **🧠 Approach & Methodology**

# **1. Data Preprocessing**
Since raw text contains noise and inconsistencies, the first step is to preprocess the data:

Convert all text to lowercase

Remove HTML tags, punctuation, digits, and special characters

Remove stopwords (commonly used words with little semantic value like "the", "is", etc.)

**Tokenization and normalization**

The cleaned data ensures that the model learns the semantic meaning rather than irrelevant formatting.

# **2. Feature Extraction**

After preprocessing, the cleaned text is converted into numerical features using TF-IDF (Term Frequency–Inverse Document Frequency). TF-IDF measures how important a word is to a document relative to the entire corpus, making it well-suited for text classification tasks.

# **3. Model Training**

Several standard machine learning algorithms can be trained on the TF-IDF feature matrix. Common choices include:

Logistic Regression – efficient for binary classification

Naïve Bayes – fast and effective for text data

Support Vector Machine (SVM) – particularly good for high-dimensional text data

Each model is trained using a training dataset, typically 80% of the total data, while the remaining 20% is used for testing.

# **4. Model Evaluation**

The model’s performance is evaluated using:

Accuracy: Proportion of total correct predictions

F1-Score: Harmonic mean of precision and recall; especially useful in case of class imbalance

Classification Report: A detailed summary including precision, recall, and F1-score for each class

This evaluation determines how well the model distinguishes between positive and negative reviews.

# **💬 Testing the Model**

Once the model is trained and evaluated, it can be used to predict the sentiment of custom user-provided reviews. For instance, a user might input a sentence like “I loved the storyline but hated the acting”, and the model will return a sentiment prediction (positive or negative) based on learned patterns.

# **💡 Optional Features**

Model Saving: The trained model can be saved using pickle or joblib for reuse without retraining.

Command Line Interface: A basic text-based interface where users enter reviews and see predictions.

GUI/Web App (Optional): A Streamlit or Flask interface can be created to allow easy interaction with the model.

Word Cloud/Visualization (Optional): Display most frequent words in positive and negative reviews.

# **📌 Learning Outcomes**

Through this project, you gain experience with:

Text data cleaning and preprocessing

Feature extraction using TF-IDF

Binary classification using machine learning models

Model evaluation using proper metrics

Building a real-world sentiment classifier that can generalize to unseen reviews

# **✅ Conclusion**

This sentiment analysis project is a strong introduction to practical NLP. It combines classic ML algorithms with text preprocessing techniques and showcases how unstructured text can be transformed into actionable insights. It can serve as a foundation for more advanced tasks like emotion detection, sarcasm detection, or multi-class sentiment classification.

