**Project Overview & Dataset**

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("IMDB Dataset.csv")

# Display the first few rows
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**Data Preprocessing & Text Cleaning**

Before training our model, we need to clean the text data.  

✔ Remove special characters & punctuation  
✔ Convert text to lowercase  
✔ Remove stopwords (common words like ‘the’, ‘is’, ‘and’)  
✔ Tokenization (breaking text into words)  
✔ Lemmatization (reducing words to their base form)  

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Function to clean text
def clean_text(text):
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-zA-Z]', ' ', text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    words = word_tokenize(text)  # Tokenization
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]  # Lemmatization
    return " ".join(words)

# Apply text cleaning
df['cleaned_review'] = df['review'].apply(clean_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dumbs\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dumbs\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Dumbs\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Feature Extraction with TF-IDF**

TF-IDF helps our model understand which words are most important in a text. Words that appear frequently but aren’t common (like ‘amazing’ in a positive review) get a higher weight.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text data into numerical form
vectorizer = TfidfVectorizer(max_features=5000)  # Use top 5000 words
X = vectorizer.fit_transform(df['cleaned_review'])

# Labels
y = df['sentiment'].map({'positive': 1, 'negative': 0})

**Training the Model**

Next, we’ll train a *Logistic Regression* model to classify movie reviews

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8885
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.87      0.89      4961
           1       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



**Testing with New Reviews**

Let’s test our model with some custom reviews!

In [5]:
test_review = ["The movie was absolutely amazing, I loved it!"]
test_review_tfidf = vectorizer.transform(test_review)
prediction = model.predict(test_review_tfidf)
print("Sentiment:", "Positive" if prediction[0] == 1 else "Negative")

Sentiment: Positive
