### 📥 Load the IMDB Movie Review Dataset


In [42]:
import pandas as pd

# Load the dataset
df = pd.read_csv(r"C:\Users\Ammar\Downloads\archive (6)\IMDB Dataset.csv")


In [44]:
df


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [45]:
df.isnull().sum()


review       0
sentiment    0
dtype: int64

## 🔄 Encode Sentiment Labels 

In [46]:
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

In [48]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [49]:
df["sentiment"].value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

# 🧽 Text Preprocessing Setup

In [50]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer



In [51]:

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ammar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ammar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ammar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# 🧠✨ Turning Raw Reviews into Refined Thoughts

In [52]:
def preprocess(text):
    tokens = word_tokenize(text.lower())  # Lowercase and tokenize
    tokens = [word for word in tokens if word.isalpha()]  # Remove punctuation
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatize
    return ' '.join(tokens)


df['cleaned_review'] = df['review'].apply(preprocess)


In [53]:
df[['review', 'cleaned_review']]

Unnamed: 0,review,cleaned_review
0,One of the other reviewers has mentioned that ...,one reviewer mentioned watching oz episode hoo...
1,A wonderful little production. <br /><br />The...,wonderful little production br br filming tech...
2,I thought this was a wonderful way to spend ti...,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter mattei love time money visually stunnin...
...,...,...
49995,I thought this movie did a down right good job...,thought movie right good job creative original...
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",bad plot bad dialogue bad acting idiotic direc...
49997,I am a Catholic taught in parochial elementary...,catholic taught parochial elementary school nu...
49998,I'm going to have to disagree with the previou...,going disagree previous comment side maltin on...


# 📚🔍 Transforming Words into Numbers 

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['cleaned_review']).toarray()
y = df['sentiment']


## 🤖📈 Training the Sentiment Classifier 

In [55]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))


Accuracy: 0.89
              precision    recall  f1-score   support

           0       0.90      0.87      0.88      4961
           1       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



### Model Evaluation: Accuracy and Classification Report


In [56]:
from sklearn.metrics import classification_report, accuracy_score

# Make predictions
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.8863
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.87      0.88      4961
           1       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



### Sentiment Prediction for Multiple Reviews


In [63]:
def predict_sentiment(review):
    # Transform the input review using the existing TF-IDF vectorizer
    review_vector = tfidf.transform([review]).toarray()
    
    # Predict sentiment
    prediction = model.predict(review_vector)
    
    # Convert to label
    sentiment = "positive" if prediction[0] == 1 else "negative"
    return sentiment


#### Predicting Sentiment for a Positive Review


In [64]:
new_review = "This movie was fantastic. I loved it."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")


The sentiment of the review is: positive


#### Predicting Sentiment for a Negative Review


In [66]:
# example usage
new_review = "This movie was ok but not that good."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

The sentiment of the review is: negative


# 🎬 IMDB Sentiment Analysis – Project Overview & Insights

This project focuses on building a sentiment analysis model using the IMDB Movie Reviews dataset. The goal is to classify reviews as either positive or negative using traditional Natural Language Processing (NLP) techniques and a machine learning algorithm (Logistic Regression).

---

## 📥 1. Dataset Loading

We start by loading the IMDB dataset which contains 50,000 labeled movie reviews — 25,000 positive and 25,000 negative. Each review is paired with a sentiment label, making this a supervised binary classification problem.

---

## 🧹 2. Data Inspection & Cleaning

Before preprocessing, we examine the dataset for missing values or inconsistencies. Fortunately, the dataset is clean and free from null values, which means we can proceed directly to text processing.

---

## 🔄 3. Label Encoding

The sentiment column contains string values: "positive" or "negative". These are mapped to numeric values:
- Positive → 1  
- Negative → 0

This numerical format is required for model training.

---

## 🧽 4. Text Preprocessing

We apply several preprocessing steps using NLTK to clean the review texts:

- Convert all text to lowercase  
- Tokenize the text into words  
- Remove punctuation and non-alphabetic tokens  
- Remove stopwords (commonly used but uninformative words)  
- Apply lemmatization to reduce words to their base form  

This results in a clean, normalized version of each review suitable for vectorization.

---

## 📚 5. Feature Extraction with TF-IDF

The cleaned review texts are transformed into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). This technique helps represent the importance of each word in a review relative to the whole dataset.

- We limit the feature size to 5,000 most important words to reduce complexity.

---

## 🤖 6. Model Building – Logistic Regression

We split the data into training and testing sets (80/20 split). Then we train a Logistic Regression model on the training data. This model is widely used for binary classification tasks due to its simplicity and effectiveness.

---

## 📊 7. Model Evaluation

The trained model is evaluated on the test set using:

- Accuracy score  
- Precision, recall, and F1-score (via classification report)

The results indicate strong model performance, proving that traditional NLP techniques combined with logistic regression can be powerful for sentiment analysis.

---

## 🧪 8. Custom Review Prediction

We add functionality to predict the sentiment of any custom movie review provided by the user. The input is cleaned, transformed using the TF-IDF vectorizer, and passed through the trained model to generate a prediction.

---

## 📌 Conclusion

- The model effectively classifies sentiments using traditional machine learning methods.  
- TF-IDF combined with basic text preprocessing captures enough semantic meaning to distinguish between positive and negative reviews.  
- This project can be extended by experimenting with more advanced models (e.g., SVM, Naive Bayes, or LSTM), incorporating n-grams, or visualizing the most important words using word clouds.

---
