## IMDB Sentiment Analysis Project
This notebook demonstrates sentiment classification on IMDB movie reviews using:
- Text preprocessing with NLTK
- TF-IDF feature extraction
- Random Forest classification

## 1. Importing Required Libraries
Essential packages for data processing and analysis:
- `numpy`: Numerical computing
- `pandas`: Data manipulation
- `re`: Regular expressions for text cleaning

In [1]:
import numpy as np
import pandas as pd
import re

## 2. Loading the Dataset
The IMDB dataset contains 50,000 movie reviews labeled as positive (1) or negative (0)
- Source: Kaggle (https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
- Columns: 'review' (text), 'sentiment' (label)

In [2]:
data = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## 3. Preprocessing Pipeline

### 3.1 Label Encoding
Convert sentiment labels from strings to binary integers:
- 'positive' → 1
- 'negative' → 0

In [4]:
data['sentiment'] = data['sentiment'].map({'positive': 1, 'negative': 0})

### 3.2 Text Cleaning
The `clean_text` function performs:
1. HTML unescaping
2. Removal of URLs, mentions (@), and hashtags (#)
3. Removal of special characters (keeping only letters)
4. Conversion to lowercase
5. Stopword removal using NLTK's English stopwords
6. Tokenization using TweetTokenizer

In [5]:
import html
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

def clean_text(text):
    tknzr = TweetTokenizer()
    text = html.unescape(text)
    text = re.sub(r"http\S+|www\S+|https\S+|@\S+|#\S+", "", text, flags=re.MULTILINE)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    text = text.lower()
    stop_words = set(stopwords.words("english"))
    tokens = tknzr.tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    text = " ".join(tokens).strip()
    return text

In [6]:
data['review'] = data['review'].apply(clean_text)

### 3.3 Stemming
Porter Stemmer reduces words to their root form:

In [7]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_text(text):
    tknzr = TweetTokenizer()
    tokens = tknzr.tokenize(text)
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    stemmed_text = " ".join(stemmed_tokens)
    return stemmed_text

In [8]:
data['review'] = data['review'].apply(stem_text)

## 4. Feature Engineering with TF-IDF
Convert text to numerical features using:
- max_features=10000: Limit vocabulary to top 10,000 terms
- ngram_range=(1,2): Include both single words and word pairs
- min_df=5: Ignore terms that appear in fewer than 5 documents

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1, 2),
    min_df=5,
)

X = vectorizer.fit_transform(data["review"])
y = data['sentiment']

## 5. Train-Test Split
Standard 80-20 split for model evaluation:
- 80% training data
- 20% testing data

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## 6. Model Training and Evaluation

### 6.1 Random Forest Classifier

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model = RandomForestClassifier()
_ = model.fit(X_train, y_train)

### 6.2 Model Evaluation
Key metrics:
- Precision: % of positive predictions that were correct
- Recall: % of actual positives correctly identified
- F1-score: Harmonic mean of precision and recall

In [12]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.86      0.85      5034
           1       0.86      0.84      0.85      4966

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



In [13]:
from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)

F1 Score: 0.849354084019937
