## Assignment - 1:
### Sentiment Anaysis on IMBD Dataset with Naive Bayes

---

- @gpt :


### 🔹 Objective  
To build a sentiment analysis model on the IMDB movie reviews dataset using **Naive Bayes classifiers** and compare their performance with different feature extraction techniques (BoW & TF-IDF).  

---

### 🔹 NLP Pipeline Steps  

1. **Data Acquisition**  
   - Dataset: IMDB movie reviews (labeled as positive/negative).  

2. **Text Preprocessing**  
   - Lowercasing text.  
   - Removing punctuation and digits.  
   - Removing stopwords.  
   - Tokenization (sentence & word level).  

3. **Feature Engineering**  
   - **Bag-of-Words (CountVectorizer)**  
   - **TF-IDF (TfidfVectorizer)**  

4. **Modeling**  
   - Trained two Naive Bayes classifiers:  
     - **MultinomialNB** → best for word frequencies / TF-IDF.  
     - **BernoulliNB** → best for binary features (word present/absent).  

5. **Evaluation Metrics**  
   - Accuracy Score.  
   - Precision, Recall, F1-score (via `classification_report`).  

---

### 🔹 Results  

| Vectorizer        | Classifier     | Accuracy | Remarks |
|-------------------|---------------|----------|---------|
| CountVectorizer   | MultinomialNB | ~86%     | Good but less than TF-IDF |
| TfidfVectorizer   | MultinomialNB | **89.19%** ✅ | Best performing model |
| CountVectorizer   | BernoulliNB   | ~82%     | Works but weaker |
| TfidfVectorizer   | BernoulliNB   | ~84%     | Better, but not best |

---

### 🔹 Final Winner  
✔ **TF-IDF + MultinomialNB** → **89.19% Accuracy**  
- Works best because TF-IDF captures **importance of words** and MultinomialNB is well-suited for frequency-based features.  

---

### 🔹 Conclusion  
- **MultinomialNB** consistently outperformed BernoulliNB.  
- **TF-IDF** features provided higher accuracy than simple Bag-of-Words.  
- Hence, the combination of **TF-IDF + MultinomialNB** is most effective for IMDB sentiment classification.  

---

In [6]:
# Step 1: Load the Dataset

import pandas as pd

#loading CSV file
df = pd.read_csv('IMDB_Dataset.csv')

#adding numeric label (0=negative, 1=positive)
df['label'] = df['sentiment'].map({'negative' : 0, 'positive':1})

print(df.shape)
df.head()

#sentiment is converted to number (0 and 1)

(50000, 3)


Unnamed: 0,review,sentiment,label
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [7]:
# Step 2: Train - Test Split
# splitting into training and validation sets

from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
print(X_train.shape, X_val.shape)

# stratify = y ensures equal distribution of positive/negative in both train & validation

(40000,) (10000,)


In [8]:
# Step 3: Text CLeaning
#converting reviews into cleaner text

# Theory : LowerCasing reduces vocab size.
#        : Remove HTML tags, URLS, punctaitons, digits (noise).
#        : Keep negations like "not" as they change sentiment.

import re, html
from nltk.corpus import stopwords
import nltk
#nltk.download('stopwords') #comented after downloading.

#to Keep negations
stop_words = set(stopwords.words('english'))
for neg in ['not', 'no', 'nor', 'never']:
    stop_words.discard(neg)
    
def clean_text(text):
    #removing html tags :
    text = re.sub('r<[^>]+>', ' ', text)
    
    #decoding HTML entities 
    text = html.unescape(text)
    
    #removing URLS :
    text = re.sub(r"http\S+|www.\S+", " ", text)
    
    #lowercasing : 
    text = text.lower()
    
    #removing punctuatuins/digits :
    text = re.sub(r"[^a-z\s]", " ", text)
    
    #removing extra spaces
    text = re.sub(r"\s+", " ",text).strip()
    
    return text


print(clean_text("I didn't LIKE this movie! <br> Visit: http://abc.com"))

#if we remove "not", nodek may misclassify "not good" as "good"
#normalised spacing after regex replacements

i didn t like this movie br visit


In [9]:
# Step 4: Feature Extraction(Bag-of-Words & TF-IDF)
# convering text to numbers using COuntVectorizer(BOW) and TfidfVectorizer.

#Theory : 
#   Bag-of-Words : counts word frequency. works wll with MultinomialNB
#   TF-IDF : scales down very common words, boosts rare but useful words

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

bow_vectorizer = CountVectorizer(preprocessor=clean_text,
                                 stop_words=stop_words,
                                 ngram_range=(1,2),   # unigrams + bigrams
                                 min_df=5, max_df=0.9)

tfidf_vectorizer = TfidfVectorizer(preprocessor=clean_text,
                                   stop_words=stop_words,
                                   ngram_range=(1,2),
                                   min_df=5, max_df=0.9)

#adding both bigrams (1,2) captures phrases like "not good".
#min_df = 5 ignores very rare words.
#max_df = 0.9 ignores extremnly frequent words.

In [None]:
# for Step 5: 

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords

stop_words = stopwords.words('english')   # list, not set
for neg in ['not','no','nor','never']:
    if neg in stop_words:
        stop_words.remove(neg)   # keep negations

bow_vectorizer = CountVectorizer(preprocessor=clean_text,
                                 stop_words=stop_words,
                                 ngram_range=(1,2),   
                                 min_df=5, max_df=0.9)

tfidf_vectorizer = TfidfVectorizer(preprocessor=clean_text,
                                   stop_words=stop_words,
                                   ngram_range=(1,2),
                                   min_df=5, max_df=0.9)

#converted stopwords to list and
#fixed tying error

In [None]:
# Step 5: Train Models (MultinomialNB, BernoulliNB)
# Theory Notes:
# - MultinomialNB: Best for word counts or TF-IDF (works on frequencies).
# - BernoulliNB: Best when features are binary (word present / absent).

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report

def train_and_eval(vec , clf, X_train, X_val, y_train, y_val):
    # 1. Vectorize text
    X_train_vec = vec.fit_transform(X_train)
    X_val_vec = vec.transform(X_val)
    
    # 2. Train model
    clf.fit(X_train_vec, y_train)
    
    # 3. Predict on validation data
    preds = clf.predict(X_val_vec)
    
    # 4. Evaluate model
    acc = accuracy_score(y_val, preds)
    print(f"Model: {clf.__class__.__name__}, Vectorizer: {vec.__class__.__name__}, Accuracy: {acc:.4f}")
    print(classification_report(y_val, preds, digits=4))
    print("=" *80)
    

# Run experiments
train_and_eval(bow_vectorizer, MultinomialNB(), X_train, X_val, y_train, y_val)
train_and_eval(tfidf_vectorizer, MultinomialNB(), X_train, X_val, y_train, y_val)

train_and_eval(bow_vectorizer, BernoulliNB(), X_train, X_val, y_train, y_val)
train_and_eval(tfidf_vectorizer, BernoulliNB(), X_train, X_val, y_train, y_val)

# Notes:
# - MultinomialNB + TF-IDF usually gives the best performance (≈85–90%).
# - BernoulliNB is better when features are binary (word presence/absence).

Model: MultinomialNB, Vectorizer: CountVectorizer, Accuracy: 0.8849
              precision    recall  f1-score   support

           0     0.8825    0.8880    0.8853      5000
           1     0.8873    0.8818    0.8845      5000

    accuracy                         0.8849     10000
   macro avg     0.8849    0.8849    0.8849     10000
weighted avg     0.8849    0.8849    0.8849     10000

Model: MultinomialNB, Vectorizer: TfidfVectorizer, Accuracy: 0.8919
              precision    recall  f1-score   support

           0     0.8942    0.8890    0.8916      5000
           1     0.8896    0.8948    0.8922      5000

    accuracy                         0.8919     10000
   macro avg     0.8919    0.8919    0.8919     10000
weighted avg     0.8919    0.8919    0.8919     10000

Model: BernoulliNB, Vectorizer: CountVectorizer, Accuracy: 0.8889
              precision    recall  f1-score   support

           0     0.8942    0.8822    0.8882      5000
           1     0.8838    0.8956  

In [None]:
#Step 6: Summary of Results
#comparing all models side by side

#Theory : We comapare which combination of features representation (BOW/TF-IDF) and classifier works best.

results = []

for vec in [bow_vectorizer, tfidf_vectorizer]:
    for clf in [MultinomialNB(), BernoulliNB()]:
        X_train_vec = vec.fit_transform(X_train)
        X_val_vec = vec.transform(X_val)
        
        clf.fit(X_train_vec, y_train)
        preds = clf.predict(X_val_vec)
        acc = accuracy_score(y_val, preds)
        
        results.append({
            "Vectorizer": vec.__class__.__name__,
            "Classifier": clf.__class__.__name__,
            "Accuracy": acc
        })

pd.DataFrame(results)

Unnamed: 0,Vectorizer,Classifier,Accuracy
0,CountVectorizer,MultinomialNB,0.8849
1,CountVectorizer,BernoulliNB,0.8889
2,TfidfVectorizer,MultinomialNB,0.8919
3,TfidfVectorizer,BernoulliNB,0.8889


In [None]:
#TfidfVectorizer + MultinomilaNB wins with accuracy = 89.19%

#NOTEs : 
# 1) Naive Bayes : simple, fast, interpretable, assumes features independence, works well for text
# 2) MultinomialNB : Best for counts/TF_IDF, assumes word frequencies as multinomial distribution.
# 3) BernoulliNB : Binary features (present/absent), useful if only word presence matters.
# 4) BOW vs TF-IDF : BOW=rawa frequency , TF-IDF : frequency weighted by importance, usually better for long text(like IMBD reviews)
# 5) Evaluation : use accuracy and classificartion report(precision/recall/F1), important because accuracy alone may hide class imbalance.

---
- From here question number 5th starts of assignment-1 i.e. applying Logistic Regression on above data and then tune the hyperparameters of logistic regression.

#### Why Logistic Regression ? 
- Logistic Regression is a linear model that workds well for **high dimensional spares data** (like text after TF-IDF)
- Unlike Naive Bayes, it learns weights for each word feature by maximizing likelihood.
- Often gives better results with proper regularization ('C' paramter)
---

In [None]:
# Step0 : Load Dataset
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("IMDB_Dataset.csv")

#features :
X = data['review']

#convert labels to 0/1
y = data['sentiment'].map({'positive':1, 'negative':0})

In [5]:
# Step 1: Train-Test Split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [6]:
# Step 2: TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

#Fit on trainig
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

#Transform on validation
X_val_tfidf = tfidf_vectorizer.transform(X_val)

In [None]:
#Step 4: Hyperparameter Tuning
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Hyperparameter grid
param_grid = {
    'C' : [0.01, 0.1, 1, 10],  
    'penalty' : ['l1', 'l2'],
    'solver' : ['liblinear']
}

#GridSearchCV
grid_search = GridSearchCV(LogisticRegression(max_iter=1000),
                           param_grid,
                           cv=3,
                           scoring='accuracy',
                           n_jobs=-1)
grid_search.fit(X_train_tfidf, y_train)


#Fit
grid_search.fit(X_train_tfidf, y_train)

#Results
print("Best Parameters:", grid_search.best_params_)
print("Best CV Accuracy:", grid_search.best_score_)

Best Parameters: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
Best CV Accuracy: 0.8824499711037849


In [11]:
# Step 5: Evaluate tuned model
from sklearn.metrics import accuracy_score, classification_report


best_log_reg = grid_search.best_estimator_
y_pred_best = best_log_reg.predict(X_val_tfidf)

print("Validation Accuracy (Tuned Logistic Regression):", accuracy_score(y_val, y_pred_best))
print(classification_report(y_val, y_pred_best, digits=4))

Validation Accuracy (Tuned Logistic Regression): 0.89
              precision    recall  f1-score   support

           0     0.8960    0.8824    0.8892      5000
           1     0.8842    0.8976    0.8908      5000

    accuracy                         0.8900     10000
   macro avg     0.8901    0.8900    0.8900     10000
weighted avg     0.8901    0.8900    0.8900     10000



#### Classification report read how?
1. Accuracy
- mtlb ki kitna % model sahi samples classify kr rha h.
- lekin agar imbalanced ho dataset(e.g. 90% class-0, 10% class-1) toh sirf accuracy dekhna galat hoga
- issi liye hamesha precision, recall, f1-score bhi dekhe jate h.

2. Precision
- Out of all predictions for a class, how many were correct
- formula : TP / (TP + FP)
- example: agar model ne 100 ko "Positive" bola aur usme 90 hi actually poisitve nikle -> precision=90%
- **NOTE** High precision useful when false positive are dangerous (e.g. spam filter, fraud detection)

3. Recall
- Out of all actual instances of a class, how many did model catch 
- Formula : TP / (TP + FN)
- example : agar 100 positive cases h aur model ne 90 sahi pakde -> recall = 90%
- **NOTE** high recall useful jab false negatives costly ho(e.g. cancer detection, fraud detection)

4. F1_Score
- Harmonic mean of Precision & Recall
- Formula : 2 * (Precision * Recall) / (Precision + Recall)
- It balances both precision & recall

5. Marco vs Weighted Average 
- macro avg -> (simple average of metrics across classes)
- weighted avg -> (average weighted by number of samples per class)
---
##### What's Best/Worst? :
1. Accuracy : 
- accha indicator sirf balanced datasets m.

2. Precision VS Recall : 
- agar kaam “false positives avoid karna” hai -> precision high hona best.
- Agar “false negatives avoid karna” hai -> recall high hona best.

3. F1-Score :
- dono ko balance krta hai -> general best metric NLP tasks m.
---