<a href="https://colab.research.google.com/github/MarkoMilenovic01/Sentiment-Analysis---DL-Models-review/blob/main/Sentiment_Analysis_Classical_ML_Approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🧪 Experiment Summary: Classical ML with Enhanced Preprocessing

### Dataset
- **Source**: IMDB 50,000 Movie Reviews
- **Split**: 80% training / 20% testing
- **Task**: Binary Sentiment Classification (positive vs. negative)

---

### 🔄 Preprocessing (Enhanced)
- **Contraction expansion** (e.g. "don't" → "do not")
- **Emoticon mapping** (e.g. `:)` → "smile")
- **Removal**: HTML, links, digits, special characters
- **Negation tagging** (e.g. "not good" → "not_good")
- **Lemmatization** (WordNet)
- **Stopword removal**, with negation words (`no`, `not`, `never`) retained

---

### 🔧 TF-IDF Vectorization
- `ngram_range=(1, 2)` → unigrams and bigrams
- `min_df=2`, `max_df=0.9` → remove rare and very common terms
- `sublinear_tf=True` → log-scaled term frequency

---

### 📊 Results

#### Logistic Regression (`C=1.0`)
- **Accuracy**: 91.09%
- **Macro F1 Score**: 0.911

**Confusion Matrix**:
 [[4513  487]
 [ 404 4596]]

---

#### Linear SVM (`C=1.0`)
- **Accuracy**: **91.64%**
- **Macro F1 Score**: **0.916**

**Confusion Matrix**:
[[4540 460]
[ 376 4624]]


---

### 🧠 Conclusion

Even with advanced preprocessing and no deep learning, **TF-IDF combined with linear classifiers performs exceptionally well** on IMDB sentiment analysis:

- Linear SVM slightly outperforms Logistic Regression.
- Classical models remain strong baselines — especially when enhanced with **semantic preprocessing** like lemmatization and negation handling.

These results highlight that **clean, engineered features + robust linear models** can outperform more complex approaches on medium-sized datasets.


## Importing Libraries

In [None]:
import re, string
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

## Importing Dataset and Preprocessing phase

In [None]:
dataset = pd.read_csv("IMDB Dataset.csv")
print(dataset.head())
print(len(dataset))

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
50000


In [None]:
!pip install contractions
import nltk
nltk.download('stopwords')
nltk.download('wordnet')




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import re
import contractions
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

lemmatizer = WordNetLemmatizer()
STOP = set(stopwords.words('english')) - {"no", "not", "never", "hardly"}

EMOTICON_MAP = {
    ":)": " smile ",
    ":-)": " smile ",
    ":(": " sad ",
    ":-(": " sad ",
    ";)": " wink ",
}

def clean_text_enhanced(text: str) -> str:
    text = contractions.fix(text)

    for emo, tok in EMOTICON_MAP.items():
        text = text.replace(emo, tok)

    text = re.sub(r"<.*?>|http\S+|www\.\S+", " ", text)

    text = re.sub(r"\b\d+\b", "[NUM]", text)

    text = re.sub(r"[^A-Za-z0-9\s!?.,]", " ", text)

    text = re.sub(r"([!?.,])\1+", r"\1", text)

    words = text.split()
    for i, word in enumerate(words):
        if word.lower() in ["no", "not", "never", "hardly"] and i + 1 < len(words):
            words[i+1] = word.lower() + "_" + words[i+1]
    text = " ".join(words)

    tokens = text.lower().split()
    clean_tokens = []
    for tok in tokens:
        if tok not in STOP:
            lemma = lemmatizer.lemmatize(tok)
            clean_tokens.append(lemma)

    return " ".join(clean_tokens)


dataset["cleaned_review"] = dataset["review"].apply(clean_text_enhanced)
dataset["encoded_sentiment"] = (dataset["sentiment"] == "positive").astype(int)


## Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    dataset["cleaned_review"],
    dataset["encoded_sentiment"],
    test_size=0.20,
    random_state=42,
    stratify=dataset["encoded_sentiment"],
)


## TF-IDF Feature Extraction


In [None]:
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.9,
    sublinear_tf=True,
)

X_train_vec = tfidf.fit_transform(X_train)
X_test_vec  = tfidf.transform(X_test)


## Logistic Regression

In [None]:
logreg = LogisticRegression(
    C=1.0,
    solver="liblinear",
    max_iter=1000,
    class_weight="balanced",
)
logreg.fit(X_train_vec, y_train)
y_pred_lr = logreg.predict(X_test_vec)

print("\n=== Logistic Regression ===")
print("Accuracy :", accuracy_score(y_test, y_pred_lr))
print("Macro F1 :", f1_score(y_test, y_pred_lr, average="macro"))
print(classification_report(y_test, y_pred_lr, digits=3))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred_lr))



=== Logistic Regression ===
Accuracy : 0.9109
Macro F1 : 0.9108938614781172
              precision    recall  f1-score   support

           0      0.918     0.903     0.910      5000
           1      0.904     0.919     0.912      5000

    accuracy                          0.911     10000
   macro avg      0.911     0.911     0.911     10000
weighted avg      0.911     0.911     0.911     10000

Confusion matrix:
 [[4513  487]
 [ 404 4596]]


## Support Vector Machine


=== Linear SVM ===
Accuracy : 0.9164
Macro F1 : 0.9163941007677502
              precision    recall  f1-score   support

           0      0.924     0.908     0.916      5000
           1      0.910     0.925     0.917      5000

    accuracy                          0.916     10000
   macro avg      0.917     0.916     0.916     10000
weighted avg      0.917     0.916     0.916     10000

Confusion matrix:
 [[4540  460]
 [ 376 4624]]
