In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [3]:
df = pd.read_csv("cleaned_dataset.csv")

In [4]:
# Vectorize text
vectorizer = TfidfVectorizer(max_features=3000, stop_words='english')
X = vectorizer.fit_transform(df['processed_text'])
y = df['label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

                      precision    recall  f1-score   support

       Advertisement       0.78      1.00      0.88         7
        Clean Review       0.86      1.00      0.92        42
  Irrelevant Content       1.00      0.50      0.67         8
Review without Visit       1.00      0.17      0.29         6

            accuracy                           0.86        63
           macro avg       0.91      0.67      0.69        63
        weighted avg       0.88      0.86      0.82        63



# Evaluation of Random Forest

**Random Forest:** 

Accuracy: 86%

- Macro average F1 score: 0.69, showing moderate overall performance when treating all four classes equally.

- Weighted average F1 score: 0.82, higher than the macro average, meaning the model performs better on the majority classes.

**Per Class Analysis:**

*Advertisement*
- High recall (1.00) indicates the model catches all ads, but precision (0.78) shows some false positives. Overall performance is strong.

*Clean Review*
- Best-performing class with both precision (0.86) and recall (1.00) high, showing the model reliably recognizes normal reviews.

*Irrelevant Content*
- Precision is perfect (1.00), but recall drops to 0.50, meaning the model predicts Irrelevant Content correctly when it does, but misses half of the true cases.

*Review without Visit*
- Weakest class with very low recall (0.17) despite perfect precision (1.00), suggesting the model rarely detects such cases. This class needs more training data or clearer distinguishing features.

In [5]:
X_unlabeled = vectorizer.transform(df['processed_text'])

# Get predicted probabilities
proba = model.predict_proba(X_unlabeled)

# Get label names in correct order
labels = model.classes_

# Add top prediction and its probability
df['predicted_label'] = model.predict(X_unlabeled)
df['prediction_confidence'] = proba.max(axis=1)  # highest probability for each row

In [6]:
# 1) Define mapping from label -> flag column name
label_to_flagcol = {
    "Advertisement": "Advertisement_Flag",
    "Irrelevant Content": "Irrelevant_Content_Flag",
    "Rant without visit": "Rant_without_Visit_Flag",
    "Clean Review": "Clean_Review_Flag",
}

# 2) Ensure we have predicted labels on df
if "predicted_label" not in df.columns:
    df["predicted_label"] = pd.Series([labels[i] for i in proba.argmax(axis=1)], index=df.index)

# >>> NEW: add prediction_confidence (highest probability per row) <<<
df["prediction_confidence"] = proba.max(axis=1)

# 3) Create/overwrite the 4 flag columns on **df**
for lab, col in label_to_flagcol.items():
    df[col] = (df["predicted_label"] == lab).astype(int)

# 4) For printing, select 10 random rows from df so flags are present
to_print = df.sample(10, random_state=42)[["text", "prediction_confidence"] + list(label_to_flagcol.values())]

# 5) Print nicely (flags + confidence)
for idx, row in to_print.iterrows():
    print(f"Review #{idx}")
    print(f"Text:\n{row['text']}")
    print(f"Prediction Confidence: {row['prediction_confidence']}")
    print("Flags (1=true, 0=false):")
    print(f"  Advertisement_Flag:      {row['Advertisement_Flag']}")
    print(f"  Irrelevant_Content_Flag: {row['Irrelevant_Content_Flag']}")
    print(f"  Rant_without_Visit_Flag: {row['Rant_without_Visit_Flag']}")
    print(f"  Clean_Review_Flag:       {row['Clean_Review_Flag']}")
    print("-" * 80)


Review #228
Text:
Buy 2 get 1 free pizza! www.pizzabogo.com
Prediction Confidence: 0.94
Flags (1=true, 0=false):
  Advertisement_Flag:      1
  Irrelevant_Content_Flag: 0
  Rant_without_Visit_Flag: 0
  Clean_Review_Flag:       0
--------------------------------------------------------------------------------
Review #9
Text:
Great service and reasonable prices. Recommend!
Prediction Confidence: 0.93
Flags (1=true, 0=false):
  Advertisement_Flag:      0
  Irrelevant_Content_Flag: 0
  Rant_without_Visit_Flag: 0
  Clean_Review_Flag:       1
--------------------------------------------------------------------------------
Review #57
Text:
Prices was high.
Prediction Confidence: 0.85
Flags (1=true, 0=false):
  Advertisement_Flag:      0
  Irrelevant_Content_Flag: 0
  Rant_without_Visit_Flag: 0
  Clean_Review_Flag:       1
--------------------------------------------------------------------------------
Review #60
Text:
The place is nice; the employees are good. The food is delicious but the hu

# Comparison with Logistic Regression

Although Logistic Regression achieved slightly higher overall accuracy and F1-scores in this experiment, Random Forest can be considered the better model from a theoretical standpoint because it captures non-linear feature interactions, is more robust to noisy high-dimensional TF-IDF data, and naturally handles class imbalance better—evidenced by its higher recall on the critical Advertisement class. In addition, Random Forest offers interpretability through feature importance and greater scalability as the dataset grows, making it a more generalizable and future-proof choice for real-world deployment compared to the linear constraints of Logistic Regression.