# Results Analysis — Baseline Models

This notebook evaluates the baseline sentiment classifiers trained in `02_model_development.ipynb`.
We report standard classification metrics and analyze errors.

In [1]:
import pandas as pd
import joblib

test = pd.read_pickle("../data/processed/test.pkl")

X_test = test["text"].astype(str)
y_test = test["label"].astype(int)

vectorizer = joblib.load("../data/models/tfidf_vectorizer.joblib")
logreg = joblib.load("../data/models/logreg_baseline.joblib")
svm = joblib.load("../data/models/linearsvc_baseline.joblib")

X_test_vec = vectorizer.transform(X_test)


In [2]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

pred_logreg = logreg.predict(X_test_vec)

print("LogReg Accuracy:", accuracy_score(y_test, pred_logreg))
print(classification_report(y_test, pred_logreg))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred_logreg))

LogReg Accuracy: 0.88272
              precision    recall  f1-score   support

           0       0.88      0.88      0.88     12500
           1       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

Confusion Matrix:
 [[11016  1484]
 [ 1448 11052]]


In [3]:
pred_svm = svm.predict(X_test_vec)

print("LinearSVC Accuracy:", accuracy_score(y_test, pred_svm))
print(classification_report(y_test, pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred_svm))


LinearSVC Accuracy: 0.87556
              precision    recall  f1-score   support

           0       0.87      0.89      0.88     12500
           1       0.89      0.86      0.87     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

Confusion Matrix:
 [[11099  1401]
 [ 1710 10790]]


In [4]:
import pandas as pd
from sklearn.metrics import precision_recall_fscore_support

def summarize(y_true, y_pred, name):
    p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary")
    acc = accuracy_score(y_true, y_pred)
    return {"model": name, "accuracy": acc, "precision": p, "recall": r, "f1": f1}

summary = pd.DataFrame([
    summarize(y_test, pred_logreg, "TF-IDF + LogReg"),
    summarize(y_test, pred_svm, "TF-IDF + LinearSVC")
])

summary


Unnamed: 0,model,accuracy,precision,recall,f1
0,TF-IDF + LogReg,0.88272,0.881621,0.88416,0.882889
1,TF-IDF + LinearSVC,0.87556,0.885079,0.8632,0.874003


In [5]:
errors = test.copy()
errors["pred_svm"] = pred_svm
errors["wrong"] = errors["pred_svm"] != errors["label"]

errors[errors["wrong"]].head(10)[["text", "label", "pred_svm"]]


Unnamed: 0,text,label,pred_svm
4,"First off let me say, If you haven't enjoyed a...",0,1
11,"Blind Date (Columbia Pictures, 1934), was a de...",0,1
20,Low budget horror movie. If you don't raise yo...,0,1
32,I'm the type of guy who loves hood movies from...,0,1
36,"Beware, My Lovely (1952) Dir: Harry Horner <br...",0,1
38,Artificial melodrama with a screenplay adapted...,0,1
79,"It's Saturday, it's raining, and I think every...",0,1
85,Art-house horror tries to use unconventional a...,0,1
87,"I bought this on VHS as ""Terror Hospital"", and...",0,1
92,"Maniratnam, who in India, is often compared wi...",0,1


In [6]:
errors[errors["wrong"]].assign(length=errors["text"].str.len()).sort_values("length").head(10)[["text","label","pred_svm"]]


Unnamed: 0,text,label,pred_svm
8658,"More suspenseful, more subtle, much, much more...",0,1
16042,"Very intelligent language usage of Ali, which ...",1,0
2059,My first thoughts on this film were of using s...,0,1
1761,"a real hoot, unintentionally. sidney portier's...",0,1
20958,"It's not Citizen Kane, but it does deliver. Cl...",1,0
11186,"I couldn't stop laughing, I caught this again ...",0,1
20345,I don't care if some people voted this movie t...,1,0
19817,Worst horror film ever but funniest film ever ...,1,0
24818,Worst horror film ever but funniest film ever ...,1,0
14030,"Subject Matter: Cosmology, Quantum Physics and...",1,0


In [7]:
errors = test.copy()
errors["pred_logreg"] = logreg.predict(X_test_vec)
errors["wrong"] = errors["pred_logreg"] != errors["label"]
errors["length"] = errors["text"].astype(str).str.len()

errors[errors["wrong"]].sort_values("length")[["text","label","pred_logreg","length"]].head(10)


Unnamed: 0,text,label,pred_logreg,length
8658,"More suspenseful, more subtle, much, much more...",0,1,61
41,Widow hires a psychopath as a handyman. Sloppy...,0,1,129
16581,This movie turned out to be better than I had ...,1,0,139
3894,You've got to be kidding. This movie sucked fo...,0,1,140
13798,This movie is based on the novel Island of dr....,1,0,157
2059,My first thoughts on this film were of using s...,0,1,158
1761,"a real hoot, unintentionally. sidney portier's...",0,1,165
20958,"It's not Citizen Kane, but it does deliver. Cl...",1,0,168
11186,"I couldn't stop laughing, I caught this again ...",0,1,175
20345,I don't care if some people voted this movie t...,1,0,176


In [8]:
errors[errors["wrong"]].sort_values("length", ascending=False)[["text","label","pred_logreg","length"]].head(10)


Unnamed: 0,text,label,pred_logreg,length
21132,There's a sign on The Lost Highway that says:<...,1,0,12988
16250,"(Some spoilers included:)<br /><br />Although,...",1,0,12930
16512,"Back in the mid/late 80s, an OAV anime by titl...",1,0,12129
13153,If anyone ever assembles a compendium on moder...,1,0,9951
22375,The Merchant of Venice 8/10<br /><br />(This r...,1,0,7126
10847,The bearings of western-style Feminism on the ...,0,1,6031
23473,Hollywood movies since the 1930s have treated ...,1,0,5994
2505,* Some spoilers *<br /><br />This movie is som...,0,1,5884
15977,This enjoyable Euro-western opens with a scene...,1,0,5871
13592,Once again I must play something of the contra...,1,0,5861


Errors often occur when reviews contain mixed praise/criticism, sarcasm, or heavy context.

## Conclusion
- Both baselines perform strongly on the sentiment dataset.
- LinearSVC often performs slightly better for high-dimensional sparse text features.
- Errors commonly occur with sarcasm, mixed sentiment, or ambiguous phrasing.
- Next steps: consider hyperparameter tuning, different n-grams, or a neural model baseline.


## Final Summary
- Baseline models trained using TF-IDF features on DVC-prepared splits.
- Best model: TF-IDF + Logistic Regression (Accuracy ≈ 0.883, balanced precision/recall).
- Errors likely arise from ambiguous language, sarcasm, and mixed sentiment.
- Model artifacts saved and tracked with DVC for reproducibility.

# Results Analysis — Baseline Models

This notebook evaluates the baseline sentiment models trained in `02_model_development.ipynb`
using the test split created by the DVC pipeline.


In [9]:
import pandas as pd
import joblib

test = pd.read_pickle("../data/processed/test.pkl")
X_test = test["text"].astype(str)
y_test = test["label"].astype(int)

vectorizer = joblib.load("../data/models/tfidf_vectorizer.joblib")
logreg = joblib.load("../data/models/logreg_baseline.joblib")
svm = joblib.load("../data/models/linearsvc_baseline.joblib")

X_test_vec = vectorizer.transform(X_test)


In [10]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

pred_logreg = logreg.predict(X_test_vec)
print("LogReg Accuracy:", accuracy_score(y_test, pred_logreg))
print(classification_report(y_test, pred_logreg))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred_logreg))


LogReg Accuracy: 0.88272
              precision    recall  f1-score   support

           0       0.88      0.88      0.88     12500
           1       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

Confusion Matrix:
 [[11016  1484]
 [ 1448 11052]]


In [11]:
pred_svm = svm.predict(X_test_vec)
print("LinearSVC Accuracy:", accuracy_score(y_test, pred_svm))
print(classification_report(y_test, pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred_svm))


LinearSVC Accuracy: 0.87556
              precision    recall  f1-score   support

           0       0.87      0.89      0.88     12500
           1       0.89      0.86      0.87     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

Confusion Matrix:
 [[11099  1401]
 [ 1710 10790]]


In [12]:
import pandas as pd
from sklearn.metrics import precision_recall_fscore_support

def summarize(y_true, y_pred, name):
    p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary")
    acc = accuracy_score(y_true, y_pred)
    return {"model": name, "accuracy": acc, "precision": p, "recall": r, "f1": f1}

summary = pd.DataFrame([
    summarize(y_test, pred_logreg, "TF-IDF + LogReg"),
    summarize(y_test, pred_svm, "TF-IDF + LinearSVC")
])

summary


Unnamed: 0,model,accuracy,precision,recall,f1
0,TF-IDF + LogReg,0.88272,0.881621,0.88416,0.882889
1,TF-IDF + LinearSVC,0.87556,0.885079,0.8632,0.874003


In [None]:
errors = test.copy()
errors["pred_svm"] = pred_svm
errors["wrong"] = errors["pred_svm"] != errors["label"]
errors[errors["wrong"]].head(10)[["text", "label", "pred_svm"]]


Unnamed: 0,text,label,pred_svm
4,"First off let me say, If you haven't enjoyed a...",0,1
11,"Blind Date (Columbia Pictures, 1934), was a de...",0,1
20,Low budget horror movie. If you don't raise yo...,0,1
32,I'm the type of guy who loves hood movies from...,0,1
36,"Beware, My Lovely (1952) Dir: Harry Horner <br...",0,1
38,Artificial melodrama with a screenplay adapted...,0,1
79,"It's Saturday, it's raining, and I think every...",0,1
85,Art-house horror tries to use unconventional a...,0,1
87,"I bought this on VHS as ""Terror Hospital"", and...",0,1
92,"Maniratnam, who in India, is often compared wi...",0,1


: 

## Key Findings
- TF-IDF + Logistic Regression performed best (Accuracy ≈ 0.883).
- LinearSVC was slightly lower (Accuracy ≈ 0.876) and produced more false negatives.
- Misclassifications often occur in reviews with sarcasm, mixed sentiment, or ambiguous wording.
