Upload baseline model

In [1]:
from google.colab import files
uploaded = files.upload()  # choose baseline_logreg_tfidf.joblib from your PC

Saving baseline_logreg_tfidf.joblib to baseline_logreg_tfidf.joblib


Load it

In [2]:
import joblib
baseline = joblib.load("baseline_logreg_tfidf.joblib")

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Upload DistilBERT model

In [4]:
files.upload()


Saving distilbert_imdb_minimal.zip to distilbert_imdb_minimal.zip


Unzip

In [5]:
!unzip distilbert_imdb_minimal.zip

Archive:  distilbert_imdb_minimal.zip
   creating: distilbert_imdb_minimal/
  inflating: distilbert_imdb_minimal/tokenizer.json  
  inflating: distilbert_imdb_minimal/vocab.txt  
  inflating: distilbert_imdb_minimal/special_tokens_map.json  
  inflating: distilbert_imdb_minimal/model.safetensors  
  inflating: distilbert_imdb_minimal/tokenizer_config.json  
  inflating: distilbert_imdb_minimal/config.json  


Load

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert_imdb_minimal")
bert_model = AutoModelForSequenceClassification.from_pretrained("distilbert_imdb_minimal")
bert_model.to("cuda")
bert_model.eval()



DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


Rebuild the SAME validation split

In [7]:
# Download + extract IMDB dataset (again, in this new runtime)
!wget -q https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz
!rm aclImdb_v1.tar.gz

# sanity check
import os
print("Exists:", os.path.exists("aclImdb/train/pos"))
print("Train pos:", len(os.listdir("aclImdb/train/pos")))
print("Train neg:", len(os.listdir("aclImdb/train/neg")))


Exists: True
Train pos: 12500
Train neg: 12500


In [8]:
import os, re
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split

def clean_minimal(text):
    text = re.sub(r"<br\s*/?>", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

def load_train_split():
    base = Path("aclImdb/train")
    rows = []
    for label_name, label_int in [("pos", 1), ("neg", 0)]:
        for f in (base / label_name).glob("*.txt"):
            rows.append((clean_minimal(f.read_text(encoding="utf-8", errors="ignore")), label_int))
    return pd.DataFrame(rows, columns=["text", "label"])

df = load_train_split()

idx = np.arange(len(df))
train_idx, val_idx = train_test_split(
    idx,
    test_size=0.2,
    random_state=42,
    stratify=df["label"].values
)

val_df = df.iloc[val_idx].reset_index(drop=True)
val_df.head()


Unnamed: 0,text,label
0,"With such actors as Ralph Richardson, Raymond ...",1
1,"""Mr. Harvey Lights a Candle"" is anchored by a ...",1
2,"When I saw this ""documentary"", I was disappoin...",0
3,"For starters and for the record, the term ""Nec...",0
4,This movie is an incredible piece of work. It ...,1


Get predictions from BOTH models

In [9]:
# baseline
val_df["baseline_pred"] = baseline.predict(val_df["text"])
val_df["baseline_prob"] = baseline.predict_proba(val_df["text"])[:, 1]

In [10]:
# DistilBERT
def bert_predict(texts, batch_size=32):
    all_preds, all_probs = [], []
    device = bert_model.device

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        enc = tokenizer(
            batch,
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=256
        ).to(device)

        with torch.no_grad():
            logits = bert_model(**enc).logits
            probs = logits.softmax(dim=1).cpu().numpy()

        all_preds.append(probs.argmax(axis=1))
        all_probs.append(probs[:, 1])

    return np.concatenate(all_preds), np.concatenate(all_probs)

bert_preds, bert_probs = bert_predict(val_df["text"].tolist())
val_df["bert_pred"] = bert_preds
val_df["bert_prob"] = bert_probs


Create the error structure

In [11]:
val_df["baseline_correct"] = val_df["baseline_pred"] == val_df["label"]
val_df["bert_correct"] = val_df["bert_pred"] == val_df["label"]

In [12]:
bw_br = val_df[(~val_df.baseline_correct) & (val_df.bert_correct)]
br_bw = val_df[(val_df.baseline_correct) & (~val_df.bert_correct)]
both_wrong = val_df[(~val_df.baseline_correct) & (~val_df.bert_correct)]

Let`s look closer

In [14]:
bw_br.sample(5)[["text", "label", "baseline_pred", "bert_pred"]]

Unnamed: 0,text,label,baseline_pred,bert_pred
2779,I completely disagree with the other comments ...,1,0,1
4836,VAMPYRES Aspect ratio: 1.85:1 Sound format: Mo...,0,1,0
3260,There is no relation at all between Fortier an...,1,0,1
736,Say what you want about Andy Milligan - but if...,1,0,1
1296,Although Bette Davis did a WONDERFUL job as Mi...,0,1,0


In [15]:
br_bw.sample(5)[["text", "label", "baseline_pred", "bert_pred"]]

Unnamed: 0,text,label,baseline_pred,bert_pred
3447,Sorry to say I have no idea what Hollywood is ...,1,1,0
4345,This movie is funny and painful at the same ti...,0,0,1
3838,If you've ever been harassed on the Undergroun...,0,0,1
4379,this movie had a lot of blood in it when the s...,0,0,1
4841,After hearing about George Orwell's prophetic ...,0,0,1


In [16]:
both_wrong.sample(5)[["text", "label", "baseline_pred", "bert_pred"]]

Unnamed: 0,text,label,baseline_pred,bert_pred
347,"This is probably the first entry in the ""Lance...",0,1,1
1367,"i expected something different:more passion,dr...",1,0,0
196,Man To Man tries hard to be a good movie: it h...,0,1,1
4387,Diana Guzman is an angry young woman. Survivin...,0,1,1
1173,When i finally had the opportunity to watch Zo...,0,1,1


The “fight strategy” report

We expect to discover things like:
Logistic Regression fails at sarcasm
BERT fails at long texts with multiple sentiment shifts
Baseline wins on short, punchy negative reviews
BERT wins on subtle emotional language
Both models struggle with “balanced sentiment” reviews
BERT misclassifies reviews with domain-specific vocabulary (old movies, musicals, etc.)

Baseline wrong / DistilBERT correct

In the subset where the baseline model misclassified but DistilBERT was correct, reviews were typically long, discursive, and argument-driven. These texts often contained negative language early on but resolved positively after contextual justification. Bag-of-words models fail to capture such long-range discourse structure, while transformer-based models are better suited to handling contextual shifts and contrastive reasoning.

DistilBERT wrong / Baseline correct

In the subset where DistilBERT misclassified but the baseline model was correct, reviews were often long but lacked a coherent argumentative structure. These texts frequently resembled stream-of-consciousness rants with dense lexical sentiment and significant topic drift. In such cases, sentiment polarity was conveyed primarily through the frequency of strongly polarized words rather than through structured reasoning. Frequency-based bag-of-words models benefited from this lexical density, while the transformer model appeared vulnerable to semantic dilution and input truncation effects.

Both models incorrect

In the subset where both the baseline and transformer models agree yet are incorrect, reviews are often genuinely mixed or neutral in tone. In many such cases, even human evaluation does not clearly map the text to a binary positive or negative category. This suggests that these errors arise primarily from label definition constraints—specifically, the use of rating-based binary labels that exclude neutral sentiment—rather than from insufficient model capacity.