# 04 — Evaluation & Error Analysis (Baselines)

**Dissertation context (what we are doing):**
This notebook turns the baseline experiments (TF-IDF + Logistic Regression / Naive Bayes) into *report-ready evidence*:
1) Reproducible evaluation (metrics + confusion matrix) on the validation split.
2) Error analysis to identify systematic failure modes (what the model confuses and why).
3) Clear justification for controlled baseline variants (Week 3) and transformer models (Week 4).

This supports the Method 3 research-led structure: experiments → evaluation → interpretation → justified next step.


In [1]:
from pathlib import Path
import json
import numpy as np
import pandas as pd


In [2]:
# From notebooks/04_evaluation.ipynb, repo-root results is typically ../results
RESULTS_DIR = Path("..") / "results"
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

metrics_path = RESULTS_DIR / "baseline_metrics.json"
cm_path = RESULTS_DIR / "confusion_matrix.csv"
preds_path = RESULTS_DIR / "valid_predictions.csv"

print("RESULTS_DIR:", RESULTS_DIR.resolve())
print("Exists baseline_metrics.json:", metrics_path.exists())
print("Exists confusion_matrix.csv:", cm_path.exists())
print("Exists valid_predictions.csv:", preds_path.exists())


RESULTS_DIR: /workspaces/fake-news-dissertation/results
Exists baseline_metrics.json: True
Exists confusion_matrix.csv: True
Exists valid_predictions.csv: True


In [3]:
# Load metrics JSON (if it exists)
baseline_metrics = None
if metrics_path.exists():
    with open(metrics_path, "r", encoding="utf-8") as f:
        baseline_metrics = json.load(f)
    print("Loaded baseline_metrics.json keys:", list(baseline_metrics.keys()))
else:
    print("baseline_metrics.json not found — we'll compute metrics from predictions instead.")

# Load confusion matrix CSV (if it exists)
cm_df = None
if cm_path.exists():
    cm_df = pd.read_csv(cm_path)
    print("Loaded confusion_matrix.csv shape:", cm_df.shape)
    print(cm_df.head())
else:
    print("confusion_matrix.csv not found — we'll compute it from predictions instead.")

# Load predictions CSV (required for reproducibility + error analysis)
preds_df = pd.read_csv(preds_path)
print("Loaded valid_predictions.csv shape:", preds_df.shape)
print("Columns:", list(preds_df.columns))
preds_df.head()


Loaded baseline_metrics.json keys: ['model', 'split', 'tfidf_ngram', 'stop_words', 'max_features', 'class_weight', 'accuracy', 'macro_precision', 'macro_recall', 'macro_f1', 'weighted_precision', 'weighted_recall', 'weighted_f1']
Loaded confusion_matrix.csv shape: (6, 7)
    Unnamed: 0  true  mostly-true  half-true  barely-true  false  pants-fire
0         true    38           42         43           13     32           1
1  mostly-true    46           62         71           28     43           1
2    half-true    30           54         58           35     69           2
3  barely-true    34           37         71           31     61           3
4        false    35           44         55           42     81           6
Loaded valid_predictions.csv shape: (1284, 3)
Columns: ['statement', 'true_label', 'pred_label']


Unnamed: 0,statement,true_label,pred_label
0,We have less Americans working now than in the...,barely-true,false
1,"When Obama was sworn into office, he DID NOT u...",pants-fire,true
2,Says Having organizations parading as being so...,false,false
3,Says nearly half of Oregons children are poor.,half-true,true
4,On attacks by Republicans that various program...,half-true,barely-true


In [4]:
def find_first_matching_col(df, candidates):
    for c in candidates:
        if c in df.columns:
            return c
    return None

TRUE_CANDIDATES = ["y_true", "true", "true_label", "label", "gold", "target", "actual"]
PRED_CANDIDATES = ["y_pred", "pred", "pred_label", "prediction", "predicted", "output"]

true_col = find_first_matching_col(preds_df, TRUE_CANDIDATES)
pred_col = find_first_matching_col(preds_df, PRED_CANDIDATES)

print("Detected true label column:", true_col)
print("Detected predicted label column:", pred_col)

if true_col is None or pred_col is None:
    raise ValueError(
        "Could not auto-detect label columns. "
        "Please rename columns or add your true/pred column names to TRUE_CANDIDATES/PRED_CANDIDATES."
    )

y_true = preds_df[true_col].astype(str)
y_pred = preds_df[pred_col].astype(str)

print("Unique true labels:", sorted(y_true.unique())[:20], "...")
print("Unique pred labels:", sorted(y_pred.unique())[:20], "...")


Detected true label column: true_label
Detected predicted label column: pred_label
Unique true labels: ['barely-true', 'false', 'half-true', 'mostly-true', 'pants-fire', 'true'] ...
Unique pred labels: ['barely-true', 'false', 'half-true', 'mostly-true', 'pants-fire', 'true'] ...


In [5]:
labels = sorted(set(y_true.unique()) | set(y_pred.unique()))

def confusion_matrix_np(y_true, y_pred, labels):
    idx = {lab:i for i, lab in enumerate(labels)}
    cm = np.zeros((len(labels), len(labels)), dtype=int)
    for t, p in zip(y_true, y_pred):
        cm[idx[t], idx[p]] += 1
    return cm

cm = confusion_matrix_np(y_true, y_pred, labels)

# Accuracy
acc = np.trace(cm) / np.sum(cm) if np.sum(cm) else 0.0

# Per-class precision/recall/F1
tp = np.diag(cm)
fp = np.sum(cm, axis=0) - tp
fn = np.sum(cm, axis=1) - tp

precision = np.divide(tp, tp + fp, out=np.zeros_like(tp, dtype=float), where=(tp+fp)!=0)
recall    = np.divide(tp, tp + fn, out=np.zeros_like(tp, dtype=float), where=(tp+fn)!=0)
f1        = np.divide(2*precision*recall, precision+recall, out=np.zeros_like(tp, dtype=float), where=(precision+recall)!=0)

macro_p = float(np.mean(precision))
macro_r = float(np.mean(recall))
macro_f1 = float(np.mean(f1))

support = np.sum(cm, axis=1)
weighted_f1 = float(np.average(f1, weights=support)) if support.sum() else 0.0

computed_metrics = {
    "accuracy": acc,
    "macro_precision": macro_p,
    "macro_recall": macro_r,
    "macro_f1": macro_f1,
    "weighted_f1": weighted_f1,
    "labels_order": labels
}
computed_metrics


{'accuracy': np.float64(0.21495327102803738),
 'macro_precision': 0.22730155958184264,
 'macro_recall': 0.19937426791806,
 'macro_f1': 0.19583751645282768,
 'weighted_f1': 0.20764931818879773,
 'labels_order': ['barely-true',
  'false',
  'half-true',
  'mostly-true',
  'pants-fire',
  'true']}

In [6]:
# Save computed metrics (keeps results reproducible and report-ready)
out_metrics_path = RESULTS_DIR / "baseline_metrics_recomputed.json"
with open(out_metrics_path, "w", encoding="utf-8") as f:
    json.dump(computed_metrics, f, indent=2)

print("Saved:", out_metrics_path)


Saved: ../results/baseline_metrics_recomputed.json


## Reporting (Chapter 4)

**What these metrics mean in dissertation terms:**
- We report **macro-F1** because LIAR is multi-class and imbalanced; macro-F1 treats each class equally.
- Accuracy alone can hide poor performance on minority labels.
- This establishes a reproducible baseline benchmark to compare against controlled variants (Week 3) and transformer models (Week 4).


In [7]:
cm_df_recomputed = pd.DataFrame(cm, index=labels, columns=labels)
cm_df_recomputed.head()

out_cm_path = RESULTS_DIR / "confusion_matrix_recomputed.csv"
cm_df_recomputed.to_csv(out_cm_path)
print("Saved:", out_cm_path)


Saved: ../results/confusion_matrix_recomputed.csv


In [8]:
errors = preds_df[y_true != y_pred].copy()
errors["true"] = y_true[y_true != y_pred].values
errors["pred"] = y_pred[y_true != y_pred].values

pair_counts = (
    errors.groupby(["true", "pred"])
    .size()
    .sort_values(ascending=False)
    .reset_index(name="count")
)

pair_counts.head(15)


Unnamed: 0,true,pred,count
0,barely-true,half-true,71
1,mostly-true,half-true,71
2,half-true,false,69
3,barely-true,false,61
4,false,half-true,55
5,half-true,mostly-true,54
6,mostly-true,true,46
7,false,mostly-true,44
8,true,half-true,43
9,mostly-true,false,43


In [9]:
# Try to detect a text column for showing examples (optional but helpful)
TEXT_CANDIDATES = ["text", "statement", "claim", "sentence", "content"]
text_col = find_first_matching_col(preds_df, TEXT_CANDIDATES)
print("Detected text column:", text_col)

top_pairs = pair_counts.head(2)[["true", "pred"]].values.tolist()
top_pairs

def show_confusion_examples(df_errors, true_label, pred_label, n=8, text_col=None):
    subset = df_errors[(df_errors["true"] == true_label) & (df_errors["pred"] == pred_label)].copy()
    subset = subset.head(n)
    cols = []
    if text_col and text_col in subset.columns:
        cols.append(text_col)
    # Include anything useful if present (id/confidence/etc.)
    extra = [c for c in ["id", "statement_id", "confidence", "prob", "proba", "pred_prob"] if c in subset.columns]
    cols = cols + extra + ["true", "pred"]
    return subset[cols] if cols else subset

for t, p in top_pairs:
    print(f"\n=== Confusion: true={t} → pred={p} ===")
    display(show_confusion_examples(errors, t, p, n=8, text_col=text_col))



Detected text column: statement

=== Confusion: true=barely-true → pred=half-true ===


Unnamed: 0,statement,true,pred
25,"If people work and make more money, they lose ...",barely-true,half-true
33,Walker says hes for lower taxes. But Milwaukee...,barely-true,half-true
64,Says Carlos Lopez-Cantera even voiced enthusia...,barely-true,half-true
66,The CBOs latest report confirms what Republica...,barely-true,half-true
70,Says Mitt Romney once supported President Obam...,barely-true,half-true
85,Toledo is fourth in the nation behind much big...,barely-true,half-true
130,The American people will not support doing any...,barely-true,half-true
146,Pregnant women who stand for five to six hours...,barely-true,half-true



=== Confusion: true=mostly-true → pred=half-true ===


Unnamed: 0,statement,true,pred
26,"We are poised to get rid of over 1,000 more re...",mostly-true,half-true
48,The military has spent $500 million enforcing ...,mostly-true,half-true
55,There has been $5 trillion in debt added over ...,mostly-true,half-true
76,Mitt Romney has proposed cutting his own taxes...,mostly-true,half-true
86,94 percent of winning candidates in 2010 had m...,mostly-true,half-true
93,Congress will begin its recess without having ...,mostly-true,half-true
101,Democrats already agreed to a deal that Republ...,mostly-true,half-true
129,Says legislation pending in the House would ef...,mostly-true,half-true


### Error analysis notes (write as you inspect)

For the confusion **[TRUE] → [PRED]**, the model appears to fail because:
- …
- …
- …

Hypothesis (dissertation): TF-IDF relies on surface lexical cues and struggles with:
- negation / phrasing nuance (bigrams help),
- subtle label boundaries in LIAR (adjacent classes),
- missing context/evidence beyond the statement text (transformers may help).


In [10]:
baseline_table = pd.DataFrame([{
    "Model": "Baseline (from valid_predictions.csv)",
    "Split": "Validation",
    "Accuracy": round(computed_metrics["accuracy"], 4),
    "Macro-Precision": round(computed_metrics["macro_precision"], 4),
    "Macro-Recall": round(computed_metrics["macro_recall"], 4),
    "Macro-F1": round(computed_metrics["macro_f1"], 4),
    "Weighted-F1": round(computed_metrics["weighted_f1"], 4),
}])

baseline_table


Unnamed: 0,Model,Split,Accuracy,Macro-Precision,Macro-Recall,Macro-F1,Weighted-F1
0,Baseline (from valid_predictions.csv),Validation,0.215,0.2273,0.1994,0.1958,0.2076


In [11]:
import os
import json
import pandas as pd
import numpy as np


PRED_PATH = os.path.join(RESULTS_DIR, "valid_predictions.csv")

assert os.path.exists(PRED_PATH), f"Missing: {PRED_PATH}"
print("Found:", PRED_PATH)


Found: ../results/valid_predictions.csv


In [12]:
preds = pd.read_csv(PRED_PATH)

required_cols = {"statement", "true_label", "pred_label"}
assert required_cols.issubset(preds.columns), f"CSV columns are: {preds.columns.tolist()}"

print("Rows:", len(preds))
print(preds.head(3))
print("\nLabel counts (true):")
print(preds["true_label"].value_counts())


Rows: 1284
                                           statement   true_label pred_label
0  We have less Americans working now than in the...  barely-true      false
1  When Obama was sworn into office, he DID NOT u...   pants-fire       true
2  Says Having organizations parading as being so...        false      false

Label counts (true):
true_label
false          263
mostly-true    251
half-true      248
barely-true    237
true           169
pants-fire     116
Name: count, dtype: int64


In [13]:
labels = sorted(preds["true_label"].unique().tolist())
cm = pd.crosstab(preds["true_label"], preds["pred_label"], rownames=["true"], colnames=["pred"], dropna=False)

# Ensure full square matrix with consistent order
cm = cm.reindex(index=labels, columns=labels, fill_value=0)

cm_path = os.path.join(RESULTS_DIR, "confusion_matrix_final.csv")
cm.to_csv(cm_path)

print("Saved:", cm_path)
cm


Saved: ../results/confusion_matrix_final.csv


pred,barely-true,false,half-true,mostly-true,pants-fire,true
true,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
barely-true,31,61,71,37,3,34
false,42,81,55,44,6,35
half-true,35,69,58,54,2,30
mostly-true,28,43,71,62,1,46
pants-fire,30,35,18,11,6,16
true,13,32,43,42,1,38


In [14]:
# cm is already a numpy array in your notebook
cm_np = cm.astype(float)

tp = np.diag(cm_np)
fp = cm_np.sum(axis=0) - tp
fn = cm_np.sum(axis=1) - tp

precision = np.divide(tp, tp + fp, out=np.zeros_like(tp), where=(tp+fp)!=0)
recall    = np.divide(tp, tp + fn, out=np.zeros_like(tp), where=(tp+fn)!=0)
f1        = np.divide(2*precision*recall, precision+recall, out=np.zeros_like(tp), where=(precision+recall)!=0)
support   = cm_np.sum(axis=1)

per_class = pd.DataFrame({
    "label": labels,
    "support": support.astype(int),
    "precision": precision,
    "recall": recall,
    "f1": f1
}).sort_values("f1")

per_class_path = RESULTS_DIR / "per_class_metrics.csv"
per_class.to_csv(per_class_path, index=False)

print("Saved:", per_class_path.resolve())
per_class


Saved: /workspaces/fake-news-dissertation/results/per_class_metrics.csv


Unnamed: 0,label,support,precision,recall,f1
pants-fire,pants-fire,116,0.315789,0.051724,0.088889
barely-true,barely-true,237,0.173184,0.130802,0.149038
half-true,half-true,248,0.183544,0.233871,0.205674
true,true,169,0.190955,0.224852,0.206522
mostly-true,mostly-true,251,0.248,0.247012,0.247505
false,false,263,0.252336,0.307985,0.277397


In [15]:
errors = preds[preds["true_label"] != preds["pred_label"]].copy()

conf_pairs = (
    errors.groupby(["true_label", "pred_label"])
    .size()
    .sort_values(ascending=False)
    .reset_index(name="count")
)

conf_pairs_path = os.path.join(RESULTS_DIR, "top_confusions.csv")
conf_pairs.to_csv(conf_pairs_path, index=False)

print("Saved:", conf_pairs_path)
conf_pairs.head(15)


Saved: ../results/top_confusions.csv


Unnamed: 0,true_label,pred_label,count
0,barely-true,half-true,71
1,mostly-true,half-true,71
2,half-true,false,69
3,barely-true,false,61
4,false,half-true,55
5,half-true,mostly-true,54
6,mostly-true,true,46
7,false,mostly-true,44
8,true,half-true,43
9,mostly-true,false,43


In [16]:
def sample_confusion(true_label, pred_label, n=5, seed=42):
    subset = errors[(errors.true_label == true_label) & (errors.pred_label == pred_label)]
    if len(subset) == 0:
        return pd.DataFrame(columns=["statement", "true_label", "pred_label"])
    return subset.sample(min(n, len(subset)), random_state=seed)[["statement", "true_label", "pred_label"]]

# Take the top 3 confusion pairs and sample examples
top3 = conf_pairs.head(3)
samples = []

for _, row in top3.iterrows():
    t, p = row["true_label"], row["pred_label"]
    ex = sample_confusion(t, p, n=6)
    ex["pair"] = f"{t} -> {p}"
    samples.append(ex)

examples_df = pd.concat(samples, ignore_index=True)

examples_path = os.path.join(RESULTS_DIR, "error_examples_top3.csv")
examples_df.to_csv(examples_path, index=False)

print("Saved:", examples_path)
examples_df


Saved: ../results/error_examples_top3.csv


Unnamed: 0,statement,true_label,pred_label,pair
0,Says Ron Johnson justifies his support of trad...,barely-true,half-true,barely-true -> half-true
1,"If people work and make more money, they lose ...",barely-true,half-true,barely-true -> half-true
2,Eric Cantor took $5 million from Sheldon Adels...,barely-true,half-true,barely-true -> half-true
3,Says Mitt Romney once supported President Obam...,barely-true,half-true,barely-true -> half-true
4,Scott Walker supported the same transportation...,barely-true,half-true,barely-true -> half-true
5,Says there are a half a trillion dollars in cu...,barely-true,half-true,barely-true -> half-true
6,"As a result of Roe vs. Wade, Americas maternal...",mostly-true,half-true,mostly-true -> half-true
7,"We are poised to get rid of over 1,000 more re...",mostly-true,half-true,mostly-true -> half-true
8,I'm the first person who really took up the is...,mostly-true,half-true,mostly-true -> half-true
9,94 percent of winning candidates in 2010 had m...,mostly-true,half-true,mostly-true -> half-true


In [17]:
# Load your baseline metrics (use recomputed or original; both match)
metrics_path = os.path.join(RESULTS_DIR, "baseline_metrics.json")
if os.path.exists(metrics_path):
    with open(metrics_path, "r") as f:
        m = json.load(f)
else:
    with open(os.path.join(RESULTS_DIR, "baseline_metrics_recomputed.json"), "r") as f:
        m = json.load(f)

summary = f"""
Baseline (TF-IDF + Logistic Regression) on the validation set achieved accuracy={m['accuracy']:.3f} and macro-F1={m['macro_f1']:.3f}.
Per-class results show uneven performance across the six veracity categories, indicating that the model struggles with fine-grained distinctions.
Error analysis using the confusion matrix shows that misclassifications are concentrated between semantically adjacent labels (e.g., nearby truthfulness levels),
suggesting that surface-level TF-IDF features do not reliably capture the contextual cues needed for subtle veracity judgement.
These findings motivate the use of more context-aware neural language models in subsequent experiments.
""".strip()

print(summary)


Baseline (TF-IDF + Logistic Regression) on the validation set achieved accuracy=0.215 and macro-F1=0.196.
Per-class results show uneven performance across the six veracity categories, indicating that the model struggles with fine-grained distinctions.
Error analysis using the confusion matrix shows that misclassifications are concentrated between semantically adjacent labels (e.g., nearby truthfulness levels),
suggesting that surface-level TF-IDF features do not reliably capture the contextual cues needed for subtle veracity judgement.
These findings motivate the use of more context-aware neural language models in subsequent experiments.


## 4.X Baseline evaluation (validation)

The TF-IDF + Logistic Regression baseline achieved **accuracy = 0.215** and **macro-F1 = 0.196** on the validation split. Macro-F1 is reported because the LIAR dataset is multi-class and imbalanced; it weights each class equally and therefore reflects poor performance on minority or difficult labels.

## 4.X.1 Per-class performance

Per-class results show uneven performance across the six veracity categories. In particular, **pants-fire** exhibits very low recall, indicating that the baseline model rarely identifies extreme falsehood correctly. This suggests that TF-IDF features capture shallow lexical cues but fail to learn reliable patterns for rare or semantically complex classes.

## 4.X.2 Error analysis (confusions)

The confusion matrix shows that misclassifications are concentrated between semantically adjacent labels (e.g., barely-true ↔ half-true, mostly-true ↔ half-true). This pattern indicates that the bag-of-words assumption struggles to represent contextual nuance needed for fine-grained veracity classification.

Qualitative inspection of misclassified statements (Appendix / Error Examples) suggests three baseline failure modes:
1) reliance on surface phrasing rather than evidence or context,
2) difficulty with subtle wording differences that shift truthfulness level,
3) label boundary ambiguity in the dataset (adjacent classes are difficult even for humans).

These findings motivate controlled classical variants (e.g., n-grams, class weighting) and subsequently transformer-based models that better capture context.


In [18]:
import pandas as pd
from pathlib import Path

R = Path("..") / "results"

preds = pd.read_csv(R / "valid_predictions.csv")
cm = pd.read_csv(R / "confusion_matrix_final.csv", index_col=0)
per_class = pd.read_csv(R / "per_class_metrics.csv")
top = pd.read_csv(R / "top_confusions.csv")
examples = pd.read_csv(R / "error_examples_top3.csv")

print("CHECK 1 — N predictions vs sum(confusion matrix)")
print("N predictions:", len(preds))
print("Sum confusion matrix:", cm.to_numpy().sum())

print("\nCHECK 2 — Sum(per-class support) vs N predictions")
print("Sum support:", per_class["support"].sum())
print("N predictions:", len(preds))

print("\nCHECK 3 — Errors in preds vs sum(top_confusions)")
n_errors = (preds["true_label"] != preds["pred_label"]).sum()
print("Errors from preds:", n_errors)
print("Sum top_confusions counts:", top["count"].sum())

print("\nCHECK 4 — Example pairs match the top-3 confusion pairs")
top3_pairs = set(top.head(3).apply(lambda r: f"{r['true_label']} -> {r['pred_label']}", axis=1))
example_pairs = set(examples["pair"].unique())
print("Top-3 pairs:", top3_pairs)
print("Example pairs:", example_pairs)


CHECK 1 — N predictions vs sum(confusion matrix)
N predictions: 1284
Sum confusion matrix: 1284

CHECK 2 — Sum(per-class support) vs N predictions
Sum support: 1284
N predictions: 1284

CHECK 3 — Errors in preds vs sum(top_confusions)
Errors from preds: 1008
Sum top_confusions counts: 1008

CHECK 4 — Example pairs match the top-3 confusion pairs
Top-3 pairs: {'half-true -> false', 'mostly-true -> half-true', 'barely-true -> half-true'}
Example pairs: {'half-true -> false', 'mostly-true -> half-true', 'barely-true -> half-true'}


## 4.X Baseline Evaluation (Validation)

The TF-IDF + Logistic Regression baseline provides a reproducible point of comparison for subsequent experiments. On the validation split (n=1284), the model achieved accuracy ≈ 0.215 and macro-F1 ≈ 0.196. Macro-F1 is prioritised because LIAR is a six-class problem with class imbalance; it reflects performance across all labels rather than being dominated by majority classes.

## 4.X.1 Confusion patterns

Error analysis shows that misclassifications concentrate between *adjacent* veracity categories. The three most frequent confusions were:
- barely-true → half-true
- mostly-true → half-true
- half-true → false

This indicates the baseline tends to collapse fine-grained distinctions into neighbouring categories, particularly predicting “half-true” as a middle class.

## 4.X.2 Interpreting baseline limitations

These patterns are consistent with limitations of bag-of-words TF-IDF representations: the model relies on surface lexical cues and struggles to capture contextual nuance required for subtle truthfulness grading. This motivates controlled baseline improvements (e.g., n-grams and class weighting) and, subsequently, transformer-based models that better encode semantic context.
