# Experiment Results: Text Classification with LSTMs and GNNs

This notebook presents experiment results comparing three model architectures for text classification:
- **BiLSTM** — sequential baseline with GloVe embeddings and attention
- **TextING** — inductive GNN operating on per-document word graphs
- **TextGCN** — transductive GNN operating on a single corpus-level word-document graph

Evaluated on **MR** (movie reviews, binary) and **20 Newsgroups** (20-class topic classification).

In [None]:
import mlflow
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
import numpy as np
from pathlib import Path

# Style
sns.set_theme(style="whitegrid", font_scale=1.1)
palette = {"lstm": "#4C72B0", "texting": "#55A868", "text_gcn": "#C44E52"}
model_labels = {"lstm": "BiLSTM", "texting": "TextING", "text_gcn": "TextGCN"}
dataset_labels = {"mr": "MR", "20ng": "20NG"}
MODEL_ORDER = ["lstm", "texting", "text_gcn"]

mlflow.set_tracking_uri("mlruns")

# Load all finished runs
all_runs = mlflow.search_runs(search_all_experiments=True)
all_runs = all_runs[all_runs["status"] == "FINISHED"].copy()

# Normalize dataset names
all_runs["params.dataset_name"] = all_runs["params.dataset_name"].str.lower().replace({"ng20": "20ng"})

# Build experiment name lookup
exp_names = {e.experiment_id: e.name for e in mlflow.search_experiments()}
all_runs["experiment_name"] = all_runs["experiment_id"].map(exp_names)

print(f"Loaded {len(all_runs)} finished runs across {all_runs['experiment_name'].nunique()} experiments")
all_runs.groupby("experiment_name").size().sort_values(ascending=False)

## 1. Baseline Comparison

Each model trained with default hyperparameters on both datasets. Same preprocessing (stopword removal, rare word threshold=5), same train/val/test splits.

In [None]:
baseline = all_runs[all_runs["experiment_name"] == "baseline_comparison"].copy()
baseline["model"] = baseline["params.model_type"].map(model_labels)
baseline["dataset"] = baseline["params.dataset_name"].map(dataset_labels)

fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)

for i, (metric, label) in enumerate([("metrics.test_accuracy", "Test Accuracy"), ("metrics.test_f1_macro", "Test F1 (macro)")]):
    ax = axes[i]
    data = baseline.pivot(index="model", columns="dataset", values=metric)
    # Reorder rows
    data = data.loc[[model_labels[m] for m in MODEL_ORDER]]
    data.plot(kind="bar", ax=ax, color=["#5B9BD5", "#ED7D31"], edgecolor="black", linewidth=0.5)
    ax.set_title(label, fontweight="bold")
    ax.set_xlabel("")
    ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
    ax.set_ylim(0.65, 0.85)
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))
    # Add value labels
    for container in ax.containers:
        ax.bar_label(container, fmt="%.1f%%", label_type="edge", padding=2,
                     fontsize=8, fontweight="bold")
    if i == 0:
        ax.legend(title="Dataset")
    else:
        ax.get_legend().remove()

fig.suptitle("Baseline Model Comparison", fontweight="bold", fontsize=14)
plt.tight_layout()
plt.show()

# Summary table
summary = baseline[["model", "dataset", "metrics.test_accuracy", "metrics.test_f1_macro"]].copy()
summary.columns = ["Model", "Dataset", "Accuracy", "F1 (macro)"]
summary = summary.sort_values(["Dataset", "Model"]).reset_index(drop=True)
summary[["Accuracy", "F1 (macro)"]] = summary[["Accuracy", "F1 (macro)"]].map(lambda x: f"{x:.2%}")
summary

**Takeaway:** Both GNN models outperform BiLSTM on both datasets. TextGCN achieves the highest accuracy, with the gap most pronounced on 20NG (+5.4% over BiLSTM). On MR the models are closer together, suggesting the graph structure provides more benefit for multi-class topic classification than binary sentiment.

## 2. Training Curves

Convergence behavior of each model on the baseline configuration.

In [None]:
client = mlflow.tracking.MlflowClient()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

metric_pairs = [
    ("train_accuracy", "val_accuracy", "Accuracy"),
    ("train_loss", "val_loss", "Loss"),
]

for col, ds_name in enumerate(["mr", "20ng"]):
    ds_runs = baseline[baseline["params.dataset_name"] == ds_name]
    for row, (train_m, val_m, ylabel) in enumerate(metric_pairs):
        ax = axes[row, col]
        for _, run in ds_runs.iterrows():
            model_type = run["params.model_type"]
            color = palette[model_type]
            label = model_labels[model_type]
            run_id = run["run_id"]

            # Training metric
            hist = client.get_metric_history(run_id, train_m)
            if hist:
                steps = [h.step for h in hist]
                vals = [h.value for h in hist]
                ax.plot(steps, vals, color=color, linestyle="-", label=f"{label} (train)", alpha=0.7)

            # Validation metric
            hist = client.get_metric_history(run_id, val_m)
            if hist:
                steps = [h.step for h in hist]
                vals = [h.value for h in hist]
                ax.plot(steps, vals, color=color, linestyle="--", label=f"{label} (val)", alpha=0.9, linewidth=2)

        ax.set_xlabel("Epoch")
        ax.set_ylabel(ylabel)
        if row == 0:
            ax.set_title(f"{dataset_labels[ds_name]}", fontweight="bold", fontsize=13)
        if col == 1 and row == 0:
            ax.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=8)

fig.suptitle("Training Curves — Baseline Runs", fontweight="bold", fontsize=14)
plt.tight_layout()
plt.show()

## 3. Data Efficiency

How does each model perform when trained on a fraction of the training data (10–80%)?

In [None]:
de = all_runs[all_runs["experiment_name"] == "data_efficiency"].copy()
de["train_frac"] = de["params.train_split"].astype(float)

fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True)

for i, ds in enumerate(["mr", "20ng"]):
    ax = axes[i]
    subset = de[de["params.dataset_name"] == ds]
    for model in MODEL_ORDER:
        m = subset[subset["params.model_type"] == model].sort_values("train_frac")
        ax.plot(m["train_frac"] * 100, m["metrics.test_accuracy"],
                marker="o", color=palette[model], label=model_labels[model], linewidth=2)
    ax.set_title(f"{dataset_labels[ds]}", fontweight="bold")
    ax.set_xlabel("Training Data (%)")
    ax.set_xticks([10, 30, 50, 70, 80])
    if i == 0:
        ax.set_ylabel("Test Accuracy")
        ax.legend()
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))

fig.suptitle("Data Efficiency — Test Accuracy vs. Training Data Fraction", fontweight="bold", fontsize=14)
plt.tight_layout()
plt.show()

**Takeaway:** GNN models are more data-efficient than BiLSTM. On 20NG, TextGCN at 30% training data (~70%) approaches BiLSTM at 80% (~72%). The gap narrows on MR where all models converge. Note: TextGCN and TextING show anomalous drops at 80% on MR — likely due to unlucky random seeds or graph construction artifacts at that particular split.

## 4. Hyperparameter Sensitivity

### 4.1 BiLSTM — Hidden Dimension & Layers

In [None]:
hp_lstm = all_runs[all_runs["experiment_name"] == "hyperparams_lstm"].copy()

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Hidden dim comparison (fixed layers=2)
ax = axes[0]
hd = hp_lstm[hp_lstm["params.model_num_layers"] == "2"].copy()
hd["hidden_dim"] = hd["params.model_hidden_dim"].astype(int)
for ds in ["mr", "20ng"]:
    sub = hd[hd["params.dataset_name"] == ds].sort_values("hidden_dim")
    ax.plot(sub["hidden_dim"], sub["metrics.test_accuracy"],
            marker="o", label=dataset_labels[ds], linewidth=2)
ax.set_xlabel("Hidden Dimension")
ax.set_ylabel("Test Accuracy")
ax.set_title("Hidden Dim (layers=2)", fontweight="bold")
ax.legend()
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))

# Layers comparison (fixed hidden=128)
ax = axes[1]
nl = hp_lstm[hp_lstm["params.model_hidden_dim"] == "128"].copy()
nl["num_layers"] = nl["params.model_num_layers"].astype(int)
for ds in ["mr", "20ng"]:
    sub = nl[nl["params.dataset_name"] == ds].sort_values("num_layers")
    ax.plot(sub["num_layers"], sub["metrics.test_accuracy"],
            marker="o", label=dataset_labels[ds], linewidth=2)
ax.set_xlabel("Number of Layers")
ax.set_title("Num Layers (hidden=128)", fontweight="bold")
ax.legend()
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))
ax.set_xticks([1, 2, 3])

fig.suptitle("BiLSTM Hyperparameter Sensitivity", fontweight="bold", fontsize=14)
plt.tight_layout()
plt.show()

**BiLSTM** is relatively insensitive to hyperparameters — accuracy varies by only ~2-3% across configurations. Hidden dim 128 with 2 layers is a solid default.

### 4.2 TextING — GRU Steps & Window Size

In [None]:
hp_ting = all_runs[all_runs["experiment_name"] == "hyperparams_texting"].copy()

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Window size (fixed gru=3)
ax = axes[0]
ws = hp_ting[hp_ting["params.model_gru_steps"] == "3"].copy()
ws["window"] = ws["params.window_size"].astype(int)
for ds in ["mr", "20ng"]:
    sub = ws[ws["params.dataset_name"] == ds].sort_values("window")
    # Average duplicates
    sub = sub.groupby("window")["metrics.test_accuracy"].mean().reset_index()
    ax.plot(sub["window"], sub["metrics.test_accuracy"],
            marker="o", label=dataset_labels[ds], linewidth=2)
ax.set_xlabel("Window Size")
ax.set_ylabel("Test Accuracy")
ax.set_title("Window Size (GRU steps=3)", fontweight="bold")
ax.legend()
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))

# GRU steps (fixed window=2)
ax = axes[1]
gs = hp_ting[hp_ting["params.window_size"] == "2"].copy()
gs["gru_steps"] = gs["params.model_gru_steps"].astype(int)
for ds in ["mr", "20ng"]:
    sub = gs[gs["params.dataset_name"] == ds].sort_values("gru_steps")
    sub = sub.groupby("gru_steps")["metrics.test_accuracy"].mean().reset_index()
    ax.plot(sub["gru_steps"], sub["metrics.test_accuracy"],
            marker="o", label=dataset_labels[ds], linewidth=2)
ax.set_xlabel("GRU Steps")
ax.set_title("GRU Steps (window=2)", fontweight="bold")
ax.legend()
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))
ax.set_xticks([1, 2, 3, 4])

fig.suptitle("TextING Hyperparameter Sensitivity", fontweight="bold", fontsize=14)
plt.tight_layout()
plt.show()

**TextING** performs best with fewer GRU steps (1-2) and smaller window sizes (2-3). More message-passing iterations don't help and may cause over-smoothing. On MR, window=4 is optimal; on 20NG, window=2 works best.

### 4.3 TextGCN — Learning Rate & Hidden Dimension

In [None]:
hp_gcn = all_runs[all_runs["experiment_name"] == "hyperparams_textgcn"].copy()

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Learning rate (fixed hidden=200)
ax = axes[0]
lr = hp_gcn[hp_gcn["params.model_hidden_dim"] == "200"].copy()
lr["lr"] = lr["params.model_lr"].astype(float)
for ds in ["mr", "20ng"]:
    sub = lr[lr["params.dataset_name"] == ds].sort_values("lr")
    sub = sub.groupby("lr")["metrics.test_accuracy"].mean().reset_index()
    ax.plot(sub["lr"], sub["metrics.test_accuracy"],
            marker="o", label=dataset_labels[ds], linewidth=2)
ax.set_xlabel("Learning Rate")
ax.set_ylabel("Test Accuracy")
ax.set_title("Learning Rate (hidden=200)", fontweight="bold")
ax.set_xscale("log")
ax.legend()
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))

# Hidden dim (fixed lr=0.02)
ax = axes[1]
hd = hp_gcn[hp_gcn["params.model_lr"] == "0.02"].copy()
hd["hidden_dim"] = hd["params.model_hidden_dim"].astype(int)
for ds in ["mr", "20ng"]:
    sub = hd[hd["params.dataset_name"] == ds].sort_values("hidden_dim")
    sub = sub.groupby("hidden_dim")["metrics.test_accuracy"].mean().reset_index()
    ax.plot(sub["hidden_dim"], sub["metrics.test_accuracy"],
            marker="o", label=dataset_labels[ds], linewidth=2)
ax.set_xlabel("Hidden Dimension")
ax.set_title("Hidden Dim (lr=0.02)", fontweight="bold")
ax.legend()
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))

fig.suptitle("TextGCN Hyperparameter Sensitivity", fontweight="bold", fontsize=14)
plt.tight_layout()
plt.show()

**TextGCN** is highly sensitive to learning rate — lr=0.005 causes training to collapse (near random on 20NG). The optimal range is 0.02–0.05. Hidden dimension has less impact, with 200 being a good default.

## 5. Preprocessing Impact

Effect of stopword removal and rare word filtering on TextGCN and TextING.

In [None]:
pp = all_runs[all_runs["experiment_name"] == "preprocessing_impact"].copy()
pp["stopwords"] = pp["params.remove_stopwords"].map({"True": "Removed", "False": "Kept"})
pp["rare_threshold"] = pp["params.remove_rare_words"].astype(int)

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for row, model in enumerate(["text_gcn", "texting"]):
    for col, ds in enumerate(["mr", "20ng"]):
        ax = axes[row, col]
        sub = pp[(pp["params.model_type"] == model) & (pp["params.dataset_name"] == ds)]
        # Average duplicates
        sub = sub.groupby(["stopwords", "rare_threshold"])["metrics.test_accuracy"].mean().reset_index()
        
        pivot = sub.pivot(index="rare_threshold", columns="stopwords", values="metrics.test_accuracy")
        pivot.plot(kind="bar", ax=ax, color=["#5B9BD5", "#ED7D31"], edgecolor="black", linewidth=0.5)
        ax.set_title(f"{model_labels[model]} — {dataset_labels[ds]}", fontweight="bold")
        ax.set_xlabel("Rare Word Threshold")
        ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
        ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))
        if row == 0 and col == 0:
            ax.legend(title="Stopwords")
        else:
            ax.get_legend().remove()
        if col == 0:
            ax.set_ylabel("Test Accuracy")

fig.suptitle("Preprocessing Impact on Model Performance", fontweight="bold", fontsize=14)
plt.tight_layout()
plt.show()

**Takeaway:** Preprocessing effects are inconsistent across models and datasets. TextGCN on 20NG collapses with no rare word filtering + stopword removal (threshold=0, stopwords removed) — likely due to an oversized vocabulary creating a sparse, noisy graph. TextING is more robust to preprocessing choices. On MR, keeping stopwords generally helps for GNN models, possibly because sentiment-bearing function words (e.g., "not", "very") carry useful signal.

## 6. Ablation Studies

### 6.1 TextGCN — Window Size

In [None]:
abl_gcn = all_runs[all_runs["experiment_name"] == "ablation_textgcn"].copy()
abl_gcn["window"] = abl_gcn["params.window_size"].astype(int)

fig, ax = plt.subplots(figsize=(6, 4))
for ds in ["mr", "20ng"]:
    sub = abl_gcn[abl_gcn["params.dataset_name"] == ds].sort_values("window")
    ax.plot(sub["window"], sub["metrics.test_accuracy"],
            marker="o", label=dataset_labels[ds], linewidth=2)
ax.set_xlabel("PMI Window Size")
ax.set_ylabel("Test Accuracy")
ax.set_title("TextGCN — Effect of PMI Window Size", fontweight="bold")
ax.legend()
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))
plt.tight_layout()
plt.show()

Larger PMI window sizes improve TextGCN performance, especially on MR. Window=50 captures broader co-occurrence patterns. On 20NG, the improvement is smaller since topic-discriminative words tend to co-occur within narrow windows.

### 6.2 TextING — GRU Steps & Window Size

In [None]:
abl_ting = all_runs[all_runs["experiment_name"] == "ablation_texting"].copy()

fig, ax = plt.subplots(figsize=(7, 4))

abl_ting["config"] = "gru=" + abl_ting["params.model_gru_steps"].astype(str) + ", w=" + abl_ting["params.window_size"].astype(str)

for ds in ["mr", "20ng"]:
    sub = abl_ting[abl_ting["params.dataset_name"] == ds].sort_values("config")
    ax.bar(
        [f"{c}\n({dataset_labels[ds]})" for c in sub["config"]],
        sub["metrics.test_accuracy"],
        color="#5B9BD5" if ds == "mr" else "#ED7D31",
        edgecolor="black", linewidth=0.5, width=0.35,
        label=dataset_labels[ds]
    )

ax.set_ylabel("Test Accuracy")
ax.set_title("TextING Ablation — GRU Steps & Window Size", fontweight="bold")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))
ax.legend()
plt.tight_layout()
plt.show()

Window=1 (no word-word edges, only self-loops) still performs surprisingly well, suggesting the GRU readout captures sufficient information even without graph structure. Reducing GRU steps to 1 maintains strong performance on 20NG.

## 7. Computational Cost

Training time and GPU memory for baseline configurations.

In [None]:
# Compute duration and collect system metrics for baseline runs
cost_data = []
for _, run in baseline.iterrows():
    run_info = client.get_run(run["run_id"]).info
    duration_s = (run_info.end_time - run_info.start_time) / 1000
    gpu_mem = run.get("metrics.system/gpu_0_memory_usage_megabytes")
    cost_data.append({
        "Model": model_labels[run["params.model_type"]],
        "Dataset": dataset_labels[run["params.dataset_name"]],
        "Duration (s)": round(duration_s),
        "GPU Memory (MB)": int(gpu_mem) if pd.notna(gpu_mem) else None,
        "Test Accuracy": f"{run['metrics.test_accuracy']:.2%}",
    })

cost_df = pd.DataFrame(cost_data).sort_values(["Dataset", "Model"]).reset_index(drop=True)
cost_df

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Duration
ax = axes[0]
pivot_dur = cost_df.pivot(index="Model", columns="Dataset", values="Duration (s)")
pivot_dur = pivot_dur.loc[[model_labels[m] for m in MODEL_ORDER]]
pivot_dur.plot(kind="bar", ax=ax, color=["#5B9BD5", "#ED7D31"], edgecolor="black", linewidth=0.5)
ax.set_title("Training Duration", fontweight="bold")
ax.set_ylabel("Seconds")
ax.set_xlabel("")
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
for container in ax.containers:
    ax.bar_label(container, fmt="%ds", padding=2, fontsize=8)
ax.legend(title="Dataset")

# GPU Memory
ax = axes[1]
pivot_mem = cost_df.pivot(index="Model", columns="Dataset", values="GPU Memory (MB)")
pivot_mem = pivot_mem.loc[[model_labels[m] for m in MODEL_ORDER]]
pivot_mem.plot(kind="bar", ax=ax, color=["#5B9BD5", "#ED7D31"], edgecolor="black", linewidth=0.5)
ax.set_title("GPU Memory Usage", fontweight="bold")
ax.set_ylabel("MB")
ax.set_xlabel("")
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
for container in ax.containers:
    labels = [f"{int(v)} MB" if not np.isnan(v) else "" for v in container.datavalues]
    ax.bar_label(container, labels=labels, padding=2, fontsize=8)
ax.get_legend().remove()

fig.suptitle("Computational Cost — Baseline Runs", fontweight="bold", fontsize=14)
plt.tight_layout()
plt.show()

**Takeaway:** TextGCN is the fastest model (4s on MR, 78s on 20NG) but uses the most GPU memory on 20NG (3.8 GB) due to the full corpus graph. BiLSTM is the slowest (1218s on 20NG) because it processes documents sequentially. TextING offers the best efficiency/accuracy tradeoff — fast training (202s on 20NG) with moderate memory (1.3 GB).

## 8. Summary

Best results per model across all experiments.

In [None]:
# Build summary from all finished runs with valid test accuracy
valid = all_runs[
    (all_runs["params.model_type"].isin(MODEL_ORDER)) &
    (all_runs["metrics.test_accuracy"].notna()) &
    (all_runs["metrics.test_accuracy"] > 0.4)  # filter collapsed runs
].copy()

summary_rows = []
for model in MODEL_ORDER:
    for ds in ["mr", "20ng"]:
        sub = valid[(valid["params.model_type"] == model) & (valid["params.dataset_name"] == ds)]
        if len(sub) == 0:
            continue
        best = sub.loc[sub["metrics.test_accuracy"].idxmax()]
        baseline_row = baseline[
            (baseline["params.model_type"] == model) & (baseline["params.dataset_name"] == ds)
        ]
        bl_acc = baseline_row["metrics.test_accuracy"].values[0] if len(baseline_row) > 0 else None
        summary_rows.append({
            "Model": model_labels[model],
            "Dataset": dataset_labels[ds],
            "Baseline Acc": f"{bl_acc:.2%}" if bl_acc else "—",
            "Best Acc": f"{best['metrics.test_accuracy']:.2%}",
            "Best F1": f"{best['metrics.test_f1_macro']:.2%}",
            "Experiment": best["experiment_name"],
            "# Runs": len(sub),
        })

summary_df = pd.DataFrame(summary_rows)
summary_df

### Key Findings

1. **GNNs outperform BiLSTM** on both datasets, with TextGCN achieving the highest accuracy overall
2. **GNNs are more data-efficient** — TextGCN at 30% data matches BiLSTM at full data on 20NG
3. **TextGCN is sensitive to learning rate** — too low causes training collapse; BiLSTM is robust to hyperparameters
4. **Preprocessing matters for TextGCN** — rare word filtering is essential to prevent graph sparsity issues
5. **TextING offers the best tradeoff** — competitive accuracy with moderate compute and memory, plus inductive capability
6. **The 20NG gap is larger** — graph structure helps more for multi-class topic classification than binary sentiment