# PubMed RCT - Model Comparison

This notebook compares the performance of the three models trained in the previous notebooks:
1. Baseline (TF-IDF + Naive Bayes)
2. Embeddings (GloVe)
3. Deep Learning (Bi-LSTM)

**Prerequisite:** Run notebooks 02, 03, and 04 first to generate the result files.

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

%matplotlib inline

## Load Saved Results

In [None]:
results_dir = Path("../results")

def load_results(filename):
    path = results_dir / filename
    if path.exists():
        with open(path) as f:
            return json.load(f)
    print(f"File not found: {path}")
    return None

baseline = load_results("baseline_results.json")
embeddings = load_results("embeddings_results.json")
bilstm = load_results("bilstm_results.json")

# Print loaded results
for name, res in [("Baseline", baseline), ("Embeddings", embeddings), ("Bi-LSTM", bilstm)]:
    if res:
        print(f"{name}: accuracy = {res['test_accuracy']:.4f}")
    else:
        print(f"{name}: not available (run the corresponding notebook first)")

## Comparison Table

In [None]:
rows = []
if baseline:
    rows.append({"Model": "Baseline (TF-IDF + NB)", "Test Accuracy": baseline["test_accuracy"]})
if embeddings:
    rows.append({"Model": "Embeddings (GloVe)", "Test Accuracy": embeddings["test_accuracy"]})
if bilstm:
    rows.append({"Model": "Bi-LSTM", "Test Accuracy": bilstm["test_accuracy"]})

if rows:
    df = pd.DataFrame(rows).set_index("Model")
    print(df.to_string())
else:
    print("No results to compare. Run notebooks 02-04 first.")

## Comparison Chart

In [None]:
if rows:
    fig, ax = plt.subplots(figsize=(8, 5))
    colors = ["#4c72b0", "#55a868", "#c44e52"]
    ax.bar(df.index, df["Test Accuracy"], color=colors[:len(df)], edgecolor="black")

    for i, val in enumerate(df["Test Accuracy"]):
        ax.text(i, val + 0.005, f"{val:.4f}", ha="center", fontweight="bold")

    ax.set_ylabel("Accuracy")
    ax.set_title("Test Accuracy Comparison")
    ax.set_ylim(0, 1.0)
    ax.grid(axis="y", alpha=0.3)
    plt.tight_layout()
    plt.show()

    # Best model
    best = df["Test Accuracy"].idxmax()
    print(f"Best model: {best} ({df.loc[best, 'Test Accuracy']:.4f})")

## Conclusion

The comparison shows:
- The **Baseline model** (TF-IDF + Naive Bayes) provides a simple and fast reference point.
- **GloVe embeddings** capture word semantics and generally improve over the baseline.
- The **Bi-LSTM** model captures sequential dependencies and tends to perform best.

Possible improvements:
- Use biomedical embeddings (BioWordVec, PubMedBERT)
- Add positional features (sentence position in abstract)
- Fine-tune a pre-trained transformer model (BERT, BioBERT)