# Email Spam Detection Pipeline

This notebook demonstrates how a **single Python package** (`src/`) can
handle:

* configuration via YAML
* data loading & merging
* pipeline construction + hyper‑parameter search
* evaluation and model persistence

The core script is `run.py` – the notebook simply calls the same helper
functions, but with added visualisation.  The code structure follows:


src

├── init.py

├── config.py  # load_experiment_config & helpers

├── data_loader.py  # download_and_merge / split

├── model_builder.py  # build_pipeline + build_grid_search

└── trainer.py  # fit_and_evaluate + save_model


The experiments live in `experiments/<run_XX>/` – each run has its own folder with a config file and results.

In [3]:
# ---- Imports --------------------------------------------------------------
import json
from pathlib import Path
from typing import Dict, Any

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix

# Local imports – make sure the notebook is executed from the repo root
from src.config import load_experiment_config
from src.data_loader import download_and_merge, split
from src.model_builder import build_grid_search, build_pipeline
from src.trainer import fit_and_evaluate, save_model

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_seq_items", None)        # shows every element in a list/tuple

pd.set_option("display.expand_frame_repr", False)

In [4]:
from dataclasses import asdict

# Path to the run you want to visualise
cfg_path = Path("experiments/TFIDF_NB/config.yaml")

cfg = load_experiment_config(cfg_path)
print("Configuration loaded:")
display(pd.Series(asdict(cfg)))

Configuration loaded:


datasets                                                                       [enron]
test_size                                                                          0.2
vectorizer                                                                       tfidf
classifier                                                               MultinomialNB
cv_folds                                                                             4
scoring                                                                       f1_macro
random_state                                                                        42
vectorizer_params    {'tfidf__min_df': [1, 3], 'tfidf__ngram_range': [(1, 1), (1, 2)]}
classifier_params                            {'MultinomialNB__alpha': [0.1, 0.3, 0.5]}
dtype: object