# Hybrid DS + DL Pipeline — Project Overview

This notebook provides a **read-only overview** of a hybrid AI project
combining Classical Machine Learning (Module 1) and Deep Learning / NLP (Module 2).

The goal is to explain the **engineering flow and results**, not to retrain models.


## Why this notebook does not train models

In this project:
- All training is executed via deterministic Python scripts (`src/`)
- Artifacts are stored under `outputs/`
- This notebook only **loads and interprets results**

This separation improves:
- Reproducibility
- Reviewability
- Engineering clarity


In [1]:
from pathlib import Path
import json
import pandas as pd


## Project Structure (Simplified)

- `src/` — executable pipeline stages
- `outputs/models/` — trained model artifacts
- `outputs/results/` — metrics and metadata (JSON)
- `outputs/reports/` — human-readable summaries
- `docs/` — project-level documentation

The notebook reads from `outputs/` only.


In [6]:
PROJECT_ROOT = Path.cwd()
if PROJECT_ROOT.name.lower() == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent

RESULTS_DIR = PROJECT_ROOT / "outputs" / "results"

print("PROJECT_ROOT:", PROJECT_ROOT)
print("RESULTS_DIR:", RESULTS_DIR)

PROJECT_ROOT: c:\Users\ORENS\hybrid-ds-dl-pipeline
RESULTS_DIR: c:\Users\ORENS\hybrid-ds-dl-pipeline\outputs\results


## Classical ML Baseline

The classical model serves as:
- A fast baseline
- An interpretable reference point

This step answers:
> "Do we really need Deep Learning for this task?"


In [7]:
with open(RESULTS_DIR / "classical_metrics.json", "r", encoding="utf-8") as f:
    classical_metrics = json.load(f)

classical_metrics



{'model': 'LogisticRegression',
 'accuracy': 0.85,
 'f1_score': 0.8421052631578947,
 'note': 'Trained on synthetic data (baseline placeholder)'}

## Deep Learning Escalation

Deep Learning is introduced **only after** a classical baseline exists.

This reflects real-world practice:
- Start simple
- Measure performance
- Escalate complexity only if justified


In [8]:
with open(RESULTS_DIR / "dl_metrics.json", "r", encoding="utf-8") as f:
    dl_metrics = json.load(f)

dl_metrics



{'task': 'tabular_classification',
 'model': 'MLPClassifier',
 'accuracy': 0.925,
 'f1_score': 0.9210526315789473,
 'note': 'Numeric synthetic data (no text.csv found)',
 'artifact': 'outputs/models/dl_model.joblib',
 'training_curve_figure': 'outputs\\figures\\dl_training_loss_curve.png'}

In [10]:
with open(RESULTS_DIR / "final_metrics.json", "r", encoding="utf-8") as f:
    final_metrics = json.load(f)

final_metrics["winner"]

'deep_learning'

## Model Comparison Result

The winning model is selected automatically
based on the shared primary metric (F1-score).

This decision is:
- Metric-driven
- Reproducible
- Free of manual bias


## Persistence Layer

- SQLite is used for lightweight relational storage
- Mongo-style documents are demonstrated for experiment metadata

Both are optional and kept minimal to avoid unnecessary complexity.


## Final Notes

This notebook complements the pipeline by:
- Explaining the flow
- Presenting results
- Supporting review and presentation

All critical logic lives in scripts — not notebooks.
