# 🧪 Analyst Toolkit Tutorial: Full Data Pipeline with Penguins

This interactive notebook demonstrates the complete analyst pipeline using a synthetic **Palmer Penguins** dataset generated from the [`penguin_research_toolkit` repository](https://github.com/G-Schumacher44/penguin_research_toolkit).

Each step in the pipeline is modular, YAML-configurable, and produces exports, plots, and certification-ready reports.

This toolkit is packaged using **TOML (`pyproject.toml`)** and can be run via script or notebook.


### 🧰 Toolkit Architecture: 3-Way Modular Design

This pipeline is built around a flexible ETL framework with three usage modes:

- 📓 **Notebook Mode**
  - Run individual modules or the full pipeline interactively
  - Supports HTML dashboards, widgets, and live previews
  - Ideal for iterative exploration, first-pass audits, and QA workflows

- 🧵 **CLI Mode**
  - Execute the full pipeline using `run_toolkit_pipeline.py`
  - Controlled via a master YAML config
  - Exports all reports, checkpoints, and logs to disk

- 🧪 **Hybrid Mode**
  - Develop in notebooks, deploy via scripts
  - Reuse the same configs across testing and production

The toolkit handles essential data cleaning and transformation tasks, enabling analysts to focus on:
- Exploratory Data Analysis (EDA)
- Investigating anomalies and data quality issues
- Extracting actionable insights from certified data

In [1]:
# 📁 Load Configuration and Set Execution Context

from analyst_toolkit.m00_utils.config_loader import load_config

# Path to master config (modify if needed)
config_path = "config/run_toolkit_config.yaml"

# Load full configuration dictionary
config = load_config(config_path)

# Extract run-level settings
run_id = config.get("run_id", "default_run")
notebook_mode = config.get("notebook", True)

print(f"🔧 Config loaded | Run ID: {run_id} | Notebook Mode: {notebook_mode}")

🔧 Config loaded | Run ID: CLI_2_QA | Notebook Mode: True


In [2]:
# 📥 Load Raw Data from CSV

from analyst_toolkit.m00_utils.load_data import load_csv

# Load input path from the global config (or override manually)
input_path = config.get("pipeline_entry_path", "data/raw/synthetic_penguins_v3.5.csv")
print(f"📂 Loading data from: {input_path}")

# Load into DataFrame
df_raw = load_csv(input_path)

📂 Loading data from: data/raw/synthetic_penguins_v3.5.csv


### 🧪 Step 1: Run Initial Diagnostics (M01)

This module generates a profile of the raw data: shape, types, nulls, skewness, and sample rows.

This module profiles the raw dataset for key structural and quality checks:
- **Memory, Shape, Dtypes**  
- **Missing Values & Skewness**
- **Duplicate Detection**
- **Sample Rows & Descriptive Stats**

✅ All results are rendered in a collapsible dashboard with exportable reports.  
You can toggle inline previews and export settings via the YAML config (`diag_config_template.yaml`).


>🛠️ To modify thresholds or toggle sections, edit the config under `diagnostics.settings`.

In [3]:
# 📊 M01: Data Diagnostics – Profile Structure & Shape

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m00_utils.load_data import load_csv
from analyst_toolkit.m01_diagnostics.run_diag_pipeline import run_diag_pipeline

# --- Load module-specific config ---
diag_config = load_config("config/diag_config_template.yaml")
diag_cfg = diag_config.get("diagnostics", {})
notebook_mode = diag_config.get("notebook", True)
run_id = diag_config.get("run_id", "demo_run")

# --- Load raw data from path defined in config ---
input_path = diag_cfg.get("input_path")
if not input_path:
    raise ValueError("🛑 No input_path specified in diagnostics config.")
df_raw = load_csv(input_path)

# --- Run Diagnostics Module ---
df_profiled = run_diag_pipeline(
    config=diag_cfg,
    df=df_raw,
    notebook=notebook_mode,
    run_id=run_id
)

Rows,Columns
5541,15

Memory Usage
3.26 MB

Duplicate Rows,Duplicate %
1,0.02


Column,Unique Values
tag_id,2678
capture_date,1917
date_egg,1656
colony_id,19
study_name,12
island,11

Column,Dtype,Unique Values,Audit Remarks,Missing Count,Missing %
tag_id,object,2678,✅ OK,2242,40.46
species,object,5,✅ OK,166,3.0
bill length (mm),float64,1984,✅ OK,429,7.74
bill depth (mm),float64,862,✅ OK,417,7.53
flipper_length_mm,float64,1466,✅ OK,451,8.14
body_mass_g,float64,3328,✅ OK,406,7.33
age_group,object,7,✅ OK,121,2.18
sex,object,6,✅ OK,2739,49.43
colony_id,object,19,✅ OK,405,7.31
island,object,11,✅ OK,584,10.54


Metric,count,mean,std,min,25%,50%,75%,max,skew,kurtosis
bill length (mm),5112.0,45.166682,5.66641,30.63,40.51,45.95,49.36,62.64,-0.145952,-0.606829
bill depth (mm),5124.0,17.305377,2.231495,12.37,15.49,17.485,19.03,23.01,-0.111456,-0.897492
flipper_length_mm,5090.0,202.2378,14.342621,162.79,191.1,199.315,214.1,252.4,0.329099,-0.616376
body_mass_g,5135.0,3853.645265,898.232986,2376.56,3219.5,3742.0,4376.515,7378.33,0.616778,0.086446


tag_id,species,bill length (mm),bill depth (mm),flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
,Gentoo,48.99,14.11,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,Yes,2023-11-09
,Gentoo,48.99,14.11,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,Yes,2023-11-09


tag_id,species,bill length (mm),bill depth (mm),flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
,Gentoo,48.99,14.11,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,Yes,2023-11-09
,Gentoo,48.99,14.11,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,Yes,2023-11-09
ADE-0001,Adelie,39.55,19.92,186.2,2500.0,Chick,Male,Biscoe West,Biscoe,2024-13-03,Underweight,PAPRI2022,Yes,2022-07-20
,Gentoo,48.23,13.0,,4536.0,Adult,Female,Biscoe West,,2024-04-14,Healthy,,Yes,2024-04-12
GEN-0001,Gentoo,46.22,13.91,212.8,2500.0,Juvenile,Female,Dream South,Dream,,Underweight,PAPRI2020,Yes,2020-04-14


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Profile</h3>"), HBox(childre…

### 🛡️ Step 2: Run Schema & Content Validation (M02)

This module audits the dataset against a defined schema to catch issues early and guide cleaning steps:
- **Expected Columns & Dtypes**  
- **Allowed Categorical Values**
- **Numeric Range Checks**
- **Null Allowance (optional)**

✅ All results are displayed in a styled validation dashboard with exportable reports.  
You can define strict or flexible rules in the YAML config (`validation_config_template.yaml`).

> 🛠️ To adjust enforcement (e.g. halt-on-fail), set `fail_on_error` and update rules under `validation.schema_validation`.

In [4]:
# 🛡️ M02: Schema & Content Validation – First Audit Pass

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m00_utils.load_data import load_csv
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline

# --- Load config and unpack validation settings ---
val_config = load_config("config/validation_config_template.yaml")
val_cfg = val_config.get("validation", {})
notebook_mode = val_config.get("notebook", True)
run_id = val_config.get("run_id", "demo_run")

# --- Run Validation Module ---
df_validated = run_validation_pipeline(
    config=val_cfg,
    df=df_profiled,
    notebook=notebook_mode,
    run_id=run_id
)

Validation Rule,Description,Status
Schema Conformity,Verify column names match the expected schema.,⚠️ Fail (2 issues)
Dtype Enforcement,Verify column data types match expectations.,⚠️ Fail (1 issues)
Categorical Values,Verify values in categorical columns are within an allowed set.,⚠️ Fail (7 issues)
Numeric Ranges,Verify values in numeric columns are within a defined range.,✅ Pass


Issue Type,Columns
Missing,"bill depth_mm, bill_length_mm"
Unexpected,"bill length (mm), bill depth (mm)"

Column,Expected Type,Actual Type
flipper_length_mm,int64,float64

Invalid Value,Count
adeleie,148
Gentto,145

Invalid Value,Count
short cut,70
torg,61
unknown,59
bisco,55
cormor,47
dreamland,46

Invalid Value,Count
Male,1308
Female,1227
F,83
?,74
M,61
Unknown,49

Invalid Value,Count
cormorant NW,45
invalid_colony,36
Torgersen,35
Cormorant,34
biscoe 2,34
torgersen SE,31
TORGERSEN 4,30
short point,28
/Shortcut,26
Biscoe,25

Invalid Value,Count
juvenille,58
unk,48
ADLT,47
chik,29

Invalid Value,Count
critcal ill,36
Overwight,34
under weight,33
ok,30

Invalid Value,Count
PAPR12021,60
papri2024,58
STUDY_2022,57
PP2020,48
PAPR2023,46
PAPRI20X9,37


### 🧹 Step 3: Normalize & Standardize Data (M03)

This module performs rule-based cleaning and normalization to prepare the dataset for certification:
- **Column Renaming & Type Coercion**
- **Value Mapping & Text Cleaning**
- **Fuzzy Matching & Datetime Parsing**

✅ Results are rendered in a structured dashboard with before/after comparisons and audit previews.  
All rules and output paths are controlled via the YAML config (`normalization_config_template.yaml`).

> 🛠️ To adjust cleaning logic, modify the `rules` block (e.g. `value_mappings`, `preview_columns`, etc).

In [5]:
# 🧹 M03: Data Normalization – Standardizing Key Fields

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m03_normalization.run_normalization_pipeline import run_normalization_pipeline
import logging

# --- Load Config ---
config = load_config("config/normalization_config_template.yaml")
norm_cfg = config.get("normalization", {})
run_id = config.get("run_id")
notebook_mode = config.get("notebook", True)

df_normalized = run_normalization_pipeline(
    config=norm_cfg,
    df=df_validated,
    notebook=notebook_mode,
    run_id=run_id
)

Original Name,New Name
bill length (mm),bill_length_mm
bill depth (mm),bill_depth_mm

Column,Operation
clutch_completion,standardize_text
sex,standardize_text

Column,Target Type
capture_date,datetime64[ns]
date_egg,datetime64[ns]

Column,Mappings Applied
sex,7
species,1
island,1
colony_id,14
age_group,4
health_status,7
study_name,6

Column,Original,Corrected,Score
species,Gentto,Gentoo,83
species,adeleie,Adelie,92
island,bisco,Biscoe,91
island,short cut,Shortcut,94
island,dreamland,Dream,90
island,cormor,Cormorant,90
island,torg,Torgersen,90


Value,Count
,2739
MALE,1369
FEMALE,1310
UNKNOWN,123

Value,Original Count,Normalized Count
,2739,2739
Male,1308,0
Female,1227,0
F,83,0
?,74,0
M,61,0
Unknown,49,0
MALE,0,1369
FEMALE,0,1310
UNKNOWN,0,123

Value,Count
Torgersen,1405
Dream,1184
Biscoe,1084
Cormorant,715
,584
Shortcut,510
UNKNOWN,59

Value,Original Count,Normalized Count
Torgersen,1344,1405
Dream,1138,1184
Biscoe,1029,1084
Cormorant,668,715
,584,584
Shortcut,440,510
short cut,70,0
torg,61,0
unknown,59,0
bisco,55,0

Value,Count
Gentoo,1815
Adelie,1784
Chinstrap,1776
,166

Value,Original Count,Normalized Count
Chinstrap,1776,1776
Gentoo,1670,1815
Adelie,1636,1784
,166,166
adeleie,148,0
Gentto,145,0

Value,Count
Healthy,2194
Underweight,1411
Overweight,733
,554
Critical,323
Sick,296
UNKNOWN,30

Value,Original Count,Normalized Count
Healthy,2194,2194
Underweight,1378,1411
Overweight,699,733
,554,554
Unwell,296,0
Critically Ill,287,0
critcal ill,36,0
Overwight,34,0
under weight,33,0
ok,30,0

Value,Count
Torgersen North,1490
Dream South,1216
Biscoe West,1092
Cormorant East,767
Shortcut Point,511
,405
UNKNOWN,60

Value,Original Count,Normalized Count
Torgersen North,1394,1490
Dream South,1151,1216
Biscoe West,1033,1092
Cormorant East,688,767
Shortcut Point,457,511
,405,405
cormorant NW,45,0
invalid_colony,36,0
Torgersen,35,0
Cormorant,34,0

Value,Count
Adult,3822
Juvenile,1073
Chick,477
,121
UNKNOWN,48

Value,Original Count,Normalized Count
Adult,3775,3822
Juvenile,1015,1073
Chick,448,477
,121,121
juvenille,58,0
unk,48,0
ADLT,47,0
chik,29,0
UNKNOWN,0,48

Value,Count
PAPRI2020,1122
PAPRI2021,1024
PAPRI2022,916
PAPRI2023,824
PAPRI2024,803
,563
PAPRI2019,252
UNKNOWN,37

Value,Original Count,Normalized Count
PAPRI2020,1074,1122
PAPRI2021,964,1024
PAPRI2022,859,916
PAPRI2023,778,824
PAPRI2024,745,803
,563,563
PAPRI2019,252,252
PAPR12021,60,0
papri2024,58,0
STUDY_2022,57,0

Value,Count
NaT,915
2023-01-18,10
2024-05-09,10
2024-02-01,9
2023-06-12,8
2020-12-25,8
2022-11-15,8
2023-06-10,8
2023-03-22,8
2024-01-01,8

Value,Original Count,Normalized Count
,534,915
9999-99-99,39,0
error,33,0
not-a-date,30,0
2023-01-18,10,10
2024-05-09,10,10
2024-02-01,9,9
2020-12-25,8,8
2022-08-04,8,8
2022-11-15,8,8

Value,Count
NaT,836
2019-12-11,13
2019-12-27,12
2020-10-11,11
2020-07-20,11
2019-12-17,11
2019-11-25,11
2020-06-25,11
2021-04-03,10
2021-04-16,10

Value,Original Count,Normalized Count
,836,836
2019-12-11,13,13
2019-12-27,12,12
2019-11-25,11,11
2019-12-17,11,11
2020-06-25,11,11
2020-07-20,11,11
2020-10-11,11,11
2021-04-03,10,10
2021-04-16,10,10

Value,Count
yes,4314
no,764
,463

Value,Original Count,Normalized Count
Yes,4314,0
No,764,0
,463,463
yes,0,4314
no,0,764


### 🛡️ Step 4: Certification Gatekeeper (M04)

This module enforces **strict schema and content rules** and is designed to **halt the pipeline** if violations are found:
- ✅ All column names, data types, categorical values, and numeric ranges must pass
- 🛑 **`fail_on_error: true`** triggers a hard stop on validation failure

📦 This step can be run **at any point in the pipeline** — not just the end.  
Use it wherever you want to **certify a dataset snapshot** or block further execution unless data meets expectations.

✅ Results are rendered inline with full export support.  
All certification rules live in the YAML config (`certification_config_template.yaml`).

> 🛠️ Adjust gatekeeping behavior by modifying schema rules or toggling `fail_on_error`.

In [6]:
# 🛡️ M02: Certification (Strict Validation Gatekeeper)

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline

# --- Load Certification Config ---
config = load_config("config/certification_config_template.yaml")
cert_cfg = config.get("validation", {})
notebook_mode = config.get("notebook_mode", True)
run_id = config.get("run_id")

# --- Run Final Certification Pass ---
logging.info("🚀 Starting M04: Certification (Validation Gatekeeper)")

df_certified = run_validation_pipeline(
    config=cert_cfg,
    notebook=notebook_mode,
    df=df_normalized,
    run_id=run_id
)

Validation Rule,Description,Status
Schema Conformity,Verify column names match the expected schema.,✅ Pass
Dtype Enforcement,Verify column data types match expectations.,✅ Pass
Categorical Values,Verify values in categorical columns are within an allowed set.,✅ Pass
Numeric Ranges,Verify values in numeric columns are within a defined range.,✅ Pass


### 🧹 Step 5: Deduplication (M05)

This module identifies and handles **duplicate rows** in the dataset.

You can choose to:
- 🔍 **Flag duplicates** for review  
- ✂️ **Remove duplicates** directly (default: keep first occurrence)

✅ Configurable logic lets you define:
- Which columns to check for duplication (`subset_columns`)
- Whether to flag or drop (`mode: "flag"` or `"remove"`)
- Columns to preview (hide IDs, timestamps, etc.)

📄 Results are displayed with an inline preview and summary plots.

> 🛠️ Adjust deduplication behavior in `dups_config_template.yaml`.

In [8]:
# ♻️ D04: Dedupliction and Duplicates Handling

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m00_utils.load_data import load_csv
from analyst_toolkit.m04_duplicates.run_dupes_pipeline import run_duplicates_pipeline
import logging

# --- Load Config ---
config = load_config("config/dups_config_template.yaml")
notebook_mode = config.get("notebook", True)
run_id = config.get("run_id")


# --- Run Duplicates Module (now in 'flag' mode) ---
df_deduped = run_duplicates_pipeline(
    config=config,
    df=df_certified,
    notebook=notebook_mode,
    run_id=run_id
)

Metric,Value
Original Row Count,5541
Deduplicated Row Count,4721
Rows Removed,820

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
,Gentoo,48.99,14.11,220.9,5890.0,Adult,MALE,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,yes,2023-11-09
,Gentoo,54.95,16.16,222.7,5261.0,Adult,,Torgersen North,Torgersen,NaT,Sick,PAPRI2022,no,NaT
,Gentoo,42.93,15.76,210.2,4086.0,Juvenile,,Dream South,Dream,NaT,Healthy,PAPRI2021,yes,2021-08-31
,Chinstrap,,19.72,203.7,,Adult,,Shortcut Point,Torgersen,NaT,Underweight,PAPRI2024,yes,2024-04-06
,Gentoo,49.91,14.61,,4262.0,Adult,,Shortcut Point,Shortcut,NaT,Underweight,PAPRI2020,yes,2020-07-30
,Chinstrap,,16.73,214.0,3533.0,,FEMALE,Biscoe West,Biscoe,NaT,,PAPRI2020,yes,2020-12-05
,Gentoo,44.73,,211.1,5150.0,Adult,FEMALE,Biscoe West,Biscoe,2019-11-29,Healthy,,yes,2019-11-25
,Adelie,37.25,,206.6,2790.0,Juvenile,FEMALE,Dream South,Biscoe,NaT,Underweight,,yes,2020-08-06
,Chinstrap,44.16,19.3,188.7,4314.0,Adult,MALE,Cormorant East,Cormorant,NaT,Overweight,PAPRI2020,yes,2020-02-26
,Gentoo,41.7,13.35,,5273.0,Adult,,Biscoe West,Biscoe,NaT,,PAPRI2024,no,NaT

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0001,Adelie,39.55,19.92,186.2,2500.0,Chick,MALE,Biscoe West,Biscoe,NaT,Underweight,PAPRI2022,yes,2022-07-20
ADE-0001,Adelie,42.6,21.37,184.5,2477.78,Juvenile,MALE,Biscoe West,Biscoe,NaT,Healthy,PAPRI2022,yes,2022-07-20
ADE-0001,Adelie,38.7,20.78,202.74,2650.73,Juvenile,MALE,Biscoe West,Biscoe,NaT,Underweight,PAPRI2022,yes,2022-07-20
ADE-0013,Adelie,40.28,18.1,188.6,3224.0,Juvenile,,,Cormorant,NaT,,PAPRI2022,yes,2022-06-18
ADE-0013,Adelie,41.51,19.31,182.31,3322.26,Adult,,,Cormorant,NaT,Overweight,PAPRI2022,yes,2022-06-18
ADE-0049,Adelie,,18.46,185.4,3326.0,Adult,FEMALE,Shortcut Point,Shortcut,NaT,Healthy,PAPRI2024,yes,2024-08-29
ADE-0049,Adelie,,17.77,176.49,3175.64,Adult,FEMALE,Shortcut Point,Shortcut,NaT,Overweight,PAPRI2024,yes,2024-08-29
ADE-0054,Adelie,42.06,17.93,,4125.0,Adult,MALE,Biscoe West,Biscoe,NaT,Overweight,PAPRI2022,,2022-10-28
ADE-0054,Adelie,42.53,18.07,,4342.78,Adult,MALE,Biscoe West,Biscoe,NaT,Critical,PAPRI2022,,2022-10-28
ADE-0073,Adelie,41.64,17.1,192.8,2500.0,Chick,FEMALE,Torgersen North,,NaT,Overweight,PAPRI2023,yes,2023-02-24


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Summary</h3>"), HBox(childre…

In [None]:

df_deduped.shape

In [None]:
df_deduped.columns.tolist()

### 📏 Step 6: Detect Outliers (M06)

This module scans numeric columns for outliers using configurable logic:
- **Z-Score** or **IQR** methods (per column or global default)
- Adds binary flags (e.g., `*_outlier`) to the dataset if `append_flags: true`
- Skips non-numeric or excluded fields via `exclude_columns`

📊 **Interactive PlotViewer**  
If enabled, the `PlotViewer` renders **boxplots, histograms, and violin plots** inline  
— giving a fast visual summary of where anomalies occur.

📁 **What’s Exported:**
- ✅ `df_outliers_flagged`: DataFrame with new `_outlier` columns
- ✅ `detection_results`: thresholds and summary tables
- ✅ Plots: saved to `exports/plots/outliers/{run_id}/`
- ✅ Report: XLSX or CSV, based on config

> 🛠️ Configure methods, thresholds, excluded columns, and plot types in `outlier_config_template.yaml`.

In [None]:
# 📏 M05: Detect Outliers and Plot Visuals

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m05_detect_outliers.run_detection_pipeline import run_outlier_detection_pipeline
import logging
from IPython.display import display


config = load_config("config/outlier_config_template.yaml")
outlier_cfg = config.get("outlier_detection", {})

# Get global settings from the top level of the config
notebook_mode = config.get("notebook", True)
run_id = config.get("run_id") # Provide a default run_id

# The 'df_deduped' variable should be the output from your M03 Duplicates module
if 'df_deduped' in locals():
    df_outliers_flagged, detection_results = run_outlier_detection_pipeline(
        config=outlier_cfg,
        df=df_deduped, 
        notebook=notebook_mode,
        run_id=run_id
    )

### 🧼 Step 7: Handle Outliers (M07)

This module applies cleanup strategies to flagged outliers from the detection step:
- Strategies include:
  - `'clip'`: Caps values to threshold bounds
  - `'median'`: Imputes using median
  - `'constant'`: Replaces with fixed value (e.g., `-999`)
  - `'none'`: Leaves values untouched (default)

⚙️ Strategy is configured per column or globally via `__default__` and `__global__`.

📁 **What’s Exported:**
- ✅ Cleaned DataFrame: `df_handled`
- ✅ Handling report (XLSX/CSV)
- ✅ Optional checkpoint joblib

> 🛠️ Adjust cleanup logic, output paths, or constant fill values in `handling_config_template.yaml`.

In [None]:
# 🧼 M06: Handle Outliers

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m06_outlier_handling.run_handling_pipeline import run_outlier_handling_pipeline


# --- Load Outlier Handling Config ---
config = load_config("config/handling_config_template.yaml")
handling_cfg = config.get("outlier_handling", {})
run_id = config.get("run_id")
notebook_mode = config.get("notebook", True)

# Pass the entire detection_results dictionary, not its unpacked components.
df_handled = run_outlier_handling_pipeline(
    config=handling_cfg,
    df=df_outliers_flagged,
    detection_results=detection_results, # Pass the whole dictionary here
    notebook=notebook_mode,
    run_id=run_id
)


### 🔧 Step 8: Impute Missing Values (M08)

This module fills missing (`NaN`) values using a column-specific strategy:
- `'mean'`, `'median'`, or `'mode'` for numeric/categorical inference
- `'constant'` for fixed fallback values (e.g., `"UNKNOWN"` or `"1900-01-01"`)
- Strategy is configured via the `rules.strategies` section in the YAML

📊 If enabled, comparison plots show how categorical columns changed post-imputation  
(using the same PlotViewer system).

📁 **What’s Exported:**
- ✅ Imputed DataFrame: `df_imputed`
- ✅ Report: imputation log (XLSX/CSV)
- ✅ Plots: before/after comparisons (if enabled)
- ✅ Optional checkpoint joblib

> 🛠️ Configure logic and column-specific strategies in `imputation_config_template.yaml`.

In [None]:
#🔧 M07: Impute Data and Plot Summary Visuals

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m07_imputation.run_imputation_pipeline import run_imputation_pipeline

# Load the configuration for the imputation module
config = load_config("config/imputation_config_template.yaml") 
imputation_cfg = config.get("imputation", {})
run_id = config.get("run_id")
notebook_mode = config.get("notebook", True)

df_imputed = run_imputation_pipeline(
    config=imputation_cfg,
    notebook=notebook_mode,
    df=df_handled,  # Pass the existing DataFrame here
    run_id=run_id
)

### 🧩 Behind the Scenes: Utility & Visual Modules

Several specialized support modules power the Analyst Toolkit pipeline behind the scenes.  
These are not called directly in the notebook, but are crucial to the system’s flexibility and polish:

#### 🧰 `m00_utils/`
- `config_loader.py`: Robust loader with support for environment paths and nested YAMLs
- `load_data.py`: Abstracted CSV/Joblib loader with encoding fallback
- `export_utils.py`: Modular export system for saving reports and checkpoints
- `rendering_utils.py`: Styled HTML table generator for dashboard outputs

#### 📊 `m08_visuals/`
- `distributions.py`: Boxplots, histograms, and violin plots for outlier detection
- `summary_plots.py`: Heatmaps, missingness matrices, and dtype summaries
- `plot_viewer.py`: Interactive PlotViewer widget for inspecting flagged values and category shifts

> ⚙️ These modules enable notebook-mode display, CLI compatibility, YAML-driven plotting, and clean HTML export dashboards.

📁 Explore these utilities in the `/src/` directory to understand how the toolkit remains modular, extensible, and production-grade.

## 🎬 Step 9: Final Auditing and Certifaction (M10)

This final module performs a comprehensive audit of the cleaned dataset and applies strict quality checks before certification.

It serves as the **final quality gate** and includes:
- ✅ **Final Edits:** Drop or rename columns, coerce dtypes as needed
- ✅ **Certification Check:** Applies validation rules with `fail_on_error: true` to enforce schema, dtypes, and content requirements
- ✅ **Lifecycle Comparison:** Compares raw vs final structure, nulls, and column presence
- ✅ **Capstone Report:** Renders a complete dashboard summarizing pipeline impact and status

🛡️ If any rule is violated (e.g., unexpected nulls or schema mismatch), the system halts and logs failure details for debugging.

📁 **What’s Exported:**
- Final Audit Report (XLSX and Joblib)
- Final Certified Dataset (CSV and Joblib)
- Inline dashboard with all results

> 🛠️ Customize certification rules, null restrictions, or output paths in `final_audit_config_template.yaml`.

🎉 Once this step passes, your dataset is ready for **production use or modeling pipelines**.

In [None]:
# 🎬 M10: Final Auditing and Certifaction 

from analyst_toolkit.m10_final_audit.final_audit_pipeline import run_final_audit_pipeline
from analyst_toolkit.m00_utils.config_loader import load_config

# --- Load Config ---
config_path = "config/final_audit_config_template.yaml"
config = load_config(config_path)

# --- Extract settings (this part is correct) ---
notebook_mode = config.get("notebook", True)
run_id = config.get("run_id")

# --- Run Final Audit ---
# The run_id is handled internally by the pipeline function from the config.
df_final_clean = run_final_audit_pipeline(
    config=config,
    df=df_imputed,  # This correctly passes the processed DataFrame
    notebook=notebook_mode,
    run_id=run_id
)

---

## 🧭 What’s Next?

Congratulations — you’ve now completed a full walkthrough of the Analyst Toolkit pipeline using synthetic Palmer Penguins data!

Here are some suggested next steps:

1. 🔍 **Explore Outputs**  
   - Review the exported reports and plots in the `exports/` folder
   - Inspect final audit and certification summaries

2. 🧪 **Test with Other Datasets**  
   - Replace the penguin dataset with your own CSV in the YAML configs
   - Adjust schema, value, and range rules accordingly

3. 📓 **Use the Full Pipeline Script**  
   - Try running `run_toolkit_pipeline.py` in CLI or notebook mode for a full end-to-end execution
   - Config: `config/run_toolkit_config.yaml`

4. 🛠️ **Customize Modules**  
   - Add new modules (e.g., feature engineering, modeling)
   - Use your own diagnostic thresholds or imputation logic

5. 🚀 **Package or Deploy**  
   - Deploy the toolkit in production (Airflow, Papermill, GitHub Actions, etc.)
   - Or package it as a Python module for reuse

---