## 🧪 Analyst Toolkit Tutorial: Full Data Pipeline

This interactive notebook demonstrates the complete analyst pipeline using a synthetic **Palmer Penguins** dataset.

Each step in the pipeline is modular, YAML-configurable, and produces exports, plots, and certification-ready reports.

### 🧰 Toolkit Architecture: 3-Way Modular Design

This pipeline is built around a flexible ETL framework with three usage modes:

- 📓 **Notebook Mode**: Run individual modules or the full pipeline interactively. Ideal for exploration and QA.
- 🧵 **CLI Mode**: Execute the full pipeline using `run_toolkit_pipeline.py`, controlled via a master YAML config.
- 🧪 **Hybrid Mode**: Develop in notebooks, deploy via scripts, reusing the same configs.

In [1]:
# 📁 1. Load Configuration and Set Execution Context

import os
from pathlib import Path
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m00_utils.load_data import load_csv
from analyst_toolkit.m01_diagnostics.run_diag_pipeline import run_diag_pipeline
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline
from analyst_toolkit.m03_normalization.run_normalization_pipeline import run_normalization_pipeline
from analyst_toolkit.m04_duplicates.run_dupes_pipeline import run_duplicates_pipeline
from analyst_toolkit.m05_detect_outliers.run_detection_pipeline import run_outlier_detection_pipeline
from analyst_toolkit.m06_outlier_handling.run_handling_pipeline import run_outlier_handling_pipeline
from analyst_toolkit.m07_imputation.run_imputation_pipeline import run_imputation_pipeline
from analyst_toolkit.m10_final_audit.final_audit_pipeline import run_final_audit_pipeline

# --- Find Project Root ---
# This helper function makes the notebook runnable from any subdirectory
# by locating the project root based on a set of marker directories.
def find_project_root(markers=("config", "notebooks", "data")):
    """Searches upward from the current directory for marker directories to find the project root."""
    current_path = Path.cwd().resolve()
    for parent in [current_path, *current_path.parents]:
        if all((parent / marker).is_dir() for marker in markers):
            return parent
    # Fallback to current working directory if no marker is found
    print(f"⚠️ Could not find project root with markers {markers}. Using current directory.")
    return Path.cwd()

# --- Set Project Root as Working Directory ---
PROJECT_ROOT = find_project_root()
os.chdir(PROJECT_ROOT)
# This print statement now only shows the project folder name, not the full path.
print(f"📂 Project Root set to: '{PROJECT_ROOT.name}'")

# --- Load Master Config ---
# Path to master config is now relative to the project root
master_config_path = "config/run_toolkit_config.yaml"

# Load master configuration dictionary
master_config = load_config(master_config_path) # load_config will resolve the relative path
run_id = master_config.get("run_id", "default_run")
notebook_mode = master_config.get("notebook", True)

print(f"🔧 Config loaded | Run ID: {run_id} | Notebook Mode: {notebook_mode}")


📂 Project Root set to: 'dirty_birds_eda_3'
🔧 Config loaded | Run ID: demo_run_02 | Notebook Mode: True


In [2]:
# 📥 2. Load Raw Data

# Load input path from the master config
relative_input_path = master_config.get("pipeline_entry_path")
if not relative_input_path:
    raise ValueError("❌ 'pipeline_entry_path' not found in master config.")

# Since the working directory is the project root, we can use the relative path directly.
print(f"📂 Loading data from: {relative_input_path}")
df_raw = load_csv(relative_input_path)

📂 Loading data from: data/raw/synthetic_penguins_v0.4.0.csv


## 📊 M01 — Diagnostics

This module generates a profile of the raw data: shape, types, nulls, skewness, and sample rows. It's the first step in understanding your dataset's structure and quality.

In [None]:
# --- Run Diagnostics Module ---

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("diagnostics", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Diagnostics module as per master config.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/diag_config.yaml")
    # The path is now relative to the project root"
    diag_config = load_config(relative_config_path)
    print("🚀 Running Diagnostics from '{relative_config_path}'...")

    df_profiled = run_diag_pipeline(
        config=diag_config,
        df=df_raw,
        notebook=notebook_mode,
        run_id=run_id
    )

🚀 Running Diagnostics from '{relative_config_path}'...


Rows,Columns
5773,15

Memory Usage
3.84 MB

Duplicate Rows,Duplicate %
0,0.0


Column,Unique Values
tag_id,2811
capture_date,1946
date_egg,1656
colony_id,19

Column,Dtype,Unique Values,Audit Remarks,Missing Count,Missing %
tag_id,object,2811,✅ OK,1689,29.26
species,object,5,✅ OK,171,2.96
bill_length_mm,float64,2196,✅ OK,447,7.74
bill_depth_mm,float64,1061,✅ OK,439,7.6
flipper_length_mm,float64,1736,✅ OK,473,8.19
body_mass_g,float64,3455,✅ OK,448,7.76
age_group,object,7,✅ OK,96,1.66
sex,object,6,✅ OK,2862,49.58
colony_id,object,19,✅ OK,407,7.05
island,object,11,✅ OK,524,9.08


Metric,count,mean,std,min,25%,50%,75%,max,skew,kurtosis
bill_length_mm,5326.0,45.249156,5.496079,31.92,40.6325,46.05,49.37,63.6,-0.176468,-0.69571
bill_depth_mm,5334.0,17.290041,2.208216,12.75,15.49,17.5,19.05,22.62,-0.147417,-0.959897
flipper_length_mm,5300.0,201.72405,13.958287,158.21,190.9,198.9,214.0,244.39,0.238775,-0.832208
body_mass_g,5325.0,3829.834656,835.186271,2389.96,3229.0,3729.0,4317.83,6637.71,0.503893,-0.284733


tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
,Gentoo,48.99,14.63,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,Healthy,PAPRI2023,Yes,2023-11-09
ADE-0001,Adelie,39.55,19.92,186.2,2500.0,Chick,Male,Biscoe West,,2022-07-31,Underweight,PAPRI2022,Yes,2022-07-20
,Gentoo,55.734313,13.0,,4536.0,Adult,Female,Biscoe West,Biscoe,2024-04-14,Healthy,,Yes,2024-04-12
GEN-0001,Gentoo,46.22,13.91,212.8,3150.0,chik,,Dream South,Dream,2020-04-27,Underweight,PAPRI2020,Yes,2020-04-14
,Chinstrap,49.02,16.22,192.2,3120.389387,ADLT,,Biscoe West,Biscoe,2022-10-03,Healthy,PAPRI2022,Yes,2022-10-02


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Profile</h3>"), HBox(childre…

## 🛡️ M02 — Validation (Audit Mode)

This module audits the dataset against a defined schema to catch issues early and guide cleaning steps:

- **Expected Columns & Dtypes**
- **Allowed Categorical Values**
- **Numeric Range Checks**

In this first pass, `fail_on_error` is `false`, so it reports all issues without halting the pipeline.

In [None]:
# --- Run Validation Module ---

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("validation", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Outlier Detection module as per master config.")
elif df_profiled is None:
    print("⏩ Skipping Outlier Detection because input dataframe is None.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/validation_config.yaml")
    module_config = load_config(relative_config_path)
    
    print(f"🚀 Running Validation from '{relative_config_path}'...")
    # run_validation_pipeline returns (df, results_dict)
    df_valid = run_validation_pipeline(
        config=module_config,
        df=df_profiled,
        notebook=notebook_mode,
        run_id=run_id
    )

🚀 Running Validation from 'config/validation_config_autofill.yaml'...


Validation Rule,Description,Status
Schema Conformity,Verify column names match the expected schema.,✅ Pass
Dtype Enforcement,Verify column data types match expectations.,⚠️ Fail (2 issues)
Categorical Values,Verify values in categorical columns are within an allowed set.,⚠️ Fail (7 issues)
Numeric Ranges,Verify values in numeric columns are within a defined range.,⚠️ Fail (4 issues)


Column,Expected Type,Actual Type
capture_date,datetime64[ns],object
date_egg,datetime64[ns],object

Invalid Value,Count
Gentto,141
adeleie,133

Invalid Value,Count
unk,67
chik,57
juvenille,56
ADLT,38

Invalid Value,Count
F,75
M,72
?,69

Invalid Value,Count
torg,66
unknown,65
short cut,64
dreamland,58
bisco,51
cormor,47

Invalid Value,Count
Unwell,345
under weight,43
ok,39
critcal ill,35
Overwight,35

Invalid Value,Count
PP2020,58
papri2024,55
STUDY_2022,52
PAPR2023,50
PAPRI20X9,36

Invalid Value,Count
/Shortcut,43
biscoe 2,42
Biscoe,39
Cormorant,39
Unknown,39
dream,36
invalid_colony,32
Torgersen,30
torgersen SE,29
cormorant NW,29

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0095,Adelie,31.92,19.28,201.24,4033.01,Adult,Male,Torgersen North,Torgersen,2022-12-14,Underweight,PAPRI2020,Yes,2020-04-18
CHN-0362,Chinstrap,62.16,18.53,197.26,2674.45,Adult,,Biscoe West,cormor,2024-12-22,,PAPRI2023,Yes,2023-07-19
CHN-0852,Chinstrap,63.6,17.5,203.48,3494.88,Adult,Female,Torgersen North,Torgersen,2023-08-24,Healthy,PAPRI2023,Yes,2023-08-10

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0336,Adelie,40.95,21.84,186.52,3707.87,Adult,Female,Biscoe West,Biscoe,2024-05-06,Healthy,PAPRI2023,Yes,2023-05-20
ADE-0816,adeleie,38.55,21.45,187.67,2927.74,Adult,Female,Dream South,Dream,2024-12-11,Critically Ill,PAPRI2024,Yes,2024-01-03
CHN-0976,Chinstrap,52.57,21.51,,3333.46,Adult,,Cormorant East,cormor,2024-12-12,Healthy,PAPRI2023,No,
ADE-0339,Adelie,40.53,21.54,,2596.45,Juvenile,,Dream South,Dream,9999-99-99,Underweight,,Yes,2021-02-19
CHN-0339,Chinstrap,48.68,21.59,202.71,,Adult,,Torgersen North,Torgersen,2024-05-15,Healthy,PAPRI2021,Yes,2021-09-29

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
GEN-0875,Gentoo,41.57,14.21,230.97,4828.18,Adult,,Torgersen North,Torgersen,2023-06-12,,PAPRI2020,,
GEN-0929,Gentoo,44.35,14.21,233.07,4676.68,unk,,Torgersen North,Torgersen,2023-03-17,,PAPRI2020,,
GEN-0779,Gentoo,50.34,13.99,230.38,5039.33,Adult,,Shortcut Point,,2024-05-29,Underweight,PAPRI2023,Yes,2023-03-11
GEN-0439,Gentoo,49.98,14.09,235.59,3320.16,Juvenile,Female,Shortcut Point,Shortcut,2023-03-12,Critically Ill,PAPRI2021,Yes,2021-06-18
GEN-0578,Gentto,44.25,16.12,231.34,4635.9,Adult,,Torgersen North,Torgersen,2021-07-07,Healthy,PAPRI2020,Yes,2020-06-24

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0001,Adelie,39.55,19.92,186.2,2500.0,Chick,Male,Biscoe West,,2022-07-31,Underweight,PAPRI2022,Yes,2022-07-20
ADE-0004,Adelie,38.96,21.0,188.7,2500.0,Chick,,Torgersen North,Torgersen,2023-01-18,Underweight,PAPRI2023,Yes,2023-01-08
,Chinstrap,45.89,18.62,192.5,2500.0,Chick,Female,Torgersen North,Torgersen,2022-01-19,critcal ill,PAPRI2022,Yes,2022-01-09
,Adelie,38.39,19.28,,2500.0,Chick,?,Torgersen,,2024-01-25,Underweight,,Yes,2024-01-22
ADE-0014,Adelie,39.67,17.16,193.4,2621.0,Chick,Unknown,Torgersen North,Torgersen,2024-06-06,Underweight,PAPRI2022,No,


## 🧹 M03 — Normalization

This module performs rule-based cleaning and standardization to prepare the dataset for certification:

- **Column Renaming & Type Coercion**
- **Value Mapping & Text Cleaning**
- **Fuzzy Matching & Datetime Parsing**

All rules and output paths are controlled via the YAML config (`normalization_config_template.yaml`).

In [None]:
# --- Run Normalization Module ---

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("normalization", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Outlier Detection module as per master config.")
elif df_valid is None:
    print("⏩ Skipping Outlier Detection because input dataframe is None.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/normalization_config.yaml")
    module_config = load_config(relative_config_path)
    
    print(f"🚀 Running Normalization from '{relative_config_path}'...")
    df_norm= run_normalization_pipeline(
        config=module_config,
        df=df_valid,
        notebook=notebook_mode,
        run_id=run_id
    )

🚀 Running Normalization from 'config/normalization_config.yaml'...


Original Name,New Name
tag_id,tag_id
species,species
bill_length_mm,bill_length_mm
bill_depth_mm,bill_depth_mm
flipper_length_mm,flipper_length_mm
body_mass_g,body_mass_g
age_group,age_group
sex,sex
colony_id,colony_id
island,island

Column,Operation
species,standardize_text
age_group,standardize_text
sex,standardize_text
colony_id,standardize_text
island,standardize_text
health_status,standardize_text
study_name,standardize_text
clutch_completion,standardize_text

Column,Target Type
capture_date,datetime64[ns]
date_egg,datetime64[ns]

Column,Mappings Applied
species,2
age_group,3
sex,5
island,1
colony_id,3
study_name,7
health_status,4

Column,Original,Corrected,Score
colony_id,biscoe 2,biscoe west,81
colony_id,torgersen,torgersen north,90
colony_id,/shortcut,shortcut point,90
colony_id,biscoe,biscoe west,90
colony_id,torgersen 4,torgersen north,86
colony_id,cormorant,cormorant east,90
colony_id,dream,dream south,90
colony_id,torgersen se,torgersen north,81
colony_id,cormorant nw,cormorant east,81
island,torg,torgersen,90


Value,Count
torgersen north,1537
dream south,1309
biscoe west,1161
cormorant east,765
shortcut point,523
,407
unknown,71

Value,Original Count,Normalized Count
Torgersen North,1451,0
Dream South,1220,0
Biscoe West,1080,0
Cormorant East,697,0
Shortcut Point,453,0
,407,407
/Shortcut,43,0
biscoe 2,42,0
Biscoe,39,0
Cormorant,39,0

Value,Count
,2862
male,1406
female,1356
unknown,149

Value,Original Count,Normalized Count
,2862,2862
Male,1334,0
Female,1281,0
Unknown,80,0
F,75,0
M,72,0
?,69,0
male,0,1406
female,0,1356
unknown,0,149

Value,Count
gentoo,1882
chinstrap,1878
adelie,1842
,171

Value,Original Count,Normalized Count
Chinstrap,1878,0
Gentoo,1741,0
Adelie,1709,0
,171,171
Gentto,141,0
adeleie,133,0
gentoo,0,1882
chinstrap,0,1878
adelie,0,1842

Value,Count
adult,4023
juvenile,1087
chick,500
,96
unknown,67

Value,Original Count,Normalized Count
Adult,3985,0
Juvenile,1031,0
Chick,443,0
,96,96
unk,67,0
chik,57,0
juvenille,56,0
ADLT,38,0
adult,0,4023
juvenile,0,1087

Value,Count
torgersen,1522
dream,1224
biscoe,1148
cormorant,761
shortcut,529
,524
unknown,65

Value,Original Count,Normalized Count
Torgersen,1456,0
Dream,1166,0
Biscoe,1097,0
Cormorant,714,0
,524,524
Shortcut,465,0
torg,66,0
unknown,65,65
short cut,64,0
dreamland,58,0

Value,Count
papri2020,1024
papri2024,988
papri2021,966
papri2022,963
papri2023,962
,596
papri2019,238
unknown,36

Value,Original Count,Normalized Count
PAPRI2020,966,0
PAPRI2024,933,0
PAPRI2021,918,0
PAPRI2023,912,0
PAPRI2022,911,0
,596,596
PAPRI2019,238,0
PP2020,58,0
papri2024,55,988
STUDY_2022,52,0

Value,Count
healthy,2315
underweight,1503
overweight,791
,514
unknown,345
critically ill,305

Value,Original Count,Normalized Count
Healthy,2276,0
Underweight,1460,0
Overweight,756,0
,514,514
Unwell,345,0
Critically Ill,270,0
under weight,43,0
ok,39,0
Overwight,35,0
critcal ill,35,0

Value,Count
NaT,804
2024-11-09,19
2024-11-17,16
2024-12-05,16
2024-11-29,15
2024-12-13,14
2024-12-18,14
2024-11-05,14
2024-11-24,14
2024-12-01,13

Value,Original Count,Normalized Count
,425,804
9999-99-99,46,0
not-a-date,36,0
error,28,0
2024-11-09,19,19
2024-11-17,16,16
2024-12-05,16,16
2024-11-29,15,15
2024-11-05,14,14
2024-11-24,14,14

Value,Count
NaT,894
2021-04-03,12
2020-06-25,11
2024-03-10,11
2021-04-16,11
2024-11-20,11
2021-09-25,10
2024-01-01,10
2020-02-07,10
2021-10-11,10

Value,Original Count,Normalized Count
,894,894
2021-04-03,12,12
2020-06-25,11,11
2021-04-16,11,11
2024-03-10,11,11
2024-11-20,11,11
2020-02-07,10,10
2020-11-24,10,10
2021-09-25,10,10
2021-10-11,10,10


## 🛡️ M02 — Certification Gate (Strict Mode)

This step re-uses the **Validation Module (M02)**, but with a stricter configuration to act as a quality gate. It is designed to **halt the pipeline** if violations are found:

- ✅ All column names, data types, categorical values, and numeric ranges must pass
- 🛑 **`fail_on_error: true`** triggers a hard stop on validation failure

This step certifies the cleaned dataset before proceeding to more advanced steps like outlier handling.

In [None]:
# --- Run Certification Gate (Strict Validation) ---

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("validation_gatekeeper", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Outlier Detection module as per master config.")
elif df_norm is None:
    print("⏩ Skipping Outlier Detection because input dataframe is None.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/certification_config.yaml")
    module_config = load_config(relative_config_path)
    
    print(f"🚀 Running Certification Gate from '{relative_config_path}'...")
    df_cert = run_validation_pipeline(
        config=module_config,
        df=df_norm,
        notebook=notebook_mode,
        run_id=run_id
    )

🚀 Running Certification Gate from 'config/certification_config_template.yaml'...


Validation Rule,Description,Status
Schema Conformity,Verify column names match the expected schema.,✅ Pass
Dtype Enforcement,Verify column data types match expectations.,✅ Pass
Categorical Values,Verify values in categorical columns are within an allowed set.,✅ Pass
Numeric Ranges,Verify values in numeric columns are within a defined range.,⚠️ Fail (4 issues)


tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0095,adelie,31.92,19.28,201.24,4033.01,adult,male,torgersen north,torgersen,2022-12-14,underweight,papri2020,yes,2020-04-18
CHN-0362,chinstrap,62.16,18.53,197.26,2674.45,adult,,biscoe west,cormorant,2024-12-22,,papri2023,yes,2023-07-19
CHN-0852,chinstrap,63.6,17.5,203.48,3494.88,adult,female,torgersen north,torgersen,2023-08-24,healthy,papri2023,yes,2023-08-10

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0336,adelie,40.95,21.84,186.52,3707.87,adult,female,biscoe west,biscoe,2024-05-06,healthy,papri2023,yes,2023-05-20
ADE-0816,adelie,38.55,21.45,187.67,2927.74,adult,female,dream south,dream,2024-12-11,critically ill,papri2024,yes,2024-01-03
CHN-0976,chinstrap,52.57,21.51,,3333.46,adult,,cormorant east,cormorant,2024-12-12,healthy,papri2023,no,NaT
ADE-0339,adelie,40.53,21.54,,2596.45,juvenile,,dream south,dream,NaT,underweight,,yes,2021-02-19
CHN-0339,chinstrap,48.68,21.59,202.71,,adult,,torgersen north,torgersen,2024-05-15,healthy,papri2021,yes,2021-09-29

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
GEN-0875,gentoo,41.57,14.21,230.97,4828.18,adult,,torgersen north,torgersen,2023-06-12,,papri2020,,NaT
GEN-0929,gentoo,44.35,14.21,233.07,4676.68,unknown,,torgersen north,torgersen,2023-03-17,,papri2020,,NaT
GEN-0779,gentoo,50.34,13.99,230.38,5039.33,adult,,shortcut point,,2024-05-29,underweight,papri2023,yes,2023-03-11
GEN-0439,gentoo,49.98,14.09,235.59,3320.16,juvenile,female,shortcut point,shortcut,2023-03-12,critically ill,papri2021,yes,2021-06-18
GEN-0578,gentoo,44.25,16.12,231.34,4635.9,adult,,torgersen north,torgersen,2021-07-07,healthy,papri2020,yes,2020-06-24

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0001,adelie,39.55,19.92,186.2,2500.0,chick,male,biscoe west,,2022-07-31,underweight,papri2022,yes,2022-07-20
ADE-0004,adelie,38.96,21.0,188.7,2500.0,chick,,torgersen north,torgersen,2023-01-18,underweight,papri2023,yes,2023-01-08
,chinstrap,45.89,18.62,192.5,2500.0,chick,female,torgersen north,torgersen,2022-01-19,critically ill,papri2022,yes,2022-01-09
,adelie,38.39,19.28,,2500.0,chick,unknown,torgersen north,,2024-01-25,underweight,,yes,2024-01-22
ADE-0014,adelie,39.67,17.16,193.4,2621.0,chick,unknown,torgersen north,torgersen,2024-06-06,underweight,papri2022,no,NaT


## ♻️ M04 — Deduplication

This module identifies and handles **duplicate rows** in the dataset, using the logic from `m04_duplicates`.

You can choose to:
- 🔍 **Flag duplicates** for review (`mode: "flag"`)
- ✂️ **Remove duplicates** directly (`mode: "remove"`)

The logic is configurable via `dups_config_template.yaml`, allowing you to specify which columns to check for duplication.

In [None]:
# --- Run Duplicates Module ---

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("duplicates", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Outlier Detection module as per master config.")
elif df_cert is None:
    print("⏩ Skipping Outlier Detection because input dataframe is None.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/dups_config.yaml")
    module_config = load_config(relative_config_path)
    
    print(f"🚀 Running Duplicates from '{relative_config_path}'...")
    # The function returns the dataframe and a results dictionary
    df_duped = run_duplicates_pipeline(
        config=module_config,
        df=df_cert,
        notebook=notebook_mode,
        run_id=run_id
    )


🚀 Running Duplicates from 'config/dups_config.yaml'...


Metric,Value
Total Row Count,5773
Duplicate Rows Flagged,954

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0002,adelie,42.79,17.62,184.7,2840.0,juvenile,,dream south,dream,NaT,healthy,papri2024,no,NaT
ADE-0002,adelie,43.2,18.16,184.85,2846.85,adult,,dream south,dream,NaT,critically ill,papri2024,no,NaT
ADE-0050,adelie,36.13,19.34,184.1,3911.0,adult,male,,biscoe,NaT,healthy,papri2021,,2021-12-16
ADE-0050,adelie,37.46,20.01,181.75,3853.37,adult,male,,biscoe,NaT,underweight,papri2021,,2021-12-16
ADE-0059,adelie,41.9,17.67,210.07,3638.14,adult,,biscoe west,biscoe,2024-07-22,overweight,papri2021,yes,2021-02-03
ADE-0059,adelie,44.54,16.83,196.63,3732.43,adult,,biscoe west,biscoe,2024-07-22,overweight,papri2021,yes,2021-02-03
ADE-0082,adelie,40.52,16.54,190.4,3972.0,adult,male,dream south,dream,2020-06-03,healthy,,yes,2020-06-01
ADE-0082,adelie,37.18,17.76,183.44,3872.15,adult,male,dream south,dream,2020-06-03,healthy,,yes,2020-06-01
ADE-0100,adelie,37.9,17.46,187.5,3375.29,adult,female,dream south,dream,2024-04-08,overweight,papri2022,,2022-08-19
ADE-0100,adelie,41.06,19.05,190.15,3611.57,adult,female,dream south,dream,2024-04-08,overweight,papri2022,,2022-08-19


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Summary</h3>"), HBox(childre…

## 📏 M05 — Detect Outliers

This module (`m05_detect_outliers`) scans numeric columns for outliers using configurable logic:

- **Z-Score** or **IQR** methods (per column or global default)
- Adds binary flags (e.g., `*_outlier`) to the dataset if `append_flags: true`
- Skips non-numeric or excluded fields via `exclude_columns`

📊 If enabled, an interactive **PlotViewer** renders boxplots, histograms, and violin plots inline, giving a fast visual summary of where anomalies occur.

In [None]:
# --- Detect Outliers ---

# Initialize detection_results to ensure it exists for the next step
detection_results = None

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("outlier_detection", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Outlier Detection module as per master config.")
elif df_duped is None:
    print("⏩ Skipping Outlier Detection because input dataframe is None.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/outlier_config.yaml")
    module_config = load_config(relative_config_path)
    
    print(f"🚀 Running Outlier Detection from '{relative_config_path}'...")
    # This function returns (df_with_flags, detection_results_dict)
    df_detect, detection_results = run_outlier_detection_pipeline(
        config=module_config,
        df=df_duped,
        notebook=notebook_mode,
        run_id=run_id
    )

🚀 Running Outlier Detection from 'config/outlier_config_template.yaml'...


column,method,outlier_count,lower_bound,upper_bound,outlier_examples
bill_length_mm,iqr,1,27.52625,62.47625,[63.6]


tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg,is_duplicate
CHN-0852,chinstrap,63.6,17.5,203.48,3494.88,adult,female,torgersen north,torgersen,2023-08-24,healthy,papri2023,yes,2023-08-10,True


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Outlier Visualizations</h3>"), HBox…

## 🧼 M06 — Handle Outliers

This module (`m06_outlier_handling`) applies cleanup strategies to outliers flagged in the detection step:

- **Strategies**: `clip` (cap to bounds), `median` (impute), `constant` (fill with a fixed value), or `none`.
- **Configuration**: Apply rules globally (`__default__`) or per-column.

This step is purely for remediation and relies on the `detection_results` from the previous module.

In [None]:
# --- Handle Outliers ---

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("outlier_handling", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Outlier Handling module as per master config.")
elif df_detect is None:
    print("⏩ Skipping Outlier Handling because input dataframe is None.")
elif detection_results is None:
    print("⏩ Skipping Outlier Handling because no detection results were provided from the previous step.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/handling_config.yaml")
    module_config = load_config(relative_config_path)
    
    print(f"🚀 Running Outlier Handling from '{relative_config_path}'...")
    # This function returns the dataframe with outliers handled
    df_handled= run_outlier_handling_pipeline(
        config=module_config,
        df=df_detect,
        detection_results=detection_results,  # Pass results from M05
        notebook=notebook_mode,
        run_id=run_id
    )

🚀 Running Outlier Handling from 'config/handling_config_template.yaml'...


strategy,column,outliers_handled,details
clip,bill_length_mm,1,Clipped 1 values to bounds.


Column,Row_Index,Original_Value,Capped_Value
bill_length_mm,5760,63.6,62.47625


## 🔧 M07 — Impute Missing Values

This module (`m07_imputation`) fills missing (`NaN`) values using a column-specific strategy:

- **Strategies**: `mean`, `median`, `mode`, or `constant`.
- **Configuration**: Apply rules per column via `rules.strategies` in the YAML.

📊 If enabled, comparison plots show how categorical columns changed post-imputation.

In [None]:
# --- Run Imputation Module ---

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("imputation", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Imputation module as per master config.")
elif df_handled is None:
    print("⏩ Skipping Imputation because input dataframe is None.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/imputation_config.yaml")
    absolute_config_path = PROJECT_ROOT / relative_config_path
    module_config = load_config(relative_config_path)
    
    print(f"🚀 Running Imputation from '{relative_config_path}'...")
    df_imput = run_imputation_pipeline(
        config=module_config,
        df=df_handled,
        notebook=notebook_mode,
        run_id=run_id
    )

🚀 Running Imputation from 'config/imputation_config_template.yaml'...


Column,Strategy,Fill Value,Nulls Filled
bill_length_mm,mean,45.25,447
body_mass_g,mean,3829.83,448
bill_depth_mm,median,17.50,439
flipper_length_mm,median,198.90,473
sex,mode,male,2862
tag_id,constant,UNKNOWN,1689
species,constant,UNKNOWN,171
age_group,constant,UNKNOWN,96
colony_id,constant,UNKNOWN,407
island,constant,UNKNOWN,524

Column,Nulls Before,Nulls After,Nulls Filled
bill_length_mm,447,0,447
body_mass_g,448,0,448
bill_depth_mm,439,0,439
flipper_length_mm,473,0,473
sex,2862,0,2862
tag_id,1689,0,1689
species,171,0,171
age_group,96,0,96
colony_id,407,0,407
island,524,0,524


Value,Count
male,4268
female,1356
unknown,149

Value,Original Count,Imputed Count,Change
,2862,0,-2862
male,1406,4268,2862
female,1356,1356,0
unknown,149,149,0

Value,Count
UNKNOWN,1689
GEN-0578,5
CHN-0934,5
GEN-0385,4
CHN-0660,4
CHN-0738,4
CHN-0790,4
CHN-0959,4
GEN-0025,4
GEN-0140,4

Value,Original Count,Imputed Count,Change
,1689,0,-1689
CHN-0934,5,5,0
GEN-0578,5,5,0
ADE-0193,4,4,0
ADE-0244,4,4,0
ADE-0277,4,4,0
ADE-0372,4,4,0
ADE-0395,4,4,0
ADE-0437,4,4,0
ADE-0598,4,4,0

Value,Count
gentoo,1882
chinstrap,1878
adelie,1842
UNKNOWN,171

Value,Original Count,Imputed Count,Change
gentoo,1882,1882,0
chinstrap,1878,1878,0
adelie,1842,1842,0
,171,0,-171
UNKNOWN,0,171,171

Value,Count
adult,4023
juvenile,1087
chick,500
UNKNOWN,96
unknown,67

Value,Original Count,Imputed Count,Change
adult,4023,4023,0
juvenile,1087,1087,0
chick,500,500,0
,96,0,-96
unknown,67,67,0
UNKNOWN,0,96,96

Value,Count
torgersen north,1537
dream south,1309
biscoe west,1161
cormorant east,765
shortcut point,523
UNKNOWN,407
unknown,71

Value,Original Count,Imputed Count,Change
torgersen north,1537,1537,0
dream south,1309,1309,0
biscoe west,1161,1161,0
cormorant east,765,765,0
shortcut point,523,523,0
,407,0,-407
unknown,71,71,0
UNKNOWN,0,407,407

Value,Count
torgersen,1522
dream,1224
biscoe,1148
cormorant,761
shortcut,529
UNKNOWN,524
unknown,65

Value,Original Count,Imputed Count,Change
torgersen,1522,1522,0
dream,1224,1224,0
biscoe,1148,1148,0
cormorant,761,761,0
shortcut,529,529,0
,524,0,-524
unknown,65,65,0
UNKNOWN,0,524,524

Value,Count
papri2020,1024
papri2024,988
papri2021,966
papri2022,963
papri2023,962
UNKNOWN,596
papri2019,238
unknown,36

Value,Original Count,Imputed Count,Change
papri2020,1024,1024,0
papri2024,988,988,0
papri2021,966,966,0
papri2022,963,963,0
papri2023,962,962,0
,596,0,-596
papri2019,238,238,0
unknown,36,36,0
UNKNOWN,0,596,596

Value,Count
yes,4482
no,822
UNKNOWN,469

Value,Original Count,Imputed Count,Change
yes,4482,4482,0
no,822,822,0
,469,0,-469
UNKNOWN,0,469,469

Value,Count
healthy,2315
underweight,1503
overweight,791
UNKNOWN,514
unknown,345
critically ill,305

Value,Original Count,Imputed Count,Change
healthy,2315,2315,0
underweight,1503,1503,0
overweight,791,791,0
,514,0,-514
unknown,345,345,0
critically ill,305,305,0
UNKNOWN,0,514,514


Column,Remaining Nulls
bill_length_mm_iqr_outlier,447


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Imputation Visualizations</h3>"), H…

## Alternative: Full Pipeline Runner

For non-interactive runs, or to execute the entire pipeline in one go, you can use the `run_toolkit_pipeline` function from the `analyst_toolkit` library. This is particularly useful for automated scripts where step-by-step inspection is not required.

To use it, you would import `run_toolkit_pipeline` and call it with the path to your master configuration file.

```python
# from analyst_toolkit.run_toolkit_pipeline import run_toolkit_pipeline
#
# This function runs all modules enabled in your 'run_toolkit_config.yaml' sequentially.
# df_final, all_results = run_toolkit_pipeline(config_path=RUN_CONFIG_PATH)
```

> **Note:** This template notebook is designed for step-by-step execution and inspection. Using the full pipeline runner will execute all steps at once and bypass the individual cell outputs in this notebook.

## 🎬 M10 — Final Audit & Certification

This final module (`m10_final_audit`) serves as the ultimate quality gate before exporting the cleaned dataset. It performs a comprehensive audit and applies strict certification checks.

- ✅ **Final Edits**: Drops or renames columns and coerces dtypes as needed.
- ✅ **Certification Check**: Re-runs validation rules with `fail_on_error: true` to enforce schema, dtypes, and content requirements.
- ✅ **Lifecycle Comparison**: Compares the raw vs. final dataset's structure, nulls, and column presence.
- ✅ **Capstone Report**: Renders a complete dashboard summarizing the pipeline's impact and status.

🛡️ If any rule is violated, the system halts and logs failure details for debugging. Once this step passes, your dataset is certified and ready for production use.

In [None]:
# --- Run Final Audit Module ---

# Check if module is enabled in the master config
module_settings = master_config.get("modules", {}).get("final_audit", {})
if not module_settings.get("run", False):
    print("⏩ Skipping Final Audit module as per master config.")
elif df_imput is None:
    print("⏩ Skipping Final Audit because input dataframe is None.")
else:
    # Load module-specific config
    relative_config_path = module_settings.get("config_path", "config/final_audit_config.yaml")
    absolute_config_path = PROJECT_ROOT / relative_config_path
    module_config = load_config(relative_config_path)
    
    print(f"🚀 Running Final Audit from '{relative_config_path}'...")
    # This function returns the final, certified dataframe
    df_final = run_final_audit_pipeline(
        config=module_config,
        df=df_imput,
        notebook=notebook_mode,
        run_id=run_id
    )

🚀 Running Final Audit from 'config/final_audit_config_template.yaml'...


Issue Type,Columns
Unexpected,"sex, colony_id, health_status, capture_date, study_name, flipper_length_mm, tag_id, age_group, bill_depth_mm, clutch_completion, date_egg, body_mass_g, is_duplicate, island, bill_length_mm, species"


Metric,Value
Final Pipeline Status,❌ CERTIFICATION FAILED
Certification Rules Passed,False
Null Value Audit Passed,True

Action,Details
drop_columns,Removed: ['bill_length_mm_iqr_outlier']


Metric,Value
Initial Rows,5773
Final Rows,5773
Initial Columns,15
Final Columns,16

Column,Dtype,Unique Values,Audit Remarks,Missing Count,Missing %
tag_id,object,2812,✅ OK,0,0.0
species,object,4,✅ OK,0,0.0
bill_length_mm,float64,2197,✅ OK,0,0.0
bill_depth_mm,float64,1061,✅ OK,0,0.0
flipper_length_mm,float64,1736,✅ OK,0,0.0
body_mass_g,float64,3456,✅ OK,0,0.0
age_group,object,5,✅ OK,0,0.0
sex,object,3,✅ OK,0,0.0
colony_id,object,7,✅ OK,0,0.0
island,object,7,✅ OK,0,0.0


Metric,count,mean,std,min,25%,50%,75%,max,skew,kurtosis
bill_length_mm,5773.0,45.248945,5.278319,31.92,41.14,45.4,49.11,62.47625,-0.184926,-0.506671
bill_depth_mm,5773.0,17.306007,2.12331,12.75,15.65,17.5,18.93,22.62,-0.175695,-0.789962
flipper_length_mm,5773.0,201.492667,13.396554,158.21,191.7,198.9,212.8,244.39,0.299019,-0.635365
body_mass_g,5773.0,3829.834656,802.119748,2389.96,3272.35,3806.0,4231.0,6637.71,0.52465,-0.056235


tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg,is_duplicate
UNKNOWN,gentoo,48.99,14.63,220.9,5890.0,adult,male,torgersen north,torgersen,2023-11-17,healthy,papri2023,yes,2023-11-09,False
ADE-0001,adelie,39.55,19.92,186.2,2500.0,chick,male,biscoe west,UNKNOWN,2022-07-31,underweight,papri2022,yes,2022-07-20,False
UNKNOWN,gentoo,55.734313,13.0,198.9,4536.0,adult,female,biscoe west,biscoe,2024-04-14,healthy,UNKNOWN,yes,2024-04-12,False
GEN-0001,gentoo,46.22,13.91,212.8,3150.0,chick,male,dream south,dream,2020-04-27,underweight,papri2020,yes,2020-04-14,False
UNKNOWN,chinstrap,49.02,16.22,192.2,3120.389387,adult,male,biscoe west,biscoe,2022-10-03,healthy,papri2022,yes,2022-10-02,False


In [12]:
# 🎉 Final Certified Data Preview
if 'df_final' in locals() and df_final is not None:
    print("✅ Pipeline complete. Displaying the first 5 rows of the final certified dataset:")
    display(df_final.head())
else:
    print("⏹️ Pipeline finished, but no final dataframe was produced (likely skipped or failed). ")

✅ Pipeline complete. Displaying the first 5 rows of the final certified dataset:


Unnamed: 0,tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg,is_duplicate
0,UNKNOWN,gentoo,48.99,14.63,220.9,5890.0,adult,male,torgersen north,torgersen,2023-11-17,healthy,papri2023,yes,2023-11-09,False
1,ADE-0001,adelie,39.55,19.92,186.2,2500.0,chick,male,biscoe west,UNKNOWN,2022-07-31,underweight,papri2022,yes,2022-07-20,False
2,UNKNOWN,gentoo,55.734313,13.0,198.9,4536.0,adult,female,biscoe west,biscoe,2024-04-14,healthy,UNKNOWN,yes,2024-04-12,False
3,GEN-0001,gentoo,46.22,13.91,212.8,3150.0,chick,male,dream south,dream,2020-04-27,underweight,papri2020,yes,2020-04-14,False
4,UNKNOWN,chinstrap,49.02,16.22,192.2,3120.389387,adult,male,biscoe west,biscoe,2022-10-03,healthy,papri2022,yes,2022-10-02,False
