# 🧪 Full Pipeline Execution (Notebook Mode)

This notebook demonstrates the full pipeline execution using the master controller script `run_toolkit_pipeline.py`.

- Controlled via: `config/run_toolkit_config.yaml`
- Executes all pipeline modules in sequence (M01–M10)
- Outputs: Dashboards, reports, plots, and the final certified dataset
- ✅ Set `notebook: true` in the YAML to enable inline dashboards

>📂 Final outputs are exported to the `exports/` and `data/processed/` directories.

---

<details>
<summary><strong>📎 Notes & Use Cases</strong></summary>

**🧭 Notes**
- Fully modular pipeline execution from raw to certified clean data
- Configurable behavior using a single YAML file
- Can be executed interactively (with displays) or headlessly (silent mode)

**💼 Use Cases**
- End-to-end QA audits for new or synthetic datasets
- Validating preprocessing logic during exploratory workflows
- Certifying pipeline output before downstream modeling
- Showcasing toolkit capabilities in interviews or portfolio reviews

</details>

<details>
<summary><strong>🔁 Alternate Modes</strong></summary>

- Set `notebook: false` in the YAML to run this notebook silently (ideal for automation or CI).
- Run the pipeline as a CLI script outside notebooks with:

```bash
python run_toolkit_pipeline.py --config config/run_toolkit_config.yaml


In [1]:
from analyst_toolkit.run_toolkit_pipeline import run_full_pipeline

final_df = run_full_pipeline(config_path="config/run_toolkit_config.yaml")

2025-09-02 00:35:14,970 - INFO - --- Loading Master Orchestration Config from config/run_toolkit_config.yaml ---
2025-09-02 00:35:14,971 - INFO - --- 🚚 Loading initial data from data/raw/synthetic_penguins_v3.5.csv ---
2025-09-02 00:35:14,977 - INFO - --- 🚀 Starting Module: DIAGNOSTICS ---


Rows,Columns
5541,15

Memory Usage
3.26 MB

Duplicate Rows,Duplicate %
1,0.02


Column,Unique Values
tag_id,2678
capture_date,1917
date_egg,1656
colony_id,19

Column,Dtype,Unique Values,Audit Remarks,Missing Count,Missing %
tag_id,object,2678,✅ OK,2242,40.46
species,object,5,✅ OK,166,3.0
bill length (mm),float64,1984,✅ OK,429,7.74
bill depth (mm),float64,862,✅ OK,417,7.53
flipper_length_mm,float64,1466,✅ OK,451,8.14
body_mass_g,float64,3328,✅ OK,406,7.33
age_group,object,7,✅ OK,121,2.18
sex,object,6,✅ OK,2739,49.43
colony_id,object,19,✅ OK,405,7.31
island,object,11,✅ OK,584,10.54


Metric,count,mean,std,min,25%,50%,75%,max,skew,kurtosis
bill length (mm),5112.0,45.166682,5.66641,30.63,40.51,45.95,49.36,62.64,-0.145952,-0.606829
bill depth (mm),5124.0,17.305377,2.231495,12.37,15.49,17.485,19.03,23.01,-0.111456,-0.897492
flipper_length_mm,5090.0,202.2378,14.342621,162.79,191.1,199.315,214.1,252.4,0.329099,-0.616376
body_mass_g,5135.0,3853.645265,898.232986,2376.56,3219.5,3742.0,4376.515,7378.33,0.616778,0.086446


tag_id,species,bill length (mm),bill depth (mm),flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
,Gentoo,48.99,14.11,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,Yes,2023-11-09
,Gentoo,48.99,14.11,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,Yes,2023-11-09


tag_id,species,bill length (mm),bill depth (mm),flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
,Gentoo,48.99,14.11,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,Yes,2023-11-09
,Gentoo,48.99,14.11,220.9,5890.0,Adult,Male,Torgersen North,Torgersen,2023-11-17,,PAPRI2023,Yes,2023-11-09
ADE-0001,Adelie,39.55,19.92,186.2,2500.0,Chick,Male,Biscoe West,Biscoe,2024-13-03,Underweight,PAPRI2022,Yes,2022-07-20
,Gentoo,48.23,13.0,,4536.0,Adult,Female,Biscoe West,,2024-04-14,Healthy,,Yes,2024-04-12
GEN-0001,Gentoo,46.22,13.91,212.8,2500.0,Juvenile,Female,Dream South,Dream,,Underweight,PAPRI2020,Yes,2020-04-14


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Profile</h3>"), HBox(childre…

Validation Rule,Description,Status
Schema Conformity,Verify column names match the expected schema.,⚠️ Fail (2 issues)
Dtype Enforcement,Verify column data types match expectations.,⚠️ Fail (1 issues)
Categorical Values,Verify values in categorical columns are within an allowed set.,⚠️ Fail (7 issues)
Numeric Ranges,Verify values in numeric columns are within a defined range.,✅ Pass


Issue Type,Columns
Missing,"bill_length_mm, bill depth_mm"
Unexpected,"bill length (mm), bill depth (mm)"

Column,Expected Type,Actual Type
flipper_length_mm,int64,float64

Invalid Value,Count
adeleie,148
Gentto,145

Invalid Value,Count
short cut,70
torg,61
unknown,59
bisco,55
cormor,47
dreamland,46

Invalid Value,Count
Male,1308
Female,1227
F,83
?,74
M,61
Unknown,49

Invalid Value,Count
cormorant NW,45
invalid_colony,36
Torgersen,35
Cormorant,34
biscoe 2,34
torgersen SE,31
TORGERSEN 4,30
short point,28
/Shortcut,26
Biscoe,25

Invalid Value,Count
juvenille,58
unk,48
ADLT,47
chik,29

Invalid Value,Count
critcal ill,36
Overwight,34
under weight,33
ok,30

Invalid Value,Count
PAPR12021,60
papri2024,58
STUDY_2022,57
PP2020,48
PAPR2023,46
PAPRI20X9,37


Original Name,New Name
bill length (mm),bill_length_mm
bill depth (mm),bill_depth_mm

Column,Operation
clutch_completion,standardize_text
sex,standardize_text

Column,Target Type
capture_date,datetime64[ns]
date_egg,datetime64[ns]

Column,Mappings Applied
sex,7
species,1
island,1
colony_id,14
age_group,4
health_status,7
study_name,6

Column,Original,Corrected,Score
species,Gentto,Gentoo,83
species,adeleie,Adelie,92
island,bisco,Biscoe,91
island,short cut,Shortcut,94
island,dreamland,Dream,90
island,cormor,Cormorant,90
island,torg,Torgersen,90


Value,Count
,2739
MALE,1369
FEMALE,1310
UNKNOWN,123

Value,Original Count,Normalized Count
,2739,2739
Male,1308,0
Female,1227,0
F,83,0
?,74,0
M,61,0
Unknown,49,0
MALE,0,1369
FEMALE,0,1310
UNKNOWN,0,123

Value,Count
Torgersen,1405
Dream,1184
Biscoe,1084
Cormorant,715
,584
Shortcut,510
UNKNOWN,59

Value,Original Count,Normalized Count
Torgersen,1344,1405
Dream,1138,1184
Biscoe,1029,1084
Cormorant,668,715
,584,584
Shortcut,440,510
short cut,70,0
torg,61,0
unknown,59,0
bisco,55,0

Value,Count
Gentoo,1815
Adelie,1784
Chinstrap,1776
,166

Value,Original Count,Normalized Count
Chinstrap,1776,1776
Gentoo,1670,1815
Adelie,1636,1784
,166,166
adeleie,148,0
Gentto,145,0

Value,Count
Healthy,2194
Underweight,1411
Overweight,733
,554
Critical,323
Sick,296
UNKNOWN,30

Value,Original Count,Normalized Count
Healthy,2194,2194
Underweight,1378,1411
Overweight,699,733
,554,554
Unwell,296,0
Critically Ill,287,0
critcal ill,36,0
Overwight,34,0
under weight,33,0
ok,30,0

Value,Count
Torgersen North,1490
Dream South,1216
Biscoe West,1092
Cormorant East,767
Shortcut Point,511
,405
UNKNOWN,60

Value,Original Count,Normalized Count
Torgersen North,1394,1490
Dream South,1151,1216
Biscoe West,1033,1092
Cormorant East,688,767
Shortcut Point,457,511
,405,405
cormorant NW,45,0
invalid_colony,36,0
Torgersen,35,0
Cormorant,34,0

Value,Count
Adult,3822
Juvenile,1073
Chick,477
,121
UNKNOWN,48

Value,Original Count,Normalized Count
Adult,3775,3822
Juvenile,1015,1073
Chick,448,477
,121,121
juvenille,58,0
unk,48,0
ADLT,47,0
chik,29,0
UNKNOWN,0,48

Value,Count
PAPRI2020,1122
PAPRI2021,1024
PAPRI2022,916
PAPRI2023,824
PAPRI2024,803
,563
PAPRI2019,252
UNKNOWN,37

Value,Original Count,Normalized Count
PAPRI2020,1074,1122
PAPRI2021,964,1024
PAPRI2022,859,916
PAPRI2023,778,824
PAPRI2024,745,803
,563,563
PAPRI2019,252,252
PAPR12021,60,0
papri2024,58,0
STUDY_2022,57,0

Value,Count
NaT,915
2023-01-18,10
2024-05-09,10
2024-02-01,9
2023-06-12,8
2020-12-25,8
2022-11-15,8
2023-06-10,8
2023-03-22,8
2024-01-01,8

Value,Original Count,Normalized Count
,534,915
9999-99-99,39,0
error,33,0
not-a-date,30,0
2023-01-18,10,10
2024-05-09,10,10
2024-02-01,9,9
2020-12-25,8,8
2022-08-04,8,8
2022-11-15,8,8

Value,Count
NaT,836
2019-12-11,13
2019-12-27,12
2020-10-11,11
2020-07-20,11
2019-12-17,11
2019-11-25,11
2020-06-25,11
2021-04-03,10
2021-04-16,10

Value,Original Count,Normalized Count
,836,836
2019-12-11,13,13
2019-12-27,12,12
2019-11-25,11,11
2019-12-17,11,11
2020-06-25,11,11
2020-07-20,11,11
2020-10-11,11,11
2021-04-03,10,10
2021-04-16,10,10

Value,Count
yes,4314
no,764
,463

Value,Original Count,Normalized Count
Yes,4314,0
No,764,0
,463,463
yes,0,4314
no,0,764


Validation Rule,Description,Status
Schema Conformity,Verify column names match the expected schema.,✅ Pass
Dtype Enforcement,Verify column data types match expectations.,✅ Pass
Categorical Values,Verify values in categorical columns are within an allowed set.,✅ Pass
Numeric Ranges,Verify values in numeric columns are within a defined range.,✅ Pass


Metric,Value
Total Row Count,5541
Duplicate Rows Flagged,1219

tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg
ADE-0001,Adelie,39.55,19.92,186.2,2500.0,Chick,MALE,Biscoe West,Biscoe,NaT,Underweight,PAPRI2022,yes,2022-07-20
ADE-0001,Adelie,42.6,21.37,184.5,2477.78,Juvenile,MALE,Biscoe West,Biscoe,NaT,Healthy,PAPRI2022,yes,2022-07-20
ADE-0001,Adelie,38.7,20.78,202.74,2650.73,Juvenile,MALE,Biscoe West,Biscoe,NaT,Underweight,PAPRI2022,yes,2022-07-20
ADE-0013,Adelie,40.28,18.1,188.6,3224.0,Juvenile,,,Cormorant,NaT,,PAPRI2022,yes,2022-06-18
ADE-0013,Adelie,41.51,19.31,182.31,3322.26,Adult,,,Cormorant,NaT,Overweight,PAPRI2022,yes,2022-06-18
ADE-0049,Adelie,,18.46,185.4,3326.0,Adult,FEMALE,Shortcut Point,Shortcut,NaT,Healthy,PAPRI2024,yes,2024-08-29
ADE-0049,Adelie,,17.77,176.49,3175.64,Adult,FEMALE,Shortcut Point,Shortcut,NaT,Overweight,PAPRI2024,yes,2024-08-29
ADE-0054,Adelie,42.06,17.93,,4125.0,Adult,MALE,Biscoe West,Biscoe,NaT,Overweight,PAPRI2022,,2022-10-28
ADE-0054,Adelie,42.53,18.07,,4342.78,Adult,MALE,Biscoe West,Biscoe,NaT,Critical,PAPRI2022,,2022-10-28
ADE-0073,Adelie,41.64,17.1,192.8,2500.0,Chick,FEMALE,Torgersen North,,NaT,Overweight,PAPRI2023,yes,2023-02-24


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Visual Summary</h3>"), HBox(childre…

column,method,outlier_count,lower_bound,upper_bound,outlier_examples
bill_length_mm,iqr,1,27.235,62.635,[62.64]
body_mass_g,zscore,18,709.829815,6997.460715,"[7000.0, 7000.0, 7000.0, 7000.0, 7000.0]"


tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg,is_duplicate
,Gentoo,,14.41,221.9,7000.0,Adult,,Torgersen North,Torgersen,2019-10-31,Healthy,PAPRI2019,,NaT,False
,,47.68,17.62,,7000.0,Adult,,Torgersen North,Torgersen,2021-08-17,Healthy,PAPRI2021,,2021-08-14,False
GEN-0041,Gentoo,45.63,14.13,213.2,7000.0,Juvenile,FEMALE,Dream South,Dream,2021-12-02,Healthy,PAPRI2021,,2021-11-23,False
,Gentoo,46.39,13.84,206.3,7000.0,Adult,,Cormorant East,Cormorant,2022-10-26,Healthy,PAPRI2022,,2022-10-12,False
ADE-0182,Adelie,38.46,17.16,185.1,7000.0,Adult,,Dream South,Dream,2024-02-03,Overweight,PAPRI2024,yes,2024-01-31,False
,Gentoo,49.36,13.0,224.1,7000.0,Adult,,Torgersen North,Torgersen,NaT,Healthy,,no,NaT,True
,Gentoo,40.59,14.37,230.0,7000.0,Adult,MALE,,Biscoe,NaT,Healthy,PAPRI2021,yes,2021-03-25,True
GEN-0301,Gentoo,44.56,16.48,212.7,7000.0,Adult,MALE,Biscoe West,Biscoe,2022-12-12,Healthy,PAPRI2022,no,NaT,False
,Gentoo,45.16,15.57,218.4,7000.0,Adult,FEMALE,,Cormorant,2021-07-30,Healthy,PAPRI2021,yes,2021-07-17,False
GEN-0681,Gentoo,44.73,13.94,217.8,7000.0,Adult,,Torgersen North,Torgersen,NaT,Healthy,PAPRI2022,yes,2022-11-07,True


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Outlier Visualizations</h3>"), HBox…

strategy,column,outliers_handled,details
clip,bill_length_mm,1,Clipped 1 values to bounds.
median,body_mass_g,18,Imputed 18 values with median (3742.00).


Column,Row_Index,Original_Value,Capped_Value
bill_length_mm,5164,62.64,62.635


Column,Strategy,Fill Value,Nulls Filled
bill_length_mm,mean,45.17,429
body_mass_g,mean,3842.08,406
bill_depth_mm,median,17.48,417
flipper_length_mm,median,199.31,451
sex,mode,MALE,2739
tag_id,constant,UNKNOWN,2242
species,constant,UNKNOWN,166
age_group,constant,UNKNOWN,121
colony_id,constant,UNKNOWN,405
island,constant,UNKNOWN,584

Column,Nulls Before,Nulls After,Nulls Filled
bill_length_mm,429,0,429
body_mass_g,406,0,406
bill_depth_mm,417,0,417
flipper_length_mm,451,0,451
sex,2739,0,2739
tag_id,2242,0,2242
species,166,0,166
age_group,121,0,121
colony_id,405,0,405
island,584,0,584


Value,Count
MALE,4108
FEMALE,1310
UNKNOWN,123

Value,Original Count,Imputed Count,Change
,2739,0,-2739
MALE,1369,4108,2739
FEMALE,1310,1310,0
UNKNOWN,123,123,0

Value,Count
UNKNOWN,2242
GEN-0271,5
ADE-0119,4
GEN-0143,4
ADE-0176,4
GEN-0751,4
GEN-0673,4
GEN-0433,4
GEN-0902,4
GEN-0106,4

Value,Original Count,Imputed Count,Change
,2242,0,-2242
GEN-0271,5,5,0
ADE-0119,4,4,0
ADE-0176,4,4,0
ADE-0203,4,4,0
CHN-0905,4,4,0
GEN-0054,4,4,0
GEN-0106,4,4,0
GEN-0143,4,4,0
GEN-0433,4,4,0

Value,Count
Gentoo,1815
Adelie,1784
Chinstrap,1776
UNKNOWN,166

Value,Original Count,Imputed Count,Change
Gentoo,1815,1815,0
Adelie,1784,1784,0
Chinstrap,1776,1776,0
,166,0,-166
UNKNOWN,0,166,166

Value,Count
Adult,3822
Juvenile,1073
Chick,477
UNKNOWN,169

Value,Original Count,Imputed Count,Change
Adult,3822,3822,0
Juvenile,1073,1073,0
Chick,477,477,0
,121,0,-121
UNKNOWN,48,169,121

Value,Count
Torgersen North,1490
Dream South,1216
Biscoe West,1092
Cormorant East,767
Shortcut Point,511
UNKNOWN,465

Value,Original Count,Imputed Count,Change
Torgersen North,1490,1490,0
Dream South,1216,1216,0
Biscoe West,1092,1092,0
Cormorant East,767,767,0
Shortcut Point,511,511,0
,405,0,-405
UNKNOWN,60,465,405

Value,Count
Torgersen,1405
Dream,1184
Biscoe,1084
Cormorant,715
UNKNOWN,643
Shortcut,510

Value,Original Count,Imputed Count,Change
Torgersen,1405,1405,0
Dream,1184,1184,0
Biscoe,1084,1084,0
Cormorant,715,715,0
,584,0,-584
Shortcut,510,510,0
UNKNOWN,59,643,584

Value,Count
PAPRI2020,1122
PAPRI2021,1024
PAPRI2022,916
PAPRI2023,824
PAPRI2024,803
UNKNOWN,600
PAPRI2019,252

Value,Original Count,Imputed Count,Change
PAPRI2020,1122,1122,0
PAPRI2021,1024,1024,0
PAPRI2022,916,916,0
PAPRI2023,824,824,0
PAPRI2024,803,803,0
,563,0,-563
PAPRI2019,252,252,0
UNKNOWN,37,600,563

Value,Count
yes,4314
no,764
UNKNOWN,463

Value,Original Count,Imputed Count,Change
yes,4314,4314,0
no,764,764,0
,463,0,-463
UNKNOWN,0,463,463

Value,Count
Healthy,2194
Underweight,1411
Overweight,733
UNKNOWN,584
Critical,323
Sick,296

Value,Original Count,Imputed Count,Change
Healthy,2194,2194,0
Underweight,1411,1411,0
Overweight,733,733,0
,554,0,-554
Critical,323,323,0
Sick,296,296,0
UNKNOWN,30,584,554


Column,Remaining Nulls
bill_length_mm_iqr_outlier,429
body_mass_g_zscore_outlier,406


Accordion(children=(VBox(children=(HTML(value="<h3 style='margin-top:10px'>Imputation Visualizations</h3>"), H…

Issue Type,Columns
Unexpected,is_duplicate


Metric,Value
Final Pipeline Status,❌ CERTIFICATION FAILED
Certification Rules Passed,False
Null Value Audit Passed,True

Action,Details
drop_columns,"Removed: ['body_mass_g_zscore_outlier', 'bill_length_mm_iqr_outlier']"


Metric,Value
Initial Rows,5541
Final Rows,5541
Initial Columns,15
Final Columns,16

Column,Dtype,Unique Values,Audit Remarks,Missing Count,Missing %
tag_id,object,2679,✅ OK,0,0.0
species,object,4,✅ OK,0,0.0
bill_length_mm,float64,1985,✅ OK,0,0.0
bill_depth_mm,float64,863,✅ OK,0,0.0
flipper_length_mm,float64,1467,✅ OK,0,0.0
body_mass_g,float64,3324,✅ OK,0,0.0
age_group,object,4,✅ OK,0,0.0
sex,object,3,✅ OK,0,0.0
colony_id,object,6,✅ OK,0,0.0
island,object,6,✅ OK,0,0.0


Metric,count,mean,std,min,25%,50%,75%,max,skew,kurtosis
bill_length_mm,5541.0,45.166681,5.442593,30.63,40.98,45.24,49.07,62.635,-0.151954,-0.405922
bill_depth_mm,5541.0,17.318895,2.146392,12.37,15.65,17.485,18.92,23.01,-0.134672,-0.725335
flipper_length_mm,5541.0,201.999903,13.769645,162.79,191.8,199.315,213.0,252.4,0.392703,-0.397019
body_mass_g,5541.0,3842.084375,845.336672,2376.56,3264.0,3806.0,4266.0,6965.072934,0.552218,0.015302


tag_id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,age_group,sex,colony_id,island,capture_date,health_status,study_name,clutch_completion,date_egg,is_duplicate
UNKNOWN,Gentoo,48.99,14.11,220.9,5890.0,Adult,MALE,Torgersen North,Torgersen,2023-11-17,UNKNOWN,PAPRI2023,yes,2023-11-09,True
UNKNOWN,Gentoo,48.99,14.11,220.9,5890.0,Adult,MALE,Torgersen North,Torgersen,2023-11-17,UNKNOWN,PAPRI2023,yes,2023-11-09,True
ADE-0001,Adelie,39.55,19.92,186.2,2500.0,Chick,MALE,Biscoe West,Biscoe,1900-01-01,Underweight,PAPRI2022,yes,2022-07-20,True
UNKNOWN,Gentoo,48.23,13.0,199.315,4536.0,Adult,FEMALE,Biscoe West,UNKNOWN,2024-04-14,Healthy,UNKNOWN,yes,2024-04-12,False
GEN-0001,Gentoo,46.22,13.91,212.8,2500.0,Juvenile,FEMALE,Dream South,Dream,1900-01-01,Underweight,PAPRI2020,yes,2020-04-14,True


## 🛠️ Next Steps

This notebook demonstrates the full analyst pipeline using notebook mode. The following enhancements are planned or encouraged for production workflows:

#### ✅ CLI and Automation
- Use the CLI version for scheduled or automated runs:
  
  ```bash
  python run_toolkit_pipeline.py --config config/run_toolkit_config.yaml
  ```

- Integrate into GitHub Actions or cron jobs for continuous data QA
- Swap YAML configs to support different datasets or audit targets

#### 🚀 Planned Iterations
- Add dynamic changelog to fallow data end to end.
- Extend to namespace, and add addtional modules;
  - ML Module Evaluation Suite
  - Visual EDA Suite
- Optional integration with cloud storage (GCS / S3) for inputs and outputs
- Create a streamlined CLI onboarding script (e.g., init_pipeline.py) to scaffold configs

#### 📦 Packaging Notes
- The toolkit is TOML-packaged and installable as a local Python module
- Follows modular design to support interactive, notebook, and script-based workflows
