# 🧪 Analyst Toolkit Tutorial: Full Data Pipeline 

This interactive notebook demonstrates the complete analyst pipeline using a synthetic **Palmer Penguins** dataset generated from the [`dirty birds data synthetic data generator` repository](https://github.com/G-Schumacher44/dirty_birds_data_generator).

Each step in the pipeline is modular, YAML-configurable, and produces exports, plots, and certification-ready reports.

This toolkit is packaged using **TOML (`pyproject.toml`)** and can be run via script or notebook.


### 🧰 Toolkit Architecture: 3-Way Modular Design

This pipeline is built around a flexible ETL framework with three usage modes:

- 📓 **Notebook Mode**
  - Run individual modules or the full pipeline interactively
  - Supports HTML dashboards, widgets, and live previews
  - Ideal for iterative exploration, first-pass audits, and QA workflows

- 🧵 **CLI Mode**
  - Execute the full pipeline using `run_toolkit_pipeline.py`
  - Controlled via a master YAML config
  - Exports all reports, checkpoints, and logs to disk

- 🧪 **Hybrid Mode**
  - Develop in notebooks, deploy via scripts
  - Reuse the same configs across testing and production

The toolkit handles essential data cleaning and transformation tasks, enabling analysts to focus on:
- Exploratory Data Analysis (EDA)
- Investigating anomalies and data quality issues
- Extracting actionable insights from certified data

In [1]:
# 📁 Load Configuration and Set Execution Context

from analyst_toolkit.m00_utils.config_loader import load_config

# Path to master config (modify if needed)
config_path = "config/run_toolkit_config.yaml"

# Load full configuration dictionary
config = load_config(config_path)

# Extract run-level settings
run_id = config.get("run_id", "default_run")
notebook_mode = config.get("notebook", True)

print(f"🔧 Config loaded | Run ID: {run_id} | Notebook Mode: {notebook_mode}")

🔧 Config loaded | Run ID: CLI_2_QA | Notebook Mode: True


In [2]:
# 📥 Load Raw Data from CSV

from analyst_toolkit.m00_utils.load_data import load_csv

# Load input path from the global config (or override manually)
input_path = config.get("pipeline_entry_path", "data/raw/synthetic_penguins_v3.5.csv")
print(f"📂 Loading data from: {input_path}")

# Load into DataFrame
df_raw = load_csv(input_path)

📂 Loading data from: data/raw/synthetic_penguins_v3.5.csv


### 🧪 Step 1: Run Initial Diagnostics (M01)

This module generates a profile of the raw data: shape, types, nulls, skewness, and sample rows.

This module profiles the raw dataset for key structural and quality checks:
- **Memory, Shape, Dtypes**  
- **Missing Values & Skewness**
- **Duplicate Detection**
- **Sample Rows & Descriptive Stats**

✅ All results are rendered in a collapsible dashboard with exportable reports.  
You can toggle inline previews and export settings via the YAML config (`diag_config_template.yaml`).


>🛠️ To modify thresholds or toggle sections, edit the config under `diagnostics.settings`.

In [None]:
# 📊 M01: Data Diagnostics – Profile Structure & Shape

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m01_diagnostics.run_diag_pipeline import run_diag_pipeline

# --- Load module-specific config ---
diag_config_full = load_config("config/diag_config_template.yaml")

# --- Run Diagnostics Module ---
# We pass the df_raw loaded in the previous step.
# The global run_id and notebook_mode are used.
df_profiled = run_diag_pipeline(
    config=diag_config_full, # Pass the full config object
    df=df_raw,
    notebook=notebook_mode,
    run_id=run_id
)

### 🛡️ Step 2: Run Schema & Content Validation (M02)

This module audits the dataset against a defined schema to catch issues early and guide cleaning steps:
- **Expected Columns & Dtypes**  
- **Allowed Categorical Values**
- **Numeric Range Checks**
- **Null Allowance (optional)**

✅ All results are displayed in a styled validation dashboard with exportable reports.  
You can define strict or flexible rules in the YAML config (`validation_config_template.yaml`).

> 🛠️ To adjust enforcement (e.g. halt-on-fail), set `fail_on_error` and update rules under `validation.schema_validation`.

In [None]:
# 🛡️ M02: Schema & Content Validation – First Audit Pass

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline

# --- Load module-specific config ---
val_config_full = load_config("config/validation_config_template.yaml")

# --- Run Validation Module ---
df_validated = run_validation_pipeline(
    config=val_config_full,
    df=df_profiled,
    notebook=notebook_mode,
    run_id=run_id
)

### 🧹 Step 3: Normalize & Standardize Data (M03)

This module performs rule-based cleaning and normalization to prepare the dataset for certification:
- **Column Renaming & Type Coercion**
- **Value Mapping & Text Cleaning**
- **Fuzzy Matching & Datetime Parsing**

✅ Results are rendered in a structured dashboard with before/after comparisons and audit previews.  
All rules and output paths are controlled via the YAML config (`normalization_config_template.yaml`).

> 🛠️ To adjust cleaning logic, modify the `rules` block (e.g. `value_mappings`, `preview_columns`, etc).

In [None]:
# 🧹 M03: Data Normalization – Standardizing Key Fields

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m03_normalization.run_normalization_pipeline import run_normalization_pipeline

# --- Load Config ---
norm_config_full = load_config("config/normalization_config_template.yaml")

# --- Run Normalization Module ---
# Uses df_validated from the previous step and global run_id/notebook_mode.
df_normalized = run_normalization_pipeline(
    config=norm_config_full,
    df=df_validated,
    notebook=notebook_mode,
    run_id=run_id
)

In [None]:
(df_normalized[['capture_date', 'date_egg']] == "YYYY-00-DD 00:00:SS").sum()

### 🛡️ Step 4: Certification Gate (M02)

This step re-uses the **Validation Module (M02)**, but with a stricter configuration to act as a quality gate. It is designed to **halt the pipeline** if violations are found:
- ✅ All column names, data types, categorical values, and numeric ranges must pass
- 🛑 **`fail_on_error: true`** triggers a hard stop on validation failure

📦 This step can be run **at any point in the pipeline** — not just the end.  
Use it wherever you want to **certify a dataset snapshot** or block further execution unless data meets expectations.

✅ Results are rendered inline with full export support.  
All certification rules live in the YAML config (`certification_config_template.yaml`).

> 🛠️ Adjust gatekeeping behavior by modifying schema rules or toggling `fail_on_error`.

In [None]:
# 🛡️ M02: Certification (Strict Validation Gatekeeper)

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m02_validation.run_validation_pipeline import run_validation_pipeline

# --- Load Certification Config ---
cert_config_full = load_config("config/certification_config_template.yaml")

# --- Run Final Certification Pass ---
logging.info("🚀 Starting Certification Gate (re-using M02)")

df_certified = run_validation_pipeline(
    config=cert_config_full,
    df=df_normalized,
    notebook=notebook_mode,
    run_id=run_id
)

### 🧹 Step 5: Deduplication (M04)

This module identifies and handles **duplicate rows** in the dataset, using the logic from `m04_duplicates`.

You can choose to:
- 🔍 **Flag duplicates** for review  
- ✂️ **Remove duplicates** directly (default: keep first occurrence)

✅ Configurable logic lets you define:
- Which columns to check for duplication (`subset_columns`)
- Whether to flag or drop (`mode: "flag"` or `"remove"`)
- Columns to preview (hide IDs, timestamps, etc.)

📄 Results are displayed with an inline preview and summary plots.

> 🛠️ Adjust deduplication behavior in `dups_config_template.yaml`.

In [None]:
# ♻️ M04: Deduplication and Duplicates Handling

from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m04_duplicates.run_dupes_pipeline import run_duplicates_pipeline
import logging

# --- Load Config ---
dupes_config_full = load_config("config/dups_config_template.yaml")


# --- Run Duplicates Module ---
df_deduped = run_duplicates_pipeline(
    config=dupes_config_full,
    df=df_certified,
    notebook=notebook_mode,
    run_id=run_id
)

### 📏 Step 6: Detect Outliers (M05)

This module (`m05_detect_outliers`) scans numeric columns for outliers using configurable logic:
- **Z-Score** or **IQR** methods (per column or global default)
- Adds binary flags (e.g., `*_outlier`) to the dataset if `append_flags: true`
- Skips non-numeric or excluded fields via `exclude_columns`

📊 **Interactive PlotViewer**  
If enabled, the `PlotViewer` renders **boxplots, histograms, and violin plots** inline  
— giving a fast visual summary of where anomalies occur.

📁 **What’s Exported:**
- ✅ `df_outliers_flagged`: DataFrame with new `_outlier` columns
- ✅ `detection_results`: thresholds and summary tables
- ✅ Plots: saved to `exports/plots/outliers/{run_id}/`
- ✅ Report: XLSX or CSV, based on config

> 🛠️ Configure methods, thresholds, excluded columns, and plot types in `outlier_config_template.yaml`.

In [None]:
# 📏 M05: Detect Outliers and Plot Visuals

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m05_detect_outliers.run_detection_pipeline import run_outlier_detection_pipeline
from IPython.display import display

# --- Load module-specific config ---
outlier_config_full = load_config("config/outlier_config_template.yaml")

# The 'df_deduped' variable should be the output from your M04 Duplicates module
if 'df_deduped' in locals():
    df_outliers_flagged, detection_results = run_outlier_detection_pipeline(
        config=outlier_config_full,
        df=df_deduped,
        notebook=notebook_mode,
        run_id=run_id
    )

### 🧼 Step 7: Handle Outliers (M06)

This module (`m06_outlier_handling`) applies cleanup strategies to flagged outliers from the detection step:
- Strategies include:
  - `'clip'`: Caps values to threshold bounds
  - `'median'`: Imputes using median
  - `'constant'`: Replaces with fixed value (e.g., `-999`)
  - `'none'`: Leaves values untouched (default)

⚙️ Strategy is configured per column or globally via `__default__` and `__global__`.

📁 **What’s Exported:**
- ✅ Cleaned DataFrame: `df_handled`
- ✅ Handling report (XLSX/CSV)
- ✅ Optional checkpoint joblib

> 🛠️ Adjust cleanup logic, output paths, or constant fill values in `handling_config_template.yaml`.

In [None]:
# 🧼 M06: Handle Outliers

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m06_outlier_handling.run_handling_pipeline import run_outlier_handling_pipeline

# --- Load module-specific config ---
handling_config_full = load_config("config/handling_config_template.yaml")

# Pass the entire detection_results dictionary, not its unpacked components.
df_handled = run_outlier_handling_pipeline(
    config=handling_config_full,
    df=df_outliers_flagged,
    detection_results=detection_results, # Pass the whole dictionary here
    notebook=notebook_mode,
    run_id=run_id
)


### 🔧 Step 8: Impute Missing Values (M07)

This module (`m07_imputation`) fills missing (`NaN`) values using a column-specific strategy:
- `'mean'`, `'median'`, or `'mode'` for numeric/categorical inference
- `'constant'` for fixed fallback values (e.g., `"UNKNOWN"` or `"1900-01-01"`)
- Strategy is configured via the `rules.strategies` section in the YAML

📊 If enabled, comparison plots show how categorical columns changed post-imputation  
(using the same PlotViewer system).

📁 **What’s Exported:**
- ✅ Imputed DataFrame: `df_imputed`
- ✅ Report: imputation log (XLSX/CSV)
- ✅ Plots: before/after comparisons (if enabled)
- ✅ Optional checkpoint joblib

> 🛠️ Configure logic and column-specific strategies in `imputation_config_template.yaml`.

In [None]:
#🔧 M07: Impute Data and Plot Summary Visuals

import logging
from analyst_toolkit.m00_utils.config_loader import load_config
from analyst_toolkit.m07_imputation.run_imputation_pipeline import run_imputation_pipeline

# Load the configuration for the imputation module
imputation_config_full = load_config("config/imputation_config_template.yaml")

df_imputed = run_imputation_pipeline(
    config=imputation_config_full,
    notebook=notebook_mode,
    df=df_handled,  # Pass the existing DataFrame here
    run_id=run_id
)

### 🧩 Behind the Scenes: Utility & Visual Modules

Several specialized support modules power the Analyst Toolkit pipeline behind the scenes.  
These are not called directly in the notebook, but are crucial to the system’s flexibility and polish:

#### 🧰 `m00_utils/`
- `config_loader.py`: Robust loader with support for environment paths and nested YAMLs
- `load_data.py`: Abstracted CSV/Joblib loader with encoding fallback
- `export_utils.py`: Modular export system for saving reports and checkpoints
- `rendering_utils.py`: Styled HTML table generator for dashboard outputs

#### 📊 `m08_visuals/`
- `distributions.py`: Boxplots, histograms, and violin plots for outlier detection
- `summary_plots.py`: Heatmaps, missingness matrices, and dtype summaries
- `plot_viewer.py`: Interactive PlotViewer widget for inspecting flagged values and category shifts

> ⚙️ These modules enable notebook-mode display, CLI compatibility, YAML-driven plotting, and clean HTML export dashboards.

📁 Explore these utilities in the `/src/` directory to understand how the toolkit remains modular, extensible, and production-grade.

## 🎬 Step 9: Final Auditing and Certifaction (M10)

This final module performs a comprehensive audit of the cleaned dataset and applies strict quality checks before certification.

It serves as the **final quality gate** and includes:
- ✅ **Final Edits:** Drop or rename columns, coerce dtypes as needed
- ✅ **Certification Check:** Applies validation rules with `fail_on_error: true` to enforce schema, dtypes, and content requirements
- ✅ **Lifecycle Comparison:** Compares raw vs final structure, nulls, and column presence
- ✅ **Capstone Report:** Renders a complete dashboard summarizing pipeline impact and status

🛡️ If any rule is violated (e.g., unexpected nulls or schema mismatch), the system halts and logs failure details for debugging.

📁 **What’s Exported:**
- Final Audit Report (XLSX and Joblib)
- Final Certified Dataset (CSV and Joblib)
- Inline dashboard with all results

> 🛠️ Customize certification rules, null restrictions, or output paths in `final_audit_config_template.yaml`.

🎉 Once this step passes, your dataset is ready for **production use or modeling pipelines**.

In [None]:
# 🎬 M10: Final Auditing and Certifaction 

from analyst_toolkit.m10_final_audit.final_audit_pipeline import run_final_audit_pipeline
from analyst_toolkit.m00_utils.config_loader import load_config

# --- Load Config ---
final_audit_config_full = load_config("config/final_audit_config_template.yaml")

# --- Run Final Audit ---
# The final audit pipeline expects the full config dictionary, as it may perform
# validation using rules from a separate block.
df_final_clean = run_final_audit_pipeline(
    config=final_audit_config_full,
    df=df_imputed,  # Pass the existing DataFrame here
    notebook=notebook_mode,
    run_id=run_id
)

---

## 🧭 What’s Next?

Congratulations — you’ve now completed a full walkthrough of the Analyst Toolkit pipeline using synthetic Palmer Penguins data!

Here are some suggested next steps:

1. 🔍 **Explore Outputs**  
   - Review the exported reports and plots in the `exports/` folder
   - Inspect final audit and certification summaries

2. 🧪 **Test with Other Datasets**  
   - Replace the penguin dataset with your own CSV in the YAML configs
   - Adjust schema, value, and range rules accordingly

3. 📓 **Use the Full Pipeline Script**  
   - Try running `run_toolkit_pipeline.py` in CLI or notebook mode for a full end-to-end execution
   - Config: `config/run_toolkit_config.yaml`

4. 🛠️ **Customize Modules**  
   - Add new modules (e.g., feature engineering, modeling)
   - Use your own diagnostic thresholds or imputation logic

5. 🚀 **Package or Deploy**  
   - Deploy the toolkit in production (Airflow, Papermill, GitHub Actions, etc.)
   - Or package it as a Python module for reuse

---