# Fixmydata Project Walkthrough

This notebook demonstrates how the **Fixmydata** library cleans, validates, and inspects a dataset end-to-end. It also contains a ready-to-present outline for a 5–7 minute project presentation.

## 1. Overview

Fixmydata wraps common pandas cleaning and validation patterns in small, composable helpers:

- **`DataCleaner`**: remove duplicates, drop/fill missing data, trim whitespace, and drop columns.
- **`DataValidator`**: assert numeric ranges and check for empty datasets.
- **`OutlierDetector`**: filter outliers with Z-score or IQR methods while ignoring non-numeric fields.

The workflow keeps a clean, chainable `DataFrame` in memory so you can move from messy input to validated analytics quickly.

In [None]:
import pandas as pd
from Fixmydata import DataCleaner, DataValidator, OutlierDetector

## 2. Load a messy sample dataset

The sample data intentionally includes:
- Duplicate IDs and trailing whitespace in `city`.
- Missing `city` and `price` values.
- An obvious price outlier.

In [None]:
raw = pd.DataFrame({
    "id": [1, 1, 2, 3, 4, 5],
    "city": ["  New York", "Boston  ", "Chicago", None, "San Francisco", "Houston"],
    "price": [10.5, 9.7, 11.2, 13.0, None, 99.0],
})
raw

## 3. Clean the data

- Remove duplicate IDs (keep the first occurrence).
- Drop rows missing a `city` value.
- Standardize whitespace for the `city` column.
- Fill missing prices with the column median for analysis.

In [None]:
cleaner = DataCleaner(raw)
cleaner.remove_duplicates(subset=["id"])
cleaner.drop_missing(columns=["city"])
cleaner.standardize_whitespace(["city"])
median_price = cleaner.data["price"].median()
cleaner.fill_missing("price", median_price)
clean = cleaner.data
clean

## 4. Validate the cleaned dataset

- Confirm the dataframe is non-empty and free of nulls.
- Ensure `price` stays within a practical range (0–50).

In [None]:
validator = DataValidator(clean)
validator.validate_non_empty()
validator.validate_range("price", 0, 50)
clean

## 5. Filter outliers

Use Z-score filtering (threshold 2.5) to keep only inlier rows while safely ignoring non-numeric columns.

In [None]:
outlier_detector = OutlierDetector(clean)
inliers = outlier_detector.z_score_outliers(threshold=2.5)
inliers

## 6. Summary of results

- Started with duplicate IDs, whitespace issues, missing values, and a price outlier.
- Cleaned dataset now has standardized city names and imputed prices.
- Validation confirms completeness and realistic price ranges.
- Outlier detection isolates reliable rows for downstream analysis.

## 7. Presentation outline (5–7 minutes)

Use this as a talk track—each bullet should take ~30–60 seconds.

1. **Problem & goal (45s)**: Data quality slows analysis; Fixmydata packages repeatable fixes on top of pandas.
2. **Library overview (60s)**: Briefly introduce `DataCleaner`, `DataValidator`, `OutlierDetector`, and helper utilities.
3. **Workflow demo (2–3m)**:
   - Load messy sample data and show issues (duplicates, whitespace, nulls, outlier).
   - Run cleaning steps and display the cleaned frame.
   - Validate ranges and completeness; highlight guardrail errors.
   - Filter outliers with Z-score or IQR; note automatic numeric-column handling.
4. **Implementation highlights (60–90s)**: Mention small, focused classes, explicit error messages, and chainable methods that keep a copy of the data.
5. **Practical uses (45s)**: Quick pre-EDA cleanup, lightweight pipeline steps, or teaching data quality basics.
6. **Next steps (30–45s)**: Extend validators, add schema configs, or ship formal docs.

**Call to action**: Invite questions or contributions via the repository.