# Fixmydata tutorial: cleaning built-in datasets

This notebook demonstrates how to explore and clean the bundled sample datasets using the `Fixmydata` utilities. Each section mirrors a typical data quality workflow so you can adapt the snippets to your own projects.

## Prerequisites
- Install dependencies from `requirements.txt`.
- Ensure the project root is on your Python path so `Fixmydata` can be imported directly.

In [None]:
import sys
import pandas as pd
from pathlib import Path

# Ensure project root is on the Python path
ROOT = Path().resolve().parent
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))

DATA_DIR = ROOT / 'datasets'

from Fixmydata import DataCleaner, DataValidator, OutlierDetector

## 1. Load the Titanic-style passenger data

We will use `datasets/tested.csv`, which mirrors the familiar Titanic competition data.

In [None]:
titanic_path = DATA_DIR / 'tested.csv'
titanic_df = pd.read_csv(titanic_path)

print(titanic_df.shape)
titanic_df.head()

### Inspect missing values
Before cleaning, it is useful to see which columns contain gaps.

In [None]:
titanic_df.isnull().sum().to_frame('missing_values')

## 2. Clean the passenger data
We will fill missing ages and fares with their medians, replace unknown cabins with a placeholder, and remove duplicates before validation.


In [None]:
cleaning = DataCleaner(titanic_df)

# Fill missing numeric values with summary statistics
age_median = cleaning.data['Age'].median()
fare_median = cleaning.data['Fare'].median()
cleaning.fill_missing('Age', age_median)
cleaning.fill_missing('Fare', fare_median)

# Replace cabin gaps with a clear placeholder to simplify validation
cleaning.fill_missing('Cabin', 'Unknown')

# Drop accidental duplicate rows if any
titanic_clean = cleaning.remove_duplicates()
titanic_clean.head()


### Validate the cleaned data
`DataValidator` can assert common expectations. Here we ensure the DataFrame is non-empty and that passenger ages fall inside a reasonable range.

In [None]:
validator = DataValidator(titanic_clean)
validator.validate_non_empty()
validator.validate_range('age', 0, 90)

titanic_clean[['age', 'fare']].describe()

### Detect and remove outliers
We can use `OutlierDetector` to filter extreme values. The IQR method is robust for skewed distributions like fares.

In [None]:
detector = OutlierDetector(titanic_clean)
titanic_iqr = detector.iqr_outliers()

print('Original rows:', len(titanic_clean))
print('Rows after IQR filtering:', len(titanic_iqr))

titanic_iqr[['age', 'fare']].describe()

## 3. Explore the USA housing data
The `USA Housing Dataset.csv` contains home sale information. The same cleaners can be applied to prepare the data for modeling.

In [None]:
housing_path = DATA_DIR / 'USA Housing Dataset.csv'
housing_df = pd.read_csv(housing_path)
housing_df.head()

### Clean housing records and compute quick insights
We remove any duplicate housing records, filter Z-score outliers, and check how home size correlates with price.


In [None]:
housing_cleaner = DataCleaner(housing_df)
housing_base = housing_cleaner.remove_duplicates()

housing_detector = OutlierDetector(housing_base)
housing_no_outliers = housing_detector.z_score_outliers(threshold=3)

price_sqft_corr = housing_no_outliers['price'].corr(housing_no_outliers['sqft_living'])
print(f'Correlation between price and square footage: {price_sqft_corr:.3f}')
housing_no_outliers[['price', 'sqft_living', 'bedrooms', 'bathrooms']].describe()
