# Fixmydata tutorial: cleaning built-in datasets

This notebook demonstrates how to explore and clean the bundled sample datasets using the `Fixmydata` utilities. Each section mirrors a typical data quality workflow so you can adapt the snippets to your own projects.

## Prerequisites
- Install dependencies from `requirements.txt`.
- Ensure the project root is on your Python path so `Fixmydata` can be imported directly.

In [None]:
import pandas as pd
from Fixmydata import DataCleaner, DataValidator, OutlierDetector

## 1. Load the Titanic-style passenger data

We will use `datasets/tested.csv`, which mirrors the familiar Titanic competition data.

In [None]:
titanic_path = '../datasets/tested.csv'
titanic_df = pd.read_csv(titanic_path)

print(titanic_df.shape)
titanic_df.head()

### Inspect missing values
Before cleaning, it is useful to see which columns contain gaps.

In [None]:
titanic_df.isnull().sum().to_frame('missing_values')

## 2. Clean the passenger data
We will standardize column names, fill missing numeric values with the median, and remove duplicates. The `DataCleaner` instance keeps track of the working DataFrame internally.

In [None]:
cleaner = DataCleaner(titanic_df)

# Normalize headers for easier downstream processing
cleaner.standardize_columns()

# Replace missing ages and fares with their median values
cleaner.fill_missing(strategy='median', columns=['age', 'fare'])

# Drop accidental duplicate rows if any
titanic_clean = cleaner.remove_duplicates()

titanic_clean.head()

### Validate the cleaned data
`DataValidator` can assert common expectations. Here we ensure the DataFrame is non-empty and that passenger ages fall inside a reasonable range.

In [None]:
validator = DataValidator(titanic_clean)
validator.validate_non_empty()
validator.validate_range('age', 0, 90)

titanic_clean[['age', 'fare']].describe()

### Detect and remove outliers
We can use `OutlierDetector` to filter extreme values. The IQR method is robust for skewed distributions like fares.

In [None]:
detector = OutlierDetector(titanic_clean)
titanic_iqr = detector.iqr_outliers()

print('Original rows:', len(titanic_clean))
print('Rows after IQR filtering:', len(titanic_iqr))

titanic_iqr[['age', 'fare']].describe()

## 3. Explore the USA housing data
The `USA Housing Dataset.csv` contains home sale information. The same cleaners can be applied to prepare the data for modeling.

In [None]:
housing_path = '../datasets/USA Housing Dataset.csv'
housing_df = pd.read_csv(housing_path)
housing_df.head()

### Clean housing records and compute quick insights
We standardize column names, fill any numeric gaps with column means, and check the relationship between living area and sale price after removing Z-score outliers.

In [None]:
housing_cleaner = DataCleaner(housing_df)
housing_cleaner.standardize_columns()
housing_cleaner.fill_missing(strategy='mean')
housing_base = housing_cleaner.remove_duplicates()

housing_detector = OutlierDetector(housing_base)
housing_no_outliers = housing_detector.z_score_outliers(threshold=3)

price_sqft_corr = housing_no_outliers['price'].corr(housing_no_outliers['sqft_living'])
print(f'Correlation between price and square footage: {price_sqft_corr:.3f}')
housing_no_outliers[['price', 'sqft_living', 'bedrooms', 'bathrooms']].describe()

## Next steps
- Swap in your own CSV paths and reuse the same cleaning steps.
- Try different fill strategies (mean/median/mode) depending on the data type.
- Adjust outlier thresholds to balance robustness and recall.
- Add additional validation checks before training models or generating reports.