# Fixmydata tutorial: cleaning built-in datasets

This notebook demonstrates how to explore and clean the bundled sample datasets using the `Fixmydata` utilities. Each section mirrors a typical data quality workflow so you can adapt the snippets to your own projects.

## Prerequisites
- Install dependencies from `requirements.txt`.
- Ensure the project root is on your Python path so `Fixmydata` can be imported directly.

In [1]:
import sys
import pandas as pd
from pathlib import Path

# Ensure project root is on the Python path
ROOT = Path().resolve().parent
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))

DATA_DIR = ROOT / 'datasets'

from Fixmydata import DataCleaner, DataValidator, OutlierDetector

ModuleNotFoundError: No module named 'Fixmydata'

## 1. Load the Titanic-style passenger data

We will use `datasets/tested.csv`, which mirrors the familiar Titanic competition data.

In [None]:
titanic_path = DATA_DIR / 'tested.csv'
titanic_df = pd.read_csv(titanic_path)

print(titanic_df.shape)
titanic_df.head()

### Inspect missing values
Before cleaning, it is useful to see which columns contain gaps.

In [None]:
titanic_df.isnull().sum().to_frame('missing_values')

## 2. Clean the passenger data
We will standardize column names, fill missing numeric values with the median, fill categorical gaps with the mode, and remove duplicates. The `DataCleaner` instance keeps track of the working DataFrame internally.

In [3]:
cleaning = DataCleaner(titanic_df)

# Normalize headers for easier downstream processing
cleaning.standardize_columns()

# Replace missing numeric values with medians
cleaning.fill_missing(strategy='median', columns=['age', 'fare'])

# Fill categorical gaps using the most frequent values
cleaning.fill_missing(strategy='mode')

# Drop accidental duplicate rows if any
titanic_clean = cleaner.remove_duplicates()

titanic_clean.head()

NameError: name 'DataCleaner' is not defined

### Validate the cleaned data
`DataValidator` can assert common expectations. Here we ensure the DataFrame is non-empty and that passenger ages fall inside a reasonable range.

In [33]:
validator = DataValidator(titanic_clean)
validator.validate_non_empty()
validator.validate_range('age', 0, 90)

titanic_clean[['age', 'fare']].describe()

NameError: name 'titanic_clean' is not defined

### Detect and remove outliers
We can use `OutlierDetector` to filter extreme values. The IQR method is robust for skewed distributions like fares.

In [36]:
detector = OutlierDetector(titanic_clean)
titanic_iqr = detector.iqr_outliers()

print('Original rows:', len(titanic_clean))
print('Rows after IQR filtering:', len(titanic_iqr))

titanic_iqr[['age', 'fare']].describe()

NameError: name 'titanic_clean' is not defined

## 3. Explore the USA housing data
The `USA Housing Dataset.csv` contains home sale information. The same cleaners can be applied to prepare the data for modeling.

In [39]:
housing_path = DATA_DIR / 'USA Housing Dataset.csv'
housing_df = pd.read_csv(housing_path)
housing_df.head()

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-09 00:00:00,376000.0,3.0,2.0,1340,1384,3.0,0,0,3,1340,0,2008,0,9245-9249 Fremont Ave N,Seattle,WA 98103,USA
1,2014-05-09 00:00:00,800000.0,4.0,3.25,3540,159430,2.0,0,0,3,3540,0,2007,0,33001 NE 24th St,Carnation,WA 98014,USA
2,2014-05-09 00:00:00,2238888.0,5.0,6.5,7270,130017,2.0,0,0,3,6420,850,2010,0,7070 270th Pl SE,Issaquah,WA 98029,USA
3,2014-05-09 00:00:00,324000.0,3.0,2.25,998,904,2.0,0,0,3,798,200,2007,0,820 NW 95th St,Seattle,WA 98117,USA
4,2014-05-10 00:00:00,549900.0,5.0,2.75,3060,7015,1.0,0,0,5,1600,1460,1979,0,10834 31st Ave SW,Seattle,WA 98146,USA


### Clean housing records and compute quick insights
We standardize column names, fill any numeric gaps with column means, and check the relationship between living area and sale price after removing Z-score outliers.

In [42]:
housing_cleaner = DataCleaner(housing_df)
housing_cleaner.standardize_columns()
housing_cleaner.fill_missing(strategy='mean')
housing_base = housing_cleaner.remove_duplicates()

housing_detector = OutlierDetector(housing_base)
housing_no_outliers = housing_detector.z_score_outliers(threshold=3)

price_sqft_corr = housing_no_outliers['price'].corr(housing_no_outliers['sqft_living'])
print(f'Correlation between price and square footage: {price_sqft_corr:.3f}')
housing_no_outliers[['price', 'sqft_living', 'bedrooms', 'bathrooms']].describe()

AttributeError: 'DataCleaner' object has no attribute 'standardize_columns'