# Fixmydata

- Fixmydata is a Python package designed to help with data cleaning, validation, and outlier detection.
- It simplifies and automates common data preprocessing tasks, improving data quality and workflow efficiency for data scientists and analysts.

## Features

- **Cleaning:** Deduplicate rows, drop or fill missing values, remove columns, and trim whitespace with `DataCleaner`.
- **Validation:** Assert value ranges and check for missing or empty data with `DataValidator`.
- **Outlier filtering:** Identify inliers using Z-score or IQR methods while ignoring non-numeric columns via `OutlierDetector`.
- **Utilities:** CSV load/save helpers, column name normalization, null counting, and quick DataFrame introspection. `stats and utils`


In [24]:
!pip install Fixmydata



# DataCleaner Class - Cleaning and Preprocessing Data

- The DataCleaner class provides several cleaning methods, such as:

    - Removing duplicates

    - Dropping or filling missing values

    - Standardizing column names

In [None]:
from Fixmydata import DataCleaner

cleaning = DataCleaner(df)
cleaned_df = cleaning.drop_missing()
cleaned_df = cleaning.fill_missing("column_name", "missing_value")

# DataValidator Class - Ensuring Data Integrity
- The DataValidator class checks if the data meets specific criteria:

    - **Range Validation**: Ensures numeric data is within a specified range.

    - **Non-Empty Validation**: Ensures no missing values in the data.

In [None]:
from Fixmydata import DataValidator
validator = DataValidator(df)
validator.validate_range("column_name", min_val, max_val)
validator.validate_non_empty()

# OutlierDetector Class - Identifying Outliers
- The OutlierDetector class detects outliers using two methods:

    - Z-score: Identifies outliers based on how many standard deviations a data point is from the mean.

    - Interquartile Range (IQR): Identifies outliers based on the spread of the middle 50% of the data.

In [None]:
from Fixmydata import OutlierDetector

detector = OutlierDetector(df)
df_without_outliers_z = detector.z_score_outliers()
df_without_outliers_iqr = detector.iqr_outliers()

# Fixmydata tutorial: cleaning datasets

This notebook demonstrates how to explore and clean the bundled sample datasets using the `Fixmydata` utilities. Each section mirrors a typical data quality workflow so you can adapt the snippets to your own projects.

## Prerequisites
- Install dependencies from `requirements.txt`.
- Ensure the project root is on your Python path so `Fixmydata` can be imported directly.

In [1]:
import sys
import pandas as pd
from pathlib import Path

# Ensure project root is on the Python path
ROOT = Path().resolve().parent
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))

DATA_DIR = ROOT / 'datasets'

from Fixmydata import DataCleaner, DataValidator, OutlierDetector

## 1. Load the Titanic-style passenger data

We will use `datasets/tested.csv`, which mirrors the familiar Titanic competition data.

In [3]:
titanic_path = DATA_DIR / 'tested.csv'
titanic_df = pd.read_csv(titanic_path)

print(titanic_df.shape)
titanic_df.head()

(418, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### Inspect missing values
Before cleaning, it is useful to see which columns contain gaps.

In [5]:
titanic_df.isnull().sum().to_frame('missing_values')

Unnamed: 0,missing_values
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,86
SibSp,0
Parch,0
Ticket,0
Fare,1


## 2. Clean the passenger data
We will fill missing ages and fares with their medians, replace unknown cabins with a placeholder, and remove duplicates before validation.


In [7]:
cleaning = DataCleaner(titanic_df)

# Fill missing numeric values with summary statistics
age_median = cleaning.data['Age'].median()
fare_median = cleaning.data['Fare'].median()
cleaning.fill_missing('Age', age_median)
cleaning.fill_missing('Fare', fare_median)

# Replace cabin gaps with a clear placeholder to simplify validation
cleaning.fill_missing('Cabin', 'Unknown')

# Drop accidental duplicate rows if any
titanic_clean = cleaning.remove_duplicates()
titanic_clean.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Unknown,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,Unknown,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Unknown,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,Unknown,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,Unknown,S


### Validate the cleaned data
`DataValidator` can assert common expectations. Here we ensure the DataFrame is non-empty and that passenger ages fall inside a reasonable range.

In [9]:
validator = DataValidator(titanic_clean)
validator.validate_non_empty()
validator.validate_range('Age', 0, 90)

titanic_clean[['Age', 'Fare']].describe()

Unnamed: 0,Age,Fare
count,418.0,418.0
mean,29.599282,35.576535
std,12.70377,55.850103
min,0.17,0.0
25%,23.0,7.8958
50%,27.0,14.4542
75%,35.75,31.471875
max,76.0,512.3292


### Detect and remove outliers
We can use `OutlierDetector` to filter extreme values. The IQR method is robust for skewed distributions like fares.

In [11]:
detector = OutlierDetector(titanic_clean)
titanic_iqr = detector.iqr_outliers()

print('Original rows:', len(titanic_clean))
print('Rows after IQR filtering:', len(titanic_iqr))

titanic_iqr[['Age', 'Fare']].describe()

Original rows: 418
Rows after IQR filtering: 281


Unnamed: 0,Age,Fare
count,281.0,281.0
mean,28.272242,15.61809
std,7.876031,12.818909
min,12.0,0.0
25%,24.0,7.775
50%,27.0,8.6625
75%,30.0,21.0
max,54.0,65.0


## 3. Explore the USA housing data
The `USA Housing Dataset.csv` contains home sale information. The same cleaners can be applied to prepare the data for modeling.

In [16]:
housing_path = DATA_DIR / 'USA Housing Dataset.csv'
housing_df = pd.read_csv(housing_path)
housing_df.head()

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-09 00:00:00,376000.0,3.0,2.0,1340,1384,3.0,0,0,3,1340,0,2008,0,9245-9249 Fremont Ave N,Seattle,WA 98103,USA
1,2014-05-09 00:00:00,800000.0,4.0,3.25,3540,159430,2.0,0,0,3,3540,0,2007,0,33001 NE 24th St,Carnation,WA 98014,USA
2,2014-05-09 00:00:00,2238888.0,5.0,6.5,7270,130017,2.0,0,0,3,6420,850,2010,0,7070 270th Pl SE,Issaquah,WA 98029,USA
3,2014-05-09 00:00:00,324000.0,3.0,2.25,998,904,2.0,0,0,3,798,200,2007,0,820 NW 95th St,Seattle,WA 98117,USA
4,2014-05-10 00:00:00,549900.0,5.0,2.75,3060,7015,1.0,0,0,5,1600,1460,1979,0,10834 31st Ave SW,Seattle,WA 98146,USA


### Clean housing records and compute quick insights
We remove any duplicate housing records, filter Z-score outliers, and check how home size correlates with price.


In [19]:
housing_cleaner = DataCleaner(housing_df)
housing_base = housing_cleaner.remove_duplicates()

housing_detector = OutlierDetector(housing_base)
housing_no_outliers = housing_detector.z_score_outliers(threshold=3)

price_sqft_corr = housing_no_outliers['price'].corr(housing_no_outliers['sqft_living'])
print(f'Correlation between price and square footage: {price_sqft_corr:.3f}')
housing_no_outliers[['price', 'sqft_living', 'bedrooms', 'bathrooms']].describe()


Correlation between price and square footage: 0.611


Unnamed: 0,price,sqft_living,bedrooms,bathrooms
count,3805.0,3805.0,3805.0,3805.0
mean,499189.2,2019.579763,3.349803,2.09159
std,271645.9,786.01844,0.855091,0.70747
min,0.0,370.0,1.0,0.75
25%,312891.0,1430.0,3.0,1.75
50%,444845.0,1910.0,3.0,2.25
75%,620000.0,2500.0,4.0,2.5
max,2300000.0,4960.0,6.0,4.5
