# 🧽 Data Cleaning — Ames Housing Dataset
This notebook focuses on preparing the dataset for modeling by handling missing values, removing duplicates, and preparing fields for feature engineering.


## 📥 Step 1: Reload the Raw Dataset

We begin by reloading the raw dataset to ensure a fresh copy.


In [1]:
import pandas as pd

df = pd.read_csv("../data/house_prices_records.csv")
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


## 📋 Step 2: Remove Duplicate Rows

Duplicate entries can introduce bias into the model, so we remove them.


In [2]:
df.drop_duplicates(inplace=True)

## 🧼 Step 3: Handle Missing Values

- Drop columns where over 30% of the data is missing
- Fill missing numeric values with the column median
- Fill missing categorical values with the mode


In [3]:
# Drop columns with more than 30% missing values
threshold = 0.3
missing_fraction = df.isnull().mean()
cols_to_drop = missing_fraction[missing_fraction > threshold].index
df.drop(columns=cols_to_drop, inplace=True)

# Fill numeric columns with the median
num_cols = df.select_dtypes(include="number").columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Fill categorical columns with the mode
cat_cols = df.select_dtypes(include="object").columns
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

## 🔄 Step 4: Transform Skewed Target (Optional)

House prices often have a skewed distribution. We'll apply a log transformation to `SalePrice` to improve model learning later.


In [4]:
import numpy as np

df["SalePrice_log"] = np.log(df["SalePrice"])

## 💾 Step 5: Save the Cleaned Dataset

We export the cleaned dataset to a new CSV file in the `data/` folder.


In [5]:
df.to_csv("../data/house_prices_cleaned.csv", index=False)