# Data Preprocessing

This notebook performs data cleaning, outlier removal, and feature engineering using the HousingDataProcessor class.

## Step 1: Import Libraries and Load Data

This cell imports necessary libraries and loads the raw data that was saved in the previous notebook. It also reloads the data processing module to ensure we have the latest version.

In [7]:
import pandas as pd
import sys
import os
import importlib

# Add src to path
sys.path.append('../src')

# Force reload the module to get latest changes
import data_processing
importlib.reload(data_processing)
from data_processing import HousingDataProcessor

# Load raw data
df_raw = pd.read_csv('../data/raw/california_housing_raw.csv')
print(f"Initial dataset shape: {df_raw.shape}")

Initial dataset shape: (20640, 9)


## Step 2: Process the Data

This cell initializes the `HousingDataProcessor` and runs the complete data processing pipeline:
- Handles missing values (imputes with median)
- Removes duplicate rows
- Removes outliers using IQR method (with multiplier=3.0 for less aggressive removal)
- Creates engineered features (RoomsPerBedroom and additional features)

In [8]:
# Initialize processor
processor = HousingDataProcessor(df_raw)

# Execute complete processing pipeline with additional features
# Using outlier_multiplier=3.0 for less aggressive outlier removal (better for model performance)
df_processed = processor.process(create_additional_features=True, outlier_multiplier=3.0)

print(f"\nProcessed dataset shape: {df_processed.shape}")
print("\nNew columns after feature engineering:")
print(df_processed.columns.tolist())

Removed 1379 outlier rows using IQR method (multiplier=3.0)
Created RoomsPerBedroom feature
Created PopulationPerRoom feature
Created OccupancyRate feature
Created IncomePerPerson feature
Created DistanceFromCenter feature

Processed dataset shape: (19261, 14)

New columns after feature engineering:
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'MedHouseVal', 'RoomsPerBedroom', 'PopulationPerRoom', 'OccupancyRate', 'IncomePerPerson', 'DistanceFromCenter']


## Step 3: Display Processed Data Summary

This cell displays descriptive statistics for the processed dataset, showing the distribution of all features after cleaning and feature engineering.

In [9]:
# Display summary of processed data
print("Processed Data Summary:")
print(df_processed.describe())

Processed Data Summary:
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  19261.000000  19261.000000  19261.000000  19261.000000  19261.000000   
mean       3.826560     29.177405      5.215899      1.052809   1346.901978   
std        1.730536     12.400170      1.160431      0.079128    783.695551   
min        0.499900      1.000000      0.846154      0.730769      5.000000   
25%        2.571400     19.000000      4.413567      1.003953    804.000000   
50%        3.541700     30.000000      5.171429      1.045802   1171.000000   
75%        4.735600     37.000000      5.949886      1.093023   1697.000000   
max       11.246300     52.000000     10.581522      1.372951   4579.000000   

           AveOccup      Latitude     Longitude   MedHouseVal  \
count  19261.000000  19261.000000  19261.000000  19261.000000   
mean       2.908011     35.612217   -119.589456      2.069494   
std        0.713077      2.117110      1.990225      1.135690   
min 

## Step 4: Save Processed Data

This cell saves the cleaned and processed dataset to the `data/processed/` directory. This processed data will be used in subsequent notebooks for visualization and model training.

In [10]:
# Save processed data
os.makedirs('../data/processed', exist_ok=True)
df_processed.to_csv('../data/processed/california_housing_processed.csv', index=False)
print("Processed data saved to ../data/processed/california_housing_processed.csv")

Processed data saved to ../data/processed/california_housing_processed.csv
