## Objective
The objective of this notebook is to prepare the California Housing dataset for machine learning by cleaning the data, handling missing values, creating meaningful features, and defining preprocessing strategies based on insights from the exploratory data analysis (EDA).

This notebook focuses on decision-making and justification rather than model performance.

## Data Loading and Inspection
**Tasks**
- Load the dataset.


In [12]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [6]:
df_housing = pd.read_csv("/home/trazeure/Notebooks/House_pricing/Data/housing.csv")

In [7]:
df_housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [9]:
df_housing.shape

(20640, 10)

In [10]:
df_housing.iloc[0]

longitude              -122.23
latitude                 37.88
housing_median_age        41.0
total_rooms              880.0
total_bedrooms           129.0
population               322.0
households               126.0
median_income           8.3252
median_house_value    452600.0
ocean_proximity       NEAR BAY
Name: 0, dtype: object

In [11]:
df_housing.iloc[-1]

longitude             -121.24
latitude                39.37
housing_median_age       16.0
total_rooms            2785.0
total_bedrooms          616.0
population             1387.0
households              530.0
median_income          2.3886
median_house_value    89400.0
ocean_proximity        INLAND
Name: 20639, dtype: object

## 1️⃣ EDA Recap
### Purpose
Summarize the key findings from the exploratory data analysis that directly impact feature engineering decisions.

### Key Observations
- The target variable (`median_house_value`) shows a right-skewed distribution and is capped at a maximum value.
- `total_bedrooms` contains missing values that must be handled.
- Raw count features (rooms, bedrooms, population) depend heavily on district size.
- `median_income` has a strong relationship with house values.
- Geographic features (latitude and longitude) reveal spatial patterns.
- Several numerical features are highly correlated.