## Objective
The objective of this notebook is to prepare the California Housing dataset for machine learning by cleaning the data, handling missing values, creating meaningful features, and defining preprocessing strategies based on insights from the exploratory data analysis (EDA).

This notebook focuses on decision-making and justification rather than model performance.

## Data Loading and Inspection
**Tasks**
- Load the dataset.


In [2]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
df_housing = pd.read_csv("/home/trazeure/Notebooks/House_pricing/Data/housing.csv")

In [4]:
df_housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [5]:
df_housing.shape

(20640, 10)

In [6]:
df_housing.iloc[0]

longitude              -122.23
latitude                 37.88
housing_median_age        41.0
total_rooms              880.0
total_bedrooms           129.0
population               322.0
households               126.0
median_income           8.3252
median_house_value    452600.0
ocean_proximity       NEAR BAY
Name: 0, dtype: object

In [7]:
df_housing.iloc[-1]

longitude             -121.24
latitude                39.37
housing_median_age       16.0
total_rooms            2785.0
total_bedrooms          616.0
population             1387.0
households              530.0
median_income          2.3886
median_house_value    89400.0
ocean_proximity        INLAND
Name: 20639, dtype: object

## 1. EDA Recap
### Purpose
Summarize the key findings from the exploratory data analysis that directly impact feature engineering decisions.

### Key Observations
- The target variable (`median_house_value`) shows a right-skewed distribution and is capped at a maximum value.
- `total_bedrooms` contains missing values that must be handled.
- Raw count features (rooms, bedrooms, population) depend heavily on district size.
- `median_income` has a strong relationship with house values.
- Geographic features (latitude and longitude) reveal spatial patterns.
- Several numerical features are highly correlated.

In [10]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

# Seleccionamos solo columnas numéricas para el imputer
housing_num = df_housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)

# Aplicamos la transformación
X = imputer.transform(housing_num)
df_fixed = pd.DataFrame(X, columns=housing_num.columns, index=df_housing.index)

# Reincorporamos la columna categórica
df_fixed["ocean_proximity"] = df_housing["ocean_proximity"]

print("--- Check de nulos tras Imputación ---")
print(df_fixed.isnull().sum())
print("-" * 30)
print("Resumen de estrategia:")
print("1. Nulos en total_bedrooms -> Llenados con mediana.")
print("2. Distribuciones sesgadas -> Listas para transformación logarítmica.")
print("3. Ratios -> Listos para ser calculados en la siguiente celda.")

df_fixed.head()

--- Check de nulos tras Imputación ---
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64
------------------------------
Resumen de estrategia:
1. Nulos en total_bedrooms -> Llenados con mediana.
2. Distribuciones sesgadas -> Listas para transformación logarítmica.
3. Ratios -> Listos para ser calculados en la siguiente celda.


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


## 2. Handling Missing Values
### Purpose
Define strategies to handle missing data in a way that preserves information and minimizes bias.

### Tasks
- Identify features with missing values.
- Decide whether missing values should be imputed or treated as a separate category.
- Justify the chosen imputation strategy.

### Considerations
- Use simple statistical imputation where appropriate.
- Avoid introducing data leakage.
- Maintain consistency across train and test sets.

In [11]:
from sklearn.impute import SimpleImputer

# Separate numerical data
housing_num = df_housing.drop("ocean_proximity", axis=1)

# Configure and fit SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit(housing_num)

# Transform and reconstruct DataFrame
X = imputer.transform(housing_num)
df_housing_imputed = pd.DataFrame(X, columns=housing_num.columns, index=df_housing.index)

# Add back categorical column
df_housing_imputed["ocean_proximity"] = df_housing["ocean_proximity"]

# Verification
print(f"Missing values after imputation: {df_housing_imputed.isnull().sum().sum()}")

Missing values after imputation: 0


## 3. Feature Scaling and Transformation
### Purpose
Ensure that numerical features are in suitable formats and scales for machine learning algorithms.

### Tasks
- Identify features with large value ranges.
- Identify skewed distributions that may benefit from transformation.
- Decide which features require scaling.

### Considerations
- Some models are sensitive to feature scale.
- Transformations should improve interpretability or model stability

In [12]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# 1. Log Transformation for skewed features
# We apply log to compress long tails and stabilize variance
skewed_features = ["total_rooms", "total_bedrooms", "population", "households"]
df_housing_imputed[skewed_features] = np.log1p(df_housing_imputed[skewed_features])

# 2. Standardization (Z-score scaling)
# We scale all numerical features to have mean=0 and std=1
scaler = StandardScaler()

# Select numerical columns excluding the target
num_cols = df_housing_imputed.select_dtypes(include=[np.number]).columns
num_cols = num_cols.drop("median_house_value")

# Fit and transform
df_housing_imputed[num_cols] = scaler.fit_transform(df_housing_imputed[num_cols])

# Verification
print("Scaled Numerical Features (First 5 rows):")
print(df_housing_imputed[num_cols].head())

Scaled Numerical Features (First 5 rows):
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0  -1.327835  1.052548            0.982143    -1.131133       -1.642192   
1  -1.322844  1.043185           -0.607019     1.651357        1.320043   
2  -1.332827  1.038503            1.856182    -0.450310       -1.110094   
3  -1.337818  1.038503            1.856182    -0.638257       -0.817506   
4  -1.337818  1.038503            1.856182    -0.312370       -0.576140   

   population  households  median_income  
0   -1.694943   -1.569395       2.344766  
1    1.030337    1.449251       2.332238  
2   -1.109604   -1.104849       1.782699  
3   -0.949925   -0.813343       0.932968  
4   -0.933021   -0.583469      -0.012881  
