# Day 2 – ML Pipeline & Data Preparation (California Housing, NumPy Scaling)

In this notebook we will:
1. Load the **California Housing** dataset.
2. Explore the features and target variable.
3. Split the data into an **80/20 Train–Test** split.
4. Perform a quick **Exploratory Data Analysis (EDA)**.
5. Apply **Min–Max Normalization** to [0,1] **using only NumPy**.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

## Step 1 – Load the California Housing dataset
Each row describes a California district and the target
(`MedHouseVal`) is the median house value (in $100,000s).


In [None]:
housing=fetch_california_housing(as_frame=True)
X=housing.data
y=housing.target

print("shape of x : ", X.shape)
print("Shape ofn y:", y.shape)


In [None]:
cols = ['MedInc', 'AveRooms', 'HouseAge']

X=X[cols]
X

## Step 2 – Train / Test Split (80/20)
We'll keep **20%** of the data for testing and use the remaining
**80%** for training.


In [None]:
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

## Step 3 – Quick Exploratory Data Analysis (EDA)
Before preprocessing, let's inspect some summary statistics
and check correlations of each feature with the target.


In [None]:
X_train.describe()

In [None]:
# Visualize the strongest relationship: Median Income vs House Value
plt.figure(figsize=(6,6))
plt.scatter(X_train["MedInc"], y_train, alpha=0.3)
plt.xlabel("Median Income (in $10,000s)")
plt.ylabel("Median House Value ($100,000s)")
plt.title("Median Income vs House Value")
plt.yticks([0, 2,4,6])
plt.show()


## Step 4 – Min–Max Normalization with NumPy
We scale each feature to the range **[0,1]** using:

\[
X_{scaled} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
\]

We compute the **min** and **max** only from the **training set**
to avoid data leakage, then apply the same transformation to both
train and test sets.


In [25]:
# Convert pandas DataFrame to NumPy arrays for manual scaling
X_train_np = X_train.to_numpy()
X_test_np  = X_test.to_numpy()

# Compute per-column min and max on training data
X_min = X_train_np.min(axis=0)
X_max = X_train_np.max(axis=0)

# Avoid division by zero if a column is constant
scale_range = np.where(X_max - X_min == 0, 1, X_max - X_min)

# Apply scaling
X_train_scaled = (X_train_np - X_min) / scale_range
X_test_scaled  = (X_test_np  - X_min) / scale_range


In [35]:
import pandas as pd

# Convert back to DataFrame with same columns & index
X_train_scaled_df = pd.DataFrame(X_train_scaled,
                                 columns=X_train.columns,
                                 index=X_train.index)

X_test_scaled_df = pd.DataFrame(X_test_scaled,
                                columns=X_test.columns,
                                index=X_test.index)

# Print a preview of the scaled training data
print("Scaled Training Data (first 5 rows):\n")
print(X_train_scaled_df.head())

# Optional: check the full shape
print("\nShape:", X_train_scaled_df.shape)


Scaled Training Data (first 5 rows):

         MedInc  AveRooms  HouseAge
14196  0.190322  0.029278  0.627451
8267   0.228452  0.025419  0.941176
17445  0.252162  0.033732  0.058824
14265  0.099488  0.022081  0.686275
2271   0.210638  0.038147  0.823529

Shape: (16512, 3)
