<a href="https://colab.research.google.com/github/Joseph89155/mlops-housing-price-prediction/blob/main/MLOPs_housing_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🏠 MLOps: Housing Price Prediction using the California Housing Dataset

## 📌 Project Overview

This project demonstrates how to build an end-to-end machine learning pipeline using the **California Housing dataset** to predict median house values. The goal is to apply core MLOps concepts including data preprocessing, model training, hyperparameter tuning, cross-validation, pipeline construction, and model serialization all within a structured and reproducible workflow.

We'll train a **K-Nearest Neighbors Regressor (KNN)** to predict housing prices and optimize its performance using `GridSearchCV`. The final trained model will be saved and optionally deployed via a lightweight API for real-time inference.

## 🎯 Objectives

- Load and explore the California Housing dataset  
- Apply preprocessing using `ColumnTransformer`
- Split the data into training and test sets (80/20)
- Use R² Score as the main evaluation metric  
- Train a `KNeighborsRegressor`  
- Apply hyperparameter tuning with `GridSearchCV`
- Use 5-fold cross-validation  
- Save the entire pipeline using `pickle` or `joblib`
- *(Optional)* Build an API using Flask or FastAPI for deployment

## 🌍 Real-World Relevance

Accurate housing price prediction is critical for a variety of real-life scenarios:
- **Real estate platforms** use these models to estimate listing prices.
- **Financial institutions** rely on such models for mortgage lending risk assessments.
- **Urban planning departments** analyze price trends to guide zoning, development, and resource allocation.
- **Investors** and **home buyers** use such insights to identify undervalued or overvalued properties.

This project not only simulates such real-world predictive systems, but also practices **MLOps principles** that ensure scalability, reproducibility, and deployment-readiness of machine learning solutions.

---

# 1: Load the California Housing Dataset

In [1]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
import pandas as pd

In [2]:
# Load the dataset
housing = fetch_california_housing(as_frame=True)

In [3]:
# Extract features and target into a DataFrame
df = housing.frame

In [4]:
# Display the first few rows
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [5]:
# Display the last few rows
df.tail()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.17192,741.0,2.123209,39.43,-121.32,0.847
20639,2.3886,16.0,5.254717,1.162264,1387.0,2.616981,39.37,-121.24,0.894


# 2: Preprocessing with ColumnTransformer
The California Housing dataset has only numerical features, so for this project we'll:

 - Use StandardScaler to standardize all numerical features

 - Set up a ColumnTransformer to keep it modular and extendable

This approach is best practice for building production-ready ML pipelines.


### Preprocessing Pipeline

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Get feature names (all except target)
feature_columns = housing.feature_names

In [7]:
# Define the ColumnTransformer for numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), feature_columns)
    ]
)

**Explanation:**
 - StandardScaler ensures all features have mean 0 and standard deviation 1 — crucial for distance-based models like KNN.

 - Using ColumnTransformer keeps your pipeline modular and ready for mixed data types (e.g., categorical features in future datasets).

# 3: Train/Test Split (80/20)
I’ll:

 - Separate the features X and target y

 - Perform an 80/20 split using train_test_split

 - Set a random seed for reproducibility

In [8]:
from sklearn.model_selection import train_test_split

# Features and target
X = df[feature_columns]
y = df['MedHouseVal']

In [9]:
# Train/test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [10]:
# Confirm the split
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")

Training samples: 16512
Test samples: 4128


**Explanation:**
 - test_size=0.2 → 20% of the data is held out for testing.

 - random_state=42 ensures the split is reproducible across runs.

 - No stratification needed here since the target is continuous (regression problem).

# 4: Define the Evaluation Metric — R² Score
Since this is a regression task, I’ll use:

 - R² Score (coefficient of determination):
Measures how well the model explains the variance in the target variable.

 - R² = 1 → perfect prediction

 - R² = 0 → model predicts no better than the mean

 - R² < 0 → model is worse than simply predicting the mean

In [11]:
from sklearn.metrics import r2_score

# Example usage:
# After model.predict(X_test), use:
# r2_score(y_test, y_pred)


# 5: Build Pipeline with KNeighborsRegressor
I'll combine:

 - The preprocessor (StandardScaler inside ColumnTransformer)

 - The regressor: KNeighborsRegressor

This makes your workflow modular and production ready everything from preprocessing to prediction happens in one object.

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor

# Define the pipeline
knn_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('regressor', KNeighborsRegressor())
])

# 6: Apply GridSearchCV for Hyperparameter Tuning
I'll tune the following hyperparameters of KNeighborsRegressor:

 - n_neighbors: [3, 5, 7, 9]

 - weights: ['uniform', 'distance']

 - p: [1, 2] → for different Minkowski distances (1 = Manhattan, 2 = Euclidean)

I'll also use:

 - GridSearchCV to exhaustively search over these combinations

 - 5-fold cross-validation to ensure model robustness



In [13]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'regressor__n_neighbors': [3, 5, 7, 9],
    'regressor__weights': ['uniform', 'distance'],
    'regressor__p': [1, 2]
}

In [14]:
# Define GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=knn_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

In [15]:
# Fit to training data
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


**What This Does:**
 - Automatically performs 5-fold cross-validation for each hyperparameter combination.

 - Uses R² score as the metric to select the best model.

 - Leverages parallel processing (n_jobs=-1) for efficiency.

# 7: Evaluate the Best Model
I’ll:

 - View the best hyperparameters selected by GridSearchCV

 - Make predictions on the test set

 - Calculate and print the R² score

 - Optionally: Print feature scaling + best model details

In [16]:
from sklearn.metrics import r2_score

# Best model and parameters
print("🔍 Best Hyperparameters:")
print(grid_search.best_params_)

🔍 Best Hyperparameters:
{'regressor__n_neighbors': 9, 'regressor__p': 1, 'regressor__weights': 'distance'}


In [17]:
# Predict on the test set using the best model
y_pred = grid_search.predict(X_test)

In [18]:
# Evaluate performance
r2 = r2_score(y_test, y_pred)
print(f"\n📈 R² Score on Test Set: {r2:.4f}")


📈 R² Score on Test Set: 0.7221


In [19]:
# View full best pipeline
print(grid_search.best_estimator_)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['MedInc', 'HouseAge',
                                                   'AveRooms', 'AveBedrms',
                                                   'Population', 'AveOccup',
                                                   'Latitude',
                                                   'Longitude'])])),
                ('regressor',
                 KNeighborsRegressor(n_neighbors=9, p=1, weights='distance'))])


# 8: Save the Trained Pipeline using joblib

In [20]:
import joblib

# Save the best pipeline (model + preprocessing)
joblib.dump(grid_search.best_estimator_, 'knn_housing_model.pkl')

print("✅ Model saved as 'knn_housing_model.pkl'")


✅ Model saved as 'knn_housing_model.pkl'


##  Reload the Model Later

In [None]:
# Load the saved model
#loaded_model = joblib.load('knn_housing_model.pkl')

# Predict using the loaded model
#loaded_model.predict(X_test[:5])