<a href="https://colab.research.google.com/github/MarianBolous/AceGPT-v2/blob/main/california_rf_local.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# California Housing Regression with scikit‑learn & MLflow

This notebook walks through an **end‑to‑end machine‑learning workflow**:
1. Load the California Housing dataset
2. Train & tune a `RandomForestRegressor`
3. Track experiments, parameters, metrics, and artifacts in **MLflow**
4. Register the best model and (optionally) serve it locally

Works out‑of‑the‑box in a **local JupyterLab / classic Jupyter** setup.


## 0  Environment setup  
Uncomment the next cell if the required libraries aren’t installed yet.

In [1]:
# !pip install --upgrade pip
!pip install scikit-learn==1.4.2 mlflow pandas numpy matplotlib seaborn


Collecting scikit-learn==1.4.2
  Downloading scikit_learn-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting mlflow
  Downloading mlflow-3.1.4-py3-none-any.whl.metadata (29 kB)
Collecting mlflow-skinny==3.1.4 (from mlflow)
  Downloading mlflow_skinny-3.1.4-py3-none-any.whl.metadata (30 kB)
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.16.4-py3-none-any.whl.metadata (7.3 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==3.1.4->mlflow)
  Downloading databricks_sdk-0.60.0-py3-none-any.whl.metadata (39 kB)
Collecting opentelemetry-api<3,>=1.9.0 (from mlflow-skinny==3.1.4->mlflow)
  Downloading opentelemetry_api

## 1  Imports & experiment setup

In [None]:
import os, subprocess, time
import mlflow, mlflow.sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from mlflow.models import infer_signature
import pandas as pd
import numpy as np

EXPERIMENT_NAME = 'DS-Method-California-Housing'
mlflow.set_experiment(EXPERIMENT_NAME)


<Experiment: artifact_location='file:///Users/faisal/Dev/elvtr/mlruns/855472902766185579', creation_time=1748170976450, experiment_id='855472902766185579', last_update_time=1748170976450, lifecycle_stage='active', name='DS-Method-California-Housing', tags={}>

## 2  Load data & define pipeline

In [None]:
raw = fetch_california_housing(as_frame=True)
X_full, y_full = raw.data, raw.target
X_full.describe().to_csv('feature_summary.csv')

X_train, X_test, y_train, y_test = train_test_split(
    X_full, y_full, test_size=0.20, random_state=42)

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('rf', RandomForestRegressor(random_state=42))
])

param_grid = {
    'rf__n_estimators': [120, 240],
    'rf__max_depth': [None, 15],
    'rf__min_samples_split': [2, 4]
}


## 3  Train, tune, evaluate, and log with MLflow

In [None]:
with mlflow.start_run(run_name='rf_regressor_local'):
    gscv = GridSearchCV(
        pipe,
        param_grid=param_grid,
        cv=3,
        scoring='neg_mean_absolute_error',
        n_jobs=-1,
        verbose=1,
    ).fit(X_train, y_train)

    best = gscv.best_estimator_
    mlflow.log_params(gscv.best_params_)

    y_pred = best.predict(X_test)
    metrics = {
        'MAE': mean_absolute_error(y_test, y_pred),
        'RMSE': root_mean_squared_error(y_test, y_pred),
        'R2': r2_score(y_test, y_pred),
    }
    mlflow.log_metrics(metrics)

    # artifacts
    fi = pd.Series(best.named_steps['rf'].feature_importances_,
                   index=X_full.columns).sort_values(ascending=False)
    fi.to_csv('feature_importance.csv')
    mlflow.log_artifact('feature_importance.csv', artifact_path='insight')
    mlflow.log_artifact('feature_summary.csv', artifact_path='eda')

    # signature & model registration
    signature = infer_signature(X_test.head(5), best.predict(X_test.head(5)))
    mlflow.sklearn.log_model(
        best,
        artifact_path='model',
        registered_model_name='CaliforniaRFRegressor',
        signature=signature,
        input_example=X_test.head(5),
    )

print('✅ Run complete — open MLflow UI to inspect.')


Fitting 3 folds for each of 8 candidates, totalling 24 fits
✅ Run complete — open MLflow UI to inspect.


Registered model 'CaliforniaRFRegressor' already exists. Creating a new version of this model...
Created version '3' of model 'CaliforniaRFRegressor'.


## 4  View results
Launch MLflow Tracking UI in a terminal:
```bash
mlflow ui --port 5000
```
then open **http://localhost:5000** in your browser.

## 5  (Optional) Serve the model locally
```bash
mlflow models serve -m "models:/CaliforniaRFRegressor/1" -p 9000
```
Send inference requests to `http://127.0.0.1:9000/invocations`.

In [None]:
! mlflow models serve -m "models:/CaliforniaRFRegressor/2" -p 9000 --env-manager local


## 5 Simple Invocation

In [None]:
!curl -X POST http://127.0.0.1:9000/invocations \
     -H 'Content-Type: application/json' \
     -d '{ \
           "dataframe_split": { \
             "columns": ["MedInc", "HouseAge", "AveRooms", "AveBedrms", \
                         "Population", "AveOccup", "Latitude", "Longitude"], \
             "data": [[3.2, 15, 6.6, 1.0, 784, 2.6, 37.88, -122.23]] \
           } \
         }'

{"predictions": [1.9872875000000008]}