# California Housing Price Prediction (Regression)

## 🎯 Objective
Build an AutoML regression model to predict median house values using AutoGluon.

**Task**: Regression  
**Dataset**: California Housing (sklearn built-in)  
**Target**: `median_house_value`  
**Metric**: RMSE (Root Mean Squared Error)  

## 📋 What This Notebook Does
1. Install AutoGluon and dependencies
2. Load California Housing dataset from sklearn
3. Prepare features and target variable
4. Train AutoGluon predictor for regression
5. Show leaderboard and feature importance
6. Generate predictions and save artifacts

## 📦 Install Dependencies

In [None]:
!pip install -q torch torchvision torchaudio
!pip install -q autogluon scikit-learn

## 📚 Import Libraries

In [1]:
import time
import shutil
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularPredictor

# Set random seed for reproducibility
np.random.seed(42)

## 📥 Load Dataset

The California Housing dataset contains:
- **20,640 samples** from California districts
- **8 features**: Location, housing attributes, demographics
- **Target**: Median house value (in $100,000s)

In [2]:
# Load California Housing dataset
print("📥 Loading California Housing dataset...")
housing = fetch_california_housing(as_frame=True)

# Create dataframe with features and target
data = housing.frame

# Rename target to be more descriptive
data = data.rename(columns={'MedHouseVal': 'median_house_value'})

print(f"\n✅ Data loaded successfully!")
print(f"   Shape: {data.shape}")
print(f"\n📊 Dataset Info:")
print(data.info())
print(f"\n📈 Target Statistics:")
print(data['median_house_value'].describe())

📥 Loading California Housing dataset...

✅ Data loaded successfully!
   Shape: (20640, 9)

📊 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MedInc              20640 non-null  float64
 1   HouseAge            20640 non-null  float64
 2   AveRooms            20640 non-null  float64
 3   AveBedrms           20640 non-null  float64
 4   Population          20640 non-null  float64
 5   AveOccup            20640 non-null  float64
 6   Latitude            20640 non-null  float64
 7   Longitude           20640 non-null  float64
 8   median_house_value  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
None

📈 Target Statistics:
count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Nam

## 🔀 Train-Test Split

Split data into training and test sets:

In [3]:
# Split data (80% train, 20% test)
train, test = train_test_split(data, test_size=0.2, random_state=42)

print(f"📊 Data split:")
print(f"   Train: {train.shape[0]} samples")
print(f"   Test:  {test.shape[0]} samples")

📊 Data split:
   Train: 16512 samples
   Test:  4128 samples


## 🎯 Set Target Label and Problem Type

AutoGluon will automatically detect this is a regression problem because the target is numeric.

In [4]:
# Define target label
LABEL = "median_house_value"

# AutoGluon will auto-detect problem type (regression)
# and use RMSE as the default metric
print(f"🎯 Target Label: {LABEL}")
print(f"📈 Metric: RMSE (auto-detected for regression)")
print(f"\n📊 Feature columns:")
feature_cols = [col for col in train.columns if col != LABEL]
print(feature_cols)

🎯 Target Label: median_house_value
📈 Metric: RMSE (auto-detected for regression)

📊 Feature columns:
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


## 🚀 Train AutoGluon Model

AutoGluon will:
- Automatically detect this is a regression task
- Train multiple models (LightGBM, CatBoost, Neural Networks, etc.)
- Create an ensemble of the best models
- Optimize for RMSE

In [5]:
# Create save directory with timestamp
save_dir = f"ag-{int(time.time())}-california-housing"

# Initialize predictor
predictor = TabularPredictor(
    label=LABEL,
    problem_type="regression",  # Explicitly set for clarity
    eval_metric="root_mean_squared_error",  # RMSE for regression
    path=save_dir
)

# Train the model
print("🏋️ Training AutoGluon models...")
print("This may take 10-15 minutes...\n")

predictor = predictor.fit(
    train,
    presets="medium_quality",  # Balance between speed and accuracy
    time_limit=900,            # 15 minutes (adjust as needed)
    verbosity=2                # Show detailed progress
)

print("\n✅ Training complete!")

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.9.6
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 25.0.0: Wed Sep 17 21:42:08 PDT 2025; root:xnu-12377.1.9~141/RELEASE_ARM64_T8132
CPU Count:          10
Memory Avail:       5.59 GB / 16.00 GB (34.9%)
Disk Space Avail:   109.79 GB / 228.27 GB (48.1%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 900s
AutoGluon will save models to "/Users/banbalagan/Projects/autogluon-assignment/part1-kaggle/ag-1761503699-california-housing"
Train Data Rows:    16512
Train Data Columns: 8
Label Column:       median_house_value
Problem Type:       regression
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    5643.34 MB
	Train Data (Original)  Memory Usage: 1.01 MB (0.0% of available 

🏋️ Training AutoGluon models...
This may take 10-15 minutes...



	Types of features in original data (raw dtype, special dtypes):
		('float', []) : 8 | ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('float', []) : 8 | ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', ...]
	0.0s = Fit runtime
	8 features in original data used to generate 8 features in processed data.
	Train Data (Processed) Memory Usage: 1.01 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.06s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.1, Train Rows: 14860, Val Rows: 1652
User-specified model hyperparameters to be fit:
{
	'N

[1000]	valid_set's rmse: 0.490271
[2000]	valid_set's rmse: 0.480824
[3000]	valid_set's rmse: 0.479103
[4000]	valid_set's rmse: 0.47806
[5000]	valid_set's rmse: 0.477555
[6000]	valid_set's rmse: 0.478424


	-0.4775	 = Validation score   (-root_mean_squared_error)
	11.91s	 = Training   runtime
	0.18s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 887.75s of the 887.75s of remaining time.
	Fitting with cpus=10, gpus=0, mem=0.0/5.6 GB


[1000]	valid_set's rmse: 0.45723
[2000]	valid_set's rmse: 0.455822
[3000]	valid_set's rmse: 0.454135


	-0.454	 = Validation score   (-root_mean_squared_error)
	5.86s	 = Training   runtime
	0.08s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 881.74s of the 881.74s of remaining time.
	Fitting with cpus=10, gpus=0
	-0.5298	 = Validation score   (-root_mean_squared_error)
	4.47s	 = Training   runtime
	0.06s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 877.03s of the 877.03s of remaining time.
	Fitting with cpus=10, gpus=0
	-0.4356	 = Validation score   (-root_mean_squared_error)
	30.03s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: ExtraTreesMSE ... Training model for up to 846.98s of the 846.98s of remaining time.
	Fitting with cpus=10, gpus=0
	-0.5313	 = Validation score   (-root_mean_squared_error)
	0.91s	 = Training   runtime
	0.05s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 845.86s of the 845.85s of remaining time.
	Fitting with cpus=10, gpus=0, mem=0.0/5.3 GB
	-

[1000]	valid_set's rmse: 0.453908
[2000]	valid_set's rmse: 0.452343
[3000]	valid_set's rmse: 0.452218
[4000]	valid_set's rmse: 0.452185


	-0.4522	 = Validation score   (-root_mean_squared_error)
	32.54s	 = Training   runtime
	0.2s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.00s of the 768.15s of remaining time.
	Ensemble Weights: {'CatBoost': 0.75, 'LightGBMLarge': 0.208, 'NeuralNetTorch': 0.042}
	-0.4336	 = Validation score   (-root_mean_squared_error)
	0.01s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 131.88s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 7596.6 rows/s (1652 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/Users/banbalagan/Projects/autogluon-assignment/part1-kaggle/ag-1761503699-california-housing")



✅ Training complete!


## 📊 Model Leaderboard

Shows all models trained and their performance (lower RMSE = better):

In [6]:
# Get leaderboard
leaderboard = predictor.leaderboard(train, silent=True)

print("🏆 Top 10 Models (sorted by RMSE):")
display(leaderboard.head(10))

# Save leaderboard
leaderboard.to_csv('leaderboard.csv', index=False)
print("\n💾 Saved: leaderboard.csv")

🏆 Top 10 Models (sorted by RMSE):


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBMLarge,-0.143494,-0.45215,root_mean_squared_error,1.11477,0.201329,32.544856,1.11477,0.201329,32.544856,1,True,9
1,XGBoost,-0.145659,-0.458902,root_mean_squared_error,0.49281,0.035739,5.426123,0.49281,0.035739,5.426123,1,True,7
2,WeightedEnsemble_L2,-0.179685,-0.433641,root_mean_squared_error,1.216885,0.217465,95.516262,0.007099,0.00046,0.010457,2,True,10
3,LightGBM,-0.185385,-0.454005,root_mean_squared_error,0.480925,0.076977,5.860347,0.480925,0.076977,5.860347,1,True,2
4,CatBoost,-0.194105,-0.43555,root_mean_squared_error,0.039579,0.004959,30.027418,0.039579,0.004959,30.027418,1,True,4
5,RandomForestMSE,-0.243204,-0.529777,root_mean_squared_error,0.367702,0.063809,4.473133,0.367702,0.063809,4.473133,1,True,3
6,ExtraTreesMSE,-0.244164,-0.531256,root_mean_squared_error,0.332478,0.04802,0.911261,0.332478,0.04802,0.911261,1,True,5
7,LightGBMXT,-0.277642,-0.477455,root_mean_squared_error,1.187048,0.17717,11.912182,1.187048,0.17717,11.912182,1,True,1
8,NeuralNetTorch,-0.43159,-0.517004,root_mean_squared_error,0.055437,0.010717,32.933531,0.055437,0.010717,32.933531,1,True,8
9,NeuralNetFastAI,-0.527477,-0.546016,root_mean_squared_error,0.090616,0.013055,6.300323,0.090616,0.013055,6.300323,1,True,6



💾 Saved: leaderboard.csv


## 🔍 Feature Importance

Shows which features are most predictive of house prices:

In [7]:
# Get feature importance
feature_importance = predictor.feature_importance(train)

print("🔍 Feature Importance (all features):")
display(feature_importance)

# Save feature importance
feature_importance.to_csv('feature_importance.csv')
print("\n💾 Saved: feature_importance.csv")

Computing feature importance via permutation shuffling for 8 features using 5000 rows with 5 shuffle sets...
	17.46s	= Expected runtime (3.49s per shuffle set)
	16.62s	= Actual runtime (Completed 5 of 5 shuffle sets)


🔍 Feature Importance (all features):


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Latitude,1.487997,0.021876,5.603919e-09,5,1.533039,1.442955
Longitude,1.392236,0.046787,1.528187e-07,5,1.488571,1.295901
MedInc,0.489626,0.038122,4.374629e-06,5,0.56812,0.411131
AveOccup,0.301257,0.027523,8.267509e-06,5,0.357926,0.244587
AveRooms,0.275909,0.026237,9.695745e-06,5,0.329932,0.221886
HouseAge,0.147702,0.02211,5.849344e-05,5,0.193227,0.102178
Population,0.08779,0.017887,0.000195825,5,0.124619,0.050961
AveBedrms,0.08762,0.015438,0.0001110185,5,0.119408,0.055833



💾 Saved: feature_importance.csv


## 📈 Model Performance on Test Set

Evaluate the model on held-out test data:

In [8]:
# Evaluate on test set
print("📊 Evaluating on test set...")
test_performance = predictor.evaluate(test)

print("\n📈 Test Set Performance:")
for metric, value in test_performance.items():
    print(f"   {metric}: {value:.4f}")

📊 Evaluating on test set...

📈 Test Set Performance:
   root_mean_squared_error: -0.4276
   mean_squared_error: -0.1828
   mean_absolute_error: -0.2746
   r2: 0.8605
   pearsonr: 0.9276
   median_absolute_error: -0.1723


## 🔮 Generate Predictions

Make predictions on the test set:

In [9]:
# Generate predictions
print("🔮 Generating predictions...")
predictions = predictor.predict(test)

# Create comparison dataframe
comparison = pd.DataFrame({
    'actual': test[LABEL].values,
    'predicted': predictions.values,
    'error': test[LABEL].values - predictions.values,
    'abs_error': abs(test[LABEL].values - predictions.values)
})

# Add feature columns for context
for col in feature_cols:
    comparison[col] = test[col].values

comparison.to_csv('predictions.csv', index=False)
print("✅ Predictions generated!")
print("\n📊 Sample predictions (first 10):")
display(comparison[['actual', 'predicted', 'error', 'abs_error']].head(10))
print("\n💾 Saved: predictions.csv")

🔮 Generating predictions...
✅ Predictions generated!

📊 Sample predictions (first 10):


Unnamed: 0,actual,predicted,error,abs_error
0,0.477,0.466602,0.010398,0.010398
1,0.458,0.667731,-0.209731,0.209731
2,5.00001,5.066875,-0.066865,0.066865
3,2.186,2.489718,-0.303718,0.303718
4,2.78,2.589109,0.190891,0.190891
5,1.587,1.63902,-0.05202,0.05202
6,1.982,2.335185,-0.353185,0.353185
7,1.575,1.557289,0.017711,0.017711
8,3.4,3.079656,0.320344,0.320344
9,4.466,5.015294,-0.549294,0.549294



💾 Saved: predictions.csv


## 💾 Save Model Artifacts

Package everything for download:

In [None]:
# Create model archive
print("📦 Creating model archive...")
shutil.make_archive('autogluon_model', 'zip', save_dir)

print("\n✅ All artifacts saved!")
print("\n📥 Download these files:")
print("   ✓ autogluon_model.zip     - Trained model")
print("   ✓ leaderboard.csv         - Model comparison")
print("   ✓ feature_importance.csv  - Important features")
print("   ✓ predictions.csv         - Test predictions with actuals")
print("\n💡 Use the Files panel (📁) to download")

## 🎓 Summary

This notebook demonstrated:
1. ✅ Loading California Housing dataset from sklearn
2. ✅ Training AutoGluon for regression task
3. ✅ Evaluating model performance (RMSE)
4. ✅ Analyzing feature importance
5. ✅ Generating predictions on test set

**Key Insights:**
- Most important features are typically: MedInc (median income), location (Latitude/Longitude)
- AutoGluon automatically handles the regression task
- Ensemble models typically perform best

**Next Steps:**
- Try different presets (`best_quality`, `high_quality`)
- Increase `time_limit` for better results
- Experiment with feature engineering (e.g., adding distance from coast)