# California Housing Price Prediction (Regression)

## 🎯 Objective
Build an AutoML regression model to predict median house values using AutoGluon.

**Task**: Regression  
**Dataset**: California Housing (sklearn built-in)  
**Target**: `median_house_value`  
**Metric**: RMSE (Root Mean Squared Error)  

## 📋 What This Notebook Does
1. Install AutoGluon and dependencies
2. Load California Housing dataset from sklearn
3. Prepare features and target variable
4. Train AutoGluon predictor for regression
5. Show leaderboard and feature importance
6. Generate predictions and save artifacts

## 📦 Install Dependencies

In [None]:
!pip install -q autogluon scikit-learn

## 📚 Import Libraries

In [None]:
import time
import shutil
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularPredictor

# Set random seed for reproducibility
np.random.seed(42)

## 📥 Load Dataset

The California Housing dataset contains:
- **20,640 samples** from California districts
- **8 features**: Location, housing attributes, demographics
- **Target**: Median house value (in $100,000s)

In [None]:
# Load California Housing dataset
print("📥 Loading California Housing dataset...")
housing = fetch_california_housing(as_frame=True)

# Create dataframe with features and target
data = housing.frame

# Rename target to be more descriptive
data = data.rename(columns={'MedHouseVal': 'median_house_value'})

print(f"\n✅ Data loaded successfully!")
print(f"   Shape: {data.shape}")
print(f"\n📊 Dataset Info:")
print(data.info())
print(f"\n📈 Target Statistics:")
print(data['median_house_value'].describe())

## 🔀 Train-Test Split

Split data into training and test sets:

In [None]:
# Split data (80% train, 20% test)
train, test = train_test_split(data, test_size=0.2, random_state=42)

print(f"📊 Data split:")
print(f"   Train: {train.shape[0]} samples")
print(f"   Test:  {test.shape[0]} samples")

## 🎯 Set Target Label and Problem Type

AutoGluon will automatically detect this is a regression problem because the target is numeric.

In [None]:
# Define target label
LABEL = "median_house_value"

# AutoGluon will auto-detect problem type (regression)
# and use RMSE as the default metric
print(f"🎯 Target Label: {LABEL}")
print(f"📈 Metric: RMSE (auto-detected for regression)")
print(f"\n📊 Feature columns:")
feature_cols = [col for col in train.columns if col != LABEL]
print(feature_cols)

## 🚀 Train AutoGluon Model

AutoGluon will:
- Automatically detect this is a regression task
- Train multiple models (LightGBM, CatBoost, Neural Networks, etc.)
- Create an ensemble of the best models
- Optimize for RMSE

In [None]:
# Create save directory with timestamp
save_dir = f"ag-{int(time.time())}-california-housing"

# Initialize predictor
predictor = TabularPredictor(
    label=LABEL,
    problem_type="regression",  # Explicitly set for clarity
    eval_metric="root_mean_squared_error",  # RMSE for regression
    path=save_dir
)

# Train the model
print("🏋️ Training AutoGluon models...")
print("This may take 10-15 minutes...\n")

predictor = predictor.fit(
    train,
    presets="medium_quality",  # Balance between speed and accuracy
    time_limit=900,            # 15 minutes (adjust as needed)
    verbosity=2                # Show detailed progress
)

print("\n✅ Training complete!")

## 📊 Model Leaderboard

Shows all models trained and their performance (lower RMSE = better):

In [None]:
# Get leaderboard
leaderboard = predictor.leaderboard(train, silent=True)

print("🏆 Top 10 Models (sorted by RMSE):")
display(leaderboard.head(10))

# Save leaderboard
leaderboard.to_csv('leaderboard.csv', index=False)
print("\n💾 Saved: leaderboard.csv")

## 🔍 Feature Importance

Shows which features are most predictive of house prices:

In [None]:
# Get feature importance
feature_importance = predictor.feature_importance(train)

print("🔍 Feature Importance (all features):")
display(feature_importance)

# Save feature importance
feature_importance.to_csv('feature_importance.csv')
print("\n💾 Saved: feature_importance.csv")

## 📈 Model Performance on Test Set

Evaluate the model on held-out test data:

In [None]:
# Evaluate on test set
print("📊 Evaluating on test set...")
test_performance = predictor.evaluate(test)

print("\n📈 Test Set Performance:")
for metric, value in test_performance.items():
    print(f"   {metric}: {value:.4f}")

## 🔮 Generate Predictions

Make predictions on the test set:

In [None]:
# Generate predictions
print("🔮 Generating predictions...")
predictions = predictor.predict(test)

# Create comparison dataframe
comparison = pd.DataFrame({
    'actual': test[LABEL].values,
    'predicted': predictions.values,
    'error': test[LABEL].values - predictions.values,
    'abs_error': abs(test[LABEL].values - predictions.values)
})

# Add feature columns for context
for col in feature_cols:
    comparison[col] = test[col].values

comparison.to_csv('predictions.csv', index=False)
print("✅ Predictions generated!")
print("\n📊 Sample predictions (first 10):")
display(comparison[['actual', 'predicted', 'error', 'abs_error']].head(10))
print("\n💾 Saved: predictions.csv")

## 💾 Save Model Artifacts

Package everything for download:

In [None]:
# Create model archive
print("📦 Creating model archive...")
shutil.make_archive('autogluon_model', 'zip', save_dir)

print("\n✅ All artifacts saved!")
print("\n📥 Download these files:")
print("   ✓ autogluon_model.zip     - Trained model")
print("   ✓ leaderboard.csv         - Model comparison")
print("   ✓ feature_importance.csv  - Important features")
print("   ✓ predictions.csv         - Test predictions with actuals")
print("\n💡 Use the Files panel (📁) to download")

## 🎓 Summary

This notebook demonstrated:
1. ✅ Loading California Housing dataset from sklearn
2. ✅ Training AutoGluon for regression task
3. ✅ Evaluating model performance (RMSE)
4. ✅ Analyzing feature importance
5. ✅ Generating predictions on test set

**Key Insights:**
- Most important features are typically: MedInc (median income), location (Latitude/Longitude)
- AutoGluon automatically handles the regression task
- Ensemble models typically perform best

**Next Steps:**
- Try different presets (`best_quality`, `high_quality`)
- Increase `time_limit` for better results
- Experiment with feature engineering (e.g., adding distance from coast)