# Boston Housing Regression: Ensemble Models Comparison

This notebook demonstrates the comparison of various regression models and ensemble techniques on the **Boston Housing dataset**. The goal is to predict the median value of owner-occupied homes (`medv`) using the available features.  

We will explore the following models:

- **Linear Regression** with proper handling of categorical variables (`chas` and `rad`) via One-Hot Encoding.
- **Decision Tree Regressor**, which splits data to reduce MSE at each node.
- **Bagging Regressor** using Decision Trees to reduce variance.
- **AdaBoost Regressor**, boosting weak learners to improve prediction accuracy.
- **Extra Trees Regressor**, an ensemble of randomized decision trees that reduces variance and improves robustness.
- **Voting Regressor**, combining Linear Regression and Decision Tree predictions for improved generalization.

Key steps:

1. **Data Preprocessing**: Scaling numeric features and encoding categorical features using `ColumnTransformer`.
2. **Model Training**: Using pipelines to ensure preprocessing is applied consistently where required.
3. **Cross-Validation**: 5-fold CV to estimate model performance on unseen data.
4. **Evaluation Metrics**: RÂ² score on training, test, and cross-validated predictions to check overfitting and generalization.

**Notes:**

- Extra Trees differs from Bagging by using **random feature selection** and **random split thresholds**, often resulting in lower variance.
- All ensemble models are evaluated using consistent cross-validation to ensure fair comparison.

**References:**

- Boston Housing Dataset: [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/Boston+Housing)
- Scikit-learn Ensemble Methods: [Bagging, AdaBoost, Extra Trees, Voting](https://scikit-learn.org/stable/modules/ensemble.html)


In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, VotingRegressor, ExtraTreesRegressor

from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings("ignore")


# Load the dataset
df = pd.read_csv(r"D:\OneDrive - greatlakes.edu.in\OFFICE\ML\Datasets/Boston.csv")

# Define features and target
X = df.drop('medv', axis=1)
y = df['medv']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=28
)

# Identify categorical and numerical features
cat_cols = ['chas', 'rad']
num_cols = [col for col in X_train.columns if col not in cat_cols]

# Convert categorical columns to 'category' dtype
X_train[cat_cols] = X_train[cat_cols].astype('category')
X_test[cat_cols] = X_test[cat_cols].astype('category')

# Column transformer for scaling numerical and one-hot encoding categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(drop='first'), cat_cols)
    ]
)

# Define regression models
models = {
    "Linear Regression": make_pipeline(
        preprocessor,
        LinearRegression()
    ),
    "Decision Tree": DecisionTreeRegressor(random_state=28),
    "Bagging (DT)": BaggingRegressor(
        estimator=DecisionTreeRegressor(random_state=28),
        n_estimators=20,
        max_samples=0.8,
        oob_score=True,
        random_state=28
    ),
    "AdaBoost (DT)": AdaBoostRegressor(
        estimator=DecisionTreeRegressor(max_depth=4, random_state=28),
        n_estimators=100,
        learning_rate=0.5,
        random_state=28
    ),
    "Extra Trees": ExtraTreesRegressor(
        n_estimators=200,
        random_state=28,
        n_jobs=-1
    ),
    "Voting Regressor": VotingRegressor(
        estimators=[
            ('linear', make_pipeline(preprocessor, LinearRegression())),
            ('dt', DecisionTreeRegressor(random_state=28))
        ]
    )
}

# Set up 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=28)

# Evaluate each model
results = []

for name, model in models.items():
    pipeline = Pipeline([("model", model)])
    
    # Cross-validation on training set
    cv_scores = cross_val_score(
        pipeline, X_train, y_train, cv=cv, scoring='r2', n_jobs=-1
    )
    
    # Fit model on full training data
    pipeline.fit(X_train, y_train)
    
    # Calculate R2 on training and test sets
    train_r2 = r2_score(y_train, pipeline.predict(X_train))
    test_r2 = r2_score(y_test, pipeline.predict(X_test))
    
    # Store results
    results.append({
        "Model": name,
        "Mean CV R2": cv_scores.mean(),
        "Std CV R2": cv_scores.std(),
        "Train R2": train_r2,
        "Test R2": test_r2
    })

# Display results
cv_summary = pd.DataFrame(results).sort_values(by="Mean CV R2", ascending=False)
pd.set_option("display.float_format", lambda x: f"{x:.4f}")
display(cv_summary)


Unnamed: 0,Model,Mean CV R2,Std CV R2,Train R2,Test R2
4,Extra Trees,0.8823,0.0734,1.0,0.7208
3,AdaBoost (DT),0.8478,0.0886,0.9481,0.7726
2,Bagging (DT),0.827,0.1023,0.9663,0.7298
5,Voting Regressor,0.793,0.1138,0.9452,0.6518
0,Linear Regression,0.7388,0.0859,0.7809,0.5684
1,Decision Tree,0.6823,0.1272,1.0,0.4901
