# Regression Example

This notebook demonstrates basic concepts relating to regression problems where the target variable is a continuous numeric value. In this case, we are predicting median value of homes in various districts of California.

## Imports

In [None]:
import pandas as pd
import time
from tqdm import tqdm

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

## Load the Data

In this special case, the dataset is included as part of Scikit-Learn. Convenient.

In [None]:
california_housing = fetch_california_housing()

Print the dataset description

In [None]:
print(california_housing.DESCR)

### Transform the Data

However, the data is not immediately loaded in a DataFrame format we are familiar with. So let's create a DataFrame with it. We also have to explicitly add the target variable to the DataFrame.

In [None]:
df = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
df['MEDV'] = california_housing.target

## Training & Test Split

In [None]:
X = df.drop('MEDV', axis=1)
y = df['MEDV']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Initial Model Training

In [None]:
model = LinearRegression()
model = model.fit(X_train, y_train)

### Make Predictions

In [None]:
y_pred = model.predict(X_test)

## Evaluation

First, manual inspection of actual vs. predicted values and the error between them.

In [None]:
eval_df = X_test.copy()
eval_df["MEDV_actual"] = y_test
eval_df["MEDV_predicted"] = y_pred
eval_df["error"] = abs(eval_df["MEDV_actual"] - eval_df["MEDV_predicted"])
eval_df.sort_values(by="error")

### Individual Regression Metrics

#### Mean Absolute Error

Mean value of error over all predicted samples.

In [None]:
mean_absolute_error(y_test, y_pred)

#### Mean Squared Error

Mean squared value of error over all predicted samples. Squaring the error results in proportionally larger values the larger the initial error was.

In [None]:
mean_squared_error(y_test, y_pred)

#### R-Squared Value

Statistical measure that determines the proportion of variance in the target variable that can be explained by the features. R-squared shows how well the data fit the regression model (*the goodness of fit*).

In [None]:
r2_score(y_test, y_pred)

### Evaluating Different Regression Models

In [None]:
# Linear Regression as presented in the lecture
linear_regression = LinearRegression()

# Decision Trees for Regression:
# "criterion" parameter used to determine the quality of splits when constructing the decision tree
# - default: "squared_error"
# - alternative value: "absolute_error"
decision_tree_regression = DecisionTreeRegressor(criterion="squared_error")

# Random Forest for Regression:
# "criterion" parameter supported as above
# "n_estimators" - number of individual trees
random_forest_regression = RandomForestRegressor()

# Support Vector Machine for Regression
support_vector_regression = SVR()

regressors = [
    linear_regression,
    decision_tree_regression,
    random_forest_regression,
    support_vector_regression
]

model_metrics = []
for regressor in tqdm(regressors):
    
    # Train the regressor
    start_time = time.time()
    trained_model = regressor.fit(X_train, y_train)
    end_training_time = time.time()
    training_time_elapsed = end_training_time - start_time
    
    # Apply trained regressor to test set
    start_time = time.time()
    predictions = trained_model.predict(X_test)
    prediction_time = time.time()
    prediction_time_elapsed = prediction_time - start_time
    
    # Measure model performance
    mse = mean_squared_error(y_test, predictions)
    mae = mean_absolute_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    
    # Record model metrics
    model_metrics.append({
        "model": trained_model.__class__.__name__,
        "training_time": training_time_elapsed,
        "prediction_time": prediction_time_elapsed,
        "mse": mse,
        "mae": mae,
        "r2": r2
    })
    
# Print model metrics table
pd.DataFrame(model_metrics)