---
title: "Practice Activity 7.1: Cross-Validation and Tuning"
format: 
  html:
    embed-resources: true
execute:
  echo: true
code-fold: true
author: James Compagno
jupyter: python3
---

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer

In [None]:
ames = pd.read_csv("AmesHousing.csv")

In [None]:
# Load data and prepare train/test spli
X = ames.drop("SalePrice", axis=1)
y = ames["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
# Model Library 
model_library = {}
records = []

# Practice Activity

Once again consider four modeling options for house price:

    -Using only the size and number of rooms.
    -Using size, number of rooms, and building type.
    -Using size and building type, and their interaction.
    -Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.
    
Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

# Using only the size and number of rooms.

In [None]:
# Model Name
model_name = "LivArea_TotRoom"

# Preprocessing
preprocess = ColumnTransformer(
    [("num", "passthrough", ["Gr Liv Area", "TotRms AbvGrd"])],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Cross Validation
LivArea_TotRoom = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

# Add to Library
model_library[model_name] = LivArea_TotRoom.fit(X, y)

# Meterics Calculation 
rmse = cross_val_score(LivArea_TotRoom, X, y, cv=5, scoring='neg_root_mean_squared_error')
mse = cross_val_score(LivArea_TotRoom, X, y, cv=5, scoring='neg_mean_squared_error')
r2 = cross_val_score(LivArea_TotRoom, X, y, cv=5, scoring='r2')

# Metrics Storage 
records.append({
    "Model": model_name,
    "Regression Type": "Linear",
    "Split": "CV-5",
    "RMSE Mean": -rmse.mean(),
    "MSE Mean": -mse.mean(),
    "R2 Mean": r2.mean()
})

# Display
cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Regression Type,Split,RMSE Mean,MSE Mean,R2 Mean
0,LivArea_TotRoom,Linear,CV-5,55806.326349,3136139000.0,0.504209


## Using size, number of rooms, and building type.

In [None]:
# Model Name
model_name = "LivArea_Rooms_BlgdType"

# Preprocessing
preprocess = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
        ("num", "passthrough", ["Gr Liv Area", "TotRms AbvGrd"]),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Cross Validation Pipeline
LivArea_Rooms_BlgdType = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

# Add to Library
model_library[model_name] = LivArea_Rooms_BlgdType.fit(X, y)

# Metrics Calculation 
rmse = cross_val_score(LivArea_Rooms_BlgdType, X, y, cv=5, scoring='neg_root_mean_squared_error')
mse = cross_val_score(LivArea_Rooms_BlgdType, X, y, cv=5, scoring='neg_mean_squared_error')
r2 = cross_val_score(LivArea_Rooms_BlgdType, X, y, cv=5, scoring='r2')

# Metrics Storage 
records.append({
    "Model": model_name,
    "Regression Type": "Linear",
    "Split": "CV-5",
    "RMSE Mean": -rmse.mean(),
    "MSE Mean": -mse.mean(),
    "R2 Mean": r2.mean()
})

# Display
cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Regression Type,Split,RMSE Mean,MSE Mean,R2 Mean
0,LivArea_TotRoom,Linear,CV-5,55806.326349,3136139000.0,0.504209
1,LivArea_Rooms_BlgdType,Linear,CV-5,54168.081429,2951994000.0,0.532882


## Using size and building type, and their interaction.

In [None]:
# Model Name
model_name = "Size_Type_IntST"

# Preprocessing
preprocess = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
        ("num", "passthrough", ["Gr Liv Area"]),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Cross Validation Pipeline
Size_Type_IntST = Pipeline([
    ("preprocess", preprocess),
    ("interaction", PolynomialFeatures(interaction_only=True, include_bias=False)),
    ("linear_regression", LinearRegression())
])

# Add to Library
model_library[model_name] = Size_Type_IntST.fit(X, y)

# Metrics Calculation 
rmse = cross_val_score(Size_Type_IntST, X, y, cv=5, scoring='neg_root_mean_squared_error')
mse = cross_val_score(Size_Type_IntST, X, y, cv=5, scoring='neg_mean_squared_error')
r2 = cross_val_score(Size_Type_IntST, X, y, cv=5, scoring='r2')

# Metrics Storage 
records.append({
    "Model": model_name,
    "Regression Type": "Linear",
    "Split": "CV-5",
    "RMSE Mean": -rmse.mean(),
    "MSE Mean": -mse.mean(),
    "R2 Mean": r2.mean()
})

# Display
cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Regression Type,Split,RMSE Mean,MSE Mean,R2 Mean
0,LivArea_TotRoom,Linear,CV-5,55806.326349,3136139000.0,0.504209
1,LivArea_Rooms_BlgdType,Linear,CV-5,54168.081429,2951994000.0,0.532882
2,Size_Type_IntST,Linear,CV-5,53430.921975,2871228000.0,0.544867


## Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

In [None]:
# Model Name
model_name = "Poly5_Size_Rooms_BlgdType"

# Preprocessing
preprocess = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
        ("poly_size", PolynomialFeatures(degree=5, include_bias=False), ["Gr Liv Area"]),
        ("poly_rooms", PolynomialFeatures(degree=5, include_bias=False), ["TotRms AbvGrd"]),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Cross Validation Pipeline
Poly5_Size_Rooms_BlgdType = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

# Add to Library
model_library[model_name] = Poly5_Size_Rooms_BlgdType.fit(X, y)

# Metrics Calculation 
rmse = cross_val_score(Poly5_Size_Rooms_BlgdType, X, y, cv=5, scoring='neg_root_mean_squared_error')
mse = cross_val_score(Poly5_Size_Rooms_BlgdType, X, y, cv=5, scoring='neg_mean_squared_error')
r2 = cross_val_score(Poly5_Size_Rooms_BlgdType, X, y, cv=5, scoring='r2')

# Metrics Storage 
records.append({
    "Model": model_name,
    "Regression Type": "Linear",
    "Split": "CV-5",
    "RMSE Mean": -rmse.mean(),
    "MSE Mean": -mse.mean(),
    "R2 Mean": r2.mean()
})

# Display
cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Regression Type,Split,RMSE Mean,MSE Mean,R2 Mean
0,LivArea_TotRoom,Linear,CV-5,55806.326349,3136139000.0,0.504209
1,LivArea_Rooms_BlgdType,Linear,CV-5,54168.081429,2951994000.0,0.532882
2,Size_Type_IntST,Linear,CV-5,53430.921975,2871228000.0,0.544867
3,Poly5_Size_Rooms_BlgdType,Linear,CV-5,56255.736345,3177429000.0,0.49714


## Which was the best

Size_Type_IntST is the best across all metrics but only just so

# 13.3.3 Practice Activity

Consider one hundred modeling options for house price:

    -House size, trying degrees 1 through 10
    -Number of rooms, trying degrees 1 through 10
    -Building Type
    
Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

## House size, trying degrees 1 through 10

In [None]:
# Model Name
model_name = "Size_1_thru_10"
regression_type = "Linear"  

# Preprocessing
preprocess = ColumnTransformer(
    [
        ("polynomial", PolynomialFeatures(), ["Gr Liv Area"])
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Cross Validation Pipeline
pipe = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

# GridSearchCV 
degrees = {'preprocess__polynomial__degree': np.arange(1, 11)}  
gscv = GridSearchCV(pipe, degrees, cv=5, scoring='r2')
gscv.fit(X, y)

# Add best model to Library
model_library[model_name] = gscv.best_estimator_

# Get best degree
best_degree = gscv.best_params_['preprocess__polynomial__degree']

# Metrics Calculation using best model
rmse = cross_val_score(gscv.best_estimator_, X, y, cv=5, scoring='neg_root_mean_squared_error')
mse = cross_val_score(gscv.best_estimator_, X, y, cv=5, scoring='neg_mean_squared_error')
r2 = cross_val_score(gscv.best_estimator_, X, y, cv=5, scoring='r2')

# Metrics Storage 
records.append({
    "Model": model_name,
    "Regression Type": regression_type,
    "Split": "CV-5",
    "RMSE Mean": -rmse.mean(),
    "MSE Mean": -mse.mean(),
    "R2 Mean": r2.mean()
})

# Display
cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Regression Type,Split,RMSE Mean,MSE Mean,R2 Mean
0,LivArea_TotRoom,Linear,CV-5,55806.326349,3136139000.0,0.504209
1,LivArea_Rooms_BlgdType,Linear,CV-5,54168.081429,2951994000.0,0.532882
2,Size_Type_IntST,Linear,CV-5,53430.921975,2871228000.0,0.544867
3,Poly5_Size_Rooms_BlgdType,Linear,CV-5,56255.736345,3177429000.0,0.49714
4,Size_1_thru_10,Linear,CV-5,55666.018798,3110883000.0,0.507396


## Number of rooms, trying degrees 1 through 10

In [None]:
# Model Name
model_name = "Rooms_1_thru_10"
regression_type = "Linear"  

# Preprocessing
preprocess = ColumnTransformer(
    [
        ("polynomial", PolynomialFeatures(), ["TotRms AbvGrd"])
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Cross Validation Pipeline
pipe = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

# GridSearchCV 
degrees = {'preprocess__polynomial__degree': np.arange(1, 11)}  
gscv = GridSearchCV(pipe, degrees, cv=5, scoring='r2')
gscv.fit(X, y)

# Add best model to Library
model_library[model_name] = gscv.best_estimator_

# Get best degree
best_degree = gscv.best_params_['preprocess__polynomial__degree']

# Metrics Calculation using best model
rmse = cross_val_score(gscv.best_estimator_, X, y, cv=5, scoring='neg_root_mean_squared_error')
mse = cross_val_score(gscv.best_estimator_, X, y, cv=5, scoring='neg_mean_squared_error')
r2 = cross_val_score(gscv.best_estimator_, X, y, cv=5, scoring='r2')

# Metrics Storage 
records.append({
    "Model": model_name,
    "Regression Type": regression_type,
    "Split": "CV-5",
    "RMSE Mean": -rmse.mean(),
    "MSE Mean": -mse.mean(),
    "R2 Mean": r2.mean()
})

# Display
cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Regression Type,Split,RMSE Mean,MSE Mean,R2 Mean
0,LivArea_TotRoom,Linear,CV-5,55806.326349,3136139000.0,0.504209
1,LivArea_Rooms_BlgdType,Linear,CV-5,54168.081429,2951994000.0,0.532882
2,Size_Type_IntST,Linear,CV-5,53430.921975,2871228000.0,0.544867
3,Poly5_Size_Rooms_BlgdType,Linear,CV-5,56255.736345,3177429000.0,0.49714
4,Size_1_thru_10,Linear,CV-5,55666.018798,3110883000.0,0.507396
5,Rooms_1_thru_10,Linear,CV-5,69347.503639,4815620000.0,0.23522


## Building Type

In [None]:
# Model Name
model_name = "Building_Type"
regression_type = "Linear"  

# Preprocessing
preprocess = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"])
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Cross Validation Pipeline
pipe = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

# Add to Library
model_library[model_name] = pipe.fit(X, y)

# Metrics Calculation 
rmse = cross_val_score(pipe, X, y, cv=5, scoring='neg_root_mean_squared_error')
mse = cross_val_score(pipe, X, y, cv=5, scoring='neg_mean_squared_error')
r2 = cross_val_score(pipe, X, y, cv=5, scoring='r2')

# Metrics Storage 
records.append({
    "Model": model_name,
    "Regression Type": regression_type,
    "Split": "CV-5",
    "RMSE Mean": -rmse.mean(),
    "MSE Mean": -mse.mean(),
    "R2 Mean": r2.mean()
})

# Display
cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Regression Type,Split,RMSE Mean,MSE Mean,R2 Mean
0,LivArea_TotRoom,Linear,CV-5,55806.326349,3136139000.0,0.504209
1,LivArea_Rooms_BlgdType,Linear,CV-5,54168.081429,2951994000.0,0.532882
2,Size_Type_IntST,Linear,CV-5,53430.921975,2871228000.0,0.544867
3,Poly5_Size_Rooms_BlgdType,Linear,CV-5,56255.736345,3177429000.0,0.49714
4,Size_1_thru_10,Linear,CV-5,55666.018798,3110883000.0,0.507396
5,Rooms_1_thru_10,Linear,CV-5,69347.503639,4815620000.0,0.23522
6,Building_Type,Linear,CV-5,78636.980412,6209124000.0,0.02075


## Questions

### Q1: Which model performed the best?

In [None]:
cumulative_models.sort_values('R2 Mean', ascending=False)

Unnamed: 0,Model,Regression Type,Split,RMSE Mean,MSE Mean,R2 Mean
2,Size_Type_IntST,Linear,CV-5,53430.921975,2871228000.0,0.544867
1,LivArea_Rooms_BlgdType,Linear,CV-5,54168.081429,2951994000.0,0.532882
4,Size_1_thru_10,Linear,CV-5,55666.018798,3110883000.0,0.507396
0,LivArea_TotRoom,Linear,CV-5,55806.326349,3136139000.0,0.504209
3,Poly5_Size_Rooms_BlgdType,Linear,CV-5,56255.736345,3177429000.0,0.49714
5,Rooms_1_thru_10,Linear,CV-5,69347.503639,4815620000.0,0.23522
6,Building_Type,Linear,CV-5,78636.980412,6209124000.0,0.02075


Size_Type_IntST (Using size and building type, and their interaction) performed the best liekly due to the interaction which helped capture more data  

### Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

There are nearly infinite possibilities of models if you consider truning 