---
title: "Practice Activity 7.1: Cross-Validation and Tuning"
format: 
  html:
    embed-resources: true
execute:
  echo: true
code-fold: true
author: James Compagno
jupyter: python3
---

In [54]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer

In [43]:
ames = pd.read_csv("AmesHousing.csv")

In [44]:
# Load data and prepare train/test spli
X = ames.drop("SalePrice", axis=1)
y = ames["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [45]:
# Model Library 
model_library = {}
records = []

# Practice Activity

Once again consider four modeling options for house price:

    -Using only the size and number of rooms.
    -Using size, number of rooms, and building type.
    -Using size and building type, and their interaction.
    -Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.
    
Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

# Using only the size and number of rooms.

In [46]:
preprocess = ColumnTransformer(
    [
        ("num", "passthrough", ["Gr Liv Area", "TotRms AbvGrd"]),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

lr_pipeline = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

lr_fitted = lr_pipeline.fit(X_train, y_train)
model_library["Size_RoomNums"] = lr_fitted 

In [47]:
y_test_pred = model_library["Size_RoomNums"].predict(X_test)
mse = mean_squared_error(y_test, y_test_pred) 
records.append({
    "Model": "Size_RoomNums", "Split": "Test",
    "RMSE": np.sqrt(mse), "MSE": mse, "R2": r2_score(y_test, y_test_pred)  
})

cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Split,RMSE,MSE,R2
0,Size_RoomNums,Test,51569.933624,2659458000.0,0.527276


## Using size, number of rooms, and building type.

In [48]:
preprocess = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
        ("num", "passthrough", ["Gr Liv Area", "TotRms AbvGrd"]),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

lr_pipeline = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

lr_fitted = lr_pipeline.fit(X_train, y_train)
model_library["LivArea_Rooms_BlgdType"] = lr_fitted 

In [49]:
y_test_pred = model_library["LivArea_Rooms_BlgdType"].predict(X_test)
mse = mean_squared_error(y_test, y_test_pred) 
records.append({
    "Model": "LivArea_Rooms_BlgdType", "Split": "Test",
    "RMSE": np.sqrt(mse), "MSE": mse, "R2": r2_score(y_test, y_test_pred)  
})

cumulative_models = (pd.DataFrame(records))
cumulative_models

Unnamed: 0,Model,Split,RMSE,MSE,R2
0,Size_RoomNums,Test,51569.933624,2659458000.0,0.527276
1,LivArea_Rooms_BlgdType,Test,50037.959382,2503797000.0,0.554945


## Using size and building type, and their interaction.

In [50]:
preprocess = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
        ("num", "passthrough", ["Gr Liv Area"]),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

lr_pipeline = Pipeline([
    ("preprocess", preprocess),
    ("interaction", PolynomialFeatures(interaction_only=True, include_bias=False)),
    ("linear_regression", LinearRegression())
])

lr_fitted = lr_pipeline.fit(X_train, y_train)
model_library["Size_Type_IntST"] = lr_fitted

In [51]:
y_test_pred = model_library["Size_Type_IntST"].predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
records.append({
    "Model": "Size_Type_IntST", "Split": "Test",
    "RMSE": np.sqrt(mse), "MSE": mse, "R2": r2_score(y_test, y_test_pred)
})

cumulative_models = pd.DataFrame(records)
cumulative_models

Unnamed: 0,Model,Split,RMSE,MSE,R2
0,Size_RoomNums,Test,51569.933624,2659458000.0,0.527276
1,LivArea_Rooms_BlgdType,Test,50037.959382,2503797000.0,0.554945
2,Size_Type_IntST,Test,49719.972081,2472076000.0,0.560584


## Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

In [52]:
preprocess = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
        ("poly_size", PolynomialFeatures(degree=5, include_bias=False), ["Gr Liv Area"]),
        ("poly_rooms", PolynomialFeatures(degree=5, include_bias=False), ["TotRms AbvGrd"]),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

lr_pipeline = Pipeline([
    ("preprocess", preprocess),
    ("linear_regression", LinearRegression())
])

lr_fitted = lr_pipeline.fit(X_train, y_train)
model_library["Poly5_Size_Rooms_BlgdType"] = lr_fitted

In [53]:
y_test_pred = model_library["Poly5_Size_Rooms_BlgdType"].predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
records.append({
    "Model": "Poly5_Size_Rooms_BlgdType", "Split": "Test",
    "RMSE": np.sqrt(mse), "MSE": mse, "R2": r2_score(y_test, y_test_pred)
})

cumulative_models = pd.DataFrame(records)
cumulative_models

Unnamed: 0,Model,Split,RMSE,MSE,R2
0,Size_RoomNums,Test,51569.933624,2659458000.0,0.527276
1,LivArea_Rooms_BlgdType,Test,50037.959382,2503797000.0,0.554945
2,Size_Type_IntST,Test,49719.972081,2472076000.0,0.560584
3,Poly5_Size_Rooms_BlgdType,Test,53868.788587,2901846000.0,0.484191


# 13.3.3 Practice Activity

Consider one hundred modeling options for house price:

    -House size, trying degrees 1 through 10
    -Number of rooms, trying degrees 1 through 10
    -Building Type
    
Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

Q1: Which model performed the best?

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?