---
title: Practice Activity 7.1
author: Sneha Narayanan
format:
    html:
        toc: true
        code-fold: true
embed-resources: true
theme: "Lumen"

---

In [64]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,OneHotEncoder
import numpy as np

In [24]:
lr = LinearRegression()


ames = pd.read_csv("AmesHousing.csv")
X = ames[["Gr Liv Area", "TotRms AbvGrd"]]
y = ames["SalePrice"]



X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_s = (X_train - X_train.mean())/X_train.std()

lr_fitted = lr.fit(X_train_s, y_train)
lr_fitted.coef_

array([ 72135.32477084, -19621.69346988])

In [25]:
ames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2930 non-null   object 
 9   Land Contour     2930 non-null   object 
 10  Utilities        2930 non-null   object 
 11  Lot Config       2930 non-null   object 
 12  Land Slope       2930 non-null   object 
 13  Neighborhood     2930 non-null   object 
 14  Condition 1      2930 non-null   object 
 15  Condition 2      2930 non-null   object 
 16  Bldg Type        2930 non-null   object 
 17  House Style   

In [26]:
y_preds = lr_fitted.predict(X_test)

r2_score(y_test, y_preds)

-2138675.009717495

In [27]:
y_preds[1:5]

array([5.20193810e+07, 1.14160788e+08, 1.28463204e+08, 9.78253123e+07])

### 1.1 What went wrong here?

> Only the 'X_train' was standardized above but we need to scale both X_train and X_test using the mean and std of X_test.

In [28]:
X_test_s = (X_test - X_test.mean())/X_test.std()

In [29]:
lr_fitted = lr.fit(X_train_s, y_train)

In [30]:
y_preds = lr_fitted.predict(X_test_s)

In [31]:
r2_score(y_test, y_preds)

0.5315048666104758

In [32]:
y_preds[1:5]

array([ 87929.55265831, 184120.69587976, 224311.52937401, 177058.11743042])

In [33]:
X_test_s = (X_test - X_train.mean())/X_train.std()
y_preds = lr_fitted.predict(X_test_s)

r2_score(y_test, y_preds)

0.5319987360146061

## Chapter 2

In [34]:
new_house = pd.DataFrame(data = {"Gr Liv Area": [889], "TotRms AbvGrd": [6]})
new_house

Unnamed: 0,Gr Liv Area,TotRms AbvGrd
0,889,6


In [35]:
new_house_s = (new_house - new_house.mean())/new_house.std()
new_house_s

Unnamed: 0,Gr Liv Area,TotRms AbvGrd
0,,


### 1.2 What happened this time, and how can we fix it?

> Our new house dataset has one row, so now variance which gives Nan when we try to subtract the values with itself as new_house.mean() will be the same as new_house, instead we can standardize suing the X_train mean from the prev dataset that was used in the qn above.

In [36]:
new_house_s = (new_house - X_train.mean())/X_train.std()
lr_fitted.predict(new_house_s)

array([98172.14734894])

In [39]:
lr_pipeline = Pipeline(
  [("standardize", StandardScaler()),
  ("linear_regression", LinearRegression())]
)

lr_pipeline

In [40]:
lr_pipeline_fitted = lr_pipeline.fit(X_train, y_train)

y_preds = lr_pipeline_fitted.predict(X_test)
r2_score(y_test, y_preds)

0.5319987360146061

In [41]:
lr_pipeline_fitted.predict(new_house)

array([98172.14734894])

In [44]:
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"]),
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)


lr_pipeline = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)

lr_pipeline

In [45]:
X = ames.drop("SalePrice", axis = 1)
y = ames["SalePrice"]



X_train, X_test, y_train, y_test = train_test_split(X, y)

lr_fitted = lr_pipeline.fit(X_train, y_train)

In [46]:
ct_fitted = ct.fit(X_train)

ct.transform(X_train)

array([[ 1.        ,  0.        ,  0.        , ...,  0.        ,
         1.10983122,  0.35814925],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
        -0.40673219, -0.91096994],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.09542777, -0.27641035],
       ...,
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
        -0.00137415,  0.35814925],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.35154952,  0.99270884],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
        -0.11430972, -0.91096994]])

In [47]:
ct.transform(X_test)

array([[ 1.        ,  0.        ,  0.        , ...,  0.        ,
        -1.55020487, -1.54552953],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.23256382, -0.27641035],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
        -0.03767487,  0.35814925],
       ...,
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
         1.6765258 ,  0.35814925],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.35759964, -1.54552953],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
        -0.28572979, -0.27641035]])

In [51]:
lr_pipeline_fitted.named_steps['linear_regression'].coef_

array([-1.43418958e+18, -1.43418958e+18, -1.43418958e+18, -1.43418958e+18,
       -1.43418958e+18,  6.66880000e+04, -1.10080000e+04])

In [52]:
lr_pipeline = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
).set_output(transform="pandas")


ct.fit_transform(X_train)

Unnamed: 0,dummify__Bldg Type_1Fam,dummify__Bldg Type_2fmCon,dummify__Bldg Type_Duplex,dummify__Bldg Type_Twnhs,dummify__Bldg Type_TwnhsE,standardize__Gr Liv Area,standardize__TotRms AbvGrd
2590,1.0,0.0,0.0,0.0,0.0,1.109831,0.358149
951,1.0,0.0,0.0,0.0,0.0,-0.406732,-0.910970
2166,1.0,0.0,0.0,0.0,0.0,0.095428,-0.276410
805,1.0,0.0,0.0,0.0,0.0,1.071514,0.992709
2644,0.0,1.0,0.0,0.0,0.0,0.932361,1.627268
...,...,...,...,...,...,...,...
593,1.0,0.0,0.0,0.0,0.0,-1.259799,-0.910970
383,1.0,0.0,0.0,0.0,0.0,0.797242,0.358149
2224,1.0,0.0,0.0,0.0,0.0,-0.001374,0.358149
1133,1.0,0.0,0.0,0.0,0.0,0.351550,0.992709


In [54]:
from sklearn.preprocessing import PolynomialFeatures
ct_inter = ColumnTransformer(
  [
    ("interaction", PolynomialFeatures(interaction_only = True), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
).set_output(transform = "pandas")

ct_inter.fit_transform(X_train)

Unnamed: 0,interaction__1,interaction__Gr Liv Area,interaction__TotRms AbvGrd,interaction__Gr Liv Area TotRms AbvGrd
2590,1.0,2039.0,7.0,14273.0
951,1.0,1287.0,5.0,6435.0
2166,1.0,1536.0,6.0,9216.0
805,1.0,2020.0,8.0,16160.0
2644,1.0,1951.0,9.0,17559.0
...,...,...,...,...
593,1.0,864.0,5.0,4320.0
383,1.0,1884.0,7.0,13188.0
2224,1.0,1488.0,7.0,10416.0
1133,1.0,1663.0,8.0,13304.0


In [55]:
ct_dummies = ColumnTransformer(
  [("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"])],
  remainder = "passthrough"
).set_output(transform = "pandas")

ct_inter = ColumnTransformer(
  [
    ("interaction", PolynomialFeatures(interaction_only = True), ["remainder__TotRms AbvGrd", "dummify__Bldg Type_1Fam"]),
  ],
  remainder = "drop"
).set_output(transform = "pandas")

X_train_dummified = ct_dummies.fit_transform(X_train)
X_train_dummified

Unnamed: 0,dummify__Bldg Type_1Fam,dummify__Bldg Type_2fmCon,dummify__Bldg Type_Duplex,dummify__Bldg Type_Twnhs,dummify__Bldg Type_TwnhsE,remainder__Order,remainder__PID,remainder__MS SubClass,remainder__MS Zoning,remainder__Lot Frontage,...,remainder__Screen Porch,remainder__Pool Area,remainder__Pool QC,remainder__Fence,remainder__Misc Feature,remainder__Misc Val,remainder__Mo Sold,remainder__Yr Sold,remainder__Sale Type,remainder__Sale Condition
2590,1.0,0.0,0.0,0.0,0.0,2591,535325460,30,RL,118.0,...,0,0,,,,0,12,2006,COD,Abnorml
951,1.0,0.0,0.0,0.0,0.0,952,914475030,85,RL,72.0,...,120,0,,GdWo,,0,11,2009,WD,Normal
2166,1.0,0.0,0.0,0.0,0.0,2167,907420110,60,RL,64.0,...,0,0,,,,0,6,2007,WD,Normal
805,1.0,0.0,0.0,0.0,0.0,806,906223210,60,RL,80.0,...,0,0,,,,0,6,2009,WD,Normal
2644,0.0,1.0,0.0,0.0,0.0,2645,902110120,190,RM,56.0,...,0,0,,,,0,4,2006,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
593,1.0,0.0,0.0,0.0,0.0,594,534200030,20,RL,62.0,...,0,0,,GdPrv,,0,6,2009,WD,Normal
383,1.0,0.0,0.0,0.0,0.0,384,527364030,60,FV,70.0,...,0,0,,,,0,6,2009,WD,Normal
2224,1.0,0.0,0.0,0.0,0.0,2225,909428340,20,RL,,...,0,0,,,,0,9,2007,CWD,Normal
1133,1.0,0.0,0.0,0.0,0.0,1134,531373060,60,RL,67.0,...,0,0,,,,0,7,2008,WD,Normal


In [56]:
ct_inter.fit_transform(X_train_dummified)

Unnamed: 0,interaction__1,interaction__remainder__TotRms AbvGrd,interaction__dummify__Bldg Type_1Fam,interaction__remainder__TotRms AbvGrd dummify__Bldg Type_1Fam
2590,1.0,7.0,1.0,7.0
951,1.0,5.0,1.0,5.0
2166,1.0,6.0,1.0,6.0
805,1.0,8.0,1.0,8.0
2644,1.0,9.0,0.0,0.0
...,...,...,...,...
593,1.0,5.0,1.0,5.0
383,1.0,7.0,1.0,7.0
2224,1.0,7.0,1.0,7.0
1133,1.0,8.0,1.0,8.0


# Consider four possible models for predicting house prices:

1. Using only the size and number of rooms.

2. Using size, number of rooms, and building type.

3. Using size and building type, and their interaction.

4. Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all three models.

In [57]:
ames = pd.read_csv("AmesHousing.csv")
X = ames[["Gr Liv Area", "TotRms AbvGrd", "Bldg Type"]]
y = ames["SalePrice"]

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## 1. Using only the size and number of rooms.

In [59]:
pipeline1 = Pipeline([
    ("preprocessing", ColumnTransformer([
        ("scaler", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
    ])),
    ("linear_regression", LinearRegression())
])

## 2.Using size, number of rooms, and building type.

In [60]:
pipeline2 = Pipeline([
    ("preprocessing", ColumnTransformer([
        ("scaler", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"]),
        ("dummify", OneHotEncoder(drop='first', sparse_output=False), ["Bldg Type"])
    ])),
    ("linear_regression", LinearRegression())
])

## 3. Using size and building type, and their interaction.

In [61]:
pipeline3 = Pipeline([
    ("preprocessing", ColumnTransformer([
        ("scaler", StandardScaler(), ["Gr Liv Area"]),
        ("dummify", OneHotEncoder(drop='first', sparse_output=False), ["Bldg Type"]),
        ("interaction", PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), ["Gr Liv Area", "TotRms AbvGrd"])
    ])),
    ("linear_regression", LinearRegression())
])

## 4.Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

In [62]:
pipeline4 = Pipeline([
    ("preprocessing", ColumnTransformer([
        ("scaler", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"]),
        ("dummify", OneHotEncoder(drop='first', sparse_output=False), ["Bldg Type"]),
        ("poly_features", PolynomialFeatures(degree=5, include_bias=False), ["Gr Liv Area", "TotRms AbvGrd"])
    ])),
    ("linear_regression", LinearRegression())
])

In [65]:
pipelines = [pipeline1, pipeline2, pipeline3, pipeline4]
rmse_scores = []

for i, pipeline in enumerate(pipelines, 1):

    pipeline.fit(X_train, y_train)
    y_preds = pipeline.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, y_preds))
    rmse_scores.append((f"Model {i}", rmse))
    print(f"Model {i} RMSE: {rmse}")

Model 1 RMSE: 59261.71322786229
Model 2 RMSE: 57078.21809431249
Model 3 RMSE: 57193.19477873403
Model 4 RMSE: 61320.01793306126


In [66]:
best_model = min(rmse_scores, key=lambda x: x[1])
print("\nBest model:", best_model)


Best model: ('Model 2', 57078.21809431249)


::: {.callout-tip title="Best model"}
The relatively low RMSE 57078.21809431249 suggests that Model 2’s combination of size, room count, and building type provides a good balance of features without overfitting. It captures essential aspects of the housing data.
:::

# 2. Once again consider four modeling options for house price:

1. Using only the size and number of rooms.
2. Using size, number of rooms, and building type.
3. Using size and building type, and their interaction.
4. Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.
5. Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

In [74]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, mean_squared_error
import numpy as np

In [75]:
pipelines = [pipeline1, pipeline2, pipeline3, pipeline4]
rmse_scores = []

for i, pipeline in enumerate(pipelines, 1):
    scores = cross_val_score(pipeline, X, y, cv=5, scoring=rmse_scorer)
    rmse_mean = np.mean(scores)
    rmse_scores.append((f"Model {i}", rmse_mean))
    print(f"Model {i} Cross-Validated RMSE: {rmse_mean:.2f}")

Model 1 Cross-Validated RMSE: 55806.33
Model 2 Cross-Validated RMSE: 54168.08
Model 3 Cross-Validated RMSE: 54135.58
Model 4 Cross-Validated RMSE: 60218.42




In [76]:
best_model = min(rmse_scores, key=lambda x: x[1])
print(f"\nBest model based on cross-validated RMSE: {best_model}")


Best model based on cross-validated RMSE: ('Model 3', 54135.58145435537)


# 3.Consider one hundred modeling options for house price:

1. House size, trying degrees 1 through 10
2. Number of rooms, trying degrees 1 through 10
3. Building Type

Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

Q1: Which model performed the best?

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

In [77]:
ct_poly = ColumnTransformer([
    ("poly_size", PolynomialFeatures(), ["Gr Liv Area"]),
    ("poly_rooms", PolynomialFeatures(), ["TotRms AbvGrd"]),
    ("dummify", OneHotEncoder(drop='first', sparse_output=False), ["Bldg Type"])
])

lr_pipeline_poly = Pipeline([
    ("preprocessing", ct_poly),
    ("linear_regression", LinearRegression())
])

param_grid = {
    'preprocessing__poly_size__degree': np.arange(1, 11),  
    'preprocessing__poly_rooms__degree': np.arange(1, 11)   
}

gscv = GridSearchCV(lr_pipeline_poly, param_grid, cv=5, scoring='neg_root_mean_squared_error')

gscv.fit(X_train, y_train)

best_model = gscv.best_estimator_
best_params = gscv.best_params_
best_rmse = -gscv.best_score_

In [78]:
print(f"Best model parameters: {best_params}")

Best model parameters: {'preprocessing__poly_rooms__degree': 3, 'preprocessing__poly_size__degree': 4}


In [82]:
gscv = GridSearchCV(lr_pipeline_poly, param_grid, cv=5, scoring='r2')
gscv.fit(X_train, y_train)

In [84]:
best_r2 = gscv.best_score_
best_r2

0.5687643251926724

In [79]:
print(f"Best cross-validated RMSE: {best_rmse}")

Best cross-validated RMSE: 51079.05053522197


### Q1 Which model performed the best?

The best model, as indicated by GridSearchCV, uses:

1. A 3rd-degree polynomial transformation for the Number of Rooms.
2. A 4th-degree polynomial transformation for House Size.
3. Building Type included as a categorical feature.
This combination achieved the lowest cross-validated RMSE of 51,079.05, suggesting that this model configuration best balances complexity and predictive accuracy within the range of options tested.

### Q2 What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

1. Computational Cost: Testing 100 combinations is time-consuming and resource-intensive, especially for larger datasets.
2. Overfitting: Higher-degree polynomials can overfit, capturing noise that doesn't generalize well.
3. Diminishing Returns: Higher degrees often offer minimal improvements, adding unnecessary complexity.
4. Interpretability: Complex models are harder to interpret, making feature relationships less clear.

### Strategies to Reduce Tuning Values
1. Coarse Grid Search: Test a few key degrees (e.g., 1, 3, 5) to gauge trends.
2. Randomized Search: Randomly sample parameter combinations to explore efficiently.
3. Incremental Refinement: Start with a small set and refine based on results.
4. Regularization: Use techniques like Ridge or Lasso to reduce overfitting with higher degrees.