# Model Tuning

### Contents:
- [Setup & Train/Test/Split](#Setup-&-Train/Test/Split)
- [Data Transformations](#Data-Transformations)
- [Model Fitting!](#Model-Fitting!)
- [R2 and RMSE Values - Linear Regression](#R2-and-RMSE-Values---Linear-Regression)
- [Model Tuning - Regularization](#Model-Tuning---Regularization)

### Setup & Train/Test/Split
---

In [1]:
#Library Imports
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.metrics import r2_score, mean_squared_error

In [2]:
#Read in relevant csvs
train_clean = pd.read_csv('../datasets/train_clean.csv')
validate_clean = pd.read_csv('../datasets/validate_clean.csv')

In [3]:
#Interaction terms code

train_clean['kitchen_qual * overall_qual * exter_qual'] = train_clean['kitchen_qual'] * train_clean['overall_qual'] * train_clean['exter_qual']
validate_clean['kitchen_qual * overall_qual * exter_qual'] = validate_clean['kitchen_qual'] * validate_clean['overall_qual'] * validate_clean['exter_qual']


In [4]:
#Features in use
features = ['neighborhood',
            'overall_cond',
            'bldg_type',
            'kitchen_qual',
            'central_air',
            'gr_liv_area',
            'garage_area',
            'total_bsmt_sf',
            '1st_flr_sf',
            'kitchen_qual * overall_qual * exter_qual',
            'bedroom_abvgr',
            'overall_qual',
            'exter_qual',
            'year_built']

In [5]:
#Test/Train Data
X = train_clean[features]
y = train_clean['saleprice']

#Validate Data
val = validate_clean[features]

#Train/Test/Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 24)

### Data Transformations
___

In [6]:
# Simple Imputing
si = SimpleImputer(strategy = 'most_frequent').set_output(transform = 'pandas')
imputefeatures = ['bedroom_abvgr']

X_train[imputefeatures] = si.fit_transform(X_train[imputefeatures])
X_test[imputefeatures] = si.transform(X_test[imputefeatures])

In [7]:
#Transform the data with ColumnTransformer
ohe = OneHotEncoder(drop = 'first',
                    handle_unknown = 'ignore',
                    sparse_output = False)

ctx = ColumnTransformer(
    transformers =[
        ('one_hot', ohe, ['neighborhood', 'bldg_type']),
        ('ss', StandardScaler(), ['bedroom_abvgr', '1st_flr_sf', 'garage_area', 'total_bsmt_sf'])
    ], remainder = 'passthrough',
    verbose_feature_names_out = False
)

In [8]:
#Fit and transform the training set
X_train_ctx = pd.DataFrame(ctx.fit_transform(X_train),
                           columns = ctx.get_feature_names_out())

X_test_ctx = pd.DataFrame(ctx.transform(X_test),
                           columns = ctx.get_feature_names_out())

### Model Fitting!
---

Moment of Truth

In [9]:
#Instantiate Linear Regression Model
lr = LinearRegression()

In [10]:
# Fit the Model
lr.fit(X_train_ctx, y_train)

### R2, RMSE, and Baseline Values - Linear Regression
___

R2 values seem on the higher end with some high bias. RMSE is around 31000 for the train and 23500 for the test data, which may be considered high for the type of model this is.

In [11]:
print(f'Train R2 value is {round(lr.score(X_train_ctx, y_train),3)}')
print(f'Test R2 value is {round(lr.score(X_test_ctx, y_test),3)}')
print(f'Cross validation scores are {cross_val_score(lr, X_test_ctx, y_test)}')

Train R2 value is 0.852
Test R2 value is 0.909
Cross validation scores are [0.88891351 0.92252331 0.91099414 0.89835395 0.9138616 ]


In [12]:
print(f'Root Mean Squared Error of Training Data: {mean_squared_error(lr.predict(X_train_ctx), y_train, squared = False)}')
print(f'Root Mean Squared Error of Test Data: {mean_squared_error(lr.predict(X_test_ctx), y_test, squared = False)}')

Root Mean Squared Error of Training Data: 30877.46332284694
Root Mean Squared Error of Test Data: 22953.24174771525


The baseline model, using the mean house price for the test set, has a negative R2 value and higher RMSE than the models I have created.

In [13]:
#Baseline Score
baseline = y_train.mean()
baseline

181820.62353706112

In [14]:
b_model = LinearRegression()

In [15]:
baseline_series = pd.Series([baseline] * len(X_train_ctx))
baseline_series

0       181820.623537
1       181820.623537
2       181820.623537
3       181820.623537
4       181820.623537
            ...      
1533    181820.623537
1534    181820.623537
1535    181820.623537
1536    181820.623537
1537    181820.623537
Length: 1538, dtype: float64

In [16]:
b_model.fit(X_train_ctx, baseline_series)

In [17]:
print(f'Train R2 value is {round(lr.score(X_train_ctx, baseline_series),3)}')
print(f'Root Mean Squared Error of Training Data: {mean_squared_error(lr.predict(X_train_ctx), baseline_series, squared = False)}')
print(f'Root Mean Squared Error of Test Data: {mean_squared_error(lr.predict(X_test_ctx), y_test, squared = False)}')

Train R2 value is -7.196162493191385e+29
Root Mean Squared Error of Training Data: 74066.51100852448
Root Mean Squared Error of Test Data: 22953.24174771525


### Model Tuning - Regularization
---

#### Ridge Regression

I kept getting the best alpha as 1 when trying to run a Ridge Regression.  After talking with Hank, I switched up between np.logspace and np.linspace.  Several iterations show that the best alpha value to use is ~0.68.  However, the R2 score doesn't change much.

After running a ridge regression on my model, I don't see any improvement in model scores

In [None]:
# Setting up a list of ridge alphas to check 
alphas = np.logspace(-5, 10, 1000)

#Cross validation
ridge_cv = RidgeCV(alphas = alphas, cv = 5)

#Fit using best ridge alpha value
ridge_cv.fit(X_train_ctx, y_train)


In [None]:
ridge_cv.alpha_

In [None]:
ridge_cv.best_score_

In [None]:
print(f'Training score with ridge cv: {ridge_cv.score(X_train_ctx, y_train)}')
# Different than .best_score_ because this is using the entire training set, not just a fold from
print(f'Testing score with ridge cv: {ridge_cv.score(X_test_ctx, y_test)}')

#### Lasso Regression

After running several iterations of np.linspace and np.logspace, alpha value of ~11 

In [None]:
#Lasso alphas to check
l_alphas = np.linspace(0.1, 100, 1000)

#Cross validation
lasso_cv = LassoCV(alphas = l_alphas)

#Fit model
lasso_cv.fit(X_train_ctx, y_train)

In [None]:
lasso_cv.alpha_

In [None]:
print(lasso_cv.score(X_train_ctx, y_train))
print(lasso_cv.score(X_test_ctx, y_test))

In [None]:
lasso_cv.coef_

### Summary
___

I belive this model is a significant improvement over the baseline model.  I went through various features of the dataset and used visualizations to check for relationships between the features and sale price.  If I could see a relationship, I cleaned the feature data and addeed it to my model.  The process was repeated until I could no longer generate a significant improvement in metric scores.  The model has somewhat of a high bias, but I never reached the point of overfitting.