# Prediction of Singapore's housing price using simple regression based models
## Regression Modelling notebook

This project is about the prediction of resale price of a house located in singapore, using simple regression model.   
This notebook is only for the purpose of building Regression Model.  
Analytical solution to least squares regression is given by:  
<img src="../img/least_square_sol.png"></img>  
Which I will not go into details here.  

This note book will show the process of building regression model(s) to predict the housing price in Singapore.  
Which is in the area of Machine Learning.  

# Problem Statement

The goal of this project is to build a regression model, using data contained in the [datasets](../datasets) folder. The model should be able to make an accurate prediction of the resale price (`resale_price`) of the house, for every house id (`Id`) that appeared in the [test set](../datasets/test.csv).  
Success will be evaluated based on common evaluation metrics such as Mean Absolute Error (MAE) and Mean Square Error (MSE), apart from scores.

Motivation:  
While this is a toy project for the purpose of learning, it shows the importance of prediction models.  
House owners who are looking to sale their property, property agents, those seeking to purchase a house, all stand to benefit from this model.  

Contents:  
1. [Single Model](#Single-Model)   
    1.1 [Preprocessing](#Preprocessing)  
    1.2 [Linear Regression](#Linear-Regression)  
    1.3 [Lasso Regression](#Lasso-Regression)  
    1.4 [Ridge Regression](#Ridge-Regression)  
    1.5 [Elastic Net](#Elastic-Net-Regression)   
    1.6 [Model Evaluation](#Model-Evaluation)  
2. [Combined Model](#Combined-Model)  
    2.1 [Data Selection]  
    2.2 [Preprocessing]   
    2.3 [Regression Model]  
    2.4 [Grid Search Cross Validation]  
    2.5 [Model Evaluation]  

In [1]:
import functools
import numpy as np
import pandas as pd
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, ElasticNet, ElasticNetCV, LassoCV, Lasso, Ridge, RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn import metrics

## Single Model

This section is the most basic model created.   
I will start with simple preprocessing, then build simple models as well as perform Grid Search Cross Validation.  
Then, these models with tunned paprameters will be examined and evaluated.

In [2]:
# read in the cleaned data used in model creation
df = pd.read_csv('../datasets/reduced_train.csv')
df.head()

Unnamed: 0,id,flat_type,floor_area_sqm,flat_model,resale_price,Tranc_Year,mid,hdb_age,max_floor_lvl,planning_area,Mall_Nearest_Distance,Hawker_Nearest_Distance,mrt_nearest_distance,bus_interchange,mrt_interchange,cutoff_point
0,88471,4 ROOM,90.0,Model A,680000.0,2016,11,15,25,Kallang,1094.090418,154.753357,330.083069,0,0,224
1,122598,5 ROOM,130.0,Improved,665000.0,2012,8,34,9,Bishan,866.941448,640.151925,903.659703,1,1,232
2,170897,EXECUTIVE,144.0,Apartment,838000.0,2013,14,24,16,Bukit Batok,1459.579948,1762.082341,1334.251197,1,0,188
3,86070,4 ROOM,103.0,Model A,550000.0,2012,3,29,11,Bishan,950.175199,726.215262,907.453484,1,1,253
4,153632,4 ROOM,83.0,Simplified,298000.0,2017,2,34,4,Yishun,729.771895,1540.151439,412.343032,0,0,208


In [3]:
df.shape

(149772, 16)

In [4]:
df.isnull().sum()

id                         0
flat_type                  0
floor_area_sqm             0
flat_model                 0
resale_price               0
Tranc_Year                 0
mid                        0
hdb_age                    0
max_floor_lvl              0
planning_area              0
Mall_Nearest_Distance      0
Hawker_Nearest_Distance    0
mrt_nearest_distance       0
bus_interchange            0
mrt_interchange            0
cutoff_point               0
dtype: int64

## Preprocessing

Split the dataset as categorical and numerical sets.  
This is a very basic and simple preprocessing, where the categorical features will go through One Hot Encoding, numerical features will go through Standard Scaler.

In [5]:
categorical_col = ['flat_type', 'flat_model', 'Tranc_Year', 'planning_area', 'bus_interchange', 'mrt_interchange']
numerical_col = ['floor_area_sqm', 'mid', 'hdb_age', 'max_floor_lvl', 'Mall_Nearest_Distance', 'Hawker_Nearest_Distance', 'mrt_nearest_distance', 'cutoff_point']

In [6]:
# This is the target of the prediction
y = df.resale_price

In [7]:
# These are the predictors used to predict the resale price
X = df.drop(columns=['id', 'resale_price'])

In [8]:
# make a pipeline to apply standard scaler on selected features
numeric_transformer = Pipeline(
    steps=[("simple impute", SimpleImputer(missing_values=np.nan, strategy='mean')),
           ("scaler", StandardScaler())
            ]
)

In [9]:
# make a pipeline to apply one hot encoding on selected features
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore", drop='first')),
    ]
)

In [10]:
# the above 2 steps forms the colmun transformer, which is used for the preprocessing step.

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_col),
        ("cat", categorical_transformer, categorical_col),
    ]
)

Then, prepare the data used for training and testing of model.  
As this model is a linear regression based model, the `train_test_split` can't split up the training data using y.   
This is because y value, the housing price is numerical and continuous.   
There's no way both training set and testing set can contain the number when ther's only 1 y in the entire sample.  

In [11]:
# Split the data using train test split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

With this, the data is prepared and ready to be used for traning of the models.  

In [12]:
X_train.head()

Unnamed: 0,flat_type,floor_area_sqm,flat_model,Tranc_Year,mid,hdb_age,max_floor_lvl,planning_area,Mall_Nearest_Distance,Hawker_Nearest_Distance,mrt_nearest_distance,bus_interchange,mrt_interchange,cutoff_point
88970,4 ROOM,104.0,Model A,2020,11,32,12,Yishun,578.64016,480.057997,1156.371509,1,0,235
129062,3 ROOM,81.0,New Generation,2019,2,43,13,Clementi,346.376514,315.873061,383.278479,1,0,231
41882,4 ROOM,101.0,Model A,2018,2,23,10,Woodlands,535.174902,1272.548649,619.800951,1,1,204
139675,4 ROOM,99.0,New Generation,2020,14,38,13,Jurong East,1449.158899,353.35574,411.382755,0,0,223
140141,5 ROOM,110.0,Improved,2018,8,20,14,Sembawang,442.246864,2414.752074,401.245937,1,0,188


## Linear Regression

This section covers the building of a simple linear regression model.  

In [13]:
# Build a pipeline for regression model
lin_reg = Pipeline(
    steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)

In [14]:
# Fit the model with training data
lin_reg.fit(X_train, y_train)

In [15]:
print(f"The train score is: {round(lin_reg.score(X_train, y_train), 6)}")
print(f"The test score is: {round(lin_reg.score(X_test, y_test), 6)}")

The train score is: 0.891913
The test score is: 0.891237


This is a reseasonable model with very close train and test scores.   
This means that this is a decent model!  
Now, I will further investigate this model using cross validation.

In [16]:
# Build a regression model with cross validation
folds = KFold(n_splits = 5, shuffle = True, random_state = 0)
scores = cross_val_score(lin_reg, X_train, y_train, scoring='r2', cv=folds)

In [17]:
scores

array([0.8913561 , 0.89231941, 0.89244822, 0.8928205 , 0.88969684])

Seems that the cross validation score is pretty good and very consistent.  
This is indicated by the close match between the different folds.  
Below is a simpler way to do it.  
```console
scores = cross_val_score(lin_reg, X_train, y_train, scoring='r2', cv=5)
scores      
```
Okay, this is acceptable.  
Now, time to try the other methods!

## Lasso Regression

Lasso (least absolute shrinkage and selection operator; also Lasso or LASSO), is a regression analysis method.  
It is also termed as L1 regularization.  
It is able to shrink some coefficients to 0.  
Lasso Regression is given by:  
<img src="../img/lasso.png"></img>  
Using my own naive understanding, lasso is solving for optimal point.  
The solution to these optimal points appear when elliptical contours touch the constraints.   
Solutions exists at vertices, just like Linear Programming.  
These vertices just so happens to lie on axis.  
Which then results in at least 1 coefficent to be shrinked to 0.  
Amazing mathematics at work!

In [18]:
# Build a pipeline for regression model
# default setting did not converge
lasso = Pipeline(
    steps=[("preprocessor", preprocessor), ("Lasso", Lasso(max_iter=10000))]
)

In [19]:
lasso.fit(X_train, y_train)

In [20]:
print(f"The train score is: {round(lasso.score(X_train, y_train), 6)}")
print(f"The test score is: {round(lasso.score(X_test, y_test), 6)}")

The train score is: 0.891912
The test score is: 0.891241


Hmmmm, this is very close match!  
Take a look at cross validation?  

In [32]:
lasso_CV = Pipeline(
    steps=[("preprocessor", preprocessor), ("Lasso CV", LassoCV(max_iter=10000, cv=5))]
)

In [33]:
lasso_CV.fit(X_train, y_train)

In [34]:
print(f"The train score is: {round(lasso_CV.score(X_train, y_train), 6)}")
print(f"The test score is: {round(lasso_CV.score(X_test, y_test), 6)}")

The train score is: 0.886419
The test score is: 0.886501


This is in fact lower than simply fitting using `.Lasso()`.  
Hmmm....   
This maybe becasue the model is in fact trained with smaller training data size?  
Anyways, continue to play with different alpha values.

In [35]:
# Specify range of hyperparameters to tune
lasso_params = {'alpha':[1, 10, 100, 1000]}

In [36]:
# Perform grid search using GridSearchCV()
grid_search = GridSearchCV(Lasso(max_iter=20000),
                           param_grid=lasso_params,
                           cv=5,
                           verbose = 3,
                           return_train_score=True
                           )

lasso_grid_search_CV = Pipeline(
    steps=[("preprocessor", preprocessor), ("Lasso Grid Search", grid_search)]
) 

Warning, you may not want to run the line below...

In [37]:
lasso_grid_search_CV.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5] END .......alpha=1;, score=(train=0.892, test=0.893) total time= 1.9min
[CV 2/5] END .......alpha=1;, score=(train=0.892, test=0.891) total time= 2.2min
[CV 3/5] END .......alpha=1;, score=(train=0.892, test=0.892) total time= 2.6min
[CV 4/5] END .......alpha=1;, score=(train=0.892, test=0.892) total time= 2.2min
[CV 5/5] END .......alpha=1;, score=(train=0.892, test=0.890) total time= 1.5min
[CV 1/5] END ......alpha=10;, score=(train=0.891, test=0.893) total time=  22.6s
[CV 2/5] END ......alpha=10;, score=(train=0.892, test=0.891) total time=  45.2s
[CV 3/5] END ......alpha=10;, score=(train=0.892, test=0.891) total time=  10.2s
[CV 4/5] END ......alpha=10;, score=(train=0.892, test=0.892) total time=  38.3s
[CV 5/5] END ......alpha=10;, score=(train=0.892, test=0.890) total time=  10.1s
[CV 1/5] END .....alpha=100;, score=(train=0.886, test=0.888) total time=   0.7s
[CV 2/5] END .....alpha=100;, score=(train=0.886,

How naive...   
I started with 'lasso_params = {'alpha':[0.01, 0.1, 1, 10, 100]}'
All alpha below 1 runs into convergence issues...  
For `alpha = 0.01`, the scores from grid search are around `(train=0.892, test=0.891+/-0.002)`.  
They also took a total of 2.6 minutes to run 10000 iterations and then conclude that the solution did not converge.  
This is not acceptable, as it takes too long for a simple model.  
So the final alphas to search are in the range of 1 to 100.
Alpha is the constant that multiplies the L1 term, controlling regularization strength.  
So the higher the Alpha, the more aggressive the penalization for complex model.  
The higher the penalization, the faster the convergence, as the area bounded by the constraints are larger.  
This is evident from the difference in total time used to compute the models.

In [38]:
lasso_grid_search_CV[1].best_score_

0.891718599782989

In [39]:
# cv results
cv_results = pd.DataFrame(lasso_grid_search_CV[1].cv_results_)
cv_results[['mean_fit_time', 'param_alpha', 'mean_test_score', 'mean_train_score']]

Unnamed: 0,mean_fit_time,param_alpha,mean_test_score,mean_train_score
0,125.67075,1,0.891719,0.891932
1,25.261951,10,0.891458,0.891685
2,0.988853,100,0.885856,0.886109
3,0.173646,1000,0.817216,0.8174


By looking the the results of the cross validation results, alpha = 1 gives the best result. 
The result becomes worse when alpha increases.  
There could be a possibility that a minima exists between 0 and 1.  
But I think the difference is minor and it doesn't warrant the time spent on searching for it.  
There is also a possibility (very likely in fact) that if alpha = 0 is used, it will give the best result.  
Which is simply using linear regression, like the previous section.  
If Lasso Regression must be used, I will want to use Alpha >= 100.  
This is becasue the model is solved much faster, with very small loss in performace.  

## Ridge Regression

Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity.   
This method performs L2 regularization.  
It is UNABLE to shrink coefficients to 0.  
Ridge Regression is given by:  
<img src="../img/ridge.png"></img>  
Using my own naive understanding, Ridge is also solving for optimal point.  
The solution to these optimal points appear when elliptical contours touch the constraints, like Lasso.   
Unlike lasso, the contraint region from Ridge are cirular.   
The coefficients will be unable to shrink to 0.  
Amazing mathematics at work again!

In [40]:
# Build a pipeline for regression model
# default setting did not converge
ridge = Pipeline(
    steps=[("preprocessor", preprocessor), ("Ridge", Ridge())]
)

In [41]:
ridge.fit(X_train, y_train)

In [42]:
print(f"The train score is: {round(ridge.score(X_train, y_train), 6)}")
print(f"The test score is: {round(ridge.score(X_test, y_test), 6)}")

The train score is: 0.891556
The test score is: 0.890854


It have a slightly worse performance compared to both Lasso and Linear Regression.  
Now investigate with CV.

In [43]:
ridge_CV = Pipeline(
    steps=[("preprocessor", preprocessor), ("Ridge CV", RidgeCV(alphas=[0.1, 1, 10, 100, 1000], store_cv_values=True, alpha_per_target=True))]
)

In [44]:
ridge_CV.fit(X_train, y_train)

In [45]:
print(f"The train score is: {round(ridge_CV.score(X_train, y_train), 6)}")
print(f"The test score is: {round(ridge_CV.score(X_test, y_test), 6)}")

The train score is: 0.891919
The test score is: 0.891239


In [46]:
cv_results= pd.DataFrame(ridge_CV[1].cv_values_, columns=['alpha=0.1', 'alpha=1', 'alpha=10', 'alpha=100', 'alpha=1000'])
cv_results.shape

(119817, 5)

In [47]:
cv_results.head()

Unnamed: 0,alpha=0.1,alpha=1,alpha=10,alpha=100,alpha=1000
0,2748284000.0,2755258000.0,2813214000.0,3193140000.0,4817755000.0
1,1418368000.0,1423038000.0,1457278000.0,1611593000.0,1377079000.0
2,19915060.0,19456130.0,15474630.0,883204.2,1783.817
3,4380845000.0,4388701000.0,4462062000.0,5059141000.0,8126840000.0
4,1690679000.0,1691574000.0,1703623000.0,1891572000.0,3635783000.0


In [48]:
ridge_CV[1].alpha_

0.1

In [49]:
ridge_CV[1].best_score_

-2222773760.7176504

In [50]:
ridge_CV[1].n_features_in_

75

The best Alpha selected is 0.1.  
This means the default Ridge() is good enough.  
I will now move on to Elastic Net.

## Elastic Net Regression

Elastic Net Regression is a combination of Lasso (L1) and Ridge (L2).  
Maybe that's why it is called elastic net?

In [51]:
# Build a pipeline for regression model
# default setting did not converge
e_net = Pipeline(
    steps=[("preprocessor", preprocessor), ("ENet", ElasticNet())]
)

In [52]:
e_net.fit(X_train, y_train)

In [53]:
print(f"The train score is: {round(e_net.score(X_train, y_train), 6)}")
print(f"The test score is: {round(e_net.score(X_test, y_test), 6)}")

The train score is: 0.665517
The test score is: 0.665156


Wow! That is a huge drop in performance!  
Maybe do a grid sreach CV to tune the hyperparameters.

In [54]:
enet_CV = Pipeline(
    steps=[("preprocessor", preprocessor), ("Ridge CV", ElasticNetCV(l1_ratio=[.1, .3 , .5, .7, .9], 
                                                                     max_iter=10000, #when l1 ratio is 0.9, it is close to a lasso, which takes a long time to run
                                                                     cv=5
                                                                    ))]
)

In [55]:
enet_CV.fit(X_train, y_train)

In [56]:
print(f"The train score is: {round(enet_CV.score(X_train, y_train), 6)}")
print(f"The test score is: {round(enet_CV.score(X_test, y_test), 6)}")

The train score is: 0.160196
The test score is: 0.160537


Wow, this is a surprisingly low socre...  
What about grid search CV?

In [57]:
# Perform grid search using GridSearchCV()
enet_grid_search = GridSearchCV(ElasticNet(max_iter=10000),
                           param_grid={'alpha': [0.1, 1, 10],
                                       'l1_ratio':[0.1, 0.3, 0.5, 0.7, 0.9],
                                       'selection': ['cyclic', 'random'],
                                      },
                           cv=5,
                           verbose = 1,
                           return_train_score=True
                           )

enet_grid_search_CV = Pipeline(
    steps=[("preprocessor", preprocessor), ("Lasso Grid Search", enet_grid_search)]
) 

In [58]:
enet_grid_search_CV.fit(X_train, y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


In [59]:
enet_grid_search_CV[1].best_score_

0.8630536371629208

In [60]:
# cv results
cv_results = pd.DataFrame(enet_grid_search_CV[1].cv_results_)
cv_results = cv_results.loc[cv_results['param_selection'] == 'cyclic']
cv_results[['mean_fit_time', 'param_alpha', 'param_l1_ratio', 'mean_test_score', 'mean_train_score']]

Unnamed: 0,mean_fit_time,param_alpha,param_l1_ratio,mean_test_score,mean_train_score
0,0.470074,0.1,0.1,0.780479,0.780646
2,0.477274,0.1,0.3,0.792875,0.793047
4,0.545874,0.1,0.5,0.808604,0.808781
6,0.72193,0.1,0.7,0.830033,0.830218
8,1.290726,0.1,0.9,0.863054,0.863257
10,0.182142,1.0,0.1,0.592554,0.592667
12,0.179008,1.0,0.3,0.626981,0.6271
14,0.187351,1.0,0.5,0.665405,0.665532
16,0.209069,1.0,0.7,0.709749,0.709888
18,0.323398,1.0,0.9,0.775101,0.775266


In [65]:
# New e net with better score
e_net = Pipeline(
    steps=[("preprocessor", preprocessor), ("ENet", ElasticNet(alpha=0.1, l1_ratio=0.9))]
)
e_net.fit(X_train, y_train)

It is evident that when the model behaves like lasso regressor, the performance is better.  
With this, I have completed the investigation on Elastic Net Regression.

## Model Evaluation

Here, I will take a (slightly) closer look at the four models used.  
I will look at the scores, where the default is the r-squared value, first.  
I will peek at the coefficients.  
Then I will make inference using the models, and take a look at the MSE, RMSE, AME.

Lasso is using 'diamond' shaped constriants.  
Ridge is using 'circle' shaped constraints.  
Once the elliptical contours hits the constraints, optimality is reached.  
But becasue 'diamond' have vertices on axis, it will shrink at least 1 coefficient to 0.

<img src="../img/lasso_vs_ridge.png" width="500" height="600"></img>  

Ridge, on the other hand will not, as optimal point will not lie on axis, which are representing features.  
(Unless the whole B_hat lies on axis, resulting in a trivial solution? This is just my guess)

Based oon the scores alone, which is in fact R squared value.   
Lasso regression edges out a little bit, followed closely by Lasso, then Ridge.   
Elastic Net have the worst performance.  

Now, I will test this with kaggle.  

In [26]:
# Import test set
test = pd.read_csv('../datasets/test.csv')

  test = pd.read_csv('../datasets/test.csv')


In [28]:
pred_res = lin_reg.predict(test)

In [29]:
id_col = test['id'].copy(deep=True)
res = zip(id_col,pred_res)
result = pd.DataFrame(res, columns=['id', 'Predicted'])
result.to_csv('../datasets/prediction_linear.csv', index=False)

The Root Mean Square Error (RMSE) score from the most basic linear regression model is:  
<img src="../img/lin_reg_res.png"></img> 
Looking at the score alone, it's kind of expected. It is not great, but not too bad.

In [30]:
pred_res = lasso.predict(test)

In [31]:
id_col = test['id'].copy(deep=True)
res = zip(id_col,pred_res)
result_lasso = pd.DataFrame(res, columns=['id', 'Predicted'])
result_lasso.to_csv('../datasets/prediction_lasso.csv', index=False)

The Root Mean Square Error (RMSE) score from the lasso regression model is:  
<img src="../img/lasso_res.png"></img> 
This is actually (slightly) worse off compared to linear regression.  
The difference is rather small, but one thing to keep in mind is the 'bias variance tradeoff'.  
In lasso, my bias might be lower and my variance might be higher, the model might have over learned a tiny bit more compared to linear regression.  
It doesn't generalise as well.

This also shows that high R sqaured doesn't mean it's good, as the original distribution of the data may not follow a linear distribution completely in the first place!  

With this in mind, the last 'One Size Fit All' model to test will be the Elastic Net model.  
This is because Elastic Net have a very different score compared to all other models. 

In [66]:
pred_res = e_net.predict(test)

In [67]:
id_col = test['id'].copy(deep=True)
res = zip(id_col,pred_res)
result_enet = pd.DataFrame(res, columns=['id', 'Predicted'])
result_enet.to_csv('../datasets/prediction_enet.csv', index=False)

<img src="../img/enet_res.png"></img> 
The 2nd submission in the screenshot is using the original elastic net model, where all parameters are default value.  
The 1st submission in the screenshot is using the optimised elastic net model from grid search CV.  
It can be seen that is a huge improvement on RMSE.  
However, they still pales in comparison to simple linear regression.  

To conclude off, for this case, it seems that a simple linear regression should be the go to model.  
This is becasue it is simple and fast, with good performance in terms of r sqaured score and RSME score.
This concludes the exploration on a single model. 

## Combined Model

This section explors the possibility of using a combination of models to find the housing price.  
The reason behind this is that, a single model can not capture the unique characteristics of what contributes to the housing price.  
Hence, we use more than one model to 'learn' it!  
This is makes the model more complex.  
The base models will be simple linear regression, as linear regression is proven to work well in the previous section.  
It ought to work well here too, as they share the same use case.  
The drawback can be that the ammount of data for each model to learn is smaller.  

In [86]:
## failed attempt to do GridSearchCV on linear regression
### problem identified to be preprocessor can't contain less features than specified
## keeping for future investigation

# build a function for preprocessor of dataset
# def get_present_col_subset(selected_columns, df):
#     '''return the currently available colmuns
#     args:
#         selected_colmuns: list of colmuns to select from
#         df: dataframe being used currently
#     return:
#         subset_col: colmuns present in selected colmuns
#     Note: Code taken from user: srggrs at https://github.com/scikit-learn/scikit-learn/issues/19014
#     '''
#     # get the intersecton of present and known-infrequent columns
#     present_columns = df.columns
#     return [col for col in present_columns if col in selected_columns]

In [85]:
## failed attempt to do GridSearchCV on linear regression
### problem identified to be preprocessor can't contain less features than specified

# build the preprocessor for the dataset
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('num', numeric_transformer, functools.partial(get_present_col_subset, numerical_col)),
#         ('cat', categorical_transformer, functools.partial(get_present_col_subset, categorical_col))
#     ]
# )

In [83]:
## failed attempt to do GridSearchCV on linear regression
### problem identified to be preprocessor can't contain less features than specified

# # Prepare the training data to be 5 folds
# folds = KFold(n_splits = 5, shuffle = True, random_state = 0)

# # Specify range of hyperparameters to tune
# hyper_params = [{'n_features_to_select': list(range(1, len(X_train.columns)))}]

# # Build Feature ranking with recursive feature elimination
# lin_reg.fit(X_train,y_train)
# rfe = RFE(lin_reg)  

# # Perform grid search using GridSearchCV()
# lin_reg_cv = GridSearchCV(estimator = lin_reg, 
#                         param_grid = hyper_params, 
#                         scoring= 'r2', 
#                         cv = folds, 
#                         verbose = 1,
#                         return_train_score=True)      

# # Fit the model
# lin_reg_cv.fit(X_train, y_train) 