## Model Fitting

#### The goal of this section is to fit a regression model to Ames Dataset using default parameters.

Let's fit the following models using default parameters and analyse their scores.
- Ridge
- Lasso
- KNN (K Nearest Neighbor)
- SVM (Support Vector Machine)
- Decision Tree

##### Besides a Model's hyperparameters, Feature selection also affects a Model's score. So in this section, we will also see how changing feature selection methods affects a Model's score. 

In [720]:
cd ..

/home/jovyan


In [709]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import RandomizedLasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

In [710]:
run src/load_data_2.py

In [711]:
housing_df = load_train_data()

In [712]:
clean_data(housing_df)
housing_df.shape
#housing_df.dtypes

(1423, 78)

In [713]:
scaled_features_df = scale_numeric_features(housing_df)
scaled_encoded_features_df = one_hot_encode_categorical_features(scaled_features_df)
unscaled_encoded_features_df = one_hot_encode_categorical_features(housing_df)

In [714]:
np.random.seed(125)

In [715]:
scaled_encoded_features_df.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.21343,-0.043826,-0.066639,0.336399,-0.23918,0.513091,0.431636,0.587729,0.376383,-0.180199,...,0,0,0,1,0,0,0,0,1,0
2,-0.5631,0.278067,0.055623,-0.005369,0.973575,0.074995,-0.225214,-0.413766,0.431536,-0.180199,...,0,0,0,1,0,0,0,0,1,0
3,0.21343,0.026043,0.2076,0.336399,-0.23918,0.480842,0.407467,0.551816,0.313221,-0.180199,...,0,0,0,1,0,0,0,0,1,0
4,0.323978,-0.167652,0.05062,0.336399,-0.23918,-0.937218,-0.372399,-0.413766,0.17625,-0.180199,...,0,0,0,1,1,0,0,0,0,0
5,0.21343,0.353831,0.434786,0.63786,-0.23918,0.464706,0.359091,0.697217,0.363697,-0.180199,...,0,0,0,1,0,0,0,0,1,0


In [716]:
train_y = housing_df.iloc[:, len(housing_df.columns)-1]

##### EDA/Manual selected features

In [721]:
eda_selected_features = eda_selected_features()
eda_selected_features

['GrLivArea',
 '1stFlrSF',
 'YearBuilt',
 'YearRemodAdd',
 'GarageYrBlt',
 'Utilities_AllPub',
 'Street_Pave',
 'Condition2_Norm',
 'RoofMatl_CompShg',
 'Heating_GasA']

##### RFE (Recursive Feature Elimination) selected features

In [717]:
rfe_selected_features = rfe_linear_selected_features(scaled_encoded_features_df, train_y, 10)
rfe_selected_features

45     LotShape_IR1
46     LotShape_IR2
47     LotShape_IR3
48     LotShape_Reg
167    ExterQual_Ex
168    ExterQual_Fa
169    ExterQual_Gd
170    ExterQual_TA
172    ExterCond_Fa
174    ExterCond_Po
Name: colnames, dtype: object

#### Lasso selected features

In [718]:
lasso_selected_features = lasso_selected_features(scaled_encoded_features_df, train_y, 10)
lasso_selected_features



257           GarageQual_Ex
131        RoofMatl_WdShngl
100         Condition2_PosA
189             BsmtCond_Po
78     Neighborhood_NoRidge
85     Neighborhood_StoneBr
126        RoofMatl_Membran
252    GarageType_No Garage
123          RoofStyle_Shed
154     Exterior2nd_ImStucc
Name: colnames, dtype: object

In [719]:
# Features selections for each selection type
train_scaled_lasso_selected_X = scaled_encoded_features_df[lasso_selected_features]
train_scaled_rfe_selected_X = scaled_encoded_features_df[rfe_selected_features]
train_scaled_eda_selected_X = scaled_encoded_features_df[eda_selected_features]

train_unscaled_rfe_selected_X = unscaled_encoded_features_df[rfe_selected_features]
train_unscaled_eda_selected_X = unscaled_encoded_features_df[eda_selected_features]

In [635]:
metrics_df = pd.DataFrame(columns=['Model', 'Score (EDA/Manual feature selection)', 'Score (RFE feature selection)'])

### Models


### (1) Ridge

In [636]:
ridge = Ridge()

#### (a) Using top 10 Features selected Manually via EDA

In [637]:
train_X = train_scaled_eda_selected_X

In [638]:
ridge.fit(train_X, train_y)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [639]:
ridge.predict(train_X)

array([ 217379.00825661,  173629.73553036,  224579.62312151, ...,
        228511.87605651,  138966.31701947,  158649.82107068])

In [640]:
ridge_score_eda = ridge.score(train_X, train_y)
ridge_score_eda

0.67721537043403279

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [641]:
train_X = train_scaled_rfe_selected_X

ridge.fit(train_X, train_y)
ridge.predict(train_X)
ridge_score_rfe = ridge.score(train_X, train_y)
ridge_score_rfe

0.49615456766659749

In [642]:
metrics_df.loc[len(metrics_df)] = ['Ridge', ridge_score_eda, ridge_score_rfe]
metrics_df

Unnamed: 0,Model,Score (EDA/Manual feature selection),Score (RFE feature selection)
0,Ridge,0.677215,0.496155


**The score of 0.99 above represents the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and this model has achieved it.**

### (2) Lasso

In [643]:
lasso = Lasso()

#### (a) Using top 10 Features selected Manually via EDA

In [644]:
train_X = train_scaled_eda_selected_X

In [645]:
lasso.fit(train_X, train_y)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [646]:
lasso.predict(train_X)

array([ 217448.35934579,  173618.73594776,  224661.03225739, ...,
        228691.52124776,  138917.1300813 ,  158646.64228862])

In [647]:
lasso_score_eda = lasso.score(train_X, train_y)
lasso_score_eda

0.67723466018792733

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [648]:
train_X = train_scaled_rfe_selected_X

In [649]:
lasso.fit(train_X, train_y)
lasso.predict(train_X)
lasso_score_rfe = lasso.score(train_X, train_y)
lasso_score_rfe

0.49650160347284944

In [650]:
metrics_df.loc[len(metrics_df)] = ['Lasso', lasso_score_eda, lasso_score_rfe]
metrics_df

Unnamed: 0,Model,Score (EDA/Manual feature selection),Score (RFE feature selection)
0,Ridge,0.677215,0.496155
1,Lasso,0.677235,0.496502


### (3) KNN (K Nearest Neighbor)

In [651]:
knn = KNeighborsRegressor()

#### (a) Using top 10 Features selected Manually via EDA

In [652]:
train_X = train_scaled_eda_selected_X

In [653]:
knn.fit(train_X, train_y)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

In [654]:
knn.predict(train_X)

array([ 205500.,  167400.,  215500., ...,  248400.,  145205.,  151000.])

In [655]:
knn_score_eda = knn.score(train_X, train_y)
knn_score_eda

0.84283277718501692

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [656]:
train_X = train_scaled_rfe_selected_X

In [657]:
knn.fit(train_X, train_y)
knn.predict(train_X)
knn_score_rfe = knn.score(train_X, train_y)
knn_score_rfe

0.4844651398671761

In [658]:
metrics_df.loc[len(metrics_df)] = ['KNN', knn_score_eda, knn_score_rfe]
metrics_df

Unnamed: 0,Model,Score (EDA/Manual feature selection),Score (RFE feature selection)
0,Ridge,0.677215,0.496155
1,Lasso,0.677235,0.496502
2,KNN,0.842833,0.484465


### (4) SVM (Support Vector Machine)

In [659]:
svm = SVR()

#### (a) Using top 10 Features selected Manually via EDA

In [660]:
train_X = train_scaled_eda_selected_X

In [661]:
svm.fit(train_X, train_y)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [662]:
svm.predict(train_X)

array([ 165081.50670308,  164984.34377556,  165087.34983518, ...,
        165007.42076991,  164925.94908606,  164942.16489507])

In [663]:
svm_score_eda = svm.score(train_X, train_y)
svm_score_eda

-0.048867195597493751

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [664]:
train_X = train_scaled_rfe_selected_X

In [665]:
svm.fit(train_X, train_y)
svm.predict(train_X)
svm_score_rfe = svm.score(train_X, train_y)
svm_score_rfe

-0.049723084661448036

In [666]:
metrics_df.loc[len(metrics_df)] = ['SVM', svm_score_eda, svm_score_rfe]
metrics_df

Unnamed: 0,Model,Score (EDA/Manual feature selection),Score (RFE feature selection)
0,Ridge,0.677215,0.496155
1,Lasso,0.677235,0.496502
2,KNN,0.842833,0.484465
3,SVM,-0.048867,-0.049723


### (5) Decision Tree

In [667]:
decision_tree = DecisionTreeRegressor()

#### (a) Using top 10 Features selected Manually via EDA

In [668]:
train_X = train_unscaled_eda_selected_X

In [669]:
decision_tree.fit(train_X, train_y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [670]:
decision_tree.predict(train_X)

array([ 208500.,  181500.,  223500., ...,  266500.,  142125.,  147500.])

In [677]:
decision_tree_score_eda = decision_tree.score(train_X, train_y)
decision_tree_score_eda

0.50614943263428902

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [678]:
train_X = train_unscaled_rfe_selected_X

In [679]:
decision_tree.fit(train_X, train_y)
decision_tree.predict(train_X)
decision_tree_score_rfe = decision_tree.score(train_X, train_y)
decision_tree_score_rfe

0.50614943263428902

In [681]:
metrics_df.loc[len(metrics_df)] = ['Decision Tree', decision_tree_score_eda, decision_tree_score_rfe]
metrics_df

Unnamed: 0,Model,Score (EDA/Manual feature selection),Score (RFE feature selection)
0,Ridge,0.677215,0.496155
1,Lasso,0.677235,0.496502
2,KNN,0.842833,0.484465
3,SVM,-0.048867,-0.049723
4,Decision Tree,0.506149,0.506149
