## Model Fitting

#### The goal of this section is to fit a regression model to Ames Dataset using default parameters.

Let's fit the following models using default parameters and analyse their scores.
- Ridge
- Lasso
- KNN (K Nearest Neighbor)
- SVM (Support Vector Machine)
- Decision Tree

##### Besides a Model's hyperparameters, Feature selection also affects a Model's score. So in this section, I thought it would be interesting to also see how changing feature selection methods affects a Model's score. 

In [1]:
cd ..

/home/jovyan/Ames_Housing_Data


In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import RandomizedLasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

In [3]:
run src/load_data_2.py

In [4]:
housing_df = load_train_data()

In [5]:
clean_data(housing_df)
housing_df.shape

(1423, 78)

#### Split  features (numeric, categorical), target,

In [6]:
features, target = split_features_target(housing_df)
numerical_features, categorical_features = split_numerical_categorical(features)

#### Scale Numerical features & One Hot Encode Categorical features

In [7]:
scaled_numerical_features = log_scale_features(numerical_features)
categorical_features = one_hot_encode_features(categorical_features)

In [8]:
scaled_encoded_features_df = scaled_numerical_features.merge(categorical_features, left_index=True, right_index=True, how='left')
unscaled_encoded_features_df = numerical_features.merge(categorical_features, left_index=True, right_index=True, how='left')

In [9]:
np.random.seed(125)

In [10]:
scaled_encoded_features_df.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.21343,-0.043826,-0.066639,0.336399,-0.23918,0.513091,0.431636,0.587729,0.376383,-0.180199,...,0,0,0,1,0,0,0,0,1,0
2,-0.5631,0.278067,0.055623,-0.005369,0.973575,0.074995,-0.225214,-0.413766,0.431536,-0.180199,...,0,0,0,1,0,0,0,0,1,0
3,0.21343,0.026043,0.2076,0.336399,-0.23918,0.480842,0.407467,0.551816,0.313221,-0.180199,...,0,0,0,1,0,0,0,0,1,0
4,0.323978,-0.167652,0.05062,0.336399,-0.23918,-0.937218,-0.372399,-0.413766,0.17625,-0.180199,...,0,0,0,1,1,0,0,0,0,0
5,0.21343,0.353831,0.434786,0.63786,-0.23918,0.464706,0.359091,0.697217,0.363697,-0.180199,...,0,0,0,1,0,0,0,0,1,0


##### EDA/Manual selected features

In [11]:
eda_selected_features = eda_selected_features()
eda_selected_features

['GrLivArea',
 '1stFlrSF',
 'YearBuilt',
 'YearRemodAdd',
 'GarageYrBlt',
 'Utilities_AllPub',
 'Street_Pave',
 'Condition2_Norm',
 'RoofMatl_CompShg',
 'Heating_GasA']

In [12]:
eda_scaled_features_df = scaled_encoded_features_df[eda_selected_features]
eda_unscaled_features_df = unscaled_encoded_features_df[eda_selected_features]

##### RFE (Recursive Feature Elimination) selected features

In [13]:
rfe_selected_features = rfe_linear_selected_features(scaled_encoded_features_df, target, 10)
rfe_selected_features

45     LotShape_IR1
46     LotShape_IR2
47     LotShape_IR3
48     LotShape_Reg
167    ExterQual_Ex
168    ExterQual_Fa
169    ExterQual_Gd
170    ExterQual_TA
172    ExterCond_Fa
174    ExterCond_Po
Name: colnames, dtype: object

In [14]:
rfe_scaled_features_df = scaled_encoded_features_df[rfe_selected_features]
rfe_unscaled_features_df = unscaled_encoded_features_df[rfe_selected_features]

#### Lasso selected features

In [15]:
lasso_selected_features = lasso_selected_features(scaled_encoded_features_df, target, 10)
lasso_selected_features



124        RoofMatl_ClyTile
101         Condition2_PosN
257           GarageQual_Ex
263           GarageCond_Ex
131        RoofMatl_WdShngl
100         Condition2_PosA
189             BsmtCond_Po
78     Neighborhood_NoRidge
85     Neighborhood_StoneBr
139     Exterior1st_ImStucc
Name: colnames, dtype: object

In [16]:
lasso_scaled_features_df = scaled_encoded_features_df[lasso_selected_features]
lasso_unscaled_features_df = unscaled_encoded_features_df[lasso_selected_features]

In [17]:
metrics = pd.DataFrame(columns=['Model', 'Train (EDA)', 'Test (EDA)', 'Train (RFE)', 'Test (RFE)'])

### Models


### (1) Ridge

In [18]:
ridge = Ridge()

#### (a) Using top 10 Features selected Manually via EDA

In [19]:
train_X, test_X, train_y, test_y = train_test_split(eda_scaled_features_df, target, test_size = .25, random_state = 42)

In [20]:
ridge.fit(train_X, train_y)
ridge.predict(train_X)
ridge_score_eda_train = ridge.score(train_X, train_y)
ridge_score_eda_test = ridge.score(test_X, test_y)
ridge_score_eda_train, ridge_score_eda_test

(0.68482361168099393, 0.65433123875024712)

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [21]:
train_X, test_X, train_y, test_y  = train_test_split(rfe_scaled_features_df, target, test_size = .25, random_state = 42)

In [22]:
ridge.fit(train_X, train_y)
ridge.predict(train_X)
ridge_score_rfe_train = ridge.score(train_X, train_y)
ridge_score_rfe_test = ridge.score(test_X, test_y)
ridge_score_rfe_train, ridge_score_rfe_test

(0.51792624931205156, 0.43433899260768205)

In [24]:
metrics.loc[len(metrics)] = ['Ridge', round(ridge_score_eda_train, 2), round(ridge_score_eda_test, 2),
                                         round(ridge_score_rfe_train, 2), round(ridge_score_rfe_test, 2)]

### (2) Lasso

In [25]:
lasso = Lasso()

#### (a) Using top 10 Features selected Manually via EDA

In [26]:
train_X, test_X, train_y, test_y = train_test_split(eda_scaled_features_df, target, test_size = .25, random_state = 42)

In [27]:
lasso.fit(train_X, train_y)
lasso.predict(train_X)
lasso_score_eda_train = lasso.score(train_X, train_y)
lasso_score_eda_test = lasso.score(test_X, test_y)
lasso_score_eda_train, lasso_score_eda_test

(0.68485008638450884, 0.65461926369028667)

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [28]:
train_X, test_X, train_y, test_y = train_test_split(rfe_scaled_features_df, target, test_size = .25, random_state = 42)

In [29]:
lasso.fit(train_X, train_y)
lasso.predict(train_X)
lasso_score_rfe_train = lasso.score(train_X, train_y)
lasso_score_rfe_test = lasso.score(test_X, test_y)
lasso_score_rfe_train, lasso_score_rfe_test

(0.5182763025506183, 0.43242267807903761)

In [30]:
metrics.loc[len(metrics)] = ['Lasso', round(lasso_score_eda_train, 2), round(lasso_score_eda_test, 2),
                                         round(lasso_score_rfe_train, 2), round(lasso_score_rfe_test, 2)]

### (3) KNN (K Nearest Neighbor)

In [31]:
knn = KNeighborsRegressor()

#### (a) Using top 10 Features selected Manually via EDA

In [32]:
train_X, test_X, train_y, test_y = train_test_split(eda_scaled_features_df, target, test_size = .25, random_state = 42)

In [33]:
knn.fit(train_X, train_y)
knn.predict(train_X)
knn_score_eda_train = knn.score(train_X, train_y)
knn_score_eda_test = knn.score(test_X, test_y)
knn_score_eda_train, knn_score_eda_test

(0.84552123219891362, 0.76961933942907523)

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [34]:
train_X, test_X, train_y, test_y = train_test_split(rfe_scaled_features_df, target, test_size = .25, random_state = 42)

In [35]:
knn.fit(train_X, train_y)
knn.predict(train_X)
knn_score_rfe_train = knn.score(train_X, train_y)
knn_score_rfe_test = knn.score(test_X, test_y)
knn_score_rfe_train, knn_score_rfe_test

(0.47617897475125959, 0.40178987770344254)

In [36]:
metrics.loc[len(metrics)] = ['KNN', round(knn_score_eda_train, 2), round(knn_score_eda_test, 2),
                                         round(knn_score_rfe_train, 2), round(knn_score_rfe_test, 2)]

### (4) SVM (Support Vector Machine)

In [37]:
svm = SVR()

#### (a) Using top 10 Features selected Manually via EDA

In [38]:
train_X, test_X, train_y, test_y = train_test_split(eda_scaled_features_df, target, test_size = .25, random_state = 42)

In [39]:
svm.fit(train_X, train_y)
svm.predict(train_X)
svm_score_eda_train = svm.score(train_X, train_y)
svm_score_eda_test = svm.score(test_X, test_y)
svm_score_eda_train, svm_score_eda_test

(-0.042919671533690362, -0.045602739436426454)

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [40]:
train_X, test_X, train_y, test_y = train_test_split(rfe_scaled_features_df, target, test_size = .25, random_state = 42)

In [41]:
svm.fit(train_X, train_y)
svm.predict(train_X)
svm_score_rfe_train = svm.score(train_X, train_y)
svm_score_rfe_test = svm.score(test_X, test_y)
svm_score_rfe_train, svm_score_rfe_test

(-0.043677277922238522, -0.046186418718727751)

In [42]:
metrics.loc[len(metrics)] = ['SVM', round(svm_score_eda_train, 2), round(svm_score_eda_test, 2),
                                         round(svm_score_rfe_train, 2), round(svm_score_rfe_test, 2)]

### (5) Decision Tree

In [47]:
dtree = DecisionTreeRegressor()

#### (a) Using top 10 Features selected Manually via EDA

In [48]:
train_X, test_X, train_y, test_y = train_test_split(eda_unscaled_features_df, target, test_size = .25, random_state = 42)

In [49]:
dtree.fit(train_X, train_y)
dtree.predict(train_X)
dtree_score_eda_train = dtree.score(train_X, train_y)
dtree_score_eda_test = dtree.score(test_X, test_y)
dtree_score_eda_train, dtree_score_eda_test

(0.99939352097703515, 0.73630116449984961)

#### (b) Using top 10 Features selected via RFE (Recursuve Feature Elimination)

In [50]:
train_X, test_X, train_y, test_y = train_test_split(rfe_unscaled_features_df, target, test_size = .25, random_state = 42)

In [51]:
dtree.fit(train_X, train_y)
dtree.predict(train_X)
dtree_score_rfe_train = dtree.score(train_X, train_y)
dtree_score_rfe_test = dtree.score(test_X, test_y)
dtree_score_rfe_train, dtree_score_rfe_test

(0.53177064702763754, 0.41969211129826983)

In [52]:
metrics.loc[len(metrics)] = ['Decision Tree', round(dtree_score_eda_train, 2), round(dtree_score_eda_test, 2),
                                         round(dtree_score_rfe_train, 2), round(dtree_score_rfe_test, 2)]

### Summarize Model Scores

In [53]:
metrics

Unnamed: 0,Model,Train (EDA),Test (EDA),Train (RFE),Test (RFE)
0,Ridge,0.68,0.65,0.52,0.43
1,Lasso,0.68,0.65,0.52,0.43
2,KNN,0.85,0.77,0.48,0.4
3,SVM,-0.04,-0.05,-0.04,-0.05
4,Decision Tree,1.0,0.74,0.53,0.42


The scores above also shows that EDA/Manually selected features scored better than RFE for this dataset.

Let'd analyse the EDA train test scores...

Using the default hyperparameters, 
- Decision Tree seem to have performed pretty good in training. It seems to be overfitting, as it is a perfect 1.0, although it ca 
    - Decision tree tends to overfit since at each node, it will make the decision among a subset of all the features(columns), so when it reaches a final decision, it is a complicated and long decision chain. Only if a data point satisfies all the rules along this chain, the final decision can be made.
- Ridge and Lasso seem to have performed similarly. Ridge and Lasso are Linear models
- SVM scored pretty bad.
