# Regression: HOUSE PRICES PREDICTION

Abstract: The project aims to predict the final price of each home with around 80 features regarding every aspect of residential homes. 

The dataset is from the Kaggle website. https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data There are two datasets in total. The training dataset has 1460 rows and 81 columns, including the target variable. The testing dataset has 1459 rows and 80 columns. The detailed description is shown in the file 'data_description.txt.' In this project, we will train the model on the training dataset using the best model and predict the target values for the testing dataset.

Here is a brief description of the features of the dataset: 

1. SalePrice: the property's sale price in dollars. (Target variable)
2. MSSubClass: The building class
3. MSZoning: The general zoning classification
4. LotFrontage: Linear feet of street connected to property
5. LotArea: Lot size in square feet
6. Street: Type of road access
7. Alley: Type of alley access
8. LotShape: General shape of property
9. LandContour: Flatness of the property
10. Utilities: Type of utilities available
11. LotConfig: Lot configuration
12. LandSlope: Slope of property
13. Neighborhood: Physical locations within Ames city limits
14. Condition1: Proximity to main road or railroad
15. Condition2: Proximity to main road or railroad (if a second is present)
16. BldgType: Type of dwelling
17. HouseStyle: Style of dwelling
18. OverallQual: Overall material and finish quality
19. OverallCond: Overall condition rating
20. YearBuilt: Original construction date
21. YearRemodAdd: Remodel date
22. RoofStyle: Type of roof
23. RooMatl: Roof material
24. Exterior1st: Exterior covering on house
25. Exterior2nd: Exterior covering on house (if more than one material)
26. MasVnrType: Masonry veneer type
27. MasVnrArea: Masonry veneer area in square feet
28. ExterQual: Exterior material quality
29. ExterCond: Present condition of the material on the exterior
30. Foundation: Type of foundation
31. BsmtQual: Height of the basement
32. BsmtCond: General condition of the basement
33. BsmtExposure: Walkout or garden level basement walls
34. BsmtFinType1: Quality of basement finished area
35. BsmtFinSF1: Type 1 finished square feet
36. BsmtFinType2: Quality of second finished area (if present)
37. BsmtFinSF2: Type 2 finished square feet
38. BsmtUnfSF: Unfinished square feet of basement area
39. TotalBsmtSF: Total square feet of basement area
40. Heating: Type of heating
41. HeatingQC: Heating quality and condition
42. CentralAir: Central air conditioning
43. Electrical: Electrical system
44. 1stFlrSF: First Floor square feet
45. 2ndFlrSF: Second floor square feet
46. LowQualFinSF: Low quality finished square feet (all floors)
47. GrLivArea: Above grade (ground) living area square feet
48. BsmtFullBath: Basement full bathrooms
49. BsmtHalfBath: Basement half bathrooms
50. FullBath: Full bathrooms above grade
51. HalfBath: Half baths above grade
52. Bedroom: Number of bedrooms above basement level
53. Kitchen: Number of kitchens
54. KitchenQual: Kitchen quality
55. TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
56. Functional: Home functionality rating
57. Fireplaces: Number of fireplaces
58. FireplaceQu: Fireplace quality
59. GarageType: Garage location
60. GarageYrBlt: Year garage was built
61. GarageFinish: Interior finish of the garage
62. GarageCars: Size of garage in car capacity
63. GarageArea: Size of garage in square feet
64. GarageQual: Garage quality
65. GarageCond: Garage condition
66. PavedDrive: Paved driveway
67. WoodDeckSF: Wood deck area in square feet
68. OpenPorchSF: Open porch area in square feet
69. EnclosedPorch: Enclosed porch area in square feet
70. 3SsnPorch: Three season porch area in square feet
71. ScreenPorch: Screen porch area in square feet
72. PoolArea: Pool area in square feet
73. PoolQC: Pool quality
74. Fence: Fence quality
75. MiscFeature: Miscellaneous feature not covered in other categories
76. MiscVal: Value of miscellaneous feature
77. MoSold: Month Sold
78. YrSold: Year Sold
79. SaleType: Type of sale
80. SaleCondition: Condition of sale

## Import Packages

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

##  Load Data

Since we use the same datasets and the preprocessing steps as Project 1, we load the datasets that have been cleansed and transformed in Project 1.

In [2]:
train_df = pd.read_csv('train_df.csv')

In [3]:
test_df = pd.read_csv('test_df.csv')

In [4]:
test_df_forPred = pd.read_csv('houseprice_test.csv')

In [5]:
train_df.shape

(1459, 106)

In [6]:
test_df.shape

(1459, 105)

## Train test split

In [7]:
from sklearn.model_selection import train_test_split

X = train_df.drop('SalePrice', axis = 1)
y = train_df.SalePrice
X_train_org, X_test_org, y_train, y_test = train_test_split(X, y, random_state = 0)

In [8]:
print("Size of train set: {}   Size of test set: {}".format(X_train_org.shape, X_test_org.shape))

Size of train set: (1094, 105)   Size of test set: (365, 105)


After we split the data, we get the training set of 1094 and the testing set of 365.

## Data scaling

In [9]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train_org)
X_test = scaler.transform(X_test_org)

In [10]:
y_test = (y_test - y_test.min()) / (y_test.max() - y_test.min())
y_train = (y_train - y_train.min()) / (y_train.max() - y_train.min())
print(f'y_test metrices -> mean: {y_test.mean()}, std: {y_test.std()}')
print(f'y_train metrices -> mean: {y_train.mean()}, std: {y_train.std()}')

y_test metrices -> mean: 0.2707016845791734, std: 0.14571551732041682
y_train metrices -> mean: 0.20420521525169033, std: 0.11221269827058238


Since the dataset is not a normal distribution, we use MinMaxScaler instead of StandardScaler to normalize the data.

##  Modeling

We use grid search to find the best hyper-parameters for each model. Additionally, this is regression analysis, so we use ``R2`` as our scoring metrics to find the optimal model.  

### Bagging and Pasting

We apply the Ridge model and the Lasso model with bagging and pasting methods.

#### 1. Ridge with Bagging and Pasting

In [11]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import BaggingRegressor

# define the best ridge model
best_ridge = Ridge(alpha=0.1,
                   normalize=True,
                   tol=1e-06
                  )

# define the best ridge model with bagging
bag_ridge = BaggingRegressor(best_ridge,
                           bootstrap=True,
                           n_jobs=-1, 
                           random_state=0,
                           oob_score = True)

# set param_grid
param_grid = {'n_estimators':[10, 50, 100],
              'max_samples':[0.1, 1.0, 10]
             }

# Define the cv
cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_ridge_bag = GridSearchCV(bag_ridge, param_grid, 
                         cv=cv,
                         scoring=scoring)

# fit the grid search
gscv_ridge_bag.fit(X_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters(Ridge Bagging): {}".format(gscv_ridge_bag.best_params_))

# define the best ridge model with pasting
pas_ridge = BaggingRegressor(best_ridge,
                           bootstrap=False,
                           n_jobs=-1, 
                           random_state=0)

# set param_grid
param_grid = {'n_estimators':[10, 50, 100],
              'max_samples':[0.1, 1.0, 10]
             }

# Define the cv
cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_ridge_pas = GridSearchCV(pas_ridge, param_grid, 
                         cv=cv,
                         scoring=scoring,
                         return_train_score=True)

# fit the grid search
gscv_ridge_pas.fit(X_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters(Ridge Pasting): {}".format(gscv_ridge_pas.best_params_))


Best parameters(Ridge Bagging): {'max_samples': 0.1, 'n_estimators': 100}
Best parameters(Ridge Pasting): {'max_samples': 0.1, 'n_estimators': 10}


We use the best parameters: {'alpha': 0.1, 'normalize': True, 'tol': 1e-06} that we gain from Project 1 for Ridge Regressor. The result above shows the best parameters for Bagging Regressor and Pasting Regressor with grid search method.  

In [13]:
# define the best ridge model with best bagging
best_ridge_bag = BaggingRegressor(
                            gscv_ridge_bag,
                            max_samples=gscv_ridge_bag.best_params_['max_samples'],
                            n_estimators=gscv_ridge_bag.best_params_['n_estimators'],
                            bootstrap=True,
                            n_jobs=-1, 
                            random_state=0,
                            oob_score = True)

# define the best ridge model with best pasting
best_ridge_pas = BaggingRegressor(
                            gscv_ridge_pas,
                            max_samples=gscv_ridge_pas.best_params_['max_samples'],
                            n_estimators=gscv_ridge_pas.best_params_['n_estimators'],
                            bootstrap=False,
                            n_jobs=-1, 
                            random_state=0)

# Fit the model
best_ridge_bag.fit(X_train, y_train)
best_ridge_pas.fit(X_train, y_train)

# Bagging result
print('Ridge Bagging')
print('Train score: {:.2f}'.format(best_ridge_bag.score(X_train, y_train)))
print('Test score: {:.2f}'.format(best_ridge_bag.score(X_test, y_test)))
print('Out-of-Bag score: {:.2f}'.format(best_ridge_bag.oob_score_))

# Pasting result
print('\nRidge Pasting')
print('Train score: {:.2f}'.format(best_ridge_pas.score(X_train, y_train)))
print('Test score: {:.2f}'.format(best_ridge_pas.score(X_test, y_test)))


Ridge Bagging
Train score: 0.79
Test score: 0.51
Out-of-Bag score: 0.77

Ridge Pasting
Train score: 0.74
Test score: 0.43


The result above indicates that the training score and the test score of the Ridge model with bagging and pasting.

In [14]:
# predict
y_pred_ridge_bag = best_ridge_bag.predict(X_test)
y_pred_ridge_pas = best_ridge_pas.predict(X_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('Ridge Bagging')
R2_ridge_bag = r2_score(y_test, y_pred_ridge_bag)
print("R2_ridge_bag: {:.2f}".format(R2_ridge_bag)) 
RMSE_ridge_bag  = np.sqrt(metrics.mean_squared_error(y_test, y_pred_ridge_bag))
print('RMSE_ridge_bag: {:.2f}'.format(RMSE_ridge_bag))

print('\nRidge Pasting')
R2_ridge_pas = r2_score(y_test, y_pred_ridge_pas)
print("R2_ridge_pas: {:.2f}".format(R2_ridge_pas)) 
RMSE_ridge_pas = np.sqrt(metrics.mean_squared_error(y_test, y_pred_ridge_pas))
print('RMSE_ridge_pas: {:.2f}'.format(RMSE_ridge_pas))


Ridge Bagging
R2_ridge_bag: 0.51
RMSE_ridge_bag: 0.10

Ridge Pasting
R2_ridge_pas: 0.43
RMSE_ridge_pas: 0.11


In [15]:
print('Ridge (Project1)')
print("R2_ridge: {:.2f}".format(0.54)) 
print('RMSE_ridge: {:.2f}'.format(0.10))

Ridge (Project1)
R2_ridge: 0.54
RMSE_ridge: 0.10


The results above indicate the R2 score and the RMSE value of the original Ridge model and the ones with bagging and pasting methods. Therefore, we can infer that using the bagging and pasting methods does not improve the Ridge model.

#### 2. Lasso with Bagging and Pasting

In [16]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import BaggingRegressor

# define the best lasso model
best_lasso = Lasso(alpha=0.001)

# define the best lasso model with bagging
bag_lasso = BaggingRegressor(best_lasso,
                           bootstrap=True,
                           n_jobs=-1, 
                           random_state=0,
                           oob_score = True)

# set param_grid
param_grid = {'n_estimators':[10, 50, 100],
              'max_samples':[0.1, 1.0, 10]
             }

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_lasso_bag = GridSearchCV(bag_lasso, param_grid, 
                         cv=cv,
                         scoring=scoring,
                         return_train_score=True)

# fit the grid search
gscv_lasso_bag.fit(X_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters(Lasso Bagging): {}".format(gscv_lasso_bag.best_params_))

# define the best lasso model with pasting
pas_lasso = BaggingRegressor(best_lasso,
                           bootstrap=False,
                           n_jobs=-1, 
                           random_state=0)

# set param_grid
param_grid = {'n_estimators':[10, 50, 100],
              'max_samples':[0.1, 1.0, 10]
             }

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_lasso_pas = GridSearchCV(pas_lasso, param_grid, 
                         cv=cv,
                         scoring=scoring,
                         return_train_score=True)
# fit the grid search
gscv_lasso_pas.fit(X_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters(Lasso Pasting): {}".format(gscv_lasso_pas.best_params_))


Best parameters(Lasso Bagging): {'max_samples': 1.0, 'n_estimators': 10}
Best parameters(Lasso Pasting): {'max_samples': 1.0, 'n_estimators': 10}


We use the best parameters: {'alpha': 0.001} that we gain from Project 1 for Lasso Regressor. The result above shows the best parameters for Bagging Regressor and Pasting Regressor with grid search method.  

In [17]:
# define the best lasso model with best bagging
best_lasso_bag = BaggingRegressor(
                            gscv_lasso_bag,
                            max_samples=gscv_lasso_bag.best_params_['max_samples'],
                            n_estimators=gscv_lasso_bag.best_params_['n_estimators'],
                            bootstrap=True,
                            n_jobs=-1, 
                            random_state=0,
                            oob_score = True)

# define the best lasso model with best pasting
best_lasso_pas = BaggingRegressor(
                            gscv_lasso_pas,
                            max_samples=gscv_lasso_pas.best_params_['max_samples'],
                            n_estimators=gscv_lasso_pas.best_params_['n_estimators'],
                            bootstrap=False,
                            n_jobs=-1, 
                            random_state=0)

# Fit the model
best_lasso_bag.fit(X_train, y_train)
best_lasso_pas.fit(X_train, y_train)

# Bagging result
print('Lasso Bagging')
print('Train score: {:.2f}'.format(best_lasso_bag.score(X_train, y_train)))
print('Test score: {:.2f}'.format(best_lasso_bag.score(X_test, y_test)))
print('Out-of-Bag score: {:.2f}'.format(best_lasso_bag.oob_score_))


# Pasting result
print('\nLasso Pasting')
print('Train score: {:.2f}'.format(best_lasso_pas.score(X_train, y_train)))
print('Test score: {:.2f}'.format(best_lasso_pas.score(X_test, y_test)))

Lasso Bagging
Train score: 0.81
Test score: 0.52
Out-of-Bag score: 0.76

Lasso Pasting
Train score: 0.80
Test score: 0.50


The result above indicates that the training score and the test score of the Lasso model with bagging and pasting.

In [18]:
# predict
y_pred_lasso_bag = best_lasso_bag.predict(X_test)
y_pred_lasso_pas = best_lasso_pas.predict(X_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('Lasso Bagging')
R2_lasso_bag = r2_score(y_test, y_pred_lasso_bag)
print("R2_lasso_bag: {:.2f}".format(R2_lasso_bag)) 
RMSE_lasso_bag  = np.sqrt(metrics.mean_squared_error(y_test, y_pred_lasso_bag))
print('RMSE_lasso_bag: {:.2f}'.format(RMSE_lasso_bag))

print('\nLasso Pasting')
R2_lasso_pas = r2_score(y_test, y_pred_lasso_pas)
print("R2_lasso_pas: {:.2f}".format(R2_lasso_pas)) 
RMSE_lasso_pas = np.sqrt(metrics.mean_squared_error(y_test, y_pred_lasso_pas))
print('RMSE_lasso_pas: {:.2f}'.format(RMSE_lasso_pas))

Lasso Bagging
R2_lasso_bag: 0.52
RMSE_lasso_bag: 0.10

Lasso Pasting
R2_lasso_pas: 0.50
RMSE_lasso_pas: 0.10


In [19]:
print('Lasso (Project1)')
print("R2_lasso: {:.2f}".format(0.51)) 
print('RMSE_lasso: {:.2f}'.format(0.11))

Lasso (Project1)
R2_lasso: 0.51
RMSE_lasso: 0.11


The results above indicate the R2 score and the RMSE value of the original Lasso model and the ones with bagging and pasting methods. Therefore, we can infer that using applying bagging methods can improve the R2 score of the Lasso model, but it may slightly increase the RMSE value.

### AdaBoost boosting 

We apply the KNN Regressor model and the Linear Regression model with Adaboost boosting.

#### 1.  KNN Regressor with AdaBoost boosting

In [70]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import AdaBoostRegressor

# define the best knn model
best_knn = KNeighborsRegressor(
    n_neighbors=8)

# define the best knn model with AdaBoost
ada_knn = AdaBoostRegressor(best_knn, 
                           learning_rate=1.0, 
                           random_state=0)

# set param_grid
param_grid = {'n_estimators':[10,50,100]}

# Define the cv
cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_knn_ada = GridSearchCV(ada_knn, param_grid, 
                         cv=cv,
                         scoring=scoring,
                         return_train_score=True)

# fit the grid search
gscv_knn_ada.fit(X_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters(KNN AdaBoost): {}".format(gscv_knn_ada.best_params_))


Best parameters(KNN AdaBoost): {'n_estimators': 10}


We use the best parameters: {'n_neighbors': 8} that we gain from Project 1 for KNN Regressor. The result above shows the best parameters for AdaBoost Regressor with grid search method.  

In [71]:
# define the best knn model with best AdaBoost
best_knn_ada = AdaBoostRegressor(
                            gscv_knn_ada,
                            n_estimators=gscv_knn_ada.best_params_['n_estimators'],
                            learning_rate=1.0, 
                            random_state=0)


# Fit the model
best_knn_ada.fit(X_train, y_train)

# AdaBoost result
print('KNN AdaBoost')
print('Train score: {:.2f}'.format(best_knn_ada.score(X_train, y_train)))
print('Test score: {:.2f}'.format(best_knn_ada.score(X_test, y_test)))


KNN AdaBoost
Train score: 0.88
Test score: 0.41


The result above indicates that the training score and the test score of the KNN model with AdaBoost.

In [72]:
# predict
y_pred_knn_ada = best_knn_ada.predict(X_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('Linear AdaBoost')
R2_knn_ada = r2_score(y_test, y_pred_knn_ada)
print("R2_linear_ada: {:.2f}".format(R2_knn_ada)) 
RMSE_knn_ada  = np.sqrt(metrics.mean_squared_error(y_test, y_pred_knn_ada))
print('RMSE_knn_ada: {:.2f}'.format(RMSE_knn_ada))


Linear AdaBoost
R2_linear_ada: 0.41
RMSE_knn_ada: 0.11


In [73]:
print('KNN (Project1)')
print("R2_knn: {:.2f}".format(0.29)) 
print('RMSE_knn: {:.2f}'.format(0.12))

KNN (Project1)
R2_knn: 0.29
RMSE_knn: 0.12


The results above indicate the R2 score and the RMSE value of the original KNN model and the one with the Adaboost boosting method. Since the R2 score increases from 0.29 to 0.41, and the RMSE value decreases from 0.12 to 0.11. Hence, we can infer that the KNN model with the Adaboost boosting method is better.

#### 2.  Linear Regression with AdaBoost boosting

In [74]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import AdaBoostRegressor

# define the best linear model
best_linear = LinearRegression(fit_intercept=True, 
                               normalize=True)

# define the linear model with AdaBoost
ada_linear = AdaBoostRegressor(best_linear, 
                           learning_rate=1.0, 
                           random_state=0)

# set param_grid
param_grid = {'n_estimators':[10,50,100]}

# Define the cv
cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_linear_ada = GridSearchCV(ada_linear, param_grid, 
                         cv=cv,
                         scoring=scoring,
                         return_train_score=True)

# fit the grid search
gscv_linear_ada.fit(X_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters(Linear AdaBoost): {}".format(gscv_linear_ada.best_params_))


Best parameters(Linear AdaBoost): {'n_estimators': 10}


We use the best parameters: {'fit_intercept': True, 'normalize': True} that we gain from Project 1 for Linear Regressor. The result above shows the best parameters for AdaBoost Regressor with grid search method.  

In [75]:
# define the best linear model with best AdaBoost
best_linear_ada = AdaBoostRegressor(
                            gscv_linear_ada,
                            n_estimators=gscv_linear_ada.best_params_['n_estimators'],
                            learning_rate=1.0, 
                            random_state=0)


# Fit the model
best_linear_ada.fit(X_train, y_train)

# AdaBoost result
print('Linear AdaBoost')
print('Train score: {:.2f}'.format(best_linear_ada.score(X_train, y_train)))
print('Test score: {:.2f}'.format(best_linear_ada.score(X_test, y_test)))


Linear AdaBoost
Train score: 0.88
Test score: 0.61


The result above indicates that the training score and the test score of the Linear Regressor model with Adaboost.

In [76]:
# predict
y_pred_linear_ada = best_linear_ada.predict(X_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('Linear AdaBoost')
R2_linear_ada = r2_score(y_test, y_pred_linear_ada)
print("R2_linear_ada: {:.2f}".format(R2_linear_ada)) 
RMSE_linear_ada  = np.sqrt(metrics.mean_squared_error(y_test, y_pred_linear_ada))
print('RMSE_linear_ada: {:.2f}'.format(RMSE_linear_ada))


Linear AdaBoost
R2_linear_ada: 0.61
RMSE_linear_ada: 0.09


In [77]:
print('Linear (Project1)')
print("R2_linear: {:.2f}".format(0.57)) 
print('RMSE_linear: {:.2f}'.format(0.10))

Linear (Project1)
R2_linear: 0.57
RMSE_linear: 0.10


The results above indicate the R2 score and the RMSE value of the original Linear model and the one with the Adaboost boosting method. Since the R2 score increases from 0.57 to 0.61, and the RMSE value decreases from 0.10 to 0.09. Hence, we can infer that the Linear model with the Adaboost boosting method is better.

### Gradient boosting

In [23]:
from sklearn.ensemble import GradientBoostingRegressor

# define the Gradient boosting model
grad_reg = GradientBoostingRegressor(random_state=0)

# set param_grid
param_grid = {'max_depth':[1,2,5],
              'n_estimators':[1, 10, 100],
              'learning_rate':[0.1, 0.5, 1.0]}

# Define the cv
cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_grad_reg = GridSearchCV(grad_reg, param_grid, 
                         cv=cv,
                         scoring=scoring,
                         return_train_score=True)

# fit the grid search
gscv_grad_reg.fit(X_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters(GradientBoosting): {}".format(gscv_grad_reg.best_params_))


Best parameters(GradientBoosting): {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 100}


In [24]:
# define the best Gradient boosting model 
best_grad_reg = GradientBoostingRegressor(
                           max_depth=gscv_grad_reg.best_params_['max_depth'],
                           n_estimators=gscv_grad_reg.best_params_['n_estimators'],
                           learning_rate=gscv_grad_reg.best_params_['learning_rate'], 
                           random_state=0)

# Fit the model
best_grad_reg.fit(X_train, y_train)

# Gradient result
print('Gradient boosting')
print('Train score: {:.2f}'.format(best_grad_reg.score(X_train, y_train)))
print('Test score: {:.2f}'.format(best_grad_reg.score(X_test, y_test)))


Gradient boosting
Train score: 0.94
Test score: 0.56


The result above indicates that the training score and the test score of the Gradient boosting model.


In [25]:
# predict
y_pred_grad_reg = best_grad_reg.predict(X_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('Gradient boosting')
R2_grad_reg = r2_score(y_test, y_pred_grad_reg)
print("R2_grad_reg: {:.2f}".format(R2_grad_reg)) 
RMSE_grad_reg  = np.sqrt(metrics.mean_squared_error(y_test, y_pred_grad_reg))
print('RMSE_grad_reg: {:.2f}'.format(RMSE_grad_reg))


Gradient boosting
R2_grad_reg: 0.56
RMSE_grad_reg: 0.10


The result above indicates that the R2 score and the RMSE value of the Gradient boosting model.

### PCA

In [26]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 0.95)
X_reduced_train = pca.fit_transform(X_train)
X_reduced_test = pca.transform(X_test)
# pca.n_components_

print('Original shape: {}'.format(X_train.shape))
print('Reduced shape: {}'.format(X_reduced_train.shape))

Original shape: (1094, 105)
Reduced shape: (1094, 48)


To preserve 95% of the variability in the data, we set n_components=0.95. After we apply PCA to reduce dimension, the number of features reduces from 105 to 48.

#### 1. PCA - KNN Regressor

In [27]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV, KFold

# Set param_grid
param_grid = {'n_neighbors': np.arange(1, 10, 1)}

# Define the model
knn = KNeighborsRegressor()

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# Use GridSearch
gscv_knn = GridSearchCV(knn, 
                        param_grid, 
                        cv=cv,
                        scoring=scoring,
                        return_train_score=True
                       )

# Fit the model
gscv_knn.fit(X_reduced_train, y_train)

#results = pd.DataFrame(gscv_knn.cv_results_)

#bestParamsRow = results.sort_values(
#    ['rank_test_score'])[results['mean_train_score']!= 1].iloc[0]['params']

print("Best hyper-parameters for PCA KNN Regressor: {}".format(gscv_knn.best_params_))

Best hyper-parameters for PCA KNN Regressor: {'n_neighbors': 8}


In [28]:
# Set the best KNN
pca_knn = KNeighborsRegressor(
    n_neighbors=gscv_knn.best_params_['n_neighbors'])

# fit the model
pca_knn.fit(X_reduced_train, y_train)

# train and test score
print('PCA KNN')
print("Train score: {:.2f}".format(pca_knn.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_knn.score(X_reduced_test, y_test)))


PCA KNN
Train score: 0.69
Test score: 0.30


The result above indicates that the training score and the test score of the PCA KNN Regressor model.

In [29]:
# predict
y_pred_pca_knn = pca_knn.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA KNN')
R2_pca_knn = r2_score(y_test, y_pred_pca_knn)
print("R2_pca_knn: {:.2f}".format(R2_pca_knn)) 
RMSE_pca_knn = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_knn))
print('RMSE_pca_knn: {:.2f}'.format(RMSE_pca_knn))


PCA KNN
R2_pca_knn: 0.30
RMSE_pca_knn: 0.12


The result above indicates that the R2 score and the RMSE value of the PCA KNN Regressor model.

#### 2. PCA - Linear Regression

In [30]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, KFold

# set param_grid
param_grid = {'fit_intercept':[True,False], 
              'normalize':[True,False]}

# define the model
linreg = LinearRegression()

# Define the cv
cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Define the scoring
scoring = 'r2'

# Use GridSearch
gscv_linreg = GridSearchCV(linreg, param_grid,
                           cv=cv,
                           scoring=scoring,
                           verbose=1, 
                           return_train_score=True
                          )

# Fit the model
gscv_linreg.fit(X_reduced_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters for PCA Linear Regression: {}".format(gscv_linreg.best_params_))


Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters for PCA Linear Regression: {'fit_intercept': True, 'normalize': True}


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished


In [31]:
# set best linear
pca_linreg = LinearRegression(
    normalize=gscv_linreg.best_params_['normalize'],
    fit_intercept=gscv_linreg.best_params_['fit_intercept'])

# Fit the model
pca_linreg.fit(X_reduced_train, y_train)

# train and test score
print('PCA Linear')
print("Train score: {:.2f}".format(pca_linreg.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_linreg.score(X_reduced_test, y_test)))


PCA Linear
Train score: 0.79
Test score: 0.51


The result above indicates that the training score and the test score of the PCA Linear Regressor model.

In [32]:
# predict
y_pred_pca_linreg = pca_linreg.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA Linear')
R2_pca_linreg = r2_score(y_test, y_pred_pca_linreg)
print("R2_pca_linreg: {:.2f}".format(R2_pca_linreg)) 
RMSE_pca_linreg = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_linreg))
print('RMSE_pca_linreg: {:.2f}'.format(RMSE_pca_linreg))


PCA Linear
R2_pca_linreg: 0.51
RMSE_pca_linreg: 0.10


The result above indicates that the R2 score and the RMSE value of the PCA Linear Regressor model.

#### 3. PCA - Polynomial Regression

In [33]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, KFold

# poly_1
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_reduced_train)
X_test_poly = poly.transform(X_reduced_test)

# set param_grid
param_grid = {'fit_intercept':[True,False], 
              'normalize':[True]}

# define the model
polyreg = LinearRegression()

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# Use GridSearch
gscv_polyreg = GridSearchCV(polyreg, 
                           param_grid,
                           cv=cv,
                           scoring=scoring
                          )

# Fit the model
gscv_polyreg.fit(X_train_poly, y_train)

# Print The value of best Hyperparameters
print("Best parameters: {}".format(gscv_polyreg.best_params_))
print("Best cross-validation score: {:.2f}".format(gscv_polyreg.best_score_))


Best parameters: {'fit_intercept': False, 'normalize': True}
Best cross-validation score: -0.60


In [34]:
# set best linear
pca_polyreg = LinearRegression(**gscv_polyreg.best_params_)

# Fit the model
pca_polyreg.fit(X_train_poly, y_train)

# train and test score
print('PCA Polynomial')
print("Train score: {:.2f}".format(pca_polyreg.score(X_train_poly, y_train)))
print("Test score: {:.2f}".format(pca_polyreg.score(X_test_poly, y_test)))


PCA Polynomial
Train score: 1.00
Test score: -2.99


The result above indicates that the training score and the test score of the PCA Polynomial Regression model.

In [35]:
# predict
y_pred_pca_polyreg = pca_polyreg.predict(X_test_poly)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA Polynomial')
R2_pca_polyreg = r2_score(y_test, y_pred_pca_polyreg)
print("R2_pca_polyreg: {:.2f}".format(R2_pca_polyreg)) 
RMSE_pca_polyreg = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_polyreg))
print('RMSE_pca_polyreg: {:.2f}'.format(RMSE_pca_polyreg))


PCA Polynomial
R2_pca_polyreg: -2.99
RMSE_pca_polyreg: 0.29


The result above indicates that the R2 score and the RMSE value of the PCA Polynomial Regression model.

In [36]:
X_train_poly.shape

(1094, 1225)

The cross-validation score and R2 score are negative is because we face a curse of dimensionality problem using the PolynomialFeatures function with degree = 2. As the result shown above, the number of instances is 1094, and the number of features is 1225. Since the number of features exceeds the number of rows, resulting in a training score of 1. We may reduce the degree from 2 to 1; however,  we will get the same result as linear regression. Therefore, we can conclude that this Polynomial model is valueless.

#### 4. PCA - Ridge Regression

In [37]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, KFold

# set param_grid
param_grid = {'alpha':[0.1, 1, 10, 100],
              'normalize':[True,False], 
              'tol':[1e-06,5e-06,1e-05,5e-05]
             }

# define the model
ridge = Ridge(random_state=0)

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_ridge= GridSearchCV(ridge, param_grid, 
                         cv=cv,
                         scoring=scoring)

# fit the grid search
gscv_ridge.fit(X_reduced_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters for PCA Ridge Regression: {}".format(gscv_ridge.best_params_))


Best parameters for PCA Ridge Regression: {'alpha': 10, 'normalize': False, 'tol': 1e-06}


In [38]:
# define the best ridge model
pca_ridge = Ridge(alpha=gscv_ridge.best_params_['alpha'],
                   normalize=gscv_ridge.best_params_['normalize'],
                  tol=gscv_ridge.best_params_['tol']
                  )

# fit the model
pca_ridge.fit(X_reduced_train, y_train)

# train and test score
print('PCA Ridge')
print("Train score: {:.2f}".format(pca_ridge.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_ridge.score(X_reduced_test, y_test)))


PCA Ridge
Train score: 0.79
Test score: 0.50


The result above indicates that the training score and the test score of the PCA Ridge model.

In [39]:
# predict
y_pred_pca_ridge = pca_ridge.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA Ridge')
R2_pca_ridge = r2_score(y_test, y_pred_pca_ridge)
print("R2_pca_ridge: {:.2f}".format(R2_pca_ridge)) 
RMSE_pca_ridge = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_ridge))
print('RMSE_pca_ridge: {:.2f}'.format(RMSE_pca_ridge))


PCA Ridge
R2_pca_ridge: 0.50
RMSE_pca_ridge: 0.10


The result above indicates that the R2 score and the RMSE value of the PCA Ridge model.

#### 5. PCA - Lasso Regression

In [40]:
from sklearn.linear_model import Lasso

param_grid = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100, 250, 500, 1000]}

# define the model
lasso = Lasso()

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_lasso= GridSearchCV(lasso, param_grid, 
                         cv=cv,
                         scoring=scoring,
                         return_train_score=True)

# fit the grid search
gscv_lasso.fit(X_reduced_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters for Lasso Regression: {}".format(gscv_lasso.best_params_))


Best parameters for Lasso Regression: {'alpha': 0.001}


In [41]:
# define the best lasso model
pca_lasso = Lasso(alpha=gscv_lasso.best_params_['alpha'])

# fit the model
pca_lasso.fit(X_reduced_train, y_train)

# train and test score
print('PCA Lasso')
print("Train score: {:.2f}".format(pca_lasso.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_lasso.score(X_reduced_test, y_test)))


PCA Lasso
Train score: 0.76
Test score: 0.47


The result above indicates that the training score and the test score of the PCA Lasso model.

In [42]:
# predict
y_pred_pca_lasso = pca_lasso.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA Lasso')
R2_pca_lasso = r2_score(y_test, y_pred_pca_lasso)
print("R2_pca_lasso: {:.2f}".format(R2_pca_lasso)) 
RMSE_pca_lasso = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_lasso))
print('RMSE_pca_lasso: {:.2f}'.format(RMSE_pca_lasso))


PCA Lasso
R2_pca_lasso: 0.47
RMSE_pca_lasso: 0.11


The result above indicates that the R2 score and the RMSE value of the PCA Lasso model.

#### 6. PCA - LinearSVR

In [43]:
from sklearn.svm import LinearSVR

param_grid = {'C':[0.001, 0.01, 0.1, 1, 10, 100],
              'epsilon':[0.1,0.2,0.3,0.5]}

# define the model
svr = LinearSVR(random_state=0)

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_svr= GridSearchCV(svr, param_grid, 
                       cv=cv,
                       scoring=scoring,
                       return_train_score=True)

# fit the grid search
gscv_svr.fit(X_reduced_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters for PCA LinearSVR: {}".format(gscv_svr.best_params_))


Best parameters for PCA LinearSVR: {'C': 0.01, 'epsilon': 0.1}


In [44]:
# define the best svr model
pca_svr = LinearSVR(C=gscv_svr.best_params_['C'],
              epsilon=gscv_svr.best_params_['epsilon'])

# fit the model
pca_svr.fit(X_reduced_train, y_train)

# train and test score
print('PCA LinearSVR')
print("Train score: {:.2f}".format(pca_svr.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_svr.score(X_reduced_test, y_test)))


PCA LinearSVR
Train score: 0.75
Test score: 0.45


The result above indicates that the training score and the test score of the PCA LinearSVR model.

In [45]:
# predict
y_pred_pca_svr = pca_svr.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA LinearSVR')
R2_pca_svr = r2_score(y_test, y_pred_pca_svr)
print("R2_pca_svr: {:.2f}".format(R2_pca_svr)) 
RMSE_pca_svr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_svr))
print('RMSE_pca_svr: {:.2f}'.format(RMSE_pca_svr))


PCA LinearSVR
R2_pca_svr: 0.45
RMSE_pca_svr: 0.11


The result above indicates that the R2 score and the RMSE value of the PCA LinearSVR model.

#### 7. PCA - SVM with linear kernel

In [46]:
from sklearn.svm import SVR

param_grid = {
              'C' : [0.001, 0.01, 0.1, 1, 10, 100],
              'degree' : [3,8],
              'coef0' : [0.01,10,0.5],
              'gamma' : ('auto','scale')},

# define the model
svr_linear_kernal = SVR(kernel='linear')

# Define the cv
cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_svr_kernal= GridSearchCV(svr_linear_kernal, param_grid,
                              cv=cv,
                              scoring=scoring,
                              return_train_score=True)

# fit the grid search
gscv_svr_kernal.fit(X_reduced_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters for PCA SVM with linear kernel: {}".format(gscv_svr_kernal.best_params_))


Best parameters for PCA SVM with linear kernel: {'C': 0.1, 'coef0': 0.01, 'degree': 3, 'gamma': 'auto'}


In [47]:
# define the best svr model
pca_svr_linear = SVR(kernel='linear',
               C=gscv_svr_kernal.best_params_['C'],
               degree=gscv_svr_kernal.best_params_['degree'],
               coef0=gscv_svr_kernal.best_params_['coef0'],
               gamma=gscv_svr_kernal.best_params_['gamma'])

# fit the model
pca_svr_linear.fit(X_reduced_train, y_train)

# train and test score
print('PCA SVM with Linear')
print("Train score: {:.2f}".format(pca_svr_linear.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_svr_linear.score(X_reduced_test, y_test)))


PCA SVM with Linear
Train score: 0.74
Test score: 0.55


The result above indicates that the training score and the test score of the PCA SVM with linear kernel model.

In [48]:
# predict
y_pred_pca_svr_linear = pca_svr_linear.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA SVM with Linear')
R2_pca_svr_linear = r2_score(y_test, y_pred_pca_svr_linear)
print("R2_pca_svr_linear: {:.2f}".format(R2_pca_svr_linear)) 
RMSE_pca_svr_linear = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_svr_linear))
print('RMSE_pca_svr_linear: {:.2f}'.format(RMSE_pca_svr_linear))


PCA SVM with Linear
R2_pca_svr_linear: 0.55
RMSE_pca_svr_linear: 0.10


The result above indicates that the R2 score and the RMSE value of the PCA SVM with linear kernel model.

#### 8. PCA - SVM with poly kernel

In [49]:
from sklearn.svm import SVR

param_grid = {
              'C' : [0.001, 0.01, 0.1, 1, 10, 100],
              'degree' : [3,8],
              'coef0' : [0.01,1,10],
              'gamma' : ('auto','scale')},

# define the model
svr_poly_kernal = SVR(kernel='poly')

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_svr_kernal= GridSearchCV(svr_poly_kernal, param_grid,
                              cv=cv,
                              scoring=scoring,
                              return_train_score=True)

# fit the grid search
gscv_svr_kernal.fit(X_reduced_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters for PCA SVM with poly kernel: {}".format(gscv_svr_kernal.best_params_))


Best parameters for PCA SVM with poly kernel: {'C': 0.01, 'coef0': 10, 'degree': 3, 'gamma': 'auto'}


In [50]:
# define the best svr model
pca_svr_poly = SVR(kernel='poly',
               C=gscv_svr_kernal.best_params_['C'],
               degree=gscv_svr_kernal.best_params_['degree'],
               coef0=gscv_svr_kernal.best_params_['coef0'],
               gamma=gscv_svr_kernal.best_params_['gamma'])


# fit the model
pca_svr_poly.fit(X_reduced_train, y_train)

# train and test score
print('PCA SVM with Poly')
print("Train score: {:.2f}".format(pca_svr_poly.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_svr_poly.score(X_reduced_test, y_test)))


PCA SVM with Poly
Train score: 0.75
Test score: 0.56


The result above indicates that the training score and the test score of the PCA SVM with poly kernel model.

In [51]:
# predict
y_pred_pca_svr_poly = pca_svr_poly.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA SVM with Poly')
R2_pca_svr_poly = r2_score(y_test, y_pred_pca_svr_poly)
print("R2_pca_svr_poly: {:.2f}".format(R2_pca_svr_poly)) 
RMSE_pca_svr_poly = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_svr_poly))
print('RMSE_pca_svr_poly: {:.2f}'.format(RMSE_pca_svr_poly))


PCA SVM with Poly
R2_pca_svr_poly: 0.56
RMSE_pca_svr_poly: 0.10


The result above indicates that the R2 score and the RMSE value of the PCA SVM with poly kernel model.

#### 9. PCA - SVM with rbf kernel

In [52]:
from sklearn.svm import SVR

param_grid = {
              'C' : [0.001, 0.01, 0.1, 1, 10, 100],
              'degree' : [3,8],
              'coef0' : [0.01,10,0.5],
              'gamma' : ('auto','scale')},

# define the model
svr_rbf_kernal = SVR(kernel='rbf')

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_svr_kernal= GridSearchCV(svr_rbf_kernal, param_grid,
                              cv=cv,
                              scoring=scoring,
                              return_train_score=True)

# fit the grid search
gscv_svr_kernal.fit(X_reduced_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters for PCA SVM with rbf kernel: {}".format(gscv_svr_kernal.best_params_))


Best parameters for PCA SVM with rbf kernel: {'C': 1, 'coef0': 0.01, 'degree': 3, 'gamma': 'auto'}


In [53]:
# define the best svr model
pca_svr_rbf = SVR(kernel='rbf',
               C=gscv_svr_kernal.best_params_['C'],
               degree=gscv_svr_kernal.best_params_['degree'],
               coef0=gscv_svr_kernal.best_params_['coef0'],
               gamma=gscv_svr_kernal.best_params_['gamma'])

# fit the model
pca_svr_rbf.fit(X_reduced_train, y_train)

# train and test score
print('PCA SVM with Rbf')
print("Train score: {:.2f}".format(pca_svr_rbf.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_svr_rbf.score(X_reduced_test, y_test)))


PCA SVM with Rbf
Train score: 0.76
Test score: 0.58


The result above indicates that the training score and the test score of the PCA SVM with rbf kernel model.

In [54]:
# predict
y_pred_pca_svr_rbf = pca_svr_rbf.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA SVM with Rbf')
R2_pca_svr_rbf = r2_score(y_test, y_pred_pca_svr_rbf)
print("R2_pca_svr_rbf: {:.2f}".format(R2_pca_svr_rbf)) 
RMSE_pca_svr_rbf = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_svr_rbf))
print('RMSE_pca_svr_rbf: {:.2f}'.format(RMSE_pca_svr_rbf))


PCA SVM with Rbf
R2_pca_svr_rbf: 0.58
RMSE_pca_svr_rbf: 0.09


The result above indicates that the R2 score and the RMSE value of the PCA SVM with rbf kernel model.

#### 10. PCA - Decison Tree Regressor

In [64]:
from sklearn.tree import DecisionTreeRegressor

param_grid = {'max_depth': [3, 5, 10],
              'max_features': [3, 4, 5],
             'random_state': [0]}

# define the model
dt = DecisionTreeRegressor(random_state=0)

# Define the cv
cv = KFold(n_splits=5,shuffle=True,random_state=0)

# Define the scoring
scoring = 'r2'

# define the grid search
gscv_dt = GridSearchCV(dt, param_grid,
                       cv=cv,
                       scoring=scoring,
                       return_train_score=True)

# fit the grid search
gscv_dt.fit(X_reduced_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters for PCA Decision Tree Regressor: {}".format(gscv_dt.best_params_))


Best parameters for PCA Decision Tree Regressor: {'max_depth': 5, 'max_features': 4, 'random_state': 0}


In [65]:
# define the best dt model
pca_dt = DecisionTreeRegressor(
    max_depth=gscv_dt.best_params_['max_depth'],
    max_features=gscv_dt.best_params_['max_features'],
    random_state=0)

# fit the model
pca_dt.fit(X_reduced_train, y_train)

# train and test score
print('PCA Decision Tree')
print("Train score: {:.2f}".format(pca_dt.score(X_reduced_train, y_train)))
print("Test score: {:.2f}".format(pca_dt.score(X_reduced_test, y_test)))


PCA Decision Tree
Train score: 0.48
Test score: 0.02


The result above indicates that the training score and the test score of the PCA Decision Tree model.

In [57]:
# predict
y_pred_pca_dt = pca_dt.predict(X_reduced_test)

# evaluate
from sklearn.metrics import r2_score
from sklearn import metrics

print('PCA Decision Tree')
R2_pca_dt = r2_score(y_test, y_pred_pca_dt)
print("R2_pca_dt: {:.2f}".format(R2_pca_dt)) 
RMSE_pca_dt = np.sqrt(metrics.mean_squared_error(y_test, y_pred_pca_dt))
print('RMSE_pca_dt: {:.2f}'.format(RMSE_pca_dt))


PCA Decision Tree
R2_pca_dt: 0.02
RMSE_pca_dt: 0.14


The result above indicates that the R2 score and the RMSE value of the PCA Decision Tree model.

### Model Results Comparison:

In [102]:
from tabulate import tabulate

table1 = [['Algorithm','Train score', 'R2', 'RMSE',
           'Train score (After PCA)', 'R2 (After PCA)', 'RMSE (After PCA)'], 
          ['KNN', 0.69, 0.29, 0.12, 0.69, 0.30, 0.12], 
          ['Linear', 0.84, 0.57, 0.10, 0.79, 0.51, 0.10], 
          ['Polynomial', 1.00, 0.43, 0.11, 1.00, -2.99, 0.29], 
          ['Ridge', 0.82, 0.54, 0.10, 0.79, 0.50, 0.10], 
          ['Lasso', 0.80, 0.51, 0.10, 0.76, 0.47, 0.11], 
          ['LinearSVR', 0.70, 0.60, 0.09, 0.75, 0.45, 0.11],
          ['SVM with linear kernel', 0.69, 0.61, 0.09, 0.74, 0.55, 0.10], 
          ['SVM with poly kernel', 0.76, 0.62, 0.09, 0.75, 0.56, 0.10], 
          ['SVM with rbf kernel', 0.72, 0.63, 0.09, 0.76, 0.58, 0.09],
          ['Decision Tree', 0.92, 0.27, 0.12, 0.48, 0.02, 0.14]]

print("Before PCA / After PCA:\n")
print(tabulate(table1, headers='firstrow'))

Before PCA / After PCA:

Algorithm                 Train score    R2    RMSE    Train score (After PCA)    R2 (After PCA)    RMSE (After PCA)
----------------------  -------------  ----  ------  -------------------------  ----------------  ------------------
KNN                              0.69  0.29    0.12                       0.69              0.3                 0.12
Linear                           0.84  0.57    0.1                        0.79              0.51                0.1
Polynomial                       1     0.43    0.11                       1                -2.99                0.29
Ridge                            0.82  0.54    0.1                        0.79              0.5                 0.1
Lasso                            0.8   0.51    0.1                        0.76              0.47                0.11
LinearSVR                        0.7   0.6     0.09                       0.75              0.45                0.11
SVM with linear kernel           0.69  0.

The table above reveals that the training score and the test score of each model after applying PCA. 

(1) KNN model: The R2 score is slightly better after applying PCA, and the RMSE remains the same. Therefore, we can infer that the KNN model performs better with PCA method.

(2) Linear Regression: The R2 score reduces from 0.57 to 0.51 after applying PCA. Hence, we can conclude that applying PCA does not help us find a better Linear Regression model.

(3) Polynomial Regression: Both the original polynomial model and the new polynomial model with PCA method face an overfitting problem as we can see both train scores are zero. Since our original training set only has 1459 rows and 105 features and the training set with PCA method has 1094 rows and 48 features, the number of features must exceed the number of rows after applying the PolynomialFeatures function to transform the dataset. As we mention before, both polynomial Regression models are bad models and valueless.

(4) Ridge: The train score reduces from 0.82 to 0.79, and the R2 score reduces from 0.54 to 0.50. But the RMSE remains. Hence, we can conclude that applying PCA does not help us find a better Ridge model.

(5) Lasso: The train score reduces from 0.80 to 0.76, the R2 score reduces from 0.51 to 0.47, and the RMSE increases from 0.1 to 0.11. Hence, we can conclude that applying PCA does not help us to find a better Lasso model.

(6) Linear SVR: Although the training score improves from 0.70 to 0.75, the R2 score reduces from 0.60 to 0.45. Additionally, the RMSE increases from 0.09 to 0.11. Hence, we can conclude that applying PCA does not help us to find a better Linear SVR model.

(7) Kernel SVM (linear): The train score improves from 0.69 to 0.74, but the R2 score reduces from 0.61 to 0.55. Also, the RMSE increases from 0.09 to 0.11. Hence, we can infer that applying PCA does not help us find a better SVM model with linear kernel.

(8) Kernel SVM (poly): The train score reduces from 0.76 to 0.75, and the R2 score reduces from 0.62 to 0.56. Also, the RMSE increases from 0.09 to 0.10. Hence, we can infer that applying PCA does not help us find a better SVM model with poly kernel.

(9) Kernel SVM (rbf): The train score improves from 0.72 to 0.76, but the R2 score reduces from 0.63 to 0.58. Hence, we can infer that applying PCA does not help us find a better SVM model with rbf kernel.

(10) Decision Tree: The train and test scores significantly reduced, and the RMSE increases from 0.12 to 0.14. Hence, we can infer that applying PCA does not help us find the better Decision Tree model.

Conclusion: We can find that almost all models get a lower test score even though some have a higher train score. Typically, PCA is particularly useful in processing data where multi-colinearity exists between the features. Therefore, we may conclude that the features in this dataset are not highly correlated.

### MLP

In [59]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import numpy as np

# fix random seed for reproducibility
seed = 10
np.random.seed(10)

In [60]:
X_train.shape

(1094, 105)

In [61]:
from keras import metrics
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(105, input_dim=105, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mse', optimizer='sgd' , metrics='mse')
    return model

gscv_model_keras = KerasRegressor(build_fn = create_model, verbose = 0)

param_grid = {'batch_size':[16, 32, 64] , 'epochs':[25, 50]}

grid_search = GridSearchCV(estimator= gscv_model_keras, 
                            param_grid = param_grid, 
                            cv = 5)

 # Fit the grid search
grid_search_result = grid_search.fit(X_train, y_train)

# Print The value of best Hyperparameters
print("Best parameters: {}".format(grid_search.best_params_))

Best parameters: {'batch_size': 16, 'epochs': 50}


The best hyper-parameters for Keras Regressor is: {'batch_size': 16, 'epochs': 50}.

In [62]:
# create model
model = Sequential()
model.add(Dense(105, input_dim=105, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))

# Compile model
model.compile(loss='mse', optimizer='sgd', metrics = 'mse')

# Fit model
model.fit(X_train, y_train, 
          epochs=grid_search.best_params_['epochs'],
          batch_size=grid_search.best_params_['batch_size'],
          validation_data=(X_test, y_test)
)

from sklearn.metrics import r2_score

y_train_predict = model.predict(X_train)
y_test_predict = model.predict(X_test)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [63]:
print('Train score: {:.2f}'.format(r2_score(y_train, y_train_predict)))
print('Test score: {:.2f}'.format(r2_score(y_test, y_test_predict)))
from sklearn.metrics import mean_squared_error
print('RMSE score: {:.2f}'.format(mean_squared_error(y_test, y_test_predict, squared=False)))

Train score: 0.80
Test score: 0.48
RMSE score: 0.10


The result above indicates that the training score and the test score of the deep learning model.

## Model Selection

In [101]:
table2 = [['Algorithm','Train score', 'R2', 'RMSE'], 
          ['KNN with Adaboost', 0.88, 0.41, 0.11], 
          ['Linear with Adaboost', 0.88, 0.61, 0.09], 
          ['Ridge with Bagging', 0.79, 0.51, 0.10], 
          ['Ridge with Pasting', 0.74, 0.43, 0.11], 
          ['Lasso with Bagging', 0.81, 0.52, 0.10], 
          ['Lasso with Pasting', 0.80, 0.50, 0.10], 
          ['Gradient', 0.94, 0.56, 0.10], 
          ['deep learning', 0.80, 0.48, 0.10]]

print("Model Selection:\n")
print(tabulate(table2, headers='firstrow'))
print("\n")
print(tabulate(table1, headers='firstrow'))

Model Selection:

Algorithm               Train score    R2    RMSE
--------------------  -------------  ----  ------
KNN with Adaboost              0.88  0.41    0.11
Linear with Adaboost           0.88  0.61    0.09
Ridge with Bagging             0.79  0.51    0.1
Ridge with Pasting             0.74  0.43    0.11
Lasso with Bagging             0.81  0.52    0.1
Lasso with Pasting             0.8   0.5     0.1
Gradient                       0.94  0.56    0.1
deep learning                  0.8   0.48    0.1


Algorithm                 Train score    R2    RMSE    Train score (After PCA)    R2 (After PCA)    RMSE (After PCA)
----------------------  -------------  ----  ------  -------------------------  ----------------  ------------------
KNN                              0.69  0.29    0.12                       0.69              0.3                 0.12
Linear                           0.84  0.57    0.1                        0.79              0.51                0.1
Polynomial        

There are two tables above. Some models use ensemble methods, and some models apply PCA before modeling.

To find the best regression model, we choose R2 and RMSE as our evaluation indicators. The higher value of R2, the better. The lower value of RMSE implies higher accuracy of a regression model. Additionally, we prefer the difference between train scores and test scores as small as possible.

To sum up, the ``SVM classifier with RBF kernel`` is the optimal model for prediction since it has the highest R2 score of 0.63 and a low RMSE value of 0.09.

## Prediction

In [79]:
# scale
scaler = MinMaxScaler()
## entire train set
X_scaled = scaler.fit_transform(X)
y_scaled = (y - y.min()) / (y.max() - y.min())
## entire test set
X_test_scaled = scaler.transform(test_df)

# define the model
best_svr_rbf = SVR(kernel='rbf',
               C=1,
               degree=3,
               coef0=0.01,
               gamma='auto')

# fit the model
best_svr_rbf.fit(X_scaled, y_scaled)

# train score
print("The best model - SVM classifier with RBF kernel")
print("Train score: {:.2f}".format(best_svr_rbf.score(X_scaled, y_scaled)))


The best model - SVM classifier with RBF kernel
Train score: 0.76


From project 1, we know the best hyper-parameters of SVM with RBF kernel model using Grid Search is {'C': 1, 'coef0': 0.01, 'degree': 3, 'gamma': 'auto'}, so we use them to build our model. Finally, we train the model on the entire dataset to predict 1459 house prices.

In [80]:
# predict
predictions = best_svr_rbf.predict(X_test_scaled)

In [81]:
SalePrice_prediction = pd.DataFrame(columns=['Id', 'SalePrice'])
SalePrice_prediction['Id'] = test_df_forPred['Id']
SalePrice_prediction['SalePrice'] = predictions * (y.max() - y.min()) + y.min()
SalePrice_prediction['SalePrice'] = round(SalePrice_prediction['SalePrice'], 0)
SalePrice_prediction.head(10)

Unnamed: 0,Id,SalePrice
0,1461,114767.0
1,1462,189034.0
2,1463,182759.0
3,1464,218839.0
4,1465,281192.0
5,1466,176882.0
6,1467,220317.0
7,1468,167245.0
8,1469,205072.0
9,1470,129000.0


The table above shows the 10 predicted house prices based on ID with the best model ``SVM classifier with RBF kernel``. We also save the entire sale prices prediction into a CSV file 'SalePrice_prediction_Project2'.

In [82]:
SalePrice_prediction.to_csv('SalePrice_prediction_Project2.csv', index=False)