# Part 6 - Advanced Regression Techniques
In this notebook we will investigate some popular advanced regression techniques:  
* XGBoost
* Random Forest
* MultiLayer Perceptron (Neural Network)
  
We will use the exact same dataset and features as before and compare the results with our Linear Regressor.  
You will be happy to learn that the same procedure for training a Linear Regressor applies to nearly all other regression models!

In [30]:
import time
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
%matplotlib inline

## Let's load the data and remind ourselves of the contents

In [2]:
df = pd.read_csv('./data/rew_van_jan12_clean_engineered.csv')
df.head()

Unnamed: 0,bath,bed,sqft,price,area_Vancouver East,area_Vancouver West,property_type_Apt/Condo,property_type_Duplex,property_type_House,property_type_Other,...,sub_area_Shaughnessy,sub_area_South Cambie,sub_area_South Granville,sub_area_South Vancouver,sub_area_Southlands,sub_area_Southwest Marine,sub_area_University (UBC),sub_area_Victoria East,sub_area_West End,sub_area_Yaletown
0,1,1,630,479900,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,4,1904,1098800,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2,2,841,699000,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,6,6,3344,1988800,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1210,1575000,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
features = [feature for feature in df.columns if feature != 'price']
X = df[features]
y = df['price']
X_np = X.values
y_np = y.values.reshape((len(df), 1))

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=123) # split 70% train, 30% validation

In [12]:
def evaluate_model(model, X, y):
    y_pred = model.predict(X) # predict y values from input X
    mse = mean_squared_error(y_true=y, y_pred=y_pred)
    print("Mean Squared Error: {}".format(mse))
    print("Accuracy: {}%".format(model.score(X, y)*100.0))

## XGBoost
XGBoost is similar to Random Forest where several weak learners are combined to produce a result.  
The really amazing about XGBoost (in addition to performing very well across many datasets) is you can visualize the "feature importance" to get an idea of how the model generates its prediction.

Import the xgboost library and fit our regressor same as before

In [33]:
from xgboost import XGBRegressor
xgb_regressor = XGBRegressor()
xgb_model = xgb_regressor.fit(X_train, y_train)
evaluate_model(xgb_model, X_val, y_val)

Mean Squared Error: 847471900537.3346
Accuracy: 77.65274738914258%


### Visualize the Feature Importance that XGBRegressor has assigned

In [34]:
# create a dataframe of feature importances
feature_importances = pd.DataFrame(columns=X.columns)
feature_importances.loc[0] = xgb_model.feature_importances_
# melt columns so we can easily sort and visualize
df_melt = pd.melt(feature_importances, value_vars=X.columns).sort_values(by='value', ascending=False)
df_melt

Unnamed: 0,variable,value
2,sqft,0.406051
1,bed,0.089172
0,bath,0.084395
3,area_Vancouver East,0.05414
21,sub_area_Coal Harbour,0.041401
47,sub_area_Shaughnessy,0.02707
56,sub_area_Yaletown,0.023885
9,property_type_Townhouse,0.022293
13,strata_type_Leasehold not prepaid-NonStrata,0.022293
52,sub_area_Southwest Marine,0.022293


### Retrain on entire dataset and save model to disk

In [100]:
xgb_model = xgb_regressor.fit(X, y)
with open('./models/xgb.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

## Random Forest  
https://en.wikipedia.org/wiki/Random_forest

In [83]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor()
rf_model = rf_regressor.fit(X_train, y_train)
evaluate_model(rf_model, X_val, y_val)

Mean Squared Error: 809116075159.4926
Accuracy: 78.66416419042307%


### Retrain on entire dataset and save model to disk

In [97]:
rf_model = rf_regressor.fit(X, y)
with open('./models/random_forest.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

## MultiLayer Perceptron
https://en.wikipedia.org/wiki/Multilayer_perceptron

In [107]:
from sklearn.neural_network import MLPRegressor
mlp_regressor = MLPRegressor(max_iter=20000, random_state=123, solver='lbfgs')
mlp_model = mlp_regressor.fit(X_train, y_train)
evaluate_model(mlp_model, X_val, y_val)

Mean Squared Error: 1084289075536.2954
Accuracy: 71.40804095234401%


### Retrain on entire dataset and save model to disk

In [108]:
mlp_model = mlp_regressor.fit(X, y)
with open('./models/mlp.pkl', 'wb') as f:
    pickle.dump(mlp_model, f)

## As you can see, once our data pipeline is established it is quite easy to implement various regressors! We will 