# Predictive Models

In this segment, two models are generated:

1. A model using XGBoost using one-hold-out cross-validation
2. A model using my Regressor_GradientBoost using one-hold-out cross-validation


References: 
1. https://www.kaggle.com/omarito/gridsearchcv-xgbregressor-0-556-lb
2. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
3. https://www.datacamp.com/community/tutorials/xgboost-in-python
4. https://aiinpractice.com/xgboost-hyperparameter-tuning-with-bayesian-optimization/
5. https://github.com/fmfn/BayesianOptimization
6. https://www.kaggle.com/btyuhas/bayesian-optimization-with-xgboost/notebook

## Table of Content

1. [Data Import & Pre-Processing](#LoadingData)
2. [XGBoost Model](#XGBoostModel)
3. [DIY Gradient Boosting](#DIYGradientBoostingRegressor)

### Import libraries

In [47]:
%matplotlib inline

import matplotlib.pylab as plt
import numpy as np
import os.path
import pickle
import sys
import pandas as pd
import re
import xgboost as xgb

from bayes_opt import BayesianOptimization
from matplotlib.pylab import rcParams
from sklearn import metrics 
from sklearn.model_selection import cross_validate, cross_val_score

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Parsian's modules
from src import FilePaths, Data_Properties, Make_DataSet, Regressor_GradientBoost, RegressGB_Parameters, RegressXGB_Parameters

## Data Import & Processing  <a class="anchor" id="LoadingData"></a>

### Profiling Summary:

Here are the results of the profiling done under the 'exploratory-data-analysis' JupyterNotebook:

* In total there are 66 columns and 636,984 rows.
* There are missing values in six columns:
	
| Columns    | Zero Values	| Missing Values  | % of Total Values	| Total Zero Missing Values | % Total Zero Missing Values| Data Type
| --- | --- | --- | --- | --- | --- | --- |
| feature_10    | 	0    | 	69566    | 	10.9    | 	69566    | 	10.9    | 	float64    | 
| feature_62    | 	0    | 	69566    | 	10.9    | 	69566    | 	10.9    | 	float64    | 
| feature_36    | 	0    | 	37907    | 	6.0    | 	37907    | 	6.0    | 	float64    | 
| feature_23    | 	0    | 	34567    | 	5.4    | 	34567    | 	5.4    | 	float64    | 
| feature_49    | 	0    | 	34567    | 	5.4    | 	34567    | 	5.4    | 	float64    | 
| feature_50    | 	0    | 	6743    | 	1.1    | 	6743    | 	1.1    | 	float64    | 

We begin by placiong the raw .csv files into the ```/data/raw``` directory. The following block will:

1. Load the CSV files
2. Stiches the CSV files
3. Impute the missing values (default: median of the column)
4. Create training set: X_train, y_train and testing set: X_test, y_test

In [42]:
data_prep = Make_DataSet()
X_train, X_test, y_train, y_test = data_prep.load_split_data()



The following .csv files will get stitched: 
['1_record_diast.csv', '2_record_diast.csv', '3_record_diast.csv', '4_record_diast.csv', '5_record_diast.csv', '6_record_diast.csv', '7_record_diast.csv', '8_record_diast.csv', '9_record_diast.csv', '10_record_diast.csv', '11_record_diast.csv', '12_record_diast.csv', '13_record_diast.csv', '14_record_diast.csv', '15_record_diast.csv']
Stitching is done!
Fill null values with : median
The target is:  target
The features are:  ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 

We are going to handle the missing data differently for the XGBoost and my own DIY Gradient Boost model.

## XGBoost Model:  <a class="anchor" id="XGBoostModel"></a>

### Implement XGRegressor

In [44]:
def fit_xgb_regressor(X, y, colsample_bytree=0.3, learning_rate = 0.1, 
                      max_depth = 5, alpha = 10, n_estimators = 70, nthread=-1):
    '''
    Fits Scikit-Learn's XGBooster Regressor to the data. Returns model for One Hold Out validation.
    '''
    try:

        cv_results = pd.DataFrame()
        xg_reg = xgb.XGBRegressor(colsample_bytree = colsample_bytree, learning_rate = learning_rate,
                                  max_depth = max_depth, alpha = alpha, n_estimators = n_estimators, nthread=-1)
        model = xg_reg.fit(X, y)

    except Exception as e:
        print(e)
        
    return model



def predict_rmse_rsqured(model, Xtest, ytest):
    '''
    Use the input model to predict y for a given xtest, in additon it calculates RMSE
    between ytest and predictions.
    '''
    try:
        prediction = model.predict(Xtest)
        rmse = np.sqrt(metrics.mean_squared_error(ytest, prediction))
        r_squared = metrics.r2_score(ytest, prediction)
        print("RMSE: %f" % (rmse))
        print("R Squared: %f" % (r_squared))

    except Exception as e:
        print(e)

    return prediction, rmse, r_squared



def cv_xgboost_regressor(data_matrix, params, nfold=3, num_boost_round=70, 
                         early_stopping_rounds=10, metrics="rmse", seed=123):
    cv_results = xgb.cv(dtrain=data_matrix, params=params, nfold=nfold, 
                        num_boost_round=num_boost_round, early_stopping_rounds=early_stopping_rounds, 
                        metrics=metrics, as_pandas=True, seed=seed)
    
    print('Top 5 Cross Validation RMSEs: ', cv_results.head())
    print('Last Cross Validation RMSE between Validation and Actual: ', (cv_results['test-rmse-mean']).tail(1))
    
    return cv_results

### Build a XGBoost Regressor:

#### One Hold Valdiation:

In [51]:
#Parameters
# Initial Paramters:

colsample_bytree=0.3
learning_rate = 0.1 
max_depth = 5
alpha = 10
n_estimators = 70

In [52]:
xgb_reg_oho = fit_xgb_regressor(X_train, y_train)



In [53]:
predictions, rmse, r_squared = predict_rmse_rsqured(xgb_reg_oho, X_test, y_test)

RMSE: 4.430003
R Squared: 0.892487


From our data profiling, it was found that the range of our 'target' column is:

- min(target) = -38.789069
- max(target) = 41.215521

### Bayesian Optimization with XGBoost

The parameter space will be optimized using a Bayesian Optimization technique:

In [54]:
train_dmatrix = xgb.DMatrix(X_train, label=y_train)
test_dmatrix = xgb.DMatrix(X_test)

  if getattr(data, 'base', None) is not None and \


In [55]:
def xgb_evaluate(max_depth, gamma, colsample_bytree):
    params = {'eval_metric': 'rmse',
              'max_depth': int(max_depth),
              'subsample': 0.8,
              'eta': 0.1,
              'gamma': gamma,
              'colsample_bytree': colsample_bytree}
    # Used around 1000 boosting rounds in the full model
    cv_result = xgb.cv(params, train_dmatrix, num_boost_round=100, nfold=3)    
    
    # Bayesian optimization only knows how to maximize, not minimize, so return the negative RMSE
    return -1.0 * cv_result['test-rmse-mean'].iloc[-1]

#### Optimization:

In [58]:
xgb_bo_training_2 = BayesianOptimization(xgb_evaluate, {'max_depth': (3,7), 
                                                      'gamma': (0, 1),
                                                      'colsample_bytree': (0.3, 0.9)})

In [59]:
xgb_bo_training_2.maximize(init_points=3, n_iter=5, acq='ei')

|   iter    |  target   | colsam... |   gamma   | max_depth |
-------------------------------------------------------------
| [0m 1       [0m | [0m-5.063   [0m | [0m 0.8975  [0m | [0m 0.8271  [0m | [0m 3.427   [0m |
| [95m 2       [0m | [95m-3.973   [0m | [95m 0.824   [0m | [95m 0.0841  [0m | [95m 5.071   [0m |
| [0m 3       [0m | [0m-4.015   [0m | [0m 0.388   [0m | [0m 0.2421  [0m | [0m 5.16    [0m |
| [95m 4       [0m | [95m-3.135   [0m | [95m 0.9     [0m | [95m 0.0     [0m | [95m 7.0     [0m |
| [0m 5       [0m | [0m-3.144   [0m | [0m 0.9     [0m | [0m 1.0     [0m | [0m 7.0     [0m |
| [0m 6       [0m | [0m-3.161   [0m | [0m 0.3     [0m | [0m 0.4767  [0m | [0m 7.0     [0m |
| [95m 7       [0m | [95m-3.132   [0m | [95m 0.9     [0m | [95m 0.4669  [0m | [95m 7.0     [0m |
| [0m 8       [0m | [0m-3.156   [0m | [0m 0.3     [0m | [0m 0.0     [0m | [0m 7.0     [0m |


In [60]:
optimized_params = xgb_bo_training_2.max['params']
optimized_params['max_depth'] = int(optimized_params['max_depth'])
optimized_params

{'colsample_bytree': 0.9, 'gamma': 0.4668672005782674, 'max_depth': 7}

In [61]:
model_optimized_params = xgb.train(optimized_params, train_dmatrix, num_boost_round=250)

In [62]:
y_pred_xgb = model_optimized_params.predict(test_dmatrix)
y_train_pred_xgb = model_optimized_params.predict(train_dmatrix)

print('Model Prediction on Y Test Data RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_xgb)))
print('Model Prediction on Y Train Data RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_train_pred_xgb)))

Model Prediction on Y Test Data RMSE: 1.8731620555189703
Model Prediction on Y Train Data RMSE: 1.6518342619805442


#### Save model:

In [63]:
pkl_fname = "Jupyter-XGBoost_Regressor_Model_Pickled.pkl"



with open(os.path.join(FilePaths.path_models, pkl_fname), 'wb') as file:
    pickle.dump(model_optimized_params, file)

print('Model Saved under: ', file)

Model Saved under:  <_io.BufferedWriter name='C:\\Users\\asgar\\Documents\\InterviewAssignments\\huami-interview\\models\\Jupyter-XGBoost_Regressor_Model_Pickled.pkl'>


## DIY Gradient Boosting Regressor  <a class="anchor" id="DIYGradientBoostingRegressor"></a>

### Initialize the model:

In [67]:
#Parameters
RegressGB_Parameters.ntrees = 250
RegressGB_Parameters.max_depth = 5
max_depth = RegressGB_Parameters.max_depth
ntrees = RegressGB_Parameters.ntrees
learning_rate = RegressGB_Parameters.learning_rate

print('These are hyper parameters set to train the model: ')
print('max_depth: ', max_depth)
print('ntrees: ', ntrees)
print('learning_rate: ', learning_rate)

These are hyper parameters set to train the model: 
max_depth:  5
ntrees:  250
learning_rate:  0.1


### Train a Regressor_GradientBoost model:

Fit the model using the model's decision tree method (Based on Scikit-learn 's DecisionTreeRegressor)

In [68]:
def train_regressor_gradientboost(X_train, y_train):
        '''
        Trains a regressor base on Parsian's pa_ml_utils.Regressor_GradientBoost
            
        Parameters
        ----------
        X_train : A Pandas DataFrame for features
            
        y_train : A Pandas DataFrame for target
                
            
        Returns:
            A regressor model base on pa_ml_utils.Regressor_GradientBoost
            
        '''     
        try:            
            
            print('Initiating the training process using \'pa-gb\' model')
            
            regressor = Regressor_GradientBoost(features_df = X_train, 
                                                target = y_train,
                                                max_depth = RegressGB_Parameters.max_depth, 
                                                ntrees = RegressGB_Parameters.ntrees, 
                                                learning_rate = RegressGB_Parameters.learning_rate)
        
            f0, models, training_rmse = regressor.boost_gradient(X_train, y_train)
            print('Training completed')
    
        except Exception as e:
            print('modeller.py Model_Trainer.train_regressor_gradientboost(): ',e)
            
        
        return f0, models 

In [69]:
f0, models = train_regressor_gradientboost(X_train, y_train)

Initiating the training process using 'pa-gb' model
RMSE at first prediction:  13.515627188810816
RMSE for tree #0 is: 12.485939437579958
RMSE for tree #1 is: 11.579720790341586
RMSE for tree #2 is: 10.78342861420375
RMSE for tree #3 is: 10.085339499878417
RMSE for tree #4 is: 9.475547415495809
RMSE for tree #5 is: 8.941390445559952
RMSE for tree #6 is: 8.474762111073325
RMSE for tree #7 is: 8.071758154138564
RMSE for tree #8 is: 7.711799416401938
RMSE for tree #9 is: 7.394097540510946
RMSE for tree #10 is: 7.124403786713729
RMSE for tree #11 is: 6.896506465826787
RMSE for tree #12 is: 6.688255426918239
RMSE for tree #13 is: 6.515810619025053
RMSE for tree #14 is: 6.358874639850079
RMSE for tree #15 is: 6.222790412855806
RMSE for tree #16 is: 6.093898958071987
RMSE for tree #17 is: 5.980941872890951
RMSE for tree #18 is: 5.876440397941171
RMSE for tree #19 is: 5.795758705727741
RMSE for tree #20 is: 5.718405388793373
RMSE for tree #21 is: 5.64608381828004
RMSE for tree #22 is: 5.593104

RMSE for tree #199 is: 3.311221112355766
RMSE for tree #200 is: 3.3086945467326285
RMSE for tree #201 is: 3.3027306095511304
RMSE for tree #202 is: 3.2982219545722002
RMSE for tree #203 is: 3.294260477757347
RMSE for tree #204 is: 3.2907258912189503
RMSE for tree #205 is: 3.2839119097399347
RMSE for tree #206 is: 3.2779516545244074
RMSE for tree #207 is: 3.2754487721770977
RMSE for tree #208 is: 3.266925339431294
RMSE for tree #209 is: 3.2608122677484643
RMSE for tree #210 is: 3.255909798702393
RMSE for tree #211 is: 3.251006381542861
RMSE for tree #212 is: 3.247743926751302
RMSE for tree #213 is: 3.2437989662872777
RMSE for tree #214 is: 3.24112913280755
RMSE for tree #215 is: 3.2373367043377117
RMSE for tree #216 is: 3.232885160430778
RMSE for tree #217 is: 3.2296678032175294
RMSE for tree #218 is: 3.225912296537506
RMSE for tree #219 is: 3.2212635361685433
RMSE for tree #220 is: 3.218254940135668
RMSE for tree #221 is: 3.216058414839148
RMSE for tree #222 is: 3.2128627849998215
RMSE

### Predict using the Regressor_GradientBoost model:


Using the above model, we are going to predict:

In [70]:
def predict_regressor_gradientboost(X_test, y_test, f0, reg_gb_models):
        '''
        Generates a prediction base on Parsian's 
        pa_ml_utils.Regressor_GradientBoost model
            
        Parameters
        ----------
        X_test : A Pandas DataFrame for features
            
        y_test : A Pandas Series for target
                
            
        Returns:
            Predictions base on reg_gb_models generated from pa_ml_utils.Regressor_GradientBoost
            
        '''        
        try:            
            print('Initiating the prediction process using \'pa-gb\' model')
            
            regressor = Regressor_GradientBoost(X_test, y_test)
            prediction = regressor.predict(X_test, f0, reg_gb_models)
            rmse = regressor.rmse(y_test, prediction)
            
            print('Prediction completed')
            
        except Exception as e:
            print('modeller.py Model_Predictor.predict_regressor_gradientboost(): ',e)
            
        print("The RMSE for the Regressor Gradient Boost model: ", rmse)
        return prediction, rmse   

In [71]:
predict_regressor_gradientboost(X_test, y_test, f0, models)

Initiating the prediction process using 'pa-gb' model
Prediction completed
The RMSE for the Regressor Gradient Boost model:  12.738929306363474


(array([ 3.07704991,  1.45827648, 11.97125205, ..., 25.35474394,
        32.84679981, 10.64980486]), 12.738929306363474)

### Results:

The XGBoost model, with some hyper parameter optimization generated a model that predicted target with RMSE of 1.87.

In comparison, my model without any hyper parameter tuning, produced a prediction with RMSE of 12.74. Definitely, cross validation and hyper parameter techniques can be used to insure we are not overfitting.