# Predictive Models

In this segment, two models are generated:

1. A model using XGBoost using one-hold-out cross-validation
2. A model using my Regressor_GradientBoost using one-hold-out cross-validation


References: 
1. https://www.kaggle.com/omarito/gridsearchcv-xgbregressor-0-556-lb
2. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
3. https://www.datacamp.com/community/tutorials/xgboost-in-python
4. https://aiinpractice.com/xgboost-hyperparameter-tuning-with-bayesian-optimization/
5. https://github.com/fmfn/BayesianOptimization
6. https://www.kaggle.com/btyuhas/bayesian-optimization-with-xgboost/notebook

## Table of Content

1. [Data Import & Pre-Processing](#LoadingData)
2. [XGBoost Model](#XGBoostModel)
3. [DIY Gradient Boosting](#DIYGradientBoostingRegressor)

### Import libraries

In [24]:
%matplotlib inline

import matplotlib.pylab as plt
import numpy as np
import os.path
import pickle
import sys
import pandas as pd
import re
import xgboost as xgb

from bayes_opt import BayesianOptimization
from matplotlib.pylab import rcParams
from sklearn import metrics 
from sklearn.model_selection import cross_validate, cross_val_score

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Parsian's modules
from src import FilePaths, Data_Properties, Make_DataSet, Regressor_GradientBoost, RegressGB_Parameters

## Data Import & Processing  <a class="anchor" id="LoadingData"></a>

### Profiling Summary:

Here are the results of the profiling done under the 'exploratory-data-analysis' JupyterNotebook:

* In total there are 66 columns and 636,984 rows.
* There are missing values in six columns:
	
| Columns    | Zero Values	| Missing Values  | % of Total Values	| Total Zero Missing Values | % Total Zero Missing Values| Data Type
| --- | --- | --- | --- | --- | --- | --- |
| feature_10    | 	0    | 	69566    | 	10.9    | 	69566    | 	10.9    | 	float64    | 
| feature_62    | 	0    | 	69566    | 	10.9    | 	69566    | 	10.9    | 	float64    | 
| feature_36    | 	0    | 	37907    | 	6.0    | 	37907    | 	6.0    | 	float64    | 
| feature_23    | 	0    | 	34567    | 	5.4    | 	34567    | 	5.4    | 	float64    | 
| feature_49    | 	0    | 	34567    | 	5.4    | 	34567    | 	5.4    | 	float64    | 
| feature_50    | 	0    | 	6743    | 	1.1    | 	6743    | 	1.1    | 	float64    | 

We begin by placiong the raw .csv files into the ```/data/raw``` directory. The following block will:

1. Load the CSV files
2. Stiches the CSV files
3. Impute the missing values (default: median of the column)
4. Create training set: X_train, y_train and testing set: X_test, y_test

In [2]:
data_prep = Make_DataSet()
X_train, X_test, y_train, y_test = data_prep.load_split_data()



The following .csv files will get stitched: 
['1_record_diast.csv', '2_record_diast.csv']
Stitching is done!
Fill null values with : median
The target is:  target
The features are:  ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 'feature_39', 'feature_40', 'feature_41', 'feature_42', 'feature_43', 'feature_44', 'feature_45', 'feature_46', 'feature_47', 'feature_48', 'feature_49', 'feature_50', 'feature_51', 'feature_52', 'feature_53', 'feature_54', 'feature_55', 'feature_56', 'feature_57', 'feature_58', 'feature_59'

We are going to handle the missing data differently for the XGBoost and my own DIY Gradient Boost model.

## XGBoost Model:  <a class="anchor" id="XGBoostModel"></a>

### Create Data Matrix for Scikit-learn's Cross Validation Analysis

In [10]:
data_dmatrix = xgb.DMatrix(data=X_analysis_df_preprocessed,label=y_analysis_df_preprocessed)

### Implement XGRegressor

In [3]:
def fit_xgb_regressor(X, y, colsample_bytree=0.3, learning_rate = 0.1, 
                      max_depth = 5, alpha = 10, n_estimators = 70, nthread=-1):
    '''
    Fits Scikit-Learn's XGBooster Regressor to the data. Returns model for One Hold Out validation.
    '''
    try:

        cv_results = pd.DataFrame()
        xg_reg = xgb.XGBRegressor(colsample_bytree = colsample_bytree, learning_rate = learning_rate,
                                  max_depth = max_depth, alpha = alpha, n_estimators = n_estimators, nthread=-1)
        model = xg_reg.fit(X, y)

    except Exception as e:
        print(e)
        
    return model



def predict_rmse_rsqured(model, Xtest, ytest):
    '''
    Use the input model to predict y for a given xtest, in additon it calculates RMSE
    between ytest and predictions.
    '''
    try:
        prediction = model.predict(Xtest)
        rmse = np.sqrt(metrics.mean_squared_error(ytest, prediction))
        r_squared = metrics.r2_score(ytest, prediction)
        print("RMSE: %f" % (rmse))
        print("R Squared: %f" % (r_squared))

    except Exception as e:
        print(e)

    return prediction, rmse, r_squared



def cv_xgboost_regressor(data_matrix, params, nfold=3, num_boost_round=70, 
                         early_stopping_rounds=10, metrics="rmse", seed=123):
    cv_results = xgb.cv(dtrain=data_matrix, params=params, nfold=nfold, 
                        num_boost_round=num_boost_round, early_stopping_rounds=early_stopping_rounds, 
                        metrics=metrics, as_pandas=True, seed=seed)
    
    print('Top 5 Cross Validation RMSEs: ', cv_results.head())
    print('Last Cross Validation RMSE between Validation and Actual: ', (cv_results['test-rmse-mean']).tail(1))
    
    return cv_results

### Build a XGBoost Regressor:

#### One Hold Valdiation:

In [4]:
xgb_reg_oho = fit_xgb_regressor(X_train, y_train)

  if getattr(data, 'base', None) is not None and \




In [5]:
predictions, rmse, r_squared = predict_rmse_rsqured(xgb_reg_oho, X_test, y_test)

RMSE: 2.089808
R Squared: 0.979786


From our data profiling, it was found that the range of our 'target' column is:

- min(target) = -38.789069
- max(target) = 41.215521

### Bayesian Optimization with XGBoost

The parameter space will be optimized using a Bayesian Optimization technique:

In [7]:
train_dmatrix = xgb.DMatrix(X_train, label=y_train)
test_dmatrix = xgb.DMatrix(X_test)

  if getattr(data, 'base', None) is not None and \


In [8]:
def xgb_evaluate(max_depth, gamma, colsample_bytree):
    params = {'eval_metric': 'rmse',
              'max_depth': int(max_depth),
              'subsample': 0.8,
              'eta': 0.1,
              'gamma': gamma,
              'colsample_bytree': colsample_bytree}
    # Used around 1000 boosting rounds in the full model
    cv_result = xgb.cv(params, train_dmatrix, num_boost_round=100, nfold=3)    
    
    # Bayesian optimization only knows how to maximize, not minimize, so return the negative RMSE
    return -1.0 * cv_result['test-rmse-mean'].iloc[-1]

#### First Attempt:

In [9]:
xgb_bo_training = BayesianOptimization(xgb_evaluate, {'max_depth': (3,5), 
                                                      'gamma': (0, 1),
                                                      'colsample_bytree': (0.3, 0.5)})

In [10]:
xgb_bo_training.maximize(init_points=3, n_iter=5, acq='ei')

|   iter    |  target   | colsam... |   gamma   | max_depth |
-------------------------------------------------------------
| [0m 1       [0m | [0m-2.556   [0m | [0m 0.3677  [0m | [0m 0.2054  [0m | [0m 3.337   [0m |
| [95m 2       [0m | [95m-2.234   [0m | [95m 0.3053  [0m | [95m 0.3433  [0m | [95m 4.797   [0m |
| [0m 3       [0m | [0m-2.591   [0m | [0m 0.3072  [0m | [0m 0.2065  [0m | [0m 3.152   [0m |
| [95m 4       [0m | [95m-1.937   [0m | [95m 0.3     [0m | [95m 1.0     [0m | [95m 5.0     [0m |
| [95m 5       [0m | [95m-1.869   [0m | [95m 0.5     [0m | [95m 1.0     [0m | [95m 5.0     [0m |
| [0m 6       [0m | [0m-1.869   [0m | [0m 0.5     [0m | [0m 1.0     [0m | [0m 5.0     [0m |
| [0m 7       [0m | [0m-1.869   [0m | [0m 0.5     [0m | [0m 1.0     [0m | [0m 5.0     [0m |
| [0m 8       [0m | [0m-1.869   [0m | [0m 0.5     [0m | [0m 1.0     [0m | [0m 5.0     [0m |


#### Second Attempt:

In [11]:
xgb_bo_training_2 = BayesianOptimization(xgb_evaluate, {'max_depth': (3,7), 
                                                      'gamma': (0, 1),
                                                      'colsample_bytree': (0.3, 0.9)})

In [12]:
xgb_bo_training_2.maximize(init_points=3, n_iter=5, acq='ei')

|   iter    |  target   | colsam... |   gamma   | max_depth |
-------------------------------------------------------------
| [0m 1       [0m | [0m-2.467   [0m | [0m 0.8756  [0m | [0m 0.3938  [0m | [0m 3.815   [0m |
| [95m 2       [0m | [95m-1.861   [0m | [95m 0.7061  [0m | [95m 0.06226 [0m | [95m 5.155   [0m |
| [95m 3       [0m | [95m-1.611   [0m | [95m 0.5314  [0m | [95m 0.9767  [0m | [95m 6.232   [0m |
| [95m 4       [0m | [95m-1.431   [0m | [95m 0.3     [0m | [95m 0.0     [0m | [95m 7.0     [0m |
| [95m 5       [0m | [95m-1.401   [0m | [95m 0.9     [0m | [95m 0.0     [0m | [95m 7.0     [0m |
| [95m 6       [0m | [95m-1.4     [0m | [95m 0.9     [0m | [95m 1.0     [0m | [95m 7.0     [0m |
| [95m 7       [0m | [95m-1.395   [0m | [95m 0.9     [0m | [95m 0.5177  [0m | [95m 7.0     [0m |
| [0m 8       [0m | [0m-1.43    [0m | [0m 0.3     [0m | [0m 1.0     [0m | [0m 7.0     [0m |


In [13]:
optimized_params = xgb_bo_training_2.max['params']
optimized_params['max_depth'] = int(optimized_params['max_depth'])
optimized_params

{'colsample_bytree': 0.9, 'gamma': 0.5176518530657455, 'max_depth': 7}

In [14]:
model_optimized_params = xgb.train(optimized_params, train_dmatrix, num_boost_round=250)

In [15]:
y_pred_xgb = model_optimized_params.predict(test_dmatrix)
y_train_pred_xgb = model_optimized_params.predict(train_dmatrix)

print('Model Prediction on Y Test Data RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_xgb)))
print('Model Prediction on Y Train Data RMSE:', np.sqrt(metrics.mean_squared_error(y_train, y_train_pred_xgb)))

Model Prediction on Y Test Data RMSE: 0.7852291472417188
Model Prediction on Y Train Data RMSE: 0.4499462852907251


#### Save model:

In [17]:
pkl_fname = "XGBoost_Regressor_Model_Pickled.pkl"



with open(os.path.join(FilePaths.path_models, "XGBoost_Regressor_Model_Pickled.pkl"), 'wb') as file:
    pickle.dump(model_optimized_params, file)

print('Model Saved under: ', file)

Model Saved under:  <_io.BufferedWriter name='C:\\Users\\asgar\\Documents\\InterviewAssignments\\huami-interview\\models\\XGBoost_Regressor_Model_Pickled.pkl'>


## DIY Gradient Boosting Regressor  <a class="anchor" id="DIYGradientBoostingRegressor"></a>

### Initialize the model:

In [33]:
#Parameters
RegressGB_Parameters.ntrees = 20
max_depth = RegressGB_Parameters.max_depth
ntrees = RegressGB_Parameters.ntrees
learning_rate = RegressGB_Parameters.learning_rate

print('These are hyper parameters set to train the model: ')
print('max_depth: ', max_depth)
print('ntrees: ', ntrees)
print('learning_rate: ', learning_rate)

These are hyper parameters set to train the model: 
max_depth:  2
ntrees:  20
learning_rate:  0.1


### Train a Regressor_GradientBoost model:

Fit the model using the model's decision tree method (Based on Scikit-learn 's DecisionTreeRegressor)

In [34]:
def train_regressor_gradientboost(X_train, y_train):
        '''
        Trains a regressor base on Parsian's pa_ml_utils.Regressor_GradientBoost
            
        Parameters
        ----------
        X_train : A Pandas DataFrame for features
            
        y_train : A Pandas DataFrame for target
                
            
        Returns:
            A regressor model base on pa_ml_utils.Regressor_GradientBoost
            
        '''     
        try:            
            
            print('Initiating the training process using \'pa-gb\' model')
            
            regressor = Regressor_GradientBoost(features_df = X_train, 
                                                target = y_train,
                                                max_depth = RegressGB_Parameters.max_depth, 
                                                ntrees = RegressGB_Parameters.ntrees, 
                                                learning_rate = RegressGB_Parameters.learning_rate)
        
            f0, models, training_rmse = regressor.boost_gradient(X_train, y_train)
            print('Training completed')
    
        except Exception as e:
            print('modeller.py Model_Trainer.train_regressor_gradientboost(): ',e)
            
        
        return f0, models 

In [35]:
f0, models = train_regressor_gradientboost(X_train, y_train)

Initiating the training process using 'pa-gb' model
RMSE at first prediction:  14.664271679060443
RMSE for tree #0 is: 13.391159114254439
RMSE for tree #1 is: 12.258264577646218
RMSE for tree #2 is: 11.242761813387368
RMSE for tree #3 is: 10.337511945266398
RMSE for tree #4 is: 9.528834549051984
RMSE for tree #5 is: 8.7984266207075
RMSE for tree #6 is: 8.158532061912522
RMSE for tree #7 is: 7.587525664343998
RMSE for tree #8 is: 7.0756400429759445
RMSE for tree #9 is: 6.622592722475333
RMSE for tree #10 is: 6.22578034759417
RMSE for tree #11 is: 5.87069055985789
RMSE for tree #12 is: 5.563028020753392
RMSE for tree #13 is: 5.292628346405744
RMSE for tree #14 is: 5.053040195041686
RMSE for tree #15 is: 4.846294597175786
RMSE for tree #16 is: 4.664026103785394
RMSE for tree #17 is: 4.505730496796824
RMSE for tree #18 is: 4.36898626075419
RMSE for tree #19 is: 4.249182999171462
Training completed


### Predict using the Regressor_GradientBoost model:


Using the above model, we are going to predict:

In [39]:
def predict_regressor_gradientboost(X_test, y_test, f0, reg_gb_models):
        '''
        Generates a prediction base on Parsian's 
        pa_ml_utils.Regressor_GradientBoost model
            
        Parameters
        ----------
        X_test : A Pandas DataFrame for features
            
        y_test : A Pandas Series for target
                
            
        Returns:
            Predictions base on reg_gb_models generated from pa_ml_utils.Regressor_GradientBoost
            
        '''        
        try:            
            print('Initiating the prediction process using \'pa-gb\' model')
            
            regressor = Regressor_GradientBoost(X_test, y_test)
            prediction = regressor.predict(X_test, f0, reg_gb_models)
            rmse = regressor.rmse(y_test, prediction)
            
            print('Prediction completed')
            
        except Exception as e:
            print('modeller.py Model_Predictor.predict_regressor_gradientboost(): ',e)
            
        print("The RMSE for the Regressor Gradient Boost model: ", rmse)
        return prediction, rmse   

In [40]:
predict_regressor_gradientboost(X_test, y_test, f0, models)

Initiating the prediction process using 'pa-gb' model
Prediction completed
The RMSE for the Regressor Gradient Boost model:  5.051405898852547


(array([-1.5589667 , 20.86878526,  7.82935102, ...,  2.32709653,
         0.53741474, -6.44573397]), 5.051405898852547)