# Predict Hazard Ratings for a Maintenance Project

## Model Building and Hyper Parameter Tuning

> As we know the dataset contains the count variables and the response variables is poisson distributed, we will try the poisson regression models such as lasso, Hist Gradient Booster, Poisson Regressor, XGB Regressor with count as poisson, Tweedie Regressor. Also, we will try with DecisionTree and RandomForest regressor as well.
We will evaluate the model on MAE (mean absolute error).

In [11]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import TweedieRegressor
import xgboost as xg

> import the pickle file created in feature transformation

In [7]:
filename = 'transformed_all_data_hazard'
infile = open(filename,'rb')
all_data = pickle.load(infile)
infile.close()

In [8]:
all_data.shape 

(50999, 23)

> Original Train- Test data seperation 

In [9]:
train=all_data.drop('data',1)[all_data['data']=='train']
test=all_data.drop(['data','Hazard'],1)[all_data['data']=='test']
train.shape, test.shape

((40799, 22), (10200, 21))

In [None]:
> Separating training and validation dataset

In [12]:
x_train, x_test, y_train, y_test = train_test_split(train.drop('Hazard', axis=1), train['Hazard'], test_size=0.33, random_state=2)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((27335, 21), (13464, 21), (27335,), (13464,))

> As we checked in EDA, the DV is skewed and we will apply squareroot transformation

In [13]:
y_train = y_train**(1/2)

> We will create a estimator function to check the performance with different models, and will select one with better score and then we can do hyperparameter tuning.

In [26]:
def estimator(x,y,z):
    print('Lasso')
    model=Lasso(fit_intercept=True,alpha=0.01)
    model.fit(x,y)
    y_pred = model.predict(z)**2
    mae = mean_absolute_error(y_test, y_pred)
    print('mae',mae,'\n')
    
    print('Hist GradientBooster')
    model = HistGradientBoostingRegressor(loss='poisson', max_leaf_nodes=128)
    model.fit(x,y)
    y_pred = model.predict(z)**2
    mae = mean_absolute_error(y_test, y_pred)
    print('mae',mae,'\n')
    
    print('Poisson Regressor')
    model = PoissonRegressor(alpha=1e-12, max_iter=900)
    model.fit(x,y)
    y_pred = model.predict(z)**2
    mae = mean_absolute_error(y_test, y_pred)
    print('mae',mae,'\n')
    
    print('DecisionTree Regressor')
    model = DecisionTreeRegressor(random_state=2)
    model.fit(x,y)
    y_pred = model.predict(z)**2
    mae = mean_absolute_error(y_test, y_pred)
    print('mae',mae,'\n')
    
    print('XGB Regressor')
    model = xg.XGBRegressor(objective ='count:poisson',n_estimators = 100, seed = 123)
    model.fit(x,y)
    y_pred = model.predict(z)**2
    mae = mean_absolute_error(y_test, y_pred)
    print('mae',mae,'\n')
    
    print('Tweedie Regressor')
    model = TweedieRegressor(power=1, alpha=0.5, link='log', max_iter=700)
    model.fit(x,y)
    y_pred = model.predict(z)**2
    mae = mean_absolute_error(y_test, y_pred)
    print('mae',mae,'\n')
    
    print('RandomForest Regressor')
    model = RandomForestRegressor(random_state=2)
    model.fit(x,y)
    y_pred = model.predict(z)**2
    mae = mean_absolute_error(y_test, y_pred)
    print('mae',mae,'\n')

> Evaluate the train dataset with estimator function

In [27]:
estimator(x_train,y_train,x_test)

Lasso
mae 2.7000140912385038 

Hist GradientBooster
mae 2.6768701196442013 

Poisson Regressor
mae 2.6897116279036455 

DecisionTree Regressor
mae 3.739501707154711 

XGB Regressor
mae 2.6832680113201897 

Tweedie Regressor
mae 2.7017446092022634 

RandomForest Regressor
mae 2.822988564584935 



> We can see the performance of XGBRegressor and HistGradient Boosting is much better than any other Regressor. We will do hyperparameter tuning with GridSearchCV on XGBRegressor.

In [18]:
param = {
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5, 7, 10],
        'min_child_weight': [1, 3, 5],
        'subsample': [0.5, 0.7],
        'colsample_bytree': [0.5, 0.7],
        'n_estimators' : [100, 200, 500],
        'objective': ['count:poisson']
    }

gsearch = GridSearchCV(estimator = xg.XGBRegressor(),param_grid=param, scoring='neg_mean_absolute_error',
                       cv = 5,n_jobs = -1,verbose = 1)

gsearch.fit(x_train, y_train)

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=...
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_parameters=None,
                                    verbosity=None),
             n_

In [19]:
gsearch.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.7, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.01, max_delta_step=0.699999988, max_depth=7,
             min_child_weight=3, missing=nan, monotone_constraints='()',
             n_estimators=500, n_jobs=8, num_parallel_tree=1,
             objective='count:poisson', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=None, subsample=0.5,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [25]:
model = xg.XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.7, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.01, max_delta_step=0.699999988, max_depth=7,
             min_child_weight=3, monotone_constraints='()',
             n_estimators=500, n_jobs=8, num_parallel_tree=1,
             objective='count:poisson', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=None, subsample=0.5,
             tree_method='exact', validate_parameters=1, verbosity=None)

model.fit(x_train, y_train)
y_pred = model.predict(x_test)**2 #reversing square root transformation
mae = mean_absolute_error(y_test, y_pred.round())
mae

2.641934046345811

> With parameter tuned model the mae score is improved(2.641934046345811). We can now fit the model on whole train dataset and predict against test dataset.

In [None]:
train['Hazard'] = train['Hazard']**(1/2) #Square-root transformation

In [28]:
model.fit(train.drop('Hazard',axis=1), train['Hazard'])
y_pred = model.predict(test)**2 #reversing Square-root transformation
len(y_pred)

10200