## 0. Python imports & setup

For learning purposes, libraries will be imported inside its corresponding usage section...

## 1. Data loading

In [24]:
import pandas as pd

* incidents: labeled data we can use for training and testing

In [110]:
incidents = pd.read_csv('../data/processed/pipelines_incident_for_modelling.csv')

as you can see, there are both categorical and numerical columns...

## 2. Data Exploration

Although the exploration of the dataset was made in the notebook data_analysis_report, we are going to make sure that there is no null values(scikit-learn does not goes well with nulls...)

In [26]:
incidents.isna().sum()

SIGNIFICANT                          0
SERIOUS                              0
REPORT_NUMBER                        0
OPERATOR_ID                          0
OPERATOR_NAME                        0
PIPE_FACILITY_TYPE                7647
LOCATION_STATE_ABBREVIATION        174
LOCATION_COUNTY_NAME              1292
LOCATION_CITY_NAME                1123
ON_OFF_SHORE                      4576
ITEM_INVOLVED                     7582
INSTALLATION_YEAR                 8306
CAUSE                                0
MAP_CAUSE                            0
MAP_SUBCAUSE                         0
FATAL                                0
INJURE                               0
TOTAL_COST                           0
TOTAL_COST_IN84                      0
TOTAL_COST_CURRENT                   0
COMMODITY_RELEASED_TYPE           4359
CLASS_LOCATION_TYPE              11031
UNINTENTIONAL_RELEASE_BBLS        5670
RECOVERED_BBLS                    8261
IGNITE_IND                        6752
EXPLODE_IND              

In [111]:
incidents['TOTAL_COST_CURRENT'].mean()

710634.1773311384

See what were the types of the categorical and numerical columns

In [27]:
incidents.dtypes

SIGNIFICANT                       object
SERIOUS                           object
REPORT_NUMBER                      int64
OPERATOR_ID                        int64
OPERATOR_NAME                     object
PIPE_FACILITY_TYPE                object
LOCATION_STATE_ABBREVIATION       object
LOCATION_COUNTY_NAME              object
LOCATION_CITY_NAME                object
ON_OFF_SHORE                      object
ITEM_INVOLVED                     object
INSTALLATION_YEAR                 object
CAUSE                             object
MAP_CAUSE                         object
MAP_SUBCAUSE                      object
FATAL                            float64
INJURE                           float64
TOTAL_COST                       float64
TOTAL_COST_IN84                  float64
TOTAL_COST_CURRENT               float64
COMMODITY_RELEASED_TYPE           object
CLASS_LOCATION_TYPE              float64
UNINTENTIONAL_RELEASE_BBLS       float64
RECOVERED_BBLS                   float64
IGNITE_IND      

## 3. Preprocessing

In this section we will see how to use scikit-learn's Pipiline and ColumnTransformer, one of the best practices for composing preprocessing and modeling in a single and elegand class... pay attention as it is hard to understand...

In [28]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

* https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
* https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

let's identify numerical and categorical features...

In [112]:
CAT_FEATS = ['SIGNIFICANT' , 'SERIOUS'] #, 'PIPE_FACILITY_TYPE','INCIDENT_AREA_TYPE', 'LOCATION_STATE_ABBREVIATION','ON_OFF_SHORE', 
           #  'ITEM_INVOLVED', 'COMMODITY_RELEASED_TYPE', 'IGNITE_IND', 'EXPLODE_IND', 'UNDER_CATHODIC_PROTECTION_IND', 'SYSTEM_TYPE', 
            #'MATERIAL_INVOLVED', 'PIPE_SPECIFICATION', 'INCIDENT_AREA_TYPE']
NUM_FEATS = ['FATAL', 'INJURE', 'UNINTENTIONAL_RELEASE_BBLS', 'ACCIDENT_PSIG', 'MOP_PSIG', 'RECOVERED_BBLS', 'PIPE_DIAMETER', 'WT_STEEL',
            'PIPE_SMYS', 'EX_HYDROTEST_PRESSURE', 'MANUFACTURED_YEAR', 'NORMAL_PSIG', 'ACCOMPANYING_LIQUID']
FEATS = NUM_FEATS + CAT_FEATS
TARGET = 'TOTAL_COST_CURRENT'

let's define a preprocessing transformer for numerical columns...

In [130]:
numeric_transformer = \
Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), 
                ('scaler', StandardScaler())])

let's define a preprocessing transformer for categorical columns...

In [131]:
categorical_transformer = \
Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                ('onehot', OneHotEncoder(handle_unknown='ignore'))])

let's join these transformers using a `ColumnTransformer`:

In [132]:
preprocessor = \
ColumnTransformer(transformers=[('num', numeric_transformer, NUM_FEATS),
                                ('cat', categorical_transformer, CAT_FEATS)])

In [133]:
preprocessor.fit_transform(incidents)

array([[-9.07617949e-02, -2.61389579e-02, -7.46885424e-02, ...,
         1.00000000e+00,  1.00000000e+00,  0.00000000e+00],
       [-9.07617949e-02, -2.61389579e-02, -1.79565554e-01, ...,
         1.00000000e+00,  1.00000000e+00,  0.00000000e+00],
       [-9.07617949e-02, -2.61389579e-02, -1.46937150e-01, ...,
         1.00000000e+00,  1.00000000e+00,  0.00000000e+00],
       ...,
       [ 2.51525890e+00, -2.61389579e-02,  4.23933717e-17, ...,
         1.00000000e+00,  0.00000000e+00,  1.00000000e+00],
       [ 2.51525890e+00, -2.61389579e-02,  4.23933717e-17, ...,
         1.00000000e+00,  0.00000000e+00,  1.00000000e+00],
       [-9.07617949e-02, -2.61389579e-02,  4.23933717e-17, ...,
         1.00000000e+00,  1.00000000e+00,  0.00000000e+00]])

Since we have obtained a sparse matrix, we need to convert the data with the method "to dense"

In [82]:
preprocessor.fit_transform(incidents).todense()

AttributeError: 'numpy.ndarray' object has no attribute 'todense'

inspecting the full preprocessor:

at least in this case, it is at the cost of interpretability of transformed DataFrame...

In [59]:
pd.DataFrame(data=preprocessor.fit_transform(incidents).todense())

## 4. Train a simple model

First, lets train a simple model using holdout, train - test split...

In [134]:
from sklearn.model_selection import train_test_split

In [135]:
incidents_train, incidents_test = train_test_split(incidents)

In [136]:
print(incidents_train.shape)
print(incidents_test.shape)

(14019, 51)
(4673, 51)


let's choose a model from scikit-learn cheatsheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [137]:
from sklearn.ensemble import RandomForestRegressor

rfr_model = Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor', RandomForestRegressor())])

In [138]:
rfr_model.fit(incidents_train[FEATS], incidents_train[TARGET])

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['FATAL', 'INJURE',
                                                   'UNINTENTIONAL_RELEASE_BBLS',
                                                   'ACCIDENT_PSIG', 'MOP_PSIG',
                                                   'RECOVERED_BBLS',
                                                   'PIPE_DIAMETER', 'WT_STEEL',
                                                   'PIPE_SMYS',
                                                   'EX_HYDROTEST_PRESSURE',
                                                   'MANUFACTUR

### 4.1 train a LightGBM model

In [122]:
from lightgbm import LGBMRegressor

lgbm_model= Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor', LGBMRegressor())])

In [123]:
lgbm_model.fit(incidents_train[FEATS], incidents_train[TARGET]);

### 4.2 train a combined model

In [90]:
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor

r1 = LGBMRegressor()
r2 = RandomForestRegressor()

combined_model= Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor', VotingRegressor([('lgbm', r1), ('rf', r2)]))])

In [91]:
combined_model.fit(incidents_train[FEATS], incidents_train[TARGET]);

## 5. check model performance on test and train data

In [139]:
from sklearn.metrics import mean_squared_error

In [140]:
y_test = combined_model.predict(incidents_test[FEATS])
y_train = combined_model.predict(incidents_train[FEATS])

ValueError: X has 15 features, but ColumnTransformer is expecting 16 features as input.

In [126]:
print(f"test error: {mean_squared_error(y_pred=y_test, y_true=incidents_test[TARGET], squared=False)}")
print(f"train error: {mean_squared_error(y_pred=y_train, y_true=incidents_train[TARGET], squared=False)}")

test error: 16833428.478696957
train error: 18697255.35959016


## 6. check model performance using cross validation

In [141]:
from sklearn.model_selection import cross_val_score

In [142]:
scores = cross_val_score(lgbm_model, 
                         incidents[FEATS], 
                         incidents[TARGET], 
                         scoring='neg_root_mean_squared_error', 
                         cv=10, n_jobs=-1)

In [143]:
import numpy as np
np.mean(-scores)

12306332.05072886

## 7. optimize model using grid search

In [106]:
from sklearn.model_selection import RandomizedSearchCV

In [108]:
lgbm_param_grid = {'regressor__num_leaves': (20, 100),
                   'regressor__n_estimators': (20, 500),
                   'regressor__learning_rate': (0.05, 0.3),
                  'regressor__feature_fraction': (0.1, 0.9),
                  'regressor__bagging_fraction': (0.8, 1),
                  'regressor__max_depth': (15, 25),
                  'regressor__min_split_gain': (0.001, 0.1),
                  'regressor__min_child_weight': (10, 50),
                'regressor__preprocessor__num__imputer__strategy': ['mean', 'median']}

rfr_param_grid = {'regressor__n_estimators': [512],
                 'regressor__max_depth': [16],
                 'preprocessor__num__imputer__strategy': ['mean']}

combined_param_grid = {
                        'regressor__lgbm__num_leaves': (20, 100),
                        'regressor__lgbm__n_estimators': (20, 500),
                        'regressor__lgbm__learning_rate': (0.05, 0.3),
                        'regressor__lgbm__feature_fraction': (0.1, 0.9),
                        'regressor__lgbm__bagging_fraction': (0.8, 1),
                        'regressor__lgbm__max_depth': (15, 25),
                        'regressor__lgbm__min_split_gain': (0.001, 0.1),
                        'regressor__lgbm__min_child_weight': (10, 50),
                        'regressor__rf__n_estimators': [512],
                        'regressor__rf__max_depth': [16],
                        'preprocessor__num__imputer__strategy': ['mean']
}


grid_search = RandomizedSearchCV(combined_model, 
                                 combined_param_grid, 
                                 cv=5, 
                                 verbose=2, 
                                 scoring='neg_root_mean_squared_error', 
                                 n_jobs=-1,
                                 n_iter=10)

grid_search.fit(incidents[FEATS], incidents[TARGET])

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END preprocessor__num__imputer__strategy=mean, regressor__lgbm__bagging_fraction=0.8, regressor__lgbm__feature_fraction=0.9, regressor__lgbm__learning_rate=0.3, regressor__lgbm__max_depth=25, regressor__lgbm__min_child_weight=10, regressor__lgbm__min_split_gain=0.001, regressor__lgbm__n_estimators=500, regressor__lgbm__num_leaves=100, regressor__rf__max_depth=16, regressor__rf__n_estimators=512; total time=  23.2s
[CV] END preprocessor__num__imputer__strategy=mean, regressor__lgbm__bagging_fraction=0.8, regressor__lgbm__feature_fraction=0.9, regressor__lgbm__learning_rate=0.3, regressor__lgbm__max_depth=25, regressor__lgbm__min_child_weight=10, regressor__lgbm__min_split_gain=0.001, regressor__lgbm__n_estimators=500, regressor__lgbm__num_leaves=100, regressor__rf__max_depth=16, regressor__rf__n_estimators=512; total time=  21.8s
[CV] END preprocessor__num__imputer__strategy=mean, regressor__lgbm__bagging_fraction=0.8, re

KeyboardInterrupt: 

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_score_

## 8. Prepare submission

In [208]:
y_pred = grid_search.predict(diamonds_predict[FEATS])

In [209]:
submission_df = pd.DataFrame({'id': diamonds_predict['id'], 'price': y_pred})

In [210]:
submission_df.head()

Unnamed: 0,id,price
0,0,2907.639331
1,1,5632.615022
2,2,9471.212337
3,3,4055.583619
4,4,1639.378447


In [211]:
submission_df.describe()

Unnamed: 0,id,price
count,13485.0,13485.0
mean,6742.0,3952.955395
std,3892.928525,3945.681812
min,0.0,339.102748
25%,3371.0,945.682904
50%,6742.0,2466.956301
75%,10113.0,5304.791594
max,13484.0,17899.397882


In [212]:
submission_df.price.clip(0, 20000, inplace=True)

In [213]:
submission_df.to_csv('../submissions_kaggle//submission_combined_3.csv', index=False)