# Model Building
In this stage, you will build several machine learning models on the cleaned data set and attempt to train a model that performs better than baseline. Depending on your data set, this may mean different things.
## Imports

In [2]:
import os
import sys

import numpy as np
import pandas as pd

import lightgbm as lgb
from sklearn.model_selection import train_test_split, StratifiedKFold

In [3]:
src_path = os.path.abspath('../src/')
sys.path.append(src_path)

from data_modeling import *
from data_cleaning import *

In [4]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [5]:
# to allow for all variables to be displayed in jupyter
pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 500)

In [6]:
SEED = 123

## Functions
For your convenience, we have included a few pre-written functions, which you might find useful in your model building. They are by no means necessary, but feel free to use any or all of them.

### score_classification
score_classification takes the predicted results from a model and scores them on every classification metric ever. It also gives the confusion matrix.

Parameters:
- y_train: (1d array-like) The correct y values for the training data set
- y_train_pred: (1d array-like) The predicted y values from the training data set
- y_test: (1d array-like) The correct y values for the test data set
- y_test_pred: (1d array-like) The predicted y values from the test data set

This function uses [sklearn](https://scikit-learn.org/stable/modules/classes.html).metrics to calculate each score. The required functions are imported inside the function.

### downsample
Takes a dataframe and the name (string) of its target column and [downsamples](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data) the majority class to equal the minority class.

Parameters:
- df: a Pandas DataFrame containing the data to be downsampled
- target: string. The name of the target variable.

This function uses the Python libraries [Pandas](https://pandas.pydata.org/docs/reference/index.html) (pd), which has been imported above, and [resample](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) from the [sklearn](https://scikit-learn.org/stable/modules/classes.html) library, which is imported inside the function.

### scaled_model_search 
Takes a list of scalers and models, along with test-train split data, and runs a search over every possible combination of scaler and model. It prints out the best result. Currently the metric used is accuracy, but it would be simple enough to change depending on the situation.

Parameters:
- scalers: a list of initialized scaler functions (ex: scalers = [StandardScaler(), RobustScaler(), QuantileTransformer(random_state = 42)]
- models: a list of initialized model function (ex: models = [LogisticRegression(), ExtraTreesClassifier(random_state = 42), RandomForestClassifier(random_state = 42)]
- X_train: DataFrame containing the training data set without the target variable
- y_train: DataFrame containing the target variable for the training data.
- X_test: DataFrame containing the test data set without the target variable
- y_test: DataFrame containing the target variable for the test data.

This function uses the [sklearn](https://scikit-learn.org/stable/modules/classes.html) function [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) as a metric to compare the models, and it has been imported inside the function. It also uses [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) from [sklearn](https://scikit-learn.org/stable/modules/classes.html), which has been imported inside the function.

## Data
Read in the clean data set from your data_preparation notebook. It should be ready for some preliminary model-building by now, but you should consider your variables and decide if you want to use all of them to train a model. You should have a clear reason for excluding any variables. Also consider time-series data (if applicable to your set). If you have data from multiple years, should you train and test on each year individually? Train on one year and test on another?

In [41]:
data_path = '../data/interim/'
training_csv = 'ngv_modeling.csv'
training_pickle = 'ngv_training_only_downcast_modeling.pkl'

data_df = pd.read_csv(data_path + training_csv)
# data_df = pd.read_pickle(data_path + training_pickle)

In [8]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 442 entries, transactionid to DeviceInfo_alpha_trunc
dtypes: float32(422), int16(14), int32(2), int8(4)
memory usage: 973.2 MB


In [7]:
data_df.head()

Unnamed: 0,transactionid,isfraud,transactiondt,transactionamt,productcd,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,p_emaildomain,r_emaildomain,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,m1,m2,m3,m4,m5,m6,m7,m8,m9,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,v271,v272,v273,v274,v275,v276,v277,v278,v279,v280,v281,v282,v283,v284,v285,v286,v287,v288,v289,v290,v291,v292,v293,v294,v295,v296,v297,v298,v299,v300,v301,v302,v303,v304,v305,v306,v307,v308,v309,v310,v311,v312,v313,v314,v315,v316,v317,v318,v319,v320,v321,v322,v323,v324,v325,v326,v327,v328,v329,v330,v331,v332,v333,v334,v335,v336,v337,v338,v339,transactionamt_long_dec,DeviceInfo,DeviceType,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,id_10,id_11,id_12,id_13,id_14,id_15,id_16,id_17,id_18,id_19,id_20,id_21,id_22,id_23,id_24,id_25,id_26,id_27,id_28,id_29,id_30,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,id_30_name,id_31_bin_many,id_31_bin_few,id_33_prod,id_33_ratio,DeviceInfo_trunc,DeviceInfo_alpha_trunc
0,2987000,0,86400,68.5,4,13926,,150.0,0.0,142.0,2.0,315.0,87.0,19.0,,,,1,1,0,0,0,1,0,0,1,0,2,0,1,1,14.0,,13.0,,,,,,,13.0,13.0,,,,0.0,1.0,1.0,1.0,1.0,0.0,1.0,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2987001,0,86401,29.0,4,2755,404.0,150.0,2.0,102.0,2.0,325.0,87.0,,,58.0,,1,1,0,0,0,1,0,0,0,0,1,0,1,1,0.0,,,0.0,,,,,,0.0,,,,,0.0,,,,2.0,1.0,1.0,,,,,,,,,,,,,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2987002,0,86469,59.0,4,4663,490.0,150.0,3.0,166.0,3.0,330.0,87.0,287.0,,51.0,,1,1,0,0,0,1,0,0,1,0,1,0,1,1,0.0,,,0.0,,,,,,0.0,315.0,,,,315.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2987003,0,86499,50.0,4,18132,567.0,150.0,2.0,117.0,3.0,476.0,87.0,,,57.0,,2,5,0,0,0,4,0,0,1,0,1,0,25,1,112.0,112.0,0.0,94.0,0.0,,,,,84.0,,,,,111.0,,,,2.0,1.0,0.0,,,,,,,,,,,,,,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,48.0,28.0,0.0,10.0,4.0,1.0,38.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,1758.0,925.0,0.0,354.0,135.0,50.0,1404.0,790.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,28.0,0.0,0.0,0.0,0.0,10.0,0.0,4.0,0.0,0.0,1.0,1.0,1.0,1.0,38.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,50.0,1758.0,925.0,0.0,354.0,0.0,135.0,0.0,0.0,0.0,50.0,1404.0,790.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2987004,0,86506,50.0,1,4497,514.0,150.0,2.0,102.0,2.0,420.0,87.0,,,58.0,,1,1,0,0,0,1,0,1,0,1,1,0,1,1,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,18.0,140.0,0.0,0.0,0.0,0.0,1803.0,49.0,64.0,0.0,0.0,0.0,0.0,0.0,0.0,15557.990234,169690.796875,0.0,0.0,0.0,515.0,5155.0,2840.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,7.6e-05,0.0,0.0,70787.0,,,,,,,,,100.0,0.0,,-480.0,1.0,0.0,166.0,,542.0,144.0,,,,,,,,0.0,0.0,0.037014,0.007563,32.0,0.007423,3.0,1.0,0.0,1.0,1.0,6.0,0.014457,4.0,3300.0,2.1,0.022686,0.022686


In [42]:
data_df.dtypes

transactionid                int64
isfraud                      int64
transactiondt                int64
transactionamt             float64
productcd                    int64
card1                        int64
card2                      float64
card3                      float64
card4                      float64
card5                      float64
card6                      float64
addr1                      float64
addr2                      float64
dist1                      float64
dist2                      float64
p_emaildomain              float64
r_emaildomain              float64
c1                         float64
c2                         float64
c3                         float64
c4                         float64
c5                         float64
c6                         float64
c7                         float64
c8                         float64
c9                         float64
c10                        float64
c11                        float64
c12                 

In [None]:
def reduce_df_memory(df):
    for col in df.columns:
        try:
            if ((df[col].astype(np.int64, errors='ignore')==df[col]).all()):
                df[col] = pd.to_numeric(df[col], downcast='integer')
            else:
                df[col] = pd.to_numeric(df[col], downcast='float')
        except: pass

In [46]:
def float_to_int(col):
    if (col.astype(np.int64, errors='ignore') == col).all():
        return pd.to_numeric(col, downcast='integer')
    else:
        return pd.to_numeric(col, downcast='float')

In [47]:
#reducing memory and downcasting numerical datatypes
for col in data_df.columns:
    data_df[col] = float_to_int(data_df[col])

In [48]:
data_df.dtypes

transactionid                int32
isfraud                       int8
transactiondt                int32
transactionamt             float32
productcd                     int8
card1                        int16
card2                      float32
card3                      float32
card4                      float32
card5                      float32
card6                      float32
addr1                      float32
addr2                      float32
dist1                      float32
dist2                      float32
p_emaildomain              float32
r_emaildomain              float32
c1                           int16
c2                           int16
c3                            int8
c4                           int16
c5                           int16
c6                           int16
c7                           int16
c8                           int16
c9                           int16
c10                          int16
c11                          int16
c12                 

In [11]:
# data_df.head()

In [95]:
# data_df.dtypes

isfraud                       int8
transactionamt             float32
productcd                     int8
card1                        int16
card2                      float32
card3                      float32
card4                      float32
card5                      float32
card6                      float32
addr1                      float32
addr2                      float32
dist1                      float32
dist2                      float32
p_emaildomain              float32
r_emaildomain              float32
c1                           int16
c2                           int16
c3                            int8
c4                           int16
c5                           int16
c6                           int16
c7                           int16
c8                           int16
c9                           int16
c10                          int16
c11                          int16
c12                          int16
c13                          int16
c14                 

In [13]:
# data_df.to_pickle(data_path + training_pickle)

In [14]:
# these are columns which have separate encoded versions of their data, so the originals will be excluded for modeling,
# or they're simply excluded because they aren't time 
cols_to_drop = ['transactionid', 'transactiondt']

In [15]:
data_df = data_df.drop(cols_to_drop, axis=1)

In [16]:
data_df.head()

Unnamed: 0,isfraud,transactionamt,productcd,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,p_emaildomain,r_emaildomain,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,m1,m2,m3,m4,m5,m6,m7,m8,m9,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,v271,v272,v273,v274,v275,v276,v277,v278,v279,v280,v281,v282,v283,v284,v285,v286,v287,v288,v289,v290,v291,v292,v293,v294,v295,v296,v297,v298,v299,v300,v301,v302,v303,v304,v305,v306,v307,v308,v309,v310,v311,v312,v313,v314,v315,v316,v317,v318,v319,v320,v321,v322,v323,v324,v325,v326,v327,v328,v329,v330,v331,v332,v333,v334,v335,v336,v337,v338,v339,transactionamt_long_dec,DeviceInfo,DeviceType,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,id_10,id_11,id_12,id_13,id_14,id_15,id_16,id_17,id_18,id_19,id_20,id_21,id_22,id_23,id_24,id_25,id_26,id_27,id_28,id_29,id_30,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,id_30_name,id_31_bin_many,id_31_bin_few,id_33_prod,id_33_ratio,DeviceInfo_trunc,DeviceInfo_alpha_trunc
0,0,68.5,4,13926,,150.0,0.0,142.0,2.0,315.0,87.0,19.0,,,,1,1,0,0,0,1,0,0,1,0,2,0,1,1,14.0,,13.0,,,,,,,13.0,13.0,,,,0.0,1.0,1.0,1.0,1.0,0.0,1.0,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,0,29.0,4,2755,404.0,150.0,2.0,102.0,2.0,325.0,87.0,,,58.0,,1,1,0,0,0,1,0,0,0,0,1,0,1,1,0.0,,,0.0,,,,,,0.0,,,,,0.0,,,,2.0,1.0,1.0,,,,,,,,,,,,,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,0,59.0,4,4663,490.0,150.0,3.0,166.0,3.0,330.0,87.0,287.0,,51.0,,1,1,0,0,0,1,0,0,1,0,1,0,1,1,0.0,,,0.0,,,,,,0.0,315.0,,,,315.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,0,50.0,4,18132,567.0,150.0,2.0,117.0,3.0,476.0,87.0,,,57.0,,2,5,0,0,0,4,0,0,1,0,1,0,25,1,112.0,112.0,0.0,94.0,0.0,,,,,84.0,,,,,111.0,,,,2.0,1.0,0.0,,,,,,,,,,,,,,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,48.0,28.0,0.0,10.0,4.0,1.0,38.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,1758.0,925.0,0.0,354.0,135.0,50.0,1404.0,790.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,28.0,0.0,0.0,0.0,0.0,10.0,0.0,4.0,0.0,0.0,1.0,1.0,1.0,1.0,38.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,50.0,1758.0,925.0,0.0,354.0,0.0,135.0,0.0,0.0,0.0,50.0,1404.0,790.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,0,50.0,1,4497,514.0,150.0,2.0,102.0,2.0,420.0,87.0,,,58.0,,1,1,0,0,0,1,0,1,0,1,1,0,1,1,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,18.0,140.0,0.0,0.0,0.0,0.0,1803.0,49.0,64.0,0.0,0.0,0.0,0.0,0.0,0.0,15557.990234,169690.796875,0.0,0.0,0.0,515.0,5155.0,2840.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,7.6e-05,0.0,0.0,70787.0,,,,,,,,,100.0,0.0,,-480.0,1.0,0.0,166.0,,542.0,144.0,,,,,,,,0.0,0.0,0.037014,0.007563,32.0,0.007423,3.0,1.0,0.0,1.0,1.0,6.0,0.014457,4.0,3300.0,2.1,0.022686,0.022686


In [30]:
binary_cats = ['m1', 'm2', 'm3', 'm5', 'm6', 'm7',
               'm8', 'm9', 'DeviceType', 'id_12',
               'id_16', 'id_27', 'id_28', 'id_29',
               'id_35', 'id_36', 'id_37', 'id_38',
               'transactionamt_long_dec']
num_cats = ['card1', 'card2', 'card3', 'card5',
            'addr1', 'addr2', 'id_13', 'id_14',
            'id_17', 'id_18', 'id_19', 'id_20',
            'id_21', 'id_22', 'id_24', 'id_25',
            'id_26', 'id_32', 'id_33_prod', 'id_33_ratio']
freq_cats = ['DeviceInfo', 'DeviceInfo_trunc',
             'DeviceInfo_alpha_trunc', 'id_30',
             'id_31', 'id_31_bin_many', 'id_33']
# these will be treated as categories in baseline
label_cats = ['productcd', 'card4', 'card6', 'm4',
              'p_emaildomain', 'r_emaildomain',
              'id_15', 'id_23', 'id_30_name',
              'id_31_bin_few']

## Data Splitting
Once you have an idea of how you plan to use the data, split your data into train and test groups or, if you prefer a more complicated approach, multiple folds. 

In [18]:
X_train = data_df.drop(['isfraud'], axis=1)
y_train = data_df['isfraud'].values

X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train,
                                            test_size = 0.2,
                                            stratify = y_train,
                                            random_state = SEED)

In [None]:
X = df.drop(['isfraud'], axis=1)
y = df['isfraud'].values

In [None]:
X = df.drop(['isfraud'], axis=1)
y = df['isfraud'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 1)

In [19]:
cv_folds = StratifiedKFold(5, random_state=SEED, shuffle=True)

## Baseline Model
Before anything else, let's build a baseline model. This will serve as a "sanity check" for everything that comes after. Choose a simplistic model and, without any preprocessing or tuning, train a model on the training set. How well does it perform on the test set?

In [20]:
feats = X_tr.columns.to_list()
cats = label_cats

In [13]:
lgb_tr_cv = lgb.Dataset(X_tr, y_tr, feature_name = feats, categorical_feature = cats)
lgb_val_cv = lgb.Dataset(X_val, y_val, feature_name = feats, categorical_feature = cats, reference=lgb_tr_cv)

In [14]:
lgb_tr = lgb.Dataset(X_tr, y_tr, feature_name = feats, categorical_feature = cats)
lgb_val = lgb.Dataset(X_val, y_val, feature_name = feats, categorical_feature = cats, reference=lgb_tr)

In [21]:
params = {
        'objective': 'binary',
        'boosting': 'gbdt',
        'learning_rate': 0.05 ,
        'verbose': 0,
        'num_leaves': 2**8,
        'bagging_fraction': 0.95,
        'bagging_freq': 1,
        'bagging_seed': 1,
        'feature_fraction': 0.9,
        'feature_fraction_seed': 1,
        'max_bin': 256,
        'num_rounds': 500,
        'metric': 'auc',
        'seed': SEED
    }

In [21]:
lgb_base_cv = lgb.cv(params, lgb_tr_cv, folds=cv_folds)



In [22]:
lgb_base_cv['auc-mean'][-1]

0.9692237459112262

In [23]:
lgb_base_model = lgb.train(params, train_set = lgb_tr, valid_sets = lgb_val, verbose_eval=50)



[50]	valid_0's auc: 0.938133
[100]	valid_0's auc: 0.958435
[150]	valid_0's auc: 0.964522
[200]	valid_0's auc: 0.967698
[250]	valid_0's auc: 0.969781
[300]	valid_0's auc: 0.971124
[350]	valid_0's auc: 0.971727
[400]	valid_0's auc: 0.972522
[450]	valid_0's auc: 0.973132
[500]	valid_0's auc: 0.973682


In [24]:
base_model_feat_imp = dict(zip(lgb_base_model.feature_name(),lgb_base_model.feature_importance()))
base_model_feat_imp = pd.DataFrame.from_dict(base_model_feat_imp, orient='index', columns=['feature_importance'])
base_model_feat_imp = base_model_feat_imp.sort_values(by='feature_importance', ascending=False)
base_model_feat_imp

Unnamed: 0,feature_importance
card1,9044
transactionamt,8230
card2,7609
addr1,7142
dist1,3664
card5,3061
d15,2994
d4,2374
c13,2359
id_02,2286


## Model Improvement
Now you can work on improving on the baseline. There's no linear approach to this process and the steps you take will depend on the data. Below are some steps that are commonly used in building robust models. You can use any, all, or only some of them, and you are encouraged to add your own steps for your specific data set.

As you go through this process, keep in mind all that you learned during the data understanding phase and consider the following questions:
- What sort of model should you train? (ie, classification, regression? Neural network?)
- Given the distribution of your data, the presence or absence of missing data, and various other factors, is there a particular model (or ensemble) that you think will work well? (ie, RandomForest, ExtraTrees, SVM...?)
- Depending on what sort of model you train and what your data look like, you may find different evaluation metrics useful. How can you certain that you have the most well-rounded view of how well your model is performing? What metric or metrics will best capture your model priorities (and what are your model priorities)?

### Scaling
Some models assume data have a normal distribution and performance will suffer when they do not. Most models will suffer if different variables have vastly differing scales. Do you need to scale your data? If so, how should you go about doing so?

### Feature Selection and Engineering
Are all of your variables necessary, or do you have a lot of them taking up time and computing power without adding much to model building? Can some variables be combined to make a better model? Are variables linearly related to your target variable, or would it be worthwhile to include some polynomial features? 

#### Correlation Based Feature Elimination of V Columns

In [None]:
# commented out this code and hardcoded results for simplicity moving forward

In [50]:
# v_cols = []
# for col in data_df.columns:
#     if col[0] == 'v':
#         v_cols.append(col)
# v_cols

In [24]:
# null_grouped = null_counts(data_df[v_cols]).groupby(by='null_fraction')
# v_col_groups = []
# for name, group in null_grouped:
#     v_col_groups.append(group.index.to_list())

In [41]:
# v_keeps, v_drops, v_corr_dfs = set(), set(), []
# for x in range(len(v_col_groups)):
#     temp_keeps, temp_drops, temp_corr_df = ngv_reduce_feature_by_corr(data_df[v_cols], v_col_groups[x], 0.9, [], [], [])
#     v_keeps.update(temp_keeps)
#     v_drops.update(temp_drops)
#     v_corr_dfs.append(temp_corr_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [48]:
# print(len(v_keeps))
# print(v_keeps)

182


In [52]:
# print(len(v_drops))
# print(v_drops)

157


In [22]:
# 0.95 Threshold
v_drops_95 = ['v10', 'v102', 'v106', 'v126', 'v127', 'v128',
           'v132', 'v133', 'v134', 'v143', 'v144', 'v145',
           'v148', 'v149', 'v15', 'v150', 'v151', 'v153',
           'v155', 'v157', 'v162', 'v167', 'v168', 'v17',
           'v177', 'v178', 'v179', 'v190', 'v191', 'v192',
           'v193', 'v202', 'v203', 'v204', 'v21', 'v212',
           'v213', 'v217', 'v218', 'v219', 'v221', 'v225',
           'v231', 'v232', 'v233', 'v237', 'v242', 'v248',
           'v249', 'v250', 'v254', 'v255', 'v266', 'v27',
           'v271', 'v272', 'v279', 'v280', 'v29', 'v292',
           'v294', 'v299', 'v302', 'v306', 'v307', 'v308',
           'v31', 'v316', 'v317', 'v318', 'v324', 'v329',
           'v33', 'v330', 'v331', 'v332', 'v333', 'v34',
           'v42', 'v48',  'v51', 'v57', 'v59', 'v69',
           'v71', 'v74',  'v80', 'v90', 'v92', 'v95',
           'v96', 'v97']
# 0.90 Threshold
v_drops = ['v301', 'v323', 'v336', 'v21', 'v329', 'v316',
           'v299', 'v265', 'v51', 'v168', 'v182', 'v57',
           'v328', 'v279', 'v63', 'v292', 'v162', 'v139',
           'v126', 'v15', 'v196', 'v163', 'v179', 'v84',
           'v150', 'v233', 'v156', 'v318', 'v181', 'v249',
           'v327', 'v154', 'v213', 'v33', 'v254', 'v128',
           'v306', 'v17', 'v104', 'v222', 'v34', 'v255',
           'v330', 'v153', 'v151', 'v149', 'v127', 'v308',
           'v143', 'v74', 'v200', 'v193', 'v266', 'v277',
           'v48', 'v324', 'v320', 'v111', 'v134', 'v295',
           'v212', 'v339', 'v93', 'v217', 'v152', 'v302',
           'v271', 'v280', 'v29', 'v133', 'v321', 'v155',
           'v27', 'v219', 'v146', 'v71', 'v186', 'v237',
           'v248', 'v148', 'v102', 'v190', 'v178', 'v298',
           'v246', 'v94', 'v105', 'v278', 'v331', 'v177',
           'v144', 'v235', 'v236', 'v238', 'v145', 'v76',
           'v69', 'v250', 'v332', 'v304', 'v317', 'v137',
           'v225', 'v191', 'v294', 'v192', 'v204', 'v106',
           'v72', 'v202', 'v275', 'v96', 'v245', 'v160',
           'v59', 'v92', 'v10', 'v97', 'v243', 'v95', 'v232',
           'v335', 'v19', 'v4', 'v80', 'v136', 'v31', 'v231',
           'v338', 'v221', 'v272', 'v13', 'v42', 'v183',
           'v307', 'v32', 'v44', 'v333', 'v194', 'v36',
           'v218', 'v132', 'v216', 'v242', 'v22', 'v103',
           'v229', 'v90', 'v157', 'v64', 'v73', 'v39',
           'v54', 'v315', 'v203', 'v195', 'v167']

In [65]:
lgb_tr_cv_dv_95 = lgb.Dataset(X_tr.drop(columns=v_drops_95),
                           y_tr,
                           feature_name = X_tr.drop(columns=v_drops_95).columns.to_list(),
                           categorical_feature = cats)
lgb_val_cv_dv_95 = lgb.Dataset(X_val.drop(columns=v_drops_95),
                            y_val,
                            feature_name = X_val.drop(columns=v_drops_95).columns.to_list(),
                            categorical_feature = cats,
                            reference=lgb_tr_cv_dv_95)

In [66]:
lgb_base_cv_dv_95 = lgb.cv(params, lgb_tr_cv_dv_95, folds=cv_folds)



0.95 Correlation Threshold  
Removed 92 features and CV increased marginally, so I'll try a more aggressive threshold.

In [67]:
lgb_base_cv_dv_95['auc-mean'][-1]
# 0.9695813989085756

0.9695813989085756

In [44]:
lgb_tr_cv_dv = lgb.Dataset(X_tr.drop(columns=v_drops),
                           y_tr,
                           feature_name = X_tr.drop(columns=v_drops).columns.to_list(),
                           categorical_feature = cats)
lgb_val_cv_dv = lgb.Dataset(X_val.drop(columns=v_drops),
                            y_val,
                            feature_name = X_val.drop(columns=v_drops).columns.to_list(),
                            categorical_feature = cats,
                            reference=lgb_tr_cv_dv)

In [45]:
lgb_base_cv_dv = lgb.cv(params, lgb_tr_cv_dv, folds=cv_folds)



0.9 Correlation Threshold  
Removed 157 features and CV decreased marginally, so I'll accept that change, especially since it will speed up future modeling. If pure AUC performance were desired, I'd probably choose the 0.95 threshold, but I want a faster model to aid in further feature selection.

In [46]:
lgb_base_cv_dv['auc-mean'][-1]
# 0.9695310437093434

0.9695310437093434

#### Categorical vs. Numerical Feature Treatment

In [23]:
def ngv_label_encode(label_col_list, input_df):
    label_dicts = {}
    df = input_df.copy()
    # gets unique values in reverse order by frequency
    for cat in label_col_list:
        temp_keys = list(df[cat].value_counts().index)[::-1]
        temp_values = range(len(temp_keys))
        temp_dict = dict(zip(temp_keys, temp_values))
        label_dicts.update({cat: temp_dict})
    # performs the encoding
        df[cat] = df[cat].map(temp_dict)
        df[cat] = pd.to_numeric(df[cat], downcast='integer')
    return df

Results Before Encoding during Iterations  
none  
0.969105682640335  
  
labeled  
0.9695310437093434  
  
numerical  
0.9723631124811638  
  
frequency  
0.9689063094206409  

In [31]:
cat_iter_list = [[], label_cats, num_cats, freq_cats,
                 label_cats+num_cats, label_cats+freq_cats, num_cats+freq_cats, label_cats+num_cats+freq_cats]
cat_type = ['none', 'labeled', 'numerical', 'frequency',
            'labeled_numerical', 'labeled_frequency', 'numerical_frequency', 'all']
cat_perf = []
i = 0
# drop correlation eliminated features
X_train_cv = X_train.drop(columns=v_drops)
y_train_cv = y_train
for cat_grp in cat_iter_list:
    # encode appropriate columns before splitting
    X_cv_iter = ngv_label_encode(cat_grp, X_train_cv)
    # split encoded columns
    X_tr_cv, X_val_cv, y_tr_cv, y_val_cv = train_test_split(X_cv_iter, y_train_cv,
                                                            test_size = 0.2,
                                                            stratify = y_train,
                                                            random_state = SEED)
    # creat lgbm datasets
    lgb_tr_cv_cat = lgb.Dataset(X_tr_cv,
                                y_tr_cv,
                                feature_name = X_tr_cv.columns.to_list(),
                                categorical_feature = cat_grp)
    lgb_val_cv_cat = lgb.Dataset(X_val_cv,
                                 y_val_cv,
                                 feature_name = X_val_cv.columns.to_list(),
                                 categorical_feature = cat_grp,
                                 reference=lgb_tr_cv_cat)
    # cross validate
    lgb_base_cv_cat = lgb.cv(params, lgb_tr_cv_cat, folds=cv_folds)
    # print loop status and results
    print(cat_type[i])
    print(lgb_base_cv_cat['auc-mean'][-1])
    cat_perf.append(lgb_base_cv_cat['auc-mean'][-1])
    i += 1
cat_type_aucs = dict(zip(cat_type, cat_perf))
cat_type_aucs



none
0.969105682640335




labeled
0.9695497630904963




numerical
0.9723687716111333




frequency
0.9685367736288052




labeled_numerical
0.9729421274350688




labeled_frequency
0.9689209934970583




numerical_frequency
0.9727340056210968




all
0.9731643144394357


{'none': 0.969105682640335,
 'labeled': 0.9695497630904963,
 'numerical': 0.9723687716111333,
 'frequency': 0.9685367736288052,
 'labeled_numerical': 0.9729421274350688,
 'labeled_frequency': 0.9689209934970583,
 'numerical_frequency': 0.9727340056210968,
 'all': 0.9731643144394357}

Trying with correlation drops and categoricals
Note: I may have made changes that could break the reproducibility of this code and resulting list,  
    so the result was pasted below until I can verify

In [48]:
all_cats = label_cats+num_cats+freq_cats
X_tr_nw = X_tr.drop(columns=v_drops_95)
X_val_nw = X_val.drop(columns=v_drops_95)
feats_nw = X_tr_nw.columns.to_list()

lgb_tr_nw = lgb.Dataset(X_tr_nw,
                        y_tr,
                        feature_name = feats_nw,
                        categorical_feature = all_cats)
lgb_val_nw = lgb.Dataset(X_val_nw,
                         y_val,
                         feature_name = feats_nw,
                         categorical_feature = all_cats,
                         reference=lgb_tr_nw)

In [49]:
lgb_nw_model = lgb.train(params_nw,
                         train_set = lgb_tr_nw,
                         valid_sets = lgb_val_nw,
                         verbose_eval=100)



Training until validation scores don't improve for 100 rounds
[100]	valid_0's auc: 0.950992
[200]	valid_0's auc: 0.965065
[300]	valid_0's auc: 0.970533
[400]	valid_0's auc: 0.973379
[500]	valid_0's auc: 0.974899
[600]	valid_0's auc: 0.975922
[700]	valid_0's auc: 0.976605
[800]	valid_0's auc: 0.977032
[900]	valid_0's auc: 0.977387
[1000]	valid_0's auc: 0.977584
[1100]	valid_0's auc: 0.977731
[1200]	valid_0's auc: 0.977806
[1300]	valid_0's auc: 0.977915
[1400]	valid_0's auc: 0.977977
[1500]	valid_0's auc: 0.977978
Early stopping, best iteration is:
[1449]	valid_0's auc: 0.978022


In [50]:
nw_model_feat_imp = dict(zip(lgb_nw_model.feature_name(),lgb_nw_model.feature_importance()))
nw_model_feat_imp = pd.DataFrame.from_dict(nw_model_feat_imp, orient='index', columns=['feature_importance'])
nw_model_feat_imp = nw_model_feat_imp.sort_values(by='feature_importance', ascending=False)
nw_model_feat_imp

Unnamed: 0,feature_importance
transactionamt,39797
card1,32760
dist1,14486
addr1,13487
id_02,11221
d15,9724
card2,8557
d4,7453
c13,7048
d8,6200


In [None]:
	feature_importance
transactionamt	39797
card1	32760
dist1	14486
addr1	13487
id_02	11221
d15	9724
card2	8557
d4	7453
c13	7048
d8	6200
d10	6024
dist2	5755
d2	5648
d11	5468
m4	5051
id_20	4822
c1	4765
d1	4733
m5	4715
m6	4705
id_06	4411
d5	4232
id_19	4161
id_05	4093
d3	3902
id_01	3711
c2	3688
d9	3493
d14	3254
c14	3221
p_emaildomain	3061
c6	3024
c11	2943
m8	2889
m7	2454
m9	2395
c9	2373
v313	2193
m3	1934
v310	1715
c5	1592
v87	1522
d6	1435
m2	1378
v314	1362
v315	1329
c8	1278
v78	1255
v49	1246
v75	1245
v45	1240
v130	1235
d12	1232
d13	1221
v82	1218
id_38	1208
v83	1193
c10	1192
c12	1143
id_33_prod	1134
v61	1116
v11	1070
v76	1034
DeviceType	1034
v62	1027
v283	1020
v53	1011
v44	996
v38	994
v312	991
v86	969
v35	961
v54	921
v12	902
v36	858
d7	842
v131	828
id_13	818
v264	815
v5	813
DeviceInfo	772
v20	759
v311	757
v4	749
v91	745
v282	745
v55	742
v77	742
v13	723
v19	718
v309	700
v67	673
v136	671
r_emaildomain	665
v70	663
v37	662
id_09	656
v222	654
v56	651
v52	644
v291	642
v129	642
v281	639
v30	633
v320	613
v24	604
transactionamt_long_dec	601
v261	597
v23	592
v66	581
id_11	568
v267	561
c4	557
id_15	546
v165	539
v274	537
card5	528
v277	451
v285	451
id_03	450
v270	446
v265	445
v263	421
id_37	420
v234	419
v73	414
v25	414
v124	412
v94	399
v295	396
v135	394
v258	391
v81	390
v39	383
v3	370
v245	362
v139	361
v208	354
v296	352
v187	351
c7	351
v47	349
v166	342
v156	339
v7	336
v99	336
card3	328
v79	326
v160	326
v40	324
v260	322
v211	317
v26	316
v229	315
v43	315
v268	310
v293	308
id_16	306
v220	304
v164	303
v6	301
v207	297
v137	292
v215	286
v60	286
id_31_bin_few	285
id_32	275
v259	274
v257	272
v243	271
v273	265
id_34	264
v209	264
v319	263
v152	263
v58	263
v276	258
v321	254
v290	241
v262	241
v159	236
v278	233
v63	231
id_36	230
v300	230
productcd	229
v64	225
card6	220
v323	219
id_07	215
v275	213
id_08	213
v171	213
v2	212
v170	211
v85	208
v123	207
v289	206
v214	198
v200	194
v210	193
v206	188
v16	184
v46	184
v216	183
v201	182
v169	177
v189	177
v224	172
v205	170
v101	164
id_04	164
v103	160
v253	152
v251	151
v140	151
v301	150
v125	150
v288	148
v303	144
c3	144
v256	140
v9	140
v105	139
v188	138
v115	138
v244	136
v8	133
v228	131
v338	127
v72	124
v180	123
v18	122
v84	121
id_28	116
v286	116
v246	115
v223	110
v50	108
v199	108
v238	105
m1	99
v176	99
v230	96
v227	95
id_33_ratio	93
v109	90
v287	87
v32	85
v93	85
v298	85
v161	83
v239	83
v335	80
v147	80
v182	79
v158	73
v304	72
id_12	71
v184	70
v22	67
v247	64
v100	63
v186	62
id_21	61
v226	58
v154	58
v198	57
DeviceInfo_trunc	57
v235	56
v197	56
v337	55
v146	53
v339	53
v236	52
v322	51
id_30_name	51
v98	49
v116	48
v114	48
id_29	47
id_10	47
v297	47
v284	45
v104	41
id_33	39
v172	38
v163	36
id_23	36
v110	36
id_18	36
v326	35
v181	32
v336	31
v185	30
v108	29
v174	25
id_25	25
v183	25
v195	22
v328	20
v252	18
id_35	18
v175	18
v194	16
v121	16
v173	16
v334	15
v196	14
v112	12
v141	11
DeviceInfo_alpha_trunc	11
v111	10
id_17	10
v138	10
v327	10
v240	8
id_31	8
id_26	7
id_30	7
id_27	6
v113	6
v1	5
v269	5
v122	5
v14	4
v41	3
v120	3
v142	2
v65	1
card4	1
v117	1
v88	1
v118	1
addr2	0
id_31_bin_many	0
v241	0
v89	0
v28	0
v119	0
v325	0
id_24	0
v107	0
v68	0
v305	0
id_14	0
id_22	0

In [59]:
params_quick = {
             'objective': 'binary',
             'boosting': 'gbdt',
             'n_jobs': -1,            
             'learning_rate': 0.02,
             'verbose': 0,
             'num_leaves': 2**8,
             'bagging_fraction': 0.95,
             'bagging_freq': 1,
             'bagging_seed': 1,
             'feature_fraction': 0.9,
             'feature_fraction_seed': 1,
             'max_bin': 256,
             'num_iterations': 100,
             'early_stopping_rounds': 100,
             'metric': 'auc',
             'seed': SEED
    }

In [81]:
imp_grp_list = [0, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 75, 100]
imp_perf = []
i = 0
# drop correlation eliminated features
X_train_cv = X_train
y_train_cv = y_train

for imp_grp in imp_grp_list:
    # split encoded columns
    X_tr_cv_imp, X_val_cv_imp, y_tr_cv_imp, y_val_cv_imp = train_test_split(X_train_cv, y_train_cv,
                                                                            test_size = 0.2,
                                                                            stratify = y_train,
                                                                            random_state = SEED)
    # dropping least important features
    unimportant_drops = nw_model_feat_imp.loc[nw_model_feat_imp['feature_importance'] < imp_grp].index.to_list()
    iter_drop_cols = v_drops_95+unimportant_drops
    X_tr_cv_imp = X_tr_cv_imp.drop(columns=iter_drop_cols)
    X_val_cv_imp = X_val_cv_imp.drop(columns=iter_drop_cols)
    # ensuring dropped categories are removed from list of category features
    imp_cat_grp = list(set(all_cats).difference(iter_drop_cols))
    # creat lgbm datasets
    lgb_tr_cv_imp = lgb.Dataset(X_tr_cv_imp,
                                y_tr_cv_imp,
                                feature_name = X_tr_cv_imp.columns.to_list(),
                                categorical_feature = imp_cat_grp)
    lgb_val_cv_imp = lgb.Dataset(X_val_cv_imp,
                                 y_val_cv_imp,
                                 feature_name = X_val_cv_imp.columns.to_list(),
                                 categorical_feature = imp_cat_grp,
                                 reference=lgb_tr_cv_cat)
    # cross validate
    lgb_cv_imp = lgb.cv(params_quick, lgb_tr_cv_imp, folds=cv_folds)
    # print loop status and results
    print(imp_grp)
    print(lgb_cv_imp['auc-mean'][-1])
    imp_perf.append(lgb_cv_imp['auc-mean'][-1])
    i += 1
imp_type_aucs = dict(zip(imp_grp_list, imp_perf))
imp_type_aucs



0
0.9476311854548924




10
0.9477062431258446




15
0.9474218249836805




20
0.948183220725498




25
0.9481474596159345




30
0.9478517866288202




35
0.9476198042330873




40
0.9482226657201203




45
0.9474530523797201




50
0.9478227381001927




60
0.948110278456241




75
0.947913982087601




100
0.9477800493719973


{0: 0.9476311854548924,
 10: 0.9477062431258446,
 15: 0.9474218249836805,
 20: 0.948183220725498,
 25: 0.9481474596159345,
 30: 0.9478517866288202,
 35: 0.9476198042330873,
 40: 0.9482226657201203,
 45: 0.9474530523797201,
 50: 0.9478227381001927,
 60: 0.948110278456241,
 75: 0.947913982087601,
 100: 0.9477800493719973}

### Hyperparameter Tuning
Once you have a model that is performing decently well, you'll want to adjust the hyperparameters to improve performance.

### Additional Tuning, Processing, or Model-Improvement
What else can you do to improve your model from the baseline?

## Outcome
At the end of this notebook, you should have a model that is performing better than the baseline model. You should be able to explain what steps you took to train this model and why each one was chosen.

In [83]:
params_pfn = {
             'objective': 'binary',
             'boosting': 'gbdt',
             'n_jobs': -1,            
             'learning_rate': 0.02,
             'verbose': 0,
             'num_leaves': 2**8,
             'bagging_fraction': 0.95,
             'bagging_freq': 1,
             'bagging_seed': 1,
             'feature_fraction': 0.9,
             'feature_fraction_seed': 1,
             'max_bin': 256,
             'num_iterations': 10000,
             'early_stopping_rounds': 100,
             'metric': 'auc',
             'seed': SEED
    }

In [85]:
X_train_pfn = data_df.drop(['isfraud'], axis=1)
y_train_pfn = data_df['isfraud'].values

all_cats = label_cats+num_cats+freq_cats
unimportant_drops = nw_model_feat_imp.loc[nw_model_feat_imp['feature_importance'] < 20].index.to_list()
drop_cols_pfn = v_drops_95+unimportant_drops
# ensuring dropped categories are removed from list of category features
cats_pfn = list(set(all_cats).difference(drop_cols_pfn))

X_train_pfn = X_train_pfn.drop(columns=drop_cols_pfn)
X_train_pfn = ngv_label_encode(cats_pfn, X_train_pfn)
feats_pfn = X_train_pfn.columns.to_list()

X_tr_pfn, X_val_pfn, y_tr_pfn, y_val_pfn = train_test_split(X_train_pfn, y_train_pfn,
                                                            test_size = 0.2,
                                                            stratify = y_train_pfn,
                                                            random_state = SEED)

In [86]:
lgb_tr_pfn = lgb.Dataset(X_tr_pfn,
                         y_tr_pfn,
                         feature_name = feats_pfn,
                         categorical_feature = cats_pfn)
lgb_val_pfn = lgb.Dataset(X_val_pfn,
                          y_val_pfn,
                          feature_name = feats_pfn,
                          categorical_feature = cats_pfn,
                          reference=lgb_tr_pfn)

In [87]:
lgb_pfn_model = lgb.train(params_pfn,
                          train_set = lgb_tr_pfn,
                          valid_sets = lgb_val_pfn,
                          verbose_eval=100)



Training until validation scores don't improve for 100 rounds
[100]	valid_0's auc: 0.952405
[200]	valid_0's auc: 0.965524
[300]	valid_0's auc: 0.971226
[400]	valid_0's auc: 0.973875
[500]	valid_0's auc: 0.975499
[600]	valid_0's auc: 0.976431
[700]	valid_0's auc: 0.977076
[800]	valid_0's auc: 0.977524
[900]	valid_0's auc: 0.977801
[1000]	valid_0's auc: 0.977938
[1100]	valid_0's auc: 0.978098
[1200]	valid_0's auc: 0.978242
[1300]	valid_0's auc: 0.978297
[1400]	valid_0's auc: 0.978331
[1500]	valid_0's auc: 0.978384
[1600]	valid_0's auc: 0.978372
Early stopping, best iteration is:
[1525]	valid_0's auc: 0.978415


In [88]:
pfn_model_feat_imp = dict(zip(lgb_pfn_model.feature_name(),lgb_pfn_model.feature_importance()))
pfn_model_feat_imp = pd.DataFrame.from_dict(pfn_model_feat_imp, orient='index', columns=['feature_importance'])
pfn_model_feat_imp = pfn_model_feat_imp.sort_values(by='feature_importance', ascending=False)
pfn_model_feat_imp

Unnamed: 0,feature_importance
transactionamt,42758
card1,32945
dist1,16289
addr1,13543
d15,10716
id_02,10348
card2,8926
d4,7921
c13,7920
d10,6722


In [92]:
print(nw_model_feat_imp.loc[nw_model_feat_imp['feature_importance'] < 20].index.to_list())

['v252', 'id_35', 'v175', 'v194', 'v121', 'v173', 'v334', 'v196', 'v112', 'v141', 'DeviceInfo_alpha_trunc', 'v111', 'id_17', 'v138', 'v327', 'v240', 'id_31', 'id_26', 'id_30', 'id_27', 'v113', 'v1', 'v269', 'v122', 'v14', 'v41', 'v120', 'v142', 'v65', 'card4', 'v117', 'v88', 'v118', 'addr2', 'id_31_bin_many', 'v241', 'v89', 'v28', 'v119', 'v325', 'id_24', 'v107', 'v68', 'v305', 'id_14', 'id_22']


I'll drop the original features with importances below 20 and the 4 in the final model that also are below 20

In [93]:
unimportant_feats = ['v252', 'id_35', 'v175', 'v194', 'v121',
                     'v173', 'v334', 'v196', 'v112', 'v141',
                     'DeviceInfo_alpha_trunc', 'v111', 'id_17',
                     'v138', 'v327', 'v240', 'id_31', 'id_26',
                     'id_30', 'id_27', 'v113', 'v1', 'v269',
                     'v122', 'v14', 'v41', 'v120', 'v142', 'v65',
                     'card4', 'v117', 'v88', 'v118', 'addr2',
                     'id_31_bin_many', 'v241', 'v89', 'v28',
                     'v119', 'v325', 'id_24', 'v107', 'v68',
                     'v305', 'id_14', 'id_22', 'v328', 'id_25',
                     'id_30_name', 'id_33_ratio']