# Basic Tabular Mining Tips

---
In this short tutorial, we demontrate how to perform a simple data mining task on the "Kicked!" dataset.

We will be covering:



1.   Data Preprocessing using Pandas.
2.   Some basic ideas in feature engineering.
3.   Basic parameter tuning for tree models. 



In [157]:
!pip install lightgbm xgboost catboost category-encoders sklearn pandas==1.1.5



In [158]:
!git clone https://gitee.com/qingge_dada/dataset.git

fatal: destination path 'dataset' already exists and is not an empty directory.


In [159]:
import pandas as pd
import numpy as np

'''
x_train = pd.read_csv('./cardashians/xtrain.csv', engine='python')
x_test = pd.read_csv('./cardashians/xtest.csv', engine='python')

y_train = pd.read_csv('./cardashians/ytrain.csv', engine='python')
y_test = pd.read_csv('./cardashians/ytest.csv', engine='python')
'''
train = pd.read_csv('./dataset/train_final.csv', engine='python')
test = pd.read_csv('./dataset/test_final.csv', engine='python')

In [161]:
TRAIN_IDX= train.shape[0]
TEST_IDX = TRAIN_IDX + test.shape[0]

In [162]:
# x = pd.concat([x_train, x_test], axis=0)
# y = pd.concat([y_train, y_test], axis=0)
# axis=0 沿着每行的竖直方向
# axis=1 沿着每列的水平方向
# 合并
data = pd.concat([train, test], axis=0)

In [163]:
data.columns.to_list()

['continuous_annual_inc',
 'continuous_annual_inc_joint',
 'continuous_delinq_2yrs',
 'continuous_dti',
 'continuous_dti_joint',
 'continuous_fico_range_high',
 'continuous_fico_range_low',
 'continuous_funded_amnt',
 'continuous_funded_amnt_inv',
 'continuous_inq_last_6mths',
 'continuous_installment',
 'continuous_int_rate',
 'continuous_last_fico_range_high',
 'continuous_last_fico_range_low',
 'continuous_loan_amnt',
 'loan_status',
 'continuous_mths_since_last_delinq',
 'continuous_mths_since_last_major_derog',
 'continuous_mths_since_last_record',
 'continuous_open_acc',
 'continuous_pub_rec',
 'discrete_addr_state_1_one_hot',
 'discrete_addr_state_2_one_hot',
 'discrete_addr_state_3_one_hot',
 'discrete_addr_state_4_one_hot',
 'discrete_addr_state_5_one_hot',
 'discrete_addr_state_6_one_hot',
 'discrete_addr_state_7_one_hot',
 'discrete_addr_state_8_one_hot',
 'discrete_addr_state_9_one_hot',
 'discrete_addr_state_10_one_hot',
 'discrete_addr_state_11_one_hot',
 'discrete_addr_s

选取'continuous_inq_last_6mths'这个衍生变量

## Basic Data Manipulation

Let us see how we can do some basic data preprocessing

In [164]:
data['continuous_inq_last_6mths'].unique()

array([1, 4, 0, 3, 2, 5])

In [165]:
data['continuous_inq_last_6mths'].value_counts()

0    59654
1    26696
2     9209
3     3218
4      933
5      290
Name: continuous_inq_last_6mths, dtype: int64

In [166]:
train = data.iloc[:TRAIN_IDX, :]
test = data.iloc[TRAIN_IDX:TEST_IDX, :]

In [167]:
import lightgbm as lgb
train_dataset = lgb.Dataset(train.drop(columns='continuous_inq_last_6mths'), train['continuous_inq_last_6mths'])
test_dataset = lgb.Dataset(test.drop(columns='continuous_inq_last_6mths'), test['continuous_inq_last_6mths'])

In [168]:
param = {'num_leaves': 31, 'objective': 'binary', 'metric':'binary_error'}
# binary_error:误差率
num_round = 4000

In [169]:
model = lgb.train(param, train_dataset, num_boost_round=num_round, valid_sets=[train_dataset, test_dataset])

[1]	training's binary_error: 0.41344	valid_1's binary_error: 0.39348
[2]	training's binary_error: 0.41344	valid_1's binary_error: 0.39348
[3]	training's binary_error: 0.41	valid_1's binary_error: 0.39074
[4]	training's binary_error: 0.39362	valid_1's binary_error: 0.37906
[5]	training's binary_error: 0.38106	valid_1's binary_error: 0.36738
[6]	training's binary_error: 0.37056	valid_1's binary_error: 0.3612
[7]	training's binary_error: 0.363	valid_1's binary_error: 0.3561
[8]	training's binary_error: 0.3588	valid_1's binary_error: 0.35232
[9]	training's binary_error: 0.35488	valid_1's binary_error: 0.35072
[10]	training's binary_error: 0.35082	valid_1's binary_error: 0.34896
[11]	training's binary_error: 0.34738	valid_1's binary_error: 0.34706
[12]	training's binary_error: 0.34476	valid_1's binary_error: 0.34616
[13]	training's binary_error: 0.34272	valid_1's binary_error: 0.3458
[14]	training's binary_error: 0.34148	valid_1's binary_error: 0.3454
[15]	training's binary_error: 0.34096	v

## A Wrapper

In [170]:
import io
import multiprocessing
from contextlib import redirect_stdout
from copy import deepcopy
from dataclasses import dataclass, asdict
import hyperopt.pyll
from hyperopt import fmin, tpe, hp
import numpy as np
import lightgbm as lgb
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
import torch

import copy
cpu_count = 4
use_gpu = False
@dataclass
class LGBOpt:
    num_threads: any = hp.choice('num_threads', [cpu_count])
    num_leaves: any = hp.choice('num_leaves', [64])
    metric: any = hp.choice('metric', ['binary_error'])
    num_round: any = hp.choice('num_rounds', [1000])
    objective: any = hp.choice('objective', ['binary'])
    learning_rate: any = hp.uniform('learning_rate', 0.01, 0.1)
    feature_fraction: any = hp.uniform('feature_fraction', 0.5, 1.0)
    bagging_fraction: any = hp.uniform('bagging_fraction', 0.8, 1.0)
    device_type: any = hp.choice('device_tpye', ['gpu']) if use_gpu else hp.choice('device_type',
                                                                                   ['cpu'])
    boosting: any = hp.choice('boosting', ['gbdt']) # gbdt, dart, goss仅需选择一个即可
    extra_trees: any = hp.choice('extra_tress', [False, True])
    drop_rate: any = hp.uniform('drop_rate', 0, 0.2)
    uniform_drop: any = hp.choice('uniform_drop', [True, False])
    lambda_l1: any = hp.uniform('lambda_l1', 0, 10)  # TODO: Check range
    lambda_l2: any = hp.uniform('lambda_l2', 0, 10)  # TODO: Check range
    min_gain_to_split: any = hp.uniform('min_gain_to_split', 0, 1)  # TODO: Check range
    min_data_in_bin = hp.choice('min_data_in_bin', [3, 5, 10, 15, 20, 50])

    @staticmethod
    def get_common_params():
        return {'num_thread': 4, 'num_leaves': 12, 'metric': 'binary', 'objective': 'binary',
                'num_round': 1000, 'learning_rate': 0.02, 'feature_fraction': 0.8, 'bagging_fraction': 0.8}
    

In [171]:
class FitterBase(object):
    def __init__(self, label, metric, max_eval=100, opt=None):
        self.label = label
        self.metric = metric
        self.opt_params = dict()
        self.max_eval = max_eval
        self.opt = opt

    def get_loss(self, y, y_pred):
        if self.metric == 'error':
            return 1 - accuracy_score(y, y_pred)
        elif self.metric == 'precision':
            return 1 - precision_score(y, y_pred)
        elif self.metric == 'recall':
            return 1 - recall_score(y, y_pred)
        elif self.metric == 'macro_f1':
            return 1 - f1_score(y, y_pred, average='macro')# macro f1需要先计算出每一个类别的准召及其f1 score，然后通过求均值得到在整个样本上的f1 score
        elif self.metric == 'micro_f1':
            return 1 - f1_score(y, y_pred, average='micro') # micro f1不需要区分类别，直接使用总体样本的准召计算f1 score
        elif self.metric == 'auc':  # TODO: Add a warning checking if y_predict is all [0, 1], it should be probability
            return 1 - roc_auc_score(loan_status, y_pred)
        else:
            raise Exception("Not implemented yet.")


In [172]:
class LGBFitter(FitterBase):
    def __init__(self, label='label', metric='error', opt: LGBOpt = None, max_eval=100):
        super(LGBFitter, self).__init__(label, metric, max_eval)
        if opt is not None:
            self.opt = opt
        else:
            self.opt = LGBOpt()
        self.best_round = None
        self.clf = None

    def train(self, train_df, eval_df, params=None, use_best_eval=True):
        self.best_round = None
        dtrain = lgb.Dataset(train_df.drop(columns=[self.label]), train_df[self.label])
        deval = lgb.Dataset(eval_df.drop(columns=[self.label]), eval_df[self.label])
        evallist = [dtrain, deval]
        if params is None:
            use_params = deepcopy(self.opt_params)
        else:
            use_params = deepcopy(params)

        num_round = use_params.pop('num_round')
        if use_best_eval:
            with io.StringIO() as buf, redirect_stdout(buf):
                self.clf = lgb.train(use_params, dtrain, num_round, valid_sets=evallist)
                output = buf.getvalue().split("\n")
            min_error = np.inf
            min_index = 0
            for idx in range(len(output) - 1):
                if len(output[idx].split("\t")) == 3:
                    temp = float(output[idx].split("\t")[2].split(":")[1])
                    if min_error > temp:
                        min_error = temp
                        min_index = int(output[idx].split("\t")[0][1:-1])
            print("The minimum is attained in round %d" % (min_index + 1))
            self.best_round = min_index + 1
            return output
        else:
            with io.StringIO() as buf, redirect_stdout(buf):
                self.clf = lgb.train(use_params, dtrain, num_round, valid_sets=evallist)
                output = buf.getvalue().split("\n")
            self.best_round = num_round
            return output

    def search(self, train_df, eval_df, use_best_eval=True):
        self.opt_params = dict()

        def train_impl(params):
            self.train(train_df, eval_df, params, use_best_eval)
            if self.metric == 'auc':
                y_pred = self.clf.predict(eval_df.drop(columns=[self.label]), num_iteration=self.best_round)
            else:
                y_pred = (self.clf.predict(eval_df.drop(columns=[self.label]),
                                           num_iteration=self.best_round) > 0.5).astype(int)
            return self.get_loss(eval_df[self.label], y_pred)

        self.opt_params = fmin(train_impl, asdict(self.opt), algo=tpe.suggest, max_evals=self.max_eval)

    def search_k_fold(self, k_fold, data, use_best_eval=True):
        self.opt_params = dict()

        def train_impl_nfold(params):
            loss = list()
            for train_id, eval_id in k_fold.split(data):
                train_df = data.loc[train_id]
                eval_df = data.loc[eval_id]
                self.train(train_df, eval_df, params, use_best_eval)
                if self.metric == 'auc':
                    y_pred = self.clf.predict(eval_df.drop(columns=[self.label]), num_iteration=self.best_round)
                else:
                    y_pred = (self.clf.predict(eval_df.drop(columns=[self.label]),
                                               num_iteration=self.best_round) > 0.5).astype(int)
                loss.append(self.get_loss(eval_df[self.label], y_pred))
            return np.mean(loss)

        self.opt_params = fmin(train_impl_nfold, asdict(self.opt), algo=tpe.suggest, max_evals=self.max_eval)

    def train_k_fold(self, k_fold, train_data, test_data, params=None, drop_test_y=True, use_best_eval=True):
        acc_result = list()
        train_pred = np.empty(train_data.shape[0])
        test_pred = np.empty(test_data.shape[0])
        if drop_test_y:
            dtest = test_data.drop(columns=self.label)
        else:
            dtest = test_data

        models = list()
        for train_id, eval_id in k_fold.split(train_data):
            train_df = train_data.loc[train_id]
            eval_df = train_data.loc[eval_id]
            self.train(train_df, eval_df, params, use_best_eval)
            models.append(copy.deepcopy(self.clf))
            train_pred[eval_id] = self.clf.predict(eval_df.drop(columns=self.label), num_iteration=self.best_round)
            if self.metric == 'auc':
                y_pred = self.clf.predict(eval_df.drop(columns=[self.label]), num_iteration=self.best_round)
            else:
                y_pred = (self.clf.predict(eval_df.drop(columns=[self.label]),
                                           num_iteration=self.best_round) > 0.5).astype(int)
            acc_result.append(self.get_loss(eval_df[self.label], y_pred))
            test_pred += self.clf.predict(dtest, num_iteration=self.best_round)
        test_pred /= k_fold.n_splits
        return train_pred, test_pred, acc_result, models
        

In [173]:
fitter = LGBFitter(label='continuous_inq_last_6mths')

In [174]:
params = {'num_thread': 4, 'num_leaves': 11, 'metric': 'binary', 'objective': 'binary',
                'num_round': 1000, 'learning_rate': 0.01, 'feature_fraction': 0.9, 'bagging_fraction': 0.8}

In [175]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5)

In [176]:
fitter.train_k_fold(kfold, train, test, params = params)
# 输出五个为验证集上的准确性，需要取平均值

The minimum is attained in round 1001
The minimum is attained in round 1001
The minimum is attained in round 995
The minimum is attained in round 999
The minimum is attained in round 999


(array([0.63530431, 0.58613266, 0.35265928, ..., 0.59968545, 0.47987152,
        0.12780825]),
 array([ 0.20013908,  0.3210313 , -0.38068642, ...,  0.06407503,
         0.1767113 ,  0.37202055]),
 [0.43400000000000005, 0.4293, 0.42390000000000005, 0.4294, 0.4232],
 [<lightgbm.basic.Booster at 0x7f64830338d0>,
  <lightgbm.basic.Booster at 0x7f6480559690>,
  <lightgbm.basic.Booster at 0x7f6483125950>,
  <lightgbm.basic.Booster at 0x7f6483235950>,
  <lightgbm.basic.Booster at 0x7f6480534310>])

Pipeline原本的模型效果

In [177]:
fitter1 = LGBFitter(label='loan_status')

In [178]:
fitter1.train_k_fold(kfold, train, test, params = params)

The minimum is attained in round 957
The minimum is attained in round 992
The minimum is attained in round 787
The minimum is attained in round 986
The minimum is attained in round 662


(array([0.55960484, 0.98604018, 0.98314924, ..., 0.98740098, 0.99764042,
        0.99497735]),
 array([0.93337419, 0.99709494, 0.97935958, ..., 0.97376311, 0.99435467,
        0.99420776]),
 [0.0716,
  0.08079999999999998,
  0.08360000000000001,
  0.08440000000000003,
  0.0796],
 [<lightgbm.basic.Booster at 0x7f64832fdc10>,
  <lightgbm.basic.Booster at 0x7f6480559e50>,
  <lightgbm.basic.Booster at 0x7f6480678150>,
  <lightgbm.basic.Booster at 0x7f6483044c10>,
  <lightgbm.basic.Booster at 0x7f6483044e50>])

# 上述结果为LGBFitter(FitterBase)类中metric='error'的结果，通过观察，acc_result在衍生变量加入后比原模型分数更高。

# 而metric还有其他几个指标macro_f1与macro_f1，其实验结果都截图放置于文档中。