# 美国运通-信用卡逾期风险预测

#### 写在前言

该代码文件记录了我在参加这一次kaggle竞赛的代码历程，当然最终效果也只进入20%，无法和kaggle金牌大佬比较，但整体代码也可以给初次参加kaggle竞赛或是想从凡人角度了解该比赛的小伙伴们一些参考。觉得不对的欢迎大家指正批评，我也会一直保持学习的态度去努力。金牌大佬代码链接我会附在说明文档中，供学有余力的同学浏xin览shang。

### 下载可用包

由于我租用了外部服务器（本地服务器完全跑不动），这几个包是重要的建模工具，需要自行手动下载。

In [1]:
#pip install xgboost
#pip install lightgbm
#pip install catboost

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting xgboost
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e4/ed/8e2a7ae4e856f4887afc0beee897088ed8dbbc1b19b0f49971019939452a/xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)
[K     |████████████████████████████████| 192.9 MB 74.0 MB/s eta 0:00:01   |▏                               | 757 kB 27.9 MB/s eta 0:00:07     |████████▋                       | 51.8 MB 27.9 MB/s eta 0:00:06     |█████████                       | 54.5 MB 27.9 MB/s eta 0:00:05     |██████████▍                     | 62.6 MB 52.9 MB/s eta 0:00:03
Installing collected packages: xgboost
Successfully installed xgboost-1.6.1
Note: you may need to restart the kernel to use updated packages.


### 1. 调包

In [1]:
import gc
import warnings
warnings.filterwarnings('ignore')
import scipy as sp
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from itertools import combinations
import joblib
import catboost as cat
from catboost import CatBoostClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier

## 2. read data

In [8]:
%%time
train = pd.read_parquet('/home/featurize/data/train.parquet')

CPU times: user 7.5 s, sys: 6.91 s, total: 14.4 s
Wall time: 7.17 s


In [9]:
train.shape

(5531451, 190)

In [9]:
train_labels = pd.read_csv("/home/featurize/work/train_labels.csv")
sub = pd.read_csv('/home/featurize/work/sample_submission.csv')

In [10]:
features = train.drop(['customer_ID', 'S_2'], axis = 1).columns.to_list()


In [11]:
cat_features = ["B_30","B_38", "D_114", "D_116", "D_117","D_120", "D_126", "D_63", "D_64", "D_66", "D_68"]
num_features = [col for col in features if col not in cat_features]

In [4]:
%%time
test = pd.read_parquet('/home/featurize/data/test.parquet')

CPU times: user 14.9 s, sys: 14.3 s, total: 29.2 s
Wall time: 11.5 s


#### 2.1. 测试用例

租服务器太贵，每次用全量样本运行模型耗时5h+，同时也会面临一些代码层面的错误而浪费一时间。时间就是金钱！！所以我会先写个demo，保证代码层面没错，再来运行全部代码。

In [None]:
## DEMO 
train = train.loc[0:1000,]
test = test.loc[0:1000,]

## 3. 特征工程及衍生变量

### 3.1. 处理时间列

在处理的时候写成function，同时处理 train 和test，这样在后续建模中也较为方便。

In [12]:
def proprocess_time(df):
    df['S_2'] = pd.to_datetime(df['S_2'])
    df['S_2_dayofweek'] = df['S_2'].dt.weekday
    df['S_2_dayofmonth'] = df['S_2'].dt.day
    return df

In [13]:
%%time
train = proprocess_time(train)
train.shape

CPU times: user 1.45 s, sys: 160 ms, total: 1.61 s
Wall time: 1.61 s


(5531451, 192)

In [14]:
%%time
test = proprocess_time(test)
test.shape

CPU times: user 2.95 s, sys: 337 ms, total: 3.29 s
Wall time: 3.28 s


(11363762, 192)

###  3.2.处理 numerical features

In [15]:
def preprocess_numeric(df):
    train_num_agg = df.groupby("customer_ID")[num_features].agg(['mean', 'std', 'min', 'max', 'last'])
    train_num_agg.columns = ['_'.join(x) for x in train_num_agg.columns]
    train_num_agg.reset_index(inplace = True)
    return train_num_agg

In [19]:
%%time
train_num_agg = preprocess_numeric(train)

CPU times: user 39.2 s, sys: 15.5 s, total: 54.8 s
Wall time: 55.7 s


In [16]:
%%time
test_num_agg = preprocess_numeric(test)

CPU times: user 1min 20s, sys: 35.5 s, total: 1min 56s
Wall time: 1min 56s


### 3.3.处理 categorical features

In [17]:
def preprocess_cat(df):
    train_cat_agg = df.groupby("customer_ID")[cat_features].agg(['count', 'last', 'nunique'])
    train_cat_agg.columns = ['_'.join(x) for x in train_cat_agg.columns]
    train_cat_agg.reset_index(inplace = True) 
    return train_cat_agg

In [24]:
train_cat_agg = preprocess_cat(train)

In [18]:
cat_features = ["B_30","B_38", "D_114", "D_116", "D_117","D_120", "D_126", "D_63", "D_64", "D_66", "D_68"]
test_cat_agg = preprocess_cat(test)

### 3.4. 记录衍生后的特征总数

In [28]:
train_num_agg.shape,train_cat_agg.shape

((458913, 886), (458913, 34))

### 3.5. 数据合并

由于我们是分批处理numerical 和 categorical，因此，在入模之前，我们还需对所有的的变量合并，确保数据的完整度。

In [19]:
def merge(train_num_agg,train_cat_agg):
    train = train_num_agg.merge(train_cat_agg, how = 'inner', on = 'customer_ID').merge(train_labels, how = 'inner', on = 'customer_ID')
    return train

In [31]:
%%time
train = merge(train_num_agg,train_cat_agg)

CPU times: user 1min 8s, sys: 7.08 s, total: 1min 16s
Wall time: 1min 15s


In [20]:
%%time
test = pd.merge(test_num_agg,test_cat_agg, how = 'inner',on = 'customer_ID')

CPU times: user 2min 9s, sys: 2min 44s, total: 4min 53s
Wall time: 4min 52s


In [33]:
test_num_agg.shape,test_cat_agg.shape

((924621, 886), (924621, 34))

**Attention** 运行内存容量有限，随写随删，不然内存耗尽会崩。

In [37]:
del train_num_agg, train_cat_agg

In [21]:
del test_num_agg, test_cat_agg

### 3.6. categorical features enconding & nemerical features 去噪音化


In [22]:
def encode_round(df):
    cat_features = ["B_30","B_38", "D_114", "D_116", "D_117","D_120", "D_126", "D_63", "D_64", "D_66", "D_68"]
    cat_features = [f"{cf}_last" for cf in cat_features]
    for cat_col in cat_features:
        encoder = LabelEncoder()
        df[cat_col] = encoder.fit_transform(df[cat_col])
        
    num_cols = list(df.dtypes[(df.dtypes == 'float32') | (df.dtypes == 'float64')].index)
    num_cols = [col for col in num_cols if 'last' in col]
    for col in num_cols:
        df[col + '_round2'] = df[col].round(2)
    return df

In [40]:
trian = encode_round(train)
train.shape

(458913, 1013)

In [23]:
test = encode_round(test)
test.shape

(924621, 1012)

## 4. 判别方程

该部分判断条件由美国运通所决定，因此无需过多浏览，实际业务也并不会使用该类判断。

In [24]:
def lgb_amex(y_pred, y_true):
    return 'amex', amex_metric_np(y_pred,y_true.get_label()), True

def xgb_amex(y_pred, y_true):
    return 'amex', amex_metric_np(y_pred,y_true.get_label())

# Created by https://www.kaggle.com/yunchonggan
# https://www.kaggle.com/competitions/amex-default-prediction/discussion/328020
def amex_metric_np(preds: np.ndarray, target: np.ndarray) -> float:
    indices = np.argsort(preds)[::-1]
    preds, target = preds[indices], target[indices]

    weight = 20.0 - target * 19.0
    cum_norm_weight = (weight / weight.sum()).cumsum()
    four_pct_mask = cum_norm_weight <= 0.04
    d = np.sum(target[four_pct_mask]) / np.sum(target)

    weighted_target = target * weight
    lorentz = (weighted_target / weighted_target.sum()).cumsum()
    gini = ((lorentz - cum_norm_weight) * weight).sum()

    n_pos = np.sum(target)
    n_neg = target.shape[0] - n_pos
    gini_max = 10 * n_neg * (n_pos + 20 * n_neg - 19) / (n_pos + 20 * n_neg)

    g = gini / gini_max
    return 0.5 * (g + d)

# we still need the official metric since the faster version above is slightly off
def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:

    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

def amex_metric_mod_lgbm(y_pred: np.ndarray, data: lgb.Dataset):

    y_true = data.get_label()
    labels     = np.transpose(np.array([y_true, y_pred]))
    labels     = labels[labels[:, 1].argsort()[::-1]]
    weights    = np.where(labels[:,0]==0, 20, 1)
    cut_vals   = labels[np.cumsum(weights) <= int(0.04 * np.sum(weights))]
    top_four   = np.sum(cut_vals[:,0]) / np.sum(labels[:,0])

    gini = [0,0]
    for i in [1,0]:
        labels         = np.transpose(np.array([y_true, y_pred]))
        labels         = labels[labels[:, i].argsort()[::-1]]
        weight         = np.where(labels[:,0]==0, 20, 1)
        weight_random  = np.cumsum(weight / np.sum(weight))
        total_pos      = np.sum(labels[:, 0] *  weight)
        cum_pos_found  = np.cumsum(labels[:, 0] * weight)
        lorentz        = cum_pos_found / total_pos
        gini[i]        = np.sum((lorentz - weight_random) * weight)

    return 'AMEX', 0.5 * (gini[1]/gini[0]+ top_four), True

## 5. 开始建模

### 5.1.xgb

In [25]:
def xgb_train(x, y, xt, yt):
    print("-----------xgb starts training-----------")
    print("# of features:", x.shape[1])
    assert x.shape[1] == xt.shape[1]
    dtrain = xgb.DMatrix(data=x, label=y)
    dvalid = xgb.DMatrix(data=xt, label=yt)
    params = {
            'objective': 'binary:logistic', 
            'tree_method': 'auto', 
            'max_depth': 7,
            'subsample':0.88,
            'colsample_bytree': 0.5,
            'gamma':1.5,
            'min_child_weight':8,
            'lambda':70,
            'eta':0.03,
#             'scale_pos_weight': scale_pos_weight,
    }
    watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
    bst = xgb.train(params, dtrain=dtrain,
                num_boost_round=5000,evals=watchlist,
                early_stopping_rounds=100, feval=xgb_amex, maximize=True,
                verbose_eval=100)
    print('best ntree_limit:', bst.best_ntree_limit)
    print('best score:', bst.best_score)
    return bst.predict(dtrain, iteration_range=(0,bst.best_ntree_limit)), bst.predict(dvalid, iteration_range=(0,bst.best_ntree_limit)), bst

### 5.2. lgbm-gbdt

In [26]:
def lgb_train(x, y, xt, yt):
    print("----------lgb starts training----------")
    print("# of features:", x.shape[1])
    assert x.shape[1] == xt.shape[1]
    # lgb_train = lgb.Dataset(x.to_pandas(), y.to_pandas())
    # lgb_eval = lgb.Dataset(xt.to_pandas(), yt.to_pandas(), reference=lgb_train)
    lgb_train = lgb.Dataset(x, y)
    lgb_eval = lgb.Dataset(xt, yt, reference=lgb_train)
    params = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'boosting':'gbdt',
        'seed': 42,
        'num_leaves': 100,
        'learning_rate': 0.01,
        'feature_fraction': 0.20,
        'bagging_freq': 10,
        'bagging_fraction': 0.50,
        'n_jobs': -1,
        'lambda_l2': 2,
        'min_data_in_leaf': 40
       
    }
    gbm = lgb.train(params,
                lgb_train,
                num_boost_round=5000,
                valid_sets=[lgb_train, lgb_eval],
                early_stopping_rounds=100,feval=amex_metric_mod_lgbm, 
                verbose_eval=100,)


    print('best iterations:', gbm.best_iteration)
    print('best score:', gbm.best_score)
    return gbm.predict(x, num_iteration =gbm.best_iteration),gbm.predict(xt, num_iteration =gbm.best_iteration), gbm

### 5.3. catgboost

In [27]:
def cat_train(x, y, xt, yt):
    print("-----------catboost starts training-----------")
    print("# of features:", x.shape[1])
    assert x.shape[1] == xt.shape[1]
    cat_train = cat.Pool(x, y)
    cat_eval = cat.Pool(xt, yt)
    
    clf = CatBoostRegressor(iterations=5000, 
                             task_type='CPU',
                             bagging_temperature = 0.2,
                             od_type='Iter',
                             metric_period = 50,
                             od_wait=20)
    clf.fit(cat_train, eval_set=cat_eval, verbose=100,early_stopping_rounds=100)
    return  clf.predict(cat_eval), clf

### 5.4. 训练模型

由于数据集量极大，因此整个模型过程，该部分耗时最久，普通GPU加载下，还需6h左右。但此处并非决定模型好坏的关键因素。只要特征衍生以及入模数据处理的好，模型效果总不会差。

In [61]:
%%time
not_used = ['customer_ID','target']
not_used = [i for i in not_used if i in train.columns]
msgs = {}
folds = 5
score = 0


for i in range(folds):
    print(f"==============Folds {i}===============")
    mask = train['cid']%folds == i
    tr,va = train[~mask], train[mask]

    x, y = tr.drop(not_used, axis=1), tr['target']
    xt, yt = va.drop(not_used, axis=1), va['target']
    features = len(x.columns)
    
    xp, yp, bst = xgb_train(x, y, xt, yt)
    bst.save_model(f'models/xgb_{i}.json')

    x = tr.drop(not_used, axis=1)
    xt = va.drop(not_used, axis=1)
    
    xp2,yp2,gbm = lgb_train(x, y, xt, yt)
    gbm.save_model(f'models/lgb_{i}.json')
    
    yp3,cats = cat_train(x, y, xt, yt)
    cats.save_model(f'models/cat_{i}.json')

    preds = yp * 0.35+yp2 * 0.45
    amex_score = amex_metric(pd.DataFrame({'target':yt.values}), 
                                    pd.DataFrame({'prediction':preds}))
    msg = f"Fold {i} amex {amex_score:.4f}"
    print(msg)
    score += amex_score
    del tr,va,x,y
    del xt,yt,cats,gbm
    _ = gc.collect()
score /= folds
print(f"Average amex score: {score:.4f}")

-----------xgb starts training-----------
# of features: 1011
[0]	train-logloss:0.67371	train-amex:0.70659	eval-logloss:0.67380	eval-amex:0.69917
[99]	train-logloss:0.24242	train-amex:0.78084	eval-logloss:0.24837	eval-amex:0.76634
best ntree_limit: 100
best score: 0.766344
----------lgb starts training----------
# of features: 1011
[LightGBM] [Info] Number of positive: 95258, number of negative: 271872
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 163950
[LightGBM] [Info] Number of data points in the train set: 367130, number of used features: 1003
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.259467 -> initscore=-1.048742
[LightGBM] [Info] Start training from score -1.048742
Training until validation scores don't improve for 100 rounds
[100]	training's binary_logloss: 0.320411	training's AMEX: 0.772166	valid_1's binary_logloss: 0.322061	valid_1's AMEX: 0.760757
[200]	training's binary_logloss: 0.254833	training's AMEX: 0.783088	valid_1's bi



0:	learn: 0.3662659	test: 0.3651642	best: 0.3651642 (0)	total: 229ms	remaining: 45.6s
100:	learn: 0.2609307	test: 0.2666794	best: 0.2666794 (100)	total: 14.6s	remaining: 14.3s
199:	learn: 0.2549217	test: 0.2660373	best: 0.2660103 (181)	total: 28.2s	remaining: 0us

bestTest = 0.2660102522
bestIteration = 181

Shrink model to first 182 iterations.
Fold 0 amex 0.7720
-----------xgb starts training-----------
# of features: 1011
[0]	train-logloss:0.67373	train-amex:0.70413	eval-logloss:0.67383	eval-amex:0.69728
[99]	train-logloss:0.24209	train-amex:0.78083	eval-logloss:0.24921	eval-amex:0.76798
best ntree_limit: 98
best score: 0.768142
----------lgb starts training----------
# of features: 1011
[LightGBM] [Info] Number of positive: 95167, number of negative: 271963
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 163932
[LightGBM] [Info] Number of data points in the train set: 367130, number of used features: 1003
[LightGBM] [Info] [binary:BoostFromSco



0:	learn: 0.3656816	test: 0.3651761	best: 0.3651761 (0)	total: 179ms	remaining: 35.6s
100:	learn: 0.2610076	test: 0.2663813	best: 0.2663813 (100)	total: 14.8s	remaining: 14.5s
199:	learn: 0.2548046	test: 0.2657349	best: 0.2656664 (171)	total: 28.4s	remaining: 0us

bestTest = 0.265666387
bestIteration = 171

Shrink model to first 172 iterations.
Fold 1 amex 0.7738
-----------xgb starts training-----------
# of features: 1011
[0]	train-logloss:0.67373	train-amex:0.71082	eval-logloss:0.67386	eval-amex:0.69898
[99]	train-logloss:0.24215	train-amex:0.78214	eval-logloss:0.24937	eval-amex:0.76709
best ntree_limit: 100
best score: 0.767091
----------lgb starts training----------
# of features: 1011
[LightGBM] [Info] Number of positive: 94985, number of negative: 272145
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 163966
[LightGBM] [Info] Number of data points in the train set: 367130, number of used features: 1003
[LightGBM] [Info] [binary:BoostFromSco



0:	learn: 0.3651172	test: 0.3658629	best: 0.3658629 (0)	total: 169ms	remaining: 33.6s
100:	learn: 0.2611801	test: 0.2659755	best: 0.2659755 (100)	total: 14.6s	remaining: 14.4s
199:	learn: 0.2551166	test: 0.2652512	best: 0.2651741 (180)	total: 28.1s	remaining: 0us

bestTest = 0.2651741029
bestIteration = 180

Shrink model to first 181 iterations.
Fold 2 amex 0.7747
-----------xgb starts training-----------
# of features: 1011
[0]	train-logloss:0.67372	train-amex:0.70762	eval-logloss:0.67379	eval-amex:0.70326
[99]	train-logloss:0.24246	train-amex:0.78028	eval-logloss:0.24842	eval-amex:0.76984
best ntree_limit: 99
best score: 0.769901
----------lgb starts training----------
# of features: 1011
[LightGBM] [Info] Number of positive: 94856, number of negative: 272275
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 163855
[LightGBM] [Info] Number of data points in the train set: 367131, number of used features: 1003
[LightGBM] [Info] [binary:BoostFromSco



0:	learn: 0.3649734	test: 0.3658084	best: 0.3658084 (0)	total: 188ms	remaining: 37.5s
100:	learn: 0.2611294	test: 0.2659639	best: 0.2659576 (99)	total: 14.6s	remaining: 14.4s
199:	learn: 0.2550633	test: 0.2655921	best: 0.2654803 (172)	total: 28.1s	remaining: 0us

bestTest = 0.2654802937
bestIteration = 172

Shrink model to first 173 iterations.
Fold 3 amex 0.7763
-----------xgb starts training-----------
# of features: 1011
[0]	train-logloss:0.67374	train-amex:0.70511	eval-logloss:0.67385	eval-amex:0.69711
[99]	train-logloss:0.24223	train-amex:0.78210	eval-logloss:0.24887	eval-amex:0.76612
best ntree_limit: 100
best score: 0.766116
----------lgb starts training----------
# of features: 1011
[LightGBM] [Info] Number of positive: 95046, number of negative: 272085
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 163930
[LightGBM] [Info] Number of data points in the train set: 367131, number of used features: 1003
[LightGBM] [Info] [binary:BoostFromSco



0:	learn: 0.3647172	test: 0.3648478	best: 0.3648478 (0)	total: 164ms	remaining: 32.6s
100:	learn: 0.2610071	test: 0.2662727	best: 0.2662727 (100)	total: 14.7s	remaining: 14.4s
199:	learn: 0.2550472	test: 0.2656445	best: 0.2655963 (167)	total: 28.2s	remaining: 0us

bestTest = 0.2655963142
bestIteration = 167

Shrink model to first 168 iterations.
Fold 4 amex 0.7722
Average amex score: 0.7738
CPU times: user 5h 51min 52s, sys: 1min 9s, total: 5h 53min 1s
Wall time: 34min 56s


### 5.3.预测模型

In [28]:
yps = []
preds = 0
folds = 5
not_used = ['customer_ID']
for i in range(folds):
    bst = xgb.Booster()
    bst.load_model(f'models/xgb_{i}.json')
    dx = xgb.DMatrix(test.drop(not_used,axis =1))
    
    gbm = lgb.Booster(model_file = f'models/lgb_{i}.json')
    dx2 = test.drop(not_used,axis = 1)
    
    cats = CatBoostRegressor()
    cats.load_model(f'models/cat_{i}.json')
    
    yp = bst.predict(dx,iteration_range = (0,bst.best_ntree_limit))
    yp2 = gbm.predict(dx2,num_iteration = gbm.best_iteration)
    yp3 = cats.predict(dx2)
    
    preds += (yp*0.35 + yp2*0.45 + yp3 * 0.25)
yps.append(preds/folds)

In [36]:
sub['prediction'] = yps[0]

## 6. 结果输出


In [38]:
df_submit = sub

In [60]:
df_submit.to_csv("/home/featurize/work/submission_3.csv", index = False)

In [61]:
tmp = pd.read_csv('/home/featurize/work/submission_3.csv')
tmp

Unnamed: 0,customer_ID,prediction
0,00000469ba478561f23a92a868bd366de6f6527a684c9a...,0.028926
1,00001bf2e77ff879fab36aa4fac689b9ba411dae63ae39...,0.000817
2,0000210045da4f81e5f122c6bde5c2a617d03eef67f82c...,0.052601
3,00003b41e58ede33b8daf61ab56d9952f17c9ad1c3976c...,0.246161
4,00004b22eaeeeb0ec976890c1d9bfc14fd9427e98c4ee9...,0.841315
...,...,...
924616,ffff952c631f2c911b8a2a8ca56ea6e656309a83d2f64c...,0.017092
924617,ffffcf5df59e5e0bba2a5ac4578a34e2b5aa64a1546cd3...,0.785878
924618,ffffd61f098cc056dbd7d2a21380c4804bbfe60856f475...,0.359771
924619,ffffddef1fc3643ea179c93245b68dca0f36941cd83977...,0.233254


## 7. 总结

该代码简单介绍了整体分类预测模型如何开始与收尾。当然如果想要效果达到非常好的程度，还有很多关键事情可以做。例如特征衍生量是否可以增加，数据去噪是否可以完善，融合模型是否可以有更多的选择，以及融合模型时更多权重的选择。在关于该代码的说明文档中，也会有我关于该竞赛的反思和思考，大家有兴趣可以观看。这份代码仅作为当初我提交的一个版本，还有诸多版本就不一一上传（其余版本在租用服务器平台，早已欠费呜呜呜）就不一一展示。