## 数字识别

* 有监督；
* 分类问题；
* 训练数据集：对应数字以及每个位置的像素点组成；
* 测试数据集：每个位置的像素点；
* 提交格式：ImageId,Label；

## 初始化环境

In [51]:
import os,sys,time
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.feature_selection import VarianceThreshold
from sklearn.externals import joblib
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier

import warnings
warnings.filterwarnings('ignore')

plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

SEED = 9291

%matplotlib inline

## 加载、展示数据，内存优化

In [2]:
def concat_df(train_data, test_data):
    # Returns a concatenated df of training and test set on axis 0
    return pd.concat([train_data, test_data], sort=True).reset_index(drop=True)

def divide_df(all_data, train_size=42000):
    # Returns divided dfs of training and test set
    return all_data.loc[:train_size-1], all_data.loc[train_size:].drop(['label'], axis=1)

### 加载

In [63]:
train_data = pd.read_csv('input/train.csv')
print train_data.info(memory_usage='deep')

test_data = pd.read_csv('input/test.csv')
print test_data.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Columns: 785 entries, label to pixel783
dtypes: int64(785)
memory usage: 251.5 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28000 entries, 0 to 27999
Columns: 784 entries, pixel0 to pixel783
dtypes: int64(784)
memory usage: 167.5 MB
None


784=28*28，即每张数字图是28*28的格式；

In [64]:
all_data = concat_df(train_data, test_data)

### 内存优化

In [65]:
all_data.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Columns: 785 entries, label to pixel99
dtypes: float64(1), int64(784)
memory usage: 419.2 MB


In [67]:
all_data = all_data.apply(pd.to_numeric,downcast='unsigned')
all_data.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Columns: 785 entries, label to pixel99
dtypes: float64(1), uint8(784)
memory usage: 52.9 MB


### 展示

In [68]:
all_data.sample(10)

Unnamed: 0,label,pixel0,pixel1,pixel10,pixel100,pixel101,pixel102,pixel103,pixel104,pixel105,...,pixel90,pixel91,pixel92,pixel93,pixel94,pixel95,pixel96,pixel97,pixel98,pixel99
33985,0.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
58418,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42508,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53895,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8479,0.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50074,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21810,0.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
66228,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
67079,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18457,0.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
train_data[['label','pixel0']].groupby('label').count()

Unnamed: 0_level_0,pixel0
label,Unnamed: 1_level_1
0,4132
1,4684
2,4177
3,4351
4,4072
5,3795
6,4137
7,4401
8,4063
9,4188


看到各个数字对应的训练数据个数相差不大，基本都在4000左右，比较平衡；

## 可视化特征

观察任意两组数据的标准差方差比较；

In [7]:
train_data.groupby('label').std().mean(axis=1)

label
0    44.810023
1    24.077799
2    47.097188
3    43.168668
4    40.773742
5    44.821169
6    40.953832
7    38.045724
8    42.534838
9    37.799291
dtype: float64

In [8]:
train_data[train_data['label']%4==0].std().mean()

48.453327804548906

看到std有明显增高，说明各个分类之间的差异是明显的；

## 数据预处理

- 缺失处理；
- 异常处理；

## 特征工程

同样的，因为是像素数据，不好直接直接特征构建、选择等，不过可以将全是0的特征去除，这些特征明显无法起到帮助作用（即方差过滤法的简单形式）；

0. 方差过滤法；
1. 每组数据都是一个N\*M的分辨率图片的每个像素点，那么可以找到每一行1的个数，每一列1的个数；
2. 1的总数；

### 构建28*28图的每行不为0的总数特征

In [9]:
# 28行
for i in range(28):
    cols = [col for col in all_data.columns if col != 'label' and col.find('pixel')!=-1 and int(int(col.replace('pixel',''))/28)==i]
    all_data['row_'+str(i+1)] = all_data[cols].apply(lambda x: sum([0 if t==0 else 1 for t in x]), axis=1)

all_data[[col for col in all_data.columns if col.find('row_')!=-1]].sample(3)

Unnamed: 0,row_1,row_2,row_3,row_4,row_5,row_6,row_7,row_8,row_9,row_10,...,row_19,row_20,row_21,row_22,row_23,row_24,row_25,row_26,row_27,row_28
49986,0,0,0,0,0,8,11,13,14,14,...,7,8,10,14,15,14,9,0,0,0
24794,0,0,0,0,3,5,6,6,6,7,...,5,5,6,6,5,4,0,0,0,0
6156,0,0,0,0,0,4,5,9,10,11,...,6,7,6,6,6,5,4,0,0,0


### 构建28*28图的每列1的总数特征

In [10]:
# 28列
for i in range(28):
    cols = [col for col in all_data.columns if col != 'label' and col.find('pixel')!=-1 and int(int(col.replace('pixel',''))%28)==i]
    all_data['col_'+str(i+1)] = all_data[cols].apply(lambda x: sum([0 if t==0 else 1 for t in x]), axis=1)

all_data[[col for col in all_data.columns if col.find('col_')!=-1]].sample(3)

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,...,col_19,col_20,col_21,col_22,col_23,col_24,col_25,col_26,col_27,col_28
43781,0,0,0,0,0,0,5,11,13,13,...,12,10,4,4,4,4,5,4,0,0
38617,0,0,0,0,0,0,0,0,2,9,...,9,7,5,0,0,0,0,0,0,0
24110,0,0,0,0,0,0,3,6,7,8,...,18,16,11,5,0,0,0,0,0,0


### 构建28*28图的1的总数特征

In [11]:
# 28×28
cols = [col for col in all_data.columns if col != 'label' and col.find('pixel')!=-1]
all_data['all_image'] = all_data[cols].apply(lambda x: sum([0 if t==0 else 1 for t in x]), axis=1)

all_data[['all_image','label']].sample(10)

Unnamed: 0,all_image,label
13844,142,2.0
53171,190,
42628,158,
40523,147,2.0
9296,195,5.0
15910,106,7.0
7124,180,3.0
50194,178,
59302,181,
19559,133,4.0


### 过滤掉全是0的列

In [12]:
all_zero_columns = []
for col in all_data.columns:
    if 0 >= all_data[col].max():
        all_zero_columns.append(col)
print all_zero_columns

['pixel0', 'pixel1', 'pixel10', 'pixel11', 'pixel111', 'pixel112', 'pixel140', 'pixel16', 'pixel168', 'pixel17', 'pixel18', 'pixel19', 'pixel2', 'pixel20', 'pixel21', 'pixel22', 'pixel23', 'pixel24', 'pixel25', 'pixel26', 'pixel27', 'pixel28', 'pixel29', 'pixel3', 'pixel30', 'pixel31', 'pixel4', 'pixel476', 'pixel5', 'pixel52', 'pixel53', 'pixel54', 'pixel55', 'pixel56', 'pixel560', 'pixel57', 'pixel6', 'pixel644', 'pixel671', 'pixel672', 'pixel673', 'pixel699', 'pixel7', 'pixel700', 'pixel701', 'pixel727', 'pixel728', 'pixel729', 'pixel730', 'pixel754', 'pixel755', 'pixel756', 'pixel757', 'pixel758', 'pixel759', 'pixel780', 'pixel781', 'pixel782', 'pixel783', 'pixel8', 'pixel82', 'pixel83', 'pixel84', 'pixel85', 'pixel9']


In [13]:
all_data.drop(all_zero_columns, axis=1, inplace=True)
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Columns: 777 entries, label to all_image
dtypes: float64(1), int64(776)
memory usage: 415.0 MB


### 方差过滤

In [14]:
print len(all_data.columns)

777


In [15]:
# 过滤掉方差小于0.1的特征
threshold = .8*(1-.8)
all_data_without_target = all_data.drop(['label'], axis=1)
selector = VarianceThreshold(threshold=threshold).fit(all_data_without_target)

for feature,variance in zip(list(all_data_without_target.columns),selector.variances_):
    print feature, variance, variance>=threshold
    if variance < threshold:
        all_data.drop(feature, axis=1, inplace=True)

pixel100 2475.481296482449 True
pixel101 2220.1317095426534 True
pixel102 1804.1300272653061 True
pixel103 1309.3671964081632 True
pixel104 785.4129124171432 True
pixel105 430.52276553061216 True
pixel106 192.84726855102042 True
pixel107 76.41804083244898 True
pixel108 28.15855856632653 True
pixel109 3.8632425124489793 True
pixel110 0.6202490691836736 True
pixel113 0.020628276734693873 False
pixel114 0.38702286346938763 True
pixel115 1.0269039997959186 True
pixel116 12.143777370612247 True
pixel117 61.49446267346941 True
pixel118 172.4656764538776 True
pixel119 435.5346495640818 True
pixel12 0.19365390285714287 True
pixel120 884.0021983477551 True
pixel121 1587.8409816326532 True
pixel122 2537.9635518824493 True
pixel123 3670.89860730102 True
pixel124 4890.936816959999 True
pixel125 6140.921724489795 True
pixel126 7094.081583301222 True
pixel127 7531.423256632653 True
pixel128 7344.328434194081 True
pixel129 6585.10996676551 True
pixel13 1.5881263469387745 True
pixel130 5414.4543337712

pixel369 1974.8855712848974 True
pixel37 10.510540112448979 True
pixel370 4692.95610354102 True
pixel371 8005.213757541022 True
pixel372 10754.806793142041 True
pixel373 12014.480619632446 True
pixel374 12128.86238114102 True
pixel375 11731.442853689592 True
pixel376 11304.28728678347 True
pixel377 11830.432641206324 True
pixel378 12969.903320783471 True
pixel379 12479.44901297959 True
pixel38 23.23604999489796 True
pixel380 12087.085742986937 True
pixel381 12587.963386102037 True
pixel382 12454.500613137756 True
pixel383 10959.7183092049 True
pixel384 8647.812102836735 True
pixel385 6515.206592398164 True
pixel386 4569.008698546735 True
pixel387 2690.75079453694 True
pixel388 963.1855183477551 True
pixel389 99.72491508387752 True
pixel39 32.23003302673469 True
pixel390 11.711273564081637 True
pixel391 1.8461989583673473 True
pixel392 0.1824116797959184 True
pixel393 0.9142214234693881 True
pixel394 8.511735489795921 True
pixel395 79.97662020489796 True
pixel396 547.5707133614285 True


pixel621 1970.4870579671433 True
pixel622 4168.232704718163 True
pixel623 6930.808301799184 True
pixel624 9564.186832651227 True
pixel625 11464.688623118165 True
pixel626 12460.10354120408 True
pixel627 12753.306463672654 True
pixel628 12660.36841366347 True
pixel629 12577.749244732857 True
pixel63 24.762014479795916 True
pixel630 12604.029859118162 True
pixel631 12412.13873355102 True
pixel632 11553.769541382653 True
pixel633 9882.895007165509 True
pixel634 7512.823833387755 True
pixel635 5081.1612385663275 True
pixel636 3054.5680615510205 True
pixel637 1665.6298620277553 True
pixel638 843.5866986938776 True
pixel639 396.69344399 True
pixel64 50.76802486204081 True
pixel640 139.45026295836735 True
pixel641 27.025442857142856 True
pixel642 2.1533157869387765 True
pixel643 0.07405608489795915 False
pixel645 0.008228453877551025 False
pixel646 7.481199872448978 True
pixel647 63.84880808959183 True
pixel648 292.3340230153061 True
pixel649 873.5662843834697 True
pixel65 96.20013102836734 T

In [16]:
print len(all_data.columns)

739


### 数据转换TODO

## 划分数据集

In [17]:
train_data, test_data = divide_df(all_data)

In [18]:
from sklearn.model_selection import train_test_split

x,y = train_data.drop(['label'], axis=1),train_data['label']
x_train,x_valid,y_train,y_valid = train_test_split(x,y,test_size=0.3,random_state=0)

x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29400 entries, 26437 to 2732
Columns: 738 entries, pixel100 to all_image
dtypes: int64(738)
memory usage: 165.8 MB


## 模型构建、优化

In [27]:
train_x = train_data.drop('label', axis=1)
train_y = train_data['label']

In [30]:
K = 10
kf = model_selection.KFold(n_splits=K, shuffle=True, random_state=SEED)

def cv_score(model, X=train_x, y=train_y):
    return model_selection.cross_val_score(model, X, y, cv=kf)

### kNN - 0.9659

此问题虽然数据量不算小但是也不大，最简单的方法就是利用kNN，且kNN的一个好处是目前所有特征我们可以假设其权重一致，且分类平均，个类别差异较大，可以看到效果已经很好了，但是我们也知道一方面提升很困难，对于kNN除了修改k、距离公式（将特征权重改变）没有太多其他办法，同时算法运行时间比较长也是kNN的弊端，后面需要优化只能通过换其他算法；

In [33]:
from sklearn.neighbors import KNeighborsClassifier as kNN

    from sklearn.neighbors import KNeighborsClassifier as kNN

    knn = kNN()
    score = cv_score(knn, X=train_x, y=train_y)
    print 'Accuracy mean:'+str(score.mean())+', std:'+str(score.std())

### 多模型

最好得分为XGBT的0.97，最差为Adaboost的0.85；

In [34]:
models = {
    'Adaboost':ensemble.AdaBoostClassifier(n_estimators=1000, learning_rate=.05), 
    'RandomForest':ensemble.RandomForestClassifier(n_estimators=200, n_jobs=-1, max_features = "sqrt", min_samples_split = 5),
    'kNN':kNN(),
    'XGBClassifier':XGBClassifier(n_estimators=500, n_jobs=-1),
    'LogisticRegression':linear_model.LogisticRegressionCV()
}

predictions = {}
scores = {}

for name, model in models.items():
    start = datetime.now()
    print('[{}] Running {}'.format(start, name))
    
    model.fit(train_x, train_y)
    predictions[name] = model.predict(train_x)
    
    score = cv_score(model, X=train_x, y=train_y)
    scores[name] = (score.mean(), score.std())
    
    end = datetime.now()
    
    print('[{}] Finished Running {} in {:.2f}s'.format(end, name, (end - start).total_seconds()))
    print('[{}] {} Mean Accuracy: {:.6f} / Std: {:.6f}\n'.format(datetime.now(), name, scores[name][0], scores[name][1]))

[2019-09-23 19:36:55.357693] Running kNN
[2019-09-23 20:55:55.433154] Finished Running kNN in 4740.08s
[2019-09-23 20:55:55.433274] kNN Mean Accuracy: 0.967643 / Std: 0.001728

[2019-09-23 20:55:55.433312] Running XGBClassifier
[2019-09-24 00:03:41.470045] Finished Running XGBClassifier in 11266.04s
[2019-09-24 00:03:41.476668] XGBClassifier Mean Accuracy: 0.972405 / Std: 0.002427

[2019-09-24 00:03:41.476758] Running RandomForest
[2019-09-24 00:07:45.673236] Finished Running RandomForest in 244.20s
[2019-09-24 00:07:45.673542] RandomForest Mean Accuracy: 0.966024 / Std: 0.002736

[2019-09-24 00:07:45.673666] Running LogisticRegression
[2019-09-24 02:07:50.018648] Finished Running LogisticRegression in 7204.34s
[2019-09-24 02:07:50.018838] LogisticRegression Mean Accuracy: 0.912524 / Std: 0.005139

[2019-09-24 02:07:50.018911] Running Adaboost
[2019-09-24 03:31:16.127161] Finished Running Adaboost in 5006.11s
[2019-09-24 03:31:16.127279] Adaboost Mean Accuracy: 0.850048 / Std: 0.006464

In [36]:
for k,v in models.items():
    joblib.dump(v, 'model/'+k+'.model')

### 模型融合

- 投票法；
- 加权融合；
- 元模型融合；

#### 定义投票方法

In [46]:
def vota(results, weights=None):
    vota_results = []
    if weights == None:
        weights = [1]*len(results)
    for i in range(len(train_x)):
        votas = {}
        for j in range(len(models.keys())):
            votas[results[j][i]] = votas.get(results[j][i], 0)+weights[j] # 简单图片weights全为1，则此处加1，加权则此处为权重值
        vota_results.append(int(sorted(votas.items(), key=lambda vota:vota[1])[-1][0]))
    return vota_results

#### 简单投票法 - 0.992

In [52]:
simple_vota_result = vota(predictions.values())
print 'Simple vota score:'+str(accuracy_score(train_y, simple_vota_result))

Simple vota score:0.992


#### 加权融合 - 0.9937

In [53]:
weights = [score[0]/sum([s[0]for s in scores.values()]) for score in scores.values()]
simple_vota_result = vota(predictions.values(), weights=weights)
print 'With weight vota score:'+str(accuracy_score(train_y, simple_vota_result))

With weight vota score:0.9937142857142857


#### 元模型融合 - 0.9997

使用RandomForestClassifier进行融合，一方面融合模型理应具备更好的泛化能力，另一方面，RandomForest也易于调参等；

In [54]:
#model_data
model_x = pd.DataFrame(predictions)
model_x.sample(5)

Unnamed: 0,Adaboost,LogisticRegression,RandomForest,XGBClassifier,kNN
40303,8.0,8.0,8.0,8.0,8.0
6154,2.0,2.0,2.0,2.0,2.0
28716,8.0,8.0,2.0,2.0,2.0
26384,9.0,9.0,9.0,9.0,9.0
38222,8.0,8.0,8.0,8.0,8.0


In [70]:
meta_rf = ensemble.RandomForestClassifier(n_estimators=1000, n_jobs=-1, max_features = 4)
meta_score = cv_score(meta_rf, model_x, train_y)
print 'Meta model score mean:'+str(meta_score.mean())+', std:'+str(meta_score.std())

Meta model score mean:0.9998333333333334, std:0.00021428571428573352


#### 融合小结

从融合结果看，元模型融合依然是效果最好的，因此融合方面就以该方式提交结果文件；

In [57]:
test_predictions = {}
for k,v in models.items():
    test_predictions[k] = v.predict(test_data)

In [59]:
model_test = pd.DataFrame(test_predictions)

## 结果生成、提交

1. V-1.0(0917)：没有预处理、没有特征工程、去除了全为0的列，使用kNN默认参数训练预测；
    - 分数：0.966；
    - 排名：Top81%；
2. V-2.0(0921)：增加三类新特征，每行不为0个数、每列不为0个数，总不为0个数；
    - 分数：0.966；
    - 排名：Top81%；

    result = knn.predict(test_data)
    result[:10]

In [71]:
meta_rf = ensemble.RandomForestClassifier(n_estimators=1000, n_jobs=-1, max_features = 4)
meta_rf.fit(model_x, train_y)
result = meta_rf.predict(model_test)

In [72]:
[idx-41999 for idx in list(test_data.index)]

[-41999,
 -41998,
 -41997,
 -41996,
 -41995,
 -41994,
 -41993,
 -41992,
 -41991,
 -41990,
 -41989,
 -41988,
 -41987,
 -41986,
 -41985,
 -41984,
 -41983,
 -41982,
 -41981,
 -41980,
 -41979,
 -41978,
 -41977,
 -41976,
 -41975,
 -41974,
 -41973,
 -41972,
 -41971,
 -41970,
 -41969,
 -41968,
 -41967,
 -41966,
 -41965,
 -41964,
 -41963,
 -41962,
 -41961,
 -41960,
 -41959,
 -41958,
 -41957,
 -41956,
 -41955,
 -41954,
 -41953,
 -41952,
 -41951,
 -41950,
 -41949,
 -41948,
 -41947,
 -41946,
 -41945,
 -41944,
 -41943,
 -41942,
 -41941,
 -41940,
 -41939,
 -41938,
 -41937,
 -41936,
 -41935,
 -41934,
 -41933,
 -41932,
 -41931,
 -41930,
 -41929,
 -41928,
 -41927,
 -41926,
 -41925,
 -41924,
 -41923,
 -41922,
 -41921,
 -41920,
 -41919,
 -41918,
 -41917,
 -41916,
 -41915,
 -41914,
 -41913,
 -41912,
 -41911,
 -41910,
 -41909,
 -41908,
 -41907,
 -41906,
 -41905,
 -41904,
 -41903,
 -41902,
 -41901,
 -41900,
 -41899,
 -41898,
 -41897,
 -41896,
 -41895,
 -41894,
 -41893,
 -41892,
 -41891,
 -41890,
 -41889,
 

In [73]:
pd.DataFrame({'ImageId':[idx-41999 for idx in list(test_data.index)], 'Label':[int(label) for label in list(result)]}).to_csv('output/submission-digit-0928-V-3.0-with-meta-blend-models-maxfeature4.csv', index=False)