## 数字识别

* 有监督；
* 分类问题；
* 训练数据集：对应数字以及每个位置的像素点组成；
* 测试数据集：每个位置的像素点；
* 提交格式：ImageId,Label；

## 初始化环境

In [1]:
import os,sys,time
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.feature_selection import VarianceThreshold

import warnings
warnings.filterwarnings('ignore')

plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

%matplotlib inline

## 加载、展示数据

In [2]:
def concat_df(train_data, test_data):
    # Returns a concatenated df of training and test set on axis 0
    return pd.concat([train_data, test_data], sort=True).reset_index(drop=True)

def divide_df(all_data, train_size=42000):
    # Returns divided dfs of training and test set
    return all_data.loc[:train_size-1], all_data.loc[train_size:].drop(['label'], axis=1)

In [3]:
train_data = pd.read_csv('input/train.csv')
print train_data.info()

test_data = pd.read_csv('input/test.csv')
print test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Columns: 785 entries, label to pixel783
dtypes: int64(785)
memory usage: 251.5 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28000 entries, 0 to 27999
Columns: 784 entries, pixel0 to pixel783
dtypes: int64(784)
memory usage: 167.5 MB
None


784=28*28，即每张数字图是28*28的格式；

In [4]:
all_data = concat_df(train_data, test_data)

In [5]:
all_data.sample(10)

Unnamed: 0,label,pixel0,pixel1,pixel10,pixel100,pixel101,pixel102,pixel103,pixel104,pixel105,...,pixel90,pixel91,pixel92,pixel93,pixel94,pixel95,pixel96,pixel97,pixel98,pixel99
22405,1.0,0,0,0,0,171,156,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49389,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50085,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5201,1.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
62190,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
57629,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18985,3.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60113,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41352,3.0,0,0,0,163,254,254,216,139,28,...,0,0,0,0,0,0,0,0,0,0
68020,,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
train_data[['label','pixel0']].groupby('label').count()

Unnamed: 0_level_0,pixel0
label,Unnamed: 1_level_1
0,4132
1,4684
2,4177
3,4351
4,4072
5,3795
6,4137
7,4401
8,4063
9,4188


看到各个数字对应的训练数据个数相差不大，基本都在4000左右，比较平衡；

## 可视化特征

观察任意两组数据的标准差方差比较；

In [7]:
train_data.groupby('label').std().mean(axis=1)

label
0    44.810023
1    24.077799
2    47.097188
3    43.168668
4    40.773742
5    44.821169
6    40.953832
7    38.045724
8    42.534838
9    37.799291
dtype: float64

In [8]:
train_data[train_data['label']%4==0].std().mean()

48.453327804548906

看到std有明显增高，说明各个分类之间的差异是明显的；

## 数据预处理

因为数据无缺失，且为像素数据，不好直接处理，先略过，后续考虑是否有好的方式可以处理；

## 特征工程

同样的，因为是像素数据，不好直接直接特征构建、选择等，不过可以将全是0的特征去除，这些特征明显无法起到帮助作用（即方差过滤法的简单形式）；

0. 方差过滤法；
1. 每组数据都是一个N\*M的分辨率图片的每个像素点，那么可以找到每一行1的个数，每一列1的个数；
2. 1的总数；

### 构建28*28图的每行不为0的总数特征

In [9]:
# 28行
for i in range(28):
    cols = [col for col in all_data.columns if col != 'label' and col.find('pixel')!=-1 and int(int(col.replace('pixel',''))/28)==i]
    all_data['row_'+str(i+1)] = all_data[cols].apply(lambda x: sum([0 if t==0 else 1 for t in x]), axis=1)

all_data[[col for col in all_data.columns if col.find('row_')!=-1]].sample(3)

Unnamed: 0,row_1,row_2,row_3,row_4,row_5,row_6,row_7,row_8,row_9,row_10,...,row_19,row_20,row_21,row_22,row_23,row_24,row_25,row_26,row_27,row_28
42768,0,0,0,0,4,4,5,5,5,6,...,5,5,5,5,5,5,0,0,0,0
32671,0,0,0,0,0,8,10,10,9,5,...,4,4,5,8,10,9,7,0,0,0
53406,0,0,0,0,0,0,0,0,5,12,...,12,12,10,3,0,0,0,0,0,0


### 构建28*28图的每列1的总数特征

In [10]:
# 28列
for i in range(28):
    cols = [col for col in all_data.columns if col != 'label' and col.find('pixel')!=-1 and int(int(col.replace('pixel',''))%28)==i]
    all_data['col_'+str(i+1)] = all_data[cols].apply(lambda x: sum([0 if t==0 else 1 for t in x]), axis=1)

all_data[[col for col in all_data.columns if col.find('col_')!=-1]].sample(3)

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,...,col_19,col_20,col_21,col_22,col_23,col_24,col_25,col_26,col_27,col_28
67832,0,0,0,0,0,0,0,0,5,7,...,11,6,4,3,2,1,0,0,0,0
24538,0,0,0,0,0,0,0,5,12,13,...,12,13,11,6,0,0,0,0,0,0
339,0,0,0,0,0,0,0,0,4,7,...,8,5,4,2,0,0,0,0,0,0


### 构建28*28图的1的总数特征

In [11]:
# 28×28
cols = [col for col in all_data.columns if col != 'label' and col.find('pixel')!=-1]
all_data['all_image'] = all_data[cols].apply(lambda x: sum([0 if t==0 else 1 for t in x]), axis=1)

all_data[['all_image','label']].sample(10)

Unnamed: 0,all_image,label
37400,161,2.0
15676,187,3.0
28778,151,0.0
5160,102,1.0
51526,162,
45716,71,
34948,156,0.0
26037,189,8.0
11215,158,6.0
29719,206,6.0


### 过滤掉全是0的列

In [12]:
all_zero_columns = []
for col in all_data.columns:
    if 0 >= all_data[col].max():
        all_zero_columns.append(col)
print all_zero_columns

['pixel0', 'pixel1', 'pixel10', 'pixel11', 'pixel111', 'pixel112', 'pixel140', 'pixel16', 'pixel168', 'pixel17', 'pixel18', 'pixel19', 'pixel2', 'pixel20', 'pixel21', 'pixel22', 'pixel23', 'pixel24', 'pixel25', 'pixel26', 'pixel27', 'pixel28', 'pixel29', 'pixel3', 'pixel30', 'pixel31', 'pixel4', 'pixel476', 'pixel5', 'pixel52', 'pixel53', 'pixel54', 'pixel55', 'pixel56', 'pixel560', 'pixel57', 'pixel6', 'pixel644', 'pixel671', 'pixel672', 'pixel673', 'pixel699', 'pixel7', 'pixel700', 'pixel701', 'pixel727', 'pixel728', 'pixel729', 'pixel730', 'pixel754', 'pixel755', 'pixel756', 'pixel757', 'pixel758', 'pixel759', 'pixel780', 'pixel781', 'pixel782', 'pixel783', 'pixel8', 'pixel82', 'pixel83', 'pixel84', 'pixel85', 'pixel9']


In [13]:
all_data.drop(all_zero_columns, axis=1, inplace=True)
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Columns: 777 entries, label to all_image
dtypes: float64(1), int64(776)
memory usage: 415.0 MB


### 方差过滤

In [14]:
print len(all_data.columns)

777


In [15]:
# 过滤掉方差小于0.1的特征
threshold = .8*(1-.8)
all_data_without_target = all_data.drop(['label'], axis=1)
selector = VarianceThreshold(threshold=threshold).fit(all_data_without_target)

for feature,variance in zip(list(all_data_without_target.columns),selector.variances_):
    print feature, variance, variance>=threshold
    if variance < threshold:
        all_data.drop(feature, axis=1, inplace=True)

pixel100 2475.481296482449 True
pixel101 2220.1317095426534 True
pixel102 1804.1300272653061 True
pixel103 1309.3671964081632 True
pixel104 785.4129124171432 True
pixel105 430.52276553061216 True
pixel106 192.84726855102042 True
pixel107 76.41804083244898 True
pixel108 28.15855856632653 True
pixel109 3.8632425124489793 True
pixel110 0.6202490691836736 True
pixel113 0.020628276734693873 False
pixel114 0.38702286346938763 True
pixel115 1.0269039997959186 True
pixel116 12.143777370612247 True
pixel117 61.49446267346941 True
pixel118 172.4656764538776 True
pixel119 435.5346495640818 True
pixel12 0.19365390285714287 True
pixel120 884.0021983477551 True
pixel121 1587.8409816326532 True
pixel122 2537.9635518824493 True
pixel123 3670.89860730102 True
pixel124 4890.936816959999 True
pixel125 6140.921724489795 True
pixel126 7094.081583301222 True
pixel127 7531.423256632653 True
pixel128 7344.328434194081 True
pixel129 6585.10996676551 True
pixel13 1.5881263469387745 True
pixel130 5414.4543337712

pixel366 22.15616342122449 True
pixel367 108.81089142530611 True
pixel368 528.8752097885714 True
pixel369 1974.8855712848974 True
pixel37 10.510540112448979 True
pixel370 4692.95610354102 True
pixel371 8005.213757541022 True
pixel372 10754.806793142041 True
pixel373 12014.480619632446 True
pixel374 12128.86238114102 True
pixel375 11731.442853689592 True
pixel376 11304.28728678347 True
pixel377 11830.432641206324 True
pixel378 12969.903320783471 True
pixel379 12479.44901297959 True
pixel38 23.23604999489796 True
pixel380 12087.085742986937 True
pixel381 12587.963386102037 True
pixel382 12454.500613137756 True
pixel383 10959.7183092049 True
pixel384 8647.812102836735 True
pixel385 6515.206592398164 True
pixel386 4569.008698546735 True
pixel387 2690.75079453694 True
pixel388 963.1855183477551 True
pixel389 99.72491508387752 True
pixel39 32.23003302673469 True
pixel390 11.711273564081637 True
pixel391 1.8461989583673473 True
pixel392 0.1824116797959184 True
pixel393 0.9142214234693881 True

pixel617 0.02298500387755101 False
pixel618 28.264718569591835 True
pixel619 190.64470892061226 True
pixel62 9.400016024285712 True
pixel620 702.9368368928572 True
pixel621 1970.4870579671433 True
pixel622 4168.232704718163 True
pixel623 6930.808301799184 True
pixel624 9564.186832651227 True
pixel625 11464.688623118165 True
pixel626 12460.10354120408 True
pixel627 12753.306463672654 True
pixel628 12660.36841366347 True
pixel629 12577.749244732857 True
pixel63 24.762014479795916 True
pixel630 12604.029859118162 True
pixel631 12412.13873355102 True
pixel632 11553.769541382653 True
pixel633 9882.895007165509 True
pixel634 7512.823833387755 True
pixel635 5081.1612385663275 True
pixel636 3054.5680615510205 True
pixel637 1665.6298620277553 True
pixel638 843.5866986938776 True
pixel639 396.69344399 True
pixel64 50.76802486204081 True
pixel640 139.45026295836735 True
pixel641 27.025442857142856 True
pixel642 2.1533157869387765 True
pixel643 0.07405608489795915 False
pixel645 0.0082284538775510

In [16]:
print len(all_data.columns)

739


## 划分数据集

In [17]:
train_data, test_data = divide_df(all_data)

In [18]:
from sklearn.model_selection import train_test_split

x,y = train_data.drop(['label'], axis=1),train_data['label']
x_train,x_valid,y_train,y_valid = train_test_split(x,y,test_size=0.3,random_state=0)

x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29400 entries, 26437 to 2732
Columns: 738 entries, pixel100 to all_image
dtypes: int64(738)
memory usage: 165.8 MB


## 模型构建、优化

### kNN - 0.9659

此问题虽然数据量不算小但是也不大，最简单的方法就是利用kNN，且kNN的一个好处是目前所有特征我们可以假设其权重一致，且分类平均，个类别差异较大，可以看到效果已经很好了，但是我们也知道一方面提升很困难，对于kNN除了修改k、距离公式（将特征权重改变）没有太多其他办法，同时算法运行时间比较长也是kNN的弊端，后面需要优化只能通过换其他算法；

from sklearn.neighbors import KNeighborsClassifier as kNN

knn = kNN()
knn.fit(x_train, y_train)
score = knn.score(x_valid, y_valid)
print 'Accuracy:'+str(score)

### 多模型

可以看到，多模型cv中，效果最好的是xxx，最差的是yyy，同样的单模型无法拟合所有数据，因此后续对多模型进行融合；

In [None]:
#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(n_estimators=1000, learning_rate=.1),
    #ensemble.GradientBoostingClassifier(n_estimators=200),
    ensemble.RandomForestClassifier(n_estimators=200, n_jobs=-1, max_features = "sqrt", min_samples_split = 5),

    #Gaussian Processes
    #gaussian_process.GaussianProcessClassifier(),
    
    kNN(),
    
    #GLM
    linear_model.LogisticRegressionCV(),
    #linear_model.SGDClassifier(),
    #linear_model.Perceptron(),
    
    #SVM
    #svm.SVC(probability=True),
    
    #Discriminant Analysis
    #discriminant_analysis.LinearDiscriminantAnalysis(),
    
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    XGBClassifier(n_estimators=200, n_jobs=-1)
    ]



cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 )

MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 
               'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

Target = 'label'
MLA_predict = train_data[Target].copy()

row_index = 0
for alg in MLA:
    start = datetime.now()
    print('[{}] Running {}'.format(start, alg.__class__.__name__))

    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Index'] = row_index
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    
    cv_results = model_selection.cross_validate(alg, train_data.drop(Target, axis=1), train_data[Target], cv  = cv_split)

    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()   
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3   #let's know the worst that can happen!
    

    alg.fit(train_data.drop(Target, axis=1), train_data[Target])
    MLA_predict[MLA_name] = alg.predict(train_data.drop(Target, axis=1))
    
    row_index+=1
    
    end = datetime.now()
    print('[{}] Finished Running {} in {:.2f}s'.format(end, MLA_name, (end - start).total_seconds()))
    print('[{}] {} Mean Accuracy: {:.6f} / Std: {:.6f}\n'.format(datetime.now(), MLA_name, cv_results['test_score'].mean(), cv_results['test_score'].std()*3))

    
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare

[2019-09-22 22:32:11.455143] Running AdaBoostClassifier
[2019-09-22 23:03:06.315650] Finished Running AdaBoostClassifier in 1854.86s
[2019-09-22 23:03:06.315776] AdaBoostClassifier Mean Accuracy: 0.668857 / Std: 0.189710

[2019-09-22 23:03:06.315913] Running RandomForestClassifier
[2019-09-22 23:05:56.862900] Finished Running RandomForestClassifier in 170.55s
[2019-09-22 23:05:56.863309] RandomForestClassifier Mean Accuracy: 0.962619 / Std: 0.004616

[2019-09-22 23:05:56.863742] Running LogisticRegressionCV
[2019-09-23 00:30:02.866444] Finished Running LogisticRegressionCV in 5046.00s
[2019-09-23 00:30:02.866642] LogisticRegressionCV Mean Accuracy: 0.908452 / Std: 0.005342

[2019-09-23 00:30:02.866879] Running GaussianNB
[2019-09-23 00:30:25.139166] Finished Running GaussianNB in 22.27s
[2019-09-23 00:30:25.139294] GaussianNB Mean Accuracy: 0.573460 / Std: 0.028961

[2019-09-23 00:30:25.139434] Running SVC


In [None]:
1/0

### 模型融合TODO

- 投票法；
- 加权融合；
- 元模型融合；

#### 简单投票法

#### 加权融合

#### 元模型融合

- AdaBoostClassifier
- RandomForestClassifier
- XGBClassifier
- SVC

In [None]:
#model_data

## 结果生成、提交

1. V-1.0(0917)：没有预处理、没有特征工程、去除了全为0的列，使用kNN默认参数训练预测；
    - 分数：0.966；
    - 排名：Top81%；
2. V-2.0(0921)：增加三类新特征，每行不为0个数、每列不为0个数，总不为0个数；
    - 分数：0.966；
    - 排名：Top81%；

In [None]:
result = knn.predict(test_data)
result[:10]

In [None]:
[idx-41999 for idx in list(test_data.index)]

In [None]:
pd.DataFrame({'ImageId':[idx-41999 for idx in list(test_data.index)], 'Label':[int(label) for label in list(result)]}).to_csv('output/submission-digit-0921-V-2.0-with-new-feature.csv', index=False)