# 线性分类器 Linear Classifier

## 良/恶性肿瘤预测

In [3]:
# 导入pandas和numpy工具包
import pandas as pd
import numpy as np

# 注意这里事先要知道数据集的结构，此处创建特征列表
column_names=['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion',
              'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']

# 使用pandas.read_csv从互联网读取指定数据
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',
                   names = column_names)

# 在事先知道的数据集结构中有？数据的存在，将它们换为标准缺失值表示并丢弃，只要有一个维度有缺失
data = data.replace(to_replace = '?', value = np.nan)
data = data.dropna(how = 'any')

# 输出data的数据量和维度
data.shape

(683, 11)

承上启下：

- 经过处理，无缺失值的样本数据一共683条，特征一共有9个维度，特征均被量化为1~10之间的数值（已经处理过）。其中第一个和最后一个不是特征。


- 下面分割数据一般是按1：3来分割的。

In [6]:
# 下面准备训练数据，用sklearn.cross_validation中的train_test_split分割数据，其中随机采样25%测试
from sklearn.cross_validation import train_test_split as tts
X_train, X_test, y_train, y_test = tts(data[column_names[1:10]], data[column_names[10]], test_size = 0.25, random_state = 33)
# 这里random_state是干什么的？记得去问！

# 查看训练样本的数量和类别分布
print(y_train.value_counts())

# 然后是测试样本的数量和类别分布
print(y_test.value_counts())

2    344
4    168
Name: Class, dtype: int64
2    100
4     71
Name: Class, dtype: int64


上面输出的意思是：

- 训练样本一共512条，包括344条良性肿瘤数据，168条恶性肿瘤数据。


- 测试样本一共171条，包括内容类似。

下面就要具体地用Logistic Regression和随机梯度参数估计两种方法。

In [10]:
# 从sklearn.preprocessing导入StandardScaler。
from sklearn.preprocessing import StandardScaler

# From Sklearn.linear_model import LogisticRegression & SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

### Note that here's a function to handle the data. ###
### To STANDARDIZE the data, we try to make 每个维度特征数据方差为1，means is 0. ###
### This can let prediction away from some too big features! ###
# Initial ss
ss = StandardScaler()
# Standardize it
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)

# Now initial LogisticRegression and SGDClassifier
lr = LogisticRegression()
sgdc = SGDClassifier()

# Then we train the models and predit
lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)
sgdc.fit(X_train, y_train)
sgdc_y_predict = sgdc.predict(X_test)

- Since we have finished the training procedure, we can analyze the result below.


- Apart from accuracy, we have another three predictors.

In [23]:
# import the evaluation mode classification_report form sklearn.metrics
from sklearn.metrics import classification_report

# Use default function score
print 'Accuracy of Linear Regression Classifier:', lr.score(X_test, y_test)

# Another three predictors
print classification_report(y_test, lr_y_predict, target_names = ['Benign', 'Malignant'])

Accuracy of Linear Regression Classifier: 0.9707602339181286
             precision    recall  f1-score   support

     Benign       0.96      0.99      0.98       100
  Malignant       0.99      0.94      0.96        71

avg / total       0.97      0.97      0.97       171



In [24]:
# Use default function score
print 'Accuracy of SGD Classifier:', sgdc.score(X_test, y_test)

# Another three predictors
print classification_report(y_test, sgdc_y_predict, target_names = ['Benign', 'Malignant'])

Accuracy of SGD Classifier: 0.9649122807017544
             precision    recall  f1-score   support

     Benign       0.96      0.98      0.97       100
  Malignant       0.97      0.94      0.96        71

avg / total       0.97      0.96      0.96       171



From the results above, we can get:

- Logistic Regression have higher accuracy on test data, this is because it use analitical ways to calculate the paraments.


- Usually, Logistic is much slower but more accurate.