我们已经学习了决策树原理和应用，以及参数调整。

之前我们学习逻辑斯蒂回归、SVM支持向量机。

现在请以，葡萄酒数据作为研究对象

```Python
from sklearn import datasets
wine = datasets.load_wine()
```

作业要求：

* 观察葡萄酒数据
* 对数据进行归一化处理
* 使用不同算法，对葡萄酒数据进行分类模型构建（数据划分为训练数据和测试数据）
* 多次训练求平均值
* 对比思考三种不同机器学习差异



# 不同分类算法的差异

In [1]:
import numpy as np 
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn import datasets

## 1. 加载数据 

In [3]:
wine = datasets.load_wine() 

'feature_names': ['alcohol',
  'malic_acid',
  'ash',
  'alcalinity_of_ash',
  'magnesium',
  'total_phenols',
  'flavanoids',
  'nonflavanoid_phenols',
  'proanthocyanins',
  'color_intensity',
  'hue',
  'od280/od315_of_diluted_wines',
  'proline']

In [4]:
from sklearn.model_selection import train_test_split

X = wine['data']
y = wine['target']

In [5]:
np.set_printoptions(suppress=True)
X

array([[  14.23,    1.71,    2.43, ...,    1.04,    3.92, 1065.  ],
       [  13.2 ,    1.78,    2.14, ...,    1.05,    3.4 , 1050.  ],
       [  13.16,    2.36,    2.67, ...,    1.03,    3.17, 1185.  ],
       ...,
       [  13.27,    4.28,    2.26, ...,    0.59,    1.56,  835.  ],
       [  13.17,    2.59,    2.37, ...,    0.6 ,    1.62,  840.  ],
       [  14.13,    4.1 ,    2.74, ...,    0.61,    1.6 ,  560.  ]])

## 2. 归一化处理

In [6]:
from sklearn.preprocessing import StandardScaler

standard = StandardScaler()
X = standard.fit_transform(X)
X

array([[ 1.51861254, -0.5622498 ,  0.23205254, ...,  0.36217728,
         1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, ...,  0.40605066,
         1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, ...,  0.31830389,
         0.78858745,  1.39514818],
       ...,
       [ 0.33275817,  1.74474449, -0.38935541, ..., -1.61212515,
        -1.48544548,  0.28057537],
       [ 0.20923168,  0.22769377,  0.01273209, ..., -1.56825176,
        -1.40069891,  0.29649784],
       [ 1.39508604,  1.58316512,  1.36520822, ..., -1.52437837,
        -1.42894777, -0.59516041]])

## 3. LogisticRegression 

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [9]:
%%time
score = 0
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2) 
    lr = LogisticRegression()
    lr.fit(X_train,y_train)
    s = lr.score(X_test, y_test)
    score += s/100 # 循环了100次的平均
print('LR算法多次运算平均得分是：',score)

LR算法多次运算平均得分是： 0.9769444444444446
CPU times: total: 719 ms
Wall time: 730 ms


## 4. SVC

In [11]:
%%time
score = 0
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2) 
    model = SVC()
    model.fit(X_train,y_train)
    s = model.score(X_test, y_test)
    score += s/100 # 循环了100次的平均
print('SVC算法多次运算平均得分是：',score)

SVC算法多次运算平均得分是： 0.9786111111111112
CPU times: total: 250 ms
Wall time: 322 ms


## 5. 决策树

In [12]:
%%time
score = 0
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2) 
    model = DecisionTreeClassifier()
    model.fit(X_train,y_train)
    s = model.score(X_test, y_test)
    score += s/100 # 循环了100次的平均
print('决策树算法多次运算平均得分是：',score)

决策树算法多次运算平均得分是： 0.902777777777777
CPU times: total: 312 ms
Wall time: 356 ms


## 6. 不同算法总结对比

- 决策树对数据是否归一化不敏感
- 逻辑回归，如果不进行归一化，准确率降低，运行时间会增加
- SVC支持向量机，如果进行归一化，准确率大大降低

In [13]:
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
model.feature_importances_

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.14926013, 0.        , 0.        , 0.35875944,
       0.        , 0.        , 0.49198043])

In [14]:
model = LogisticRegression()
model.fit(X_train,y_train)
model.coef_  # 返回的系数，不能表示特征重要与否

array([[ 0.68347141,  0.17471074,  0.20895509, -0.8001233 , -0.05713168,
         0.13885993,  0.61436679, -0.26941507,  0.22493812,  0.25171584,
         0.23125522,  0.60200664,  1.05496944],
       [-0.88638539, -0.37532645, -0.63110906,  0.6175745 ,  0.02793696,
         0.08962394,  0.32787812,  0.2601362 ,  0.21339307, -0.99315641,
         0.65919895,  0.03637603, -1.03241368],
       [ 0.20291398,  0.20061571,  0.42215396,  0.1825488 ,  0.02919472,
        -0.22848387, -0.94224491,  0.00927887, -0.4383312 ,  0.74144058,
        -0.89045417, -0.63838267, -0.02255576]])

- 这三个只有决策树有feature_importances_
- LR返回的系数，不能表示特征重要与否
- 线性回归，系数绝对值大小可以表示特征重要性