<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#简单概念回顾" data-toc-modified-id="简单概念回顾-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>简单概念回顾</a></span></li><li><span><a href="#Sklearn的设计概述" data-toc-modified-id="Sklearn的设计概述-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sklearn的设计概述</a></span></li><li><span><a href="#机器学习流程" data-toc-modified-id="机器学习流程-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>机器学习流程</a></span></li><li><span><a href="#简单的sklearn-API套路" data-toc-modified-id="简单的sklearn-API套路-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>简单的sklearn API套路</a></span></li><li><span><a href="#Preparing-data" data-toc-modified-id="Preparing-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Preparing data</a></span></li><li><span><a href="#数据处理（上）" data-toc-modified-id="数据处理（上）-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>数据处理（上）</a></span><ul class="toc-item"><li><span><a href="#Standardization,-or-mean-removal-and-variance-scaling" data-toc-modified-id="Standardization,-or-mean-removal-and-variance-scaling-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Standardization, or mean removal and variance scaling</a></span></li><li><span><a href="#Normalization" data-toc-modified-id="Normalization-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Normalization</a></span></li><li><span><a href="#Binarization（离散化）" data-toc-modified-id="Binarization（离散化）-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Binarization（离散化）</a></span></li><li><span><a href="#Encoding-categorical-features" data-toc-modified-id="Encoding-categorical-features-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Encoding categorical features</a></span></li></ul></li></ul></div>

# sklearn模块基础学习【1】

快速入门参考学习文档：  
  
>  https://sklearn.apachecn.org/docs/0.21.3/      
>  https://sklearn.apachecn.org/docs/0.21.3/50.html

## 简单概念回顾

**监督学习与无监督学习**  
> * 最大的区别就是有没有标签  
> * 工业应用中主要是用监督学习

**分类任务和回归任务**  
> * 能用线性模型，决不用非线性模型（容易over fitting，且计算量太大）  

**模型的评估**  
> * accuracy：很少用，样本不均衡时，易出问题  
> * recall与precision：二者之间的trade off  
> * F1-score：综合均衡考量recall与precision  
> * AUC：ROC曲线下方面积

**特征处理（特征工程）**  
> * 决定机器学习建模效果的核心  
> * 业务经验相关  
> * 熟悉相关工具

## Sklearn的设计概述

**官方文档：**  
https://scikit-learn.org/stable/

* Classification
* Regression
* Clustering
* Dimensionality reduction
* Model selection
* Preprocessing

## 机器学习流程

* **获取数据**
> 爬虫  
> 数据库  
> 数据文件(csv、excel、txt)  
* **数据处理**
> 文本处理  
> 量纲一致  
> 降维  
* **建立模型**
> 分类  
> 回归  
> 聚类  
* **评估模型**
> 超参数择优  
> 哪个模型更好

## 简单的sklearn API套路

* fit：训练模型  
* transform：将数据转换为模型处理后的结果（label会放在test集后面）  
* predict：返回模型预测结果  
* predict_proba：预测概率值  
* score：模型准确率（很少用默认的accuracy，会设置为f1）  
* get_params：获取参数  

## Preparing data

**数据集划分：**  
> * Training data（70%）  
> * Validation data
> * Testing data（30%）  

实际工作中，大部分的情况下**不会完全随机划分，会用已经发生（时间在前的、过去的）数据作为训练集，来预测未来（时间在后的）数据**。否则使用未来数据预测过去的数据，会引入一些未来发生的先验信息，是不合理的，容易造成过拟合。  
  
另外也会有其他情况，例如按地域划分。

## 数据处理（上）

数据集：[ML DATASETS](http://archive.ics.uci.edu/ml/)

### Standardization, or mean removal and variance scaling

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, **_then scale it by dividing non-constant features by their standard deviation_**.

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

[Should I normalize/standardize/rescale the data](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html,"Should I normalize/standardize/rescale the data")

[**StandardScaler**](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.


[**MinMaxScaler**](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)

Scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. 

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
df  = pd.read_csv("forestfires.csv")
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [7]:
df1 = df.loc[:,"FFMC":"rain"]
df1.head()

Unnamed: 0,FFMC,DMC,DC,ISI,temp,RH,wind,rain
0,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0
1,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0
2,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0
3,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2
4,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0


In [13]:
from sklearn.model_selection import train_test_split

# 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(df1.astype(float), df['area'], test_size=0.3)

In [15]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

ss.fit(X_train.astype(float))

X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)

In [16]:
X_test_ss

array([[ 0.28679864,  0.71681755,  0.48046958, ...,  0.83158581,
         0.25320833, -0.06207709],
       [ 0.19722059,  0.00979696,  0.90069328, ..., -0.20254633,
        -1.21410535, -0.06207709],
       [ 0.2151362 ,  0.10417301,  0.49278237, ..., -0.02005242,
         0.25320833, -0.06207709],
       ...,
       [ 0.05389571, -1.05393312, -1.7787293 , ..., -0.14171503,
        -1.70320991, -0.06207709],
       [ 0.4122079 , -0.55326016,  0.69812798, ..., -0.38504024,
        -0.01851642, -0.06207709],
       [ 0.37637668, -0.98035179,  0.61392436, ..., -0.56753414,
        -1.70320991, -0.06207709]])

In [17]:
print(X_test.shape, X_test_ss.shape)

(156, 8) (156, 8)


In [19]:
X_test_ss.mean(axis=0)

array([ 0.08639647,  0.16353231,  0.14095374,  0.01202239,  0.0823483 ,
       -0.0083541 , -0.02966411,  0.00596895])

In [20]:
X_test_ss.std(axis=0)

array([0.95695186, 1.06717786, 0.939732  , 0.85642884, 0.96857897,
       0.97203551, 0.90614731, 0.45269417])

In [21]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()

mms.fit(X_train)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [22]:
mms.transform(X_train)

array([[0.94064516, 0.35527223, 0.5476721 , ..., 0.44705882, 0.15555556,
        0.        ],
       [0.9316129 , 0.43211578, 0.79582503, ..., 0.6       , 0.3       ,
        0.        ],
       [0.96516129, 0.51068229, 0.84519761, ..., 0.28235294, 0.5       ,
        0.        ],
       ...,
       [0.94967742, 0.30254997, 0.57194793, ..., 0.14117647, 0.3       ,
        0.        ],
       [0.90451613, 0.50379049, 0.71138736, ..., 0.6       , 0.55555556,
        0.        ],
       [0.94193548, 0.11957271, 0.08549314, ..., 0.14117647, 0.55555556,
        0.        ]])

###  Normalization

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

Normalizer类也拥有fit、transform等转换器API拥有的常见方法，但实际上fit和transform对其是没有实际意义的，因为归一化操作是对每个样本单独进行变换，不存在针对所有样本上的统计学习过程。这里的设计，仅仅是为了供sklearn中的pipeline等API调用时，传入该对象时，各API的方法能够保持一致性，方便使用pipeline。

In [23]:
from sklearn.preprocessing import Normalizer

norm = Normalizer()

In [28]:
norm.fit(X_train)
X_train_norm = norm.transform(X_train)
X_train_norm

array([[0.18392939, 0.20922972, 0.95358151, ..., 0.10642203, 0.00361433,
        0.        ],
       [0.12852563, 0.1788613 , 0.97065837, ..., 0.09331894, 0.00438316,
        0.        ],
       [0.12449794, 0.19879724, 0.97015183, ..., 0.05192962, 0.00652449,
        0.        ],
       ...,
       [0.17983167, 0.17320732, 0.9655967 , ..., 0.05260515, 0.00603985,
        0.        ],
       [0.1383677 , 0.2295221 , 0.95751074, ..., 0.10284086, 0.00841425,
        0.        ],
       [0.69786466, 0.2724488 , 0.61491237, ..., 0.20547814, 0.04109563,
        0.        ]])

In [40]:
sum(np.square(X_train_norm[1]))

1.0

### Binarization（离散化）

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM.

In [32]:
df['DC'].mean()

547.9400386847191

In [34]:
from sklearn.preprocessing import Binarizer

bi = Binarizer(548)
DC_bi = bi.fit_transform(df[['DC']])
DC_bi

array([[0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],

In [35]:
df['DC_bi'] = DC_bi[:, 0]

In [36]:
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,DC_bi
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,1.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,1.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0,0.0


Return indices of half-open bins to which each value of x belongs.
```python
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
```  
pandas.cut是按分位数划分的

In [37]:
pd.cut(df['DC'], 5)

0       (7.047, 178.44]
1      (519.52, 690.06]
2      (519.52, 690.06]
3       (7.047, 178.44]
4       (7.047, 178.44]
5      (348.98, 519.52]
6      (348.98, 519.52]
7      (519.52, 690.06]
8       (690.06, 860.6]
9       (690.06, 860.6]
10      (690.06, 860.6]
11      (690.06, 860.6]
12     (519.52, 690.06]
13     (519.52, 690.06]
14      (690.06, 860.6]
15      (690.06, 860.6]
16      (7.047, 178.44]
17     (519.52, 690.06]
18      (7.047, 178.44]
19      (7.047, 178.44]
20      (690.06, 860.6]
21      (690.06, 860.6]
22     (178.44, 348.98]
23     (519.52, 690.06]
24     (519.52, 690.06]
25     (519.52, 690.06]
26     (519.52, 690.06]
27     (519.52, 690.06]
28      (690.06, 860.6]
29      (690.06, 860.6]
             ...       
487    (519.52, 690.06]
488    (519.52, 690.06]
489    (519.52, 690.06]
490    (519.52, 690.06]
491    (519.52, 690.06]
492    (519.52, 690.06]
493    (519.52, 690.06]
494    (519.52, 690.06]
495    (519.52, 690.06]
496    (519.52, 690.06]
497    (519.52, 

### Encoding categorical features

We could encode categorical features as integers, but such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder). This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

**OneHotEncoder**(一般不用)
```python
class sklearn.preprocessing.OneHotEncoder(n_values='auto', categorical_features='all', dtype=<type 'numpy.float64'>, 
                                          sparse=True, handle_unknown='error')```

Convert categorical variable into dummy/indicator variables
```python
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)
```

In [39]:
modelData = pd.get_dummies(data=df, columns=['month','day'])
modelData.head()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,...,month_nov,month_oct,month_sep,day_fri,day_mon,day_sat,day_sun,day_thu,day_tue,day_wed
0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,...,0,0,0,1,0,0,0,0,0,0
1,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,...,0,1,0,0,0,0,0,0,1,0
2,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,...,0,1,0,0,0,1,0,0,0,0
3,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,...,0,0,0,1,0,0,0,0,0,0
4,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,...,0,0,0,0,0,0,1,0,0,0
