# 使用Scikit-Learn 完成預測
### Scikit-Learn在三個面向提供支援。
1. 獲取資料:***klearn.datasets***
2. 掌握資料:***sklearn.preprocessing*** 
3. 機器學習:***sklearn Estimator API*** 

獲取資料的方式有很多種（包含檔案、資料庫、網路爬蟲、Kaggle Datasets等），<br>
其中最簡單的方式是從Sklearn import 內建的資料庫。由於其特性隨手可得且不用下載，所以我們通常叫他**玩具資料**：

# 基本架構

* 讀取資料&pre-processing
* 切分訓練集與測試集 
* 模型配適
* 預測 
* 評估(計算成績可能是誤差值或正確率或..)


In [1]:
%matplotlib inline

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## 讀取Iris資料集與資料前處理

Iris Flowers 資料集

我們在這個項目中使用 Iris Data Set，這個資料集中的每個樣本有4個特徵，1個類別。該資料集1中的樣本類別數為3類，每類樣本數目為50個，總共150個樣本。

屬性資訊：

    花萼長度 sepal length(cm)
    花萼寬度 sepal width(cm)
    花瓣長度 petal length(cm)
    花瓣寬度 petal width(cm)
    類別：
        Iris Setosa
        Iris Versicolour
        Iris Virginica

樣本特徵資料是數值型的，而且單位都相同（釐米）。

![Iris Flowers](images/iris_data.PNG)


In [4]:
iris = datasets.load_iris()  
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

* 印出iris的key值與檔案位置
* 查看前10筆資料
* 查看資料型別
* 印出標註的樣本類別資料

In [23]:
print(iris.keys())
print(iris["filename"])
# print(iris["data"])

# print(iris.data[0: 10])
# print(iris["target_names"])
# print(type(iris["target"]))
print(iris.feature_names)

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
iris.csv
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [36]:
# we only take the first two features.
X = iris.data[:,0:2]
Y = iris.target

In [22]:
#以下是組成 pandas DataFrame (也可以不用這種做)
x = pd.DataFrame(iris.data, columns=iris['feature_names'])
x.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [27]:
print("target_names: "+ str(iris.target_names))

target_names: ['setosa' 'versicolor' 'virginica']


In [26]:
#建立Target欄位與資料
y = pd.DataFrame(iris.target, columns=["target"])
y.head()

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


In [37]:
#合併資料特徵欄位與目標欄位
iris_data = pd.concat([x, y], axis=1)

# 兩個[[]]才可以獲得指定欄位
iris_data = iris_data[["sepal length (cm)", "petal length (cm)", "target"]]
iris_data.head(10)

Unnamed: 0,sepal length (cm),petal length (cm),target
0,5.1,1.4,0
1,4.9,1.4,0
2,4.7,1.3,0
3,4.6,1.5,0
4,5.0,1.4,0
5,5.4,1.7,0
6,4.6,1.4,0
7,5.0,1.5,0
8,4.4,1.4,0
9,4.9,1.5,0


In [33]:
#只選擇目標為0與1的資料
iris_data = iris_data[iris_data["target"].isin([0, 1])]
iris_data

print(iris.data.size/ len(iris.feature_names))

150.0


## 切分訓練集與測試集
> train_test_split()

In [45]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(iris_data[["sepal length (cm)", "petal length (cm)"]], iris_data["target"], test_size=0.3)

In [50]:
X_train.head()
X_train.columns
# X_train.shape

Index(['sepal length (cm)', 'petal length (cm)'], dtype='object')

In [42]:
X_test.head()
X_test.shape

(45, 2)

In [43]:
Y_train.head()
Y_train.shape

(105,)

In [44]:
Y_test.head()
Y_test.shape

(45,)

# Appendix 

>normalization和standardization是差不多的<br>
都是把數據進行前處理，從而使數值都落入到統一的數值範圍，從而在建模過程中，各個特徵量沒差別對待。<br> 
* normalization一般是把數據限定在需要的範圍，比如一般都是【0，1】，從而消除了數據量綱對建模的影響。<br> 
* standardization 一般是指將數據正態化，使平均值0方差為1.<br> 

因此normalization和standardization 是針對數據而言的，消除一些數值差異帶來的特種重要性偏見。<br>
經過歸一化的數據，能加快訓練速度，促進算法的收斂。

### Standardization (z-score)
    to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. 

In [24]:
def norm_stats(dfs):
    minimum = np.min(dfs)
    maximum = np.max(dfs)
    mu = np.mean(dfs)
    sigma = np.std(dfs)
    return (minimum, maximum, mu, sigma)


def z_score(col, stats):
    m, M, mu, s = stats
    df = pd.DataFrame()
    for c in col.columns:
        df[c] = (col[c]-mu[c])/s[c]
    return df

In [25]:
stats = norm_stats(X_train)
arr_x_train = np.array(z_score(X_train, stats))
arr_y_train = np.array(y_train)
arr_x_train[:5]

array([[-0.72966298, -0.79523439],
       [-0.72966298, -1.07013023],
       [ 1.27990064,  1.19776044],
       [ 0.6100461 , -1.07013023],
       [-0.56219935, -0.72651043]])

## use sklearn

In [26]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(X_train)  #Compute the statistics to be used for later scaling.
print(sc.mean_)  #mean
print(sc.scale_) #standard deviation

[5.43571429 2.75714286]
[0.59714457 1.4550966 ]


In [27]:
#transform: (x-u)/std.
X_train_std = sc.transform(X_train)
X_train_std[:5]

array([[-0.72966298, -0.79523439],
       [-0.72966298, -1.07013023],
       [ 1.27990064,  1.19776044],
       [ 0.6100461 , -1.07013023],
       [-0.56219935, -0.72651043]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:

In [30]:
X_test_std = sc.transform(X_test)
print(X_test_std[:10])

[[-0.63038672 -1.16023591]
 [-0.07416314  0.43177178]
 [ 0.20394865 -1.16023591]
 [-0.07416314  0.50413577]
 [ 0.76017222  1.15541164]
 [-0.90849851 -1.23259989]
 [-0.63038672 -0.07477612]
 [-1.0475544  -1.08787192]
 [ 0.20394865  0.28704381]
 [-0.49133083  0.57649975]]


you can also use fit_transform method (i.e., fit and then transform)    

In [31]:
X_train_std = sc.fit_transform(X_train)  
X_test_std = sc.fit_transform(X_test)
print(X_test_std[:10])


[[-0.63038672 -1.16023591]
 [-0.07416314  0.43177178]
 [ 0.20394865 -1.16023591]
 [-0.07416314  0.50413577]
 [ 0.76017222  1.15541164]
 [-0.90849851 -1.23259989]
 [-0.63038672 -0.07477612]
 [-1.0475544  -1.08787192]
 [ 0.20394865  0.28704381]
 [-0.49133083  0.57649975]]


In [32]:
print('mean of X_train_std:',np.round(X_train_std.mean(),4))
print('std of X_train_std:',X_train_std.std())

mean of X_train_std: -0.0
std of X_train_std: 1.0


## Min-Max Normaliaztion
    Transforms features by scaling each feature to a given range.
    The transformation is given by:

    X' = X - X.min(axis=0) / ((X.max(axis=0) - X.min(axis=0))
    X -> N 維資料
    


In [63]:
x1 = np.random.normal(50, 6, 100)  # np.random.normal(mu,sigma,size))
y1 = np.random.normal(5, 0.5, 100)

print(x1)

x2 = np.random.normal(30,6,100)
y2 = np.random.normal(4,0.5,100)
plt.scatter(x1,y1,c='b',marker='s',s=20,alpha=0.8)
plt.scatter(x2,y2,c='r', marker='^', s=20, alpha=0.8)

# print(np.sum(x1)/len(x1))
# print(np.sum(x2)/len(x2))

[49.88493079 52.7088562  55.55285175 53.82305737 50.96570385 55.7497334
 49.4225766  51.04056129 47.30057095 50.58736511 53.7730684  56.32531077
 45.17129707 46.14540794 51.33601086 52.28575761 41.96874667 47.70662464
 48.70404554 44.16726195 45.20987765 59.19454318 46.28710387 48.11130801
 47.04047251 57.0598223  50.95730898 52.19908634 42.24502196 55.18923704
 53.39046974 40.38916776 45.12370265 41.64590809 45.21702454 55.45469634
 44.69057909 54.13519604 39.55740647 46.39299489 61.27265336 49.3685069
 43.76885404 41.07044258 57.84913533 46.82298535 53.4323372  41.77297492
 53.11438969 55.94987926 63.08082157 59.47818655 53.87294771 54.21746192
 50.55388849 57.30381466 54.26940809 41.54169712 52.09070649 53.99544364
 60.23275253 47.38259296 52.23474885 49.63742709 34.60531846 36.31352822
 56.50814488 56.24637727 51.3130093  48.90562613 58.68651841 43.81869111
 55.11998362 53.26329295 45.1798781  49.11483365 59.24725946 52.57599815
 48.96183591 51.80890917 44.56310468 51.56136262 47.3

<matplotlib.collections.PathCollection at 0x21907a9c550>

In [53]:
x_val = np.concatenate((x1,x2))
y_val = np.concatenate((y1,y2))

x_val.shape

(200,)

In [54]:
def minmax_norm(X):
    return (X - X.min(axis=0)) / ((X.max(axis=0) - X.min(axis=0)))

In [55]:
minmax_norm(x_val[:10])

array([1.        , 0.756846  , 0.88263726, 0.4476152 , 0.        ,
       0.47439901, 0.69771518, 0.47429877, 0.86937855, 0.6201845 ])

In [61]:
from sklearn.preprocessing import MinMaxScaler
x_val=x_val.reshape(-1, 1) # 1維轉2維
print(x_val)
scaler = MinMaxScaler().fit(x_val)  # default range 0~1
print(scaler.data_max_) # 顯示最大值
print(scaler.data_min_) # 顯示最小值
print(scaler.transform(x_val)[:10])

[[59.96846003]
 [53.95542173]
 [57.06615677]
 [46.30834772]
 [35.23911857]
 [46.97069359]
 [52.49315539]
 [46.96821488]
 [56.7382777 ]
 [50.57587294]
 [56.70580519]
 [55.94654576]
 [38.78857615]
 [51.81083029]
 [53.98590809]
 [45.27412425]
 [57.01276897]
 [53.3352217 ]
 [50.54990055]
 [56.51719688]
 [37.70651684]
 [64.53470766]
 [50.69223141]
 [53.69904124]
 [50.51712637]
 [50.45765909]
 [45.60920573]
 [64.22603533]
 [48.00820567]
 [52.37824408]
 [37.36550578]
 [41.00832215]
 [49.03441779]
 [46.71795938]
 [49.73431965]
 [42.5931952 ]
 [53.74811723]
 [58.20232428]
 [55.39059129]
 [45.30719274]
 [44.22407478]
 [49.09621042]
 [47.2686243 ]
 [43.05227044]
 [44.65148164]
 [55.67881226]
 [48.35360122]
 [48.00750748]
 [48.08773506]
 [43.79732506]
 [38.94973809]
 [47.1016599 ]
 [44.13706883]
 [50.89762092]
 [50.82148911]
 [46.11550126]
 [44.15331049]
 [58.51509055]
 [44.93570377]
 [58.42481551]
 [44.95713186]
 [55.70700274]
 [62.08489517]
 [50.80581802]
 [55.31038323]
 [50.41058243]
 [50.45617