参考: [100-Days-Of-ML-Code](https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day%2011%20K-NN.md)

## Step1 Importing the libraries

In [3]:
import numpy as np
import pandas as pd

## Step2 Imporing dataset

In [4]:
dataset = pd.read_csv('datasets/Data.csv')
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


### loc与iloc的区别: iloc是通过索引号进行查询, 而loc是通过序号进行查询

In [5]:
dataset.iloc[0]  # dateset.iloc 是根据索引号进行查询

Country      France
Age              44
Salary        72000
Purchased        No
Name: 0, dtype: object

In [6]:
dataset.iloc[ : , 1]  # ','前是行，后是列

0    44.0
1    27.0
2    30.0
3    38.0
4    40.0
5    35.0
6     NaN
7    48.0
8    50.0
9    37.0
Name: Age, dtype: float64

In [8]:
X = dataset.iloc[ : , :-1].values  # .values表示只会取值
Y = dataset.iloc[ : , 3].values
print(X)
print(Y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Step3 Handling the missing data

In [9]:
from sklearn.preprocessing import Imputer  # Imputer 用于缺省值处理
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)  # 参数全部是default值
X[:, 1:3] = imputer.fit_transform(X[:, 1:3])  # return new numpy ndarray
print(X)  # 缺省值以该列的平均值代替

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Step4 Encoding categorical data
- Encoding 主要是为了将非数值型数据转换为数值型数据

In [10]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x = LabelEncoder()
X[:, 0] = label_encoder_x.fit_transform(X[:, 0])
print(X)  # 将标签分为 [0:n-1]

[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]


### Creating a dummy variable

In [11]:
one_hot_encoder = OneHotEncoder(categorical_features=[0])  # 指定one_hot的列, default='all' 是全部
X = one_hot_encoder.fit_transform(X).toarray()  # fit and transform then to numpy ndarray
print(X)  # 第一列值范围 [0:2]

[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
  5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
  8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
  6.70000000e+04]]


In [12]:
laber_encoder = LabelEncoder()
Y = laber_encoder.fit_transform(Y)
print(Y)  # 将y进行encoding

[0 1 0 0 1 1 0 1 0 1]


## Step5 Splitting the datasets into Training sets and Test sets

In [167]:
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)  # random_state是随机数种子，如果不为1，每次随机都不一样

## Step6 Feature Scaling
- 因为有些分类方法随着数值大小而受影响，所以需要特征收敛到例如`[0, 1]`范围内

In [171]:
from sklearn.preprocessing import StandardScale
sc_x = StandardScaler()  # 将特征去均值，向中心0收缩`
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.fit_transform(x_test)

In [172]:
x_train

array([[-1.        ,  1.        , -1.        ,  2.64575131, -0.77459667,
         0.26306757,  0.12381479],
       [ 1.        , -1.        ,  1.        , -0.37796447, -0.77459667,
        -0.25350148,  0.46175632],
       [-1.        ,  1.        , -1.        , -0.37796447,  1.29099445,
        -1.97539832, -1.53093341],
       [-1.        ,  1.        , -1.        , -0.37796447,  1.29099445,
         0.05261351, -1.11141978],
       [ 1.        , -1.        ,  1.        , -0.37796447, -0.77459667,
         1.64058505,  1.7202972 ],
       [-1.        ,  1.        , -1.        , -0.37796447,  1.29099445,
        -0.0813118 , -0.16751412],
       [ 1.        , -1.        ,  1.        , -0.37796447, -0.77459667,
         0.95182631,  0.98614835],
       [ 1.        , -1.        ,  1.        , -0.37796447, -0.77459667,
        -0.59788085, -0.48214934]])

In [173]:
x_test

array([[ 0.,  0.,  0.,  0.,  0., -1., -1.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  1.]])