## 数据分析第一步：数据预处理

### 1. 导入依赖库

In [2]:
import numpy as np
import pandas as pd

### 2. 导入数据集

导入常见的csv数据类型文件，通过Pandas的read_csv方法，返回DataFrame对象

In [29]:
dataset = pd.read_csv('data/Data.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values

print(X)
print('\n')
print(Y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


### 3. 处理缺失的数据

处理数据中的缺失数据（值为NaN），常见的处理形式如采用均值法填充

In [30]:
from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = "NaN", strategy = "mean", axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### 4. 编码类别数据

对类别型数据进行数字编码

In [35]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:, 0])

print(X[:, 0])

[0 2 1 2 1 0 2 0 1 0]


#### 创建dummy变量

将指定属性下的类别数值作为信息的属性，属性的值为0或者1

In [27]:
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

print(X)
print('\n')
print(Y)

[[1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 4.40000000e+01 7.20000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 2.70000000e+01 4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 3.00000000e+01 5.40000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 3.80000000e+01 6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 4.00000000e+01 6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 3.50000000e+01 5.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 3.87777778e+01 5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 4.80000000e+01 7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 5.00000000e+01 8.30000000e+04]
 [1.000000

### 5. 划分训练数据集与测试数据集

按照一定比例将原始数据集分为训练数据和测试数据，通常比例为4:1

In [28]:
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)



### 6. 特征缩放

将数据的值维度进行缩放，加快收敛速率，通常使用Min-Max标准化方法或者Z-score标准化

In [33]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

Done!