# GitHub 100-Days-Of-ML-Code项目

* 个人学习记录
* 注意有些库可能已经被弃用， 可在sklearn官网查询：[sklearn](https://scikit-learn.org/stable/index.html "title")

# Day 1 数据预处理

* 导入需要的库：NumPy, pandas库
* 导入数据集：pandas库中read_csv方法读取.csv格式数据集
* ~~处理丢失数据：sklearn.preprocessing库Imputer类（原项目方法Imputer类已弃用）~~
* 处理丢失数据： sklearn.impute库SimpleImputer类
* 解析分类数据：sklearn.preprocessing库LabelEncoder类
* ~~训练、测试集：sklearn.crossvalidation库中的train_test_split()方法~~
* 训练、测试集：sklearn.model_selection库中的train_test_split()方法
* 特征缩放： 特征标准化，Z值归一化 sklearn.preprocessing库StandardScalar类

## 1： 导入库

In [1]:
import pandas as pd
import numpy as np

## 2： 导入数据集

In [2]:
dataset = pd.read_csv('./dataset/Data.csv')

In [3]:
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [4]:
X = dataset.iloc[:, :-1]
X.head()

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,


In [5]:
Y = dataset.iloc[:, 3]
Y.head()

0     No
1    Yes
2     No
3     No
4    Yes
Name: Purchased, dtype: object

## 3：处理丢失数据

Imputer方法已弃用

In [6]:
# from sklearn.preprocessing import Imputer
# imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
# imputer = imputer.fit(X.iloc[:, 1:3])
# X.iloc[:, 1:3] = imputer.transform(X.iloc[:, 1:3])

In [7]:
X.iloc[:, 1:3].isnull()

Unnamed: 0,Age,Salary
0,False,False
1,False,False
2,False,False
3,False,False
4,False,True
5,False,False
6,True,False
7,False,False
8,False,False
9,False,False


In [8]:
X

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [9]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X.iloc[:, 1:3])
X.iloc[:, 1:3] = imp.transform(X.iloc[:, 1:3])

In [10]:
X

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.777778
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


## 解析分类数据

In [11]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# LabelEncoder 现在是否更改为OrdinalEncoder
labelencoder_X = LabelEncoder()

In [12]:
X.iloc[:, 0] = labelencoder_X.fit_transform(X.iloc[:, 0])

In [13]:
X

Unnamed: 0,Country,Age,Salary
0,0,44.0,72000.0
1,2,27.0,48000.0
2,1,30.0,54000.0
3,2,38.0,61000.0
4,1,40.0,63777.777778
5,0,35.0,58000.0
6,2,38.777778,52000.0
7,0,48.0,79000.0
8,1,50.0,83000.0
9,0,37.0,67000.0


### 创建虚拟变量

In [14]:
X_array = np.array(X)

In [15]:
X_array = X_array.astype(int)

In [16]:
X_array

array([[    0,    44, 72000],
       [    2,    27, 48000],
       [    1,    30, 54000],
       [    2,    38, 61000],
       [    1,    40, 63777],
       [    0,    35, 58000],
       [    2,    38, 52000],
       [    0,    48, 79000],
       [    1,    50, 83000],
       [    0,    37, 67000]])

In [17]:
type(X_array)

numpy.ndarray

onehotencoder = OneHotEncoder()
onehotencoder.fit(X_array)
onehotencoder.transform(X_array).toarray()

[前三位表示第一个特征，中间10位表示第二个特征，最后几位表示第三个特征]

In [18]:
labelencoder_Y = LabelEncoder()
labelencoder_Y.fit_transform(Y)

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## 5：拆分数据集为训练集合和测试集合

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

## 6：特征量化

* 将数据按其属性(按列进行)减去其均值，然后除以其方差。最后得到的结果是，对每个属性/每列来说所有数据都聚集在0附近，方差值为1
* $ \frac{x_i - \mu}{\sigma^2}$

In [20]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train_sc = sc_X.fit(X_train)
X_test_sc = X_train_sc.transform(X_test)

  return self.partial_fit(X, y)
  after removing the cwd from sys.path.


In [21]:
X_train_sc.mean_

array([8.75000000e-01, 3.84722222e+01, 6.25972222e+04])

In [22]:
X_train_sc.var_

array([8.59375000e-01, 3.37276235e+01, 9.09133873e+07])

In [23]:
X_test_sc

array([[ 0.13483997, -1.45882927, -0.90166297],
       [ 0.13483997,  1.98496442,  2.13981082]])