博士段房价预测案例--特征工程
通过数据探索，我们了解了数据集的特点，接下来我们进行特征工程，对原始特征做必要的数据预处理和特征编码，使得变化后的特征符合模型要求

# 1.导入必要的包 

In [68]:
# -*- coding:utf-8 -*-
import sys
#矩阵操作
import numpy as np
#处理csv文件 SQL数据处理
import pandas as pd

#画直方图用，对数据结果进行可视化
import matplotlib.pyplot as plt
#对matplotlib进行一次封装，画图更容易，使用更方便
import seaborn as sns


# 2.读取数据
该数据集很简单，可以直接送入回归模型，一般而言，我们通常先对原始特征进行必要的特征编码和处理(特征工程),编码后的特征再送入模型
Garbage In， Garbage Out

In [69]:
#波士顿房价预测
data_path = "/home/tianyu/software/IDEAProjects/machine_learning/data/boston_housing.csv"

#读取数据
df = pd.read_csv(data_path)
#显示前五行，了解每列（特征）的情况
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15,396.9,4.98,24.0
1,0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17,396.9,9.14,21.6
2,0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17,392.83,4.03,34.7
3,0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18,394.63,2.94,33.4
4,0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18,396.9,5.33,36.2


In [70]:
#数据的总体信息
#一共包含多少个样本点，一共多少列，列的名字，非空数值，数据类型
#X 需要是一个数值，方便处理，对字符串的需要进行编码，才能送到数据模型中
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null int64
INDUS      506 non-null float64
CHAS       506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null int64
TAX        506 non-null int64
PTRATIO    506 non-null int64
B          506 non-null float64
LSTAT      506 non-null float64
MEDV       506 non-null float64
dtypes: float64(9), int64(5)
memory usage: 55.5 KB


 3. 特征工程
    3.1 数据去噪

In [71]:
# 删除y 大于等于50的样本(保留小于50的样本)
df = df[df.MEDV < 50]
print(df.shape)
df.info()

(490, 14)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 490 entries, 0 to 505
Data columns (total 14 columns):
CRIM       490 non-null float64
ZN         490 non-null int64
INDUS      490 non-null float64
CHAS       490 non-null int64
NOX        490 non-null float64
RM         490 non-null float64
AGE        490 non-null float64
DIS        490 non-null float64
RAD        490 non-null int64
TAX        490 non-null int64
PTRATIO    490 non-null int64
B          490 non-null float64
LSTAT      490 non-null float64
MEDV       490 non-null float64
dtypes: float64(9), int64(5)
memory usage: 57.4 KB


去除了16个样本，特征量为14个

    3.2 数据分离
从原始数据中分离输入特征 X 和标签 y

In [150]:
y = df['MEDV']
X = df.drop('MEDV', axis = 1)
#尝试对y（房屋价格中位数）做log变化，对log变换后的价格进行估计
#对价格类可以做log变化
#log1p = log(y + 1)
log_y =np.log1p(y)

In [151]:
type(y)

pandas.core.series.Series

    3.3 离散型特征编码
    离散型特征可以通过独热编码(one-hot encode)，将原来有K种取值的离散型特征变成K维 0-1 编码特征，这K维特征中只有一个是1(独热)，其余维均为0，独热编码可以用pandas 的get_dommles方法(哑编码)或Sclkit-Learn中的OneHotEncoder类来实现
    
    get_dummles方法要求输入特征的类型是非数值型("object"),而OneHotEncoder要求输入是整数，如果是字符串要先用LabelEncoder变成整数(但LabelEncoder输出是一维数组，而OneHotEncoder要求输出是二维数组，需要在二者之间进行格式转换)
    
    如果训练数据和测试数据不能同时获得的话，需要用OneHotEncoder，用训练集训练编码器，然后对训练集和测试集进行编码，而get_dummles是依赖于DataFrame,只使用于一个数据集情况
    
    更多学习，推荐阅读：[scikit-learn] 特征二值化编码函数的一些坑 https://ask.hellobi.com/blog/DataMiner/4897

In [152]:
#RAD的含义是距离告诉公路的便利指数，虽然给的数值是数值型，但实际是索引，可换成离散特征/类别型特征编码试试
X['RAD'].astype("object")
X_cat = X['RAD']
X.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')

In [153]:
pd.DataFrame(X_cat).columns

Index(['RAD'], dtype='object')

In [154]:
#独热编码
X_cat = pd.get_dummies(X_cat, prefix="RAD")

X = X.drop('RAD', axis = 1)
#特征名称，用于保存特征工程结果
feat_names = X.columns
X_cat.head()

Unnamed: 0,RAD_1,RAD_2,RAD_3,RAD_4,RAD_5,RAD_6,RAD_7,RAD_8,RAD_24
0,1,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0


In [155]:
X_cat.columns

Index(['RAD_1', 'RAD_2', 'RAD_3', 'RAD_4', 'RAD_5', 'RAD_6', 'RAD_7', 'RAD_8',
       'RAD_24'],
      dtype='object')

    3.4 数值型特征的预处理
    原则是数值型特征可以直接送到模型里面去
    注意几点：
    关于数据的取值范围、单位，每一维度特征进行处理，对数据进行去量刚化、去单位化
    scikit learn中提供的数据预处理功能： http://scikit-learn.org/stable/modules/preprocessing.html
    
    Boston房价数据集特征为数值型特征，在数据探索阶段发现，发现各特征相差较大，需要对数据标准化预处理，标准化的目的在于避免原始特征值差异过大，导致训练得到的参数权重单位不一致，无法比较各种特征的重要性。另外，一些优化算法(如随机梯度下降及其改进版本)只在各特征尺度差不多的情况下才能保证收敛

##### 数据特征标准化

In [156]:
# 数据标准化
#常用的两个标准化方法 StandardScaler 和 MinMaxScaler
from sklearn.preprocessing import StandardScaler

#分别初始化对数据和目标值的标准化器
ss_X = StandardScaler()
ss_y = StandardScaler()

ss_log_y = StandardScaler()

#分别对训练和测试数据的特征以及目标值进行标准化处理
#对训练数据，先调用fit方法训练模型，得到模型参数，然后对训练数据和测试数据进行transform
X = ss_X.fit_transform(X)
y = np.array(y)
#对y做标准化不是必须的
#对y标准化的好处是不同问题的w差异不太大，同时正则参数范围也有限
#y = ss_y.fit_transform(y[0].reshape(-1, 1))
#log_y = ss_log_y.fit_transform(log_y[0].reshape(-1, 1))


In [140]:
X.shape

(490, 12)

In [157]:
X_cat.shape

(490, 9)

In [158]:
X_cat.head()

Unnamed: 0,RAD_1,RAD_2,RAD_3,RAD_4,RAD_5,RAD_6,RAD_7,RAD_8,RAD_24
0,1,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0


In [159]:
y.shape

(490,)

In [160]:
pd.DataFrame(y).head()

Unnamed: 0,0
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


In [161]:
log_y.shape

(490,)

In [114]:
data = pd.DataFrame(X)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,-0.417401,0.302696,-1.291856,-0.250812,-0.139895,0.50504,-0.109432,0.121208,-0.667101,-1.415179,0.443535,-1.122679
1,-0.414992,-0.486479,-0.593329,-0.250812,-0.731821,0.269017,0.377488,0.53733,-0.988734,-0.516361,0.443535,-0.534772
2,-0.414995,-0.486479,-0.593329,-0.250812,-0.731821,1.439934,-0.255152,0.53733,-0.988734,-0.516361,0.399553,-1.256937
3,-0.414412,-0.486479,-1.310933,-0.250812,-0.826186,1.153335,-0.798939,1.056878,-1.107857,-0.066953,0.419005,-1.41098
4,-0.410202,-0.486479,-1.310933,-0.250812,-0.826186,1.381694,-0.50039,1.056878,-1.107857,-0.066953,0.443535,-1.073216


In [121]:
df.index

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            496, 497, 498, 499, 500, 501, 502, 503, 504, 505],
           dtype='int64', length=490)

In [130]:
y.shape

(1, 1)

In [131]:
log_y.shape

(1, 1)

# 4. 保存特征工程的结果到文件，供机器学习使用

In [163]:
fe_data = pd.DataFrame(data = X, columns = feat_names, index = df.index)
fe_data = pd.concat([fe_data], axis = 1, ignore_index = False)
#print(fe_data.head())
print(fe_data.shape)
#加上标签y
fe_data['MEDV'] = y
fe_data['LOGMEDV'] = log_y


#保存结果到文件
fe_data.to_csv("FE_boston_housing_prics.csv", index = False)

(490, 12)


In [164]:
fe_data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,TAX,PTRATIO,B,LSTAT,MEDV,LOGMEDV
0,-0.417401,0.302696,-1.291856,-0.250812,-0.139895,0.50504,-0.109432,0.121208,-0.667101,-1.415179,0.443535,-1.122679,24.0,3.218876
1,-0.414992,-0.486479,-0.593329,-0.250812,-0.731821,0.269017,0.377488,0.53733,-0.988734,-0.516361,0.443535,-0.534772,21.6,3.11795
2,-0.414995,-0.486479,-0.593329,-0.250812,-0.731821,1.439934,-0.255152,0.53733,-0.988734,-0.516361,0.399553,-1.256937,34.7,3.575151
3,-0.414412,-0.486479,-1.310933,-0.250812,-0.826186,1.153335,-0.798939,1.056878,-1.107857,-0.066953,0.419005,-1.41098,33.4,3.538057
4,-0.410202,-0.486479,-1.310933,-0.250812,-0.826186,1.381694,-0.50039,1.056878,-1.107857,-0.066953,0.443535,-1.073216,36.2,3.616309


In [137]:
fe_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 490 entries, 0 to 505
Data columns (total 12 columns):
CRIM       490 non-null float64
ZN         490 non-null float64
INDUS      490 non-null float64
CHAS       490 non-null float64
NOX        490 non-null float64
RM         490 non-null float64
AGE        490 non-null float64
DIS        490 non-null float64
TAX        490 non-null float64
PTRATIO    490 non-null float64
B          490 non-null float64
LSTAT      490 non-null float64
dtypes: float64(12)
memory usage: 69.8 KB


In [138]:
fe_data.describe

<bound method NDFrame.describe of          CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
0   -0.417401  0.302696 -1.291856 -0.250812 -0.139895  0.505040 -0.109432   
1   -0.414992 -0.486479 -0.593329 -0.250812 -0.731821  0.269017  0.377488   
2   -0.414995 -0.486479 -0.593329 -0.250812 -0.731821  1.439934 -0.255152   
3   -0.414412 -0.486479 -1.310933 -0.250812 -0.826186  1.153335 -0.798939   
4   -0.410202 -0.486479 -1.310933 -0.250812 -0.826186  1.381694 -0.500390   
..        ...       ...       ...       ...       ...       ...       ...   
501 -0.410939 -0.486479  0.119873 -0.250812  0.160357  0.532627  0.029180   
502 -0.412931 -0.486479  0.119873 -0.250812  0.160357 -0.192299  0.299297   
503 -0.411153 -0.486479  0.119873 -0.250812  0.160357  1.119618  0.807542   
504 -0.405549 -0.486479  0.119873 -0.250812  0.160357  0.840682  0.747121   
505 -0.412685 -0.486479  0.119873 -0.250812  0.160357 -0.330234  0.445017   

          DIS       TAX   PTRATIO        