# 案例：波士顿房价预测

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

## 1、获取数据
### 1.1 波士顿房价数据在sklearn中已经内置，可以通过load_boston()方法获得

In [2]:
boston = load_boston()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

##### 特征含义
CRIM：城镇人均犯罪率。<br/>
ZN：住宅用地超过 25000 sq.ft. 的比例。<br/>
INDUS：城镇非零售商用土地的比例。<br/>
CHAS：查理斯河空变量（如果边界是河流，则为1；否则为0）。<br/>
NOX：一氧化氮浓度。<br/>
RM：住宅平均房间数。<br/>
AGE：1940 年之前建成的自用房屋比例。<br/>
DIS：到波士顿五个中心区域的加权距离。<br/>
RAD：辐射性公路的接近指数。<br/>
TAX：每 10000 美元的全值财产税率。<br/>
PTRATIO：城镇师生比例。<br/>
B：1000（Bk-0.63）^ 2，其中 Bk 指代城镇中黑人的比例。<br/>
LSTAT：人口中地位低下者的比例。<br/>
MEDV：自住房的平均房价，以千美元计。<br/>

In [3]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [4]:
# type(boston)
print(boston)

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]]), 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 1

##### 取特征X和标签y

In [5]:
X = boston.data
y = boston.target

### 1.2 从文件读取
绝大多数情况，数据是存在文件中的，如excel。所以我们也可以从文件中读取数据。一般使用pandas读取。

In [6]:
import pandas as pd

In [7]:
df = pd.read_excel('data2/boston.xls')
df

FileNotFoundError: [Errno 2] No such file or directory: 'data2/boston.xls'

In [None]:
X = df[df.columns[0:-1]]
y = df[df.columns[-1]]


## 2、数据预处理(数据清洗)


我们获取的数据有可能存在下面的一些情况：
  - 缺少数据值
  - 含有错误的数据值，如年龄=200
  - 数据不一致，等级编码有的是“1，2，3”有的却是“A，B，C ”
  - 重复的记录值
  
$\color{red}{注意：本门课程关注的是机器学习算法，而波士顿房价数据也是清理过得，所以该部分不用写代码进行处理}$

## 3、数据分析与可视化

$\color{red}{注意：本门课程关注的是机器学习算法，不是数据分析，因此忽略数据分析与可视化部分}$

## 4、选择合适的机器学习模型

该问题是房价预测问题，线性回归能很好的应用于预测问题，因此我们选择使用线性回归模型

In [None]:
model = linear_model.Ridge(alpha=0.1)
model.fit(X,y)
y_hat = model.predict(X)
y_hat

我们如何选择参数alpha呢？

## 5、训练模型(使用交叉验证选择合适的参数)


In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.model_selection import GridSearchCV


In [None]:
ridge_model = linear_model.Ridge()
param = {'alpha':[0.01,0.03,0.05,0.07,0.1,0.5,0.8,1],'normalize':[True,False]}
gsearch = GridSearchCV(estimator=ridge_model,param_grid=param,cv=5,scoring='neg_mean_squared_error')
gsearch.fit(X_train,y_train)

In [None]:
gsearch.best_params_,gsearch.best_score_

## 6、模型评价

In [None]:
final_model = linear_model.Ridge(alpha=0.01,normalize=True)
final_model.fit(X_train,y_train)

y_train_hat = final_model.predict(X_train)
y_test_hat = final_model.predict(X_test)

print("train-MSE=",mean_squared_error(y_train,y_train_hat))
print("test-MSE=",mean_squared_error(y_test,y_test_hat))

## 7、上线部署使用

1、模型保存

In [None]:
from sklearn.externals import joblib
joblib.dump(final_model,"house_train_model.m")

2、模型读取

In [None]:
load_model = joblib.load("house_train_model.m")

In [None]:
load_model.predict(X_test)