回归是统计学最有力的工具之一，机器学习监督学习算法分为分类和回归两种算法，分类算法主要用于离散型分布预测，回归算法主要用于连续型分布预测。回归的简单定义为，给定一个点集D，用一个函数去拟合这个点集，并且使得点集和拟合函数之间的误差最小。这里的误差是指预测y值和真实y值之间的差值，使用该误差的简单累加将使得正差值和负差值相互抵消，所以采用平方误差（最小二乘法）

使用scikit-learn 自带的波士顿房价数据集来训练模型，然后用模型来预测房价

In [4]:
from sklearn.datasets import load_boston
boston = load_boston()
x = boston.data
y = boston.target
x.shape
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [5]:
#讲数据集分成训练集和测试集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=3)

In [11]:
import time
from sklearn.linear_model import LinearRegression
model = LinearRegression()
start = time.clock()
model.fit(x_train,y_train)
train_score = model.score(x_train,y_train)  #对训练样本模型的准确得分
train_score
cv_score = model.score(x_test,y_test)       #对测试样本模型的准确得分
cv_score
print('elaspe:{0:.2f},train_score:{1:0.2f},cv_score:{2:.2f}'.format(time.clock()-start,train_score,cv_score))

elaspe:0.00,train_score:0.72,cv_score:0.79


模型的优化

In [14]:
#特征数据的范围相差太大，讲数据进行归一化处理
#模型产生欠拟合，方法一发掘更多的输入特征；方法二增加多项式特征
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures  #生成多项式特征
from sklearn.pipeline import Pipeline
def poplynomial_model(degree=1):
    poplynomial_features = PolynomialFeatures(degree=degree,include_bias=False)
    linear_regression = LinearRegression(normalize=True)
    pipeline = Pipeline([('poplynomial_features',poplynomial_features),('linear_regression',linear_regression)])
    return pipeline
model = poplynomial_model(degree=2)
start = time.clock()
model.fit(x_train,y_train)
train_score = model.score(x_train,y_train) 
cv_score = model.score(x_test,y_test) 
print('elaspe:{0:.2f},train_score:{1:0.2f},cv_score:{2:.2f}'.format(time.clock()-start,train_score,cv_score))

elaspe:0.04,train_score:0.93,cv_score:0.86
