# 多元线性回归

#### 一元线性回归：找到点与线距离平方和最小

## Sklearn的多元线性回归
对多个数据指标进行回归


In [2]:
import pandas as pd
df = pd.read_excel('波士顿房价.xlsx')
df

Unnamed: 0,犯罪率,豪宅比例,商务地产比例,河畔住宅,一氧化氮浓度,每户平均人数,老房比例,到就业中心距离,高速公路便利性,财产税率,师生比例,非洲裔比例指数,低收入者比例,同类房屋中位数价格
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


In [4]:
# 根据前面13个特征预测房价
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

x_train,x_test,y_train,y_test = train_test_split(
    df.iloc[:,:-1],df.iloc[:,-1], random_state=42)

model = LinearRegression()
model.fit(x_train,y_train)
y_predict = model.predict(x_test)


### 准确率如何? 回归模型的评价

**MSE 均方误差**，真实值与预测值之间的差距。真实值与预测值之间差距的平方相加再除以数据总数。数值越大，误差越大，越不准确。对结果做平方根处理，得到的才是平均误差

均方误差的大小取决于原来数据集的大小，因此不同的数据集之间很难通过MSE进行水平比较。

R^2 决定系数，R方的取值范围从负无穷大一直到+1。
* 当预测数据集和原数据集完全相等时，R方=1。模型非常准确
* 当预测数据集等于原数据集平均值时，R方=0。模型效果与平均差别不大
* 当预测数据集为随机数值时，R方可能为负。模型很差。


In [5]:
from sklearn.metrics import mean_squared_error, r2_score

print('MSE: ', mean_squared_error(y_test, y_predict))
print('R^2: ', r2_score(y_test, y_predict))

MSE:  22.098694827098146
R^2:  0.6844267283527108


In [7]:
# 查看多元回归模型的系数
print('Coefficient: ',model.coef_)
print('Intercept: ', model.intercept_)

Coefficient:  [-1.28322638e-01  2.95517751e-02  4.88590934e-02  2.77350326e+00
 -1.62388292e+01  4.36875476e+00 -9.24808158e-03 -1.40086668e+00
  2.57761243e-01 -9.95694820e-03 -9.23122944e-01  1.31854199e-02
 -5.17639519e-01]
Intercept:  29.83642016383877


 ### SGD Regression
 梯度下降回归。
 在数据量很大的时候，可以提高速度，但是准确性会有所下降。
 
 **注意：要将特征值标准化**

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.preprocessing import StandardScaler

x_train,x_test,y_train,y_test = train_test_split(
    df.iloc[:,:-1],df.iloc[:,-1], random_state=42)

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

model = RandomForestRegressor()
model.fit( x_train, y_train)

y_predict = model.predict(x_test)

#将测试集真实诊断结果与model的预测结果进行比较评分，输出各类分值
print('MSE: ', mean_squared_error(y_test, y_predict))
print('R2: ', r2_score(y_test, y_predict))


MSE:  12.513086464566937
R2:  0.8213109115753453
