# 多元线性回归/对数线性回归（二选一）

## 一、多元线性回归
这部分的内容是要求大家完成多元线性回归，我们会先带着大家使用sklearn做一元线性回归的十折交叉验证，多元线性回归大家可以仿照着完成

### 1. 读取数据

In [38]:
import numpy as np

In [39]:
import pandas as pd

# 读取数据
data = pd.read_csv('data/kaggle_house_price_prediction/kaggle_hourse_price_train.csv')

# 丢弃有缺失值的特征（列）
data.dropna(axis = 1, inplace = True)

# 只保留整数的特征
data = data[[col for col in data.dtypes.index if data.dtypes[col] == 'int64']]

In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
Fireplaces       1460 non-null int64
GarageCars       1460 non-null int64
Garag

### 2. 引入模型

In [41]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict

### 3. 使用sklearn完成一元线性回归的十折交叉验证验证

#### 创建模型

In [42]:
model = LinearRegression()

#### 选取数据

In [43]:
features = ['LotArea']
x = data[features]
y = data['SalePrice']

#### 做十折交叉验证的预测

In [44]:
prediction = cross_val_predict(model, x, y, cv = 10)

这十折交叉验证是按顺序做的，会先将前10%的数据作为测试集，然后会往后顺延到10%到20%，最后将这十份的预测值按顺序拼接后返回

In [45]:
prediction.shape

(1460,)

### 4. 计算评价指标

#### MAE

In [46]:
mean_absolute_error(prediction, data['SalePrice'])

55394.441952448942

#### RMSE

In [47]:
mean_squared_error(prediction, data['SalePrice']) ** 0.5

77868.513377524141

### 5. 请你选择多种特征进行组合，完成多元线性回归，并对比不同的特征组合，它们训练出的模型在十折交叉验证上MAE与RMSE的差别，至少完成3组

###### 扩展：多项式回归（一元线性回归的扩展），尝试对部分特征进行变换，如将其二次幂，三次幂作为特征输入模型，观察模型在预测能力上的变化
###### 提示：多元线性回归，只要在上方的features这个list中，加入其他特征的名字就可以

In [48]:
#模型1

#模型1
myFeatures1 = ['LotArea', 'OverallQual','OverallCond','YearBuilt','YearRemodAdd']
x1 = data[myFeatures1]

prediction1 = cross_val_predict(model, x1, y, cv = 10)
prediction1.shape

# MAE
print('模型1 MAE:', mean_absolute_error(prediction1, data['SalePrice']))
# MSE
print('模型1 RMSE:', mean_squared_error(prediction1, data['SalePrice']) ** 0.5)



#模型2

#TODO:
myFeatures2 = ['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF']
x2 = data[myFeatures2]

prediction2 = cross_val_predict(model, x2, y, cv = 10)
prediction2.shape

# MAE
print('模型2 MAE:', mean_absolute_error(prediction2, data['SalePrice']))
# RMSE
print('模型2 RMSE:', mean_squared_error(prediction2, data['SalePrice']) ** 0.5)


#模型3

#TODO:
myFeatures3 = ['GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr',
'TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch',
'3SsnPorch','ScreenPorch','PoolArea','MiscVal','MoSold','YrSold']
x3 = data[myFeatures3]

prediction3 = cross_val_predict(model, x3, y, cv = 10)
prediction3.shape

# MAE
print('模型3 MAE:', mean_absolute_error(prediction3, data['SalePrice']))
# RMSE
print('模型3 RMSE:', mean_squared_error(prediction3, data['SalePrice']) ** 0.5)

模型1 MAE: 30992.3710513
模型1 RMSE: 46314.1393576
模型2 MAE: 31058.3565668
模型2 RMSE: 50023.271828
模型3 MAE: 28436.5724298
模型3 RMSE: 43120.1334739


###### 双击此处填写
1. 模型1使用的特征： 'LotArea', 'OverallQual','OverallCond','YearBuilt','YearRemodAdd'
2. 模型2使用的特征： 'BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF'
3. 模型3使用的特征: 'GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','MoSold','YrSold'  


模型|MAE|RMSE
-|-|-
模型1 | 30992.3710513 | 46314.1393576
模型2 | 31058.3565668 | 50023.271828
模型3 | 28436.5724298 | 43120.1334739

#### 多项式回归

In [49]:
myFeatures4 = [ 'LotArea', 'OverallQual','OverallCond','YearBuilt','YearRemodAdd',
                'BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF']
x4_ = data[myFeatures4]
x4 = x4_.copy()

# 添加平方项
for feature in myFeatures4:
    x4[feature + '_squre'] = [i ** 2 for i in x4_[feature]]

# 添加立方项
for feature in myFeatures4:
    x4[feature + '_cubic'] = [i ** 3 for i in x4_[feature]]

# print(x4)

# 下面是用sklearn库实现的，但是数据类型会转化为int，精度有损失  
# from sklearn.preprocessing import PolynomialFeatures
# poly_features = PolynomialFeatures(degree=2, include_bias=False)
# X_poly = poly_features.fit_transform(x4)
# print(X_poly)


prediction4 = cross_val_predict(model, x4, y, cv = 10)

prediction4.shape

# MAE
print('模型4 MAE:', mean_absolute_error(prediction4, data['SalePrice']))
# RMSE
print('模型4 RMSE:', mean_squared_error(prediction4, data['SalePrice']) ** 0.5)


模型4 MAE: 19950.3441752
模型4 RMSE: 66376.5116672
