# 多元线性回归/对数线性回归（二选一）

## 一、多元线性回归
这部分的内容是要求大家完成多元线性回归，我们会先带着大家使用sklearn做一元线性回归的十折交叉验证，多元线性回归大家可以仿照着完成

### 1. 读取数据

In [None]:
import pandas as pd

In [None]:
# 读取数据
data = pd.read_csv('data/advertising/advertising.csv')


In [None]:
data.head()

### 2. 引入模型

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict

### 3. 使用sklearn完成一元线性回归的十折交叉验证验证

#### 创建模型

In [None]:
model = LinearRegression()

#### 选取数据

In [None]:
features = ['TV']
x = data[features]
y = data['Sales']

#### 做十折交叉验证的预测

In [None]:
prediction = cross_val_predict(model, x, y, cv = 10)

这十折交叉验证是按顺序做的，会先将前10%的数据作为测试集，然后会往后顺延到10%到20%，最后将这十份的预测值按顺序拼接后返回

In [None]:
prediction.shape

### 4. 计算评价指标

#### MAE

In [None]:
mean_absolute_error(prediction, data['Sales'])

#### RMSE

In [None]:
mean_squared_error(prediction, data['Sales']) ** 0.5

### 5. 请你选择多种特征进行组合，完成多元线性回归，并对比不同的特征组合，它们训练出的模型在十折交叉验证上MAE与RMSE的差别，至少完成3组

In [None]:
# YOUR CODE HERE
features1 = ['TV', 'Radio']
x1 = data[features1]
y1 = data['Sales']
prediction1 = cross_val_predict(model, x1, y1, cv = 10)
mae=mean_absolute_error(prediction1, data['Sales'])
rmse=mean_squared_error(prediction1, data['Sales']) ** 0.5
print("Model-1:\nMAE:",mae,"\nRMSE:",rmse)

In [None]:
# YOUR CODE HERE
features2 = ['TV', 'Newspaper']
x2 = data[features2]
y2 = data['Sales']
prediction2 = cross_val_predict(model, x2, y2, cv = 10)
mae=mean_absolute_error(prediction2, data['Sales'])
rmse=mean_squared_error(prediction2, data['Sales']) ** 0.5
print("Model-2:\nMAE:",mae,"\nRMSE:",rmse)

In [None]:
# YOUR CODE HERE
features3 = ['Newspaper', 'Radio']
x3 = data[features3]
y3 = data['Sales']
prediction3 = cross_val_predict(model, x3, y3, cv = 10)
mae=mean_absolute_error(prediction3, data['Sales'])
rmse=mean_squared_error(prediction3, data['Sales']) ** 0.5
print("Model-3:\nMAE:",mae,"\nRMSE:",rmse)

In [None]:
# YOUR CODE HERE
features4 = ['TV', 'Radio', 'Newspaper']
x4 = data[features4]
y4 = data['Sales']
prediction4 = cross_val_predict(model, x4, y4, cv = 10)
mae=mean_absolute_error(prediction4, data['Sales'])
rmse=mean_squared_error(prediction4, data['Sales']) ** 0.5
print("Model-4:\nMAE:",mae,"\nRMSE:",rmse)

###### 双击此处填写
1. 模型1使用的特征：'TV', 'Radio'
2. 模型2使用的特征：'TV', 'Newspaper'
3. 模型3使用的特征：'Newspaper', 'Radio'
4. 模型4使用的特征：'TV', 'Radio', 'Newspaper'

模型|MAE|RMSE
-|-|-
模型1 | 1.2621936422519535 | 1.6818990518213202
模型2 | 1.776707605359968 | 2.2492610419402657
模型3 | 4.3163332911165995 | 5.042150499491994
模型4 | 1.2644541807760776 | 1.685774006914065

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# 使用TV和Radio的二次幂作为特征
poly_features2 = PolynomialFeatures(degree=2, include_bias=False)
x_poly2 = poly_features2.fit_transform(x1)
prediction5 = cross_val_predict(model, x_poly2, y1, cv=10)
mae = mean_absolute_error(prediction5, data['Sales'])
rmse = mean_squared_error(prediction5, data['Sales']) ** 0.5
print("Model-2次幂:\nMAE:", mae, "\nRMSE:", rmse)

In [None]:
# 使用TV和Radio的三次幂作为特征
poly_features3 = PolynomialFeatures(degree=3, include_bias=False)
x_poly3 = poly_features3.fit_transform(x1)
prediction6 = cross_val_predict(model, x_poly3, y1, cv=10)
mae = mean_absolute_error(prediction6, data['Sales'])
rmse = mean_squared_error(prediction6, data['Sales']) ** 0.5
print("Model-3次幂:\nMAE:", mae, "\nRMSE:", rmse)

5. 模型5使用的特征：'TV', 'Radio' 2次幂
6. 模型16使用的特征：'TV', 'Radio' 3次幂

模型|MAE|RMSE
-|-|-
模型5 | 1.0295180853375854 | 1.3890476306368063
模型6 | 0.9900309987172708 | 1.3714669912830202