## 岭回归
线性回归最主要的问题是对异常值敏感。在真实世界的数据收集过程中，经常会遇到错误的度量结果。线性回归使用的是普通最小二乘法，其目标是使平方误差最小化。当存在异常值时，会对模型影响很大。  
换句话说，最小二乘法在建模时会考虑每个数据点的影响，对于存在异常数据点的问题，可以通过引入正则化项的系数作为阈值来消除异常值的影响。

In [1]:
import sys
import numpy as np

In [2]:
#filename = sys.argv[1]
filename = "data_multivar.txt"
X = []
y = []
with open(filename, 'r') as f:
    for line in f.readlines():
        data = [float(i) for i in line.split(',')]
        xt, yt = data[:-1],data[-1]
        X.append(xt)
        y.append(yt)

In [3]:
# Train/test split
num_training = int(0.8*len(X))
num_test = len(X) - num_training

# Training data
X_train = np.array(X[:num_training])
y_train = np.array(y[:num_training])

# Test data
X_test = np.array(X[num_training:])
y_test = np.array(y[num_training:])

In [4]:
from sklearn import linear_model
import sklearn.metrics as sm

ridge_regressor = linear_model.Ridge(alpha=0.01, fit_intercept=True, max_iter=10000)
ridge_regressor.fit(X_train, y_train)

y_test_pred_ridge = ridge_regressor.predict(X_test)
print("Mean absolute error:", round(sm.mean_absolute_error(y_test, y_test_pred_ridge), 2))
print("Mean squared error:", round(sm.mean_squared_error(y_test, y_test_pred_ridge),2))
print("Median absolute error:", round(sm.median_absolute_error(y_test, y_test_pred_ridge), 2))
print("Explain variance score:",round(sm.explained_variance_score(y_test, y_test_pred_ridge), 2))
print("R2 score:",round(sm.r2_score(y_test, y_test_pred_ridge), 2))

Mean absolute error: 3.95
Mean squared error: 23.15
Median absolute error: 3.69
Explain variance score: 0.84
R2 score: 0.83


In [5]:
linear_regressor = linear_model.LinearRegression()
linear_regressor.fit(X_train, y_train)

y_test_pred = linear_regressor.predict(X_test)
print("Mean absolute error:", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("Mean squared error:", round(sm.mean_squared_error(y_test, y_test_pred),2))
print("Median absolute error:", round(sm.median_absolute_error(y_test, y_test_pred), 2))
print("Explain variance score:",round(sm.explained_variance_score(y_test, y_test_pred), 2))
print("R2 score:",round(sm.r2_score(y_test, y_test_pred), 2))

Mean absolute error: 3.95
Mean squared error: 23.15
Median absolute error: 3.69
Explain variance score: 0.84
R2 score: 0.83


In [11]:
from sklearn.preprocessing import PolynomialFeatures
# 将曲线的多项式的次数的初始值设置为3
polynomial = PolynomialFeatures(degree=15)
x_train_transformed = polynomial.fit_transform(X_train)
datapoint = np.array([0.39, 2.78, 7.11]).reshape(1,-1)
poly_datapoint = polynomial.fit_transform(datapoint)

poly_linear_model = linear_model.LinearRegression()
poly_linear_model.fit(x_train_transformed, y_train)
print("Linear regression:", linear_regressor.predict(datapoint))
print("Polynomail regression:", poly_linear_model.predict(poly_datapoint))

Linear regression: [-11.0587295]
Polynomail regression: [-8.06664984]
