## LinearRegression概述

线性回归是利用回归分析方式来确定多种变量之间的依赖关系：
y = WX+e 其中W为ShapeSize=[1，N]的权重参数 e为预测误差 y为实际值

## 线性回归原理图
线性回归实际上是求出一个N维的超平面，使所有样本数据都在其上下一定范围内浮动。
<img src="LinearRegression.png" >

## 线性回归的要求
```
1. 自变量和因变量之间存在线性关系，可以通过绘制散点图矩阵进行考察
2. 自变量和自变量之间相互独立，即自变量之间协方差为0
3. 误差服从期望为0独立的正态分布
4. 误差大小不会随着变量的变化而变化，即方差齐性
```
```
基本条件的原因分析：
y =[θ0，,θ1，θ2，θ3....][X1,X2,X3....] + e 其中特征向量X为因变量，W变量权重，e为预测误差，为固定值，其中e为所有样本预测误差的均值，为一个固定值。自变量和隐变量应该线性相关，否则通过线性表达式则无法诠释样本的分布；自变量和自变量之间相互独立，否则不能进行线性相加；误差服从期望为0独立的正态分布，否则无法满足最大似然估计的条件；误差大小满足齐性，即e是一个固定值，否则不能满足线性方程求解。
```

## 最大似然估计值
```
最大似然估计值是求出所有样本预测值和真实值之间的误差之和最小，其表示方法为所有样本预测误差通过正态分布N-(0,u)概率密度的计算后得到概率，然后所有概率的乘积即可得到似然估计值，当所有样本的预测误差都趋近于0，则似然估计值最大，似然估计最大，则表示权重值越正确。
```
<img src='./maximum.jpg' style='zoom:60%'>

```
最大似然估计为目标求解函数，在学习的过程中通过梯度下降方式，使目标函数值最大，通过梯度求导，求得每一步的梯度变化方向，然后乘以学习率，最终求出梯度更新尺度。

其中梯度数学表达式的物理含义是：所有误差值和对应特征元乘积的期望值
```
<img src='./grad.png' style='zoom:60%'>

```
值得注意的是，为了防止过拟合，即使最后学习的W权重能够有更大泛化能力，会在权重更新时加上正则化惩罚项
```

###  数据处理

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
features_labels = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']
features=['accommodates','bedrooms','bathrooms','beds','minimum_nights','maximum_nights','number_of_reviews']

dc_listings = pd.read_csv('listings.csv')

dc_listings = dc_listings[features_labels]

dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float) #字符串转换

dc_listings = dc_listings[dc_listings.price < 500][dc_listings.price > 100]

max_price = dc_listings.price.max()

min_price = dc_listings.price.min()

dc_listings['price'] = (dc_listings['price']-min_price)/(max_price - min_price) #对价格进行归一化处理

dc_listings = dc_listings.dropna()    #去掉空值


dc_listings[features] = StandardScaler().fit_transform(dc_listings[features]) #数据标准化处理

dc_listings['pad_one'] = 1

features.append('pad_one')

normalized_listings = dc_listings   #将指定特征数据取出

norm_train_df = normalized_listings.copy().iloc[0:2792]
norm_test_df = normalized_listings.copy().iloc[2792:]

  if sys.path[0] == '':


In [2]:
norm_train_data = norm_train_df[features]
norm_train_label = norm_train_df['price']
norm_test_data = norm_test_df[features]
norm_test_label = norm_test_df['price']

In [3]:
norm_train_data.head(5)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,minimum_nights,maximum_nights,number_of_reviews,pad_one
0,0.106575,-0.414744,-0.536678,0.107118,-0.308069,-0.022402,-0.541295,1
1,1.101111,1.803475,2.841286,0.975726,-0.083765,-0.022425,1.835919,1
17,-0.887961,-0.414744,-0.536678,0.107118,-0.308069,-0.022402,-0.541295,1
19,-0.887961,-0.414744,-0.536678,-0.761489,-0.308069,-0.022402,-0.541295,1
21,-0.887961,-0.414744,-0.536678,-0.761489,-0.308069,-0.022402,-0.541295,1


## 1. 通过自己代码实现线性回归预测

In [4]:
norm_train_data.values

array([[ 1.06575111e-01, -4.14744088e-01, -5.36677939e-01, ...,
        -2.24023659e-02, -5.41295306e-01,  1.00000000e+00],
       [ 1.10111130e+00,  1.80347496e+00,  2.84128555e+00, ...,
        -2.24252374e-02,  1.83591931e+00,  1.00000000e+00],
       [-8.87961075e-01, -4.14744088e-01, -5.36677939e-01, ...,
        -2.24023659e-02, -5.41295306e-01,  1.00000000e+00],
       ...,
       [ 1.10111130e+00,  6.94365435e-01, -5.36677939e-01, ...,
         4.48325375e+01, -1.02424915e-01,  1.00000000e+00],
       [-8.87961075e-01, -4.14744088e-01, -5.36677939e-01, ...,
        -2.24254254e-02,  1.21418626e+00,  1.00000000e+00],
       [-3.90692982e-01, -1.52385361e+00, -5.36677939e-01, ...,
        -2.24023659e-02, -5.04722773e-01,  1.00000000e+00]])

In [5]:
import numpy as np

In [6]:
def loss_fun(predicts,labels):
    loss = labels.flatten() -  predicts.flatten() 
    return loss

In [7]:
def grad_get(predicts,labels,train_data,theta,lr):
    
    loss = loss_fun(predicts,labels.values)
    grad = np.zeros(len(theta))
    for i in range(0,len(theta)):
        grad_tmp = 0
        for j in range(0,len(train_data)):
            grad_tmp = grad_tmp + train_data.values[j][i]*loss[i]
        grad[i] = lr*grad_tmp/(len(train_data))
    
    return grad      

In [8]:
def weight_update(theta,grad):
    theta = theta - grad.reshape(8,1)
    return theta

In [9]:
steps = 100
theta = np.ones((8,1))   #定义权重参数
lr = 0.001                #定义学习率
for step in range(steps):
    predicts =  np.dot(norm_train_data.values,theta)
    grad = grad_get(predicts,norm_train_label,norm_train_data,theta,lr)
    theta = weight_update(theta,grad)
    
    if step % 10 == 0:
        print(loss_fun(predicts,norm_train_label.values).sum()/len(norm_train_data))

-0.8070436718650437
-0.8327970458352778
-0.8588091155428623
-0.8850824796175768
-0.9116197627927495
-0.9384236161674704
-0.9654967174714386
-0.9928417713324695
-1.0204615095466902
-1.0483586913514484


## 2. 使用SKLEARN库实现线性回归

In [10]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

boston = datasets.load_boston()
x, y = boston.data, boston.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=10010)
reg = LinearRegression()
reg.fit(x_train, y_train)
y_predict = reg.predict(x_test)
print(mean_squared_error(y_test, y_predict))

15.78937682781807


In [17]:
print(x_train.shape)
print(y_train.shape)
print(x_train)
y_train

(379, 13)
(379,)
[[4.62960e-01 0.00000e+00 6.20000e+00 ... 1.74000e+01 3.76140e+02
  5.25000e+00]
 [1.44760e-01 0.00000e+00 1.00100e+01 ... 1.78000e+01 3.91500e+02
  1.36100e+01]
 [1.12658e+00 0.00000e+00 1.95800e+01 ... 1.47000e+01 3.43280e+02
  1.21200e+01]
 ...
 [9.76170e-01 0.00000e+00 2.18900e+01 ... 2.12000e+01 2.62760e+02
  1.73100e+01]
 [9.55770e-01 0.00000e+00 8.14000e+00 ... 2.10000e+01 3.06380e+02
  1.72800e+01]
 [3.67822e+00 0.00000e+00 1.81000e+01 ... 2.02000e+01 3.80790e+02
  1.01900e+01]]


array([31.7, 19.3, 15.3, 23.7, 22. , 23.9, 23.8, 38.7, 10.2, 21. , 18.2,
       19.8, 13. , 11.8, 23.9, 20. , 19.1, 24.7, 34.9, 37.3, 18.7, 19.2,
       26.6, 15.6,  8.5, 17.7,  5. , 20.9, 16.7, 17.5, 20.7, 50. ,  8.4,
       23.9, 25. , 33.2, 20.1,  8.8, 27.5, 12.1, 24.4, 21.9, 19.7, 18.1,
       17.9, 19.8, 17. , 43.5, 19.3, 23.7, 32.2, 20.3, 13.5, 50. , 14.1,
       23. , 26.2, 25.3, 11.3, 17.8, 20.5, 21.2, 17.8, 17.2, 25. , 18.6,
       15.2, 36.2, 19.5, 20.6, 37.2, 23.2, 24.4, 12.6, 27.9, 21.2, 50. ,
       19.9, 24.8, 25. , 29.9, 41.7, 23.1, 12.7, 43.1, 14.9, 22.6, 33.1,
        9.5, 27.1, 24.1, 14.2, 29.6, 14.1, 32.5, 18.5, 29.1, 24.4, 18.5,
       24.3, 50. , 20.6, 50. ,  7. , 19.4, 22.6, 33.1, 32. , 31. , 24.3,
       10.2, 31.1, 17.4, 23.8, 23.9, 15.6, 20. , 34.9, 50. , 18.4, 15.2,
       30.8, 13.4, 50. , 20.1, 22.8, 20.8, 22.6, 16.8, 21.4, 16.6, 22. ,
       19.6, 24.8, 17.2, 15. , 22.5,  7.5, 18.5, 18.7, 21.2, 27.9, 50. ,
       17.4, 19.6, 19.6, 23.5, 24.3, 50. , 45.4, 20

## 3.使用pytorch实现线性回归模型

In [13]:
import torch 
import torch.optim as optim
import torch.nn as nn
import warnings

warnings.filterwarnings("ignore")
%matplotlib inline

In [21]:
class XLinearRegression(nn.Module):
    def __init__(self,InputChanels,OutPutChanels):
        super().__init__()
        self.out = nn.Linear(InputChanels,OutPutChanels)
    def forward(self,x):
        x = self.out(x)
        return x

In [15]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

boston = datasets.load_boston()
x, y = boston.data, boston.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=10010)

def  DataGet(x_train, x_test, y_train, y_test,bs):
    x_train, x_test, y_train, y_test = map(torch.tensor,(x_train, x_test, y_train, y_test))
    train_ds = TensorDataset(x_train, y_train)
    train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
    valid_ds = TensorDataset(x_test, y_test)
    valid_dl = DataLoader(valid_ds, batch_size=bs)
    return train_dl,valid_dl

In [16]:
import torch.nn.functional as F

def GetModelWithOptAndLossFunc(InputChanels,OutPutChanels):
    model = XLinearRegression(InputChanels,OutPutChanels)
    opt = optim.Adam(model.parameters(),lr=0.001)
    loss_func = F.cross_entropy
    return model,opt,loss_func

In [17]:
def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)
    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()
    return loss.item(), len(xb)

In [18]:
def model_fit(steps, model, loss_func, opt, train_dl, valid_dl):
    for step in range(steps):
        model.train()
        for xb,yb in train_dl:
            print(xb.shape)
            loss_batch(model, loss_func, xb, yb, opt)
        
        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl])
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)
        print('当前step:'+str(step), '验证集损失：'+str(val_loss))

In [19]:
bs = len(x_train)
steps = 1
InputChanels = x_train.shape[1]
OutPutChanels = 1
train_dl, valid_dl = DataGet(x_train, x_test, y_train, y_test,bs)
model, opt,loss_func = GetModelWithOptAndLossFunc(InputChanels,OutPutChanels)
model_fit(steps, model, loss_func, opt, train_dl, valid_dl)

AttributeError: 'LinearRegression' object has no attribute 'parameters'

In [22]:
InputChanels = x_train.shape[1]
OutPutChanels = 1
Net = XLinearRegression(InputChanels,OutPutChanels)
print(Net)

XLinearRegression(
  (out): Linear(in_features=13, out_features=1, bias=True)
)
