## Linear Regression
#### 背景：
线性回归是经典机器学习模型，其算法原理十分简单，可通过最小二乘法直接求解或利用随机梯度下降进行参数拟合得到回归方程。
本项目借助波士顿房价预测数据集，探究线性回归的算法原理以及使用过程中的注意事项。

#### 目的：
1. 从零开始实现线性回归算法功能，深入理解算法原理及可能的改进点
2. 体会多元线性回归中，矩阵运算对模型效率的提升作用
3. 理解不同类别型特征编码方法的异同和适用情况
4. 探究特征标准的作用于局限

In [42]:
# 环境初始化
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"     # 执行全部行输出命令

### Data Preparation

In [43]:
# 加载数据
from sklearn.datasets import load_boston
data = load_boston()
# data

In [44]:
# 数据探索
df = pd.DataFrame(data['data'])
df.columns = data['feature_names']
df.info();df.head()
print(data['DESCR'])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
dtypes: float64(13)
memory usage: 51.5 KB
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


浏览数据集描述信息后发现：
1. 数据集一共有13个特征、506个样本，且数据集中没有缺失数据；
2. 13个特征中，数值型特征11个，类别型特征2个（CHAS,RAD）



### Pre-processing

In [45]:
df_cat = df[['CHAS','RAD']].astype('int').astype('category')
df_num = df.drop(columns=['CHAS','RAD'])
# print(df_cat.shape); print(df_num.shape)

In [46]:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
nar_cat = onehot.fit_transform(df_cat).toarray()

In [47]:
df_num.describe().loc[('mean','std'),:]

Unnamed: 0,CRIM,ZN,INDUS,NOX,RM,AGE,DIS,TAX,PTRATIO,B,LSTAT
mean,3.613524,11.363636,11.136779,0.554695,6.284634,68.574901,3.795043,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.115878,0.702617,28.148861,2.10571,168.537116,2.164946,91.294864,7.141062


In [48]:
from sklearn.preprocessing import StandardScaler
standardized = StandardScaler()
nar_num = standardized.fit_transform(df_num)

In [49]:
X = np.concatenate((nar_num, nar_cat), axis=1)
y = data['target'].reshape(len(data['target']),1)

In [50]:
def data_split(data, test_ratio=0.2, val_ratio=0):
    index = np.random.choice(range(len(data)), size=len(data), replace=False)
    train_index = index[:int(len(data)*(1-val_ratio-test_ratio))]
    val_index = index[int(len(data)*(1-val_ratio-test_ratio)):int(len(data)*(1-test_ratio))]
    test_index = index[int(len(data)*(1-test_ratio)):]
    return data[train_index], data[test_index], data[val_index]

X_train,X_test,_ = data_split(X,test_ratio=0.2)
y_train,y_test,_ = data_split(y,test_ratio=0.2)


1. 内置sum()结果为按照输入数组的第0维度进行相加，且默认按照-1维度进行相加(也就是最高维度）
2. np.sum()才能正常数据元素汇总求和

In [120]:
def LinReg_train(X,y,num_epochs,lr):
    # 初始化
    loss=[]
    W = np.random.normal(0,1,(1,X.shape[1]))
    b = 0
    # 训练
    for i in range(num_epochs):
        y_hat = np.dot(X,W.T)+b
        # MSE
        ls = np.sum((y-y_hat)**2)/2
        # ls = np.dot((y_train-y_hat).T,(y_train-y_hat))/2
        loss.append(ls)
        # 优化（模型参数迭代）
        W = W-lr*(-np.dot((y-y_hat).T,X)/X.shape[0])
        b = b-lr*np.mean(y-y_hat)
    return loss, W, b

def LinReg_price(X,y,W,b):
    y_hat = np.dot(X,W.T)+b
    # MSE
    ls = np.sum((y-y_hat)**2)/2
    # ls = np.dot((y_train-y_hat).T,(y_train-y_hat))/2
    return y_hat, ls

In [129]:
loss_CV,W,b = LinReg_train(X_train,y_train,100,0.03)
y_hat,loss = LinReg_price(X_test,y_test,W,b)

