## Linear Models

### for regression
- Linear regression
- ridge regression
- lasso regression  
...  

### for binary classification
- logistic regression  
...

### Linear Regression

$$\hat{y} = w^T x + b = w_1x_1 + ... + w_dx_d + b$$
$$ x = (x_1, ..., x_d) \in \mathbb{R}^d, \ y, \hat{y} \in \mathbb{R}$$


- $x_i$는 $(d+1) \times 1$ (bias인 $w_0$가 존재하기 때문)
- training error를 최소화(cost func)하는 $w*$를 찾는 것이 학습 목적 

### Training
- squared error 사용  
    $L(y, \hat{y}) = (\hat{y} - y)^2, \ then \ J(w) = MSE_{train}$  

    $J(w) = \frac{1}{n}||Xw - y||^2_2$  

    - $Xw$인 이유: X는 행렬(개수 x feature)이고 w는 벡터라서 $wX$면 계산이 안됨

### Optimization

- closed-form solution -> 기울기가 0이 되는 지점을 찾아야 함
- w에 대한 이차식 형태 -> convex 형태 -> closed-form solution으로 최적해가 존재.

- X, y는 고정된 수이므로 변수인 w에 대해서 미분  

    $w^* = (X^TX)^{-1}X^Ty$  
    - 계산으로 $w^*$가 바로 나옴
    - trained model  
    $f(x) = w^{*T}x$

### Probalistic Interpretation of Linear Regression
- Linear Regression에 대한 통계적 해석

- $y$가 정규분포 $N(\hat{y}, \sigma^2)$에 속하고, $\hat{y} = w^Tx$라 가정

- $p(y|x;w) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(y-w^Tx)^2}{2\sigma^2})$  => 정규분포식

- maximum likelihood estimation(w에 대해)  

    $$w^* = argmax_w \prod_{(x_i, y_i) \in D} p(y_i | x_i;w)$$

    $$\mathbf{w}^* = \arg\max_{\mathbf{w}} \left[-\frac{n}{2} \log (2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{(x_i,y_i)\in D} \big( y_i - \mathbf{w}^T x_i \big)^2\right]$$

    => MSE가 Linear Regression의 loss 함수로 최적인 것을 통계적으로 증명  
    y를 대표하는 방향으로 w를 조정하여 $\hat{y}$을 구하기에

### wave dataset

In [1]:
import mglearn
from sklearn.model_selection import train_test_split


X, y = mglearn.datasets.make_wave(n_samples = 60)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.linear_model import LinearRegression


reg = LinearRegression()
reg.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


def evaluate(x_data, y_data, model, mode='train'):
    y_hat = model.predict(x_data)
    print(f'{mode} ' + 'MAE: %.5f'%mean_absolute_error(y_data, y_hat))
    print(f'{mode} ' + 'RMSE: %.5f'%mean_squared_error(y_data, y_hat)**0.5)
    print(f'{mode} ' + 'R_square: %.5f'%r2_score(y_data, y_hat))
    
evaluate(X_train, y_train, reg)
evaluate(X_test, y_test, reg, 'test')

train MAE: 0.41790
train RMSE: 0.50591
train R_square: 0.67006
test MAE: 0.49590
test RMSE: 0.62968
test R_square: 0.65779


### Ridge & Lasso

- loss가 작아져도 w가 작아지지 않는 문제 발생  
=>  loss func에 기준을 하나 더 추가해줘서 같이 줄어들게 만듦


- Ridge: L2 regularization

    - $\tilde{J}(w) = MSE_{train} + \alpha||w||^2_2 \ = \ \frac{1}{n}||Xw - y||^2 \alpha||w||^2_2$

- Lasso: L1 regularization

    - $\tilde{J}(w) = MSE_{train} + \alpha||w||_1 \ = \ \frac{1}{n}||Xw - y||^2 + \alpha||w||_1$


- $\alpha$는 하이퍼파라미터로 regularization 강도 조정

In [10]:
import mglearn
from sklearn.model_selection import train_test_split


X, y = mglearn.datasets.make_wave(n_samples = 60)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Ridge(L2)

from sklearn.linear_model import Ridge


reg = Ridge(alpha=1)
reg.fit(X_train, y_train)

0,1,2
,alpha,1
,fit_intercept,True
,copy_X,True
,max_iter,
,tol,0.0001
,solver,'auto'
,positive,False
,random_state,


In [13]:
evaluate(X_train, y_train, reg)
evaluate(X_test, y_test, reg, 'test')

train MAE: 0.41790
train RMSE: 0.50591
train R_square: 0.67006
test MAE: 0.49590
test RMSE: 0.62968
test R_square: 0.65779


In [None]:
import pandas as pd


training_r2 = []
test_r2 = []
alpha_settings = [0, 0.1, 1, 10]


for alpha in alpha_settings:
    # build the model
    reg = Ridge(alpha=alpha)
    reg.fit(X_train, y_train)
    
    # r2 on the training set
    y_train_hat = reg.predict(X_train)
    training_r2.append(r2_score(y_train, y_train_hat))   
     
    # r2 on the test set (generalization)
    y_test_hat = reg.predict(X_test)
    test_r2.append(r2_score(y_test, y_test_hat))
    
pd.DataFrame({"alpha": alpha_settings, "train R_square": training_r2, "test R_square": test_r2})

Unnamed: 0,alpha,train R_square,test R_square
0,0.0,0.670089,0.659337
1,0.1,0.670089,0.659183
2,1.0,0.67006,0.657795
3,10.0,0.667496,0.643293


In [16]:
# Lasso(L1)
from sklearn.linear_model import Lasso


reg = Lasso(alpha=1)
reg.fit(X_train, y_train)

0,1,2
,alpha,1
,fit_intercept,True
,precompute,False
,copy_X,True
,max_iter,1000
,tol,0.0001
,warm_start,False
,positive,False
,random_state,
,selection,'cyclic'


In [17]:
evaluate(X_train, y_train, reg)
evaluate(X_test, y_test, reg, 'test')

train MAE: 0.63884
train RMSE: 0.74459
train R_square: 0.28529
test MAE: 0.80635
test RMSE: 0.93987
test R_square: 0.23760


In [19]:
import pandas as pd


training_r2 = []
test_r2 = []
alpha_settings = [0.0001, 0.001, 0.01, 0.1, 1] 


for alpha in alpha_settings:
    # build the model
    reg = Lasso(alpha=alpha)
    reg.fit(X_train, y_train)
    
    # r2 on the training set
    y_train_hat = reg.predict(X_train)
    training_r2.append(r2_score(y_train, y_train_hat))   
     
    # r2 on the test set (generalization)
    y_test_hat = reg.predict(X_test)
    test_r2.append(r2_score(y_test, y_test_hat))
    
pd.DataFrame({"alpha": alpha_settings, "train R_square": training_r2, "test R_square": test_r2})

Unnamed: 0,alpha,train R_square,test R_square
0,0.0001,0.670089,0.659319
1,0.001,0.670089,0.659161
2,0.01,0.670051,0.65756
3,0.1,0.666241,0.639351
4,1.0,0.285288,0.237597


L1 정규화 시 alpha에 따라 몇몇 계수들이 0이 되면서 feature들이 무시되는 feature selection이 발생하기도 함  

이는 L1 정규화의 그래프가 마름모의 형태라 loss func과 접할 때 x 또는 y 어느 한 쪽이 없을 수도 있기 때문

### Logistic Regression

- 두 클래스를 분류할 수 있는 선형함수 찾기  
=> 시그모이드 함수를 사용해서 0.5 이상이면 1, 아니면 0으로 구분


$$\hat{y} = \sigma(w^Tx + b) = \frac{1}{1 + exp(-w^Tx - b)}$$

### Training

- loss func으로 binary cross-entropy 함수를 사용
- 0과 1 모두의 경우에서 정답이 나와야하기에 $y_i$에 0 혹은 1을 대입했을 때 한 쪽이 없어지는 형태를 취함

$$J(w) = \frac{1}{n} \sum_{(x_i, y_i) \in D} L(y_i, \hat{y_i}) = \frac{1}{n} \sum_{(x_i, y_i) \in D} \ [-y_i log\hat{y_i} - (1 - y_i)log(1-\hat{y_i})]

### Optimization

- closed-form solution이 아니라 최적해가 존재하지 않음  
-> gradient descent로 수렴할 때까지 아래 과정을 반복


$$w := w - \epsilon \nabla_{\mathbf{w}} J(w)$$

$$w_j \; \coloneqq \; w_j - \epsilon \, \frac{\partial}{\partial w_j} (\mathbf{w}), \quad \forall \, w_j \in \mathbf{w}$$


- 기울기가 완만한 지점이 가장 작은 경우이므로 $J(w)$의 특정 지점 $w_j$일 때의 기울기와 learning rate를 곱한 값을 활용


- 해당 식은 테일러 급수에서 옴
$$
J(\boldsymbol{\theta})
= J(\boldsymbol{\theta}_0)
+ (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^{T} \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0)
+ \tfrac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_0)^{T}
  \nabla_{\boldsymbol{\theta}}^{2} J(\boldsymbol{\theta}_0) (\boldsymbol{\theta} - \boldsymbol{\theta}_0)
+ \cdots
$$  

- 근사하면 다음과 같이 변형
$$
J(\boldsymbol{\theta}) \;\simeq\;
J(\boldsymbol{\theta}_0) +
(\boldsymbol{\theta}-\boldsymbol{\theta}_0)^{\top}
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0)
$$

- $\theta$는 $\theta_0$에 아주 근접한 수이므로($\theta \Longrightarrow \theta_0$)  
-> $J(\theta$)는 $J(\theta_0)$보다 작은 방향으로 학습이 이뤄져야 함. 

따라서 넘기고 부등호를 붙임  
$$
J(\boldsymbol{\theta}) - J(\boldsymbol{\theta}_0)
\simeq
(\boldsymbol{\theta}-\boldsymbol{\theta}_0)^{\top}
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0) < 0
\
$$

- 좌항과 우항은 서로 비례관계임으로 learning rate로 설정하여 식 완성  
- $\theta_0$부터 수렴 시까지 해당 식을 반복

$$
\boldsymbol{\theta} - \boldsymbol{\theta}_0 \;\propto\; -\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0) \
$$
$$
\boldsymbol{\theta} - \boldsymbol{\theta}_0 = -\epsilon \,\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0),
\quad \epsilon > 0 \
$$
$$
\boldsymbol{\theta} = \boldsymbol{\theta}_0 - \epsilon \,\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}_0),
\quad \epsilon > 0
$$


In [20]:
import mglearn
from sklearn.model_selection import train_test_split


X, y = mglearn.datasets.make_forge()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [22]:
# Regression
from sklearn.linear_model import LogisticRegression


clf = LogisticRegression()
clf.fit(X_train, y_train)

y_test_hat = clf.predict(X_test)
print(y_test)
print(y_test_hat)

[1 0 1 0 1 1 0]
[1 0 1 0 1 0 0]


In [23]:
from sklearn.metrics import accuracy_score


y_train_hat = clf.predict(X_train)
print('train accuracy: %.5f'%accuracy_score(y_train, y_train_hat))
y_test_hat = clf.predict(X_test)
print('test accuracy: %.5f'%accuracy_score(y_test, y_test_hat))

train accuracy: 0.94737
test accuracy: 0.85714


In [24]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)

In [26]:
import pandas as pd


training_accuracy = []
test_accuracy = []
C_settings = [0.01, 0.1, 1, 10, 100, 1000, 10000]

for C in C_settings:
    # build the model
    clf = LogisticRegression(C=C)
    clf.fit(X_train, y_train)
    
    # accuracy on the training set
    y_train_hat = clf.predict(X_train)
    training_accuracy.append(accuracy_score(y_train, y_train_hat))
    
    # accuracy on the test set (generalization)
    y_test_hat = clf.predict(X_test)
    test_accuracy.append(accuracy_score(y_test, y_test_hat))
    
pd.DataFrame({"C": C_settings, "train accuracy": training_accuracy, "test accuracy": test_accuracy})

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to sca

Unnamed: 0,C,train accuracy,test accuracy
0,0.01,0.934272,0.93007
1,0.1,0.941315,0.951049
2,1.0,0.946009,0.958042
3,10.0,0.960094,0.958042
4,100.0,0.948357,0.958042
5,1000.0,0.943662,0.944056
6,10000.0,0.946009,0.958042


### Multi-class

- OVR Approach  
: 하나의 클래스를 다른 클래스와 구분하는 분류기 생성  
-> 가장 점수가 높은 클래스가 예측 클래스가 됨


- OVO Approach
: 각 클래스 페어를 학습한 후 예측값을 voting해서 결정

- Softmax Regression  
: 각 class 별 score 계산 후 softmax로 확률로 만듦

In [27]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split


X, y = make_blobs(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# OVR
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression


clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X_train, y_train)

0,1,2
,estimator,LogisticRegression()
,n_jobs,
,verbose,0

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [29]:
y_test_hat = clf.predict(X_test)
print(y_test)
print(y_test_hat)

[1 0 0 2 2 1 2 0 2 0 2 0 1 0 1 2 2 0 2 1 0 2 1 2 1]
[1 0 0 2 2 1 2 0 2 0 2 0 1 0 1 2 2 0 2 1 0 2 1 2 1]
