# Assignment 1

Explore solutions of linear regression model with MSE loss.  
Investigate how regularization affects the solution.  
In this toy example we use simulated data and select $|| \hat w - w||_2$ as quality metric (distance between found solution and the ground truth).  
In the tasks 1-4 you are allowed to use only `numpy`.

In [26]:
import sklearn
print(sklearn.__version__)
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

0.23.2


Condition number of matrix $A$ is 
$$ k(A) = \frac {\lambda_{max}(A)} {\lambda_{min}(A)}$$
where  
$\lambda_{max}$ - max eigenvalue of $A$  
$\lambda_{min}$ - min eigenvalue of $A$  


In [27]:
X, y, coef = make_regression(n_samples=1000, 
                             n_features=1000, 
                             n_informative=1000, 
                             n_targets=1, 
                             bias=0.0, 
                             effective_rank=10, 
                             tail_strength=0.5, 
                             noise=0.1, 
                             shuffle=True, coef=True, random_state=42)

print('k(A)', np.linalg.cond(X.T.dot(X)))

scaler = StandardScaler()
X = scaler.fit_transform(X)
coef = scaler.inverse_transform(coef.reshape(1,-1))

k(A) 1902233113.0210264


## Task 1 (2 points)
Implement analytic solution for linear regression with MSE loss.

In [31]:
def solve(X, y):
    """
    @return: weights of the linear model
    """
    # TODO
    w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    return w

if print(np.linalg.norm(solve(X, y) - coef) < 400):
    print('success!')

False


## Task 2 (2 points)
Implement analytic solution for linear regression with MSE loss and $L_2$ regularization.  
Plot the dependence between regularization coefficient $\alpha$ and $|| \hat w - w||_2$.

In [29]:
def solve(X, y, alpha):
    w = np.linalg.inv(alpha * np.eye(X.shape[1]) + X.T.dot(X)).dot(X.T).dot(y)
    return w

if print(np.linalg.norm(solve(X, y, 0.1) - coef) < 10):
    print('success!')

True


## Task 3 (2 points)
Implement Full Gradient Descent solution for linear MSE regression with $L_2$ regularization.  
Use gradient norm for stopping criterion.

In [30]:
def solve(X, y, alpha, max_iter, tol):
    """
    @param tol: value for stopping criterion
    @param max_iter: max number of iterations
    @return: weights of the linear model
    """
    # TODO
    w = ...
    return w

if print(np.linalg.norm(solve(X, y, 0.99, 1000, 0.001) - coef) < 10):
    print('success!')

TypeError: unsupported operand type(s) for -: 'ellipsis' and 'float'

## Task 4 (2 points)
What param in `make_regression` affects condition number of $X^T X$ the most? Why?    
Tweak `make_regression` routine to generate problems with different condition numbers.  
Plot the dependence between $||\hat w - w||_2$ of the analytic solution from `task 1` and condition number of $X^T X$.  
Use log scale for condition numbers in the plot.

In [43]:
"""
    Param effective_rank make most affect, because this param describes 
    the number of singular vectors required to explain most of the input 
    data by linear combinations, or in simpler terms, the amount of computation 
    required to explain the data. This is his purpose.
"""

def solve(X, y):
    s = 0
    return 1.8
    
    

X, y, coef = make_regression(n_samples=1000, 
                             n_features=1000, 
                             n_informative=1000, 
                             n_targets=1, 
                             bias=0.0, 
                             effective_rank=10, 
                             tail_strength=0.5, 
                             noise=0.1, 
                             shuffle=True, coef=True, random_state=42)


res = []
for e in coef:
    res.append(np.linalg.norm(solve(X, y) - e))
    
res

[83.36786047548651,
 97.68166498062607,
 79.67494843150065,
 11.478841552814368,
 61.1245556850119,
 89.56156568565298,
 45.05672718203248,
 3.2525006866049546,
 11.3152339874262,
 81.83110999074849,
 12.148501665331969,
 18.3243301573188,
 68.09750192959659,
 32.26700713263275,
 14.410988834299967,
 82.00575110070174,
 15.32527546604603,
 11.864914788683269,
 38.149870687330896,
 16.01694263698067,
 6.472393597650417,
 57.50893425457272,
 15.791100490351685,
 74.30517591400648,
 72.76842791673124,
 68.5809296518284,
 28.93550590551015,
 20.315018010487076,
 86.32455068269077,
 66.76907401294045,
 82.03033332742608,
 88.00108072409122,
 88.50036903109202,
 48.914275120480596,
 3.881487291318243,
 81.63184080225867,
 85.44576322338396,
 39.4701182005205,
 22.995940907771864,
 2.1661060874025324,
 66.16841754636306,
 42.33021735629094,
 83.93270697101356,
 41.21631736750229,
 76.1856326717017,
 14.147692433095006,
 47.53949072377484,
 58.91492048973567,
 24.632756649457733,
 17.565685360

## Task 5 (2 points)
How does switching on and off the `StandardScaler` transformation affects quality of solutions in the tasks 1-3?  
How it is connected with $L_2$ norm?

In [None]:
"""
The quality of the unnormalized data model drops significantly, 
while the error with L2 regularization is less.
"""