### <u>Basic Linear Regression</u>
Given a labeled data set $D_{train}=\{(x_i, y_i)\}_{i=1}^n$, where $x_i \in \mathbb R^d$, $y_i \in \mathbb R$, and $n \gg d$. For a given $x_i$, we make the prediction $y_i = w^Tx_i$, where we have to "learn" the optimal $w$ via linear regression.


##### <u>Imports & Reading Input</u> 

In [37]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error

df_train = pd.read_csv('train.csv')
y_train = df_train.iloc[:, 1]
X_train = df_train.iloc[:, 2:].values
n, d = X_train.shape

##### <u>Linear Regression</u>
The problem looks as follows: $$\hat w = \arg \max_{w \in \mathbb R^d} \frac 1 {2n} \|y - Xw\|_2^2$$The gradient of the loss function $L(w) = \| y - Xw \|_2^2$ is $\triangledown_w L(w) = \frac 1 n (X^TX w - X^Ty)$. Note that we could simply solve the normal equations $X^TXw = X^Ty$, however, we use gradient descent for the sake for entertainment. GD will converge to a global minimum since the loss function is clearly convex.<br>
Recall the update formula for GD with momentum: $$\begin{aligned} w^{t+1} &= w^t + \beta (w^{t-1} - w^t) - \alpha \triangledown L(w^t)\end{aligned}$$

In [38]:
alpha, beta = 10E-7, 0.0000001
wt = np.zeros(d) 
wt1 = np.zeros(d)

for i in range(100): 
    grad = (1 / n) * (X_train.T @ X_train @ wt1 - X_train.T @ y_train)
    wt2 = wt1 + beta * (wt - wt1) - alpha * grad
    wt = np.copy(wt1)
    wt1 = np.copy(wt2)

w = wt2
print('w =', w)

w = [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]


We observe that $w = \frac 1 {10} [1, ..., 1]$, meaning that the scalar product $w^Tx_i$ corresponds to the commponent-wise mean. 

##### <u>Testing & Training Error<u>
Given single $x_i \in \mathbb R^d$, we predict $w^Tx_i = y_i \in \mathbb R$. Or equivalently, given the test matrix $X_{train}$, we predict $X_{train}w = y \in \mathbb R^n$. We define the training error as follows:$$\text{RMSE} = \sqrt {\frac 1 n \sum_{i=1}^n (y_i - w^T x_i)^2}$$We can easily compute the Root Mean Squared Error on the test set $X_{train}, \space y_{train}$ as follows: $$\sqrt {\frac 1 n \| y_{train} - X_{train} w \|_2^2}$$

In [39]:
RMSE = mean_squared_error(y_train, X_train @ w) ** 0.5
print('RMSE =', RMSE)

RMSE = 5.913107707840612e-13


##### <u>Making Predictions</u>
Given the unlabeled data set $D_{test}=\{x_i\}_{i=1}^n$, where $x_i \in \mathbb R^d$, we make the prediction $y_{pred} = X_{test}w$. The output is saved in `output.csv`. 

In [40]:
df_test = pd.read_csv('test.csv')
X_test = df_test.iloc[:, 1:].values

y_pred = X_test @ w

np.savetxt('output.csv', y_pred, header='Prediction', fmt='%f')