# Lab 1: Teacher-Student estimation problem

We study the generalization error dynamics in a shallow linear neural network receiving $n$-dimensional inputs. We consider a standard student-teacher formulation.


Import the libraries

In [None]:
import torch
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
seed = 79790
torch.manual_seed(seed) # set the seed of the random generator

<torch._C.Generator at 0x7f3174add6d0>

##Teacher model

The teacher implements a noisy linear mapping between $N$ inputs $x_i\in\mathbb{R}^n$:
\begin{equation}
y_i=x_i^T\beta+\varepsilon_i=\sum_{j=1}^{n}x_{i,j}\beta_j+\varepsilon_i,\:i=1,\ldots,N.
\end{equation}
We assume that the inputs $x_{i,j}$ are drawn i.i.d. from a Gaussian with mean zero and variance $\frac{1}{n}$ so that each example will have an expected norm of one: $\mathbb{E}({\left\lVert x_i\right\rVert}^2_2)= 1$. 

In matrix form, we get
\begin{equation}\label{eq:teacher-mechanism}
y = S\beta+\varepsilon
\end{equation}
where the $i$-th row of $S \in \mathbb{R}^{N\times n}$ is $x_i^T$, $y=(y_1,\ldots,y_N) \in \mathbb{R}^{N}$ and $\varepsilon=(\varepsilon_1,\ldots,\varepsilon_N) \in \mathbb{R}^{N}$. We assume that $N>n$ and that $S$ is a full column rank matrix.

In the equation above, $\varepsilon$ denotes noise in the teacher’s output. We will model both the noise $\varepsilon_i$ and the teacher weights $\beta_j$ as drawn i.i.d. from a random Gaussian distribution with zero mean and variance $\sigma^2_{\varepsilon}$ and $\sigma^2_{\beta}$ respectively. 



### Question 1: Code the teacher data generation mechanism in Pytorch. Use the tensors of Pytorch to store the training data $S$ and $y$. You will take on $n=4$, $N=20$, $\sigma_{\beta}=0.7$, and $\sigma_{\varepsilon}=1.5$.

In [None]:
# Write your code here.

Teacher parameters

In [None]:
N = 20 # number of samples for the training
n = 4 # Dimension of the input
sigma_beta = 0.7  # initialization of the teacher
sigma_epsilon = 1.5 # initialization of the noise


##Student model

The student network with weight vector $w \in \mathbb{R}^{n}$ is trained on examples $x_i$ generated by a teacher network:

$$
\hat{y}_i=x_i^\top w.
$$

**Training the student model**

The student network is trained using the dataset $\{y,S\}$ to accurately predict outputs for novel inputs $x \in \mathbb{R}^n$ . 
The student is a shallow linear network, such that the student’s prediction $\hat{y}\in\mathbb{R}$ is simply $\hat{y}=x^Tw$. 

To learn its parameters, the student network will attempt to minimize the mean squared error on the $N$ training samples using gradient descent. The training error is
\begin{equation}\label{eq:train-error}
E_r(w)=\frac{1}{N}\sum_{i=1}^{N}{( y_i -  \hat{y}_i)}^2=\frac{1}{N}\sum_{i=1}^{N}{( y_i -  x_i^T w)}^2=\frac{1}{N}{\left\lVert y -  Sw \right\rVert}^2_2.
\end{equation}



### Question 2: Write $E_r(w)$ under the form 
\begin{equation}
E_r(w)=\frac{1}{N}{\left\lVert y -  Sw \right\rVert}^2_2.
\end{equation}

Write your answer here.

### Question 3: Calculate the gradient of $E_r(w)$ with respect to $w$. Write the gradient in matrix form.

Write your answer here.

### Question 4: What is the minimum value of $E_r(w)$? This value is denoted $E_r^*$.

Write your answer here.

### Question 5: Compute the optimal weights $w^*$ such that $E_r(w^*)=E_r^*$ and $E_r^*$.

In [None]:
# Write your code here.

##Generalization error

We will study the generalization error
\begin{equation}
E_g(t)=E_g(w(t))=\mathbb{E}_{X,Y}{\left(Y-X^Tw(t)\right)}^2=\mathbb{E}_{X,\varepsilon}{\left(X^T\beta+\varepsilon-X^Tw(t)\right)}^2,
\end{equation}
where $w(t)$ is the student weight estimated at time $t$ during the gradient descent and $\varepsilon$ is a random value following a Gaussian distribution with zero mean and variance $\sigma^2_{\varepsilon}$.



### Question 6: What is the oracle error $E_\infty=\mathbb{E}_{X,Y}{\left(Y-X^Tw(t)\right)}^2$ when $w(t)=\beta$ ?

Write your answer here.

### Question 7: Generate an evaluation dataset with 10 000 teacher samples. This dataset will be used to compute the evaluation error of the trained student network during the training.

In [None]:
# Write your code here.

##Gradient descent

We will use the full gradient descent algorithm to minimize $E_r(w)$:
\begin{equation}
w_{k+1}=w_{k}-\lambda \nabla E_r(w_{k})
\end{equation}
where $\lambda>0$ is a small constant learning rate.
We assume that the starting weights $w{(0)}_i$ are drawn i.i.d. from a Gaussian with mean zero and variance $\sigma^2_0$. 


### Question 8: Implement a shallow neural network in Pytorch to learn $w$ by minimizing the error. The neural network will have only one fully connected layer with no bias. You will use the SGD optimizer from the library "torch.optim". You will take on $\sigma_{0}=0.2$ and the learning rate should be close to $0.01$. The number of iterations will be close to 2 500. You must compute both the training error (on the training dataset) and the evaluation (on the evaluation dataset).

In [None]:
# Write your code here.

### Question 9: Plot the training error $E_r(t)=E_r(w(t))$ as a function of the gradient descent iterates $t=1,2,\ldots$. You can use the libraries ``matplotlib'' and ``numpy'' to plot the error. Plot on the same graph the constant oracle error $E_\infty$ and also the optimal training error $E_r^*$.

In [None]:
# Write your code here.

## Trajectory of the gradient descent

As shown in the lecture, the gradient descent trajectory is approximated by the solution of 
\begin{equation}
	\tau\, \dot{w}(t) = S^Ty-S^TSw(t)
\end{equation}
with $\tau=\frac{N}{2}$.

By using an appropriate change of variables, we get $n$ uncoupled differential equations
$$
\tau \dot{z}_i(t)=(\delta_i-z_i(t)) \lambda_i+\gamma_i \sqrt{\lambda_i}
$$

Their solutions are

$$
z_i(t)=\delta_i+(z_i(0)-\delta_i)e^{-\frac{\lambda_i}{\tau}t}+\frac{\gamma_i}{\sqrt{\lambda_i}}(1-e^{-\frac{\lambda_i}{\tau}t})
$$

where $S^\top S=V\Lambda V^\top$, $z=V^\top w$, $\delta=V^\top \beta$ and $\gamma=\Lambda^{-\frac{1}{2}}V^\top S^\top\varepsilon$. The diagonal elements of $\Lambda$ are the eigenvalues $\lambda_i$.

All the notations are defined in the lecture.

### Question 10: Compute the $n$ functions $z_i(t)$ for $t=k \lambda$ where $k=0,1,\ldots,N$
	

In [None]:
# Write your code here.

### Question 11: Compute the coupled trajectories $w(t)=Vz(t)$.

In [None]:
# Write your code here.

### Question 12: Plot the empirical learned weights together with the approximated analytic solution $w(t$ as a function of the gradient descent iterates $t=1,2,\ldots$. Plot on the same graph the true coefficients $\beta$, the optimal training weights $w^*$ and the initial weight $w(0)$.

In [None]:
# Write your code here.