In [None]:
import numpy as np
from matplotlib import pyplot as plt

## Part 1 : errors in least-square regression

### Experiment

We work on data obtained in the TP « Banc mécanique » that was gracefully shared with us. We consider a glider on a air-cushion bench. We measure its speed $v$ and we want to evaluate the laminar friction $f=-\beta v$ it undergoes. We perform least square regression on $f$ and $v$ to compute $\beta$.

We set `l` the length of the glider ; `m` its mass ; `L` the distance between the two points of measurement.

In the following cell the experimental measurements can be loaded. There are two setups ; you can choose one of the two files.

In [None]:
m = 0.142
L = 0.79

donnees = np.loadtxt("ex_11_données.csv", delimiter=',')
l = 0.091

donnees = np.loadtxt("ex_11_donnéesBis.csv", delimiter=',')
l = 0.09

The experiment is repeated $n$ times ($n$ is about twenty) with various initial speeds. For the $\mu$th repetition the measurements consist in four time intervals $t_{\mu,1}$, $t_{\mu,2}$, $t_{\mu,3}$ and $t_{\mu,4}$. They correspond to the times to go through the optical gates on the banch. In the following cell we unpack these. Print a few of them.

In [None]:
t1, t2, t3, t4 = donnees[:,0], donnees[:,1], donnees[:,2], donnees[:,3]
t1

For each $\mu$ the speeds at these points are $v_{\mu,a}=\mathrm{l}/t_{\mu,a}$ (the length of the glider divided by the duration). We define an average speed $\bar v_\mu=(v_{\mu,1}+v_{\mu,2}+v_{\mu,3}+v_{\mu,4})/4$. The average force is $\bar f_\mu=\frac{\mathrm{m}}{\mathrm{L}}(v_{\mu,1}^2-v_{\mu,2}^2+v_{\mu,3}^2-v_{\mu,4}^2)/4$.

Compute $\bar v_\mu$ and $\bar f_\mu$ ; plot the points $(\bar v_\mu, \bar f_\mu)$ as dots.

### Least squares

Perform the linear regression $\bar f=\beta\bar v+c$ on the scalars $\beta$ and $c$. Use least squares for arbitrary dimension : you should first create a matrix $X$ by stacking a column of 1s with the column of $\bar v$ (check your resulting matrix is correct) ; the fit is then $\bar f=Xw$ with
$$X=\pmatrix{1 & \bar v_1 \\ \vdots & \vdots \\ 1 & \bar v_n} \quad\mathrm{and}\quad w=\pmatrix{c \\ \beta}\ .$$
We write $\hat w=(\hat c, \hat\beta)^T$ the computed estimator.

In [None]:
X = ...


Plot the predicted $\hat f=X\hat w=\hat\beta\bar v+\hat c$ vs $\bar v$ (as a continuous line) as well as the experimental datapoints (as dots).

Compute the error on the estimated $\hat\beta$ with the formula of the course
$$\hat\Delta^2=\frac{1}{n}\sum_\mu^n(\bar f_\mu-\hat f_\mu)^2$$
$$\mathrm{Var}(\hat w_i)=\hat\Delta^2[(X^TX)^{-1}]_{ii}$$
Print the resulting estimation with a nice formating such that $\beta=\mathtt{x.xx±0.0y}$.

### Error bars and weigthed least squares

We have errors $\Delta_\mathtt{l}$, $\Delta_\mathtt{m}$, $\Delta_\mathtt{L}$ and $\Delta_t$ on the measurements `l`, `m`, `L` and $t_a$. They have been estimated to the following values.

In [None]:
delta_l = 0.001
delta_m = 0.001
delta_L = 0.001
delta_t = 0.001

We want an estimate of the error on each datapoint. Propagate the errors on $\bar v$ and $\bar f$ with the most generic formula to show that
$$\Delta_{\bar f} = \bar f\left(\frac{\Delta_\mathtt{m}}{\mathtt{m}}+\frac{\Delta_\mathtt{L}}{\mathtt{L}}\right)+2\bar f\left(\Delta_t\left(\frac{1}{t_1}+\frac{1}{t_2}+\frac{1}{t_3}+\frac{1}{t_4}\right)+4\frac{\Delta_\mathtt{l}}{\mathtt{l}}\right)\ .$$
Compute $\Delta_{\bar f}$ for each datapoint.

We do not consider the errors on $\bar v$. Perform the linear regression $\bar f=\beta\bar v+c$ with weigthed least squares as seen in the course (part 10.2 in the lecture notes). You can use `np.diag` to construct the diagonal matrix $\Omega$ (eq. 10.26) from the different $\Delta_{\bar f}$.

In [None]:
X = ...
O = ...


Plot the regression. For the datapoints add errorbars on the dots with `plt.errorbar(..., yerr=...)`.

Compute the error on the estimated $\hat\beta$ with the formula of the course
$$\mathrm{Var}(\hat w_i)=[(X^T\Omega X)^{-1}]_{ii}\ .$$
Print the resulting estimation.

We can check that our result is consistent. Compute
$$\mathcal L(\hat w)=\frac{1}{n}\sum_\mu^n\frac{1}{\Delta_{\bar f, \mu}^2}(\bar f_\mu-\hat f_\mu)^2$$
and check that $\mathcal L(\hat w)\approx 1$.

Redo the previous step dividing or multiplying the estimated errors $\Delta_\mathtt{l}$, $\Delta_\mathtt{m}$, $\Delta_\mathtt{L}$ and $\Delta_t$ by a factor five. What do you observe on $\mathcal L(\hat w)$ ?

## Part 2 : toward high dimension

We want to study how these estimators behave when the dimension i.e. the number of parameters is growing. In this part we use an artificial dataset to be able to compare our estimations to a ground truth while varying $D$ and $N$. We will see that, even if we know exactly how data was generated, when the dimension $D$ is larger the number $N$ of datapoints must be much larger for the estimators to be correct.

### Setting
Data : we have a train set $X\in\mathbb R^{N\times D}$ and a test set $X'\in\mathbb R^{N'\times D}$. $N$ is the size of the train set and $N'$ the size of the test set. We take the components $X_{\mu i}, X_{\mu i}'$ of the data drawn according to independent centered Gaussians.

There is a teacher $w^*\in\mathbb R^D$ that outputs labels $y_\mu=X_\mu^Tw^*+\epsilon_\mu$ and $y_\mu'=X_\mu^{'T}w^*+\epsilon'_\mu$, where the $\epsilon_\mu, \epsilon_\mu' \sim\mathcal N(0,\Delta^2)$ are independent and account for noise or experimental errors. We take the compoments of $w^*$ drawn according to independent standard Gaussians. $w^*$ is the ground truth ; it is not observed. We can compare how close to it is our empirical estimator.

We perform linear regression : our estimator of the coefficients is $\hat w=(X^TX)^{-1}X^Ty$. The predicted labels are then $\hat y_\mu=X_\mu^T\hat w$ and $\hat y_\mu'=X_\mu^{'T}\hat w$.

We can study different errors : the train error $E_\mathrm{train} = \frac{1}{N}\sum_\mu^N(y_\mu-\hat y_\mu)^2$ (how well data is adjusted), the test error $E_\mathrm{test} = \frac{1}{N'}\sum_\mu^{N'}(y_\mu'-\hat y_\mu')^2$ (how well unseen data will be fitted), as well as the true error on the coefficients (the mean square error) $\mathrm{MSE} = \frac{1}{D}\sum_i^D(w^*_i-\hat w_i)^2$ (how well the hidden parameters are reconstructed).

$E_\mathrm{train}$ is an estimator of $E_\mathrm{test}$, $\hat\Delta^2=\frac{1}{N}\sum_\mu^N(y_\mu-\hat y_\mu)^2=E_\mathrm{train}$ an estimator of $\Delta^2$ and $\widehat{\mathrm{MSE}}=\hat\Delta^2\frac{1}{D}\sum_i^D(X^TX)^{-1}_{ii}$ an estimator of $\mathrm{MSE}$.
We want to study how good they are, depending on the number of datapoints $N$ and the dimension $D$ of the problem.

We will always take $N'$ fixed large so the empirical $E_\mathrm{test}$ concentrates to the true test error.

In part 1 we had $\hat w=(c,\beta)^T$, $D=2$ and $N\approx 20$.

In [None]:
Ntest = int(10**4)
Delta = 1.5

### low-dimensional : D=5

In the following cell, for $N$ going from 10 to $N_\mathrm{max}=5000$ :
- generate the data $X$ and the train labels $y_\mu$ according to the teacher model ;
- compute the least square estimator $\hat w$ on $X$ and $y_\mu$ and the predicted train and test labels $\hat y_\mu$ and $\hat y_\mu'$ ;
- compute the train and test errors $E_\mathrm{train}$ and $E_\mathrm{test}$ as well as $\widehat{\mathrm{MSE}}$ and $\mathrm{MSE}$. Store these values in lists.

The $X_{\mu,i}$ and the $X'_{\mu,i}$ are drawn according to a centered normal of variance $1/D$. The $w_i^*$ are drawn according to a standard normal. $w^*$ and $X'$ are fixed at the beginning. $X$ should not vary in the sense that going from $N$ to $N+1$ only one datapoint $x\in\mathbb R^D$ is added to $X$. We study the behaviour of the estimators while increasing the number of datapoints we use to compute them.

(hint : if you prefer, you can generate the $N_\mathrm{max}$ datapoints `Xtot` first and, in the loop, select the right ones with `X = Xtot[:N,:]`. To generate the different $N$ you can use `np.geomspace(..., dtype=int)` ; take a hundred of $N$.)

In [None]:
D = 5
Nmax = 5000

Xtest = np.random.normal(...
w = np.random.normal(...

In a figure plot the curves $E_\mathrm{train}$ and $E_\mathrm{test}$ vs $N$. Add the line $E=\Delta^2$.

In another figure plot the curves $\widehat{\mathrm{MSE}}$ and $\mathrm{MSE}$ vs $N$. Add the line $N\to 1/N$.

You should use a logarithm axis for $N$ and the MSEs.

Comment on these graphs (you can re-run your code once or twice) :
- at small $N$, does the train error over-estimates or under-estimates the actual error ?
- is $\hat\Delta^2$ a consistent estimator ?
- is $\widehat{\mathrm{MSE}}$ a consistent estimator ?
- what is its convergence rate ?

### high-dimensional : D=200

Redo the previous questions for $D=200$. Take $N$ going from 250 to 10000 (use only fifty values).

In [None]:
D = 200
Nmax = 10000

Xtest = np.random.normal(...
w = np.random.normal(...

Comments :
- observe the convergence of the estimators ;
- is $N=1000$ still large enough ?

### Scaling limit

For $D$ larger we need $N$ larger for the estimators to be close to the quatities they estimate. We wonder how these two numbers are related, how the limit $N\to\infty$ and $D\to\infty$ should be taken, how large is $N$ with respect to $D$ for the estimators to converge. We suppose there is a scaling relation $N\sim D^\nu$ with $\nu$ an exponent to be determined. This means that if one multiplies $D$ by $\alpha$, one will have to take $N\alpha^\nu$ samples for the estimators to be as good.

Draw the previous curves (you may focus on the MSEs) for two different $D$ (take $D=100$ and $D=\alpha\times100$ with $\alpha=2$) and search how to scale the $N$ with respect to $\alpha$ so the curves collapse (i.e. draw MSE vs $N/\alpha^\nu$). What is the scaling $\nu$ between $N$ and $D$ ?

This is the thermodynamic limit of this model !

In [None]:
alpha = 2

Nmax = 10000
Ns = np.arange(250, Nmax, 50)

...

plt.plot(Ns, ...
plt.plot(...Ns..., ...