# **<div align="center">Optimisation stochastique : TP1 - Régression linéaire et gradient stochastique</div>**

## **3. Questions préliminaires**

### **Q1.**

On a :

$$
h : \underbrace{\mathcal{X}}_{\mathbb{R}^d} \rightarrow \underbrace{\mathcal{Y}}_{\mathbb{R}}
$$


Fonction perte :

$$
l_w :   \begin{cases}
            \mathcal{X} \times \mathcal{Y} \to \mathbb{R} \\
            (x,y) \mapsto (\langle w,y \rangle - y)^2
        \end{cases}
$$

Famille de prédicteurs :

$$
\mathcal{H} = \displaystyle\left\{x \mapsto \langle w,x \rangle | w \in \mathbb{R}^d\right\}
$$

### **Q2.**

Loi conditionnelle : 

$$
\begin{align*}
\mathbb{P}\displaystyle\left( Y = y | X = x \right) & = \mathbb{P}\displaystyle\left( \langle \theta , X \rangle + B = y | X = x \right) \\
& = \mathbb{P}\displaystyle\left( \langle \theta , x \rangle + B = y \right) \\
& = \mathbb{P}\displaystyle\left( B = y - \langle \theta , x \rangle \right)
\end{align*}
$$


D'où :

$$
Y \, |_{X = x} \sim \mathcal{N}(\langle \theta, x \rangle , \sigma^2)
$$

### **Q3.**

La définition du risque moyen est la suivante :

$$
\begin{align*}
E(w)    & = \mathbb{E}_{(X,Y)}\displaystyle\left[ l(X,Y) \right] \\
        & = \mathbb{E}\displaystyle\left[ \cfrac{1}{2} \left( \langle w , X \rangle - Y \right)^2 \right] \\
        & = \cfrac{1}{2} ~ \mathbb{E}\displaystyle\left[\displaystyle\left( \langle w , X \rangle - \langle \theta , X \rangle - B \right)^2 \right] \\
        & = \cfrac{1}{2} ~ \mathbb{E}\displaystyle\left[\displaystyle\left( \langle w - \theta , X \rangle - B \right)^2 \right] \\
        & = \cfrac{1}{2} ~ \mathbb{E}\displaystyle\left[\displaystyle\ \langle w - \theta , X \rangle^2\right]
        - \underbrace{\mathbb{E}\displaystyle\left[\displaystyle\ \langle w - \theta , X \rangle B \right]}_{\mathbb{E}\displaystyle\left[ \langle w - \theta , X \rangle \right] \times \underbrace{\mathbb{E}\displaystyle\left[ B \right]}_{= 0}}
        + \cfrac{1}{2} ~ \underbrace{\mathbb{E}\displaystyle\left[ B^2 \right]}_{\sigma^2} \\
        & = \cfrac{1}{2} ~ \mathbb{E}\displaystyle\left[\displaystyle\ \langle w - \theta , X \rangle^2\right] + \cfrac{1}{2} \sigma^2 \\
\end{align*}
$$

Si X a une densité $p_{_X}$, on a alors :

$$
\mathbb{E}(w) = \cfrac{1}{2} ~ \displaystyle \int_{\mathbb{R}^d} \langle w - \theta , X \rangle p_{_X}(x) dx + \cfrac{\sigma^2}{2}
$$

### **Q4.**

On a :

$$
\mathbb{E}\displaystyle\left[\displaystyle\ \langle w - \theta , X \rangle^2\right] \geq 0
$$

Un prédicteur optimal est : $w^{\star} = 0$. On a alors $\mathbb{E}(w^{\star}) = \cfrac{\sigma^2}{2} \,$. 

Il n'est pas unique si $X$ est nul presque partout ($ \, \mathbb{P}(X = 0) = 1 \,$).

### **Q5.**


## **6. Travail pratique**

In [2]:
import torch

import random as rd
import numpy as np

In [3]:
print(torch.cuda.is_available())

True


In [16]:
def generate_data(d,n,theta,sigma):
    X = np.random.randn(d,n)
    Y = np.matmul(np.transpose(X),theta) + sigma*np.random.randn(n)
    return X,Y

array([-3.76570739,  4.67280347,  3.86167722,  8.58920485, -3.51323898,
        3.93095327, -3.36838932, -2.544462  , -0.67645544,  5.24480346,
       -2.70745982, -4.60617125, -5.38578864,  5.75616354, -2.0412249 ,
        2.44826593,  3.10302158,  9.99477231,  1.76763518,  1.16062022,
       -4.83016094, -3.52965873,  1.37322509, -0.23446181,  0.39930644,
       -4.8875014 , -8.45085066, -2.1957016 ,  8.14505118,  3.40589174,
       -2.98670361, -1.01431193,  7.17928449,  1.92985714, -2.13237198,
       -3.07821022,  2.0917825 ,  3.82285259, -5.02217396,  2.47162524,
        5.57693628, -9.69453277, -2.44276644, -0.56850356, -1.40422365,
       -1.25214409, -1.19003494, -2.36007237,  0.58950022, -1.99268879,
        1.22979164,  1.30897483,  2.43995968, -0.58695669,  3.41617312,
       -2.44874101,  0.56037076, -1.29464579,  2.30387324, -3.33431032,
        1.04094271,  3.79914914, -0.32639537,  1.14690949, -2.70263203,
       -4.03628739,  2.11439167,  5.80281355,  2.27872578,  2.71

In [51]:
def E(w,theta,sigma):
    return 0.5*np.linalg.norm(w - theta) + (sigma**2)/2 # pour une matrice Espérance[X * transpos(X)] qui vaut I_n

#E([1,1])

In [32]:
def En(w, X, Y):
    return (0.5/len(X)) * np.linalg.norm(np.matmul(np.transpose(X),w) - Y**2) # TODO : vérifier qu'on a pris le bon np.size (len)

In [65]:
def grad_En(w, X, Y):
    return (1/X.shape[1]) * np.matmul(X, np.matmul(np.transpose(X),w) - Y)

In [66]:
def grad_sto_En(w, X, Y, n_batch):
    ind_sto = np.random.randint(low=0, high=X.shape[1], size = n_batch)
    X_sto = X[:, ind_sto]
    Y_sto = Y[ind_sto]
    return grad_En(w, X_sto, Y_sto)

In [67]:
def eval_lipsch_grad_En(X):
    return np.norm((1/len(X)) * X * np.transpose(X)) # TODO : à revoir ici

In [68]:
# Constantes

d = 5
n = 100

theta = np.ones(d)
sigma = 3.0

In [56]:
# SCRIPT de test

# generate
X_test, Y_test = generate_data(d,n,theta,sigma)
print("X_test")
print(X_test)
print("Y_test")
print(Y_test)

print("\n")
print("Taille de X : \n")
print(X_test.shape)
print("\nTaille de Y : \n")
print(Y_test.shape)

w_test = np.random.randn(d)

X_test
[[-1.14346777e+00 -6.90706627e-02 -2.11024004e+00 -1.16495371e+00
  -1.89661964e+00 -1.52254060e+00  3.17543629e-01 -1.18590527e+00
  -1.42063127e-01  1.64331108e+00 -9.12533853e-01  4.49304439e-01
   1.59949586e+00 -2.20518550e-01  1.55956437e-01  9.11126796e-01
  -1.06781158e+00 -1.68469832e+00 -5.37059419e-01  8.37589543e-01
  -6.70764950e-01  4.59835205e-01  1.30213817e+00 -2.12093373e+00
   1.51436199e+00 -5.63448548e-01 -2.04175400e+00 -9.71668758e-01
  -7.82844900e-02  5.56262850e-01  5.54539840e-03 -5.86866697e-01
   1.23038380e-01  4.82545633e-02  4.20838933e-01  1.35675467e+00
   1.55457751e+00 -4.72673440e-01 -4.43392959e-01  8.81811611e-01
   5.08170345e-01  3.68657544e-01  9.36561746e-02 -4.60447648e-02
  -1.10192058e-02  1.01806851e+00  9.49066711e-02  9.34536567e-01
   2.02584079e+00 -2.18358728e-01  2.24627339e-01 -1.07484255e+00
   1.29635090e+00 -8.05264651e-01  1.41887890e+00 -1.06432962e+00
  -1.61222563e+00 -2.28776589e-01 -1.01210773e+00  9.60989020e-01
  -

In [62]:
X_test.shape

(5, 100)

In [70]:
# E
E_test = E(w_test, theta, sigma)
print("E(w_test) \npour w_test = " + str(w) + " : \nE(w_test) = " + str(E_test))
print("\n")

# En
En_test = En(w_test,X_test,Y_test)
print("En(w_test, X_test, Y_test) \npour w_test = " + str(w) + " : \nEn(w_test) = " + str(En_test))
print("\n")

# grad_En
grad_En_test = grad_En(w_test,X_test,Y_test)
print("grad_En(w_test, X_test, Y_test) \npour w_test = " + str(w) + " : \ngrad_En(w_test) = " + str(grad_En_test))
print("\n")

# grad_sto_En
n_batch_test = int(len(X_test)/2)
grad_sto_En_test = grad_sto_En(w_test,X_test,Y_test, n_batch_test)
print("grad_sto_En(w_test, X_test, Y_test, n_batch_test) \npour w_test = " + str(w) + " \net n_batch_test = " + str(n_batch_test) + " : \ngrad_En(w_test) = " + str(grad_sto_En_test))
print("\n")

E(w_test) 
pour w_test = [-1.81838282 -0.0976258   0.38685562 -0.30024755  1.18349656] : 
E(w_test) = 6.437600556809734


En(w_test, X_test, Y_test) 
pour w_test = [-1.81838282 -0.0976258   0.38685562 -0.30024755  1.18349656] : 
En(w_test) = 28.8911345136127


grad_En(w_test, X_test, Y_test) 
pour w_test = [-1.81838282 -0.0976258   0.38685562 -0.30024755  1.18349656] : 
grad_En(w_test) = [-0.86688747 -3.8094186  -1.25336252 -2.09331015  0.27333147]


grad_sto_En(w_test, X_test, Y_test, n_batch_test) 
pour w_test = [-1.81838282 -0.0976258   0.38685562 -0.30024755  1.18349656] 
et n_batch_test = 2 : 
grad_En(w_test) = [-0.66443261 -0.18405129 -2.71122961 -0.95434784 -3.00378472]


