# ***TP3 - Clovis Lechien***

## **Sommaire**
1. [Utils](#Utils)
2. [SGD](#SGD)
3. [RMSProp](#RMSProp)
4. [Adagrad](#Adagrad)
5. [Adam](#Adam)
6. [AdamW](#AdamW)
7. [Evaluations](#Evaluations)
    1. Evaluation des Optimiseurs
    2. Réseau de Neurones
10. [Schedulers](#Schedulers)

## **Mise en contexte**
Pour chaque implémentation d'optimiseur et de lr scheduler, j'ai pris la décision d'implémenter 2 versions *"from-scratch"* :
1. Utilisation de torch : j'ai utilisé l'autograd de torch pour la première version
2. Vanilla : j'ai repris la classe Tensor en backward differentiation ainsi que les fonctions usuelles associées pour implémenter cette version *vraiment* ***from-scratch***.

Ces versions seront utilisées tout au long du notebook pour comparer les résultats et vérifier que mes implémentations sont bonnes.

J'ai également mis en place des tests intermédiaires orientés uniquement sur l'implémentation torch pour chaque optimiseur afin de vérifier leur bon fonctionnement avant la partie [Evaluations](#Evaluations).

# ***Imports des librairies nécessaires***

In [138]:
import numpy as np
import torch                         # Utilisée pour l'implémentation des Optimizers torch "from-scratch"
import torch.nn as nn                # Utilisée pour l'implémentation des Optimizers torch "from-scratch"
from torch.optim import Optimizer    # Utilisée pour l'implémentation des Optimizers torch "from-scratch"

In [3]:
# Génération du jeu de données linéaire
np.random.seed(0)
n_samples = 100
x_linear = np.linspace(-10, 10, n_samples)
y_linear = 3 * x_linear + 5 + np.random.normal(0, 2, n_samples)

 # Génération du jeu de données non linéaire
y_nonlinear = 0.5 * x_linear **2 - 4 * x_linear + np.random.normal(0 ,5 ,n_samples)

# ***Custom Tensor class***

In [4]:
class Tensor:

    """ stores a single scalar Tensor and its gradient """

    def __init__(self, data, _children=(), _op=''):

        self.data = data
        self.grad = 0.0
        # internal variables used for autograd graph construction
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op # the op that produced this node, for graphviz / debugging / etc

    def __add__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)

        out = Tensor(self.data + other.data, (self, other), '+')

        def _backward():
            self.grad += out.grad
            other.grad += out.grad

        out._backward = _backward

        out._prev = set([self, other])
        return out

    def __mul__(self, other):

        other = other if isinstance(other, Tensor) else Tensor(other)

        out = Tensor(self.data * other.data, [self, other], '*')

        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward

        return out

    def __pow__(self, other):

        assert isinstance(other, (int, float)), "only supporting int/float powers for now"

        out = Tensor(self.data**other, (self,), f'**{other}')

        def _backward():
            self.grad += (other * self.data**(other-1)) * out.grad

        out._backward = _backward

        return out

    def relu(self):
        # FIXME: implement relu
        pass

    def build_topo(self, visited=None, topo=None):
        if self not in visited:
            visited.add(self)
            for child in self._prev:
                child.build_topo(visited=visited, topo=topo)
            topo.append(self)
        return topo

    def backward(self):
        # topological order all of the children in the graph
        topo = []
        visited = set()
        topo = self.build_topo(topo=topo, visited=visited)

        # go one variable at a time and apply the chain rule to get its gradient
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

    def __neg__(self): # -self
        return self * -1

    def __radd__(self, other): # other + self
        return self + other

    def __sub__(self, other): # self - other
        return self + (-other)

    def __rsub__(self, other): # other - self
        return other + (-self)

    def __rmul__(self, other): # other * self
        return self * other

    def __truediv__(self, other): # self / other
        return self * other**-1

    def __rtruediv__(self, other): # other / self
        return other * self**-1

    def __repr__(self):
        return f"Tensor(data={self.data}, grad={self.grad})"

## ***Custom operations***

In [5]:
def log_d(dual_number: Tensor):
    out = Tensor(np.log(dual_number.data), (dual_number,), 'log')

    def _backward():
        dual_number.grad += (1 / dual_number.data) * out.grad

    out._backward = _backward
    return out

def exp_d(dual_number: Tensor):
    out = Tensor(np.exp(dual_number.data), (dual_number,), 'exp')

    def _backward():
        dual_number.grad += np.exp(dual_number.data) * out.grad

    out._backward = _backward
    return out

def sin_d(dual_number: Tensor):
    out = Tensor(np.sin(dual_number.data), (dual_number,), 'sin')

    def _backward():
        dual_number.grad += np.cos(dual_number.data) * out.grad

    out._backward = _backward
    return out

def cos_d(dual_number: Tensor):
    out = Tensor(np.cos(dual_number.data), (dual_number,), 'cos')

    def _backward():
        dual_number.grad += -np.sin(dual_number.data) * out.grad

    out._backward = _backward
    return out

def sigmoid_d(dual_number: Tensor):
    sig = 1 / (1 + np.exp(-dual_number.data))
    out = Tensor(sig, (dual_number,), 'sigmoid')

    def _backward():
        dual_number.grad += sig * (1 - sig) * out.grad

    out._backward = _backward
    return out

def tanh_d(dual_number: Tensor):
    tanh = np.tanh(dual_number.data)
    out = Tensor(tanh, (dual_number,), 'tanh')

    def _backward():
        dual_number.grad += (1 - tanh**2) * out.grad

    out._backward = _backward
    return out

def tan_d(dual_number: Tensor):
    out = Tensor(np.tan(dual_number.data), (dual_number,), 'tan')

    def _backward():
        dual_number.grad += (1 / np.cos(dual_number.data)**2) * out.grad

    out._backward = _backward
    return out

def sqrt_d(dual_number: Tensor):
    out = Tensor(np.sqrt(dual_number.data), (dual_number,), 'sqrt')

    def _backward():
        dual_number.grad += (0.5 / np.sqrt(dual_number.data)) * out.grad

    out._backward = _backward
    return out


def pow_d(dual_number: Tensor, power: int):
    out = Tensor(dual_number.data**power, (dual_number,), f'pow{power}')

    def _backward():
        dual_number.grad += (power * dual_number.data**(power-1)) * out.grad

    out._backward = _backward
    return out

def softmax_d(dual_number: Tensor):
    e = np.exp(dual_number.data - np.max(dual_number.data))
    out = Tensor(e / np.sum(e), (dual_number,), 'softmax')

    def _backward():
        for i in range(len(dual_number.data)):
            for j in range(len(dual_number.data)):
                if i == j:
                    dual_number.grad[i] += out.data[i] * (1 - out.data[i]) * out.grad[i]
                else:
                    dual_number.grad[i] += -out.data[i] * out.data[j] * out.grad[j]

    out._backward = _backward
    return out

# ***Utils<a name="Utils"></a>***

```python
def optimizer_testing_loop(parameters : dict[str,])
```
Cette fonction est responsable des tests intermédiaires mentionnés en haut de page.
Elle mimique le comportement d'une boucle d'entraînement.
Elle est utilisée uniquement pour tester l'implémentation des optimiseurs implémentés à l'aide de l'autograd de torch.

Exemple d'utilisation :
```python
model = nn.Linear(1, 1)

linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': SGD_torch(model.parameters(), learning_rate=0.015),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_nonlinear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)
```

---

```python
def check_diffs(a : list[float], b : list[float], tol : float = 1e-4)
```
Cette fonction sert à comparer les deux listes passées en paramètres et de vérifier que la différence absolue entre deux valeurs respectives est inférieure au seuil précisé.

Exemple d'utilisation :
```python
check_diffs(losses_A, losses_B, tol=1e-4)
```

In [110]:
def optimizer_testing_loop(parameters : dict[str,]):
    model = parameters['model']

    criterion = parameters['criterion']
    optimizer = parameters['optimizer']

    x_tensor = parameters['x_tensor']
    y_tensor = parameters['y_tensor']

    epochs = parameters['epochs']
    for epoch in range(epochs):
        optimizer.zero_grad()
        predictions = model(x_tensor)
        loss = criterion(predictions, y_tensor)
        loss.backward()
        optimizer.step()

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

    for name, param in model.named_parameters():
        print(f"{name}: {param.data}")


def check_diffs(a : list[float], b : list[float], tol : float = 1e-4):
    res = np.allclose(a, b, atol=tol)
    if res:
        print(f"All elements between\n{a}\nand\n{b}\nare close within a tolerance of {tol}")
    else:
        print("Test failed")

# ***SGD<a name="SGD"></a>***

## **Implementation de SGD**

```python
class SGD_torch(Optimizer)
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
sgd = SGD_torch(model.parameters(), learning_rate=0.015)
```

---

```python
class SGD_custom
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
A = Tensor(1.)
B = Tensor(69.)
C = Tensor(420.)
sgd = SGD_custom([A, B, C], learning_rate=0.015)
```

In [7]:
class SGD_torch(Optimizer):
    def __init__(self, params, learning_rate=0.015):
        hyperparams = {'lr': learning_rate}
        super().__init__(params=params, defaults=hyperparams)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            lr = group['lr']

            for theta_t in group['params']:
                if theta_t.grad is None:
                    continue

                theta_t -= lr * theta_t.grad


class SGD_custom:
    def __init__(self, params, learning_rate=0.015):
        self.params = params
        self.learning_rate = learning_rate

    def step(self):
        for param in self.params:
            if param.grad is not None:
                param.data -= self.learning_rate * param.grad

    def zero_grad(self):
        for param in self.params:
            param.grad = 0.0

## **Test de SGD_torch**

In [14]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': SGD_torch(model.parameters(), learning_rate=0.015),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_linear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 16.184354782104492
Epoch 20, Loss: 10.640948295593262
Epoch 30, Loss: 7.626477241516113
Epoch 40, Loss: 5.987224102020264
Epoch 50, Loss: 5.095807075500488
Epoch 60, Loss: 4.611059188842773
Epoch 70, Loss: 4.347457408905029
Epoch 80, Loss: 4.204110622406006
Epoch 90, Loss: 4.12615966796875
Epoch 100, Loss: 4.083771228790283
weight: tensor([[2.9703]])
bias: tensor([4.9016])


In [24]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': SGD_torch(model.parameters(), learning_rate=0.015),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_nonlinear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 467.55975341796875
Epoch 20, Loss: 380.6451416015625
Epoch 30, Loss: 333.38153076171875
Epoch 40, Loss: 307.6798095703125
Epoch 50, Loss: 293.703369140625
Epoch 60, Loss: 286.1030578613281
Epoch 70, Loss: 281.97003173828125
Epoch 80, Loss: 279.7225341796875
Epoch 90, Loss: 278.5003662109375
Epoch 100, Loss: 277.83575439453125
weight: tensor([[-4.1445]])
bias: tensor([16.5501])


# ***RMSProp<a name="RMSProp"></a>***

## **Implementation de RMSProp**

```python
class RMSProp_torch(Optimizer)
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
rms = RMSProp_torch(model.parameters(), learning_rate=0.05, decay=0.5)
```

---

```python
class RMSProp_custom
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
A = Tensor(1.)
B = Tensor(69.)
C = Tensor(420.)
rms = RMSProp_custom([A, B, C], learning_rate=0.05, decay=0.5)
```

In [53]:
class RMSProp_torch(Optimizer):
    def __init__(self, params, learning_rate=0.05, decay=0.5):
        hyperparams = {'lr': learning_rate, 'decay': decay}
        super().__init__(params=params, defaults=hyperparams)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            decay = group['decay']
            lr = group['lr']

            for theta_t in group['params']:
                if theta_t.grad is None:
                    continue

                state = self.state[theta_t]
                if 'square_avg' not in state:
                    state['square_avg'] = torch.zeros_like(theta_t)

                square_avg = state['square_avg']
                square_avg = decay * square_avg + (1 - decay) * (theta_t.grad ** 2)
                state['square_avg'] = square_avg

                theta_t -= lr * theta_t.grad / square_avg.sqrt()


class RMSProp_custom:
    def __init__(self, params, learning_rate=0.05, decay=0.5):
        self.params = params
        self.learning_rate = learning_rate
        self.decay = decay
        self.state = {param: {'square_avg': Tensor(0.0)} for param in params}

    def step(self):
        for theta_t in self.params:
            if theta_t.grad is None:
                continue

            state = self.state[theta_t]

            square_avg = state['square_avg']
            square_avg.data = self.decay * square_avg.data + (1 - self.decay) * (theta_t.grad ** 2)
            state['square_avg'] = square_avg

            theta_t.data -= self.learning_rate * theta_t.grad / np.sqrt(square_avg.data)

    def zero_grad(self):
        for param in self.params:
            param.grad = 0.0

## **Test de RMSProp_torch**

In [54]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': RMSProp_torch(model.parameters(), learning_rate=0.05, decay=0.5),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_linear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 180.6050567626953
Epoch 20, Loss: 112.90521240234375
Epoch 30, Loss: 62.201744079589844
Epoch 40, Loss: 28.272367477416992
Epoch 50, Loss: 10.602396011352539
Epoch 60, Loss: 6.3285231590271
Epoch 70, Loss: 5.107961654663086
Epoch 80, Loss: 4.370968341827393
Epoch 90, Loss: 4.07150936126709
Epoch 100, Loss: 4.054510593414307
weight: tensor([[2.9453]])
bias: tensor([5.0998])


In [55]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': RMSProp_torch(model.parameters(), learning_rate=0.05, decay=0.5),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_nonlinear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 1101.2764892578125
Epoch 20, Loss: 959.2235717773438
Epoch 30, Loss: 834.4452514648438
Epoch 40, Loss: 726.88427734375
Epoch 50, Loss: 636.4846801757812
Epoch 60, Loss: 563.16259765625
Epoch 70, Loss: 506.7803955078125
Epoch 80, Loss: 467.0693664550781
Epoch 90, Loss: 443.2617492675781
Epoch 100, Loss: 429.9840087890625
weight: tensor([[-4.1391]])
bias: tensor([5.0963])


# ***Adagrad<a name="Adagrad"></a>***

## **Implementation de Adagrad**

```python
class Adagrad_torch(Optimizer)
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
adagrad = Adagrad_torch(model.parameters(), learning_rate=0.9)
```

---

```python
class Adagrad_custom
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
A = Tensor(1.)
B = Tensor(69.)
C = Tensor(420.)
adagrad = Adagrad_custom([A, B, C], learning_rate=0.9)
```

In [66]:
class Adagrad_torch(Optimizer):
    def __init__(self, params, learning_rate=0.9):
        hyperparams = {'lr': learning_rate}
        super().__init__(params=params, defaults=hyperparams)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            lr = group['lr']

            for theta_t in group['params']:
                if theta_t.grad is None:
                    continue

                state = self.state[theta_t]
                if 'sum_squared_grads' not in state:
                    state['sum_squared_grads'] = torch.zeros_like(theta_t)

                sum_squared_grads = state['sum_squared_grads']
                sum_squared_grads += theta_t.grad ** 2
                state['sum_squared_grads'] = sum_squared_grads

                adjusted_lr = lr / sum_squared_grads.sqrt()

                theta_t -= adjusted_lr * theta_t.grad


class Adagrad_custom:
    def __init__(self, params, learning_rate=0.9):
        self.params = params
        self.learning_rate = learning_rate
        self.state = {param: {'sum_squared_grads': Tensor(0.0)} for param in params}

    def step(self):
        for theta_t in self.params:
            if theta_t.grad is None:
                continue

            state = self.state[theta_t]

            sum_squared_grads = state['sum_squared_grads']
            sum_squared_grads.data += theta_t.grad ** 2
            state['sum_squared_grads'] = sum_squared_grads

            adjusted_lr = self.learning_rate / np.sqrt(sum_squared_grads.data)

            theta_t.data -= adjusted_lr * theta_t.grad

    def zero_grad(self):
        for param in self.params:
            param.grad = 0.0

## **Test de Adagrad_torch**

In [67]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': Adagrad_torch(model.parameters(), learning_rate=0.9),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_linear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 7.353924751281738
Epoch 20, Loss: 4.580388069152832
Epoch 30, Loss: 4.131178855895996
Epoch 40, Loss: 4.050938129425049
Epoch 50, Loss: 4.036445617675781
Epoch 60, Loss: 4.033823013305664
Epoch 70, Loss: 4.033348560333252
Epoch 80, Loss: 4.033262729644775
Epoch 90, Loss: 4.033247470855713
Epoch 100, Loss: 4.0332441329956055
weight: tensor([[2.9703]])
bias: tensor([5.1189])


In [68]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': Adagrad_torch(model.parameters(), learning_rate=0.9),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_nonlinear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 437.4356689453125
Epoch 20, Loss: 384.6959228515625
Epoch 30, Loss: 356.72442626953125
Epoch 40, Loss: 337.80859375
Epoch 50, Loss: 324.18017578125
Epoch 60, Loss: 314.0190124511719
Epoch 70, Loss: 306.27508544921875
Epoch 80, Loss: 300.28387451171875
Epoch 90, Loss: 295.5985412597656
Epoch 100, Loss: 291.9052734375
weight: tensor([[-4.1445]])
bias: tensor([13.6006])


# ***Adam<a name="Adam"></a>***

## **Implementation de Adam**

```python
class Adam_torch(Optimizer)
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
adam = Adam_torch(model.parameters(), learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8)
```

---

```python
class Adam_custom
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
A = Tensor(1.)
B = Tensor(69.)
C = Tensor(420.)
adam = Adam_custom([A, B, C], learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8)
```

In [82]:
class Adam_torch(Optimizer):
    def __init__(self, params, learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8):
        hyperparams = {'lr': learning_rate, 'beta1': beta1, 'beta2': beta2, 'epsilon': epsilon}
        super().__init__(params=params, defaults=hyperparams)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            lr = group['lr']
            beta1 = group['beta1']
            beta2 = group['beta2']
            epsilon = group['epsilon']

            for theta_t in group['params']:
                if theta_t.grad is None:
                    continue

                state = self.state[theta_t]
                if 'm' not in state: # Moment d'ordre 1
                    state['m'] = torch.zeros_like(theta_t)
                if 'v' not in state: # Moment d'ordre 2
                    state['v'] = torch.zeros_like(theta_t)
                if 't' not in state: # Temps
                    state['t'] = 0

                # Premier Moment
                m = state['m']
                m_t = beta1 * m + (1 - beta1) * theta_t.grad
                state['m'] = m_t

                # Second Moment
                v = state['v']
                v_t = beta2 * v + (1 - beta2) * theta_t.grad ** 2
                state['v'] = v_t

                # Temps
                t = state['t'] + 1
                state['t'] = t

                # Correction des biais
                m_hat = m_t / (1 - beta1 ** t)
                v_hat = v_t / (1 - beta2 ** t)

                theta_t -= lr * m_hat / (v_hat.sqrt() + epsilon)


class Adam_custom:
    def __init__(self, params, learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.params = params
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.state = {param: {'m': Tensor(0.0), 'v': Tensor(0.0), 't': 0} for param in params}

    def step(self):
        for theta_t in self.params:
            if theta_t.grad is None:
                continue

            state = self.state[theta_t]

            # Premier Moment
            m = state['m']
            m_t = self.beta1 * m.data + (1 - self.beta1) * theta_t.grad
            state['m'].data = m_t

            # Second Moment
            v = state['v']
            v_t = self.beta2 * v.data + (1 - self.beta2) * theta_t.grad ** 2
            state['v'].data = v_t

            # Temps
            t = state['t'] + 1
            state['t'] = t

            # Correction des biais
            m_hat = m_t / (1 - self.beta1 ** t)
            v_hat = v_t / (1 - self.beta2 ** t)

            theta_t.data -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)

    def zero_grad(self):
        for param in self.params:
            param.grad = 0.0

## **Test de Adam_torch**

In [83]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': Adam_torch(model.parameters(), learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_linear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 18.474105834960938
Epoch 20, Loss: 17.985536575317383
Epoch 30, Loss: 4.627365589141846
Epoch 40, Loss: 5.256790637969971
Epoch 50, Loss: 4.720978260040283
Epoch 60, Loss: 4.059943675994873
Epoch 70, Loss: 4.084419250488281
Epoch 80, Loss: 4.069782257080078
Epoch 90, Loss: 4.037606239318848
Epoch 100, Loss: 4.033542156219482
weight: tensor([[2.9711]])
bias: tensor([5.1368])


In [84]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': Adam_torch(model.parameters(), learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_nonlinear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 627.2946166992188
Epoch 20, Loss: 427.98394775390625
Epoch 30, Loss: 393.12933349609375
Epoch 40, Loss: 345.67926025390625
Epoch 50, Loss: 314.45587158203125
Epoch 60, Loss: 298.862548828125
Epoch 70, Loss: 287.8588562011719
Epoch 80, Loss: 282.3156433105469
Epoch 90, Loss: 279.3364562988281
Epoch 100, Loss: 277.9549865722656
weight: tensor([[-4.1269]])
bias: tensor([16.5093])


# ***AdamW<a name="AdamW"></a>***

## ***Implémentation de AdamW***

```python
class AdamW_torch(Optimizer)
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
adamw = AdamW_torch(model.parameters(), learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01)
```

---

```python
class AdamW_custom
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
A = Tensor(1.)
B = Tensor(69.)
C = Tensor(420.)
adamw = AdamW_custom([A, B, C], learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01)
```

In [92]:
class AdamW_torch(Optimizer):
    def __init__(self, params, learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01):
        hyperparams = {'lr': learning_rate, 'beta1': beta1, 'beta2': beta2, 'epsilon': epsilon, 'weight_decay': weight_decay}
        super().__init__(params=params, defaults=hyperparams)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            lr = group['lr']
            beta1 = group['beta1']
            beta2 = group['beta2']
            epsilon = group['epsilon']
            weight_decay = group['weight_decay']

            for theta_t in group['params']:
                if theta_t.grad is None:
                    continue

                state = self.state[theta_t]
                if 'm' not in state: # Moment d'ordre 1
                    state['m'] = torch.zeros_like(theta_t)
                if 'v' not in state: # Moment d'ordre 2
                    state['v'] = torch.zeros_like(theta_t)
                if 't' not in state: # Temps
                    state['t'] = 0

                # Premier Moment
                m = state['m']
                m_t = beta1 * m + (1 - beta1) * theta_t.grad
                state['m'] = m_t

                # Second Moment
                v = state['v']
                v_t = beta2 * v + (1 - beta2) * theta_t.grad ** 2
                state['v'] = v_t

                # Temps
                t = state['t'] + 1
                state['t'] = t

                # Correction des biais
                m_hat = m_t / (1 - beta1 ** t)
                v_hat = v_t / (1 - beta2 ** t)

                theta_t -= lr * m_hat / (np.sqrt(v_hat) + epsilon) - lr * weight_decay * theta_t


class AdamW_custom:
    def __init__(self, params, learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01):
        self.params = params
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.weight_decay = weight_decay
        self.state = {param: {'m': Tensor(0.0), 'v': Tensor(0.0), 't': 0} for param in params}

    def step(self):
        for theta_t in self.params:
            if theta_t.grad is None:
                continue

            state = self.state[theta_t]

            # Premier Moment
            m = state['m']
            m_t = self.beta1 * m.data + (1 - self.beta1) * theta_t.grad
            state['m'].data = m_t

            # Second Moment
            v = state['v']
            v_t = self.beta2 * v.data + (1 - self.beta2) * theta_t.grad ** 2
            state['v'].data = v_t

            # Temps
            t = state['t'] + 1
            state['t'] = t

            # Correction des biais
            m_hat = m_t / (1 - self.beta1 ** t)
            v_hat = v_t / (1 - self.beta2 ** t)

            theta_t.data -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon) - self.learning_rate * self.weight_decay * theta_t.data

    def zero_grad(self):
        for param in self.params:
            param.grad = 0.0

## ***Test de AdamW***

In [93]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': AdamW_torch(model.parameters(), learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_linear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 97.443359375
Epoch 20, Loss: 6.934173583984375
Epoch 30, Loss: 25.01974868774414
Epoch 40, Loss: 7.812467575073242
Epoch 50, Loss: 4.84361457824707
Epoch 60, Loss: 4.524322509765625
Epoch 70, Loss: 4.178420066833496
Epoch 80, Loss: 4.323681831359863
Epoch 90, Loss: 4.06089973449707
Epoch 100, Loss: 4.044063568115234
weight: tensor([[2.9760]])
bias: tensor([5.2175])


In [94]:
model = nn.Linear(1, 1)
linear_parameters = {
    'model': model,
    'criterion': nn.MSELoss(),
    'optimizer': AdamW_torch(model.parameters(), learning_rate=0.25, beta1=0.9, beta2=0.999, epsilon=1e-8, weight_decay=0.01),
    'x_tensor': torch.from_numpy(x_linear).float().view(-1, 1),
    'y_tensor': torch.from_numpy(y_nonlinear).float().view(-1, 1),
    'epochs': 100
}

optimizer_testing_loop(linear_parameters)

Epoch 10, Loss: 579.9456787109375
Epoch 20, Loss: 454.680908203125
Epoch 30, Loss: 410.2872314453125
Epoch 40, Loss: 348.4303894042969
Epoch 50, Loss: 317.8584899902344
Epoch 60, Loss: 297.220947265625
Epoch 70, Loss: 286.00286865234375
Epoch 80, Loss: 279.979248046875
Epoch 90, Loss: 277.4600524902344
Epoch 100, Loss: 277.0702209472656
weight: tensor([[-4.1685]])
bias: tensor([17.5873])


# ***Evaluations<a name="Evaluations"></a>***

## ***Evaluation des Optimiseurs***

In [95]:
def f(x : torch.Tensor | Tensor):
    return (x - 2) ** 2


def f_nonconvexe(x : torch.Tensor | Tensor):
    return 3*x ** 2 - 2*x

In [96]:
def eval_optim(x : torch.Tensor | Tensor, convexe : bool = True, scheduler : bool = False):
    if convexe:
        print(f"Optimisation de la fonction convexe f(x) = (x - 2)²")
        y = f(x)
    else:
        print(f"Optimisation de la fonction non convexe f(x) = 3x² - 2x")
        y = f_nonconvexe(x)

    y.backward()
    if isinstance(x, torch.Tensor):
        print(f"Gradient de f en x={x.item()}: x.grad={x.grad.item()}")
    else:
        print(f"Gradient de f en x={x.data}: x.grad={x.grad}")

    resulting_x = []
    resulting_fx = []

    if isinstance(x, torch.Tensor):
        optimizers_torch = [
            SGD_torch,
            RMSProp_torch,
            Adagrad_torch,
            Adam_torch,
            AdamW_torch
        ]

        for optimizer in optimizers_torch:
            optimizer = optimizer([x])
            if scheduler:
                scheduler = LRSchedulerOnPlateauTorch(optimizer, initial_lr=0.01, patience=5, factor=0.5, min_lr=1e-6, mode='min', threshold=1e-4)
            for i in range(100):
                optimizer.zero_grad()
                if convexe:
                    y = f(x)
                else:
                    y = f_nonconvexe(x)
                y.backward()
                optimizer.step()
                if scheduler:
                    scheduler.step(y)
            print(f"Optimiseur {optimizer.__class__.__name__}: x={x.item()}, f(x)={f(x).item()}")
            resulting_x.append(x.item())
            resulting_fx.append(f(x).item())

    else:
        optimizers_custom = [
            SGD_custom,
            RMSProp_custom,
            Adagrad_custom,
            Adam_custom,
            AdamW_custom
        ]

        for optimizer in optimizers_custom:
            optimizer = optimizer([x])
            if scheduler:
                scheduler = LRSchedulerOnPlateauCustom(optimizer, initial_lr=0.01, patience=5, factor=0.5, min_lr=1e-6, mode='min', threshold=1e-4)
            for i in range(100):
                optimizer.zero_grad()
                if convexe:
                    y = f(x)
                else:
                    y = f_nonconvexe(x)
                y.backward()
                optimizer.step()
                if scheduler:
                    scheduler.step(y)
            print(f"Optimiseur {optimizer.__class__.__name__}: x={x.data}, f(x)={f(x).data}")
            resulting_x.append(x.data)
            resulting_fx.append(f(x).data)

    return resulting_x, resulting_fx

In [107]:
x = torch.tensor([1.], requires_grad=True)
conv_torch_x, conv_torch_fx = eval_optim(x, convexe=True)
print()
nonconv_torch_x, nonconv_torch_fx = eval_optim(x, convexe=False)

Optimisation de la fonction convexe f(x) = (x - 2)²
Gradient de f en x=1.0: x.grad=-2.0
Optimiseur SGD_torch: x=1.8673806190490723, f(x)=0.017587900161743164
Optimiseur RMSProp_torch: x=2.0249998569488525, f(x)=0.000624992826487869
Optimiseur Adagrad_torch: x=2.0, f(x)=0.0
Optimiseur Adam_torch: x=2.0, f(x)=0.0
Optimiseur AdamW_torch: x=2.0017940998077393, f(x)=3.218794063286623e-06

Optimisation de la fonction non convexe f(x) = 3x² - 2x
Gradient de f en x=2.0017940998077393: x.grad=10.012619018554688
Optimiseur SGD_torch: x=0.33676183223724365, f(x)=2.7663612365722656
Optimiseur RMSProp_torch: x=0.3583333492279053, f(x)=2.6950693130493164
Optimiseur Adagrad_torch: x=0.3333333432674408, f(x)=2.777777671813965
Optimiseur Adam_torch: x=0.3333333432674408, f(x)=2.777777671813965
Optimiseur AdamW_torch: x=0.3331911861896515, f(x)=2.7782516479492188


In [108]:
x = Tensor(1.)
conv_custom_x, conv_custom_fx = eval_optim(x, convexe=True)
print()
nonconv_custom_x, nonconv_custom_fx = eval_optim(x, convexe=False)

Optimisation de la fonction convexe f(x) = (x - 2)²
Gradient de f en x=1.0: x.grad=-2.0
Optimiseur SGD_custom: x=1.8673804441052466, f(x)=0.017587946605721615
Optimiseur RMSProp_custom: x=2.025, f(x)=0.0006249999999999956
Optimiseur Adagrad_custom: x=2.0, f(x)=0.0
Optimiseur Adam_custom: x=2.0, f(x)=0.0
Optimiseur AdamW_custom: x=2.001794181739699, f(x)=3.2190881150698366e-06

Optimisation de la fonction non convexe f(x) = 3x² - 2x
Gradient de f en x=2.001794181739699: x.grad=10.012619192458853
Optimiseur SGD_custom: x=0.33676181143633, f(x)=2.7663612718965584
Optimiseur RMSProp_custom: x=0.3583333333333334, f(x)=2.695069444444444
Optimiseur Adagrad_custom: x=0.3333333333333333, f(x)=2.777777777777778
Optimiseur Adam_custom: x=0.3333333333333333, f(x)=2.777777777777778
Optimiseur AdamW_custom: x=0.3331912013267588, f(x)=2.7782515713345335


In [112]:
check_diffs(conv_torch_x, conv_custom_x, tol=1e-4)
print("\n\n")
check_diffs(conv_torch_fx, conv_custom_fx, tol=1e-4)

All elements between
[1.8673806190490723, 2.0249998569488525, 2.0, 2.0, 2.0017940998077393]
and
[1.8673804441052466, 2.025, 2.0, 2.0, 2.001794181739699]
are close within a tolerance of 0.0001



All elements between
[0.017587900161743164, 0.000624992826487869, 0.0, 0.0, 3.218794063286623e-06]
and
[0.017587946605721615, 0.0006249999999999956, 0.0, 0.0, 3.2190881150698366e-06]
are close within a tolerance of 0.0001


## ***Réseau de Neurones***

In [113]:
def func_nn(x, W1, b1, W2, b2):
    h1 = W1 * x + b1
    y = W2 * h1 + b2
    return y


def mse(y, y_hat):
    return (y - y_hat) ** 2

In [114]:
def eval_nn_optim(scheduler : bool = False, custom : bool = True):
    results = []

    if not custom:
        optimizers = [
            SGD_torch,
            RMSProp_torch,
            Adagrad_torch,
            Adam_torch,
            AdamW_torch
        ]

        for optimizer in optimizers:
            W1 = torch.tensor([1.], requires_grad=True)
            b1 = torch.tensor([1.], requires_grad=True)
            W2 = torch.tensor([1.], requires_grad=True)
            b2 = torch.tensor([1.], requires_grad=True)

            x = torch.tensor([1.], requires_grad=True)
            y = torch.tensor([10.])

            optimizer = optimizer([W1, b1, W2, b2])

            if scheduler:
                scheduler = LRSchedulerOnPlateauTorch(optimizer, initial_lr=0.01, patience=5, factor=0.5, min_lr=1e-6, mode='min', threshold=1e-4)

            for i in range(100):
                optimizer.zero_grad()

                y_hat = func_nn(x, W1, b1, W2, b2)
                loss = mse(y, y_hat)

                loss.backward()
                optimizer.step()

                if scheduler:
                    scheduler.step(loss)

            print(f"Optimiseur {optimizer.__class__.__name__}:\nW1={W1.item()}, b1={b1.item()}, W2={W2.item()}, b2={b2.item()}")
            results.append([W1.item(), b1.item(), W2.item(), b2.item()])

    else:
        optimizers = [
            SGD_custom,
            RMSProp_custom,
            Adagrad_custom,
            Adam_custom,
            AdamW_custom
        ]

        for optimizer in optimizers:
            W1 = Tensor(1.)
            b1 = Tensor(1.)
            W2 = Tensor(1.)
            b2 = Tensor(1.)

            x = Tensor(1.)
            y = Tensor(10.)

            optimizer = optimizer([W1, b1, W2, b2])

            if scheduler:
                scheduler = LRSchedulerOnPlateauCustom(optimizer, initial_lr=0.01, patience=5, factor=0.5, min_lr=1e-6, mode='min', threshold=1e-4)

            for i in range(100):
                optimizer.zero_grad()

                y_hat = func_nn(x, W1, b1, W2, b2)
                loss = mse(y, y_hat)

                loss.backward()
                optimizer.step()

                if scheduler:
                    scheduler.step(loss)

            print(f"Optimiseur {optimizer.__class__.__name__}:\nW1={W1.data}, b1={b1.data}, W2={W2.data}, b2={b2.data}")
            results.append([W1.data, b1.data, W2.data, b2.data])

    return results

In [115]:
torch_nn = eval_nn_optim(scheduler=False, custom=False)

Optimiseur SGD_torch:
W1=1.7965515851974487, b1=1.7965515851974487, W2=2.356534481048584, b2=1.532727837562561
Optimiseur RMSProp_torch:
W1=1.975722074508667, b1=1.975722074508667, W2=1.975722074508667, b2=1.9654841423034668
Optimiseur Adagrad_torch:
W1=2.0004096031188965, b1=2.0004096031188965, W2=2.0004096031188965, b2=1.99672532081604
Optimiseur Adam_torch:
W1=1.8886557817459106, b1=1.8886557817459106, W2=1.8886557817459106, b2=2.8375625610351562
Optimiseur AdamW_torch:
W1=1.8559832572937012, b1=1.8559832572937012, W2=1.8559832572937012, b2=3.1020750999450684


In [116]:
custom_nn = eval_nn_optim(scheduler=False, custom=True)

Optimiseur SGD_custom:
W1=1.7965517126874235, b1=1.7965517126874235, W2=2.3565344500454675, b2=1.532727995527799
Optimiseur RMSProp_custom:
W1=1.975721791587977, b1=1.975721791587977, W2=1.975721791587977, b2=1.965486512797642
Optimiseur Adagrad_custom:
W1=2.000409291478893, b1=2.000409291478893, W2=2.000409291478893, b2=1.996725333555097
Optimiseur Adam_custom:
W1=1.8886558814890224, b1=1.8886558814890224, W2=1.8886558820516521, b2=2.8375626842072093
Optimiseur AdamW_custom:
W1=1.8559833420114675, b1=1.8559833420114675, W2=1.855983342325199, b2=3.1020747059033416


In [117]:
check_diffs(torch_nn, custom_nn, tol=1e-4)

All elements between
[[1.7965515851974487, 1.7965515851974487, 2.356534481048584, 1.532727837562561], [1.975722074508667, 1.975722074508667, 1.975722074508667, 1.9654841423034668], [2.0004096031188965, 2.0004096031188965, 2.0004096031188965, 1.99672532081604], [1.8886557817459106, 1.8886557817459106, 1.8886557817459106, 2.8375625610351562], [1.8559832572937012, 1.8559832572937012, 1.8559832572937012, 3.1020750999450684]]
and
[[1.7965517126874235, 1.7965517126874235, 2.3565344500454675, 1.532727995527799], [1.975721791587977, 1.975721791587977, 1.975721791587977, 1.965486512797642], [2.000409291478893, 2.000409291478893, 2.000409291478893, 1.996725333555097], [1.8886558814890224, 1.8886558814890224, 1.8886558820516521, 2.8375626842072093], [1.8559833420114675, 1.8559833420114675, 1.855983342325199, 3.1020747059033416]]
are close within a tolerance of 0.0001


# ***Schedulers<a name="Schedulers"></a>***

## **Implementation de LRScheduler**

```python
class LRSchedulerTorch
```
```python
class LRSchedulerCustom
```
TODO: expliquer l'implémentation

---

```python
class LRSchedulerOnPlateauTorch(LRSchedulerTorch)
```
```python
class LRSchedulerOnPlateauCustom(LRSchedulerCustom)
```
TODO: expliquer l'implémentation

Exemple d'utilisation :
```python
scheduler = LRSchedulerOnPlateauTorch(optimizer, initial_lr=0.01, patience=5, factor=0.5, min_lr=1e-6, mode='min', threshold=1e-4)

scheduler = LRSchedulerOnPlateauCustom(optimizer, initial_lr=0.01, patience=5, factor=0.5, min_lr=1e-6, mode='min', threshold=1e-4)
```

In [118]:
class LRSchedulerTorch:
    def __init__(self, optimizer, initial_lr):
        self.optimizer = optimizer
        self.initial_lr = initial_lr

    def get_lr(self):
        return self.optimizer.param_groups[0]['lr']

    def set_lr(self, lr):
        for group in self.optimizer.param_groups:
            group['lr'] = lr


class LRSchedulerCustom:
    def __init__(self, optimizer, initial_lr):
        self.optimizer = optimizer
        self.initial_lr = initial_lr

    def get_lr(self):
        return self.optimizer.learning_rate

    def set_lr(self, lr):
        self.optimizer.learning_rate = lr

## **Implementation de LRSchedulerOnPlateau**

In [119]:
class LRSchedulerOnPlateauTorch(LRSchedulerTorch):
    def __init__(self, optimizer, initial_lr, patience=10, factor=0.1, min_lr=1e-6, mode='min', threshold=1e-4):
        super().__init__(optimizer, initial_lr)
        self.patience = patience
        self.factor = factor
        self.min_lr = min_lr
        self.mode = mode
        self.threshold = threshold

        self.best_value = None
        self.num_bad_epochs = 0

    def step(self, current_value):
        if self.best_value is None:
            self.best_value = current_value
            return

        if self.mode == 'min':
            improvement = self.best_value - current_value
        elif self.mode == 'max':
            improvement = current_value - self.best_value
        else:
            raise ValueError("Mode must be either 'min' (minimize) or 'max' (maximize).")

        if isinstance(improvement, Tensor):
            if improvement.data > self.threshold:
                self.best_value = current_value
                self.num_bad_epochs = 0
            else:
                self.num_bad_epochs += 1
        else:
            if improvement > self.threshold:
                self.best_value = current_value
                self.num_bad_epochs = 0
            else:
                self.num_bad_epochs += 1

        if self.num_bad_epochs >= self.patience:
            self.reduce_lr()

    def reduce_lr(self):
        current_lr = self.get_lr()
        new_lr = max(current_lr * self.factor, self.min_lr)
        if new_lr < current_lr:
            print(f"Reducing learning rate: {current_lr:.6f} -> {new_lr:.6f}")
            self.set_lr(new_lr)
        self.num_bad_epochs = 0


class LRSchedulerOnPlateauCustom(LRSchedulerCustom):
    def __init__(self, optimizer, initial_lr, patience=10, factor=0.1, min_lr=1e-6, mode='min', threshold=1e-4):
        super().__init__(optimizer, initial_lr)
        self.patience = patience
        self.factor = factor
        self.min_lr = min_lr
        self.mode = mode
        self.threshold = threshold

        self.best_value = None
        self.num_bad_epochs = 0

    def step(self, current_value):
        if self.best_value is None:
            self.best_value = current_value
            return

        if self.mode == 'min':
            improvement = self.best_value - current_value
        elif self.mode == 'max':
            improvement = current_value - self.best_value
        else:
            raise ValueError("Mode must be either 'min' (minimize) or 'max' (maximize).")

        if improvement.data > self.threshold:
            self.best_value = current_value
            self.num_bad_epochs = 0
        else:
            self.num_bad_epochs += 1

        if self.num_bad_epochs >= self.patience:
            self.reduce_lr()

    def reduce_lr(self):
        current_lr = self.get_lr()
        new_lr = max(current_lr * self.factor, self.min_lr)
        if new_lr < current_lr:
            print(f"Reducing learning rate: {current_lr:.6f} -> {new_lr:.6f}")
            self.set_lr(new_lr)
        self.num_bad_epochs = 0

## **Test de LRSchedulerOnPlateau**

In [132]:
x = torch.tensor([69.], requires_grad=True)
conv_torch_scheduler_x, conv_torch_scheduler_fx = eval_optim(x, convexe=True, scheduler=True)
print()
nonconv_torch_scheduler_x, nonconv_torch_scheduler_fx = eval_optim(x, convexe=False, scheduler=True)

Optimisation de la fonction convexe f(x) = (x - 2)²
Gradient de f en x=69.0: x.grad=134.0
Optimiseur SGD_torch: x=10.885512351989746, f(x)=78.95233154296875
Optimiseur RMSProp_torch: x=5.890719413757324, f(x)=15.137697219848633
Reducing learning rate: 0.900000 -> 0.450000
Reducing learning rate: 0.450000 -> 0.225000
Reducing learning rate: 0.225000 -> 0.112500
Reducing learning rate: 0.112500 -> 0.056250
Reducing learning rate: 0.056250 -> 0.028125
Reducing learning rate: 0.028125 -> 0.014063
Reducing learning rate: 0.014063 -> 0.007031
Reducing learning rate: 0.007031 -> 0.003516
Reducing learning rate: 0.003516 -> 0.001758
Reducing learning rate: 0.001758 -> 0.000879
Reducing learning rate: 0.000879 -> 0.000439
Optimiseur Adagrad_torch: x=2.0020194053649902, f(x)=4.077997800777666e-06
Reducing learning rate: 0.250000 -> 0.125000
Reducing learning rate: 0.125000 -> 0.062500
Reducing learning rate: 0.062500 -> 0.031250
Reducing learning rate: 0.031250 -> 0.015625
Reducing learning rate

In [133]:
x = Tensor(69.)
conv_custom_scheduler_x, conv_custom_scheduler_fx = eval_optim(x, convexe=True, scheduler=True)
print()
nonconv_custom_scheduler_x, nonconv_custom_scheduler_fx = eval_optim(x, convexe=False, scheduler=True)

Optimisation de la fonction convexe f(x) = (x - 2)²
Gradient de f en x=69.0: x.grad=134.0
Optimiseur SGD_custom: x=10.885510244948467, f(x)=78.95229231308416
Optimiseur RMSProp_custom: x=5.890716009174875, f(x)=15.13767106404967
Reducing learning rate: 0.900000 -> 0.450000
Reducing learning rate: 0.450000 -> 0.225000
Reducing learning rate: 0.225000 -> 0.112500
Reducing learning rate: 0.112500 -> 0.056250
Reducing learning rate: 0.056250 -> 0.028125
Reducing learning rate: 0.028125 -> 0.014063
Reducing learning rate: 0.014063 -> 0.007031
Reducing learning rate: 0.007031 -> 0.003516
Reducing learning rate: 0.003516 -> 0.001758
Reducing learning rate: 0.001758 -> 0.000879
Reducing learning rate: 0.000879 -> 0.000439
Optimiseur Adagrad_custom: x=2.0020189655822986, f(x)=4.076222022506476e-06
Reducing learning rate: 0.250000 -> 0.125000
Reducing learning rate: 0.125000 -> 0.062500
Reducing learning rate: 0.062500 -> 0.031250
Reducing learning rate: 0.031250 -> 0.015625
Reducing learning ra

In [134]:
check_diffs(conv_torch_scheduler_x, conv_custom_scheduler_x, tol=1e-4)
print()
check_diffs(conv_torch_scheduler_fx, conv_custom_scheduler_fx, tol=1e-4)

All elements between
[10.885512351989746, 5.890719413757324, 2.0020194053649902, 2.013941764831543, 2.010690689086914]
and
[10.885510244948467, 5.890716009174875, 2.0020189655822986, 2.0139409069240797, 2.0106905128327615]
are close within a tolerance of 0.0001

All elements between
[78.95233154296875, 15.137697219848633, 4.077997800777666e-06, 0.00019437281298451126, 0.00011429082951508462]
and
[78.95229231308416, 15.13767106404967, 4.076222022506476e-06, 0.00019434888586585268, 0.00011428706462743738]
are close within a tolerance of 0.0001


In [135]:
torch_scheduler_nn = eval_nn_optim(scheduler=True, custom=False)

Reducing learning rate: 0.010000 -> 0.005000
Reducing learning rate: 0.005000 -> 0.002500
Reducing learning rate: 0.002500 -> 0.001250
Reducing learning rate: 0.001250 -> 0.000625
Reducing learning rate: 0.000625 -> 0.000313
Reducing learning rate: 0.000313 -> 0.000156
Reducing learning rate: 0.000156 -> 0.000078
Reducing learning rate: 0.000078 -> 0.000039
Reducing learning rate: 0.000039 -> 0.000020
Reducing learning rate: 0.000020 -> 0.000010
Reducing learning rate: 0.000010 -> 0.000005
Reducing learning rate: 0.000005 -> 0.000002
Reducing learning rate: 0.000002 -> 0.000001
Reducing learning rate: 0.000001 -> 0.000001
Optimiseur SGD_torch:
W1=1.796550989151001, b1=1.796550989151001, W2=2.3565328121185303, b2=1.5327274799346924
Reducing learning rate: 0.050000 -> 0.025000
Reducing learning rate: 0.025000 -> 0.012500
Reducing learning rate: 0.012500 -> 0.006250
Reducing learning rate: 0.006250 -> 0.003125
Reducing learning rate: 0.003125 -> 0.001563
Reducing learning rate: 0.001563 -

In [136]:
custom_scheduler_nn = eval_nn_optim(scheduler=True, custom=True)

Reducing learning rate: 0.010000 -> 0.005000
Reducing learning rate: 0.005000 -> 0.002500
Reducing learning rate: 0.002500 -> 0.001250
Reducing learning rate: 0.001250 -> 0.000625
Reducing learning rate: 0.000625 -> 0.000313
Reducing learning rate: 0.000313 -> 0.000156
Reducing learning rate: 0.000156 -> 0.000078
Reducing learning rate: 0.000078 -> 0.000039
Reducing learning rate: 0.000039 -> 0.000020
Reducing learning rate: 0.000020 -> 0.000010
Reducing learning rate: 0.000010 -> 0.000005
Reducing learning rate: 0.000005 -> 0.000002
Reducing learning rate: 0.000002 -> 0.000001
Reducing learning rate: 0.000001 -> 0.000001
Optimiseur SGD_custom:
W1=1.7965510145299812, b1=1.7965510145299812, W2=2.3565333855349313, b2=1.5327276992600714
Reducing learning rate: 0.050000 -> 0.025000
Reducing learning rate: 0.025000 -> 0.012500
Reducing learning rate: 0.012500 -> 0.006250
Reducing learning rate: 0.006250 -> 0.003125
Reducing learning rate: 0.003125 -> 0.001563
Reducing learning rate: 0.00156

In [137]:
check_diffs(torch_scheduler_nn, custom_scheduler_nn, tol=1e-4)

All elements between
[[1.796550989151001, 1.796550989151001, 2.3565328121185303, 1.5327274799346924], [2.0043885707855225, 2.0043885707855225, 2.0043885707855225, 1.9648537635803223], [2.0004093647003174, 2.0004093647003174, 2.0004093647003174, 1.99672532081604], [1.9199641942977905, 1.9199641942977905, 1.9199641942977905, 2.3495137691497803], [1.8980488777160645, 1.8980488777160645, 1.8980488777160645, 2.3067123889923096]]
and
[[1.7965510145299812, 1.7965510145299812, 2.3565333855349313, 1.5327276992600714], [2.004388154342763, 2.004388154342763, 2.004388154342763, 1.964856315463248], [2.000409302209841, 2.000409302209841, 2.000409302209841, 1.9967252473852117], [1.9199642223144788, 1.9199642223144788, 1.9199642227924367, 2.3495129287059933], [1.8980487189606503, 1.8980487189606503, 1.8980487194008169, 2.306712029160739]]
are close within a tolerance of 0.0001
