# Illustration of Forward-Backward-Forward Algorithm
## Min-Max-Problem with box constraints (linear classifier model)

The FBF algorithm was originally formulated for monotone inclusions and finds application in variational inequality problems (VIPs).
VIPs also cover the class of zero-sum games (Min-Max-Problem with a mutual objective in two variables):

$\min\limits_{x \in H} \max\limits_{y \in G} F(x, y)$

This type of problem perfectly fits the Wasserstein-GAN formulation with weight clipping (see https://arxiv.org/abs/1701.07875), where $x$ and $y$ is a parametrisation of the generator and discriminator network, respectively, and the constraint set $G$ is a d-dimensional cube.

The algorithm applied to this specific setting looks as follows:

$u_k = P_H \left[ x_k - \alpha \nabla_x F(x_k, y_k)\right]$

$v_k = P_G \left[ y_k + \alpha \nabla_y F(x_k, y_k)\right]$

$x_{k+1} = u_k - \alpha \nabla_x F(u_k, v_k) + \alpha \nabla_x F(x_k, y_k)$

$y_{k+1} = v_k + \alpha \nabla_y F(u_k, v_k) - \alpha \nabla_y F(x_k, y_k)$

We have proved convergence of the FBF method if $F(x, y)$ is differentiable, and convex in $x$ and concave in $y$, and the constraint sets $H$ and $G$ are nonempty, closed and convex. This is a well-established result.

In absence of a constraint set (and thus a projection) we get the so-called "extra-gradient method" (for application in GANs see https://arxiv.org/abs/1802.10551).

The implementation of one step (e.g., to get from $x_{k}$ to $x_{k+1}$) is split into two phases:

1. "extrapolation":
    1. compute update (either via SGD or Adam)
    2. do descent step
    3. store update (e.g., $- \alpha \nabla_x F(x_k, y_k)$)

2. "step":
    1. compute update (either via SGD or Adam)
    2. do descent step and subtract stored update
    
Note: The projection (in case of a d-dimensional cube this means "weight clipping") is directly done in the executable training file, e.g., "train_fbfadam.py", and is not included in the optimiser class.

The purpose of this notebook is to illustrate the implementation of the FBF method, in particular to show the two key methods of the FBF optimiser class. In this case this is done for "FBFSGD" ("FBFAdam" works in a similar fashion).
This is done for a linear classifier model (one fully connected layer without activation), showed only for one component as the algorithm does the same in both components apart from the opposite sign of the objective function.

In [1]:
import torch
import torch.nn as nn

In [2]:
# define toy instance of a neural network
class LinClas(nn.Module):
    def __init__(self):
        super(LinClas, self).__init__()
        self.fc = nn.Linear(5, 1)

    def forward(self, x):
        x = self.fc(x)
        return x

In [3]:
# define toy example of a loss function
def loss(x):
    return x*x

In [4]:
# print function for optimiser (weights, gradient and copy of update)
def print_opt():
    for group in opt.param_groups:
        for p in group['params']:
            print(f"Weights\n{p}")
            print(f"Gradient\n{p.grad}\n")
    print(f"Updates_Copy:\n{opt.updates_copy}\n")
    print(f"Old_Params_Copy:\n{opt.old_params_copy}\n")

In [5]:
# radius of d-dimensional cube
clip = 0.25

# input of neural network (whole batch)
inp = torch.Tensor([-0.1, 0., 0.1, 0.2, 0.3])

# step size = learning rate
lr = 0.1

# inertia
inertia = 0.25

In [6]:
def grad(output):
    return 2*output*inp

def weights():
    return opt.param_groups[0]["params"][0].data.clone().detach()

def update():
    return opt.updates_copy[0]

def old_params():
    return opt.old_params_copy[0]

First we instantiate a fully connected (1-layer) neural network and have a look at the initial weights.

In [7]:
A = LinClas()
A.state_dict()

OrderedDict([('fc.weight',
              tensor([[-0.3501,  0.3995,  0.4176,  0.3505, -0.0472]])),
             ('fc.bias', tensor([-0.0214]))])

Now we import the FBF optimiser class "FBFSGD" and set up an instance with a certain stepsize (= "lr"). To easily keep track of what happens nothing fancy (e.g., "Momentum" or "Nesterov") is specified.
To check that all the parameters of the network are tracked, make use of `print_opt()`.

In [8]:
from optim import FBFSGD
opt = FBFSGD(A.parameters(), lr = lr, inertia = inertia)
print_opt()
print(opt.inertia)

Weights
Parameter containing:
tensor([[-0.3501,  0.3995,  0.4176,  0.3505, -0.0472]], requires_grad=True)
Gradient
None

Weights
Parameter containing:
tensor([-0.0214], requires_grad=True)
Gradient
None

Updates_Copy:
[]

Old_Params_Copy:
[]

0.25


## Iteration 1

### Computation of gradient

Compute output of network with respect to input.

In [9]:
outp = A(inp)
print(f"Output: {outp}")
lc_loss = loss(outp)
print(f"Loss: {lc_loss}")

Output: tensor([0.1113], grad_fn=<AddBackward0>)
Loss: tensor([0.0124], grad_fn=<MulBackward0>)


 Clear old gradients that where possibly stored.

In [10]:
opt.zero_grad()
print_opt()

Weights
Parameter containing:
tensor([[-0.3501,  0.3995,  0.4176,  0.3505, -0.0472]], requires_grad=True)
Gradient
None

Weights
Parameter containing:
tensor([-0.0214], requires_grad=True)
Gradient
None

Updates_Copy:
[]

Old_Params_Copy:
[]



Backpropagate the loss through the network to get gradients with respect to each weight.

In [11]:
lc_loss.backward()
print_opt()

Weights
Parameter containing:
tensor([[-0.3501,  0.3995,  0.4176,  0.3505, -0.0472]], requires_grad=True)
Gradient
tensor([[-0.0223,  0.0000,  0.0223,  0.0445,  0.0668]])

Weights
Parameter containing:
tensor([-0.0214], requires_grad=True)
Gradient
tensor([0.2226])

Updates_Copy:
[]

Old_Params_Copy:
[]



In [12]:
grad(outp)

tensor([-0.0223,  0.0000,  0.0223,  0.0445,  0.0668], grad_fn=<MulBackward0>)

### Extrapolation

In [13]:
weights() - lr*grad(outp)

tensor([[-0.3479,  0.3995,  0.4154,  0.3460, -0.0539]], grad_fn=<SubBackward0>)

In [14]:
opt.extrapolation()
print_opt()

Weights
Parameter containing:
tensor([[-0.3479,  0.3995,  0.4154,  0.3460, -0.0539]], requires_grad=True)
Gradient
tensor([[-0.0223,  0.0000,  0.0223,  0.0445,  0.0668]])

Weights
Parameter containing:
tensor([-0.0437], requires_grad=True)
Gradient
tensor([0.2226])

Updates_Copy:
[tensor([[ 0.0022, -0.0000, -0.0022, -0.0045, -0.0067]]), tensor([-0.0223])]

Old_Params_Copy:
[]



### Projection

In [15]:
for p in A.parameters():
    p.data.clamp_(-clip, clip)
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2500, -0.0539]], requires_grad=True)
Gradient
tensor([[-0.0223,  0.0000,  0.0223,  0.0445,  0.0668]])

Weights
Parameter containing:
tensor([-0.0437], requires_grad=True)
Gradient
tensor([0.2226])

Updates_Copy:
[tensor([[ 0.0022, -0.0000, -0.0022, -0.0045, -0.0067]]), tensor([-0.0223])]

Old_Params_Copy:
[]



### Computation of gradient

Compute output of network with respect to input.

In [16]:
outp = A(inp)
print(f"Output: {outp}")
lc_loss = loss(outp)
print(f"Loss: {lc_loss}")

Output: tensor([0.0402], grad_fn=<AddBackward0>)
Loss: tensor([0.0016], grad_fn=<MulBackward0>)


 Clear old gradients that where possibly stored.

In [17]:
opt.zero_grad()
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2500, -0.0539]], requires_grad=True)
Gradient
tensor([[0., 0., 0., 0., 0.]])

Weights
Parameter containing:
tensor([-0.0437], requires_grad=True)
Gradient
tensor([0.])

Updates_Copy:
[tensor([[ 0.0022, -0.0000, -0.0022, -0.0045, -0.0067]]), tensor([-0.0223])]

Old_Params_Copy:
[]



Backpropagate the loss through the network to get gradients with respect to each weight.

In [18]:
lc_loss.backward()
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2500, -0.0539]], requires_grad=True)
Gradient
tensor([[-0.0080,  0.0000,  0.0080,  0.0161,  0.0241]])

Weights
Parameter containing:
tensor([-0.0437], requires_grad=True)
Gradient
tensor([0.0803])

Updates_Copy:
[tensor([[ 0.0022, -0.0000, -0.0022, -0.0045, -0.0067]]), tensor([-0.0223])]

Old_Params_Copy:
[]



In [19]:
grad(outp)

tensor([-0.0080,  0.0000,  0.0080,  0.0161,  0.0241], grad_fn=<MulBackward0>)

### Step

In [20]:
weights() - lr*grad(outp) - update()

tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]], grad_fn=<SubBackward0>)

In [21]:
opt.step()
print_opt()

Weights
Parameter containing:
tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]], requires_grad=True)
Gradient
tensor([[-0.0080,  0.0000,  0.0080,  0.0161,  0.0241]])

Weights
Parameter containing:
tensor([-0.0294], requires_grad=True)
Gradient
tensor([0.0803])

Updates_Copy:
[]

Old_Params_Copy:
[tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]]), tensor([-0.0294])]



## Iteration 2

### Computation of gradient

Compute output of network with respect to input.

In [22]:
outp = A(inp)
print(f"Output: {outp}")
lc_loss = loss(outp)
print(f"Loss: {lc_loss}")

Output: tensor([0.0565], grad_fn=<AddBackward0>)
Loss: tensor([0.0032], grad_fn=<MulBackward0>)


 Clear old gradients that where possibly stored.

In [23]:
opt.zero_grad()
print_opt()

Weights
Parameter containing:
tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]], requires_grad=True)
Gradient
tensor([[0., 0., 0., 0., 0.]])

Weights
Parameter containing:
tensor([-0.0294], requires_grad=True)
Gradient
tensor([0.])

Updates_Copy:
[]

Old_Params_Copy:
[tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]]), tensor([-0.0294])]



Backpropagate the loss through the network to get gradients with respect to each weight.

In [24]:
lc_loss.backward()
print_opt()

Weights
Parameter containing:
tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]], requires_grad=True)
Gradient
tensor([[-0.0113,  0.0000,  0.0113,  0.0226,  0.0339]])

Weights
Parameter containing:
tensor([-0.0294], requires_grad=True)
Gradient
tensor([0.1130])

Updates_Copy:
[]

Old_Params_Copy:
[tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]]), tensor([-0.0294])]



In [25]:
grad(outp)

tensor([-0.0113,  0.0000,  0.0113,  0.0226,  0.0339], grad_fn=<MulBackward0>)

### Extrapolation

In [26]:
weights() - lr*grad(outp)

tensor([[-0.2503,  0.2500,  0.2503,  0.2506, -0.0530]], grad_fn=<SubBackward0>)

In [27]:
opt.extrapolation()
print_opt()

Weights
Parameter containing:
tensor([[-0.2503,  0.2500,  0.2503,  0.2506, -0.0530]], requires_grad=True)
Gradient
tensor([[-0.0113,  0.0000,  0.0113,  0.0226,  0.0339]])

Weights
Parameter containing:
tensor([-0.0408], requires_grad=True)
Gradient
tensor([0.1130])

Updates_Copy:
[tensor([[ 0.0011, -0.0000, -0.0011, -0.0023, -0.0034]]), tensor([-0.0113])]

Old_Params_Copy:
[tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]]), tensor([-0.0294])]



### Projection

In [28]:
for p in A.parameters():
    p.data.clamp_(-clip, clip)
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2500, -0.0530]], requires_grad=True)
Gradient
tensor([[-0.0113,  0.0000,  0.0113,  0.0226,  0.0339]])

Weights
Parameter containing:
tensor([-0.0408], requires_grad=True)
Gradient
tensor([0.1130])

Updates_Copy:
[tensor([[ 0.0011, -0.0000, -0.0011, -0.0023, -0.0034]]), tensor([-0.0113])]

Old_Params_Copy:
[tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]]), tensor([-0.0294])]



### Computation of gradient

Compute output of network with respect to input.

In [29]:
outp = A(inp)
print(f"Output: {outp}")
lc_loss = loss(outp)
print(f"Loss: {lc_loss}")

Output: tensor([0.0433], grad_fn=<AddBackward0>)
Loss: tensor([0.0019], grad_fn=<MulBackward0>)


 Clear old gradients that where possibly stored.

In [30]:
opt.zero_grad()
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2500, -0.0530]], requires_grad=True)
Gradient
tensor([[0., 0., 0., 0., 0.]])

Weights
Parameter containing:
tensor([-0.0408], requires_grad=True)
Gradient
tensor([0.])

Updates_Copy:
[tensor([[ 0.0011, -0.0000, -0.0011, -0.0023, -0.0034]]), tensor([-0.0113])]

Old_Params_Copy:
[tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]]), tensor([-0.0294])]



Backpropagate the loss through the network to get gradients with respect to each weight.

In [31]:
lc_loss.backward()
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2500, -0.0530]], requires_grad=True)
Gradient
tensor([[-0.0087,  0.0000,  0.0087,  0.0173,  0.0260]])

Weights
Parameter containing:
tensor([-0.0408], requires_grad=True)
Gradient
tensor([0.0867])

Updates_Copy:
[tensor([[ 0.0011, -0.0000, -0.0011, -0.0023, -0.0034]]), tensor([-0.0113])]

Old_Params_Copy:
[tensor([[-0.2514,  0.2500,  0.2514,  0.2528, -0.0496]]), tensor([-0.0294])]



In [32]:
grad(outp)

tensor([-0.0087,  0.0000,  0.0087,  0.0173,  0.0260], grad_fn=<MulBackward0>)

### Step

In [33]:
w = weights() - lr*grad(outp) - update()
w

tensor([[-0.2503,  0.2500,  0.2503,  0.2505, -0.0522]], grad_fn=<SubBackward0>)

In [34]:
(1+inertia)*w - inertia*old_params()

tensor([[-0.2500,  0.2500,  0.2500,  0.2499, -0.0528]], grad_fn=<SubBackward0>)

In [35]:
opt.step()
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2499, -0.0528]], requires_grad=True)
Gradient
tensor([[-0.0087,  0.0000,  0.0087,  0.0173,  0.0260]])

Weights
Parameter containing:
tensor([-0.0403], requires_grad=True)
Gradient
tensor([0.0867])

Updates_Copy:
[]

Old_Params_Copy:
[tensor([[-0.2503,  0.2500,  0.2503,  0.2505, -0.0522]]), tensor([-0.0381])]



## Iteration 3

### Computation of gradient

Compute output of network with respect to input.

In [36]:
outp = A(inp)
print(f"Output: {outp}")
lc_loss = loss(outp)
print(f"Loss: {lc_loss}")

Output: tensor([0.0438], grad_fn=<AddBackward0>)
Loss: tensor([0.0019], grad_fn=<MulBackward0>)


 Clear old gradients that where possibly stored.

In [37]:
opt.zero_grad()
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2499, -0.0528]], requires_grad=True)
Gradient
tensor([[0., 0., 0., 0., 0.]])

Weights
Parameter containing:
tensor([-0.0403], requires_grad=True)
Gradient
tensor([0.])

Updates_Copy:
[]

Old_Params_Copy:
[tensor([[-0.2503,  0.2500,  0.2503,  0.2505, -0.0522]]), tensor([-0.0381])]



Backpropagate the loss through the network to get gradients with respect to each weight.

In [38]:
lc_loss.backward()
print_opt()

Weights
Parameter containing:
tensor([[-0.2500,  0.2500,  0.2500,  0.2499, -0.0528]], requires_grad=True)
Gradient
tensor([[-0.0088,  0.0000,  0.0088,  0.0175,  0.0263]])

Weights
Parameter containing:
tensor([-0.0403], requires_grad=True)
Gradient
tensor([0.0877])

Updates_Copy:
[]

Old_Params_Copy:
[tensor([[-0.2503,  0.2500,  0.2503,  0.2505, -0.0522]]), tensor([-0.0381])]



In [39]:
grad(outp)

tensor([-0.0088,  0.0000,  0.0088,  0.0175,  0.0263], grad_fn=<MulBackward0>)

### Extrapolation

In [40]:
weights() - lr*grad(outp)

tensor([[-0.2491,  0.2500,  0.2491,  0.2482, -0.0555]], grad_fn=<SubBackward0>)

In [41]:
opt.extrapolation()
print_opt()

Weights
Parameter containing:
tensor([[-0.2491,  0.2500,  0.2491,  0.2482, -0.0555]], requires_grad=True)
Gradient
tensor([[-0.0088,  0.0000,  0.0088,  0.0175,  0.0263]])

Weights
Parameter containing:
tensor([-0.0491], requires_grad=True)
Gradient
tensor([0.0877])

Updates_Copy:
[tensor([[ 0.0009, -0.0000, -0.0009, -0.0018, -0.0026]]), tensor([-0.0088])]

Old_Params_Copy:
[tensor([[-0.2503,  0.2500,  0.2503,  0.2505, -0.0522]]), tensor([-0.0381])]



### Projection

In [42]:
for p in A.parameters():
    p.data.clamp_(-clip, clip)
print_opt()

Weights
Parameter containing:
tensor([[-0.2491,  0.2500,  0.2491,  0.2482, -0.0555]], requires_grad=True)
Gradient
tensor([[-0.0088,  0.0000,  0.0088,  0.0175,  0.0263]])

Weights
Parameter containing:
tensor([-0.0491], requires_grad=True)
Gradient
tensor([0.0877])

Updates_Copy:
[tensor([[ 0.0009, -0.0000, -0.0009, -0.0018, -0.0026]]), tensor([-0.0088])]

Old_Params_Copy:
[tensor([[-0.2503,  0.2500,  0.2503,  0.2505, -0.0522]]), tensor([-0.0381])]



### Computation of gradient

Compute output of network with respect to input.

In [43]:
outp = A(inp)
print(f"Output: {outp}")
lc_loss = loss(outp)
print(f"Loss: {lc_loss}")

Output: tensor([0.0338], grad_fn=<AddBackward0>)
Loss: tensor([0.0011], grad_fn=<MulBackward0>)


 Clear old gradients that where possibly stored.

In [44]:
opt.zero_grad()
print_opt()

Weights
Parameter containing:
tensor([[-0.2491,  0.2500,  0.2491,  0.2482, -0.0555]], requires_grad=True)
Gradient
tensor([[0., 0., 0., 0., 0.]])

Weights
Parameter containing:
tensor([-0.0491], requires_grad=True)
Gradient
tensor([0.])

Updates_Copy:
[tensor([[ 0.0009, -0.0000, -0.0009, -0.0018, -0.0026]]), tensor([-0.0088])]

Old_Params_Copy:
[tensor([[-0.2503,  0.2500,  0.2503,  0.2505, -0.0522]]), tensor([-0.0381])]



Backpropagate the loss through the network to get gradients with respect to each weight.

In [45]:
lc_loss.backward()
print_opt()

Weights
Parameter containing:
tensor([[-0.2491,  0.2500,  0.2491,  0.2482, -0.0555]], requires_grad=True)
Gradient
tensor([[-0.0068,  0.0000,  0.0068,  0.0135,  0.0203]])

Weights
Parameter containing:
tensor([-0.0491], requires_grad=True)
Gradient
tensor([0.0675])

Updates_Copy:
[tensor([[ 0.0009, -0.0000, -0.0009, -0.0018, -0.0026]]), tensor([-0.0088])]

Old_Params_Copy:
[tensor([[-0.2503,  0.2500,  0.2503,  0.2505, -0.0522]]), tensor([-0.0381])]



In [46]:
grad(outp)

tensor([-0.0068,  0.0000,  0.0068,  0.0135,  0.0203], grad_fn=<MulBackward0>)

### Step

In [47]:
w = weights() - lr*grad(outp) - update()
w

tensor([[-0.2493,  0.2500,  0.2493,  0.2486, -0.0549]], grad_fn=<SubBackward0>)

In [48]:
(1+inertia)*w - inertia*old_params()

tensor([[-0.2491,  0.2500,  0.2491,  0.2481, -0.0555]], grad_fn=<SubBackward0>)

In [49]:
opt.step()
print_opt()

Weights
Parameter containing:
tensor([[-0.2491,  0.2500,  0.2491,  0.2481, -0.0555]], requires_grad=True)
Gradient
tensor([[-0.0068,  0.0000,  0.0068,  0.0135,  0.0203]])

Weights
Parameter containing:
tensor([-0.0493], requires_grad=True)
Gradient
tensor([0.0675])

Updates_Copy:
[]

Old_Params_Copy:
[tensor([[-0.2493,  0.2500,  0.2493,  0.2486, -0.0549]]), tensor([-0.0470])]

