# Assignment 2: Training the Fully Recurrent Network



## Exercise 1: Data generation

There are two classes, both occurring with probability 0.5. There is one input unit. Only the first sequence element conveys relevant information about the class. Sequence elements at positions $t > 1$ stem from a Gaussian with mean zero and variance 0.2. The first sequence element is 1.0 (-1.0) for class 1 (2). Target at sequence end is 1.0 (0.0) for class 1 (2)

Write a function `generate_data` that takes an integer `T` as argument which represents the sequence length. Seed the `numpy` random generator with the number `0xDEADBEEF`. Implement the [Python3 generator](https://docs.python.org/3/glossary.html#term-generator) pattern and produce data in the way described above. The input sequences should have the shape `(T, 1)` and the target values should have the shape `(1,)`.

In [1]:
%matplotlib inline
import numpy as np
from scipy.special import expit as sigmoid
import matplotlib.pyplot as plt

class FullyRecurrentNetwork(object):
    def __init__(self, D, I, K):
        self.W = np.random.uniform(-0.01, 0.01, (I, D))
        self.R = np.random.uniform(-0.01, 0.01, (I, I))
        self.V = np.random.uniform(-0.01, 0.01, (K, I))
        
    def forward(self, x, y):
        # helper function for numerically stable loss
        def f(z):
            return np.log1p(np.exp(-np.absolute(z))) + np.maximum(0, z)
        
        # infer dims
        T, D = x.shape
        K, I = self.V.shape

        # init result arrays
        self.x = x
        self.y = y
        self.a = np.zeros((T, I))

        # iterate forward in time 
        # trick: access model.a[-1] in first iteration
        for t in range(T):
            self.a[t] = np.tanh(self.W @ x[t] + self.R @ self.a[t-1])
    
        self.z = self.V @ self.a[t] #model.V @ self.a[t]
        
        return y * f(-self.z) + (1-y) * f(self.z)

T, D, I, K = 10, 3, 5, 1
model = FullyRecurrentNetwork(D, I, K)
model.forward(np.random.uniform(-1, 1, (T, D)), 1)

def generate_data(T):
    ########## YOUR SOLUTION HERE ##########
    
    
    # setting seed:
    np.random.seed(0xDEADBEEF)
    
    while True: # while-loop gives you different yield-values each time
        
        # generate the first element-s of the sequences, each with probability 0.5:
        # We have Bernoulli distribution (with replacement):
        first_ele = np.random.binomial(1, p=0.5, size=1).astype('float')
        first_ele[np.where(first_ele==0)] = -1.0

        # setting target:
        if first_ele[0]==1:
            target = np.array([1], dtype='float')
        else:
            target = np.array([0], dtype='float')


        # generate remaining values which come from a gaussian distribution:
        mu = 0
        sigma = np.sqrt(0.2) # std

        rest_ele = np.random.normal(mu,sigma,T-1)

        # Combine first element with random generated elements:
        input_sample = np.append(first_ele, rest_ele).reshape(T,1)


        # generator = A function which returns a generator iterator. 
        # It looks like a normal function except that it contains yield expressions for 
        # producing a series of values usable in a for-loop or that can be retrieved 
        # one at a time with the next() function.
        yield input_sample, target 

data = generate_data(2)


In [8]:
one_data_sample = next(data) # getting the data produced by the function, call one yield with each
# next().
# When implementing a for-loop in which is yield in the function and several 
# next(data), we get lots of samples.
one_data_sample

(array([[ 1.        ],
        [-0.13333295]]),
 array([1.]))

## Exercise 2: Gradients for the network parameters
Compute gradients of the total loss 
$$
L = \sum_{t=1}^T L(t), \quad \text{where} \quad L(t) = L(z(t), y(t))
$$
w.r.t. the weights of the fully recurrent network. To this end, find the derivative of the loss w.r.t. the logits and hidden pre-activations first, i.e., 
$$
\psi^\top(t) = \frac{\partial L}{\partial z(t)} \quad \text{and} \quad \delta^\top(t) = \frac{\partial L}{\partial s(t)}.
$$
With the help of these intermediate results you should be able to compute the gradients w.r.t. the weights, i.e., $\nabla_W L, \nabla_R L, \nabla_V L$. 

*Hint: Take a look at the computational graph from the previous assignment to see the functional dependencies.*

*Remark: Although we only have one label at the end of the sequence, we consider the more general case of evaluating a loss at every time step in this exercise (many-to-many mapping).*

########## YOUR SOLUTION HERE ##########

$$
\text{1. get } \psi^\top(t) = \frac{\partial L}{\partial z(t)} \quad\\
\text{From assig. 1 ex. 2 and the fact that we can ignore the sum as only part with t=t_relevant important, we get:}\\
\psi^\top(t) = \frac{\partial L(z,y)}{\partial z(t)} = -y(t) - \frac{1}{e^{|z(t)|} + 1 }  + \frac{max(0,z(t))}{z(t)}\\
\text{Solution:} e(t)=  \sigma (z(t)-y) \text{ because \hat{y}= \sigma(z(t))}\\
\\
$$

$$
\text{2. get }  \delta^\top(t) = \frac{\partial L}{\partial s(t)}\\
\frac{\partial L}{\partial s(t)}
= \frac{\partial L}{\partial a(t)} \cdot \frac{\partial a(t)}{\partial s(t)}
= \left(\frac{\partial L(z(t),y(t))}{\partial a(t)} 
+
\frac{\partial L}{\partial s(t+1)} \frac{\partial s(t+1)}{\partial a(t)}\right)
\cdot 
\frac{\partial a(t)}{\partial s(t)}
\\
= \left(\frac{\partial L(z(t),y(t))}{\partial z(t)} \frac{\partial z(t)}{\partial a(t)}
+
\delta^\top(t+1) R^\top \right)
\cdot
diag(\tanh'(s(t)))
\\
= \left(\psi^\top(t) 
\frac{\partial V a(t)}{\partial a(t)}
+
\delta^\top(t+1) R^\top \right)
\cdot
diag(\tanh'(s(t)))
\\
= \left(\psi^\top(t) 
V^\top
+
\delta^\top(t+1) R^\top \right)
\cdot
diag(1 - \tanh^2(s(t)))
\\
$$
Solution: we use diag() because only on diagonal the terms are independent of each other off-values=0 in a matrix over time(?).

$$
\text{3. get }\nabla_V L\\
 \frac{\partial L(z,y)}{\partial z(t)} \cdot \frac{\partial z(t)}{\partial V} 
= \frac{\partial L(z,y)}{\partial z(t)} \cdot \frac{\partial V a(t)}{\partial V}
= \frac{\partial L(z,y)}{\partial z(t)} \cdot a(t)
=  \psi^\top(t) \cdot  \tanh(s(t))
\\
\text{Sum that up over time:}\\
\nabla_V L = \sum_{t=1}^T  \psi^\top(t) \cdot  a(t)
$$

$$
\text{4. get }\nabla_R L\\
\frac{\partial L}{\partial s(t)} \cdot \frac{\partial s(t)}{\partial R}
= \frac{\partial L}{\partial s(t)} \cdot \frac{\partial W x(t) + R a(t-1)}{\partial R}
= \frac{\partial L}{\partial s(t)} \cdot  a(t-1)
=  \delta^\top(t) \cdot  a(t-1)
\\
\text{Sum that up over time:}\\
\nabla_R L = \sum_{t=1}^T   \delta^\top(t) \cdot  a(t-1)
$$

$$
\text{5. get }\nabla_W L\\
\frac{\partial L}{\partial s(t)} \cdot \frac{\partial s(t)}{\partial W}
= \frac{\partial L}{\partial s(t)} \cdot \frac{\partial W x(t) + R a(t-1)}{\partial W}
= \frac{\partial L}{\partial s(t)} \cdot x(t)
=  \delta^\top(t) \cdot  x(t)
\\
\text{Sum that up over time:}\\
\nabla_W L = \sum_{t=1}^T   \delta^\top(t) \cdot  x(t)
$$

## Exercise 3: The backward pass
Write a function `backward` that takes a model `self` as argument. The function should compute the gradients of the loss with respect to all model parameters and store them to `self.dW`, `self.dR`, `self.dV`, respectively. 

In [None]:
def backward(self):
    ########## YOUR SOLUTION HERE ##########
    
    # dz:
    z_full = self.V @ self.a.T 
    dz = -self.y - 1/(np.exp(np.abs(z_full))+1) + np.clip(z_full,0,None)/z_full
    # y and z are scalars --> dz is scalar

    # ds:
    # add 0 row to a because it starts earlier than the signal x:
    a2 = np.concatenate((np.zeros(self.a.shape[1]).reshape(1,-1),self.a),axis=0)
    s = self.W @ self.x.T + self.R @ a2[:-1].T
    # and not only one
    
    ds = np.zeros((self.x.shape[0]+1,self.V.shape[1]))
   
    for ind,time_rev in enumerate(ds[1:]):

        ds[1+ind] = (dz[:,-ind-1] @ self.V + ds[ind-1].T@self.R.T) @ np.diag(1-np.tanh(s.T[-ind-1])**2)

    # ds: the last row is the first time step t=1
    ds = np.flip(ds,axis=0)[:-1] # now the last row is the last time step t=T
    
    self.dz = dz
    self.ds = ds

    # dV: When we look at def forward() we have a many-to-one mapping, therfore
    # now sum and only the last output needed:
    dV = dz[-1].reshape(-1,1) @ a2[-1].reshape(-1,1).T
    self.dV = np.sum(dV,axis=0).reshape(1,-1)
    
    # dR:
    deri_R = np.ones_like(self.R)
    dR = ds.T @ a2[:-1]
    self.dR = dR

    # dW:
    deri_W = np.ones_like(self.W)
    dW = ds.T @ self.x
    self.dW = dW
    

    return

FullyRecurrentNetwork.backward = backward
model.backward()

In [None]:
# Solution:
def backward(self):
    ########## YOUR SOLUTION HERE ##########
    T, D = self.x.shape
    K, I = self.V.shape
    
    self.dW, self.dR, self.dV = np.zeros_like(self.W),np.zeros_like(self.R),np.zeros_like(self.V)
    delta = np.zeros((T,I))
    
    psi_T = sigmoid(self.z) - self.y
    self.dV = psi_T[:,None] @ self.a[T-1][None,:]
    
    for t in reversed(range(T)):
        if t == T-1:
            delta[t] = psi_T @ self.V @ np.diag(1-self.a[t]**2)
        else:
            delta[t] = delta[t+1] @ self.R @ np.diag(1-self.a[t]**2)
            
        self.dW += delta[t][:,None] @ self.x[t][None,:]
        
        if t != 0:
            self.dR += delta[t][:,None] @ self.a[t-1][None,:]

FullyRecurrentNetwork.backward = backward
model.backward()

## Exercise 4: Gradient checking
Write a function `grad_check` that takes a model `self`, a float `eps` and another float `thresh` as arguments and computes the numerical gradients of the model parameters according to the approximation
$$
f'(x) \approx \frac{f(x + \varepsilon) - f(x - \varepsilon)}{2 \varepsilon}.
$$
If any of the analytical gradients are farther than `thresh` away from the numerical gradients the function should throw an error. 

In [None]:
def grad_check(self, eps, thresh):
    ########## YOUR SOLUTION HERE ##########
    

    # numeric_part:
    numeric_deri = lambda func: (func(eps)-func(-eps))/(2*eps)
    numeric_deri_time = lambda func,time: (func(eps,time)-func(-eps,time))/(2*eps)
    
    loss = lambda ep: - self.y * (self.z+ep) + np.log1p(np.exp(-np.abs((self.z+ep)))) + max(0,(self.z+ep))
    numeric_dz = numeric_deri(loss)
    
    z_full = self.V @ self.a.T
    loss_time = lambda ep,time: - self.y * (z_full[:,time]+ep) + np.log1p(np.exp(-np.abs((z_full[:,time]+ep)))) + max(0,(z_full[:,time]+ep))
    
    a2 = np.concatenate((np.zeros(self.a.shape[1]).reshape(1,-1),self.a),axis=0)
    #s = self.W @ self.x.T + self.R @ a2[:-1].T
    #tanh_s = lambda ep: np.tanh((s+ep))
    print(self.V.shape,a2.shape,'ffff')
    
    x2 = np.concatenate((self.x, np.zeros(self.x.shape[1]).reshape(1,-1)),axis=0)
    z = lambda ep,time: self.V @ (self.a[time]+ep)
    s = lambda ep,time: (self.W @ x2[time+1].T).T + (self.R @ (self.a[time].reshape(-1,1) +ep)).T
    a = lambda ep,time: np.tanh(s(ep,time))
    numeric_ds = np.zeros((self.x.shape[0]+1,self.V.shape[1]))
    
    for ind,time_rev in enumerate(numeric_ds[1:]):
        numeric_ds[ind+1] = (numeric_deri_time(loss_time,-ind-1) * numeric_deri_time(z,-ind-1) + numeric_ds[ind-1] * numeric_deri_time(s,-ind+1-1)) * numeric_deri_time(a,-ind-1)

        #numeric_ds[ind+1] = (numeric_dz * numeric_deri_time(z,-ind-1) + numeric_ds[ind-1] * numeric_deri_time(s,-ind+1-1)) * numeric_deri_time(a,-ind-1)

        #numeric_ds[ind+1] = (numeric_dz * numeric_deri(z(eps,-ind-1)) + numeric_ds[ind-1] * numeric_deri(s(eps,-ind+1-1))) * numeric_deri(a(eps,-ind-1))
        #ds[1+ind] = (dz * self.V + ds[ind-1].T@self.R.T) @ np.diag(1-np.tanh(s.T[-ind])**2)
    # ds: the last row is the first time step t=1
    numeric_ds = np.flip(numeric_ds,axis=0)[:-1] # now the last row is the last time step t=T
    
    #numeric_ds = numeric_dz * self.V @ numeric_deri(tanh_s)
    print('my',self.V.shape,self.dV.shape)
    
    
    fffff
    #z_all_time = lambda ep: np.sum(self.V @ (self.a.T+ep),axis=)
    
    
    V_rep = np.repeat(self.V,5,axis=0)
    V_a = lambda ep: (self.V+ep) @ a2[:-1].T
    numeric_dV = numeric_dz * numeric_deri(V_a) 
    # dV = dL/dz * dz/dV --> Put in separately L and z into numeric differentiation
    # formula such that dV = num_formula(L) * num_formula(z)
  
    
    s2 = lambda ep:  self.W @ self.x.T + (self.R+ep) @ a2[:-1].T
    numeric_dR = numeric_ds * numeric_deri(s2)
   
    s3 = lambda ep:  (self.W+ep) @ self.x.T + self.R @ a2[:-1].T
    numeric_dW = numeric_ds * numeric_deri(s3)
    print('!!!!!!!1',numeric_dW.shape,numeric_ds.shape,numeric_deri(s3).shape)
    # analytic part:
    analytic_dV = self.dV#.T
    analytic_dR = self.dR
    analytic_dW = self.dW
    
    # comparison:
    o=np.allclose(numeric_dV,self.dV,atol=thresh) # abs
    print(o)
    print('dv',numeric_dV.shape,self.dV.shape)
    print(self.a.shape)
    
    print('dr',numeric_dR.shape, analytic_dR.shape)
    print('dw',numeric_dW.shape, analytic_dW.shape)
   # print(numeric_dR)
    #print(analytic_dR)
    print('x shape',self.x.shape)
    print(numeric_ds.shape)
    print('##################')
    print(self.ds.shape)
    print('selfs',self.ss.shape)
    print(numeric_dV)
    print('########')
    print(self.dV)
    
    check_dV = np.allclose(numeric_grad, analytic_dV, atol=thresh) 
    # atol: compare the absolute difference between two values.
    check_dR = np.allclose(numeric_grad, analytic_dR, atol=thresh) 
    check_dW = np.allclose(numeric_grad, analytic_dW, atol=thresh) 
    
    if not check_dV:
        raise Exception('Numeric and analytic dV differ too strongly!')

    if not check_dR:
        raise Exception('Numeric and analytic dR differ too strongly!')
    
    if not check_dW:
        raise Exception('Numeric and analytic dW differ too strongly!')
    

FullyRecurrentNetwork.grad_check = grad_check
model.grad_check(1e-7, 1e-7)

In [None]:
# Solution:
def my_numeric_grad(self,weight_matrix, eps):

    numeric_dWRV = np.zeros_like(weight_matrix) # the numeric (and also 
    # the analytic one) has to have the same shape as the belonging weight_matrix
        
    for row_ind,row in enumerate(weight_matrix):
        for col_ind,col in enumerate(row):
            
            # first part numeric formula:
            weight_matrix[row_ind,col_ind] += eps
            loss_f_numeric_one = self.forward(self.x,self.y)

            # second part numeric formula:
            weight_matrix[row_ind,col_ind] -= 2*eps
            loss_f_numeric_two = self.forward(self.x,self.y)
            
            # bring weight_matrix back into its original being:
            weight_matrix[row_ind,col_ind] += eps

            numeric_dWRV[row_ind,col_ind] = (loss_f_numeric_one-loss_f_numeric_two) / (2*eps)

    return numeric_dWRV

In [None]:
# Solution:
def grad_check(self, eps, thresh):
    ########## YOUR SOLUTION HERE ##########
    

    # numeric_part:
    numeric_dV = self.my_numeric_grad(self.V, eps)
    numeric_dW = self.my_numeric_grad(self.W, eps)
    numeric_dR = self.my_numeric_grad(self.R, eps)
    
    
    # analytic part:
    analytic_dV = self.dV
    analytic_dR = self.dR
    analytic_dW = self.dW
    
    # comparison:   
    check_dV = np.allclose(numeric_dV, analytic_dV, atol=thresh) 
    # atol: compare the absolute difference between two values.
    check_dR = np.allclose(numeric_dR, analytic_dR, atol=thresh) 
    check_dW =  np.allclose(numeric_dW, analytic_dW, atol=thresh) 
    
    if not check_dV:
        raise Exception('Numeric and analytic dV differ too strongly!')

    if not check_dR:
        raise Exception('Numeric and analytic dR differ too strongly!')
    
    if not check_dW:
        raise Exception('Numeric and analytic dW differ too strongly!')
    
FullyRecurrentNetwork.my_numeric_grad = my_numeric_grad
FullyRecurrentNetwork.grad_check = grad_check
model.grad_check(1e-7, 1e-7)

## Exercise 5: Parameter update

Write a function `update` that takes a model `self` and a float argument `eta`, which represents the learning rate. The method should implement the gradient descent update rule $\theta \gets \theta - \eta \nabla_{\theta}L$ for all model parameters $\theta$.

In [None]:
def update(self, eta):
    ########## YOUR SOLUTION HERE ##########

    self.V = self.V - eta * self.dV # self.V -= eta * self.dV 
    self.R = self.R - eta * self.dR
    self.W = self.W - eta * self.dW

    return

FullyRecurrentNetwork.update = update
model.update(0.001)

## Exercise 6: Network training

Train the fully recurrent network with 32 hidden units. Start with input sequences of length one and tune the learning rate and the number of update steps. Then increase the sequence length by one and tune the hyperparameters again. What is the maximal sequence length for which the fully recurrent network can achieve a performance that is better than random? Visualize your results. 

In [None]:
########## YOUR SOLUTION HERE ##########
FullyRecurrentNetwork.backward = backward
FullyRecurrentNetwork.update = update

T_big,epoch_size = 5,10
best_lr,best_updating,best_loss = np.zeros((epoch_size,T_big)),np.zeros(T_big,),np.ones((epoch_size,T_big))*float('inf')
I, K =  32, 1 # x with NxD is 1D # I as number of hidden units

for ind_T,T in enumerate(range(1,T_big+1)):

    loss_old = float('inf')
    one_data_sample = next(generate_data(T)) # trough seed all the old samples
    # from loop before stay the same and one is added
    x, y = one_data_sample


    for update_number in [1,2,3]: # update every/every second/every third time

        for lr in [0.0005,0.01,0.05,0.1]:
            
            D = 1
            model2 = FullyRecurrentNetwork(D, I, K)

            for epoch in range(0,epoch_size):

                #for t in range(T):
                loss_f = model2.forward(x, y[0])
                
                if epoch % update_number == 0:
                    model2.backward() # compute gradients to update
                    model2.update(lr) # update gradients


                if best_loss[epoch,ind_T] > loss_f[0]:
                    best_lr[epoch,ind_T] = lr
                    best_loss[epoch,ind_T] = loss_f[0]
                    
        if best_loss[epoch,ind_T] == loss_f[0]:
            best_updating[ind_T] = update_number

In [None]:
epoch_range = [epoch for epoch in range(0,epoch_size)]
legend_list = []

for ind,col in enumerate(best_loss.T):
    plt.plot(epoch_range,col)
    legend_list.append(f'T={ind+1}')
    
plt.legend(legend_list)
plt.title('best loss scores per epoch for different T')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.show;

In [None]:
for ind,col in enumerate(best_lr.T):
    plt.plot(epoch_range,col)

plt.legend(legend_list)
plt.title('best learning rate values per epoch for different T')
plt.xlabel('epochs')
plt.ylabel('learning rate')
plt.show;

In [None]:
legend_list = []
best_updating = best_updating.astype('int')
plt.scatter([t for t in range(1,T+1)],best_updating)

plt.legend(legend_list)
plt.title('best values for how often to update weights for different T')
plt.xlabel('T')
plt.ylabel('after every _. epoch update')
plt.show;

## Exercise 7: The Vanishing Gradient Problem

Analyze why the network is incapable of learning long-term dependencies. Show that $\|\frac{\partial a(T)}{\partial a(1)}\|_2 \leq \|R\|_2^{T-1}$ , where $\|\cdot\|_2$ is the spectral norm, and discuss how that affects the propagation of error signals through the time dimension of the network. 

*Hint: Use the fact that the spectral norm is submultiplicative for square matrices, i.e. $\|AB\|_2 \leq \|A\|_2\|B\|_2$ if $A$ and $B$ are both square.*

########## YOUR SOLUTION HERE ##########
$$
\frac{\partial a(t)}{\partial a(t-1)} = \frac{\partial \tanh(s(t))}{\partial a(t-1) }
=\frac{\partial \tanh(s(t)) }{\partial s(t)} \frac{\partial s(t)}{\partial a(t-1)}
= (1 - \tanh^2(s(t))) R
= \frac{\partial s(t+1) }{\partial s(t+1-1)}
$$

$$
\text{It follows:}\\
\frac{\partial a(T)}{\partial a(T-(T-1))} 
= \frac{\partial a(T) }{\partial a(1)}
= \frac{\partial a(T)}{\partial a(T-1)} \frac{\partial a(T-1)}{\partial a(T-2)} ...\\
= \prod_{t'=0}^{(T-1)-1} \frac{\partial a(T-t')}{\partial a(T-1-t')}
= R^{(T-1)} \prod_{t'=0}^{(T-1)-1} (1 - \tanh^2(s(T-t')))\\
= R^{(T-1)} \tau
$$
Side note: |Derivative_of_tanh|<1, i.e. the derivative becomes very small with many multiplications like in the product above.

Solution start <br>
We can't use the product for matrices but for scalars. The solution would be to apply the norm rules to the single derivatives in the derivate multiplications above. <br>
Solution end

Spectral norm: Knowledge derived from https://de.wikipedia.org/wiki/Spektralnorm . <br>
The spectral norm describes the maximal eigenvalue squared of its contained matrix. I.e.:
$$
\text{We have: } ||A||_2\\
det(\lambda I - A^\top A) = 0 \text{ ->} \lambda_{1/2/...}\\
||A||_2 = \sqrt{\max\lambda_{1/2/...}\\}
$$

Besides that, from https://de.wikipedia.org/wiki/Spektralnorm  we get the following rule:
$$
|| A x||_2 \leq ||A||_2 ||x||| \text{ with A.shape=MxN, x.shape=N,}\\
\text{This accounts for our case:}\\
\left|\left|\frac{\partial a(T) }{\partial a(1)}\right|\right|_2
= || R^{(T-1)} \tau ||_2 \text{ with R.shape=IxI,} \tau \text{.shape=I,}\\
\leq ||R^{(T-1)}||_2 ||\tau||_2
$$

Furthermore, for square matrices we get:
$$
||A B||_2 \leq ||A||_2 ||B||_2\\
\text{We get:} ||R^{(T-1)}||_2 \leq ||R||_2 ||R||_2 ... = ||R||_2^{(T-1)}
$$

It follows:
$$
\left|\left|\frac{\partial a(T) }{\partial a(1)}\right|\right|_2 \leq ||R||_2^{(T-1)} ||\tau||_2 
\leq ||R||_2^{(T-1)}  \\
\text{ because } \tau \text{ only contains very small values and therefore small eigenvalues}\\
\square
$$

Vanishing gradients: When T is high and $||K||_2$ has small eigenvalues through small K-values then $||R||_2^{(T-1)}$ has even smaller values through small-value-multiplications. From the shown equation it follows that $\left|\left|\frac{\partial a(T) }{\partial a(1)}\right|\right|_2$ has even smaller values and therefore also the values from $\frac{\partial a(T) }{\partial a(1)}$ have to be very small which shows the vanishing gradient problem.

Notebook done by Nina Braunmiller k11923286