# (a)
*What are the different forms of sequence mapping allowed by recurrent neural networks ? Give for each form an example of application.*

 - Many-to-One: Time-series classification
 - One-to-Many: Image captioning
 - Many-to-Many (shape of input and output the same): Named entity recognition
 - Many-to-Many (shape of input and output not the same): Language translation

# (b)
*Compute the number of parameters to be trained for a two-layer SimpleRNN and softmax with hidden state dimensions 32 and 64, respectively, 10 classes to classify in the softmax and inputs given by sequences of length 100 and each element a vector of dimension 30.*


$$
(32\times 30 + 32 \times 32 + 32) + (64\times32 +64\times 64 + 64) + (10\times64 + 10) = 8874
$$



In [1]:
import tensorflow.keras as keras
model = keras.models.Sequential()
model.add(keras.layers.SimpleRNN(32, input_shape=(100, 30), 
                    return_sequences=True))
model.add(keras.layers.SimpleRNN(64, 
                    return_sequences=False))
model.add(keras.layers.Dense(10))
model.add(keras.layers.Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', 
              metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn (SimpleRNN)       (None, 100, 32)           2016      
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 64)                6208      
_________________________________________________________________
dense (Dense)                (None, 10)                650       
_________________________________________________________________
activation (Activation)      (None, 10)                0         
Total params: 8,874
Trainable params: 8,874
Non-trainable params: 0
_________________________________________________________________


# (c)
*Why is gradient clipping rather needed in long than in short sentences ?*

Because as we have seen in class in order to calculate the partial gradient for the weight matrix $W_h$ we have to calculate a product term depending on the time dimension $t$.

$$
\frac{\partial L_{\mathrm{ce}}}{\partial W_{h}}=\frac{\partial L_{\mathrm{ce}}}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial h_{t}} \sum_{s=1}^{t+1}\left(\prod_{\tau=s}^{t} \frac{\partial h_{\tau}}{\partial h_{\tau-1}}\right) \frac{\partial h_{s-1}}{\partial W_{h}}
$$

The longer the sequence, the more likely we are to observe an exploding gradient and hence need gradient clipping.


# (d)
*Describe why SimpleRNNs have problems in learning long-term dependencies.*

When calculating the gradient for the weights of an RNN with sequence length $n$ in a Many-to-One setting, the gradient contribution of e.g. the operation $W_x \cdot x_0$ to the total gradient of the weight matrix $W_x$ will be very small when compared to the operation $W_x \cdot x_n$ due to the backpropagation through time explained in the answer above. Hence the network will not be able to really learn the dependency of the input at time step $0$ on the hiddenstate at time step $n$ which then will be used for further processing.

# (e)
*How can you define a generative system ? Describe the two approaches seen in the class to build generative systems with RNNs.*

 - Many-to-One: We try to predict the next token at timestep $x_{t}$ from a fixed given window with length $n$ of previous tokens $[x_{t-n},...,x_{t-1}]$
 - Many-to-Many: We try to predict the next token at each time step $[x_{t-(n+1)},...,x_{t}]$ from a fixed given window with length $n$ of previous tokens $[x_{t-n},...,x_{t-1}]$
 
The two approaches only differ at training time. When infering predictions the models are equivalent.
