In [1]:
import numpy as np

In [5]:
help(np.random.choice)

Help on built-in function choice:

choice(...) method of numpy.random.mtrand.RandomState instance
    choice(a, size=None, replace=True, p=None)
    
    Generates a random sample from a given 1-D array
    
    .. versionadded:: 1.7.0
    
    .. note::
        New code should use the ``choice`` method of a ``default_rng()``
        instance instead; please see the :ref:`random-quick-start`.
    
    Parameters
    ----------
    a : 1-D array-like or int
        If an ndarray, a random sample is generated from its elements.
        If an int, the random sample is generated as if it were ``np.arange(a)``
    size : int or tuple of ints, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn.  Default is None, in which case a
        single value is returned.
    replace : boolean, optional
        Whether the sample is with or without replacement. Default is True,
        meaning that a value of ``a`` can be selected mu

## Sampling Novel Sequences (video 7)

What is the purpose of sampling novel sequences?
Oh, it's like having the model learn Shakespear and then having it generate ("sample") sentences ("sequences")
That is, "sampling from the model that you've trained"

One advantage of character-level language model: you will not encounter unknown elements in a sequence
But they are more computationally expensive to train

## Vanishing Gradient with RNNs (video 8)

A problem with RNNs. Why?
Long-term dependencies; e.g., "The *cat* ... *was* full", "The *cats* ... *were* full", where the later word, in both cases depends on whether or not the earlier word (cat) was pluralized (so we have to go *way* back in the sequence to find out)

Recall **vanishing gradient problem**: the gradient has a hard time propagating back to change the weight of early layers


## Gated Recurrent Unit (GRU) (video 9)

Modification to RNN hidden layer that helps with the vanishing gradient problem.

Typical hidden layer:

$$a^{\langle t \rangle} = g(W$$

GRU unit has new variable, c, the **memory cell**. $c^{\langle t \rangle} = a^{\langle t \rangle}$ (for GRUs, the memory cell is equal to the activation)

Then $\tilde{c} = \tanh \dots$ is a candidate for replacing c. Decided by $\Gamma_u \in \{0,1\}$ (Gamma is another "shape" analogy, like the Rho algorithms -- think of a gated fence) ("u" for update)
e.g., $\Gamma_u$ could decide to update (from 1 to 0, say) if the subject changes from singular to plural. This way, elements of the sequences farther down the sequence can reference the $\Gamma_u$ rather than going all the way back in the sequence.

Note that c is a vector (as are $\Gamma$ and $\tilde{c}$, all of the same dimension)

3 main equations

$\begin{align}
\tilde{c}^{\langle t \rangle} &=\\
\Gamma_u &=\\
c^{\langle t \rangle} &= 
\end{align}$

But wouldn't we need one c per subject (i.e. sentence)?

Note: $\Gamma$ is the **gate** of the **Gated** Recurrent Unit

## 10: Long Short Term Memory (LSTM)

Alternative to GRU units; that is, another way to learn long-range connections within a sequence.

We still have a memory cell, but the candidate is defined differently (note the a^{t-1} in the defining equation. More importantly, **the update gate is different**; namely, there are three gates instead of two, "update" gate , "forget" gate, and "output" gates

Long short term memory unit vs GRU unit:
- LSTM is more powerful
- No widespread consensus on which is better
- LSTM units actually came first (see the 1997 paper)
- GRU units are simpler
    - scales better and less computation heavy
- neither is universally superior
- LSTMs are more flexible (three gates helps this)
- In general, LSTMs are more likely to be the default choice
- GRUs have been gaining momentum

Common variations:
- have gates depend on a^{t-1} as well as c^{t-1} ("peephole connection")
- 

Whole purpose of GRUs and LSTMs: capturing long-range dependencies within sequences

## 11: Bidirectional RNN

"Getting information from the future" What a cool way to put it.

Every RNN that we've seen so far (even those using GRUs or LSTMs) have been forward direction only

BRNNs define an acyclic graph

Note this all takes place in forwardprop, even though we are going in both directions

But how is it implemented? is the order opposide for the reversed green arrow a's in Ng's drawing? i.e. is a^<1> really a^<T_x>?

What is the implementation?? I need to see it. (Check Aggarwal or Chollet (10.4.3))

Disadvantage: You need the entire sequence of data before running the system; that is, you cannot, e.g., translate "word-by-word" as someone is speaking -- only when the entire sentence has been spoken.


<details> test </details>


## 12: Deep RNNs

Stacking layers to the architectures introduced above

Oh! What we've been doing up to now has only been done on a single layer. Didn't get that until now.

So the RNNs are not communicating between layers, but between what? The neurons in a layer, I think? 

Because of the temporal dimension (communication between neurons on the same layer), RNNs are generally not as deep as typical feed-forward neural networks.

Can have blocks that are GRUs, LSTMs, or BRNNs.

## Quiz Week 1

Attempt:

1. number in langle denotes position in sequence ("word"), number in parentheses denotes training example number
2. Judging from the slides, the architecture diagram is not appropriate when Tx>Ty, but I don't see why it wouldn't be if we just set the missing inputs equal to zero, isn't that one thing we discussed in the lectures?
3. Classification will always have a single output, so image classification and sentiment classification are many-to-one models <font color='red'>Image classification is an example of one-to-one architecture</font>
4. At time t we are calculating the probability of $y^{<t>}$, given that we know $y^{<t-1>{,...,y^{<1>}$
5. Confused about this one and how sampling novel sentences works. I think that it is true, but I cannot justify my belief. <font color='red'>False. The probabilities output by the RNN are not used to pick the highest probability word and the ground-truth word from the training set is not the input to the next time-step.</font>
6. I do not know how to answer this. If you find that your weights and activations are taking NaN values, then it could be caused by a vanishing gradient (as specified in lecture, this is a prevalent problem in RNNs), but it might also be due to other factors, right? <font color='red'>Yeah, shoulda answered "False"...</font>
7. As I understood it, in an LSTM unit $\Gamma_u$ has the same size as $c^{<t>}$, which has the same size as $a^{<t>}$, or am I confusing this with GRU units? Is it not the same for both?
8. I need to understand this question better. First of all, I didn't even know about the second gate, $\Gamma_r$ in a GRU unit. For now I will guess that removing $\Gamma_r$ will not cause vanishing gradient problems, but I really cannot justify this.
9. Is this as simple as it looks? The update gate plays similar roles in both GRU and LSTM, and the forget gate in LSTM plays a role similar to $1 - \Gamma_u$ in GRU
10. Unidirectional is better, for the reason stated (we only care about the weather from the past -- nevermind that we can't get the weather of the future as data)

70% 

Notes on the problems:




Extra resources:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

