# Seqence models 

### (video 1) Sequence problems
Examples of sequence data
- speach recognition: x(audio = sequence) => y(transcript = sequence)
- music generation: x(nan; single integer, referring to a style) => y(sequence)
- sentiment classification: x(prase = sequence) => y(stars)
- DNA sequence analysis: x(DNA code = sequence) => y (which part corresponds to a protein = sequence)
- machine translation: x(phrase = sequence) => y(translation = sequence)
- video activity recognition: x(videoframes = sequence) => y(activity)
- NER: x(phrase = sequence) => y(entities)

Problems can be addressed as supervised learning with labled data X, and Y as a training set
=> but different types of sequence problems
- X and Y are sequences
- X and Y can have different length or same 
- only X or Y can be the sequence

### (video 2) Notations to define sequence problems
NER: identify people


|  | word | word | word | word | word | word | word | word | word |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| input x: | Harry | Potter | and | Hermione | Granger | invented | a | new | spell |
| input features: | x<sup><1></sup> | x<sup><2></sup>  | ... | ... | x<sup>&lt;t&gt;</sup>  | ... | ... | x<sup><8></sup>  | x<sup><9></sup>  |
| output y1: | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| output: | y<sup><1></sup> | y<sup><2></sup>  | ... | ... | y<sup>&lt;t&gt;</sup>  | ... | ... | y<sup><8></sup> | y<sup><9></sup>  |

y1 – not the best representation, since doesn't tell you, where is the start and end of peoples' names

the input is 9 words => we'll have 9 sets of features representing these
- x<sup>&lt;t&gt;</sup> – feature in the middle of the sequence
- *t* implies that these are temporal sequences (but it will be used regardless the type of sequence)
- T<sub>x</sub> – length of the input sequence = 9
- T<sub>y</sub> – length of the output sequence = 9
- T<sub>x</sub> and T<sub>y</sub> can be different
        
- X<sup>(i)</sup> – i-th training example, in this case, this particular phrase
- x<sup>(i)&lt;t&gt; </sup> – t-th element of an input sequence of i-th training example   
- T<sub>x</sub><sup>(i)</sup> – the length of an input sequence of i-th training example  
- y<sup>(i)&lt;t&gt; </sup> – t-th element of an output sequence of i-th training example 
- T<sub>y</sub><sup>(i)</sup> – the length of an output sequence of i-th training example
    
**Representation of individual words of the sentence**
- You come up with a vocabulary = dictionary
- Usually 30-50k words, around 100k – not uncommon as well
    
| words | number |
| --- | --- |
|a| 1 |
|aaron| 2 |
|...| ... |
|and| 367 |
|...| ... |
|harry| 4075 |
|...| ... |
|potter| 6830 |
|...| ... |
|zulu| 10000 |

- to build this dictionaly: one of ways: find top 10k occuring words
- then use one hot representations to encode each of this words:
    - x<sup><1></sup> = vetor [0, ... , 1, ...., 0] with unique 1 in 4075-th position

And the *goal* is given this representation for X to learn a mapping using a sequence model to then target output y, I will do this as a supervised learning problem, I'm sure given the table data with both x and y.
    
If you encounter a word that is not in your vocabulary, you create a new token or a new fake word called *Unknown* = UNK

### (video 3) Recurrent Neural Network (RNN) model

**Why not standard network?**

9 input words = [x<sup><1></sup>, ..., x<sup>&lt;t&gt;</sup>, ..., x<sup>&lt;T<sub>x</sub>&gt;</sup>]

feed them into standard NN of a few layers => 
output of 9 values 0 or 1 = [y<sup><1></sup>, ..., y<sup>&lt;t&gt;</sup>, ..., y<sup>&lt;T<sub>y</sub>&gt;</sup>] 
    
which gonna tell you whether each of those words are part of persons name

But this not gonna work, cause:
- Inputs and outputs can be of different lengths in different examples (and even no maximum sentence length is known)
- Doesn't share feautures learnt across different positions of text: "Harry" in first position recognized as a name won't imply that "Harry" in some other position will be recognized as such

Similarly to CNN 
- you want things learned for one part of the image to generalize quickly to other parts of the input (in CNN - image, in SM - sequence)
- better representation will also let you reduce the number of parameters in your model

If we think of reqular network: each of x<sup>&lt;t&gt;</sup> is a 10k onehot encoded vector => so the total input size would be 10k x max amount of words in phrase => the weight matrix would be too huge

**What is RNN?**
- We read sentence from left to right
- We take first word x<sup>&lt;1&gt;</sup> and feed it to a first NN layer. We try to predict an output y_hat<sup>&lt;1&gt;</sup>: whether it is a part of person's name or not
- RNN when it goes to read second word x<sup>&lt;2&gt;</sup>, 
    - instead of just predicting y_hat<sup>&lt;2&gt;</sup> using x<sup>&lt;2&gt;</sup>, 
    - it also gets some information from the comptuation of step 1
    - in particular: an **activation** value from time step one is passed to time step 2: a<sup>&lt;1&gt;</sup>
- and so on for each next word: input x<sup>&lt;t&gt;</sup> => y_hat<sup>&lt;t&gt;</sup>
- till the last ont: input x<sup>&lt;T<sub>x</sub>&gt;</sup> => y_hat<sup>&lt;T<sub>y</sub>&gt;</sup>
- In this example, T<sub>x</sub> = T<sub>y</sub>, but the architecture will change a bit if T<sub>x</sub> ≠ T<sub>y</sub>
- At the begining we pass an activation a<sup>&lt;0&gt;</sup> which can be 
    - randomly initiated
    - vector of zeros
    
Representation of RNN: 

x<sup>&lt;t&gt;</sup> => [ooooooooo] (loop + shaded box = time delay of one step) => y<sup>&lt;t&gt;</sup>

**Parameters**
- The recurrent neural network scans through the data from left to right. 
- The parameters it uses for each time step are shared  
    - w<sub>ax</sub> = govern the connection from X1 to the hidden layer
        - The second index means that this w<sub>ax</sub> is going to be multiplied by some X-like quantity
        - "a" means that this is used to compute some a-like quantity (see equations below)
    - w<sub>aa</sub> =  govern the horizontal connections = the activations
    - w<sub>ya</sub> = govern the output predictions

So, to make a prediction for y_hat<sup>&lt;3&gt;</sup>, this RNN gets the information not only from x<sup>&lt;3&gt;</sup> but also the information from x<sup>&lt;1&gt;</sup> and x<sup>&lt;2&gt;</sup> because the information on x<sup>&lt;1&gt;</sup> can pass horizontally

one **weakness** of this RNN is that it only uses the information that is earlier in the sequence to make a prediction. This can be overcome in Bidirectional RNN

**What are the calculations this RNN does?**
- a<sup>&lt;0&gt;</sup> = [0] (vector)
- a<sup>&lt;1&gt;</sup> = g(w<sub>aa</sub> * a<sup>&lt;0&gt;</sup> + w<sub>ax</sub> * x<sup>&lt;1&gt;</sup> + b<sub>a</sub>)
    - g = activation function, 
        - often is "tanh" 
        - RelU (preventing the vanishing gradient problem)
    - b<sub>a</sub> = bias    
- y_hat<sup>&lt;1&gt;</sup> = f(w<sub>ya</sub> * a<sup>&lt;1&gt;</sup> + b<sub>y</sub>)
    - f = activation function (same or another), depends on what the output is
        - binary classification problem (NER) = sigmoid activation function
        - k-way classification problem = softmax
    - b<sub>y</sub> = bias
    
More generaly 
- a<sup>&lt;t&gt;</sup> = g(w<sub>aa</sub> * a<sup>&lt;t-1&gt;</sup> + w<sub>ax</sub> * x<sup>&lt;t&gt;</sup> + b<sub>a</sub>)
- y_hat<sup>&lt;t&gt;</sup> = f(w<sub>ya</sub> * a<sup>&lt;t&gt;</sup> + b<sub>y</sub>)

Simplification of these two equations
- a<sup>&lt;t&gt;</sup> = g (w<sub>a</sub> * [a<sup>&lt;t-1&gt;</sup>, x<sup>&lt;t&gt;</sup>]+ b<sub>a</sub>)
    - w<sub>a</sub> is a matrix which is defined as a horizontal stack of matricies [w<sub>aa</sub>; w<sub>ax</sub>]
    - if dim(a) = 100 and dim(x) =10k => 
        - dim(w<sub>aa</sub>) = 100; 100
        - dim(w<sub>ax</sub>) = 100; 10k
        - dim(w<sub>a</sub>) = 100; 10100
    - [a<sup>&lt;t-1&gt;</sup>, x<sup>&lt;t&gt;</sup>] is two vectors stacked together vertically 
        - dim(a<sup>&lt;t-1&gt;</sup>) = 100, 1
        - dim(x<sup>&lt;t&gt;</sup>) = 10k, 1
        - dim([a<sup>&lt;t-1&gt;</sup>, x<sup>&lt;t&gt;</sup>]) = 10100, 1
    - The advantage of this notation is that rather than carrying around two parameter matrices, Waa and Wax, we can compress them into just one parameter matrix Wa
- y_hat<sup>&lt;t&gt;</sup> = f(w<sub>y</sub> * a<sup>&lt;t&gt;</sup> + b<sub>y</sub>) (similarly)
    - w<sub>y</sub> and b<sub>y</sub> denotes what type of output quantity we're computing.
        - w<sub>y</sub> indicates a weight matrix or computing a y-like quantity, 
        - w<sub>a</sub> and b<sub>a</sub>  on top indicates that thes eparameters are for computing activation output quantity

### (video 4) Back propagation through time

**Element-wise (individual time steps) loss function**

A certain word in the sequence is supposed to be a person's name: y<sup>&lt;t&gt;</sup> = 1
And the NN outputs some probability of the particular word being a person's name: y_hat<sup>&lt;t&gt;</sup> = 0.1
Loss = standard logistic regression loss, also called the cross entropy loss

L<sup>&lt;t&gt;</sup>(y_hat<sup>&lt;t&gt;</sup>, y<sup>&lt;t&gt;</sup>) = - y<sup>&lt;t&gt;</sup> * log(y_hat<sup>&lt;t&gt;</sup>) - (1-y<sup>&lt;t&gt;</sup>) * log(1-y_hat<sup>&lt;t&gt;</sup>)

**Overall loss**

L(y_hat, y) = sum(t=1>T<sub>y</sub>) [L<sup>&lt;t&gt;</sup>(y_hat<sup>&lt;t&gt;</sup>, y<sup>&lt;t&gt;</sup>)]

Backprop requires doing computations in the opposite directions. And that then, allows you to compute all the appropriate quantities that lets you then, take derivatives, respected parameters, and update the parameters using gradient descent

In back propagation procedure, the most significant message (or the most significant recursive calculation) is the one between activations. And it gives this algorithm the name "backpropagation through time".
