# Week 3: Sequence to sequenсе architectures

### (video 1) Basic models

Translation:
* Encoder network (RNN based on GRU or LSTM units)
    * Input: word seq 
    * Output: vector representation
* Decoder network
    * Input: vector representation
    * Output: translated seq

<img src='./Images/W3_01.png' style="width: 60%"></img>

Image captioning: the same logic
*  Encoder: eg pre-trained AlexNet, without final softmax unit
    * Input: an image into a convolutional network  
    * output: 4,096-dimensional feature vector of which to represent this picture of a cat.
* Decoder network
    * Input: vector representation
    * Output: gernerated caption one word at a time
    
<img src='./Images/W3_02.png' style="width: 60%"></img>

But compared to decoder models (= seq generation = language models) you may be want 
* the most likely translation, not some translation
* the best caption, not randomly choosen caption


### (video 2) Picking the Most Likely Sentence

Regular language models vs seq to seq
* Language model allows you to estimate the probability of a sentence. You can also use this to generate nove sentences
* Machine translation as building a conditional language model.
    * decoded network looks pretty much identical to the language model
    * instead of always starting along with the vector of all zeros, it has an encoded network that figures out some representation for the input sentence which it takes as an input
    * instead of modeling the probability of any sentence, it is now modeling the probability oа output English translation, conditions on some input French sentence. 

<img src='./Images/W3_03.png' style="width: 60%"></img>

* what you do not want is to sample outputs at random
* ou would like is to find the English sentence, y, that maximizes that conditional probability. 
* The most common algorithm for doing this is called **beam** search (see next video)

<img src='./Images/W3_04.png' style="width: 60%"></img>

But why not to use greedy search? 
*  algorithm from computer science which says 
    * to generate the first word just pick whatever is the most likely first word according to your conditional language mode
    * and then it picks the second most likely word knowing first + language model of the target language, etc
* what you would really like is to pick the entire sequence of words, that maximizes the joint probability of the **whole** sentence
    * the word "going" is most probable taking into account its probability in english lang model + the previous word. But the translation using it is less accurate

<img src='./Images/W3_05.png' style="width: 60%"></img>

* Plus, the total number of combinations of words in the English sentence is exponentially large
* So, if you have just 10,000 words in a dictionary and if you're contemplating translations that are up to ten words long, 
* => then there are 10000ˆ10 possible sentences that are ten words long
* it is impossible to rate them all => which is why the most common thing to do is use an **approximate** search algorithm
    * it will try, it won't always succeed, to pick the sentence, y, that maximizes that conditional probability

### (video 3) Beam Search

**First step**
* The first thing Beam search has to do is try to pick the first words of the English translation, that's going to operate
    * List of 10,000 words into vocabulary (ignore capitalization)
    * First step: What's the probability of the first output y, given the input sentence x (the French sentence vector)
    * whereas greedy search will pick only the one most likely words and move on, Beam Search instead can consider multiple alternatives
    * The algorithm has parameter B = beam width (number of alternatives)
 
<img src='./Images/W3_06.png' style="width: 60%"></img>

**Second step**
* For each of these three choices consider what should be the second word
* So, to evaluate the probability of second word, it will use 
    * the french sentence input
    * the first word
* => we try to maximize the probability of two words, not just the second
* We consider 3 * 10 0000 possibilities
* And we pick the most probablle 3 combinations and take on to the next step of Beam search (some of the first words can thus be rejected)

<img src='./Images/W3_07.png' style="width: 60%"></img>

**NB:** because of beam width is equal to three, every step you instantiate three copies of the network to evaluate these partial sentence fragments and the output. 

**Third step**

<img src='./Images/W3_08.png' style="width: 60%"></img>

The outcome of this process hopefully will be that adding one word at a time that Beam search will decide that Jane visits Africa in September will be terminated by the end of sentence symbol 

**NB:** If B=1 => the algorithm becoms greedy search

### (video 4) Refinments to Beam Search

Some little changes, they'll make Beam search work even better

**Length normalization**
* Beam search maxiizes the probability of the output sequence given the input sentence
* Which can be transformed into the product of probabilities of each output word given all previous output words and the input vector
    * All these probabilities are numbers much less than 1
    * It can be too small for floating representation to be storeв by PC accurately
    * So we take the log of a product which becomes a sum of the logs of each members
* Since log is monotonically increasing function, we know that maximizing logP(y|x) should give you the same result as maximizing P(y|x)

There's one other change to `argmax sum function` that makes the machine translation algorithm work even better. 
* if you have a very long sentence, the probability of that sentence is going to be low because you're multiplying as many terms 
*  Thus this function has an undesirable effect that it may be unnaturally tend to prefer very short translations 
* the same thing is true the log of a probability. 
    * Values of probability are always less than or equal to one
    * => logs are always negative
    * so the more terms you add together, the more negative the function becomes
* So you can normalize this by the number of words in translation => significantly reduces the penalty for putting longer translations
    * Sometimes you use a softer approach to тщкmalization, by taking the T<sub>y</sub>ˆ(alpha), where alpha = 0.7
    * There isn't a great theoretical justification for it, but people found this works well

<img src='./Images/W3_09.png' style="width: 60%"></img>

Recap: 
* as you run beam search you see a lot of sentences with 
    * length equal 1, 
    * length sentences were equal 2,
    * up to 30, if we consider to output sentences up to 30 words
* You would be keeping track of the top three possibilities for each of these possible sentence length 1, 2, 3, 4, and so on up to 30
* I'll put all those sentences and score them against the score (last sum on the image above):  you can take your top sentences and just computes this objective function on the sentences that you have seen
* You pick the one with highest vlaue

**Beam search discussion**

Implementation details: how to choose B:
* If B is large 
    * You consider a lot of different options => better result
    * But search is slow
* If B small: vice versa

* In production B might be around 10. 
* 100 – very large to production system
* For research papers: may be 1000

Compared to exact search algorithms (Depth / Width first search) beam search runs much faster but is not guaranteed to find the exact maximum for this augments that you like to find.

Beam Search is a widely used algorithm in many production and commercial systems

<img src='./Images/W3_10.png' style="width: 60%"></img>

### (video 5) Error Analysis in Beam Search

You sometimes wonder, Is beam width working well enough? There are some simple things we can compute to give you guidance on whether you need to work on improving your search algorithm. 

* how error analysis interacts with beam search and how you can figure out 
    * whether it is the beam search algorithm that's causing problems 
    * Or whether it might be your RNN model that is causing problems
* Notation
    * Y* – good translation provided by the human
    * Y_hat – bad translation provided by the algorithm

2 components in the model
* RNN: seq to seq
* Beam search that analyses the output

it's always tempting to 
* collect more training data that never hurts.
* increase the beam width that never hurts or pretty much never hurts

it turns out that the most useful thing for you to do at this point is to compute  
* using this model P(y*|x) 
* and P(y_hat|x) using your RNN model
* then to see which of these two is bigger
    * Depending on which of these two cases hold true, you'd be able to more clearly ascribe this particular error to one of the RNN or the beam search algorithm being had greater fault
    

<img src='./Images/W3_11.png' style="width: 60%"></img>

Here is the logic behind this
* Case 1: P(y*|x) > P(y_hat|x)
    * The Beam search's job was to select most probable translation, and it didn't succeed
* Case 2: P(y*|x) < P(y_hat|x)
    * Y* is a better translation but according to RNN, P(y*|x) is smaller => RNN does bad job
    * There's some subtleties pertaining to length normalizations that I'm glossing over.
        * If you are using some sort of length normalization, 
        * instead of evaluating these probabilities, 
        * you should be evaluating the optimization objective that takes into account length normalization
    
<img src='./Images/W3_12.png' style="width: 60%"></img>

**Error analysis process**

*  You go through the development set and find the mistakes that the algorithm made in the development set
* you can then carry out error analysis to figure out what fraction of errors are due to beam search versus the RNN model
    * if you find that beam search is responsible for a lot of errors, then maybe is we're working hard to increase the beam width
    * if you find that the RNN model is at fault, then you could do a deeper layer of analysis to try to figure out
        * if you want to add regularization, 
        * or get more training data, 
        * or try a different network architecture, 
        * or something else

<img src='./Images/W3_13.png' style="width: 60%"></img>

This particular error analysis process very useful whenever you have 
* an approximate optimization algorithm, such as beam search that is working to optimize some sort of objective,
* and some sort of cost function that is output by a learning algorithm, such as a sequence-to-sequence model 

### (video 6) Bleu	score (bilingual evaluation)

* One of the challenges of machine translation: there could be multiple equally good translations
* How do you evaluate a machine translation system in this case? (= how to measure accuracy?)
* Convention: bleu score

What the BLEU score does is 
* given a machine generated translation, 
* allows you to automatically compute a score that measures how good is that machine translation.  
    * And the intuition is following: 
        * we're going to look at the machine generated output and 
        * see if the types of words it generates appear in at least one of the human generated references
        *  these human generated references would be provided as part of the test set
        
<img src='./Images/W3_14.png' style="width: 60%"></img>

let's look at a somewhat extreme example:
* Machine translatopn output "the" x N
* Tne way to measure how good the machine translation output is, is:
    * to look at each the words in the output and 
    * see if it appears in the references
    * this would be called a **precision** of the machine translation output
    *  **precision**  = what fraction of the words in the MT output also appear in the references
        * there are seven words in the machine translation output. 
        * And every one of these 7 words appears in both references
        * So this will have a precision of 7 over 7. It looks like it was a great precision
* **Modified precision:** we will give each word credit only up to the maximum number of times it appears in the reference sentences
    * [max number of times word appears in any of the references] / [number of times it appears in prediction]
        * in Reference 1, the word, the, appears twice. 
        * In Reference 2, the word, the, appears just once. 
        * So 2 is bigger than 1, and so we're going to say that the word, the, gets credit up to twice.
* In the BLEU score, you don't want to just look at isolated words. You maybe want to look at pairs of words as well
    * Denominator = number of bigrams in translated sentences
    * Numerator = clipped max of bigrams (max number appearing in at least one of the refreferences)

<img src='./Images/W3_15.png' style="width: 60%"></img>
<img src='./Images/W3_16.png' style="width: 60%"></img>

* In final bluescore we can take uni, bi, tri, etc grams into account all together
    *  if the MT output is exactly the same as either Reference 1 or Reference 2, then all of these values P1, and P2 and so on, they'll all be equal to 1.0.
    
**Combined Bleu score**
* you sum p1, p2, p3, p4, 
* devide by 4, 
* and take exp of this
* and adjust this (multiply) by BP factor = brevity penalty = 
    * if you input very short tranlsation it is easier to get hight precision
    * BP penilizes short translations
        * = 1 if MT_output_length > ref_output_length
        * = exp(1 - ref_output_length/MT_output_length) otherwise

<img src='./Images/W3_17.png' style="width: 60%"></img>

BLEU score is a useful single real number evaluation metric to use 
* whenever you want your algorithm to generate a piece of text
* And you want to see whether it has similar meaning as a reference piece of text generated by humans.

### (video 7) Attention Model Intuition

The Attention algorithm makes Encoder-Decoder models work better

Classical Encoder-Decoder models
* Input: very long sentence
* Encoder: 
     * Endode it
     * Memorize in activations
 * Decoder
     * Generate translation
  
The way a human translator would translate this sentence is 
* not to first read the whole French sentence and then memorize the whole thing and then regurgitate an English sentence from scratch. 
* Instead, 
    * he/she would read the first part of it, 
    * maybe generate part of the translation. 
    * Look at the second part, generate a few more words, 
    * look at a few more words, generate a few more words and so on
    * because it's just really difficult to memorize the whole long sentence like that
* Encoder-Decoder architecture works quite well for short sentences, so we might achieve a relatively high Bleu score, but for very long sentences, maybe longer than 30 or 40 words, the performance comes down. It is difficult for NN to memorize long sentences as well
* With attention models Bleu score doesn't drop down

<img src='./Images/W3_18.png' style="width: 60%"></img>

**Intuition**
* Input: Jane visite lÁfrique en septembre
* bi-directional RNN: to compute some set of features for each of input words
    * Since it is not word-to-word translation we get rid of Y's
    * But for each of words (positions in the sentence) we compute a rich set of features which take into account surrounding words, etc
* We are going to use another RNN to generate english translation
    * we denote activation of this RNN by S (not to confuse with A for first RNN) = hidden state
    * We hope the first output would be Jane
    * The question is, what part of French sentence should be taken into account to generate the first word?
        * Attention model will be computing the set of attention weights
        * alpha_1_1 – the weight which denotes how much attention should pay the first unit of the second RNN for the first word in input to correctly generate the first word of the output
    * => together all alphas describe the context of the input we should take into account to generate first word of the output
        * each alpha_t_t' 
            * t - the timestep of the output = the English word 
            * t' – the timestep of the input = the french word 
        * depends on forward and backward activations at timestep t' of the first RNN + state of the previous step of the second RNN (= s)
       
<img src='./Images/W3_19.png' style="width: 60%"></img>

### (video 8) Attention Model

* to simplify the notation 
    * going forwards at every time step, even though you have the features computed from the forward occurrence and from the backward occurrence in the bidirectional RNN
    * I'm just going to use a of t to represent both of these **concatenated** together 
    * a<sup>t'</sup> = concatenated (a_forward<sup>t'</sup>, a_backward<sup>t'</sup>)
* Single direction RNN to generate translation
    * Depends on 
        * s<sup>0</sup>
        * context C which depends on alpha_1_t''
    * context is actually be a way to sum the features from the different time steps (activations fron the first RNN) waited by these attention waits. So more formally the attention waits will satisfy 
        * they are all be non-negative, so it will be a zero positive and 
        * they'll sum to one
    * alpha_t_t'is the amount of attention that's y_t should pay to a_t_prime.
       
<img src='./Images/W3_20.png' style="width: 60%"></img>

**Computing attention alpha<sup>t,t'</sup>** 
* Compute exp(e<sup>t,t'</sup>)
* Use softmax to essentially make sure that these waits sum to one if you sum over t'.
* How to compute vectrs e<sup>t,t'</sup>? One way to do is to use a small NN
    * Input
        * s<sup>t-1</sup> (the previous state of second RNN)
        * a<sup>t'</sup> (the previous state of second RNN)
    * Intuition: if you want to decide how much attention to pay to the activation of t', it seems like it should depend 
        * the most on is what is your own hidden state activation from the previous time step
        * and on each of the input words' features
        * But we don't know what the function is. So we just train a very small neural network to learn whatever this function should be (backprop + gradient descent)

<img src='./Images/W3_21.png' style="width: 60%"></img>

**Downside** of this algorithm: quadratic cost to run
* Tx input, Ty output => total number of attention params Tx x Ty
* In ML applications for medium long sentences it is ok, 
* But doesn't work for research work

**The summary:**
* First recurrent RNN
    * input: x<sup>t'</sup>
    * activations a<sup>t'</sup>
* Small regular NN computes vector e<sup>t,t'</sup> 
    * it depends on
        * s<sup>t-1</sup> (the previous state of second RNN)
        * a<sup>t'</sup> (the previous state of second RNN) 
    * and is used to calculate attention (alpha<sup>t,t'</sup>) y<sup>t</sup> should pay for a<sup>t'</sup> by applying softmax function
* This attention is used to calculate context С<sup>t</sup> (summ all of attentions alpha<sup>t,t'</sup> over t')
* Second regular RNN
    * takes as input
        * previous predicted word y_hat<sup>t-1</sup>
        * previous state s<sup>t-1</sup>
        * Context С<sup>t</sup> 
    * calculates new state s<sup>t</sup>
    * outputs y_hat<sup>t</sup>
    
    
The same approach works in image captioning

In the exercise you get to implement the attention for the date normalization problem.

We can also look the visualization of the attention weights: for corresponding input and output words you find that the attention waits will tend to be high

<img src='./Images/W3_22.png' style="width: 60%"></img>

# Speach recognition - Audio data

### (video 9) Speech Recognition

Problem: given an audio clip, x, and your job is to automatically find a text transcript, y
* audio clip = air pressure against time
* the human ear has physical structures that measures the amounts of intensity of different frequencies, 
    * => a common pre-processing step for audio data is to run your raw audio clip and generate a **spectrogram**
        * horizontal axis is time, 
        * and the vertical axis is frequencies, 
        * and intensity of different colors shows the amount of energy 
    * = how loud is the sound at different frequencies at different times
* once upon a time, speech recognition systems used to be built using **phonemes** = hand-engineered basic units of cells
* With deep learning phonems representations became unnecessary
    * instead, you can built systems that input an audio clip and directly output a transcript 
    * one of the things that made this possible was going to much larger data sets.
        * Academic datasets for speach recoginition: 300-3000h
        * Commercial datasets: 10k-100k hours        

<img src='./Images/W3_23.png' style="width: 60%"></img>

**How to build speach recognition system?**

**First method: as discussed above**
* You take different timeframes of an audio input 
* Build the attention model outputting the transcript

<img src='./Images/W3_24.png' style="width: 60%"></img>

**Second method: CTC cost for speech recognition**
Connectionist temporal classification
* We're going to use a NN with an equal number of input x's and output y's,
    * I draw a simple (uni-directional), but in practice, this will usually 
        * be a bidirectional LSTM 
        * or bidirectional GRU 
        * and usually, a deeper mode
    * The number of timesteps is very big
        *  if you have 10 seconds of audio 
        * and your features come at a 100 hertz 
        * so 100 samples per second, 
        * then a 10 second audio clip would end up with a thousand inputs
* The CTC cost function allows the RNN to generate an output like this
    * ttt_h_eee___ ___qqq__
    * _ = blank character ≠ space character
    * this is considered to be a correct output for "the q"
* the basic rule for the CTC cost function is to collapse repeated characters not separated by "blank"
    * phrase here "the quick brown fox" including spaces actually has 19 characters, 
    * and if somehow, the NN is forced upwards of a thousand characters by allowing the network to insert blanks and repeated characters and can still represent this 19 character 

<img src='./Images/W3_25.png' style="width: 60%"></img>

### (video 10) Trigger Word Detection

Easier to be done with smaller amount of data then needed to be used in speach recognition
* Alexa
* Siri
* Google home

Trigger the recognition by using a particular word. 
* This area is stil evolving, no wide consensus on what's the best way to trigger
* Only one example is shown
* Previously see:  
    * RNN taking audio clip,
    * compute spectrogram features. 
    * generates features, x1, x2, x3 audio features,
    * pass them through an RNN
* the remains to be done is to define the target labels
    * you can set the target labels to be 0 for everything before the point the trigger word ends to be pronounced
    * and right after that to set the target label of 1.

<img src='./Images/W3_26.png' style="width: 60%"></img>

This could work reasonably well. But
* One slight disadvantage of this is it creates a very imbalanced training set to have a lot more 0s than 1s.
* one other thing you could do, this is a little bit of a hack, but could make the model a little bit easier 
    * train is instead of setting only a single time step output 1, 
    * you can actually make it output a few 1s for several times 
    * or for a fixed period of time before reverting back to 0

# Quiz

<img src='./Images/Q3_1.png' style="width: 80%"></img>
<img src='./Images/Q3_2.png' style="width: 80%"></img>
<img src='./Images/Q3_3.png' style="width: 80%"></img>
<img src='./Images/Q3_4.png' style="width: 80%"></img>