# Deep Neural Networks...

### There are tons of Deep Neural Network architectures! But some are better than others for specific problems.
<br/><img src="images/book.jpg" width = 70%>


<img src="images/neural-networks.png" width = 100%>

# So Lets pick a use case to focus on the architectures that are useful... Language Translation!

<img src="images/google%20translate.PNG" width = 100%>

https://translate.google.ca/

### So... lets breakdown some of the architectures and preprocessing we need to do before we get started with translation and natural language processing.

## Recurrent Neural Networks (RNN) for Generating Natural Language 
<br/>
<img src="images/RNN-NLP.gif" width="100%">
<br/>
<br/>

# Why RNN's?
### A RNN treats each word of a sentence as a separate input occurring at time ‘t’ and uses the activation value at ‘t-1’ also, as an input in addition to the input at time ‘t’. Because each word is associated with a time, the network has memory of the past inputs and can better interpret the sentence as a whole.

### However RNN's have weaknesses. For example they are great for short term memory (sentences) but will struggle to keep longer term memory (Vanishing Gradient Problem).

# Language Translation:
## A simple Sequence 2 Sequence architecture using RNN's
<br/>
<img src="images/sec2sec.jpeg" width="100%">

### For a sentence, each word is fed into our RNN along with the output of the last word. The complete sentence is completely translated into a vector (Context Vector) by the "Encoder". Now the vector is fed into the "Decoder" RNN which does the reverse operation on the vector but with a different language.

### This Encoder/Decoder Architecture is called Sequence 2 Sequence

### These RNN cells are usually LSTM (Long-Short Term Memory) or GRU (Gated Recurrent Unit) which we will learn about in the Recurrent Neural network chapter!

<img src="images/seq2seq.gif" width="100%">

<img src="images/seq2seq2.gif" width="100%">

# Adding Attention
### A critical and apparent disadvantage of this fixed-length context vector design is its incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. (https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#whats-wrong-with-seq2seq-model)

### To deal with this vanishing gradient problem and give the network more memory, we add a mechanism called Attention!?! (http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf).

### The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder’s last hidden state, the secret sauce invented by attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.


<img src="images/attention.gif" width="100%">

https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3

# Attention Mechanism in Detail:
<br/>

## Step 1: Our Normal Sequence 2 Sequence RNN Network

<img src="images/1.gif" width="100%">

<br/>

## Step 2: Add Attention Scores to each Word

<img src="images/2.gif" width="100%">

<br/>

## Step 3: Use Softmax for a Weight

<img src="images/3.gif" width="100%">

<br/>

## Step 4: Multiply the Score by our Weights

<img src="images/4.gif" width="100%">

<br/>

## Step 5: Take the weighted sum of our Attention

<img src="images/5.gif" width="100%">

<br/>

## Step 6: Feed the sum into our decoder layer so it knows more about past and future words used in the input

<img src="images/6.gif" width="100%">


# Google's Neural Machine Translation System (AKA Google Translate)
<img src="images/google-nmt.png" width="100%">