## Theory Regarding RNN:  A base for building Chatbots


As we know we form a sentence based on the knowledge from the previous words thereby forming an understandable sentence based on the formation from previous words this is what RNN does they have some persistence or you can say a memory which stores the previous word meaning in the sentence being dependant of next word from previous word

So RNN generalises this by reasoning of new words from their previous word to create a general understanding which a neural network can't do due to some lack in memory

Recurrent Neural Nets are networks with loops which allows them for an information to persist.It can also be thought of as each neural network layers is passing information to it's next succesive layer

<img src="./images/rnn_unroll.png"></img>

Best part about RNN is that we produce our model with different constraints which means that a model can have different sized input and different sized inputs not like any other networks which depends on their layers there are many different architectures of the model namely:- 
1. One To One
2. Many To Many
3. One To Many
4. Many To one
5. Many To Many with same number of inputs and outputs

<img src="./images/arch.png"></img>

In this from left to Right vectors hold the state of RNN and the bottom most layer is the Input layer and the topmost layer is the output layer with having a variety of architectures to perform analysis on the textual data mainly.


## RNN Computation

They are not only dependant on the inputs but also the certain values from the past they are dependant on.They accept an input vector and gives a particular output with the history of values stored

`rnn=RNN()`

`y=rnn.step(x)`

RNN class has some state which gets updated every time step is called.

In general,



In [2]:
class RNN:
      def step(self, x):
         # update the hidden state
         self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
         # compute the output vector
         y = np.dot(self.W_hy, self.h)
     
         return y

Here, 
<br></br>
W_hh-> matrix based on previous hidden state
<br></br>
W_xh-> Matrix based on current input
<br></br>
W_hy -> matrix based between hidden state and output

The hidden state self.h is initialized with the zero vector. The np.tanh (hyperbolic tangent) function implements a non-linearity that squashes the activations to the range [-1, 1].


There are two terms inside of the tanh: one is based on the previous hidden state and one is based on the current input. In numpy np.dot is matrix multiplication. The two intermediates interact with addition, and then get squashed by the tanh into the new state vector.
The Math notation for the hidden state update is -
<img src="./images/formula1.png"></img>
where tanh is applied elementwise.

y1 = rnn1.step(x)

y = rnn2.step(y1)
<br></br>

In other words we have two separate RNNs: One RNN is receiving the input vectors and the second RNN is receiving the output of the first RNN as its input. Except neither of these RNNs know or care — it’s all just vectors coming in and going out, and some gradients flowing through each module during backpropagation.

I’d like to briefly mention that in practice most of us use a slightly different formulation than what I presented above called a Long Short-Term Memory (LSTM) network. The LSTM is a particular type of recurrent network that works slightly better in practice, owing to its more powerful update equation and some appealing backpropagation dynamics. I won’t go into details, but everything I’ve said about RNNs stays exactly the same, except the mathematical form for computing the update (the line self.h = … ) gets a little more complicated. 

## LSTM NETWORKS

Long Short Term Memory Networks are special kind of RNN capable of learning long term dependancies in a model.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

<img src="./images/LSTM.png"></img>

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

<img src="./images/LSTM2.png"></img>
<img src="./images/LSTM_no.png"></img>

# The Core Idea Behind LSTMs
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.
<img src="./images/gate.png"></img>
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.


The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.



# Step-by-Step LSTM Walk Through
The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”
Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.
<img src="./images/LSTM_step1.png"></img>

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.
<img src="./images/LSTM_step2.png"></img>


It’s now time to update the old cell state, Ct−1, into the new cell state Ct. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add it∗C~t. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.
<img src="./images/LSTM_step3.png"></img>


Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

<img src="./images/LSTM_step4.png"></img>


# Chatbots


## 1. Padding

Before training, we work on the dataset to convert the variable length sequences into fixed length sequences, by padding. We use a few special symbols to fill in the sequence.

1.    EOS : End of sentence
2.    PAD : Filler
3.    GO : Start decoding
4.    UNK : Unknown; word not in vocabulary

Consider the following query-response pair.

    Q : How are you?
    A : I am fine.

Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair will be converted to:

    Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ]
    A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]


## Bucketing

putting sentences into buckets of different sizes. Consider this list of buckets : [ (5,10), (10,15), (20,25), (40,50) ]. If the length of a query is 4 and the length of its response is 4 (as in our previous example), we put this sentence in the bucket (5,10). The query will be padded to length 5 and the response will be padded to length 10. While running the model (training or predicting), we use a different model for each bucket, compatible with the lengths of query and response. All these models, share the same parameters and hence function exactly the same way.

If we are using the bucket (5,10), our sentences will be encoded to :

    Q : [ PAD, “?”, “you”, “are”, “How” ]
    A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]


## Word Embedding

Word Embedding is a technique for learning dense representation of words in a low dimensional vector space. Each word can be seen as a point in this space, represented by a fixed length vector. Semantic relations between words are captured by this technique. The word vectors have some interesting properties.

Word Embedding is typically done in the first layer of the network : Embedding layer, that maps a word (index to word in vocabulary) from vocabulary to a dense vector of given size. In the seq2seq model, the weights of the embedding layer are jointly trained with the other parameters of the model.

## Attention Mechanism

One of the limitations of seq2seq framework is that the entire information in the input sentence should be encoded into a fixed length vector, context. As the length of the sequence gets larger, we start losing considerable amount of information. This is why the basic seq2seq model doesn’t work well in decoding large sequences. The attention mechanism,
allows the decoder to selectively look at the input sequence while decoding. This takes the pressure off the encoder to encode every useful information from the input.

<img src="./images/attention1.png"></img>