# **Sequential data**

In [None]:
# Part one: Learning from sequences
# Part two: RNNs
# Part three: LSTMs
# Part four: CNNs for sequential data
# Part five: ELMo, a case study

# **Part one: Learning from sequences**

In [None]:
#So: what kind of sequential data can we expect? The simplest case is probably numeric one-dimensional sequential data: a timeseries.. we can also have numeric n-dimensional data, or a dataset can be symbolic; 
#where at every time step, we are given a symbol from a fixed vocabulary; The prime example of this is probably language, which can be viewed as a symbolic sequence in two ways; we can break it up into
#words in which case we have a very large vocabulary, and at each time step we are given one word, or we can view langugage as a sequence of characters which gives us a much smaller vocabulary and a much longer
#sequence. and datasets generally come in one of two types; either single sequence or set of sequences
#One of the interesting method in order to split your data between train and test is to use a method called:walk-forward validation.
#So sequences: consisting of numbers, vectors or symbols.
#and Dataset: consisting of a sequence per insatnce, or a sequence of instances
#sequence models: operate on inputs of different lengths (using the same weights).
#input: raw sequence data
#output: classification, regression, token prediction, sequence-to-sequence.
#layers: sequence-to-sequence. But what's a sequence-to-sequence layer? It's a layer that takes as input a sequence of vectors of length t and produces as output another sequence of vectors again of length t,and
#the input and output dimensions may be different, but the length of the sequence is the same in both cases; We can generalize this if we want to from vectors to tensors, but practically for this lecture, we will
#stich to vectors. And again, the defining property of a sequence layer is that the same layer with the same weights can be applied to sequences of different lengths.

#IF YOU WANT TO THINK OF A CONCRETE EXAMPLE, JUST TAKE A CONVOLUTION; JUST THINK OF A ONE-DIMENSIONAL CONVOLUTION IN PLACE OF SEQUENCE TO SEQUENCE LAYER. 
#The first important proerty that sequence to sequence layers may or may not have is causality; A causal layer is a layer that can only look backward in the sequence.
#First if our data has discrete inputs, then we need to turn that into a sequence of vectors; One way to do that is by what's called one-hot vectors(with same length as the number of options).
#Another approach is to create embedding vectors; and here the idea is that for every element in our set of options; every token in our vocabulary, we create a vector of parameters that represents that object.
#and theses embedding vectors contain parameters; so all of the numbers iof input elements will be learned during the training process of the NN.
#Embedding vectors are not specific just to sequence learning; we'll see them in some other settings as well. 

In [None]:
#Model configurations: 
# Sequence-to-sequence: POS tagging, machine translation, robot control, generation
# Sequence-to-label: Classification, regression
# Label-to-sequence: generative models
# Label+seq-to-sequence: Teacher forcing
###To recap: Sequence-to-sequence models are defined by a set of fixed weights that can be applied to variable length inputs/ Three instances(RNN,CNN,Self-attention)/ Embeddings, padding, masking, and packing
# can help us to pre-process our data and to feed it to a DL model, and we've seen how versatile these seq-to-seq models can be: because we can train seq-to-seq, label-to-seq, seq-to-label, autoregressive training
#,teacher forcing, and more.

In [None]:
###RNNs: RNN is basically a name for any NN that has a cycle in it.
###How to train RNNs? Unrolling. 
###Note of the following:
#     -RNNs are sequence-to-sequence layers(shared weights, variable length)
#     -RNNs are causal: Only backwards connections
#     -Potentially unbounded memory(theoritically: vanishing/exploding gradient): If we unroll over the whole sequence, there's a computational connection between the first element of the input sequence, and the last element of the output sequence, no
#      matter how long the sequence is. And this is where they differ from CNNs for instance; because we know CNNs have a finite receptive field; so any of these outputs in the CNN can only depend if we have
#      one convolutional layer with a size three kernel, then any output of this CNN can only depend on three of the inputs, and it cannot look infinitely far along the input sequence, in contrast to the RNN
#      which can always look infinitely far back in the input sequence. The drawback, or the price we pay for this potentially long memory is that RNNs are quite slow to evaluate, because they need to be processed
#      sequentially; what that means is that in order to evaluate the fourth hidden layer for instance, we first need to evaluate the hidden layer h3, and in order to evaluate this latter, we need to evaluate h2.. And so on.
#      ==> We cannot evaluate these four layers in parallel, in contrast to CNN, where if we look to the four outputs, each output can be given to a thread which in parallel computes that particular output, 
#     based on the weights of the convolution and the inputs; they don't need to refer to each other in order to know their own value, and this makes RNNs a little bit slower than most other neural network layers.
#     (We are talkin here about just one layer: RNN layers(which contain 4 hidden layers) and Conv1D layer(which also has a fully connected layer 4 input and 4 outputs))
###We can solve the problem of vanishing gradient by replacing the sigmoids by ReLUs, and by making sure the weight matrices are properly initialized, and perhaps even by adding the occasional normalization step
# in between, but in the late 90s when these recurrent neural networks were very popular, those options weren't available yet, and instead people came up with a very different solution, which is known as LSTM.


In [None]:
####CNNs for sequential data
https://www.youtube.com/watch?v=rT77lBfAZm4&list=PLIXJ-Sacf8u7756f8QFM_FNZQxdJov8f4&index=4&ab_channel=DLVU


In [None]:
#Part one: Self-attention: The basic sequence to sequence operation that drives all transformer models.
#Part two: Transformers: We will look how to build up this self-attention into a complete transformer model
#Part three: Famous transformers: We will look at some famous examples of transformer models, and we'll look at the finer details of how they're constructed and how they were trained.
#Part four: Advanced tricks: that are being studied to improve the performance of transformer models in various ways.

# **PART ONE: SELF-ATTENTION**

In [None]:
#Before this, we have talked for the first time about sequence-to-sequence layers; and this are neural network layers that take as input a sequence of tensors; usally a sequence of vectors, and produce a sequence
#sequence of vectors as an output as well where the both of the sequences have the same length, and the direction in which the sequence extends is called "the time direction".
##RECAP:
#Defining property: Can handle sequences of different lengths with the same parameters.
#Versatile: label-to-sequence, sequence-to-label, sequence-to-sequence, autoregressive training.
#Causal or non-causal: causal models can only look backward.

#The aim of self-attention as a sequence to sequence layer is to give us the best of both worlds; parallel computation(like cnn) and long dependencies(like RNN: The ability to look at any point of the sequence
#before or after the current output)
#There's Simple self-attention: the basic idea
#Practical self-attention: Adding some bells ane whistles

In [None]:
#At heart, self-attention is a very simple operation; We have a self-attention layer with a sequence of input vectors, and a sequence of output vectors, and the basic operation that produces any given output vector
#is simply a weighted sum over the input vectors: yi=ΣWij Xj : For every output we've a set of six weights in this case since we've 6 inputs: We simply perform a weighted sum over the input vectors, and that's
#our output vector, and we do this for every output, and that gives us our sequence of outputs.
#Now the trick that makes this special is that this Wij is not a parameter of the model, but it's a derived value that we compute from the input.

#To compute Wij, we calculate w'ij=transpose(Xi).Xj, and then we apply a softmax operation: Wij=exp(W'ij)/Σ(exp(W'ij))
#Of course if we do this naively, it would involves a lot of loops and we don't like loops in deep learning, we want to vectorize our operations. Fortunately, this operation is a very easy one to vectorize,
# we do that as follows: To compute all the raw weights of all inputs and all output positions i and j, we can simply compute a large matrix of all dot products of x with itself: So every this matrix W'(W'=transpose(X).X)
# contains every dot product of every input vector with every other input vector to apply the softmax(W=softmax(W')), we simply apply it to the matrix raw wise, so that the elements of all raws are positive
#and sum to 1. And then we simply multiply this weight matrix with our input matrix X to give us the matrix Y which contains all the weighted sums computed in one matrix multiplication.

###To properly understand what's happening here, I'd likd to point out a few things that we may not immediately have noticed.
#The first is that in this particular version of self-attention(simple self-attention), the weight from an input vector at one position to the same position so the weight ii from Xi to Yi is usually the biggest,
#because this weight is defined by the dot product of a vector with itself, and that's usually a higher value that the dot product of a vector with some other vectors; Well, this is not a big problem, but
#we'll allow this to change later; It just means that what this simple self-attention is doing is essentially keeping every input vector the same, but mixing in a little bit of the values of the other input
#vectors according to this weight, and we'll add a few simple mechanisms later that will allow us to change this behavior if necessary.

#Note also that a simple self-attention like this has no parameters: There's nothing we can do; No numbers we can set to change the behavior of the sequence-to-sequence layer; Its behavior is entirely driven
#by whatever mechanism generates these input vectors: So for instance, if we take one embedding layer and stick one simple self-attention layer on top of it, then the embedding layer entirely drives the behavior
#of the model.
#Thirdly, and this is probably one of the bigger reasons why self-attention works so well, note that it is fundamentally a linear operation; The whole of self-attention is one matrix multiplication of w by x
#resulting in y, and of course w is derived from the values of x: There's a linear operation between X and Y. non-vanishing gradients are through Y=WX.transpose(X), vanishing gradients through W=softmax(transpose(X)X)
# which means here we get a non-linearity at the price of potentially vanishing gradients; So in this way, we get the best of both worlds seperated into two parts of the computation graph(look page 7-middle): 
#linear operation with non-vanishing gradients, and a non-linear operation with vanishing gradients.

#Another bonus: Note that self-attention has no problem looking far back into the sequence. In fact, if we contrast it with the recurrent neural network, then we see that in the recurrent neural network, 
#the further back we go into the sequence, the more computation steps there are between an input vector and an output vector. That's not the case for self-attention: At every point in the sequence, there are 
#as many steps between that input point and any of the output points as at any other point in the sequence; This is because at heart, self-attention is really more of a set-model than a sequence model. As we've
#set it up this simple half attention, the model has no access to the sequential structure of the input, and we'll fix it later by encoding the sequential structure into the embeddings, but that's something
#for the next part. For now, we'll just look at this self-attention operation as a set; as more of a set-to-set layer, then a sequence-to-sequence layer.
#And another way of saying this is that self-attention is permutation equivariant; that's if we permute or shuffle the input sequence, it then it makes no difference whether we first permute and then apply the
#self-attention or first apply self-attention and then permute, we get the same result. So, that's the basic operation of self-attention. 

#Now before we move on, it pays to build a little bit of intuition for why this works so well. And one of the big reasons why it works is the power of the dot product.
#We have the example of users and movies, and the act "likes" between them, and our job is to predict which other movies these users might like; One thing we might do is to collect feauture vectors for each
#of these users and for each of these movies,
#FOr instance; user u ____________
#                     |@@|[][]|//|    @@: likes thriller [][]: likes action, and //: likes comedy. 
#                     ------------  
#Abd the same for movie m: ##: Thriller    {}{}: Action, and \\:comedy
  