<a href="https://colab.research.google.com/github/NourSoltani/ML-Learn/blob/main/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sequential data**

In [None]:
# Part one: Learning from sequences
# Part two: RNNs
# Part three: LSTMs
# Part four: CNNs for sequential data
# Part five: ELMo, a case study

# **Part one: Learning from sequences**

In [None]:
#So: what kind of sequential data can we expect? The simplest case is probably numeric one-dimensional sequential data: a timeseries.. we can also have numeric n-dimensional data, or a dataset can be symbolic; 
#where at every time step, we are given a symbol from a fixed vocabulary; The prime example of this is probably language, which can be viewed as a symbolic sequence in two ways; we can break it up into
#words in which case we have a very large vocabulary, and at each time step we are given one word, or we can view langugage as a sequence of characters which gives us a much smaller vocabulary and a much longer
#sequence. and datasets generally come in one of two types; either single sequence or set of sequences
#One of the interesting method in order to split your data between train and test is to use a method called:walk-forward validation.
#So sequences: consisting of numbers, vectors or symbols.
#and Dataset: consisting of a sequence per insatnce, or a sequence of instances
#sequence models: operate on inputs of different lengths (using the same weights).
#input: raw sequence data
#output: classification, regression, token prediction, sequence-to-sequence.
#layers: sequence-to-sequence. But what's a sequence-to-sequence layer? It's a layer that takes as input a sequence of vectors of length t and produces as output another sequence of vectors again of length t,and
#the input and output dimensions may be different, but the length of the sequence is the same in both cases; We can generalize this if we want to from vectors to tensors, but practically for this lecture, we will
#stich to vectors. And again, the defining property of a sequence layer is that the same layer with the same weights can be applied to sequences of different lengths.

#IF YOU WANT TO THINK OF A CONCRETE EXAMPLE, JUST TAKE A CONVOLUTION; JUST THINK OF A ONE-DIMENSIONAL CONVOLUTION IN PLACE OF SEQUENCE TO SEQUENCE LAYER. 
#The first important proerty that sequence to sequence layers may or may not have is causality; A causal layer is a layer that can only look backward in the sequence.
#First if our data has discrete inputs, then we need to turn that into a sequence of vectors; One way to do that is by what's called one-hot vectors(with same length as the number of options).
#Another approach is to create embedding vectors; and here the idea is that for every element in our set of options; every token in our vocabulary, we create a vector of parameters that represents that object.
#and theses embedding vectors contain parameters; so all of the numbers iof input elements will be learned during the training process of the NN.
#Embedding vectors are not specific just to sequence learning; we'll see them in some other settings as well. 

In [None]:
#Model configurations: 
# Sequence-to-sequence: POS tagging, machine translation, robot control, generation
# Sequence-to-label: Classification, regression
# Label-to-sequence: generative models
# Label+seq-to-sequence: Teacher forcing
###To recap: Sequence-to-sequence models are defined by a set of fixed weights that can be applied to variable length inputs/ Three instances(RNN,CNN,Self-attention)/ Embeddings, padding, masking, and packing
# can help us to pre-process our data and to feed it to a DL model, and we've seen how versatile these seq-to-seq models can be: because we can train seq-to-seq, label-to-seq, seq-to-label, autoregressive training
#,teacher forcing, and more.

In [None]:
###RNNs: RNN is basically a name for any NN that has a cycle in it.
###How to train RNNs? Unrolling. 
###Note of the following:
#     -RNNs are sequence-to-sequence layers(shared weights, variable length)
#     -RNNs are causal: Only backwards connections
#     -Potentially unbounded memory(theoritically: vanishing/exploding gradient): If we unroll over the whole sequence, there's a computational connection between the first element of the input sequence, and the last element of the output sequence, no
#      matter how long the sequence is. And this is where they differ from CNNs for instance; because we know CNNs have a finite receptive field; so any of these outputs in the CNN can only depend if we have
#      one convolutional layer with a size three kernel, then any output of this CNN can only depend on three of the inputs, and it cannot look infinitely far along the input sequence, in contrast to the RNN
#      which can always look infinitely far back in the input sequence. The drawback, or the price we pay for this potentially long memory is that RNNs are quite slow to evaluate, because they need to be processed
#      sequentially; what that means is that in order to evaluate the fourth hidden layer for instance, we first need to evaluate the hidden layer h3, and in order to evaluate this latter, we need to evaluate h2.. And so on.
#      ==> We cannot evaluate these four layers in parallel, in contrast to CNN, where if we look to the four outputs, each output can be given to a thread which in parallel computes that particular output, 
#     based on the weights of the convolution and the inputs; they don't need to refer to each other in order to know their own value, and this makes RNNs a little bit slower than most other neural network layers.
#     (We are talkin here about just one layer: RNN layers(which contain 4 hidden layers) and Conv1D layer(which also has a fully connected layer 4 input and 4 outputs))
###We can solve the problem of vanishing gradient by replacing the sigmoids by ReLUs, and by making sure the weight matrices are properly initialized, and perhaps even by adding the occasional normalization step
# in between, but in the late 90s when these recurrent neural networks were very popular, those options weren't available yet, and instead people came up with a very different solution, which is known as LSTM.


In [None]:
####CNNs for sequential data
https://www.youtube.com/watch?v=rT77lBfAZm4&list=PLIXJ-Sacf8u7756f8QFM_FNZQxdJov8f4&index=4&ab_channel=DLVU


In [None]:
#Part one: Self-attention: The basic sequence to sequence operation that drives all transformer models.
#Part two: Transformers: We will look how to build up this self-attention into a complete transformer model
#Part three: Famous transformers: We will look at some famous examples of transformer models, and we'll look at the finer details of how they're constructed and how they were trained.
#Part four: Advanced tricks: that are being studied to improve the performance of transformer models in various ways.

# **PART ONE: SELF-ATTENTION**

In [None]:
#Before this, we have talked for the first time about sequence-to-sequence layers; and this are neural network layers that take as input a sequence of tensors; usally a sequence of vectors, and produce a sequence
#sequence of vectors as an output as well where the both of the sequences have the same length, and the direction in which the sequence extends is called "the time direction".
##RECAP:
#Defining property: Can handle sequences of different lengths with the same parameters.
#Versatile: label-to-sequence, sequence-to-label, sequence-to-sequence, autoregressive training.
#Causal or non-causal: causal models can only look backward.

#The aim of self-attention as a sequence to sequence layer is to give us the best of both worlds; parallel computation(like cnn) and long dependencies(like RNN: The ability to look at any point of the sequence
#before or after the current output)
#There's Simple self-attention: the basic idea
#Practical self-attention: Adding some bells ane whistles

In [None]:
#At heart, self-attention is a very simple operation; We have a self-attention layer with a sequence of input vectors, and a sequence of output vectors, and the basic operation that produces any given output vector
#is simply a weighted sum over the input vectors: yi=ΣWij Xj : For every output we've a set of six weights in this case since we've 6 inputs: We simply perform a weighted sum over the input vectors, and that's
#our output vector, and we do this for every output, and that gives us our sequence of outputs.
#Now the trick that makes this special is that this Wij is not a parameter of the model, but it's a derived value that we compute from the input.

#To compute Wij, we calculate w'ij=transpose(Xi).Xj, and then we apply a softmax operation: Wij=exp(W'ij)/Σ(exp(W'ij))
#Of course if we do this naively, it would involves a lot of loops and we don't like loops in deep learning, we want to vectorize our operations. Fortunately, this operation is a very easy one to vectorize,
# we do that as follows: To compute all the raw weights of all inputs and all output positions i and j, we can simply compute a large matrix of all dot products of x with itself: So every this matrix W'(W'=transpose(X).X)
# contains every dot product of every input vector with every other input vector to apply the softmax(W=softmax(W')), we simply apply it to the matrix raw wise, so that the elements of all raws are positive
#and sum to 1. And then we simply multiply this weight matrix with our input matrix X to give us the matrix Y which contains all the weighted sums computed in one matrix multiplication.

###To properly understand what's happening here, I'd likd to point out a few things that we may not immediately have noticed.
#The first is that in this particular version of self-attention(simple self-attention), the weight from an input vector at one position to the same position so the weight ii from Xi to Yi is usually the biggest,
#because this weight is defined by the dot product of a vector with itself, and that's usually a higher value that the dot product of a vector with some other vectors; Well, this is not a big problem, but
#we'll allow this to change later; It just means that what this simple self-attention is doing is essentially keeping every input vector the same, but mixing in a little bit of the values of the other input
#vectors according to this weight, and we'll add a few simple mechanisms later that will allow us to change this behavior if necessary.

#Note also that a simple self-attention like this has no parameters: There's nothing we can do; No numbers we can set to change the behavior of the sequence-to-sequence layer; Its behavior is entirely driven
#by whatever mechanism generates these input vectors: So for instance, if we take one embedding layer and stick one simple self-attention layer on top of it, then the embedding layer entirely drives the behavior
#of the model.
#Thirdly, and this is probably one of the bigger reasons why self-attention works so well, note that it is fundamentally a linear operation; The whole of self-attention is one matrix multiplication of w by x
#resulting in y, and of course w is derived from the values of x: There's a linear operation between X and Y. non-vanishing gradients are through Y=WX.transpose(X), vanishing gradients through W=softmax(transpose(X)X)
# which means here we get a non-linearity at the price of potentially vanishing gradients; So in this way, we get the best of both worlds seperated into two parts of the computation graph(look page 7-middle): 
#linear operation with non-vanishing gradients, and a non-linear operation with vanishing gradients.

#Another bonus: Note that self-attention has no problem looking far back into the sequence. In fact, if we contrast it with the recurrent neural network, then we see that in the recurrent neural network, 
#the further back we go into the sequence, the more computation steps there are between an input vector and an output vector. That's not the case for self-attention: At every point in the sequence, there are 
#as many steps between that input point and any of the output points as at any other point in the sequence; This is because at heart, self-attention is really more of a set-model than a sequence model. As we've
#set it up this simple half attention, the model has no access to the sequential structure of the input, and we'll fix it later by encoding the sequential structure into the embeddings, but that's something
#for the next part. For now, we'll just look at this self-attention operation as a set; as more of a set-to-set layer, then a sequence-to-sequence layer.
#And another way of saying this is that self-attention is permutation equivariant; that's if we permute or shuffle the input sequence, it then it makes no difference whether we first permute and then apply the
#self-attention or first apply self-attention and then permute, we get the same result. So, that's the basic operation of self-attention. 

#Now before we move on, it pays to build a little bit of intuition for why this works so well. And one of the big reasons why it works is the power of the dot product.
#We have the example of users and movies, and the act "likes" between them, and our job is to predict which other movies these users might like; One thing we might do is to collect feauture vectors for each
#of these users and for each of these movies,
#FOr instance; user u ____________
#                     |@@|[][]|//|    @@: likes thriller [][]: likes action, and //: likes comedy. 
#                     ------------  
#And the same for movie m: ##: Thriller    {}{}: Action, and \\:comedy 
#And if we collect  the feature vectors like this, then we can simply take the dot product to get a good prediction for how much the user will like the movie: score=u1*m1+u2*m2+u3*m3 (we are getting these terms
#that are just the feature of each user multiplied by the corresponding feature in the movie), and the first thing to notice here is that the dot product very intuitively takes into account the signs of the feautes
#For instance, if the user likes thriller, and a movie has thriller, then these 2 values will multiply and increase the score, but if the user don't like thriller, and their feature vector has a negative value
#at theat position, and a movie doesn't contain thriller, or is un-thriller to the extent that it also has a negative value at that position of the feature vector, then the two minuses will cancel out, and
#the score will also uncrease..
#Secondly, the magnitudes of the values in the feature vectors behave very naturally; If a user is fairly ambivalent to thriller, then that part of the feature vector will be close to zero, and so on.


In [None]:
###AND IF WE DON'T HAVE FATURE VECTORS LIKE THIS, OR WE DON'T FEEL LIKE COLLECTING THEM, THEN WE CAN JUST LEARN EMBEDDING VECTORS INSTEAD OF FEATURE VECTORS, AND THAT'S THE BASIC MECHANISM BEHIND A LOT OF
#RECOMMENDER SYSTEMS.
#So how does this mechanism work in a self-attention model? (look page8-top): We have inputs --> Embedding layer -->Embeddings -->Simple self-attention --> output sequence --> Global max pooling
#This was a simple classification model with a simple self-attention layer. Here we have a model with 2 layers; One embedding layer that transforsm the input words to input vectors, and one simple self-attention
#layer which leads to an output sequence, and the vectors in the output sequence are summed together to give us a single vector from which we perform the classification.
#Now if we did this without the self-attention layer, we would essentially have a model where each word can only contribute to the output score independetly of every other word; This is known as a bag of words
#model: For instance, in this case,if we have as input: "The restaurant was not too terrible", the word "terrible" would probably cause us to predict that this review is negative. In orderto see that it might
#actually be a positive review, we need to recognize that the meaning of the word "terrible" is modderated, and in fact inverted by the presence of the word "not" and this is what self-attention can do for us.
#So in this case, what we would hope that the model would learn is that the words "not" and "terrible" can interact in an important way for this task; So we would hope to learn that the embedding vector for 
#the word "not" is learned in such a way that it has a load dot product with the embedding factor for the word "terrible", So that if the two occur together in a sentence, we can lower the probability of terrible
#having a meaning  that contributes negatively to the score, because there's a possibility that the word "not" occur in a way that inverts the meaning of the word terrible; we can't be sure of course with one 
#self-layer but with the features we will add later, and with a larger stack of self-attention layers, that problem can be solved as well

In [None]:
#There's extra features that we can add to self-attention to make it a little bit flexible and a little bit more powerful, and we will look to three different features: 
###1)Scaled dot product
###2)key, value, and query transformations
###3) multi-head attention: which essentially boils down to applying multiple self-attentions in parallel.

####First; Scaled self-attention: The problem we're trying to solve here is that as the dimensionality of the input vectors grows so does the average size of the dot product. and that growth is by a factor of
#the square root of k.
# W'ij=[transpose(Xi)*Xj]/ √k #where k is the input dimension. # So if we divide by the square root of k, then we normalize the average dot product; which keeps the weights within a certain range where we 
#don't suffer from vanishing gradients on the softmax operation: So this is a very simple trick that can really helpl learning.

####For the second feature, we need to recognize that every vector in a self-attention operation occurs in 3 different positions; First as a vector that is used in the weighted sum that ultimately provides the
#output: we call that the value. Second, as the input vector that corresponds to the current output matched against every other input vector; This is called the query. And third, the third that the query is
#matched against,which is called the key. And these names derive from a way of thinking about this mechanism as a kind a soft version of a dictionary; where the key, the query and the value are all vectors
#of the same size, and instead of having a query that matches only one key, every key matches the query to some extent as determined by their dot product, and instead of returning a single value, that of the
#key that matches the query, we return a mixture of all values with softmax normalized dot products as the mixture weights. And in this way of looking at our mechanism, self-attention is jst an attention mechanism
#with keys, queries, and values all coming from the same set, and that's where the name self-attention comes from.
#So, to make self-attention a little bit more powerful, we can introduce some transformations for these three different roles; So that even though we are using the same vector in all three roles, they can behave
#differently depending on what role they're taking on. And we do this simply by linear transformations.
# so for if for every role we introduce a weight matrix and an associated bias and we compute a key vector by passing the input vector through the key transformation the query vector by passing the input
#vector through the query transformation and the same for the value.
#ki= Kxi+bk
#qi=Qxi+bq
#vi=Vxi+bv

#so this makes the self-attention operation a little bit more flexible in what it can do and note also that because we've now introduced this transformation the self-attention operation has
#some parameters so the operation itself now also has has some numbers that we can change to influence the behavior of the layer

In [None]:
###TWOHEAD SELFATTENTION
#the final feature we will add is what's called multi-head attention and the necessity for multi-head attention derives from the idea that different words relate to each other by different relations
#so for instance, the word terrible in this sentence relates to the word relates to the words not and to in that the words not and to moderate and invert the meaning of the word terrible
#so the presence of the words not and the presence of the word to change the meaning of the word terrible but the relation between the word restaurant and terrible is completely different
#the word terrible describes the property of the restaurant. Now in order to allow the network to model all these different kinds of relation in one self-attention operation we split the self-attention
#into different heads which are basically self-attention layers applied in parallel and in practice that looks like this we start with an input sequence we pass the input sequence through some linear operations 
#to decrease its dimensionality so here we have a two head itself attention so we pass the input to two projections down to a lower dimensionality w1 and w2 each is fed to a separate self attention
#so we have self tension 1 and self tension 2 each with their own key query and value transforms we get two sequence vectors out which we concatenate and pass through a final output transformation to give us 
#the output sequence

In [None]:
# ###MULTIHEAD SELFATTENTION

# now note that there are two different ways of implementing this as we've drawn it in the previous slide we first multiply each input vector x by this gray matrix here which turns it
# into a vector that will be split into two; one input for each head and then the input for each head will be multiplied by another matrix to produce its key by another matrix to produce its
# query and by another matrix to produce its value but since these two operations are applied in sequence and they're both linear operations we can also multiply them together to produce one
# equivalent linear operation and if we do that for each one of these three matrices on the right ones for the key one's for the query and once for the value what we get is three different matrices
# that immediately produce the key the query and the value to go to the different heads and what this shows us is that the multi-headed self-attention if we apply it in this way and we ignore the w o transformation
# that is applied after concatenation the number of parameters for the single head self-attention and for the multi-self attention are the same so in this sense we are not adding a lot more parameters by
# splitting the self-attention up into separate heads and that's the last feature we wanted to add.

In [None]:
# to recap we've introduced self-attention which is a sequence to sequence layer that allows for parallel computation and unbounded long-term memory it's fundamentally a set to set layer it has no access to
# the sequential structure of the input this is something that we need to solve later on and a large part of the behavior comes from the parameters upstream.
# Now the real power of the  self-attention comes primarily fr 'om its simplicity and its cheapness to compute this means that we can stack a lot of self-attention operations together
# and build very large and deep models with them and these are called transformer models and that's what we're going to be discussing in the next video


# **Transformers**

In [None]:
#To get from a self-attention layer to a full-fledged model, we need to repeat it a number of times in a controlled fashion; If we do that, we get what's called a transformer model, and that's what we are going
#to talk about in this video.
##Transformer: Any sequence-based model that primarily uses self-attention to propagate information along the time dimension; So, we can add some other features and some other types of layers, but the main layer
#that'S responsible for propagating along the time dimension will be the self-attention.
#We'll limit ourselves to sequence models in this lecture, but actually there are now transformers in other domains as well. For instance, there are image transformers, and graph transformer.
#And the basic idea there is that our input consists of a set of basic units; in the case of images it's pixels, and in the case of graphs; graph nodoes that are connected by some structure: In images: the pixel
#grid, and in graph nodes, the topology of the graph.
###And the idea of any tranformer is that it's a model that primarily uses self-attention to propagate informations between these basic units of our innstances along the structure that we are given, along the pixel
#grid or along the graph. But as we have said, in this lecture, we'll limit ourselves to sequence models.


##More broadly: Any model that primarily uses self-attention to propagate information between the basic units of our instances.
#pixels -> Image transformer
#graph nodes ->Graph transformer.


#The main strategy that people tend to use to build transformers is to define a transformer block, which is a set of operations that are wired together in a certain way, and to then repeat that transformer block
#a number of times. So, the exact architecture of transformer block differs from model to model, but in most cases it looks something like this:
class Block(nn.Module):
  def forward(self,x):
    y=self.layernorm(y) #The input sequence is fed through a normalization known as layer normalization
    y=self.attention(x) #self-attention
    x=x+y

    y=self.layernorm(x) #layer normalization
    y=self.linear(x) #feed-forward
    return x+y
#The feed-forward layer here operates with the same parameters on every token of the input sequence in isolation; This means that, as we said before, the only operation that propagates information along the 
#time dimension is the self-attention, and the other 3 operations operate on every input token in isolation.
#An example of a sequence to label transformer is: Input embeddings ->Transformer block -> Transformer block -> Transformer block -> Output sequence -> global sum/avg/max pooling.
#A problem that we need to deal with is the lack of sequential structure in the self-attention, since obviously the meaning of a sentence often depends on the exact ordering of the words, and if we feed it
#through a simple classification transformer to do for instance "sentiment classification", what we see is that the output vectors will be the same for these two sentences: This resturant is not a real restaurant,
#it's a filthy burger joint and This is not a filthy burger joint, it's a real restaurant.
#So, we need to break this equivariance, and we need to tell the transformer about the structure of the input sequence; We do so by communicating the position of the input tokens, and we'll look at three
#ways of doing this: position embedding/position encoding/ and relative positions

#####The simplest of these are positional embeddings: In positional embeddings just like we assigned an embedding vector to every word in our vocabulary, we also assign an embedding vector to every position in our
#sequence from one to however long we expect our sequences to be, and then we can just sum these two together for every word in our input sequence: For instance, we have the sequence "the man pets the cat again",
#the word "the" occurs twice, but they result in different input vectors, because in the first case, it's summed with the position embedding for the position 1, and in the second occurence, it's summed with 
#the position embedding for the position 4. ##AND WE CAN DO THIS AT EVERY BLOCK, OR WE CAN DO IT JUST ONCE.
#This is very easy to implement, but the drawback is that we're basically giving our transformer model a fixed maximum length: So if we encounter after training a sequence that's longer than the longest sequence 
#that we encounter during training, then the position embeddings for the end of that sequence will not be trained, and there we cannot expect good peformance on such sequences.

####An approach that generalizes a little better at least in theory is that of position encodings. Here we take the same principle; We represent the positions in our sequence by vectors, but here they are not
#embedding vectors, they are not learned, they are simply fixed constants that we defined beforehand, and the trick here is to define these vectors by a series of functions one per dimension that follow a
#predictabl pattern. So theoritacally and ideally, a transform would generalize a little bit better to longer sequences if poistion encodings are used. The drawback it that it's a little bit more difficult to 
#implement, and they are a bit more ad-hoc choices to make in exactly which position encodings you use

In [None]:
#So, to restate, we have position embeddings which are easy to implement, flexible, but don't offer any generalization between the observed sequence length during training.
#Position encondings: which are slightly harder, but offer the possibility at least of a little bit more generalization.
#And then, relative positions which we can use with both embeddings and encodnings, but they must be implemented by adapting the self-attention itself, so they're not as easy as just summing a bunch of vectors
#to our input vectors

# **Pytorch Transformers from Scratch (Attention is all you need)**

In [None]:
# We have an encoder and decoder; The encoder is constituted from a transformer; which is constituted from: Multi-Head Attention->Add & Norm -> Feed forward -> Add and Norm ;where the Multi-head attention is basically constituted from self-attention block
# The decoder itself is constitued from a transformer also, a masked multi-head attention and "Add & Norm" before it.
# The transformer network is permutationaly invariant.. that's why we add the "Positional encoding".
# The thing about transformers that made them so great is the fact that all operations are able to be done in parallel,  which is to contrast to sequence models like RNN or LSTM.. But this really big strength has one problem
# which is if we look at translation where we have a target translated text. This translated text is all sent into decoder at the same time. So, let's just say that the first element is a <start> token, and then the next element is
# the first translated word, and then the first output that we have worn(i think it's norm) from the decoder just corrensponds to the second element which we send in the targe sentence: So, if we allow the encoder to have all this information,
# this is going to be super easy simple; It will just learn to use the provided target translation, and it will just learn a simple mapping, and really not learn anything about translating text. So, what we do is that we mask the
# target input to the decoder; So that the first output of the decoder only had access to the first element and then the second output only had access to the first and second input to the decoder.


In [6]:
import torch
import torch.nn as nn

#Attention Mechanism
#We inherit from nn.Module
class SelfAttention(nn.Module):
  #We've an embedding and we're going to split this embedding into different parts; for example divide the sequence into eight parts; and how many parts we're going to split gonna be called "heads".
  #So if we've, for instance, embeded size equal to 256, and we've heads equal to 8, then we're gonna split it into 8 by 32 parts.
  def __init__(self,embed_size, heads):
    # The attributions that we have in our class are: embed_size/ heads/ head_dim/  values/ keys/ queries/fc_out: These last four are layers.
    #And the inputs are: embed_size and heads.
    super(SelfAttention,sel).__init__()
    self.embed_size=embed_size
    self.heads=heads
    self.head_dim=embed_size // heads
    assert(self.head_dim * heads == embed_size), "Embed size needs to be div by heads"
    # The assert keyword is used when debugging code. The assert keyword lets you test if a condition in your code returns True, if not, the program will raise an AssertionError (In our case: return "Embed size need to...")



    # We're going to define the linear layers that we're gonna send our values, keys, and queries through.
    self.values = nn.Linear (self.head_dim,self.head_dim,bias=False) #The first and second elements are for the size of each input sample and size of each outptut sample respectively, and if bias=False, then the layer will not learn an additive bias. Default: True
    self.keys = nn.Linear (self.head_dim,self.head_dim,bias=False)
    self.queries = nn.Linear (self.head_dim,self.head_dim,bias=False)
    # Then after we concatenate; We're gonna do fully connected out:
    self.fc_out= nn.Linear(heads*self.head_dim, embed_size)  #heads*self.head_dim is equal to embed_size


  def forward(self, values, keys, query, mask):
    # The first thing is that we are going to get the number of training examples
    N=query.shape[0] #that's gonna be how many examples we send in at the same time
    value_len, key_len, query_len =values.shape[1], keys.shape[1], query.shape[1]
    # These lengths are going to depending on where we use the attenion mechanism is going to be corresponding to the source sentence length, and the target sentence length, but since we don't know 
    # exactly where the mechanism is used; either in the encoder, or which part in the decoder: those are going to vary, so we just use the abstract of saying we just use it abstractly and say value length, key length, and query 
    # length, but really they will always correspond to the source sentence length and the target sentence length.


    #Split embedding into self.heads pieces
    values=values.reshape(N, value_len, self.heads, self.head_dim) ###The last two; "self.heads" and "self.head_dim" is where we're spliting it since this was before a single dimension of just embed size: now it's going to be 
    #self.heads, and then self.head_dim. 
    #And we are going to do the same thing for the keys.
    keys=keys.reshape(N, key_len, self.heads, self.head_dim)
    queries=query.reshape(N,key_len, self.heads, self.head_dim)
    #And then what we gonna do is we want to multiply the queries with the keys, and the output from that is gonna  be called "energy".

    energy=torch.einsum("nqhd,nkhd->nhqk",[queries,keys]) #We are going to use it for matrix multiplication where we have several other dimensions, so let's bring out the shapes first:
    #where n for batch size,q for query length, h for the heads, k for key lengthand then d for the head dimension. 
    #After th -> is the output shape 


    #queries shape: (N,query_len, heads, heads_dim)
    #keys shape: (N,key_len, heads, heads_dim)
    #energy shape: (N,heads, query_len, key_len) 
    ######Let's say that the query_len is the target source, and the key length is the source sentence, then it's kind of
    #saying that okay for each word in our target(qurey), how much should we pay attention to each word in our input; in the
    #store sentence(key).

    if mask is not None:
      #The above line means that we are sending a mask.
      energy = energy.masked_fill(mask == 0, float("-1e20") ) #If the element of the mask=0, then that means that we want
      #to shut that off; So that it doesn't impact any other.
      #So essentially, as we saw previously, the mask for the target is gonna be a triangular matrix(used in the masked multi-head attention )
      #But anyways, the element when we're gonna close it is zero, and what it means to close it is that we're gonna replace
      #those elements with a float; where we're gonna set it to essentially -infinity, but just for numerical, it doersn't bring that
      #That's why we set it to a very very small value.
    

    #We're going to run this through softmax where the equation is: Attention(Q,K,V)=Softmax{(Q.transpose(K))/√dk}.V
    attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)
    ####This means that we're gonna do dimension equals=3 and this means that we're normalizing across the key length,
    #which for example would be depending on (As we have been saying) where we use the attention mechanism; It's going, let's
    #say the "key_len" is the source sentence, "query_len" is the target sentence length, then that would say: how much we want
    #to(essentially we're making the attention scores normalized to one across te source sentence): So, if the first for example
    #is 0.8, then that means that we're paying 80% attention to the first word in the source sentence.
    

    #Now we want to multiply the attention with the values, so we are going to do that with our famous einsum.
    #Down;: l=the dimension that we want to multiply across== the key length and the value length match now, so both are l.
    #And also we wan to do the concatenation part, so we can do that instantly after this torch.einsum by reshaping; so
    #we are just concatenating those.
    out = torch.einsum("nhql,nlhd ->nqhd",[attention,values]).reshape( 
        N,query_len, self.heads*self.head_dim   
    )
    
    # attention shape: (N, heads, query_len, key_len)
    # values shape: (N, value_len, heads, heads_dim)
    # after einsum (N, query_len, heads, head_dim) then flatten last two dimensions 
    
    out=self.fc_out(out) #What we want to do lastly is hust send it through fc_out; and this fc_out won't change the dimension
    #since the fc_out just maps the embed_size to embed_size
    return out

In [7]:
#So now, as we have the attenion, this is gonna be a lot easier for us: We're just gonna create  the TransformerBlock, 
#The architecture is as follows: Multi-Head Attention -> Add & Norm -> Feed Forward -> Add & Norm.
class TransformerBlock(nn.Module):
  def __init__(self, embed_size, heads, dropout, forward_expansion):
    super(TransformerBlock,self).__init__()
    self.attenion = SelfAttention(embed_size,heads) #This represents the Multi-Head Attention
    
    self.norm1 = nn.LayerNorm(embed_size)
    self.norm2 = nn.LayerNorm(embed_size) #Well layerNorm and batchNorm are very similar except that batchNorm takes the average across the batch, and then 
    #normalizes, whereas, layerNorm just takes an average for every single example: So think this layernorm has more computation
    #than batchnorm
    
    self.feed_forward = nn.Sequential(
        nn.Linear(embed_size, forward_expansion*embed_size),
        nn.ReLU(),
        nn.Linear(forward_expansion*embed_size, embed_size)
    )
    self.dropout = nn.Dropout(dropout)



  def forward(self,value, key,query,mask):
    attention=self.attention(value, key, query, mask)
     
    x=self.dropout(self.norm1(attention+query)) #Wy did we write "attention+query"? for the skip connection.
    forward=self.feed_forward(x)
    out= self.dropout(self.norm2(forward+x))
    return out
    
#And now, we're goona try to stick this together and form both; encoder and decoder.

In [4]:
#ARCHITECUTRE OF ENCODER: Inputs ->(+) Postional Encoding -> TransformerBlock * Nx..
#We've to set the hyperparameters of the model under def __init__
#Why src_vocab_size? Because now we're going to do the embedding and all of those things as well.

#max_length is related to the positional embedding: Positional embedding is depending on position.. So, we need to send in
#how long is the max sentence length.
class Encoder(nn.Module):
  def __init__(
      self,
      src_vocab_size,
      embed_size,
      num_layers,
      heads,
      device,
      forward_expansion,
      dropout,
      max_length):
    super(Encoder,self)._init__()
    
    #embed_size, device, word_embedding, position_embedding, layers, and dropout are the attributes and methods of Encoder:
    self.embed_size=embed_size
    self.device=device
    self.word_embedding  = nn.Embedding(src_vocab_size, embed_size)
    self.position_embedding = nn.Embedding(max_length, embed_size)
    
    #And now in the following, we're going to use it in order to map several different modules together(using nn.ModuleList):
    self.layers = nn.ModuleList(
        [
            TransformerBlock(
                embed_size,
                heads,
                dropout=dropout,
                forward_expansion=forward_expansion,
            ) 

        ]
    )
    self.dropout= nn.Dropout(dropout)
  

  #Now, we're ready to do the forward part: we're gonna send in just one input to the forward, and we're going also to
  #send in a mask.
  def forward(self, x,mask):
    N,seq_length= x.shape
    positions = torch.arange(0,seq_length).expand(N, seq_length).to(self.device)

    out = self.dropout(self.word_embedding(x)+ self.position_embedding(positions))
    #The only thing that makes it aware of the positions is "position.embedding" (where we are going to send the positions)
    
    for layer in self.layers:
      out = layer(out, out, out, mask)
    return out

In [None]:
class DecoderBlock(nn.Module):
  def __init__(self, embed_size, heads, forward_expansion, dropout, device):
    super(DecoderBlock, self).__init__()
    self.attention = SelfAttention(embed_size,heads)
    self.norm = nn.LayerNorm(embed_size)
    self.transformer_block=TransformerBlock(
        embed_size, heads, dropout, forward_expansion
    )
    self.dropout =nn.Dropout(dropout)
  
  def forward(self, x, value, key, src_mask, trg_mask):
    attention = self.attention(x,x,x,trg_mask)
    query=self.dropout(self.norm(attention+x))
    out = self.transformer_block(value, key, query, src_mask)
    return out

In [None]:
class Decoder(nn.Module):
    def __init__(
        self,
        trg_vocab_size,
        embed_size,
        num_layers,
        heads,
        forward_expansion,
        dropout,
        device,
        max_length,
    ):
        super(Decoder, self).__init__()
        self.device = device
        self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList(
            [
                DecoderBlock(embed_size, heads, forward_expansion, dropout, device)
                for _ in range(num_layers)
            ]
        )
        self.fc_out = nn.Linear(embed_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out, src_mask, trg_mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
        x = self.dropout((self.word_embedding(x) + self.position_embedding(positions)))

        for layer in self.layers:
            x = layer(x, enc_out, enc_out, src_mask, trg_mask)

        out = self.fc_out(x)

        return out


In [None]:
class Transformer(nn.Module):
    def __init__(
        self,
        src_vocab_size,
        trg_vocab_size,
        src_pad_idx,
        trg_pad_idx,
        embed_size=512,
        num_layers=6,
        forward_expansion=4,
        heads=8,
        dropout=0,
        device="cpu",
        max_length=100,
    ):

        super(Transformer, self).__init__()

        self.encoder = Encoder(
            src_vocab_size,
            embed_size,
            num_layers,
            heads,
            device,
            forward_expansion,
            dropout,
            max_length,
        )

        self.decoder = Decoder(
            trg_vocab_size,
            embed_size,
            num_layers,
            heads,
            forward_expansion,
            dropout,
            device,
            max_length,
        )

        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device

    def make_src_mask(self, src):
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
        # (N, 1, 1, src_len)
        return src_mask.to(self.device)

    def make_trg_mask(self, trg):
        N, trg_len = trg.shape
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            N, 1, trg_len, trg_len
        )

        return trg_mask.to(self.device)

    def forward(self, src, trg):
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        enc_src = self.encoder(src, src_mask)
        out = self.decoder(trg, enc_src, src_mask, trg_mask)
        return out


In [None]:
if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)

    x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(
        device
    )
    trg = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to(device)

    src_pad_idx = 0
    trg_pad_idx = 0
    src_vocab_size = 10
    trg_vocab_size = 10
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, device=device).to(
        device
    )
    out = model(x, trg[:, :-1])
    print(out.shape)