### - What is feature extraction and what different techniques can be used ?
Feature Extraction is transforming textual data into numerical data. After cleaning and normalizing textual data, we need to transform it into their features for modeling, as the machine does not compute textual data. So we go for numerical representation for individual words as it’s easy for the computer to process numbers. <br>
1. Bag of Words using CountVectorizer 
2. TF-IDF using TfIdfVectorizer
3. Word Embeddings using Word2Vec, GloVe, FastText etc.


### 1. Bag of Words using CountVectorizer : <br>
<b>Theoretical Part</b> : It is a simple and flexible way of extracting features from documents. A Countvectorizer model is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard/ignore the grammatical details and the word order. It is called a “bag of words” because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not wherein the document. <br>
<b>Practical Part</b> : <br>
A) CountVectorizer takes as input a list of text data (usually sentences) and first build a vocabulary/dictionary using a python dictionary by assigning a number starting from 0 to each token/word. We have to specify the vocabulary size for example N => it will include n most frequenct tokens. We can also set other parameters for example the minimum nr of documents a token must appear to be part of dictionary etc. For each token/word we have a specific column in the document-term matrix. The index of the token in the dictionary indicates the number of the column. <br>
B) It build the document-term matrix by either counting the occurance of each token (word) in each document (sentence) or just setting 1 of the token is contained in the document and otherwise 0. Each record corresponds to a specific document/sentence and each column corresponds to a specific token/word. Note : Using the occurances instead of using binary values to indicate whether the token is part of the sentence would be generally a better approach since it captures more detailled information about the data. <br>
<b>Note : </b> Since some documents may have much more words/terms than other documents we usually use the normalized term frequency by dividing the word count by the total nr of words in that document. 
<img src='images/tf.png' width="400" height="400"> 


### 2. TF-IDF using TfIDFVectorizer (Term Frequency – Inverse Document Frequency) : <br>
<b>Theoretical Part</b> : <br>
TF-IDF is short for term frequency-inverse document frequency. It’s designed to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. <br>
<b>Practical Part</b> : <br>
Formula :
<img src='images/img4.png' width="400" height="400"> 
If dfx is small (token x appearn in only some few documents), then N/dfx is higher, than log(N/dfx) is higher => tf-idf(x,y) is higher; as a result if we have tokens with high tf-idf value in any document that means that this token is very important for that document since it appears much more times than in other documents.

### 3. Word Embedding <br>
<b>Theoretical Part</b> : Word embedding is a learned representation of text, where each word is represented as a real-valued vector in a lower-dimensional space. The advantage of Word embedding is that is preserves contexts and captures relationships of words so that it detects similar words more accurately. Texts/sentences or workds/tokens with similar context will have similar numerical representations. Word embedding has several different implementations such as Word2Vec, GloVe, FastText etc. The number of dimensions that we use for embedding represenations may be a hyperparameter for the NN. A high value means in general that more information will be captured from the text data.

Mean Embedding : Word Embeddings are numbers or parameters that are learnt during trainig of NN. If we assume having a large text dataset, word embeddings leads to a huge amount of parameters in the NN. Therefore Mean Embedding Layer is usually added after the Embedding layer in the NN Architechture and applied to reduce the high amount of parameters by calculating the mean of each dimension in the word embeddings matrix. If we have a word embedding matrix of vocabulary_size x nr_dimensions, the output of Mean Layer would be 1 x nr_dimensions. This layer, in contract with Word Embedding Layer, has non trainable parameters. We could also use for the same intention other layers like AveragePooling1D etc. <br>


There are several word embedding methods which can be divided into two major categories : 𝗖𝗼𝗻𝘁𝗲𝘅𝘁-𝗶𝗻𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝘁 and 𝗖𝗼𝗻𝘁𝗲𝘅𝘁-𝗱𝗲𝗽𝗲𝗻𝗱𝗲𝗻𝘁

✅ Context-independent methods are characterized by being unique and distinct for each word without considering the word’s context.

𝗕𝗮𝗴-𝗼𝗳-𝘄𝗼𝗿𝗱𝘀: This method represents a text, such as a sentence or a document, as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

𝗧𝗙-𝗜𝗗𝗙: This gets this importance score by getting the term’s frequency (TF) and multiplying it by the term inverse document frequency (IDF).

𝗪𝗼𝗿𝗱𝟮𝗩𝗲𝗰: A shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words, utilize either of two model architectures: continuous bag-of-words (CBOW) or continuous skip-gram.

𝗚𝗹𝗼𝗩𝗲: This performs training on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

𝗙𝗮𝘀𝘁𝗧𝗲𝘅𝘁: This method embeds words by treating each word as being composed of character n-grams instead of a word whole. This feature enables it not only to learn rare words but also out-of-vocabulary words. <br>

✅ Context-dependent learns different embeddings for the same word based on its context.

𝗘𝗟𝗠𝗢: learns contextualized word representations based on a neural language model with a character-based encoding layer and two BiLSTM layers.

𝗖𝗼𝗩𝗲: uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors.

𝗕𝗘𝗥𝗧: This is a transformer-based language representation model trained on a large cross-domain corpus, which uses a masked language model to predict words that are randomly masked in a sequence.

𝗫𝗟𝗠: Another transformer based model which pretrained using next token prediction, masked language modeling and a translation objective.

𝗥𝗼𝗕𝗘𝗥𝗧𝗮: This is built on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

𝗔𝗟𝗕𝗘𝗥𝗧: This is a parameter-reduction techniques to lower memory consumption and increase the training speed of BERT.




<b>Practical Part</b> : <br>
A) First we apply Tokenizer to build a vocabulary by assigning a number/index to each token/word. <br>
B) We convert each document/text into a sequence of words based on the dictionary entries of the tokens. <br>
C) Apply padding by adding 0 in the beginning or end of each document numerical representation in order to have sequences of same length (since Embedding Layer in NN accepts only sequences of same length) <br>
D) Build a NN with an Embedding Layer which takes as input the numerical representations of each document (sequences), AveragePooling1D Layer and some other Layers like Dense Layer etc. <br>

Comparison : CountVectorizer is a non-context vectorization methods while TF-IDF and Word Embeddings are context preserving methods and as a result generally better approaches for the most NLP problems.

### - How does a traditional Seq2Seq model work ?

Seq2Seq Model is a model that takes a sequence of items (words, letters, time series etc) and outputs another sequence of items (the length of input and output seqeunce may be different). It is composed of the ENCODER and DECODER. The encoder captures the context of the input sequence and stores it as a vector in a hidden state and sends it to the decoder, which then produces the output sequence. Since the task is sequence based, both the Encoder and Decoder tend to use RNNs, LSTMs, GRUs Layers etc. The hidden state vector can be of any power of in most cases.

<b>The Encoder</b> is composed of some Embedding Layers followed by some LSTM Layers. Embedding Layer (only 1 unit) takes as input the tokenized words (numerical representation) and outputs the word embeddings (vector representation of each token/word in the training set) and sends it to the LSTM Unit. Each LSTM unit takes as input the word embedding from Embedding Layer and the hidden state of the previous LSTM unit. The final hidden state vector contains information about the encoded input sequence. <br>
<b>The Decoder</b> like the Encoder has usually also the Embedding Layer as the first Layer which proccess the generated/predicted word from previous state, followed by RNNs/LSTMs/GRUs .. and the last layer would be the Softmax Layer which output a probability for each token in the target dictionary. Each state/unit takes as input the previous state (LSTM/RNN/GRU) and the generated/predicted word/token/text to generate/predict the next word/token.


This is how the Encoder and Decoder usually looks like. ( In the middle of Encoder and Decoder is the hidden state vector)

<img src='images/img5.PNG' width="600" height="600">  <br>

### - What is the main disadvantage or limitation of traditional Seq2Seq model ?

The main disadvantage of traditional Seq2Seq Model is the so called Information Bottleneck. Since Seq2Seq uses a fixed length memory for the hidden state, only a fixed amount of information can go from encoder to decoder through the hidden state. So the model is not effective if the sequences are long and not able to process it which causes the decoder to fail on generating good predictions. Attention models are used to solve this problem.

### - How does Attention Models work ? 

First, to solve the problem of fixed length memory of the hidden state, we use hidden states for each timestamp to have a better and more efficient way of encoding the input sequence and not force the model to memorize the entire input sequence using a single hidden state. For the Decoder to predict the first words it needs only some particular information about the first hidden states and not all of them so, it gives weights based on their importance. For this reason is uses state from the Decoder which contains information about the output predictions of decoder. The model can compare this state with each state of Encoder and determine which are the more relevant states in the Encoder. Based on this, it gives more weight to more important states and lower weight to the other ones. <br>
<b>Practical Part :</b> : <br>
The Attention Model is composed of : 1) Encoder and Pre-Attention Decoder, 2) Attention Mechanism and 3) Decoder. <br>
1) <b>The Encoder</b> takes as input the input sequence, processes it using an Embedding Layer and LSTMs/GRUs/RNNs Layers just like the traditional Encoder Layers and outputs/computes the hidden states for each time/step (hidden states are used as keys and values) while the <b>Pre-Attention Decoder</b> takes as input the target sequence, procceses it using an ShiftRight Layer which shifts the target sequence to the right to add the starting token, Embedding Layer and LSTMs/GRUs/RNNs Layers just like a traditional Encoder and outputs/computes the hidden states for each time/step (hidden states are used as queries). Both Encoder and Pre-Attention Decoder send these hidden states to the Attention Mechanism. The computation on the Encoder and Pre-Attention Decoder could be done in paralell since they do not depend on each other. <br>
2) <b>The Attention Mechanism</b>  <br>
<b>Keys</b> are the words in the source language that are represented as numerical vectors for example embedding vectors (as a matrix) while <b>Queries</b> are words from the target language that are represented as numerical vectors for example embedding vectors (as a matrix) <b>Values</b> are vectors that are associated with the keys (value and key matrix are the same). Alignment/Weights is a matrix which has scores that show how similar a word from source language is with a word from target language (how well they match). For a specific query we look at the most similar keys using these alignment scores.<br> 
The Attention Layer takes as input the keys, queries and values. The Attention Mechanism uses the hidden states of keys and values and hidden states of queries to firrst compute the alignments which are scores that indicate how important each state of Encoder (word from source language) is for a specific Decoder state (word from target language (not predicted word) using feedforward (higher this score, the more important) (matrix multiplication of keys matrix and queries matrix). Then it divides each score by the square root of dimension of keys matrix (normalization) to improve the performance. Then using Softmax it converts these scores into weights/probabilities which range from 0-1 and sum up to 1 (probability distribution). The sum of products of each state of Encoder(values matrix) and its corresponding weight gives the Context Vector (weighted sum) for that hidden state of Decoder (word from target language). This is the so called Scaled and Dot-Product Attention (we use vectorization when manipulating these matrices). The Context Vector is what the so called Attention Layers outputs for each query which contains relevant information about the input sequence (encoder stage) with respect to each specific query from target language and these vectors are fed then to the Decoder. <br>
3) <b>The Decoder </b> takes as input the Context Vectors and Target Sequence to compute the predicted sequence. It includes LSTMs/GRUs/RNNs Layers, a Dense Layer and LogSoftmax Layer which outputs the log probabilities for each token/word in the target dictionary. Based on these log probabilities it generates the predicted sequence. <br>

This is how the Query, Keys , Value matrices look like : 
<img src='images/img14.png' width="600" height="600">  <br>


This is how the Scaled and Dot-Product Attention Mechanism looks like : 
<img src='images/img13.png' width="600" height="600">  <br>

This is how the entire Attention Model looks like 1): 
<img src='images/img6.png' width="600" height="600"> 


<img src='images/img18.png' width="600" height="600"> <br>




### - What is Teacher Forcing regarding Transformers ?
This is how a traditional Seq2Seq Model looks like. When Decoding to generate the output/predicted token we feed the previous predited token to the next decoder unit as input. We calculate the loss by comparing the output of each decoder unit to the corresponding label/token. We sum up each loss to compute the total loss of the model. The problem is that the model may make errors on early stages at Decoding and the predictions which are computed based on previous prediction get worse and worse. To avoid this problem we can use as input for each decoder unit the decoder's ground truth tokens (labels). Even if the model is making wrong predictions, the next word's predictions will be computed based on the right prevois token (it pretends it makes right predictions). As a result it improves the performance of the model. There are different variations of this strategy for example : in later stages we can use the decoder's output again as input to the next decoder unit.
 
<b>Traditional Seq2Seq Model :</b>

<img src='images/img1.png' width="600" height="600">  <br>

<b> Traditional Seq2Seq Model using Teacher Forcing :</b>
<img src='images/img2.png' width="600" height="600"> 



### - Which metrics are used to evaluate Neural Machine Translation Systems ? 
1. BLEU Score (similar to Precision)
2. ROUGE-N (similar to Recall) <br>
3. F1-Score (harmonic mean of Precision and Recall)


1. <b>BLEU Score (similar to Precision)</b> <br> 
BLUE Score is computed as the sum of overlapping unigrams in prediction and target/reference / nr of words in the model 
OR the number of words in the Predicted Sequence that appear in the Reference Translations (Reference Translations are different human translation (labels) to a particular sentence in a particular language) divided by the total number of words in the Predicted Sequence . (Modified version after counting one particular, removes them from the reference translations and therefore is a better evaluation metric). The main drawback of BLEU Score is that it does not consider semantic meaning and sentence structure for example the Predicted Sequence : "Ate i was hungry because" would have a perfect BLEU Score since all these words appear in the Reference Translations. <br>

2. <b>ROUGE-N (similar to Recall)</b> <br>
ROUGE-N is computed as the sum of overlapping unigrams in prediction and target/reference / nr of words in the reference OR
For each Reference Translation we compute the number of words in that translation that appear in the Predicted Sequence divided by the total nr of words in the reference translation. Then we take the maximum out of them which will be the ROUGE-N Score.

3. <b>F1-Score</b> <br>
Since by looking at 2 scores (BLEU and ROUGE-N) it may be difficult to select the best model which has a high score on both of them, f1-score is a metric which compute the harmonic mean of BLEU and ROUGE-N score. <br>
Formula : F1-Score = (2 * ROUGE-N * BLEU) / (ROUGE-N + BLEU)


### - What are different techniques to perform decoding when generating the translated words in Attention / Seq2Seq models ?
1. <b>Greedy Decoding.</b> We select the most probable word at each time/step. But the best word at each step may not be the best for longer sequences since it selects very common words for example Target : "I am hungry because I did not eat anything today ", Prediction : "I am am am ...". It can work fine for shorter sequences, but it is limited by the inability to look further down to other words.
2. <b>Random Sampling.</b> We choose the predicted word at each time/step randomly but first we give more weights to more probable words and less weights to less probable words to reduce the randomness. Temperature is a parameter that we adjust to allow more or less randomness in sampling. It ranges from 0 to 1 indicating low to high randomness. Lower temperature setting leads to a more confident and conservative network while high temperature setting leads to a more excited and random network.
3. <b>Beam Search.</b> At each time/step we calculate the probability of the sequences we keep the <b>B</b> most probable sequences. The probability of a sequence "word1 ... wordn"  = P(word1)P(word2|word1)P(word3|word2 word1) ... P(wordn|wordn-1 ... word1). The disadvantages of Beam Search is that ti penalizes long sentences, since we multiply by probability values which are lower than 1; we can normalize these probabilities by the sentence length and it is computationally expensive since it calculate too many conditional probabilities. If <b>B=1</b> we have Greedy Decoding.
4. <b>Minimum Bayes Risk</b> <br>
A) We generate several candidate translations. B) Assign a similarity to every pair of candidate translations using a similarity score (such as ROUGE-N) and calculate the average similarity for each candidate translation. C) Select the candidate with the highest average similarity score. <br>
So the objective is as follows : 
<img src='images/img7.png' width="400" height="400"> 


### - What are the main problems that RNNs face in NLP and how to solve it  ?
One of the biggest issues with these RNNs, is that they make use of sequential computation and do not allow paralell computing. That means, in order for your code to process the word "you", it has to first go through "are" and then "you". This becomes critical at longer sequence lengths because if your text is too long, then A) it will take a long time for you to process it sequentially and B) you will lose a good amount of information mentioned earlier in the text as you approach the end 
C) Vanishing Gradient: when you back-propagate, the gradients can become really small and as a result,  your model will not be learning much. 

To solve these issues, Transformers come into picture. In contrast, transformers are based on attention and don't require any sequential computation per layer, only a single step is needed. Additionally, the gradient steps that need to be taken from the last output to the first input in a transformer is just one. For RNNs, the number of steps increases with longer sequences. Finally, transformers don't suffer from vanishing gradients problems that are related to the length of the sequences.

## - What are different types of Attention Mechanism in Attention models ?
1. Encoder-Decoder Attention.
The Queries come from one sentence (target sentence) and Keys, Values come from another sentence (source sentence). It outputs a weight matrix where each weight corresponds to a specific Query and Key (shows how simiar they are). <br>
<img src='images/img8.png' width="600" height="600"> <br>

2. Self Attention
Queries, Keys and Values come from the same sentence. So we attent or map a word to every other word in the sequence. It is used to get the meaning of the word within the sentence und learn the contextual numerical representation of each word within the sentence. <br>
<img src='images/img9.png' width="600" height="600">  <br>

3. Masked Self Attention
Queries, Keys, Values come from the same sentence but Queries do not attend to future positions. This Attention Mechanism is present in the Decoder and ensures that predictions in each position depend only on previos word (known outputs). It uses minus infinity values in future positions which leads to weights of 0 in the weight matrix. <br>
<img src='images/img10.png' width="600" height="600">  <br>
<img src='images/img11.png' width="600" height="600"> <br>


## - What is Multi-Head Attention and how it works?
* In Multi-Head Attention we apply in paralell the Attention Mechanism to multiple sets of the matrices (Queries, Keys, Values matrices).
* In Multi-Head Attention the number of times we apply the Attention Mechanism is the number of heads in the model. For example in a model with 2 heads we need 2 sets for each Query,Key,Value matrices.
* The goal of Multi-Head Attention is that by using different set of representations (different context vectors for each query) allow the model to learn multiple relationships between words from the query and key matrices.
* Since we use paralellism it has similar computation cost to Single-Head Attention<br>

<b>Practical Part :</b> <br>
1. Multi-Head Attention takes as input the matrices Keys, Queries, Values (a set of each matrix depending on the nr of heads). The nr of columns in these matrices is the embedding size or dimension. The nr of rows is the nr of words of the sequence in the source sequence (Keys and Values) and target sequence (Queries).
2. We apply linear transformation by using some Weight Matrices for each Keys/Query/Values matrix. The nr of rows in each matrix is equal to the nr of embedding dimension. The nr of columns are hyperparameters that we can choose. It is recommende to use d_v = embedding dimension / nr heads. This choice ensures that the computational cost of Multi-Head Attention does not exceed so much the one of the Single-Head Attention. We get a linear representation matrix for each Key/Query/Value matrix as an output which has nr of columns d_v  and nr of rows is same as nr of rows in the corresponding Query/Keys/Value matrix. Each row in these matrices represents the context vector
3. We apply the Attention Mechanism (Scaled and Dot-Product Attention) to each set. We get a single matrix for each set of Keys/Query/Value set. This matrix has nr of columns equal to d_v and nr of rows equal to the nr of rows in the Query matrix. In this matrix each rows represents the Context Vector of each query.
3. We concatenate these matrices outputed by each Attention Head column-wise into one single matrix, this means we concatenate different context vector for each query to get a single context vector for each query. This matrix has nr of columns equal to nr of heads * d_v and nr of rows the nr of rows in the query matrix.
4. We apply linear transformation to this concatenated matrix using a Weight Matrix. The output matrix has nr columns and rows equal to embedding dimension.
This picture shows how the Multi-Head Attention looks like : 
<img src='images/img12.png' width="600" height="600"> <br>

<img src='images/img15.png' width="600" height="600"> <br>


### - How does the entire Transformer Decoder work ? 
1. First we perform Embedding for each word in the input sequence.
2. We perform positional encoding. These are learned vectors that are added to the embedding vectors.
3. Then we perform Attentio Mechanism using Multi-Head Attention and ADD which means adding the input of Attentional Layer (positional input embeddings) with the output of Attention Layer and then NORMALIZE the result for better model performance and lower computational time.
4. We perform Feed-Forward using Relu and ADD which means adding the input of Feed Forward Layer (output of Attention Layer) with the output pf the Feed Forward Layer and then NORMALIZE the result for better model performance and lower computational time.

3. The Attention (Multi-Head Attention) and Feed Forward Layers (steps 3 and 4) are repeated n times (hyperparamter).
4. We use a Dense Layer to process the encoded input which performs a linear transformation.
5. In the end we use a Softmax Layer which output probabilities
8. It also includes a module which computes the cross entropy loss

This is how the entire Decoder of Attention Model  looks like : 
<img src='images/img16.png' width="600" height="600"> <br>


### - What is Positinal Encoding ?
* Positional Encoding is the process of encoding each token's position in the sequence. This is important because Transformers dont use RNNs but the word order is important in any language. 
* For each word/token a positional encoded vector is built. Then it is added to the word's Embedding so that for each word/token we have information about its position in the sequence. Positional Encoded Vector are parameters that must be learnt just like Embeddings.

### - Compare Word2Vec, Glove, FastText, BlazingText with BERT.
<b>Similarities :</b> <br>
Both BlazingText and BERT generate word embeddings.
<br>
<b>Differences :</b> <br>
1. BlazingText operates at word-level (input of BlazingText) whereas BERT operates at a sentence level (input of BERT).
2. Using the bidirectional nature of the transformer architecture, BERT can capture context of a word(in both directions) while  BlazingText learns word-level embeddings for all the words that are included in the training corpus but does not look at contextual information in specific sentences. Example 1 : the word "dress". The embedding that is generated by BlazingText for the word dress is for example [.02, 0.1, .003, 0.2]. Regardless of where the word dress appears in a sentence or in which sentence (context) it appears, the embedding generated for that particular portion of the sentence is always going to be the same, which means BlazingText is not really capturing the context of the word dress in a sentence or different sentences. <br>
Example 2 : Let's take the example of these two sentences; I love the dress. I love the dress, but not the price. Obviously, the context of the word dress is different in these two sentences. BERT can take into consideration the words that come prior to the word dress, as well as the words that follow the word dress. Using this bidirectional nature of the transformer architecture, BERT is able to capture the context. The embeddings (token + segment + position) that are generated for the word dress in these two sentences will be completely different. However, the length of the embeddings in these two sentences is going to be fixed. So BERT generate contextualized word embeddings.
3. BERT is built based on Transformer Architechture while is based on Word2Vec Architechture. <br>



### - Compare and explain the EVOLUTION of algorithms : Word2Vec, Glove, FastText, Transformers, BlazingText, ELMo, GPT, BERT and T5 <br>

Evolution of Text Analysis algorithms (T5 is the last introduced model)  : 
<img src='images/models.PNG' width="600" height="600"> 

<b>1. Word2Vec (Word Embedding)</b> <br>
* Word2Vec converts text (words/tokens) into numerical vectors called word embeddings in a 300 dimensional space (hyperparameter to be set manually by the developer) such that words with similar meaning will have small euclidean distance (close to each other in the 300 dimensional space) / high similarity (cosine similarity). 
* We perform machine learning / the model learns these word embeddings from a particular local dataset or we can use pre-trained word embeddings.
* The model architecture is based upon shallow two layer neural networks (fully connected layers). To create the embedding Word2Vec makes use of : a) Continuous Bag of Words (CBOW) which predicts the current word from a fixed window of surrounding context words and b) continuous skip-gram uses the current word to predict the surrounding fixed window of context words. 
* Disadvatage 1: One challenge though with Word2Vec is that it tends to run into what's called out of vocabulary issues, because its vocabulary only contains three million words. The vocabulary is a set of known words that the model learned in the training phase. Out of vocabulary words are words that were not present in the text data set the model was initially trained on so if the word is not found in its vocabulary, the model architecture assigns a zero to that word which is basically discarding the word.
* Disadvantage 2: It also uses a fixed window which limits the sorrounding words that can be relevant to predict the current word. (if we take into account more sorrounding words we have more contextual information).
* Disadvantage 3 : Word2Vec learns word-level embeddings for all the words that are included in the training corpus but does not look at contextual information in specific sentences, for example it would generate the same embedding for the word "fair" for both sentences He didnt receive fair treatment - Fun fair in new york city this summer. But words can have different meaning in different contexts.
<img src='images/cbw.PNG' width="500" height="500"> 

<b>Glove (Word Embedding)</b> <br>
* Glove (Global Vectors for Word Representation) is basically an unsupervised algorithm that has already been trained on a huge corpus to learn how close words are to each other, and to recognize the inclusion matrix of words and to draw words close or far from each other. This is created by Stanford University.  
* The main advantage of GloVe over Word2 Vec and the Word Embedding values is that GloVe does not compute the local words vectors on the data that they are training on, but on the huge global data that Stanford University has trained them on, which makes the accuracy much higher, even if our words are few. Glove has pre-defined dense vectors for around every 6 billion words of English literature along with many other general use characters like comma, braces, and semicolons.
* Glove learns word-level embeddings for all the words that are included in the training corpus but does not look at contextual information in specific sentences, for example it would generate the same embedding for the word "fair" for both sentences He didnt receive fair treatment - Fun fair in new york city this summer. But words can have different meaning in different contexts.

<b>FastText (Word Embedding)</b> <br>
* FastText is an extension of Word2Vec. Like Word2Vec its implementation is based on CBOW and skip gram models (word embedding generation / unsupervised). But in addition it also can be used for text classification (supervised)
* It breaks the word into character set of length n using n-grams (given example in the picture). It creates the embedding for each n-gram of the word and then the embedding of the entire word will be the aggregation of ngram's embeddings within the word. SO it helps with Out of Vocabulary Issue with Word2Vec since the probability of seeing am, ama etc in the corpus is higher. As a result it increases the vocabulary size (because now we include ngrams of each word in the vocabulary) of Word2Vec beyond 3 million words. 
* Disadvantage : FastText learns word-level embeddings for all the words that are included in the training corpus but does not look at contextual information in specific sentences, for example it would generate the same embedding for the word "fair" for both sentences He didnt receive fair treatment - Fun fair in new york city this summer. But words can have different meaning in different contexts.

<b>Transformers </b> <br>
* Transormers are based on the attention mechanism. In NLP terms,the attention would map each word from the model's output to the words in the input sequence, assigning them weights depending on their importance towards the predicted word. The self-attention mechanism in this new transformer architecture focuses on capturing the relationships between all words in the input sequence and thereby significantly improving the accuracy of natural language understanding tasks such as machine translation.
<img src='images/imgx.PNG' width="500" height="500"> 

<b>BlazingText</b><br>
* Blazing Text implements the FastText algorithms and to give you an idea of the scope of the acceleration, using BlazingText, you can train a model on more than a billion words in a couple of minutes using a multi-core CPU or GPU. It is created by AWS Developer.
<img src='images/b.PNG' width="500" height="500"> 

<b>ElMo</b><br>
* In ELMo, word vectors are learned by a deep bidirectional LSTM language model. ELMo combines a forward and backward language model and is thus able to better capture syntax and semantics across different linguistic contexts.
* Although ElMo was bi-directional, it suffered from some issues such as capturing longer-term dependencies. 
<b>GPT</b><br>
* GPT are based on the Transformer Architechture
* They perform 2 steps : 1) First GPT learns a language model from a large unlabeled text corpus (ngrams), and second GPT performs a supervised learning step with labeled data to learn a specific NLP tasks such as text classification. 
* GPT is only trained and can predict context from left to right, which is often referred to as uni-directional.
* Transformers like GPT help with the problem of capturing longer-term dependencies, but since GPT was still unidirectional, BERT was then introduced which stands for the Bi-directional Encoder Representation from Transformer.

<b>BERT</b><br>
* BERT, or bidirectional encoder representations from transformers in contrast to GPT is truly bidirectional. In the unsupervised training step, BERT learns representations from unlabeled text, from left to right and right to left contexts jointly. This novel approach created interest in BERT across the industry and has led to many variations of BERT models, some of which are focused on specific language such as French, German, or Spanish. There are also BERT models that focus on a specific text domain such as scientific text, financial text. 

<b>T5</b> <br>
* T5 was introduced which makes use of transfer learning and uses the same model to predict on many tasks like text classification, machine translation etc.
<img src='images/t5.PNG' width="500" height="500"> 


### - Explain how BERT works.
Reference Article : http://jalammar.github.io/illustrated-bert/ <br>
* BERT, or bidirectional encoder representations. There are two steps in the BERT framework: <b> language model(pre-training) and fine-tuning (task-specific head)</b>. 
* BERT_BASE (BERT base model) consists of : 12 Layers (12 Transformer blocks), 12 Attention heads and 110 million parameters.
* During <b>pre-training</b>, we train and build a language model using a corpus which means the model learn the language of the corpus that we feed into it. The model is trained on unlabeled data (unsupervised) over different pre-training tasks; BERT learns representations from unlabeled text, from left to right and right to left contexts jointly. Pretraining tasks are : Masked Language Modelling (MLM) and next sentence prediction where we give as inputs 2 sentences and the model predict true if the sentence 2 follows sentence 1. Masked Language Modelling we’re actually inputting an incomplete sentence and asking BERT to complete it for us (we mask some words as unknown and let the model predict those words based on the sorrounding words). The model is trained using these 2 tasks in order to generate contextualized word embeddings.
* For <b>fine tuning (task-specific head) </b>, the BERT model is already trained (pre-trained) in a specific task for example text classification, machine translation etc. using the Transformer architechture based on a dataset (corpus). We usually perform fine-tuning on the task-specific head of BERT model using a small dataset to achieve our goal. We can either use the pre-trained weights (initialize model with the pre-trained weights) and further optimize our weights (model) using the same architechture that BERT task-specific head already has or we can add some fully connected layers, dropout layers etc in the network's head and optimize these layers.
<img src='images/myimg.png' width="600" height="600"> <br>

* How <b>(Masked Language Modelling) MLM</b> works : 
1) We tokenize our text (consists of a set of sentences). Just like we usually would with transformers, we begin with text tokenization.
2) Create a labels tensor. We’re training our model here, so we need a labels tensor to calculate loss against — and optimize towards. The labels tensor is simply input_ids — so all we need to do is make a copy.
3) Mask tokens in input_ids. Now that we’ve created a copy of input_ids for labels, we can go ahead and mask a random selection of tokens. The BERT paper uses a 15% probability of masking each token during model pre-training, with a few additional rules — we’ll use a simplified version of this and assign a 15% probability of each word being masked.
4) Calculate loss. We process the input_ids and labels tensors through our BERT model and calculate the loss between them both. Using this loss, we calculate the required gradient changes with BERT — and optimize our model weights.

<img src='images/bert1.png' width="600" height="600"> <br>
<img src='images/bert3.png' width="600" height="600"> <br> 
* How Next SEntence Prediction (NSP) works : To perform NSP, BERT randomly chooses 50 percent of the sentence pairs and replaces one of the two sentences with a random sentence from another part of the document. BERT then predicts if the two sentences are a valid sentence pair or not. BERT again will correct itself when it predicts incorrectly. Both of those training tasks (MLM and NSP) are performed in parallel to create a single accuracy score for the combined training efforts. This results in a more robust model capable of performing word and sentence level predictive tasks. Also, note that this pre-training step is implemented as unsupervised learning. The input data is large collections of unlabeled text. 

* The Objective Function that we are trying to minimize for the MLM and next sentence prediction are as follow. We just combine these 2 losses  to have one single loss which should be minimized.
<img src='images/bert4.png' width="600" height="600"> <br> 

* Fine-Tuning BERT Model : Given the pre-training (transfer learning) BERT model we can fine-tune the model for our specific task (using the weights that were learnt during pre-training) for example : question answering, text classification (sentiment analysis), text summarization by following the same way the model was trained. We feed to the model 2 sentences. In case of text classification the sentence is a symbol that indicates there is no sentence.
<img src='images/bert5.png' width="600" height="600"> <br>
<img src='images/bert6.png' width="600" height="600"> <br>


<b>How are BERT Embeddings (not to be confused with word embeddings) / BERT Input IDs generated ?</b>
1. Apply Word Pience Tokenization to the raw sentence (can also be many sentences which is much more applicable for NLP tasks such as generating question and answer papers). Word Piece Tokenization is a technique that is used to segment sentences into tokens and is based on pre-trained Models with the dimension of 768 (max_length; for different BERT models we should check the max_length that the model accepts). 
2. Add special tokens like : CLS (101) is a special token which comes in front of each input which many include many sentences and SEP (102) which is separater token (indicates the end of a sentence and beginning of another sentence).
2. Apply Token ID to determine token embedding/id for the individual tokens based on the vocabulary (dictionarty). In the vocabulary we map each token to a specific number (Note : the vocabulary is built on the training set and then used to encode validation and test set ). All I have to do is simply look at the tokens in the 768 dimension vector that I mentioned before. Here, the token CLS gets an embedding of 101 because that is the position of CLS in that 768 dimension. Similarly, the token love gets a token embedding of 2293, the token this gets an token embedding of 2023, and so on.
3. Segment embedding becomes much more important when there are multiple sentences in the input sequence. The segment ID of 0 represents that a sentence is the first sentence in the sequence, and similarly the segment embedding of 1 represents that it is a second sentence in the input sequence. Here I have only one sentence in my input sequence. So for all the individual tokens, I get a segment embedding of 0.
4. The position embedding determines the index portion of the individual token in the input sequence. Here, my input sequence consists of four tokens. So based on a zero based index, you can see the position embedding tokens for all the tokens. The position embedding for the token CLS is 0, the position embedding for the token love is 1, and so on.
5. Finally we sum the position, segment and token embedding that have been previously determined (sum of each dimension) 
6. To have same length for all sequences we perform padding by adding 0s in the end so that new length=max_length (768). So the final embeddining of one training example is of shape (1, 4, 768). We should keep in mind that all training exampled will be encoded in this way.
<img src='images/bertemb.png' width="600" height="600"> <br>

### - How Tokenization in NLP works and different types of Tokenizers ?
* Tokenizer convert raw text data into numerical data and its objective is to find meaningful numerical representation of the text data.
* Tokenizers include : Word-based tokenizer, character-based tokenizer, subword-based tokenizer

1. Word-Based Tokenizer
* Word-based Tokenizer splits the text into words based on white space
* Advantage : We map each word to a single number. A huge information is held in a single number since one word may have too many contextual and semantic information.
* Disadvantage  : huge size of vocabulary due to the large amount of words in the language.
* Disadvantage : similar words (dog and dogs) will have different number in the vocabulary and therefore be treated as they had different meaning by our model. So this results in loss of meaning across similar words.
* Disadvantage : We usually limit the size of vocabulary to 10k most frequent for example. But this will assign to all out of vocabulary token a special index OOV and map different words to the same index which results in information loss.
2. Character-based Tokenizer
* Splits text into characters where we map each character to a index in the vocabulary. To get the meaning of a word we have to consider many characters.
* Advantage over word-based : the vocabulary using character entries is much smaller than the vocabulary of word-based Tokenizer because the nr of character in a language is much lower than the nr of words.
* Advantage over word-based : Fewer out of Vocabulary words because with characters we represent any word in any language.
* Disadvantage : character does not hold as much information as words (not all languages.
* Disadvantage : the sequences (sentences) will be translated into large amount of tokens to be processed by our model. Usually we often limit the size of our sequences (max_length) and this may lead to information loss.
3. Subword based tokenizer
* Subword-based tokenizer is used to find a middle ground between word and character-based. Frequently used words should not be split into subwords and rare words shoulde be decomposed into meaningful words 
* For example dog should not be split while dogs should be split into dog and s. Tokenization should be split into token and ization which both carry meanining.  ization is a suffix (end of word) and labeled by BERT for example with ##ization. Different models use different markers for start and end of a word.
* Most models obtainig state of the art results in English today use some kind of subword based tokenizer. Different algorithms for subword based tokenizer are : word-piece (used by BERT, DistilBERT), unigram(used by AlBERT, XLNet), byte-pair encoding (used by GPT-2, RoBerta)