# Overview
The way that the NLP machine works as a whole is in 3 main parts: tokenization, embedding, and finally architectures which perform inference (namely RNNs and Transformers)

# Tokenization

__Tokenization__ is the process of turning words into tokens using one-hot encoding. Essentially, each token for a word represents a one-hot array, with the 1 representing the index at which that word (or other part of speech) appears in the . This is a very oversimplified way of doing tokenization - modern tokenizers account for punctuation, grammar, and compound word structure (such as words ending in "-ing" or "-ify"). 

One of these main tokenization methods is called __subword tokenization__ where words can be broken down into subwords. For example, `"cats"` could be turned into two tokens `[cat, ##s]` where the ## represents a prefix subtokenization. This prevents having words that you are unable to represent with tokenization as they aren't in your dictionary, as you are able to mix and match prefixes and suffixes.

# Embedding

__Embedding__ is the process of having machines actually understand words by taking tokens produced from tokenization and turning them into vectorized representations of the parts of speech they represent. Each of the dimensions of this embedding vector represents some quantitative aspect of this word, so the full vector gives a machine learning architecture the information it needs about each token to actually analyze some text. The output vectors from embedding have up to 300 dimensions, with some having as little as 50. 

These output vectors are typically generated by pulling columns from a huge embedding matrix holding the embedding vector of each word in the vocabulary. A problem with this solution of having a big embedding matrix is that different meanings of the same word are represented by the same vector, not taking into account their contextual meaning. These embedding matrixes can be downloaded from the internet and include Word2Vec and GloVe. However, all of the big milestone models (such as GPT-3 or BERT) have their own tokenizer and embedding layer built in, so we don't have to manually download Word2Vec. 

# RNNs + Sequential Models

Before Transformer models were introduced in 2017, __RNNs__, or Reccurent Neural Networks, were the big thing in NLP. RNNs work by taking data that represents a sequence of items, such as a sequence of tokens in a piece of text. They work by having a state that is modified as each item in that sequence is iterated through. The basic pseudocode for an RNN is as follows:
```
for word in words:
    state = f(word,state)
```
Bidirectional RNNs traverse along 2 paths for the hidden state which are computed simultaneously. 

The function that actually generates a new state is called the "RNN Cell" - there are different types of these cells, the most major of which is LSTM, or long short term memory. LSTM cells are more complicated, using gates and several more complex mathematical operations to try and learn long-term dependencies in text. Essentially, LSTM networks update their memory each step of the way, depending on:
* a) information that it wants to add to memory from current input, determined by the _input gate_;<br>
* b) information that it wants to remove from memory due to it not being relevant anymore, determined by the _forget gate_; <br>
* c) which information it wants to keep in memory given current input and what the network has chosen to forget, determined by the _output gate_<br>

GRU cells are similar to LSTM cells but are simpler and more computationally efficient, allowing them to be used for scenarios where moment-to-moment inference is needed. They only have 2 gates:
* a) _Update gate_, which determines whether memory should be updated with current input info
* b) _Reset gate_, which determines how much of the new memory should be kept and how much should be reset.

The problem with RNNs that transformers solve is that RNNs lack long term memory, even with LSTM they are limited, while Transformers process all things in parallel. 

# Transformers

Transformer models are the new workhorses of modern NLP, introduced in 2017 with the paper "Attention Is All You Need." Transformer models implement one vital element: __attention mechanisms__. Their job is to learn long-range (global) features and to decide what components of the input sequence of embedding vectors contribute to the output vector.

Attention works in this way: 
* Each token has 3 vectors associated with it, which are query, key, and value. Query reperesents what the current word is "looking for" - basically, what other parts of speech it depends on to form meaning; Key represents what the current word has to offer; Value is the information that the query vector was looking for. 
* The transformer takes the query vector for each word and computes the dot product of this query vector with the key vectors of every other word - each of these dot products generates an _alignment score_ measuring how much the query and the key match. This also corresponds to how much "attention" we want to give this word.
* These alignment scores are squishified to be between 0 and 1 by a softmax function. 
* We then do a weighted sum over all the value vectors and use that to represent that current word's value vector. 

What this essentially does is show how the relevant parts of the sentence relate to this word in the grand scheme of things by computing a new piece of text Z. These processed value vectors, updated to show what a word actually represents, are then passed forwards into a traditional feed-forward neural network - however, since attention is mathematically similar in its operations to a feedforward neural network, some modern approaches forego the feed forward to have just a bunch of attention.

__Multi-head self attention__ is a form of attention mechanism which applies the same attention mechanism multiple times on the same sequence in a single pass, with different query key and value matricies resulting in multiple different text pieces Z (the multiple heads) which are combined using matrix multiplication into one. This is like having multiple different people read some text or watch a video with all of them paying attention to different parts and then all of you combining your analyses into one. 

__Adaptive Attention Span__ would make it so that only the last few tokens from a certain token would be attended over, reducing the number of large calculations that would have to be performed. The number of previous tokens that the attention layer would look at is called the _attention span_, and the adaptive attention span technique has this attention span get calculated dynamically and makes it get better as the model trains. 

All these things make transformers work very well, even better than RNNs for NLP since they attend over larger an entire piece of text. However, they require $n^2$ memory and take longer to train than RNNs, so in some cases where high-precision isn't needed, RNNs are still seen. 

# ELMo
ELMo was an embedding approach that used contextualized word representations, generating different embeddings for the same word depending on the context it appears in. These word representations are character-based, similarly to fastText embeddings, so ELMo models can handle tokens that are out of vocabulary (OOV) which aren't seen in training. These contextualized word representations improved nearly everything about the process.  

# BERT
A combination of ULMFiT (fine-tuning language models and then converting them into any other NLP format) and ELMo (contextualized word embedding) led to the creation of BERT, a large pretrained language model which shattered records for NLP tasks. It was a culmination of many NLP advances, and was very important for NLP researchers since it was trained on TONS of data and could be fine-tuned by anyone in the world for their specific tasks. It also used masked bidirectional pretraining where the model maksed some words in the sentence and had the model predict those words.

Models like XLNet, RoBERTa, and more were variations on this BERT idea. New models entered the fray from OpenAI, including GPT-1, GPT-2, GPT-3, and ChatGPT.