# Introduction to Representing Text With Contextual Embeddings

Before we dive into buiding our own Question and Answering model the first step if to understand how State of the Art PyTorch SOTA models represent text with Contextual Embeddings. 

At this point you may be wondering what the term 'Contextual Embedding' even means. Don't worry by the end of this module that will be abundently clear. In order to understand better the concept lets first take a step back and look at problem of textual representation, some of the different approaches that have culminated in the current state of the art.

At the in the third module we will look at sample PyTorch code using the HuggingFace transformers library that use contextual embeddings to make our Question and Answering model work.

# What is Text Representation?

If you are here you proably at some point learned how to read, write and process language. Computers represent textual characters as numbers that map to fonts on your screen using coding formats such as ASCII or UTF-8. 

![Ascii Code](images/ASCII.png)

You and I can understand what the letters the fonts mapped to these codes **represent** and how each of their characters come together to form the words of this sentence. However computers by themselves do not have such an understanding. Therefore we need we need a mechanism to represent **text features** as features such as words and characters in order to train langugage models.

# How do we Represent Text with Computers?

There are many different approaches to represent textual features in a format that can be modeled with machine learning. We've already mentioned that contextual embeddings is the state of the art but before we explain how contextual embeddigns work lets take a brief tour of the more tradional and early neural approaches for representing text.




## Bag of Words Text Representation

Bag of Words or BoW vector representations are the most common used traditional vector representation. Each word or n-gram is linked to a vector index and marked as 0 or 1 depending on whether it occurs in a given document.

![bow image here](images/bow.png) 

Below is an example of how to generate a bag of word representation using the Scikit Learn python library:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()



## BiGrams, TriGram and N-Gram BoWs

One limitation of a bag of words approach is that some words are part of multi word expresssions for example the word 'hot dog' has a completely different meaning than the words 'hot' and 'dog'. If we are too count these words as the same it can cause confuse our model.

To address this N-gram representations are often used in methods of document classification where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. 

Below is an example of how to generate a bigram bag of word representation using the Scikit Learn python library:


In [None]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
bigram_vectorizer.fit_transform(corpus)
bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()


In BoW word occurrences are evenly weighted independently of how frequently or what context they occur. However in most NLP tasks some words are more relevant than others.



## Term Frequency Inverse Document Frequency TF - IDF

TF - IDF short for term frequency–inverse document frequency, is a variation of bag of words where instead of a binary 0 and 1 value being used to indicate the appearence of a ngram in a document a the TF-IDF value is used. The TF-IDF value is a numerical statstic that reflect how prominent a word or n-gram is to a document in a collection. The TF-IDF value increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently than others. 

![tfidf image here](images/tfidf.png)

However even though TF-IDF representations provide frequency weight to different words they are unable to represent meaning or order. As the famous linguist J. R. Firth said in 1935, “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.”


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()


## Traditional Distributional Embeddings with Mutual Information

Distributional Embeddings enable word vectors to encapsulate contextual context. Each embedding vector is represented based on the mutual information it has with other words in a given corpus. Mutual information can be represented as a global co-occurrence frequency or restricted to a given window either sequentially or based on dependency edges.

![distributional matrix image here](images/dist_matrix.png)

While these distributional methods were comparable to later neural embeddings they were difficult to implement, computationally expensive and not widely used in the industry. 



## PreTrained Embeddings Word2Vec and Varients

As opposed to traditional distributional models neural embeddings such as Word to Vec are learned by training a neural langauge model to minimize a loss function for tasks that map to language understanding.  This process of training models on large collections of text to extract word representaions is called pre-training.  

One of the first sucessful neural pretraining techniques for text representation was called Word2Vec. 

There are two main architectures that are used to produce a distributed representation of words:

 - Continuous bag-of-words (CBOW) — In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.
 - Continuous skip-gram - In the continuous skip-gram architecture, the model uses surrounding window of context words to predict the current word.

CBOW is faster while skip-gram is slower but does a better job of representing infrequent words.

![word2vec image](images/word2vec.png)

Both CBOW and Skip-Grams are “predictive” embeddings, in that they only take local contexts into account. Word2Vec does not take advantage of global context. FastText, built on Word2Vec by learning vector representations for each word and the charachter n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pre-training it enables word embeddings to encode sub-word information. 

Another method called GloVe by contrast leverages the same intuition behind the co-occurence matrix used by the traditional distributional embeddings above, but uses neural methods to decompose the co-occurrence matrix into more expressive and non linear word vectors.

Below we can use the Gensim Api to play with pretrained word2vec, fast text, and glove embeddings to find the most similar pretrained embeddings to the word 'play':


In [None]:
import gensim.downloader as api
w2v = api.load('word2vec-google-news-300')
print(w2v.most_similar('play'))

In [None]:
from gensim.models.wrappers import FastText
fast_text = FastText.load_fasttext_format('wiki.simple')
print(fast_text.most_similar('play'))

In [None]:
glove = api.load("glove-twitter-25")
print(glove.most_similar('play'))


One key limitation of tradition pretrained embedding representaitons such as Word2Vec is the problem of word sense disambigioution. While pretrained embeddings can capture some of the meaning of words in context every possible meaning of a word is encoded into the same embedding. This can cause problems in downstream models since many words such as the word 'play' have different meanings depending on the context they are used in.

For example word 'play' in the the sentence
- I went to a [play] at the theature.

Does not mean the same thing as the word 'play' in the sentence.
- John wants to [play] with his friends.

The pretrained embeddings above represent both of these meanings of the word 'play' in the same embedding. Contextual embeddings ,methods were developed to address this challenge of disambigutation and contributed to the massive leap forward in natrual language processing applications. 



## Contextual Embeddings

To address challenges of word sense disambigution a new method of pretraining models on large amounts of data and using the pre-trained models to generate contextual embeddings was spearheaded with the advent of models such as ULMFiT, ELMO and Later BERT.

![elmo](images/elmo.png)

Below we will look at Spacy's transformer api to play with contextual embeddings.


In [None]:
!pip install spacy-transformers
!python -m spacy download "en_trf_bertbaseuncased_lg"

import spacy

nlp = spacy.load("en_trf_bertbaseuncased_lg")
doc1 = nlp("I went to a play.")
doc2 = nlp("John wants to play a game.")
doc3 = nlp("John went to a show.")


print("Similarity between the two words 'play' in doc1 and doc2:", doc1[4].similarity(doc2[3]))
print("Similarity between doc1 'play' and doc3 'show':", doc1[4].similarity(doc3[4]))
print("Similarity between doc2 'play' and doc3 'show':", doc2[3].similarity(doc3[4]))

ULMFit and ELMo were models that generates embedding for a word based on the context it appears thus generating slightly different embeddings for each of its occurrence and thus alowing a downstream model to better disambiguate between the correct sense of a given word such as 'play'. On in it’s release it enabled near instant state of the art results in many downstream tasks, including tasks such as co-reference were previously not as viable for practical usage in nlp.

This was coined as the ImageNet moment of NLP more recent transfomer based models such as BERT capitalize on the development of BERT using attention transformers instead of bi-directonal RNNs to encode context. If you are unfamiliar with terms such as Transformers and RNNs do not worry in the next module we will walk through the progression of NLP models culminiating in the advent of current state of the models in NLP with PyTorch. 
