# Word2Vec and Word Embedding 📃→🔢

Like we have learned before in the introduction to NLP, machine learning models (that includes deep learning models), are mathematical functions transforming numeric inputs into numeric outputs. Data scientists need to find ways to transform the data they are working with into numbers so that they can train their models. Previously in the course, we learned to work with corpi of texts and transform texts into rows of a DataFrame where each word in the vocabulary my be represented by its frequency in the text, or its Tf-Idf for example.

## What will you learn in this course? 🧐🧐

This lecture will focus on alternative ways of processing text data into numeric inputs for machine learning models, here's the outline:

* Representing text as numbers
  * Indexing
  * Tf-idf
  * One-hot-encoding
  * Word Embedding
* Going further for big datasets
  * Word2Vec

## Representing text as numbers 📃→🔢

There are many ways you could convert your text data into numbers, in this section we will present different techniques and elaborate on their advantages and short-comings.

Texts are essentially character strings, composed of characters. The first step of any text preprocessing consists in identifying meaningful units in the text. The smallest unit we usually consider are the words in the text, also called uni-grams, but we can also choose to consider larger units like bi-grams or more. The meaningful units we choose to study will be called **tokens** in the rest of the lecture.

### Indexing 📃→7️⃣

Once we have decomposed each text into tokens, we need to convert the tokens into numbers. One way to do this is by simply indexing each unique token with a unique identifying number. This can be explained with the figure below:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/M08-DeepLearning/NLP/indexing.png)

#### Pros ➕
* Memory friendly
* Computation friendly

#### Cons ➖
* Original feature space is equal to the number of unique tokens used in the texts, representation space is 1-Dimensional (extreme dimensionality reduction = loss of information)
* The integer-encoding is arbitrary (it does not capture any relationship between words).
* An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.

### Tf-idf 📃→⚖️

Tf-idf consists in replacing tokens in the texts by their Tf-idf (Term frequency - inverse document frequency), which is the product between the term frequency $\frac{\#occurences\;of\;token\; in\;text}{\#tokens\;total\;in\;text}$ and the inverse document frequency $\log(\frac{\#texts}{\#texts\;where\;token\;appears})$.

#### Pros ➕
* Represent tokens with numbers that are connected to their relative importance in each text.
* Each texts becomes a row where each column represents the importance of a single token in the vocabulary, which makes the data compatible with classic ML models.

#### Cons ➖
* memory and computation intensive: each text will be represented by a vector that is the size of the vocabulary, which can be a lot.
* Removes the sequential nature of text data (you lose the order in which the tokens appear)

### One-hot-encoding 📃→0️⃣0️⃣0️⃣1️⃣0️⃣0️⃣0️⃣0️⃣0️⃣0️⃣
One-hot encoding represents each token with a dummy variable with as many elements as there are tokens in the vocabulary. Each element represents a token, the value is one on the index corresponding to the token and zero elsewhere.

One-hot encoding may be represented like this:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/M08-DeepLearning/NLP/one-hot-encoding.png)

#### Pros ➕
* Keeps the sequential nature of data
* No arbitrary hierarchy introduced between words

#### Cons ➖
* Highly innefficient in terms of both memory and computing due to the sparse nature of the respresentation

### Word Embedding 📃→🔢

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

Word emebedding can be represented in the following way:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/M08-DeepLearning/NLP/embedding.png)

#### Pros ➕
* Computanionally and memory efficient
* Compatible with the sequential nature of data
* No need to determine the values in the embedding manually

#### Cons ➖
* Training the embedding parameters can take a long time for large datasets

## Word2Vec for big datasets 

More sophisticated techniques exist for word embedding on large datasets, one of them is called Word2Vec.

Word2Vec is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through Word2Vec have proven to be successful on a variety of downstream natural language processing tasks.

Let's describe two different popular techniques:

### Continuous Bag-of-Words Model

The Continuous Bag-of-Words model aims at predicting the middle word given the context of of the surrounding words. It is called bag of words, since the number of surrounding words chosen to form the context is generally small.


### Continuous Skip-gram Model 

The Skip-gram model is trying to predict words in a certain range before or after a given a word. The input is a single word and the model predicts the surrounding words. In some way it works in the opposite fashion as the bag-of-words model.

We'll practice the Skip-Gram model in practice.

## Ressources 📚📚

* [Video lecture on word embedding](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture#:~:text=An%20embedding%20is%20a%20relatively,like%20sparse%20vectors%20representing%20words.)
* [Another introduction to word embedding](https://machinelearningmastery.com/what-are-word-embeddings/)
* [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
* [Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
