# Feature Extraction and Embeddings

[1. Bag of words](#1)<br>
[2. Term Frequency Inverse Document Frequency (TF-IDF)](#2)<br>
[3. One-hot encoding (OHE)](#3)<br>
[4. Word embedding](#4)<br>
[5. Word2Vec](#5)<br>
[6. GloVe](#6)<br>
[7. Embeddings for deep learning](#7)<br>
[8. t-SNE](#8)<br>

Once we have our text cleaned and transformed, we need to transform it into features that can be used for modeling. In this section, we will cover methods for doing just that!

## <a id='1'>1: Bag-of-Words (Bow)</a>

The BoW model treats each document as an un-ordered collection or bag-of-words.

A **document** is the *unit of text* you want to analyze. As an example, if you wanted to compare essays submitted by students for plagiarism, each essay would be a **document**.

To obtain a BoW from a piece of raw text, you need to simply apply appropriate text processing (cleaning, normalizing, splitting into words, stemming, etc.) and then treat the resulting **tokens** as an *un-ordered* collection or set:

$$\text{Little House on the Prairie} \rightarrow \text{{"littl", "hous", "prairi"}}$$
$$\text{Mary had a Little Lamb} \rightarrow \text{{"mari", "littl", "lamb"}}$$
$$\text{The Silence of the Lambs} \rightarrow \text{{"silenc", "lamb"}}$$
$$\text{Twinkle Twinkle Little Star} \rightarrow \text{{"twinkl", "littl", "star"}}$$

But keeping these as separate sets is very inefficient. They are of different sizes, may contain different words, and are hard to compare. What if a word occurs multiple times in a document? A more useful approach is turning each document into a vector of numbers representing how many times each word occurs in a document.

A *set of documents* is known as a **corpus**. A corpus gives the context for the vectors to be calculated.

**First:** collect all the unique words present in your corpus to form your vocabulary

<img src="assets/images/img_01.png" width=700 align='center'>

**Second:** arrange these words in some order so that they form the vector element positions, or columns of a table, and assume each document is a row

<img src="assets/images/img_02.png" width=700 align='center'>

**Third:** count the number of occurences of each word in each document and enter the value in the respective column, to create a **document-term matrix (DTM)**

<img src="assets/images/img_03.png" width=700 align='center'>

The DTM illustrates the relationship between documents in rows and terms in columns. Each element in the DTM can be interpreted as a **term frequency**.

Now, you can compare the similarity of two documents by evaluating how many of the terms are present and how frequent they are in each document. But a more mathematical approach may be takign the **dot product** (i.e. the sum of the product of corresponding elements) of the two vectors to get a numerical representation of that similarity:

<img src="assets/images/img_04.png" width=700 align='center'>

The greater the dot product, the more similar the vectors, and thus the documents, are! The dot product has a flaw however. It only captures the portions of overlap. It is not affected by other values that are not in common. So, pairs that are very different can end up with the same dot product value as pairs that are very similar. An alternative is a measure of **cosine similarity**!

**Cosine similarity** is the value from dividing the product of two vectors by the product of their magnitudes or euclidean norms
$$\text{cos}\left(\theta\right)=\frac{a \cdot b}{||a|| \cdot ||b||} = \frac{1}{\sqrt{3}\cdot\sqrt{3}}=\frac{1}{3}$$

If you think of these vectors as some arrows in n-dimensional space, then this is equal to the **cosine of the angle of theta between them**.

Identical vectors have cosine similarity equal to 1, orthogonal (or indepedent, nothing in common) vectors have cosine similarity equal to 0, and vectors that are exactly opposite they have cosine similarity equal to -1.

$$\text{cos}\left(\theta\right)\in \{-1,1\}$$

## <a id='2'>2: Term Frequency Inverse Document Frequency (TF-IDF)</a>

One limitation of the BoW approach is that it treats every word as being equally important. However, intuitively, we know that some words occur frequently within a corpus. We can compensate for this by *counting the number of times a word occurs in the corpus*. This is called the **document frequency**.

<img src="assets/images/img_05.png" width=700 align='center'>

Now, we can use this to weight the terms in the DTM:

<img src="assets/images/img_06.png" width=700 align='center'>

This gives us a metric that is proportional to the frequency of occurence of a term in a document, but *inversely proportional to the number of documents it appears in*. It highlights words that are more unique to a document and is thus better for characterizing the document.

The **TF-IDF** is simply the product of two weights very similar to what we've seen so far. The most commonly used form of TF-IDF defines *term frequency* as the <u>raw count of a term $t$ in a document $d$ divided by the number of terms in $d$</u>. And, inverse document frequency as the <u>logarithm of the total number documents in the collection $d$ divided by the number of documents where $t$ is present</u>. There are alternatives that seek to normalize, or smooth, the resulting values, or prevent edge cases such as dividing by zero errors.

<img src="assets/images/img_07.png" width=700 align='center'>

## <a id='3'>3: One-Hot Encoding (OHE)</a>

If we treat our words like a class, assign a vector that has one in a single pre-determined position for that word and zero everywhere else:

<img src="assets/images/img_08.png" width=700 align='center'>

This is very similar to the BoW idea, only that the keep a single word in each bag and build a vector for it

## <a id='4'>4: Word Embeddings</a>

OHE works in many situations, but breaks down when we have a large vocabulary to deal with because the size of our words representation grows with then number of words in a corpus. Another approach deals with addressing the size of our word representation by limiting it to a fixed size vector.

In other words, we want to find an embedding for each word in a vector space and we want it to exhibit some desired properties. For example, if two words are similar in meaning, they should be closer together than two words that are dissimilar.

<img src="assets/images/img_09.png" width=700 align='center'>

If two pairs of words have a similar difference in their meaning, they should be approximately equally separated in the embedding space.

<img src="assets/images/img_10.png" width=700 align='center'>

We could use such a representation for a variety of purposes. Like, finding synonyms, analogies, identifying concepts around which words are clustered, classifying words as positive or negative or neutral, etc. By combining word vectors, we can come up with another way of representing documents as well.

## <a id='5'>5: Word2Vec</a>

Word2Vec is one of the most popular examples of word embeddings used in practice. As the name indicates, it transforms words to vectors.

The core idea of Word2Vec: a model that is able to **predict a given word, given neighboring words**, or vice versa, **predict neighboring words for a given word**, is likely to capture the contextual meanings of words very well.

Two flavors of Word2Vec models:
1. Continuous Bag-of-Word (CBoW)
2. Continous Skip-gram

<img src="assets/images/img_11.png" width=700 align='center'>

### Skip-gram Model:

Pick a word from a sentence, convert it to an OHE vector, feed it into a probabilistic model (likely a neural network) that is designed to predict a few surrounding words, i.e. its context. You do this a number of times with a loss function to minimize until it predicts context words as best as it can.

If you take an intermediate representation like a hidden layer in the neural network, the outputs of that layer for a given word become the corresponding word vector.

<img src="assets/images/img_12.png" width=700 align='center'>

The CBoW variation also uses a similar strategy.

Word2Vec properties:
* Yields a robust meaning of the words because the meaning of each word is distributed throughout the vector (i.e. distributed representation)
* Size of vector is up to you, you can think of it as a tradeoff of performance and complexity (i.e. the size is independent of the vocabulary), it remains constant no matter how many terms on which you train
* Once pre-trained a large set of word vectors, you can use them efficiently without having to transform again and again (i.e. train once, store in lookup-table)
* They are ready to be used in deep learning architectures, can be used as the input vector for recurrent neural nets (RNN) (i.e. deep learning ready)
> * Possible to use RNNs to learn even better word embeddings

There are other optimizations that are possible to further reduce the model and training complexity such as representing the output of the words using Hierarchical Softmax, computing loss using Sparse Cross Entropy, etc.