# Notebook 5 - Feature engineering

# I HAVE NO IDEA IF IT EVEN MAKES SENSE

<!-- ## 9. Features in textual data -->

Text is only a method of transcripting our thoughts - we use it to communicate with other people... and computers! However, if we give a computer raw textual data it has no idea how to interpret this set of characters (actually sequence of bits). Text written in English makes sense to us because we know how to interpret it: we look for features we were taught are meaningful (separate words, multiple words, idioms), recall in our mind what is their meaning and join them together to get the full author's notion. 

If we want a computer to "understand" the text, we firstly need to tell it the set of features it should look for. The process of developing these features is called **feature engineering**. So what can be a feature? For different applications there will be different features. Let's say we want the computer to classify positive and negative reviews. We may create a feature which is the count of all occurences of the word *"good"*. The higher the count, the more "positive" the review is. Also, we could add a feature, which expresses the length of the review (e.g. longer review indicates a positive sentiment). However, the word *"good"* is not the only word indicating positive sentiment. We will also need to manually add word-features ("nice", "cool" etc.). This seems like a bad idea, because we will be ignoring all other words present in the review.

Instead, we can try to express a **meaning of each word** in a document. Then, we can represent a document as a set of meaningful words.

## 10. Vectorizers

The standard representation of a word meaning in NLP is by using a vector. We can also represent whole documents using vectors.

### 10.1. Classical one-hot vectors

We have already seen them in previous notebooks. One-hot vector can be used to represent a document from the corpus based on words the docuemnt contains.
Let's say we have a corpus C of 3 documents d<sub>1</sub>, d<sub>2</sub>, d<sub>3</sub>.

| document number | document content |
|-----------------|------------------|
| d<sub>1</sub>   | "i like apples"  |
| d<sub>2</sub>   | "we like dogs"   |
| d<sub>3</sub>   | "we and i and dogs" |

Now, we can associate each word with a term in a vector: ["i", "like", "apples", "we", "dogs", "and"], and represent documents using vectors (using max pooling).

| document number | document content |
|-----------------|------------------|
| d<sub>1</sub>   | [1, 1, 1, 0, 0, 0]  |
| d<sub>2</sub>   | [0, 1, 0, 1, 1, 0]  |
| d<sub>3</sub>   | [1, 0, 0, 1, 1, 1]  |

Ok, done! But how can we learn meanings of these words? Let's build the term-document matrix!



### 10.2. Term-document matrix

The term-document matrix is the representation of word occurences in documents from the corpus. It shows how many times each word occurs in each document. Let's visualize it using our corpus C.

|          | d<sub>1</sub> | d<sub>2</sub> | d<sub>3</sub> |
|----------|---|---|---|
| "i"      | 1 | 0 | 1 |
| "like"   | 1 | 1 | 0 |
| "apples" | 1 | 0 | 0 |
| "we"     | 0 | 1 | 1 |
| "dogs"   | 0 | 1 | 1 |
| "and"    | 0 | 0 | 2 |

If you look at columns you can read vectors representing each document. For example the document d<sub>3</sub> is represented by vector [1,0,0,1,1,2]. 

Now, let's look at another, more advanced term-document matrix for a corpus of 4 Shakespeare plays and 4 selected words from them. Note what is the relation in word occurences between comedies (As You Like It and Twelfth Night) and other plays (Julius Caesar and Henry V).

<div style="text-align:center"><img src="res/pic1.png" alt="" width="800"/></div>

#### Documents similarity

We can use these counts to determine whether two documents are similar. The intuitive approach is that **two documents are similar if they have similar vectors** - in other words, they contain similar words.

Again, we can represent each document using word counts creating a 4d vector. Since it's impossible to display 4 dimensional plot, let's see how we can compare these documents using only counts of words "battle" and "fool" using 2d vectors.

<div style="text-align:center"><img src="res/pic2.png" alt="" width="600"/></div>

As you can see, these two words are discriminating documents very well! As expected, in comedies there will be higher number of "fools" and barely visible "battles", hence based on words "battle" and "fool", documents "As You Like It" and "Twelfth Night" are similar. The same applies to "Henry V" and "Julius Caesar", where the number of "battles" is higher than the number of "fools". Of course, in normal implementation we would use vectors of length |V| to represent a document.


#### Words similarity

However, what is interesting, we can also read **row vectors**! We can use them to represent the **meaning of each word**.

<div style="text-align:center"><img src="res/pic3.png" alt="" width="800"/></div>

For example, we can express word "fool" using a vector [36, 58, 1, 4] and word "wit" using another vector [20, 15, 2, 3]. Here we can apply the same intuition as to documents: **Two words are similar, when they have similar vectors** - they tend to occur in similar documents. So in theory, word "fool" [36, 58, 1, 4] should be more similar to "wit" [20, 15, 2, 3] than to "battle" [1, 0, 7, 13] and indeed, it is true!





### 10.3. Term-term matrix

Another way to represent **word meanings** is to use the **term-term matrix**. In this matrix both rows and columns are labeled with words from the corpus vocabulary. To construct a matrix, for each word *w* we look for the context of this word. We can define the context as *k* previous and *k* next words (usually *k* is small ~5). 

For example, if there is a sentence "At the university I work on data science. My computer performance is quite low." and we look for a context +/- 4 words of the word "computer", we will get words: "on", "data", "science", "My", "performance", "is", "quite", "low". Now, every pair of the word *w* and each of these context words gets +1 in the term-term matrix. If we take every word from V look for its context, and note counts of each co-occurence, we will develop this word-word co-occurence matrix. 

The intuition behind this, is that if two words co-occur (are nearby in text), their meaning is similar. Let's look at the subset of term-term matrix for the Wikipedia corpus.

<div style="text-align:center"><img src="res/pic4.png" alt="" width="800"/></div>

As you can see, word "digital" co-occurs more often with "computer" (1670) than with "sugar" (4). On the other hand, "cherry" is much often seen together with "pie" (442) than with "computer" (2). In other words, "cherry" is more similar to "pie" than to "computer", what makes sense.


### 10.4 and 1/2 Cosine similarity measure

Since words and documents similarities will be defined by their vectors similarities, we need to define a **measure of vector similarity**, which takes two vectors as an input and returns a measure of their similarity. Visually the intuition for this is that two vectors are most similar when they are equal (point in the same direction). Two vectors are totally different if they point in exactly opposite directions. But how do we define it mathematically for multidimensional vectors? 

The most common similarity metric is the **cosine of the angle between two vectors**. Intuitively, if the angle between two vectors is small (cosine is large) then these two vectors point in similar direction. If the angle between two vectors is large (cosine is small) then these vectors point in different directions.

The cosine of two vectors *v* and *w* is given by the following equation:

<div style="text-align:center"><img src="res/eq1.png" alt="" width="400"/></div>

where |v| means the length of the vector *v*. However, why do we divide by these lengths? Vectors differ not only in direction but also in length. Generally, since vector terms represent word counts in documents (term-document matrix) or word co-occurences (term-term matrix), frequent words will have much longer vectors than rare words. Thus, we need to **normalize** their lengths so they don't affect the measure - we want to know how much two vectors differ regardless of their lenghts. 

### 10.4. TF-IDF

Both term-document and term-term matrices are based on frequencies of words. Because of this, very frequet words like "the" or "he", does not dicriminate well because thery are too general. Hence, we need different methods, which will punish too frequent words and discriminate well. Let's introduce the first one, based on the term-document matrix.

TF-IDF, Term Frequency - Inverse Document Frequency is of the same format as term-document matrix, but values are calculated in two steps:

**Term Frequency** of a term *t* in a document *d* is simply a logarithm of the number term *t* occurs in *d*: 

<div style="text-align:center"><img src="res/eq2.png" alt="" width="300"/></div>

We use a logarithm to "punish" words which occur very frequently. The number 1 inside the logarithm is added, since if there are no occurences of a word, we would take a logarithm of 0, which is undefined.

The next part of the TF-IDF is called **Inverse Document Frequency** and emphasizes words, that are rare (may discriminate well). Document frequency for a term *t* is simply the number of documents, in which *t* occurs. Since we are intrested in its opposite (we need rare words!), IDF is calculated as:

<div style="text-align:center"><img src="res/eq3.png" alt="" width="200"/></div>

The complete TF-IDF weighted value for a term *t* and a document *d* is given by the product of both TF and IDF:

<div style="text-align:center"><img src="res/eq4.png" alt="" width="200"/></div>


### 10.5. PPMI

The equivalent of the TF-IDF for term-term matrix is the **Positive Pointwise Mutual Information**. The fundamental idea behind this method is to learn how much more two words occur together than it is expected by assuming they are independent. If they occur actually less often, we will not take into account and just replace the value with 0 (hence it's called *Positive*). The formula for the PPMI of a word *w* in the context *c* is given by:

<div style="text-align:center"><img src="res/eq5.png" alt="" width="400"/></div>

It has been found that when a term *α* is added (α = 0.75), PPMI performance is improved.

## 10. Word embeddings

### 11.1. Uncontextual vs contextual

### 11.2. word2vec

11.3. GloVe