[back](./01-nlp-data-cleaning.ipynb)

---
## `Count Vectorization and TFIDF`

In [1]:
from sklearn.metrics.pairwise import cosine_similarity


### `Count Vectorizer`

Vocabulary:

```python
['drop', 'help', 'jeff', 'octopus', 'polic', 'said', 'sandwich', 'sandwichless', 'sobbed', 'stole']
```

Ideally, we want to assign simple values to words. So continuing with the data from previous section, we have these words organized as a list and one word is a *token*.  
So, count vectorization is called a *bag-of-words*.  
Meaning, we would want to count the frequency of the words appear in our document.

In the previous section, we actually got lists of list and now we can convert it into numerical list as opposed to list with tokens.  
And below is an example how the numerical list would look like for the example we had in the previous section.  

```python
['jeff', 'stole', 'octopus', 'sandwich']
# [0, 0, 1, 1, 0, 0, 1, 0, 0, 1]
['help', 'sob', 'sandwichless']
# [0, 1, 0, 0, 0, 0, 0, 1, 1, 0]
['drop', 'sandwich', 'said', 'sandwich', 'polic']
# [1, 0, 0, 0, 1, 1, 2, 0, 0, 0]
```

It actually represents the position in which the token is located in the main vocabulary list and increases the count if the token is repeated.

### `TFIDF`

**TFIDF** has two terms
- **TF** Term Frequency
- **IDF** Inverse Document Frequency

#### `Term Frequency`

$$TF_{word,document} = \frac{\#\_of\_times\_word\_appears\_in\_document}{total\_\#\_of\_words\_in\_document}$$

```python
['jeff', 'stole', 'octopus', 'sandwich']
[0, 0, 1/4, 1/4, 0, 0, 1/4, 0, 0, 1/4]

['help', 'sob', 'sandwichless']
[0, 1/3, 0, 0, 0, 0, 0, 1/3, 1/3, 0]

['drop', 'sandwich', 'said', 'sandwich', 'polic']
[1/5, 0, 0, 0, 1/5, 1/5, 2/5, 0, 0, 0]
```

Meaning, in our first **TF** list, the token `jeff` appears $1$ time in the list but a total of $4$ times in our corpus, hence we see $1/4$.

#### `Document Frequency`

$$DF_{word} = \frac{\#\_of\_documents\_containing\_word}{total\_\#\_of\_documents}$$

Vocabulary:

```python
['drop', 'help', 'jeff', 'octopus', 'polic', 'said', 'sandwich', 'sandwichless', 'sobbed', 'stole']
```

Document frequency for each word:

```python
[1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 2/3, 1/3, 1/3, 1/3]
```

#### `Inverse Document Frequency`

$$IDF_{word}=log\left(\frac{total\_\#\_of\_documents}{\#\_of\_documents\_containing\_word}\right)$$

Vocabulary:

```python
['drop', 'help', 'jeff', 'octopus', 'polic', 'said', 'sandwich', 'sandwichless', 'sobbed', 'stole']
```

IDF for each word:
```python
[1.099, 1.099, 1.099, 1.099, 1.099, 1.099, 0.405, 1.099, 1.099, 1.099]
```

#### `TFIDF`

Vocabulary:

```python
['drop', 'help', 'jeff', 'octopus', 'polic', 'said', 'sandwich', 'sandwichless', 'sobbed', 'stole']
```

$TF * IDF$:

```python
['jeff', 'stole', 'octopus', 'sandwich']
[0, 0, 0.275, 0.275, 0, 0, 0.101, 0, 0, 0.275]

['help', 'sob', 'sandwichless']
[0, 0.366, 0, 0, 0, 0, 0, 0.366, 0.366, 0]

['drop', 'sandwich', 'said', 'sandwich', 'polic']
[0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]
```

### `Conclusion`

Now that we have turned our *DOCUMENTS* into *VECTORS*, we can put them into whatever machine learning algorithm we want!  
We can use whatever kind of similarity measure we please!

In [2]:
cosine_similarity([[0, 0, 0.275, 0.275, 0, 0, 0.101, 0, 0, 0.275],  [0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]])


array([[1.        , 0.08115802],
       [0.08115802, 1.        ]])

We can pass two vectors to the function `cosine_similarity()`, and here we have passed the *first vector* that we got from **TFIDF** of the first document and the *second vector* is the **TFIDF** we got from the third document.  

Below, we'll take a look at the *cosine similarity* of the second vector with the third vector.

In [3]:
cosine_similarity([[0, 0.366, 0, 0, 0, 0, 0, 0.366, 0.366, 0],  [0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]])

array([[1., 0.],
       [0., 1.]])


---
[next]()