# TF-IDF Exercise

You have the following (tokenized) document collection:

|id | words|
|---|------|
| 1 | aunt, aunt, uncle, like, cake|
| 2 | uncle, like, cake|
| 3 | cake, taste, sweet|
| 4 | aunt, sweet, like|
| 5| cake, cake, bake, uncle, sweet|

1. Determine the Document Term Matrix.

2. Calculate all the TFIDF values. (For simplicity: Use the document frequency tf - unnormalized)

### Solution

Document Term Matrix:

Tf IDF Values:

| term |  D1| D2|D3|D4|D5| IDF|
|------|----|---|--|--|--|---|
| aunt| 2|0|0|1|0|0.9162907|
| uncle|1|1|0|0|1|0.5108256| 
| like|1|1|0|1|0|0.5108256| 
| cake|1|1|1|0|2|0.2231436|
| taste|0|0|1|0|0|1.609438|
| sweet|0|0|1|1|1|0.5108256|
| bake|0|0|0|0|1|1.609438|




Tf-IDF Values:

| term |  D1| D2|D3|D4|D5|
|------|----|---|--|--|--|
| aunt|1.832581 |0|0|0.9162907|0|
| uncle|0.5108256|0.5108256|0|0|0.5108256| 
| like|0.5108256|0.5108256|0|0.5108256|0| 
| cake|0.2231436|0.2231436|0.2231436|0|0.4462872|
| taste|0|0|1.609438|0|0|
| sweet|0|0|0.5108256|0.5108256|0.5108256|
| bake|0|0|0|0|1.609438|



# A2 - Application Retrieval

We can use the tf-idf table to determine the similarity of two documents. Each document can be represented as the vector of its tfidf weighted words. The cosine of the two vectors is often used to calculate a similarity measure.

When for example a user searches for "car insurance", you can interpret the query as a new document and translate the query into a vector corresponding to your vocabulary. Calculate the cosine similarity with every document in your corpus and  return the best match.

$$ cos(\vec{q},\vec{d}) = \frac{\vec{q}\cdot\vec{d}}{\lVert\vec{q}\rVert\cdot\lVert\vec{d}\rVert}$$



Which document would the queries $Q_1=[uncle, aunt, like]$ and $Q_2=[uncle, aunt, cake]$ return?

(To generate a query vector, create a vector that has the same dimension as the vocabulary and 1 when a word occurs in the query and zero everywhere else.)

Consider document set ($D_1$, $D_4$)

### Solution

$\vec{Q_1} = (1,1,1,0,0,0,0)^T \rightarrow \vec{Q_1} = (0.9162907,0.5108256,0.5108256,0,0,0,0)^T$

$\vec{Q_2} = (1,1,0,1,0,0,0)^T \rightarrow \vec{Q_2} = (0.9162907,0.5108256,0,0.2231436,0,0,0)^T$

$\vec{D_1} = (1.832581,0.5108256,0.5108256,0.2231436,0,0,0)$

$\vec{D_4} = (1.832581,0,0.5108256,0,0,0.5108256,0)$	     

Similarities :



|| D1| D4|
|----|---|---|
|Q1| 0.9515| 0.8387 |
|Q2| 0.9358|0.7947


In [11]:
import numpy as np
q1 = np.array((0.9162907,0.5108256,0.5108256,0,0,0,0))
q2 = np.array((0.9162907,0.5108256,0,0.2231436,0,0,0))
d1 = np.array((1.832581,0.5108256,0.5108256,0.2231436,0,0,0))
d2 = np.array((1.832581,0,0.5108256,0.,0,0.5108256,0))

print(q1@d1 / np.linalg.norm(q1)/ np.linalg.norm(d1))
print(q1@d2 / np.linalg.norm(q1)/ np.linalg.norm(d2))
print(q2@d1 / np.linalg.norm(q2)/ np.linalg.norm(d1))
print(q2@d2 / np.linalg.norm(q2)/ np.linalg.norm(d2))

0.9515456611591128
0.8441013501317292
0.9358921958561208
0.7947985208409508
