# TF-IDF Exercise

You have the following (tokenized) document collection:

|id | words|
|---|------|
| 1 | like, like, fruit, fly, fly|
| 2 | bee, wasp, like|
| 3 | fruit, fly|
| 4 | bee, wasp, fruit|
| 5| fruit, fruit, fruit, fly|

1. Determine the Document Term Matrix.

2. Calculate all the TFIDF values. (For simplicity: Use the max-normalization, such that the most frequent term of every document has relative term frequency 1. Equivalent to K-smoothing from the lecture with $K=0$)

### Solution

Document Term Matrix:

| term |  D1| D2|D3|D4|D5| IDF|
|------|----|---|--|--|--|---|
| bee| 0|1|0|1|0|
| wasp|0|1|0|1|0| 
| like|2|1|0|0|0| 
| fruit|1|0|1|1|1|
| fly|2|0|1|0|3|


Tf IDF Values:

| term |  D1| D2|D3|D4|D5| IDF|
|------|----|---|--|--|--|---|
| bee| 0|1|0|1|0|0.398|
| wasp|0|1|0|1|0|0.398|
| like|1|1|0|0|0| 0.398|
| fruit|0.5|0|1|1|0.333|0.097|
| fly|1|0|1|0|1|1|




Tf-IDF Values:

| term |  D1| D2|D3|D4|D5|
|------|----|---|--|--|--|
| bee  | 0  |0.398|0|0.398|0|
| wasp |0   |0.398|0|0.398|0|
| like |0.398   |0.398 |0|0|0|
| fruit|0.048   |0|0.097|0.097|0.032|
| fly  |0.222|0|0.222|0|0.222|



# A2 - Application Retrieval

We can use the tf-idf table to determine the similarity of two documents. Each document can be represented as the vector of its tfidf weighted words. The cosine of the two vectors is often used to calculate a similarity measure.

When for example a user searches for "fruit fly", you can interpret the query as a new document and translate the query into a vector corresponding to your vocabulary. Calculate the cosine similarity with every document in your corpus and  return the best match.

$$ cos(\vec{q},\vec{d}) = \frac{\vec{q}\cdot\vec{d}}{\lVert\vec{q}\rVert\cdot\lVert\vec{d}\rVert}$$


Which document would the queries $Q_1=[fruit, fly]$ and $Q_2=[bee, fly]$ return?

What are limitations and problems 
of this simple approach?


### Solution

$\vec{Q} = (0,0,0,1,1)^T$

Similarities :

| D1| D2| D3 | D4 | D5|
|---|---|----|----|---|
| 0.5| 0 | 1 | 0.068| 0.964|

Document D3 would be returned


$\vec{Q} = (1,0,0,0,1)^T$

Similarities :

| D1| D2| D3 | D4 | D5|
|---|---|----|----|---|
| 0.342| 0.408| 0.647 | 0.493| 0.700|

Document D5 would be returned

In [17]:
import numpy as np
q = np.array([1,0,0,0, 1])
d = np.array([[0,0,0.398,0.048,0.222],
              [0.398,0.398,0.398,0,0],
              [0,0,0,0.097,0.222],
              [0.298,0.298,0,0.097,0],
              [0,0,0,0.032,0.222]])
d@q / np.linalg.norm(q)/np.linalg.norm(d, axis=-1)

array([0.34255996, 0.40824829, 0.64795497, 0.48725996, 0.69987334])

Problems:

- OOV (Out-of-vocabulary)
- Memory (technical problem)
- No disambiguation
- Stemming, wordforms?