# Words, concepts and search

You are given a collection of texts, which for a dataset. Each string means a separate document. For each test we will answer 2 following questions:
- Which of the documents is the closest to a given query?
- How many concepts (i.e. principal components) is enough to represent these tests?

Thus your result (answer) will consist of 2 integers, separated by a space: `doc_id concept_count`.

## Let's consider the test example:

`input.txt`

```
c d b.
d e a.
a b c.
a b c d.
d c a b.
a c.      # <--- the last one is the query
```

## Let's do vectorization
Reuse this in your solutions

In [115]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

inp = """c d b.
d e a.
a b c.
a b c d.
d c a b.
a c.""".split('\n')

dataset, query = inp[:-1], inp[-1]

vect = TfidfVectorizer(
            analyzer='word',
            stop_words=None,
            token_pattern=r"(?u)\b\w+\b"    # (?u)\b\w\w+\b -- default pattern: (?u) -- unicode modifier, \b -- word border, \w\w+ = 2+ letters
)
tdm = vect.fit_transform(dataset).todense()
print("Vocabulary:", vect.get_feature_names())
print(tdm)

print("\nIs it normed?\n")
print(tdm @ tdm.T)

Vocabulary: ['a', 'b', 'c', 'd', 'e']
[[0.         0.57735027 0.57735027 0.57735027 0.        ]
 [0.440627   0.         0.         0.440627   0.78210977]
 [0.57735027 0.57735027 0.57735027 0.         0.        ]
 [0.5        0.5        0.5        0.5        0.        ]
 [0.5        0.5        0.5        0.5        0.        ]]

Is it normed?

[[1.         0.25439612 0.66666667 0.8660254  0.8660254 ]
 [0.25439612 1.         0.25439612 0.440627   0.440627  ]
 [0.66666667 0.25439612 1.         0.8660254  0.8660254 ]
 [0.8660254  0.440627   0.8660254  1.         1.        ]
 [0.8660254  0.440627   0.8660254  1.         1.        ]]


### So, we are ready to answer the question 1. 
Which of the documents is the closest to a given query?

In [123]:
# oops

Cosine similarities of query and dataset:
 [[0.40824829]
 [0.31157034]
 [0.81649658]
 [0.70710678]
 [0.70710678]]
Best match index: 2


### Time for question 2.
How many concepts are enough to express our dataset?

In other words, how many principal components do we need to preserve at least `0.95` of variance ratio?

**NB: doing PCA you should think about how this method works! Can you just take the data and run PCA? Will it change the cosine metric? Can you do this without changing coordinate origin?**

In [124]:
# oops

If we take 0 components, variance = 0
If we take 1 components, variance = 0.7506742465708932
If we take 2 components, variance = 0.9205411757396144
If we take 3 components, variance = 0.987207842406281
If we take 4 components, variance = 1.0


In [129]:
k = 3
# oops

Cosine similarities of query and dataset (after reduction to 3 PC):
 [[0.41915156]
 [0.31571551]
 [0.82739985]
 [0.69640892]
 [0.69640892]]


Almost the same! We verified the job.

### And the answer is...

Thus, looking at the result of previous block, we can state:
    
`output.txt`
```
2 3
```