# 1. Evolving vector-space model
This part will be devoted to the use of ML model for the needs of information retrieval and text classification.  

### **Searching in the curious facts database**

The facts dataset is given [here](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt), take a look. We want you to retrieve facts **relevant to the query** (whatever it means), for example, you type "good mood", and get to know that Cherophobia is the fear of fun. For this, the idea is to utilize document vectors. However, instead of forming vectors with tf-idf and reducing dimensions, this time we want to obtain fixed-size vectors for documents using pretrained ML model.

## 1.1. Use neural networks to embed sentences

Make use of any, starting from doc2vec up to Transformers, etc.


- [UCE in spacy 2](https://spacy.io/universe/project/spacy-universal-sentence-encoder) (`!pip install spacy-universal-sentence-encoder`)
- [Sentence BERT in spacy 2](https://spacy.io/universe/project/spacy-sentence-bert) (`!pip install spacy-sentence-bert`)
- [Pretrained 🤗 Transformers](https://huggingface.co/transformers/pretrained_models.html)
- [Spacy 3 transformers](https://spacy.io/usage/embeddings-transformers#transformers-installation)
- [doc2vec pretrained](https://github.com/jhlau/doc2vec)
- [Some more sentence transformers](https://www.sbert.net/docs/quickstart.html) <-- the simplest option.

Do dependency installation first:

In [None]:
!pip install #...

And then they the library:

In [None]:
#TODO add your code here

## 1.2. Write the function, that prepares sentence embedding

Implement the function, which returns a fixed-sized vector given a sentence.

In [None]:
def embed(text):
    #TODO return a fixed-sized vector of embedding
    return []

In [None]:
assert len(
            embed("Some random text")
        ) == len(
            embed("Folks, here's a story about Minnie the Moocher. She was a lowdown hoochie coocher. She was the roughest, toughest frail, but Minnie had a heart as big as a whale")
        ), "Length should match"

In [None]:
from scipy.spatial.distance import cosine

assert abs(cosine(
            embed("some text for testing"), 
            embed("some text for testing")
        )) < 1e-4, "Embedding should be equal"

assert abs(cosine(
            embed("Cats eat mice."), 
            embed("Terminator is an autonomous cyborg, typically humanoid, originally conceived as a virtually indestructible soldier, infiltrator, and assassin.")
        )) > 0.2, "Embeddings should be far"

## 1.3. Reading the data

Now, let's read the facts dataset. Download it from the abovementioned url and read to the list of sentences.

In [None]:
url = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"

#TODO read facts into a list of facts
facts = []

In [None]:
print(*facts[:5], sep='\n')

assert len(facts) == 159
assert ('our lovely little planet') in facts[0]

## 1.4. Transforming sentences to vectors

Transform the list of facts to `numpy.array` of vectors corresponding to each document (`sent_vecs`), inferring them from the model.

In [None]:
#TODO infer vectors

In [None]:
assert sent_vecs.shape[0] == len(facts)

## 1.5. Find closest

Now find 5 facts which are closest to the query using cosine similarity.

In [31]:
#TODO output closest facts to the query
query = "good mood"

print("Results for query:", query)
for k in #...
    print("\t", facts[k])

Results for query: good mood
	 67. The chance of you dying on the way to get lottery tickets is actually greater than your chance of winning.
	 84. You are 1% shorter in the evening than in the morning
	 57. Gorillas burp when they are happy
	 116. Male dogs lift their legs when they are urinating for a reason. They are trying to leave their mark higher so that it gives off the message that they are tall and intimidating.
	 10. If you believe that you're truly one in a million, there are still approximately 7,184 more people out there just like you.
	 60. It is considered good luck in Japan when a sumo wrestler makes your baby cry.


## 1.6. Recommend 5 facts to each of the queries for the following query bucket
```
good mood
gorilla
woman
earth
japan
people
math
```

Recommend 5 facts to each of the queries. Write your code below.

In [37]:
# write your code or computations here

good mood
	 60. It is considered good luck in Japan when a sumo wrestler makes your baby cry.
	 10. If you believe that you're truly one in a million, there are still approximately 7,184 more people out there just like you.
	 116. Male dogs lift their legs when they are urinating for a reason. They are trying to leave their mark higher so that it gives off the message that they are tall and intimidating.
	 57. Gorillas burp when they are happy
	 84. You are 1% shorter in the evening than in the morning
gorilla
	 106. The male ostrich can roar just like a lion.
	 85. The elephant is the only mammal that can't jump!
	 107. Mountain lions can whistle.
	 57. Gorillas burp when they are happy
	 139. Beetles taste like apples, wasps like pine nuts, and worms like fried bacon.
woman
	 131. If a pregnant woman has organ damage, the baby in her womb sends stem cells to help repair the organ.
	 65. A Swedish woman lost her wedding ring, and found it 16 years later- growing on a carrot in her gar

## 1.7. [BONUS] Write your own relevance assessments and compute DCG@5
Compare you results accross the group. Which embedding model performs the best?

In [None]:
#TODO add assesment
assessment = []

def dcg(rels):
    #TODO compute DCG@5
    return res

sum([dcg(row) for row in assessment]) / len(assessment)