# Vector space with ML

This lab will be devoted to the use of ML model for the needs of information retrieval and text classification.  

**Searching in the curious facts database**

The facts dataset is given [here](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt), take a look. We want you to retrieve facts **relevant to the query** (whatever it means), for example, you type "good mood", and get to know that Cherophobia is the fear of fun. For this, the idea is to utilize document vectors. However, instead of forming vectors with tf-idf and reducing dimensions, this time we want to obtain fixed-size vectors for documents using ML model.

## 1. Use neural networks to embed sentences

Make use of any, starting from doc2vec up to Transformers, etc. Provide all code, dependencies, installation requirements.


- [UCE in spacy 2](https://spacy.io/universe/project/spacy-universal-sentence-encoder) (`!pip install spacy-universal-sentence-encoder`)
- [Sentence BERT in spacy 2](https://spacy.io/universe/project/spacy-sentence-bert) (`!pip install spacy-sentence-bert`)
- [Pretrained 🤗 Transformers](https://huggingface.co/transformers/pretrained_models.html)
- [Spacy 3 transformers](https://spacy.io/usage/embeddings-transformers#transformers-installation)
- [doc2vec pretrained](https://github.com/jhlau/doc2vec)
- [Some more sentence transformers](https://www.sbert.net/docs/quickstart.html)
- [Even fasttext can do a sentence embedding](https://fasttext.cc/docs/en/python-module.html#model-object)

Here should be dependency installation, download instructions and so on. With outputs.

In [1]:
!pip install fasttext



And then use the library to download (and load) the model:

NB: model downloading may take time (depending on the model hosting). If you think it may take a long time, ask your TA for assistance with binaries.

In [2]:
# THIS BLOCK TAKES ~ 1h
import fasttext, fasttext.util
fasttext.util.download_model('en', if_exists='ignore')
fasttext.load_model('cc.en.300.bin')



<fasttext.FastText._FastText at 0x7f91b1e7b460>

## 2. Write a function that prepares embedding of arbitrary queries

Write a function, which returns a fixed-sized vector of embedding.

In [3]:
import fasttext
import numpy as np
ft = fasttext.load_model('cc.en.300.bin')



In [4]:
def embed(text):
    return ft.get_sentence_vector(text)

Here we check that embeddings are of the same size and type.

In [5]:
assert embed(
            "Some random text"
        ).shape == \
        embed(
            "Folks, here's a story about Minnie the Moocher. "
            "She was a lowdown hoochie coocher. "
            "She was the roughest, toughest frail, "
            "but Minnie had a heart as big as a whale"
        ).shape, "Shape should match"

NB: here we check DISTANCE, not similarity. This similar texts should produce results close to 0.

In [6]:
from scipy.spatial.distance import cosine

assert abs(cosine(
            embed("some text for testing"), 
            embed("some text for testing")
        )) < 1e-4, "Embedding should match"

assert abs(cosine(
            embed("Cats eat mice."), 
            embed("Terminator is an autonomous cyborg, typically humanoid, originally conceived as a virtually indestructible soldier, infiltrator, and assassin.")
        )) > 0.2, "Embeddings should be far"

## 3. Read the data

Now, let's read the facts dataset. Download it from the abovementioned url and read to the list of sentences.

In [7]:
import requests
url = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"

#TODO read facts into a list of facts. Each fact is a separate element of array
facts = []

facts = requests.get(url).text.split('\n')

In [8]:
print(*facts[:5], sep='\n')

assert len(facts) == 159
assert ('our lovely little planet') in facts[0]

1. If you somehow found a way to extract all of the gold from the bubbling core of our lovely little planet, you would be able to cover all of the land in a layer of gold up to your knees.
2. McDonalds calls frequent buyers of their food "heavy users."
3. The average person spends 6 months of their lifetime waiting on a red light to turn green.
4. The largest recorded snowflake was in Keogh, MT during year 1887, and was 15 inches wide.
5. You burn more calories sleeping than you do watching television.


## 4. Transform sentences to vectors

Transform the list of facts to `numpy.array` of vectors corresponding to each document (`sent_vecs`), inferring them from the model we just loaded.

In [9]:
import numpy as np
#TODO infer vectors

sent_vecs = np.array([])
sent_vecs = np.array([embed(f) for f in facts])

In [10]:
assert sent_vecs.shape[0] == len(facts)

## 5. Find closest to the query

Now find 5 facts which are the closest to the query using cosine measure.

### 5.1. Closest search

In [11]:
def find_k_closest(query, dataset, k=10):
    # TODO your code gere
    return np.argsort(dataset @ query)[-k:]
    return range(k)

### 5.1. Use your function

In [12]:
query = "good mood"
query_vec = embed(query)

print("Results for query:", query)
print()
for k in find_k_closest(query_vec, sent_vecs, 5):
    print("\t", facts[k])

Results for query: good mood

	 84. You are 1% shorter in the evening than in the morning
	 57. Gorillas burp when they are happy
	 116. Male dogs lift their legs when they are urinating for a reason. They are trying to leave their mark higher so that it gives off the message that they are tall and intimidating.
	 10. If you believe that you're truly one in a million, there are still approximately 7,184 more people out there just like you.
	 60. It is considered good luck in Japan when a sumo wrestler makes your baby cry.


## 6. Measure DCG@5 for the following query bucket
```
good mood
gorilla
woman
earth
japan
people
math
```

Recommend 5 facts to each of the queries. Write your code below.

In [13]:
bucket = """good mood
gorilla
woman
earth
japan
people
math""".split('\n')

for term in bucket:
    print(term)
    for k in find_k_closest(embed(term), sent_vecs, k=5)[::-1]:
        print("\t", facts[k])

good mood
	 60. It is considered good luck in Japan when a sumo wrestler makes your baby cry.
	 10. If you believe that you're truly one in a million, there are still approximately 7,184 more people out there just like you.
	 116. Male dogs lift their legs when they are urinating for a reason. They are trying to leave their mark higher so that it gives off the message that they are tall and intimidating.
	 57. Gorillas burp when they are happy
	 84. You are 1% shorter in the evening than in the morning
gorilla
	 106. The male ostrich can roar just like a lion.
	 85. The elephant is the only mammal that can't jump!
	 107. Mountain lions can whistle.
	 57. Gorillas burp when they are happy
	 139. Beetles taste like apples, wasps like pine nuts, and worms like fried bacon.
woman
	 131. If a pregnant woman has organ damage, the baby in her womb sends stem cells to help repair the organ.
	 65. A Swedish woman lost her wedding ring, and found it 16 years later- growing on a carrot in her gar

## 2.7. [BONUS] Write your own relevance assessments and compute DCG@5

In [14]:
assessments = [
    [0, 0, 0, 1, 0], # good mood
    [0, 0, 0, 1, 0], # gorilla
    [1, 1, 0, 0, 0], # ...
    [1, 1, 0, 1, 0],
    [1, 0, 0, 0, 1],
    [1, 1, 1, 1, 1],
    [1, 1, 0, 0, 0]
]

optimal = [[1] * 5] * 7

def dcg(rels):
    from math import log
    s = 0
    for i, rel in enumerate(rels):
        s += rel / log(1 + i + 1, 2)
    return s

dcg5 = sum([dcg(row) for row in assessments]) / len(assessments)
idcg5 = sum([dcg(row) for row in optimal]) / len(optimal)

print(f"DCG@5 = {dcg5:.4f}")
print(f"IDCG@5 = {idcg5:.4f}")
print(f"nDCG@5 = {dcg5 / idcg5:.4f}")

DCG@5 = 1.5029
IDCG@5 = 2.9485
nDCG@5 = 0.5097
