# Embedding Model Task
In this notebook, you will work with embedding models to convert textual information into vector representations. Using these ideas, you will perform a similarity search task, where you will identify the most similar sentences to a certain target sentence. Finally, you will consider the results of this process in the context of the semantic meaning of the sentences. Refer to [the README](README.md) for more detailed guidance on how to approach this task.

## Task 1: Defining the Data
In this task, your target sentence will be:

>***A polar bear's fur is actually transparent, and not white (as is commonly believed).***

A list of miscellanous sentences is provided in the `data.txt` file - this will serve as our dataset of texts from which we wish to identify the closest match to the above sentence. Start by loading the data from this file and splitting it into individual sentences.

In [None]:
# Task 1: Load the data from data.txt. The target sentence has been
# defined here for your convenience.
target="A polar bear's fur is actually transparent, and not white (as is commonly believed)."


## Task 2: Embedding the Sentences
With the sentences loaded, we can now produce vector representations of these using text embedding models. Examples of such embedding models can be found on the HuggingFace website as [feature extraction models](https://huggingface.co/models?pipeline_tag=feature-extraction&sort=downloads&search=embed), [sentence similarity models](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads) or on the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) (Massive Text Embedding Benchmark).\
\
To this end, your task in this part is to choose an appropriate model, and use this to produce vector embeddings for both the list of sentences from `data.txt` and the target sentence.

In [None]:
# Task 2: Embed the sentences using an appropriate model.


## Task 3: Calculating Sentence Similarity
With these vector embeddings, you are now able to determine which sentences are the most similar. To do this, you will need to choose an appropriate distance metric to quantify how similar two sentences are.

Your task is now to choose an appropriate distance metric, and calculate this between the target sentence and every sentence from `data.txt`. Using these distances, you should then display the five most similar sentences to the target, along with their calculated distance scores. Please explicitly state which distance metric you choose to use in the markdown block after the next cell.

In [None]:
# Task 3: Calculate Sentence Similarities between the target
# sentence and every sentence from the data.txt file. Use
# these similarities to display the five most similar sentences
# to the target.



#### Distance metric: 

## Task 4: Explaining Results
Another researcher has performed a similar task to that above, and they obtain the following results using their own choice of model and distance metric (where higher scores represent more similar sentences):

|Sentence|Similarity Score|
|--------|--------|
|The fur of a polar bear is transparent, not white.|0.91|
|Polar bears are renowned for their white fur.|0.78|
|A polar bear's skin is black underneath its fur.|0.74|
|Fish are a type of animal that do not have fur.|0.68|
|Grizzly bears are carnivorous mammals with brown fur.|0.66|

Is there anything interesting you notice about these sentences, in terms of the semantic meaning of these sentences compared with the target? Specifically, are you surprised by any of these results having a high similarity score? What does this tell you about the suitability of text embeddings for fact checking? Please put your answer in the markdown cell below.

### Response:
