<a href="https://colab.research.google.com/github/CoronaWhy/team-literature-review/blob/master/tlr/faiss_document_similarity_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Similarity Search


This notebook shows how to run document similarity search on the CORD-19 dataset using the [FAISS](https://github.com/facebookresearch/faiss) algorithms and based on [Christine Chen & Coronawhy Task Ties Team](https://www.kaggle.com/crispyc/coronawhy-task-ties-patient-descriptions#Code) submission on round 2 of [CORD-19 Research Challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).


## Requirements

### Python requirements

The required dependencies for this code can be located on the requirements.txt file and installed with

```
pip install -r requirements.txt
```

### Vector embeddings

In order to perform the similarity search we need the documents transformed into vector embeddings.
We will use the [embeddings](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=cord_19_embeddings) provided by the authors of the CORD-19 dataset.

If you have your [Kaggle credentials](https://github.com/Kaggle/kaggle-api#api-credentials) setup in place you can download it automatically the first time you run it (it will be stored for later uses), it may take some time as the file is 4GB,or you can always download the file by hand and pass the path as an argument.


## Quickstart

In [None]:
from document_similarity_search import document_similarity_search

uids = [
    '02tnwd4m',
    'byp2eqhd',
]
titles = [
    'Nitric oxide: a pro-inflammatory mediator in lung disease?',
    'Immune pathways and defence mechanisms in honey bees Apis mellifera',
]

num_results = 5
document_similarity_search(uids=uids, titles=titles, num_results=5)

## What's next?

### MongoDB

In order to perfom the retrieval of the articles, we need to have the CORD-19 dataset available in a MongoDB database.

By default we use the one in the [Coronawhy Infrastructure](https://www.coronawhy.org/services), but you can use your own by changing the `mongodb` section on the `setup.cfg` file.


### Embeddings 

You can use your own set of embeddings to create the index of the search, just have to pass it as an argument:

In [None]:
import pandas as pd
from document_similarity_search import document_similarity_search

uids = [
    '02tnwd4m',
    'byp2eqhd',
]
titles = [
    'Nitric oxide: a pro-inflammatory mediator in lung disease?',
    'Immune pathways and defence mechanisms in honey bees Apis mellifera',
]
num_results = 5

embeddings = pd.read_csv('my_embeddings.csv')
document_similarity_search(uids=uids, titles=titles, num_results=5, embeddings=embeddings)
