<a href="https://colab.research.google.com/github/alofgran/coronawhy_lit_review_tool/blob/master/faiss_document_similarity_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **FAISS Document Similarity Search for CORD-19 dataset**

This notebook draws from [Christine Chen's work](https://www.kaggle.com/crispyc/coronawhy-task-ties-patient-descriptions) on the Kaggle round 2 competition regarding the CORD-19 dataset.  

**Data requirements**
* document [embeddings](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=cord_19_embeddings) (currently document embeddings derived from SPECTER)


**Input:** (either, not both)
* `k` - the number of similar articles desired
* an article identifier (`cord_uid`)
* an article title


**Output:**:
* list of IDs of similar articles with length `k`

# New API Format

The code below shows how to run the same FAISS document similarity search using the new API.

In [1]:
import pandas as pd

from faiss_document_similarity_search import get_uids, get_faiss_artifacts, document_similarity_search

In [2]:
uids = ['02tnwd4m', '8zchiykl', 'gdsfkw1b', 'byp2eqhd', '7gk8uzo0', '0mtmodmo']

titles = [
    'Nitric oxide: a pro-inflammatory mediator in lung disease?',
    'The 21st International Symposium on Intensive Care and Emergency Medicine, Brussels, Belgium, 20-23 March 2001',
    'Protein secretion in Lactococcus lactis: an efficient way to increase the overall heterologous protein production',
    'Immune pathways and defence mechanisms in honey bees Apis mellifera',
    'Species-specific evolution of immune receptor tyrosine based activation motif-containing CEACAM1-related immune receptors in the dog',
    'Novel, Divergent Simian Hemorrhagic Fever Viruses in a Wild Ugandan Red Colobus Monkey Discovered Using Direct Pyrosequencing'
]

In [4]:
EMBEDDINGS_PATH = None
assert EMBEDDINGS_PATH is not None, 'EMBEDDINGS_PATH must contain the path to the embeddings file'

embeddings = pd.read_csv(EMBEDDINGS_PATH, header=None, index_col=0)
mebe

AssertionError: EMBEDDINGS_PATH must contain the path to the embeddings file

NameError: name 'EMBEDDINGS_PATH' is not defined

In [3]:
cord_uids = get_uids(titles, uids)

NameError: name 'titles' is not defined

In [3]:
index, query_vec = get_faiss_artifacts(embeddings, cord_uids)
results = document_similarity_search(index, query_vec, embeddings, cord_uids, k=6)

NameError: name 'EMBEDDINGS_FILE' is not defined

## **Imports & Data Sources**

In [1]:
#!pip install faiss-cpu
import pandas as pd
import numpy as np
import os
import faiss

#Connecting to MongoDB
# !pip install pymongo
import pymongo

In [2]:
#Setup embeddings filepath
import os
from google.colab import drive

drive.mount('/content/gdrive', force_remount=True)
root = os.getcwd()
download_destination = 'gdrive/My Drive/COVID-19/lit_review_tool'
cwd = os.path.join(root, download_destination)
os.chdir(cwd)
print('Current working directory: ', os.getcwd())

ModuleNotFoundError: No module named 'google'

In [3]:
#Import embeddings
def read_cord_19_embeddings(filename):                                      
    # emb_path = '/'.join(('embeddings', filename))
    emb_path = '/'.join((cwd, 'cord_19_embeddings', filename))
    print('emb_path: ', emb_path)
    emb = pd.read_csv(emb_path, header = None, index_col = 0)
    print(emb.head())
    return emb

filename = 'cord_19_embeddings_2020-07-31.csv'
emb = read_cord_19_embeddings(filename)
emb.shape

NameError: name 'cwd' is not defined

## **Get metadata from MongoDB**
This connection replaces the original download of metadata.  I don't believe any additional data (e.g. full text, abstracts, etc.) is necessary for fully processing the original notebook.

In [3]:
READ_ONLY_USER = 'coronawhyguest'
READ_ONLY_PASS = 'coro901na'
DATABASE = 'cord-19'
MONGO_HOST = 'mongodb.coronawhy.org'
cord_version = 'v22'

In [6]:
def get_collection_mongo(host, user, password, database, collection):
    URI = f'mongodb://{user}:{password}@{host}'
    client = pymongo.MongoClient(URI)
    db = client.get_database(database)
    return db[collection]

def get_coronawhy_cord(version='v22'):
    return get_collection_mongo(MONGO_HOST, READ_ONLY_USER, READ_ONLY_PASS, DATABASE, version)
    

In [7]:
c = get_coronawhy_cord()

In [4]:
#Setup link to CoronaWhy's MongoDB

# Read-only credentials to CoronaWhy MongoDB service
mongouser = 'coronawhyguest'
mongopass = 'coro901na'
cord_version = 'v22'

# mongo_URI = 'mongodb://{}:{}@mongodb.coronawhy.org'.format(mongouser, mongopass)
mongo_URI = 'mongodb://cord19-rw:coronaWhy2020@mongodb.coronawhy.org'
client = pymongo.MongoClient(mongo_URI)
db = client.get_database('cord19')
print('Existing collections: ', db.list_collection_names())
collection = db[cord_version]

#Test query
# pd.DataFrame(collection.find({'title': "Nitric oxide: a pro-inflammatory mediator in lung disease?"})).head() #should have 2 results as of 8/3/2020)

ServerSelectionTimeoutError: mongodb.coronawhy.org:27017: [Errno -2] Name or service not known, Timeout: 30s, Topology Description: <TopologyDescription id: 5f72756c2d030c9e5121b937, topology_type: Single, servers: [<ServerDescription ('mongodb.coronawhy.org', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('mongodb.coronawhy.org:27017: [Errno -2] Name or service not known')>]>

In [None]:
# Creating a matrix to store article embeddings 
xb = np.ascontiguousarray(emb).astype(np.float32)
# Assigning dimension for the vector space
d = xb.shape[1]

## **Creating faiss search index**

In [None]:
# Building the index
index = faiss.IndexFlatIP(d) #IndexFlatIP: taking inner product of the vectors
print('Index training complete: ', index.is_trained)

faiss.normalize_L2(xb) #with normalized vectors, the inner product (IP, of IndexFlatIP) becomes cosine similarity
index.add(xb)# Adding vectors to the index
                  
print('Total rows in index: ', index.ntotal)

Index training complete:  True
Total rows in index:  204823


**Create example data**

For ease of testing.  This could eventually be a feature in a search engine.

In [None]:
cord_uid_examples = ['02tnwd4m', '8zchiykl', 'gdsfkw1b', 'byp2eqhd', '7gk8uzo0', '0mtmodmo']

title_examples = ['Nitric oxide: a pro-inflammatory mediator in lung disease?',
                  'The 21st International Symposium on Intensive Care and Emergency Medicine, Brussels, Belgium, 20-23 March 2001',
                  'Protein secretion in Lactococcus lactis: an efficient way to increase the overall heterologous protein production',
                  'Immune pathways and defence mechanisms in honey bees Apis mellifera',
                  'Species-specific evolution of immune receptor tyrosine based activation motif-containing CEACAM1-related immune receptors in the dog',
                  'Novel, Divergent Simian Hemorrhagic Fever Viruses in a Wild Ugandan Red Colobus Monkey Discovered Using Direct Pyrosequencing']

In [None]:
#TWO OPTIONS:

#1) Feed in `cord_uid` directly (in list format)
    # No need to search MongoDB for `cord_uid` - feed this straight into similarity search

#2) Feed in title, and get CORD_UID
def get_cord_uid_for_title(study_title):
    cord_uids = list(collection.find({'title': str(study_title)}, {'cord_uid'})) #search by title, return cord_uid
    return cord_uids

In [None]:
#Get `cord_uid` ONLY
result_cord_uids = set([result['cord_uid'] for result in get_cord_uid_for_title(title_examples[0])]) #filters out `_id` column
result_cord_uids

{'02tnwd4m'}

In [None]:
#Prepare query vector
query_vec = np.ascontiguousarray(emb.loc[result_cord_uids]).reshape(1,-1).astype(np.float32)
faiss.normalize_L2(query_vec)

## **Run FAISS search**

Remember, this is a document similarity search.

In [None]:
k = 6
similar_id_list=[]

def document_similarity_search(query_vec, k, return_cord_uid=False, return_metadata=False):
    D, I = index.search(query_vec, k)
    similar_id_list.extend(I.tolist()[0])
    similar_cord_uid_list = [cid for cid in emb.iloc[similar_id_list].index if cid not in result_cord_uids]
    if return_cord_uid:
        print('Articles similar to {}: '.format(result_cord_uids), similar_cord_uid_list, '\n')
    if return_metadata:
        mongo_results = pd.DataFrame(collection.find({'cord_uid': {'$in':similar_cord_uid_list}})) #the last two ('hkrljpn3', 'ocu597fg') are apparently ot present in MongoDB
        return mongo_results

In [None]:
document_similarity_search(query_vec, k, return_cord_uid=True, return_metadata=True) #the last two ('hkrljpn3', 'ocu597fg') are apparently ot present in MongoDB

Articles similar to {'02tnwd4m'}:  ['bzub2kkv', '6v0y6xsa', 'ka676pli', 'hkrljpn3', 'ocu597fg'] 



Unnamed: 0,_id,who_covidence_id,source_x,pmcid,pubmed_id,license,publish_time,authors,journal,mag_id,arxiv_id,s2_id,year,path,cord_uid,title,abstract,body_text,tables,body_rows
0,5ec666e250ceb4d90ad6af2e,~,PMC,PMC2327086,11106932.0,green-oa,2000-11-01,"Akaike, T; Maeda, H",Immunology,~,~,~,2000,document_parses/pdf_json/bd1d562cb24b73a74830c...,6v0y6xsa,Nitric oxide and virus infection,[{'text': 'Nitric oxide (NO) has complex and d...,[{'text': 'Free radical species with oxygen-or...,"[[bd1d562cb24b73a74830ca23e2c0cd02e609c8cb, FI...","[{'cord_uid': '6v0y6xsa', 'section': 'title', ..."
1,5ec6751e50ceb4d90ad6c8a4,~,PMC,PMC7095984,22581364.0,no-cc,2012-05-13,"Carnesecchi, Stéphanie; Pache, Jean-Claude; Ba...",Cell Mol Life Sci,~,~,~,2012,document_parses/pdf_json/d0f04973f4636e11301f6...,bzub2kkv,NOX enzymes: potential target for the treatmen...,[{'text': 'Acute lung injury (ALI) and its mor...,[{'text': 'Acute lung injury (ALI) and acute r...,"[[d0f04973f4636e11301f61e6dbcde5f9dd612e5b, FI...","[{'cord_uid': 'bzub2kkv', 'section': 'title', ..."
2,5ec676a150ceb4d90ad6cb5f,~,PMC,PMC7102088,14720072.0,no-cc,2012-09-08,"Pease, James E.; Sabroe, Ian",Am J Respir Med,~,~,~,2012,document_parses/pdf_json/0267688090041b062bc54...,ka676pli,The Role of Interleukin-8 and its Receptors in...,[{'text': 'Neutrophils have been implicated in...,[{'text': 'The discovery of chemokines and the...,"[[0267688090041b062bc540979f8e3d88bc671ff7, FI...","[{'cord_uid': 'ka676pli', 'section': 'title', ..."
