In [1]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
cord19_data = kagglehub.dataset_download('fatma98/cord19createdataframe')

print('Data source import complete.')


Downloading from https://www.kaggle.com/api/v1/datasets/download/fatma98/cord19createdataframe?dataset_version_number=1...


100%|██████████| 467M/467M [00:07<00:00, 66.8MB/s]

Extracting files...





Data source import complete.


# Find Similar Documents From Scientific Corpus Using Deep Learning With SciBERT       
This kernel is a comprehensive overview of performing semantic similarity of documents with KNN and Cosine Similarity.

# Introduction  
When reading an interesting article, you might want to find similar articles from the a large number of candidate publications. Manual processing is obviously not the strategy to go for. Why not take advantage of the power of Artificial Intelligence to solve such problem?
From this article, you will be able to use SciBERT and cosine similarity in order to find articles that are most similar in meaning to your specific query.  

# Approach    
Here are the different steps performed
* Data extraction and cleaning   
* Data Processing
    * Load the pretrained model  
    * Vectorize documents
     
* Semantic Similarity search
    * Cosine Similarity   
    * k-NN with Faiss

# Useful Libraries

In [20]:
"""
Data Loading and other libraries
"""
import warnings
import pandas as pd
import numpy as np
from tqdm import tqdm

"""
Transformer libraries useful to using the pretrained model and data preprocessing
"""
!pip install keras -q
import torch
from keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer,  AutoModelForSequenceClassification

"""
Similarity search section: cosine similarity search and facebook AI research library
"""
from sklearn.metrics.pairwise import cosine_similarity
!pip install faiss-gpu -q # please uncomment this line when you're running the notebook for the first time
import faiss

In [21]:
warnings.filterwarnings("ignore")

# About the data   
- This CORD-19 data set, a resource of over 59,000 scholarly articles, including over 48,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.
- It is downloadable from Kaggle.
- Further details about the dataset can be found on this [page](https://www.kaggle.com/danielwolffram/discovid-ai-a-search-and-recommendation-engine/data).

#### Data loading

In [22]:
data = pd.read_csv(f"{cord19_data}/cord19_df.csv")
print("Data Shape: {}".format(data.shape))

Data Shape: (47110, 16)


In [23]:
# Percentage of missing column values
percent_missing = data.isnull().sum() * 100 / len(data)
percent_missing

Unnamed: 0,0
paper_id,0.0
body_text,0.0
methods,58.439822
results,63.984292
source,10.365103
title,10.439397
doi,12.604543
abstract,12.838039
publish_time,10.365103
authors,11.676926


There are 47110 articles overall, each one having 16 columns.

**Note**: We will focuse our analysis on the **abstract** column for simplicity sake, also it is the one with 0% missing data. But you could use other textual columns such as **body_text**; it is up to you.
On the other hand, we will use 500 observation in order to speed the processing.

In [24]:
# remove articles with missing abstract
data = data.dropna(subset = ['abstract'])
data = data.reset_index(drop = True)
percent_missing = data.isnull().sum() * 100 / len(data)
print("Data Shape: {}".format(data.shape))
percent_missing

Data Shape: (41062, 16)


Unnamed: 0,0
paper_id,0.0
body_text,0.0
methods,53.796698
results,59.466173
source,8.645463
title,8.723394
doi,11.095417
abstract,0.0
publish_time,8.645463
authors,9.432078


In [25]:
# Show first N (default value is 100) words of each of the #total_number random articles
def show_random_articles(total_number, df, n=100):

    # Get the random number of articles
    n_reviews = df.sample(total_number)

    # Print each one of the articles
    for val in list(n_reviews.index):
        print("Article #{}".format(val))
        print(" --> Title: {}".format(df.iloc[val]["title"]))
        print(" --> Abstract: {} ...".format(" ".join(df.iloc[val]["abstract"].split()[:n])))
        print("\n")

# Show 3 random headlines
show_random_articles(3, data)

Article #8229
 --> Title: The Identification of a Calmodulin-Binding Domain within the Cytoplasmic Tail of Angiotensin-Converting Enzyme-2
 --> Abstract: Angiotensin-converting enzyme (ACE)-2 is a homolog of the well-characterized plasma membrane-bound angiotensin-converting enzyme. ACE2 is thought to play a critical role in regulating heart function, and in 2003, ACE2 was identified as a functional receptor for severe acute respiratory syndrome coronavirus. We have recently shown that like ACE, ACE2 undergoes ectodomain shedding and that this shedding event is up-regulated by phorbol esters. In the present study, we used gel shift assays to demonstrate that calmodulin, an intracellular calcium-binding protein implicated in the regulation of other ectodomain shedding events, binds a 16-amino acid synthetic peptide corresponding to residues 762–777 within the ...


Article #30814
 --> Title: Are virus infections triggers for autoimmune disease?
 --> Abstract: Abstract Viruses have been 

# Data Processing & Vectorization     
The data processing aims to vectorize the articles' body text so that we can perform the similarity analysis. Since we are dealing with scientific document, we will use the SciBERT model and tokenizer to generate an embedding for each of the articles using their text data.  
SciBERT is a pretrained language model for Scientific text data. You can find more information about it on the [Semantic Scholar](https://www.semanticscholar.org/paper/SciBERT%3A-A-Pretrained-Language-Model-for-Scientific-Beltagy-Lo/5e98fe2163640da8ab9695b9ee9c433bb30f5353)   
Here is how we proceed:  

## Load model artifacts   
Load the pretrained model & tokenizer. When loading the pretrained model, we need to set the output_hidden_states to True so that we can extract the embeddings.  

In [26]:
# Get the SciBERT pretrained model path from Allen AI repo
pretrained_model = 'allenai/scibert_scivocab_uncased'

# Get the tokenizer from the previous path
sciBERT_tokenizer = BertTokenizer.from_pretrained(pretrained_model,
                                          do_lower_case=True)

# Get the model
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model,
                                                          output_attentions=False,
                                                          output_hidden_states=True)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Transform text data to embeddings   
This function *convert_single_abstract_to_embedding* is mostly inspired of the BERT Word [Embeddings Tutorial](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#3-extracting-embeddings) of Chris McCormick

It aims to create an embedding for a given text data using SciBERT pre-trained model.

In [27]:
def convert_single_abstract_to_embedding(tokenizer, model, in_text, MAX_LEN = 510):

    input_ids = tokenizer.encode(
                        in_text,
                        add_special_tokens = True,
                        max_length = MAX_LEN,
                   )

    results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long",
                              truncating="post", padding="post")

    # Remove the outer list.
    input_ids = results[0]

    # Create attention masks
    attention_mask = [int(i>0) for i in input_ids]

    # Convert to tensors.
    input_ids = torch.tensor(input_ids)
    attention_mask = torch.tensor(attention_mask)

    # Add an extra dimension for the "batch" (even though there is only one
    # input in this batch.)
    input_ids = input_ids.unsqueeze(0)
    attention_mask = attention_mask.unsqueeze(0)

    # Put the model in "evaluation" mode, meaning feed-forward operation.
    model.eval()

    #input_ids = input_ids.to(device)
    #attention_mask = attention_mask.to(device)

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers.
    with torch.no_grad():
        logits, encoded_layers = model(
                                    input_ids = input_ids,
                                    token_type_ids = None,
                                    attention_mask = attention_mask,
                                    return_dict=False)

    layer_i = 12 # The last BERT layer before the classifier.
    batch_i = 0 # Only one input in the batch.
    token_i = 0 # The first token, corresponding to [CLS]

    # Extract the embedding.
    embedding = encoded_layers[layer_i][batch_i][token_i]

    # Move to the CPU and convert to numpy ndarray.
    embedding = embedding.detach().cpu().numpy()

    return(embedding)

### Test on a single text data  
Here we test the function on the "abstract" field of the 30th article. You can choose whatever number you want, as long as it exists in the data.

In [28]:
input_abstract = data.abstract.iloc[30]

# Use the model and tokenizer to generate an embedding for the input_abstract
abstract_embedding = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, input_abstract)

print('Embedding shape: {}'.format(abstract_embedding.shape))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Embedding shape: (768,)


**(768,)** means that the embedding is composed of 768 values. Now that the convert from text to embedding works, we can finally apply it to all the our text data. But, before that, we are going to remove some columns from the data, in order to have less columns in the result of the final query search. Also, I selected only 500 articles to perform the analysis, so that the overall processing does not become time-consuming.

In [29]:
def get_min_viable_data(df, sample_size=100):

    # Select only the columns we need for the analysis
    useless_cols = ['methods', 'results', 'source', 'doi',
           'body_text', 'publish_time', 'authors', 'journal', 'arxiv_id',
           'publish_year', 'is_covid19', 'study_design']

    df.drop(useless_cols, axis=1, inplace=True)

    """
    It was taking too much time to run the analysis on the overall dataset, so I decided to take
    a subset (2000 observations) of the original dataset in order to speed the processing.
    """

    df = df.sample(sample_size)

    return df

In [30]:
def convert_overall_text_to_embedding(df):

    # The list of all the embeddings
    embeddings = []

    # Get overall text data
    overall_text_data = data.abstract.values

    # Loop over all the comment and get the embeddings
    for abstract in tqdm(overall_text_data):

        # Get the embedding
        embedding = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, abstract)

        #add it to the list
        embeddings.append(embedding)

    print("Conversion Done!")

    return embeddings

In [31]:
"""
# This task can take a lot of time depending on the sample_size value
in the "get_min_viable_data" function
"""
data = get_min_viable_data(data)
embeddings = convert_overall_text_to_embedding(data)

100%|██████████| 100/100 [03:35<00:00,  2.15s/it]

Conversion Done!





In [32]:
# Create a new column that will contain embedding of each body text
def create_final_embeddings(df, embeddings):

    df["embeddings"] = embeddings
    df["embeddings"] = df["embeddings"].apply(lambda emb: np.array(emb))
    df["embeddings"] = df["embeddings"].apply(lambda emb: emb.reshape(1, -1))

    return df

In [33]:
data = create_final_embeddings(data, embeddings)
data.head(3)

Unnamed: 0,paper_id,title,abstract,url,embeddings
7357,efd8b5be438ef34572b5b3abccb17ec7ee87a395,Chapter 3 Clinical Biochemistry and Hematology,The sparse information available for multiple ...,https://doi.org/10.1016/b978-0-12-380920-9.000...,"[[-0.5259601, -1.1708553, -0.71766114, -0.4543..."
14980,44a44adc5bb0f2e966d989c5cf6e21dae8dad268,Human coronaviruses in severe acute respirator...,Acute viral respiratory infections (AVRI) are ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,"[[-0.8665741, -0.11519499, -0.97434264, 1.1653..."
14054,1e83e99dc7592ff422754a5ecb16127e44503d1d,Guidelines for Health Organizations: European ...,"In Europe, the rate of noninvasive ventilation...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,"[[-0.29794416, -0.67595786, -0.9586915, 0.2749..."


# Similarity Search   
Each of the body text data has a corresponding embedding. Now, we can perform the similarity analysis between a given ***query*** vector and all the embeddings vectors. The scope of this article is limited to:  
- Cosine similarity which ...     
- k-Nearest Neighbor (KNN) search

## Utility functions

In [34]:
def process_query(query_text):
    """
    # Create a vector for given query and adjust it for cosine similarity search
    """

    query_vect = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, query_text)
    query_vect = np.array(query_vect)
    query_vect = query_vect.reshape(1, -1)
    return query_vect


def get_top_N_articles_cosine(query_text, data, top_N=5):
    """
    Retrieve top_N (5 is default value) articles similar to the query
    """
    query_vect = process_query(query_text)
    revevant_cols = ["title", "abstract", "url", "cos_sim"]

    # Run similarity Search
    data["cos_sim"] = data["embeddings"].apply(lambda x: cosine_similarity(query_vect, x))
    data["cos_sim"] = data["cos_sim"].apply(lambda x: x[0][0])

    """
    Sort Cosine Similarity Column in Descending Order
    Here we start at 1 to remove similarity with itself because it is always 1
    """
    moost_similar_articles = data.sort_values(by='cos_sim', ascending=False)[1:top_N+1]

    return moost_similar_articles[revevant_cols]

### Similarity Search with Cosine

In [35]:
query_text_test = data.iloc[0].abstract

top_articles = get_top_N_articles_cosine(query_text_test, data)

In [36]:
top_articles

Unnamed: 0,title,abstract,url,cos_sim
25816,Chapter 52 Gerbils,The chapter provides an overview of the taxono...,https://doi.org/10.1016/b978-0-12-380920-9.000...,0.875164
34498,Perspectives on Immunoglobulins in Colostrum a...,Immunoglobulins form an important component of...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,0.802031
28252,"Contamination, Disinfection, and Cross-Coloniz...",Despite documentation that the inanimate hospi...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,0.798631
13335,Influenza and Viral Pneumonia,Influenza and other respiratory viruses are co...,https://doi.org/10.1016/j.ccm.2018.07.005,0.775425
26174,Picornavirus inhibitors,Abstract Picornaviruses are among the best und...,https://doi.org/10.1016/0163-7258(94)90040-x,0.774164


In [37]:
data.iloc[0].abstract

'The sparse information available for multiple less commonly used species have been compiled with references. Clinical problems in animals are usually associated with changes in the blood and urine and core values are recommended for research, safety, and toxicity studies in laboratory animals. The core panel includes glucose, blood urea nitrogen, creatinine, total protein, albumin globulin, calcium, sodium, potassium, chloride, and total cholesterol; hematology tests includes total leukocyte count, erythrocyte count, erythrocyte morphology, platelet count, hemoglobin concentration, hematocrit or packed cell volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration. Urinalysis tests are also recommended. Stress associated with restraint and handling may cause changes; anesthesia may produce changes in clinical biochemistry or hematology parameters, including decreased hematocrit, hemoglobin level, and red blood cell count. Techniques for blood and urine collect

In [38]:
top_articles.iloc[0].abstract

'The chapter provides an overview of the taxonomy, history, and origin of the Mongolian gerbil. It provides quick, easy-to-use information on anatomy, physiology, and behavior as well as management and husbandry practices. It assists in the humane care and use of gerbils by including details on basic experimental methods for investigators, veterinary care, and the commonly seen diseases. Some more common uses of gerbils in research are described.'

In [39]:
top_articles.iloc[1].abstract

'Immunoglobulins form an important component of the immunological activity found in milk and colostrum. They are central to the immunological link that occurs when the mother transfers passive immunity to the offspring. The mechanism of transfer varies among mammalian species. Cattle provide a readily available immune rich colostrum and milk in large quantities, making those secretions important potential sources of immune products that may benefit humans. Immune milk is a term used to describe a range of products of the bovine mammary gland that have been tested against several human diseases. The use of colostrum or milk as a source of immunoglobulins, whether intended for the neonate of the species producing the secretion or for a different species, can be viewed in the context of the types of immunoglobulins in the secretion, the mechanisms by which the immunoglobulins are secreted, and the mechanisms by which the neonate or adult consuming the milk then gains immunological benefit

In [40]:
top_articles.iloc[2].abstract

'Despite documentation that the inanimate hospital environment (e.g., surfaces and medical equipment) becomes contaminated with nosocomial pathogens, the data that suggest that contaminated fomites lead to nosocomial infections do so indirectly. Pathogens for which there is more-compelling evidence of survival in environmental reservoirs include Clostridium difficile, vancomycin-resistant enterococci, and methicillin-resistant Staphylococcus aureus, and pathogens for which there is evidence of probable survival in environmental reservoirs include norovirus, influenza virus, severe acute respiratory syndrome—associated coronavirus, and Candida species. Strategies to reduce the rates of nosocomial infection with these pathogens should conform to established guidelines, with an emphasis on thorough environmental cleaning and use of Environmental Protection Agency—approved detergent-disinfectants.'

In [41]:
top_articles.iloc[3].abstract

'Influenza and other respiratory viruses are commonly identified in patients with community-acquired pneumonia, hospital-acquired pneumonia, and in immunocompromised patients with pneumonia. Clinically, it is difficult to differentiate viral from bacterial pneumonia. Similarly, the radiological findings of viral infection are nonspecific. The advent of polymerase chain reaction testing has enormously facilitated the identification of respiratory viruses, which has important implications for infection control measures and treatment. Currently, treatment options for patients with viral infection are limited, but there is ongoing research on the development and clinical testing of new treatment regimens and strategies.'

### Similarity Search Using KNN with Faiss   

Faiss is a library developed by [Facebook AI Research](https://research.facebook.com/research-areas/facebook-ai-research-fair/). According to their [wikipage](https://github.com/facebookresearch/faiss/wiki),
> Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM

Here are the steps to build the search engine using the previously built embeddings  
- create the flat index: This is used to flat the vectors. The index uses the L2 (Euclidean) distance metrics to mesure the similarity betweeen the query vector and all the vectors (embeddings).
- add all the vectors to the index
- define the number **K** of similar document we want
- run the similarity search  

In [42]:
embedding_dimension = len(embeddings[0])

In [43]:
indexFlatL2 = faiss.IndexFlatL2(embedding_dimension)

# Convert the embeddings list of vectors into a 2D array.
vectors = np.stack(embeddings)

indexFlatL2.add(vectors)

In [44]:
print("Total Added Number of Vectors: {}".format(indexFlatL2.ntotal))

Total Added Number of Vectors: 100


### Perform Query    
We will use the same query as previously. Change it to another one if you want.  

In [45]:
# Get query vector
query_text = data.iloc[0].abstract
query_vector = process_query(query_text)

K = 5

# Run the search
D, I = indexFlatL2.search(query_vector, K)

In [46]:
I # this contains the index of all the similar articles

array([[ 0, 11, 10, 91, 67]])

In [47]:
D # this contains the L2 distance values of all the similar articles

array([[  0.     , 130.24078, 205.51802, 209.41898, 231.22496]],
      dtype=float32)

**Note**:  
I decided to breakdown all the steps on purpose in order to make sure you understand properly. But you can put everything together into a single function.  

In [48]:
for i in range(I.shape[1]):

    article_index = I[0, i]

    abstract = data.iloc[article_index].abstract
    print("** Article #{} **".format(article_index))
    print("** --> Abstract : \n{}**".format(abstract))
    print("** --> L2 Distance: %.2f**" % D[0, i])
    print("\n")

** Article #0 **
** --> Abstract : 
The sparse information available for multiple less commonly used species have been compiled with references. Clinical problems in animals are usually associated with changes in the blood and urine and core values are recommended for research, safety, and toxicity studies in laboratory animals. The core panel includes glucose, blood urea nitrogen, creatinine, total protein, albumin globulin, calcium, sodium, potassium, chloride, and total cholesterol; hematology tests includes total leukocyte count, erythrocyte count, erythrocyte morphology, platelet count, hemoglobin concentration, hematocrit or packed cell volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration. Urinalysis tests are also recommended. Stress associated with restraint and handling may cause changes; anesthesia may produce changes in clinical biochemistry or hematology parameters, including decreased hematocrit, hemoglobin level, and red blood cell count. Tec

**Observation**   
- The lower the distance is, the most similar the article is to the query.   
- The first document has L2 = 0, which means 100% similarity. This is obvious, because the query was compared with itself.
- We can simply remove it to the analysis.