# What is known about transmission, incubation, and environmental stability
## COVID-19 Open Research Dataset Challenge (CORD-19)

### Task Details
What is known about transmission, incubation, and environmental stability? What do we know about natural history, transmission, and diagnostics for the virus? What have we learned about infection prevention and control?

The first question we need to ask is what we mean by transmission, incubation, and environmental stability -- or, rather, what should a computer understand when we ask this? We can go about encoding this information in several ways: 1) keywords for analysis in some kind of TF-IDF format, probably including a list of synonyms that we would need to develop by hand, 2) high-dimensional vectors vis a vis word2vec or GloVe, or 3) using heavy, but state-of-the-art transformer models for vectorization. 

Keywords probably aren't going to give us the robust results we're looking for, because typical pre-processing methods remove all sorts of punctuation and numbers, but these are really important in biomedical texts! We could skip the pre-processing except for removing stop words, but we'd still need to address the fact that keywords have synonyms, and we'd need to hand-write these. But there may be an easier way to get better results without all the hassle. 

I propose method 2: spaCy is a popular NLP package that's blazingly fast and has (mostly) everything we need to process the text. It'll break sentences apart, lemmatize, and even provide vectors for us. Spacy vectors are somewhat simplistic because the vector of several tokens is just the average of the vectors of each token individually -- so we may not get state of the art results. But we'll get them fast, and we'll know if we need to change something up!

# Part 1: Getting the data ready
The data is formatted in many files as json; let's put it into something easier to work with for NLP. 

In [None]:
#!pip uninstall spacy -y
#!pip install spacy[cuda100] scispacy spacy_langdetect scipy https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.3/en_core_sci_lg-0.2.3.tar.gz
#!pip install -U spacy[cuda100]
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.3/en_core_sci_lg-0.2.3.tar.gz

In [1]:
import pandas as pd 
import os
import numpy as np
import scispacy
import json
import spacy
from tqdm.notebook import tqdm
from scipy.spatial import distance
import ipywidgets as widgets
from scispacy.abbreviation import AbbreviationDetector
from spacy_langdetect import LanguageDetector
# UMLS linking will find concepts in the text, and link them to UMLS. 
from scispacy.umls_linking import UmlsEntityLinker
import time

# Time for NLP!

Let's load our language model. Based on the type of text we'll be dealing with, we want something that's been pretrained on biomedical texts, as the vocabulary and statistical distribution of words is much different from, say, the news or Wikipedia articles. Luckily, there's already pre-trained models for spacy, so let's load the largest one we can! 

In [2]:
#nlp = spacy.load("en_core_sci_lg")
nlp = spacy.load("en_core_sci_lg", disable=["tagger"])

# We also need to detect language, or else we'll be parsing non-english text 
# as if it were English. 
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

# Add the abbreviation pipe to the spacy pipeline. Only need to run this once.
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

# Our linker will look up named entities/concepts in the UMLS graph and normalize
# the data for us. 
linker = UmlsEntityLinker(resolve_abbreviations=True)
nlp.add_pipe(linker)



### Adding a vector for COVID-19

One last thing. COVID-19 is a new word, and doesn't exist in the vocabulary for our spaCy model. We'll need to add it manually; let's try setting it to equal the average vector of words that should represent what COVID-19 refers to, and see if that works. I'm not an expert so I just took definitions from Wikipedia and the etiology section of https://onlinelibrary.wiley.com/doi/full/10.1002/jmv.25740. There's a much better way of doing this (fine-tuning the model on our corpus) but I have no idea how to do this in spaCy...

In [3]:
from spacy.vocab import Vocab
new_vector = nlp(
               """Single‐stranded RNA virus, belongs to subgenus 
                   Sarbecovirus of the genus Betacoronavirus.5 Particles 
                   contain spike and envelope, virions are spherical, oval, or pleomorphic 
                   with diameters of approximately 60 to 140 nm.
                   Also known as severe acute respiratory syndrome coronavirus 2, 
                   previously known by the provisional name 2019 novel coronavirus 
                   (2019-nCoV), is a positive-sense single-stranded RNA virus. It is 
                   contagious in humans and is the cause of the ongoing pandemic of 
                   coronavirus disease 2019 that has been designated a 
                   Public Health Emergency of International Concern""").vector

vector_data = {"COVID-19": new_vector,
               "2019-nCoV": new_vector,
               "SARS-CoV-2": new_vector}

vocab = Vocab()
for word, vector in vector_data.items():
    nlp.vocab.set_vector(word, vector)

### Sanity Check
Alright, let's check if this work. 

In [4]:
print(
    nlp("COVID-19").similarity(nlp("novel coronavirus")), "\n",
    nlp("SARS-CoV-2").similarity(nlp("severe acute respiratory syndrome")), "\n",
    nlp("COVID-19").similarity(nlp("sickness caused by a new virus")))

0.5324297344523997 
 0.34796126970622626 
 0.7016356811120861


I guess we'll find out if that's good enough for our purposes! Let's save it so other people can use it!

In [5]:
#nlp.to_disk('/home/acorn/Documents/covid-19-en_lg')

Some of the texts is particularly long, so we need to increase the max_length attribute of nlp to more then 1.25mil. The alternative would be cutting the length of the article or dropping it entirely (I believe there's some sort of anomaly with this particular article), but we'll keep it for now. 

In [6]:
nlp.max_length=2000000

Next, we want to replace all abbreviations with their long forms. This is important for semantic indexing because the model has probably seen words like "Multiple sclerosis" but may have seen the abbreviation "MS" in different contexts. That means their vector representations are different, and we don't want that! 

So here we'll add the abbreviation expansion module to our scispaCy pipeline. 

In [10]:
doc = nlp(df.iloc[0].text)

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations[0:10]:
	print(f"{abrv} \t ({abrv.start_char}, {abrv.end_char}) {abrv._.long_form}")

Abbreviation 	 Definition
PACS 	 (1064, 1068) picture archiving and communication system
FOV 	 (2867, 2870) field of view
GGO 	 (4432, 4435) ground glass opacity
Detect COVID-19 	 (5459, 5474) Detect COVID-19
Detect COVID-19 	 (8757, 8772) Detect COVID-19
Detect COVID-19 	 (5508, 5523) Detect COVID-19
Detect COVID-19 	 (9108, 9123) Detect COVID-19
Detect COVID-19 	 (16052, 16067) Detect COVID-19
Detect COVID-19 	 (5751, 5766) Detect COVID-19
Detect COVID-19 	 (5595, 5610) Detect COVID-19


Notice we get some weird results towards the end if you print **all** of them (lots of a's being converted to at's, but we can ignore that for now. If we need to remove stop words later, we can. 

### Making the Vector DataFrames
Appending to a dataframe increases time to copy data linearly because df.append copies the entire object. The following will take an article's text, break it into sentences, and vectorize each sentence (using scispacy's pre-trained word2vec model). Finally, the list of dicts is loaded as a DataFrame and saved.

So here's the real meat of our pre-processing. This is really heavy because it processes line-by-line and then generates a lot of metadata (entities, vectors). We can break it into pieces later depending on the task we want to use this information for, but querying lines is a lot more useful that querying whole documents when you want to know about something specific like seroconversion, spike proteins, or something else. Once you identify lines of interest, you can generate more data about the actual document, since each line will be indexed with document, start and end character, entities, vectors, and language. 

In [13]:
def df_cleaner(df):
    df.fillna("Empty", inplace=True) # If we leave floats (NaN), spaCy will break.
    for i in df.index:
        for j in range(len(df.columns)):
            if " q q" in df.iloc[i,j]:
                df.iloc[i,j] = df.iloc[i,j].replace(" q q","") # Some articles are filled with " q q q q q q q q q"
                
def pipeline(df, column, dataType, filename):
    create = pd.DataFrame(columns={"_id","language","section","sentence","startChar","endChar","entities","lemma","w2vVector"})
    create.to_csv(filename + ".csv", index=False)
    
    docs = nlp.pipe(df[column].astype(str))
    i = -1
    for doc in tqdm(docs):
        languages = []
        start_chars = []
        end_chars = []
        entities = []
        sentences = []
        #vectors = []
        _ids = []
        columns = []
        lemmas = []
        i = i + 1
        if len(doc._.abbreviations) > 0 and doc._.language["language"] == "en":
            doc._.abbreviations.sort()
            join_list = []
            start = 0
            for abbrev in doc._.abbreviations:
                join_list.append(str(doc.text[start:abbrev.start_char]))
                if len(abbrev._.long_form) > 5: #Increase length so "a" and "an" don't get un-abbreviated
                    join_list.append(str(abbrev._.long_form))
                else:
                    join_list.append(str(doc.text[abbrev.start_char:abbrev.end_char]))
                start = abbrev.end_char
            # Reassign fixed body text to article in df.
            new_text = "".join(join_list)
            # We have new text. Re-nlp the doc for futher processing!
            doc = nlp(new_text)
        
        sents = [sent for sent in doc.sents]
        if doc._.language["language"] == "en" and len(doc.text) > 5:
            for sent in sents:
                languages.append(doc._.language["language"])
                sentences.append(sent.text)
                vectors.append(sent.vector)
                start_chars.append(sent.start_char)
                end_chars.append(sent.end_char)
                doc_ents = []
                for ent in sent.ents: 
                    if len(ent._.umls_ents) > 0:
                        poss = linker.umls.cui_to_entity[ent._.umls_ents[0][0]].canonical_name
                        doc_ents.append(poss)
                entities.append(doc_ents)
                _ids.append(df.iloc[i,0])
                if dataType == "tables":
                    columns.append(df.iloc[i]["figure"])
                elif dataType == "text":
                    columns.append(column)
                lemmatized_doc = " ".join([token.lemma_ for token in doc])
                lemmas.append(lemmatized_doc)
        else: 
            start_chars.append(0)
            end_chars.append(len(doc.text))
            entities.append("Non-English")
            sentences.append(doc.text)
            vectors.append(np.zeros(200))
            _ids.append(df.iloc[i,0])
            languages.append(doc._.language["language"])
            if dataType == "tables":
                columns.append(df.iloc[i]["figure"])
            elif dataType == "text":
                columns.append(column)
            lemmas.append("Non-English")
            
        rows = pd.DataFrame(data={"_id": _ids, "language": languages, "section": columns, "sentence": sentences, 
            "startChar": start_chars, "endChar": end_chars, "entities": entities, "lemma": lemmas, "w2vVector": vectors})
        rows.to_csv(filename + '.csv', mode='a', header=False, index=False)

In [14]:
columns_to_process = ["text"]
files = [f for f in os.listdir("./unnabreviated_parts/") if f.startswith("unna") and not f.endswith("csv")]
for f in tqdm(files):
    f = "./unnabreviated_parts/" + f
    df = pd.read_csv(f)
    pipeline(df=df, column="text", dataType="text", filename=f)
    del df
    os.remove(f)

HBox(children=(FloatProgress(value=0.0, max=146.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

MemoryError: Unable to allocate 3.89 GiB for an array with shape (1019544, 8, 64, 2) and data type float32

## Lemmatized Text

Just in case we need it, let's do some text cleaning and include that in a different column. Lemmatization normalizes data so that when you're creating word clouds or simplified TF-IDF, the number of dimesions you're dealing with are significantly reduced. It's also nice to remove words that don't contribute much meaning, but do note that removing stop-words will make neural models less accurate depending on the task you're using them for.

#### This is included in the pipeline() function above, but expanded here if you only need this functionality. 

In [None]:
def lemmatize_my_text(df, column):
    lemma_column = []
    for i in df.index:
        if df.iloc[i]["language"] == "en":
            doc = nlp(str(df.iloc[i][column]), disable=["ner","linker", "language_detector"])
            lemmatized_doc = " ".join([token.lemma_ for token in doc])
            lemma_column.append(lemmatized_doc)
        else: 
            lemma_column.append("Non-English")
    return lemma_column

In [9]:
df

Unnamed: 0,_id,citations,title,abstract,text
0,0e38333bff68345492526fd39b70d1b18969cb83,['Clinical features of patients infected with ...,Deep Learning-based Detection for COVID-19 fro...,Accurate and rapid diagnosis of COVID-19 suspe...,"huge amount of efforts for radiologists, which..."
1,4ce668fe6eee9f59ed5ad0dfc0e9787777acd3be,['Effect of exacerbation on quality of life in...,Anti-microbial immunity is impaired in COPD pa...,[],Chronic obstructive pulmonary disease (COPD) i...
2,12920263b2846c61d6f2b6105189367a8ff1bc39,"['Genetic improvement of litter size in pigs',...",A diallel of the mouse Collaborative Cross fou...,Reproductive success in the eight founder stra...,A critical part of fitness is optimizing the b...
3,af266fac8970a7960e96630a67d91bec5dda0335,['Handbook of infectious disease data analysis...,Estimating the generation interval for COVID-1...,Estimating key infectious disease parameters f...,In order to plan intervention strategies aimed...
4,57a86b3acd182c955877afd4792719a4e9a0ac32,['A novel coronavirus outbreak of global healt...,Cross-sectional Study Affiliations,"Background: So far, the psychological impact o...","Since December 2019, coronavirus disease 2019 ..."
...,...,...,...,...,...
196,ba3efcd6b74e55327fd7db470d824fc18943f30e,['Potential for global spread of a novel coron...,Evaluating the impact of international airline...,Global airline networks play a key role in the...,results indicate multiple countries (many with...
197,5daceb492bad28a61fc43c4be83a164042d8c430,"['∆3 cells, as indicated. The experiment was 8...",Recombinant rotaviruses rescued by reverse gen...,Rotavirus (RV,"To rescue recombinant RV strain SA11 (rRV-WT),..."
198,1dd898b5ca1ae70ec0e3cad89fc87a165002a99e,['Pathology of US porcine epidemic diarrhea vi...,Using heterogeneity in the population structur...,"In 2013, U.S. swine producers were confronted ...",The 2013 emergence of porcine epidemic diarrho...
199,1aa3e788fc6b03c147e37200c3b011ca7a289a5c,['Mechanism of nucleic acid unwinding by SARS-...,Dark Proteome of Newly Emerged SARS-CoV-2 in C...,Recently emerged Wuhan's novel coronavirus des...,World health organization third of genome code...


In [None]:
df_list = []
df = pd.concat([i for i in [pd.read_csv(f) for f in files]])
timestamp = time.strftime("%Y%m%d")
df.to_csv(f"covid_TitleAbstract_processed-{timestamp}.csv", index=False)

Let's go ahead and do the same for the tables; we want to be able to search them too! 

## Asking the right questions

Our model isn't trained to answer questions. It's trained to represent words based on their statistical distribution in relation to other words -- one way of measuring semantic similarity, or closeness in meaning. Instead of feeding our model questions, we should be making statements and trying to find statements that are semantically similar within the corpus. 

So, let's do a bit of editing on the original questions asked. 

In [None]:
queries = """Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.
Prevalence of asymptomatic shedding and transmission. 
Prevalence of asymptomatic shedding and transmission in children, infants, and young people.
Seasonality of transmission of the virus. Times when the virus is transmitted, i.e. winter, summer, autumn, spring, during cold weather, or in different climates.
Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic or hydrophobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).
Persistence and stability on a multitude of substrates and sources (nasal discharge, sputum, urine, fecal matter, blood, bodily fluids and secretions).
Persistence of virus on surfaces of different materials (copper, stainless steel, plastic).
Natural history of the virus and shedding of it from an infected person.
Implementation of diagnostics and products to improve clinical processes.
Disease models, including animal models for infection, disease and transmission.
Tools and studies to monitor phenotypic change and potential adaptation of the virus.
Immune response and immunity to the virus.
Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings.
Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings.
Role of the environment in transmission."""
queries = queries.splitlines()
queries_df = pd.DataFrame(data=[{"query":query} for query in queries])

The following cell will vectorize each of our query sentences, and store those vectors in a DataFrame for us. That way, we have a numeric representation of the meaning of our queries, and they're conveniently indexed by number. Later, we can use these index numbers for cross-reference. 

In [None]:
query_vector_list = []
for i in tqdm(range(len(queries))):
    doc = nlp(queries[i])
    vec = doc.vector
    query_vector_list.append({"_id": f"query_{i}", "vector": vec})
    
query_vector_df = pd.DataFrame(data=query_vector_list)
#query_vector_df.to_csv(f"query_vecs-{time.strftime("%Y%m%d")}.csv",index=False)

## How do we measure similarity? 

We can calculate the vector cosine distances between all queries and sentences in the corpus. Cosine distance is a pretty typical measurement in NLP for the similarity between two vectors -- however, there are other measurements (Euclidean distance takes magnitude into consideration; cosine similarity only cares about the angle between pair-wise values on an axis). We'll go with cosine similarity for now.

In [None]:
distances = distance.cdist([value for value in query_vector_df["vector"]], [value for value in vector_df["vector"].values], "cosine")

Let's save this as a searchable df. We'll keep "vector_df" on the backburner for now.

In [None]:
w2v_searchable_df = vector_df.drop(columns=["vector"])

In [None]:
# Create a column with cosine distances for each query vs the sentence
for i in range(len(queries)):
    w2v_searchable_df[f"query_{i}_distance"] = distances[i]
#w2v_searchable_df.to_csv(f"covid_w2v_searchable-{time.strftime("%Y%m%d")}.csv", index=False)

# Part 2: Getting Results

Now we can start to use our cosine similarity measurements! It's essentially like a search engine, but without the need for keywords, as long as you have a good idea what you're looking for. 



#### Run this cell below if you've already got the data from Part 1. 

In [None]:
#w2v_searchable_df = pd.read_csv("covid_w2v_searchable.csv")
df = pd.read_csv("covid_data_full-03212020.csv")
#queries_df = pd.read_csv("queries.csv")
#vector_df = pd.read_csv("covid_vectors.csv")

In [None]:
#df.drop(columns=["entities"],inplace=True)
df

Let's check out our top results, and see if this was useful! Here we pass over each column, sorting for the highest values (most similar), and printing the top 20 hits. Are the results what we expect?

In [None]:
for i in range(len(queries)):
    columnName = f"query_{i}_distance"
    context = w2v_searchable_df.sort_values(by=columnName)[["_id","start_span","end_span"]][:20]
    ix = context["_id"].to_list()
    spans1 = context["start_span"].to_list()
    spans2 = context["end_span"].to_list()
    print(queries[i] + "\n")
    for j in range(len(context.index)):
        print(f"Rank {j+1}: " + str(df[df["_id"] == ix[j]].iloc[0]["text"])[spans1[j]:spans2[j]] + "\n")

Hey, this doesn't look half bad! 

Since I'm just a humble linguist and I don't know much about virology, I can look at the above and learn a great deal in a short amount of time. I assume a biologist or expert in the field would have a better idea of what to look for. My code isn't super useful for them, so... 

Alright, let's turn it into a function so we don't have all this free-flying code all over the place. 

In [None]:
def find_these(queries, n):
    
    # queries : a list of sentences, preferably statements as described in Part 1. 
    #           The goal is to make these statements similar to something you're looking for.
    # n : The number of top hits you'd like to print out per query.
    query_vector_list = []
    for i in tqdm(range(len(queries))):
        doc = nlp(queries[i])
        vec = doc.vector
        query_vector_list.append({"_id": f"query_{i}", "vector": vec})
    
    
    query_vector_df = pd.DataFrame(data=query_vector_list)
    
    distances = distance.cdist([value for value in query_vector_df["vector"]], [value for value in vector_df["vector"].values], "cosine")
    
    for i in range(len(queries)):
        w2v_searchable_df[f"temp_query_{i}_distance"] = distances[i]
    
    for i in range(len(queries)):
        columnName = f"temp_query_{i}_distance"
        context = w2v_searchable_df.sort_values(by=columnName)[["_id","start_span","end_span"]][:n]
        ix = context["_id"].to_list()
        spans1 = context["start_span"].to_list()
        spans2 = context["end_span"].to_list()
        print(queries[i] + "\n")
        for j in range(len(context.index)):
            print(f"Rank {j+1}: " + str(df[df["_id"] == ix[j]].iloc[0]["text"])[spans1[j]:spans2[j]] + "\n")
            
    ## Cleanup so we're not making giant DataFrames
    w2v_searchable_df.drop(columns=[column for column in w2v_searchable_df.columns if column.startswith("temp")],inplace=True)

# Try it out! 

Define your own queries, and see what you find. Make sure to use statements and not questions. My friends tried asking it questions, and it came up with some really weird results because I think the really long article has a comments section... 

Anyway, enjoy!

In [None]:
# A list of your statements to find things similar to in the corpus.
queries = [
    "Seasonality of transmission of the virus. Times when the virus is transmitted, i.e. winter, summer, autumn, spring, during cold weather, or in different climates."
]

# An integer, the number of top results you want to see.
n = 10

In [None]:
find_these(queries=queries, n=n)

In [None]:
print(max([i.split("q") for i in df["text"]], key=len))

In [None]:
df

# Section 3: The to-do list

There's still a lot of work to be done. 

* Implement something that replaces all the initialisms and acronyms with their full word forms.
    * See: https://pypi.org/project/scispacy/
* Implement entity linking to canonicalize data and increase consistency across vectorized outputs. 
    * See: UmlsEntityLinker
* Change the dictionary parsers to include information like authors, affiliations, and other metadata I initially left out for the sake of simplicity. 