# Basic Description

Current implementation only searches JSON files recursively in the folder, convert these to a dataframe, and processes the texts in paralell. Processing details below.

* Text is first labeled by language. If english:

* Text acronyms are expanded. i.e. ADD --> Attention Deficit Disorder. This is done using the acronym expansion module in scispaCy (see their homepage for documentation).

* Concepts (general NER) in the text are linked to the Unified Medical Language System (UMLS) and canonicalized. The first alias for the entity is appended to the UMLS column.

* Text is non-destructively lemmatized. No stop words, no deletions of punctuation. For TF-IDF or other algorithms that depend on tokenization, you'll need to run a filter over this column for dimensionality reducation and cleaner text. This mean Sars-Covid-19 stays Sars-Covid-19 as a single token. If you need to match drug names, you can do full-text search on the "sentence" column, or attempt to match to tokens in UMLS, or match NER results in DRUG column.

* A second pass on NER is run using four NER-specific models from scispaCy. "en_ner_craft_md", "en_ner_jnlpba_md","en_ner_bc5cdr_md","en_ner_bionlp13cg_md". For more information, please see their homepage.


## A note on the Extraction class, and section labels

* The extraction class needs to be edited to read the metadata file and choose files accordingly. Right now, this is at the top of our priority list for tasks in #datasets, and if you can help with this please PM Brandon Eychaner. 

* Section labels are _messy_. There are more than 250,000 unique section labels in the JSONs alone. I listed the top 1000 section labels by count and took the obvious ones, and mapped them in the "filter_dict" variable to account for the majority of important sections. This is an area of ongoing work. 


In [1]:
import pandas as pd 
import os
import numpy as np
import scispacy
import json
import spacy
from tqdm.notebook import tqdm
from scipy.spatial import distance
import ipywidgets as widgets
from scispacy.abbreviation import AbbreviationDetector
from spacy_langdetect import LanguageDetector
# UMLS linking will find concepts in the text, and link them to UMLS. 
from scispacy.umls_linking import UmlsEntityLinker
import time
from spacy.vocab import Vocab
from multiprocessing import Process, Queue, Manager
from multiprocessing.pool import Pool
from functools import partial

In [2]:
# Returns a dictionary object that's easy to parse in pandas.
def extract_title_from_json(js):
    
    # For text mining purposes, we're only interested in 4 columns:
    # abstract, paper_id (for ease of indexing), title, and body text.
    # In this particular dataset, some abstracts have multiple sections,
    # with ["abstract"][1] or later representing keywords or extra info. 
    # We only want to keep [0]["text"] in these cases. 
    json_dict = [
            js["paper_id"],
            "title",
            js["metadata"]["title"],
            ]
    return json_dict

# Returns a dictionary object that's easy to parse in pandas. For tables! :D
def extract_tables_from_json(js):
    json_dict_list = []
    # Figures contain useful information. Since NLP doesn't handle images and tables,
    # we can leverage this text data in lieu of visual data.
    for figure in list(js["ref_entries"].keys()):
        json_dict = [
            js["paper_id"],
            figure,
            js["ref_entries"][figure]["text"]]
        json_dict_list.append(json_dict)
    return json_dict_list

def extract_abstract_from_json(js):
    
    # In this particular dataset, some abstracts have multiple sections,
    # with ["abstract"][1] or later representing keywords or extra info. 
    # We only want to keep [0]["text"] in these cases. 
    if len(js["abstract"]) > 0:
        json_dict = [
            js["paper_id"],
            "abstract",
            js["abstract"][0]["text"]
        ]
        return json_dict
        
    # Else, ["abstract"] isn't a list and we can just grab the full text.
    else:
        json_dict = [
            js["paper_id"],
            "abstract",
            js["abstract"],
        ]

        return json_dict

# Kudos and thanks to @Imran for creating this amazing iterator <3 <3 <3 

class Extraction:
    def __init__(self,data_dir='/home/acorn/Documents/covid/'):
        self.map2file = self.create_map2file(data_dir)
    def create_map2file(self,data_dir):
        map2file = dict()
        for dirname, _, filenames in os.walk(data_dir):
            for filename in filenames:
                name = filename.split('.')
                if len(name) > 1 and name[1] == 'json':
                    map2file[name[0]] = os.path.join(dirname, filename)
        return map2file
    def prep_data(self,file_list=None):
        if file_list==None:
            files = list(self.map2file)
        else:
            files = file_list
        for file_id in files:
            '''
            Generator providing section with labels
                0  _id  Section_name Text
                1
                2
            '''
            past_sec = None
            with open(self.map2file[file_id]) as paperjs:
                jsfile = json.load(paperjs)
                yield extract_title_from_json(jsfile)
                yield extract_abstract_from_json(jsfile) 
                for _,section in enumerate(jsfile['body_text']):
                    if past_sec != None and past_sec != section['section']:
                        #print('{} and{}'.format(past_sec,section))
                        past_sec = section['section']
                    yield [file_id,section['section'],section['text']]
                tables = extract_tables_from_json(jsfile)
                for i in tables: 
                    yield i
                    
filter_dict = {
    "discussion": ["conclusions","conclusion",'| discussion', "discussion",  'concluding remarks',
                   'discussion and conclusions','conclusion:', 'discussion and conclusion',
                   'conclusions:', 'outcomes', 'conclusions and perspectives', 
                   'conclusions and future perspectives', 'conclusions and future directions'],
    "results": ['executive summary', 'result', 'summary','results','results and discussion','results:',
                'comment',"findings"],
    "introduction": ['introduction', 'background', 'i. introduction','supporting information','| introduction'],
    "methods": ['methods','method','statistical methods','materials','materials and methods',
                'data collection','the study','study design','experimental design','objective',
                'objectives','procedures','data collection and analysis', 'methodology',
                'material and methods','the model','experimental procedures','main text',],
    "statistics": ['data analysis','statistical analysis', 'analysis','statistical analyses', 
                   'statistics','data','measures'],
    "clinical": ['diagnosis', 'diagnostic features', "differential diagnoses", 'classical signs','prognosis', 'clinical signs', 'pathogenesis',
                 'etiology','differential diagnosis','clinical features', 'case report', 'clinical findings',
                 'clinical presentation'],
    'treatment': ['treatment', 'interventions'],
    "prevention": ['epidemiology','risk factors'],
    "subjects": ['demographics','samples','subjects', 'study population','control','patients', 
               'participants','patient characteristics'],
    "animals": ['animals','animal models'],
    "abstract": ["abstract", 'a b s t r a c t','author summary'], 
    "review": ['review','literature review','keywords']}

def invert_dict(d): 
    inverse = dict() 
    for key in d: 
        # Go through the list that is saved in the dict:
        for item in d[key]:
            # Check if in the inverted dict the key exists
            if item not in inverse: 
                # If not create a new list
                inverse[item] = [key] 
            else: 
                inverse[item].append(key) 
    return inverse
inverted_dict = invert_dict(filter_dict)
    
def get_section_name(text):
    if len(text) == 0:
        return(text)
    text = text.lower()
    if text in inverted_dict.keys():
        return(inverted_dict[text][0])
    else:
        if "case" in text or "study" in text: 
            return("methods")
        elif "clinic" in text:
            return("clinical")
        elif "stat" in text:
            return("statistics")
        elif "intro" in text or "backg" in text:
            return("introduction")
        elif "data" in text:
            return("statistics")
        elif "discuss" in text:
            return("discussion")
        elif "patient" in text:
            return("subjects")
        else: 
            return(text)

def init_nlp():
    nlp = spacy.load("/home/acorn/Downloads/en_core_sci_lg-0.2.4/en_core_sci_lg/en_core_sci_lg-0.2.4/", disable=["tagger"])
    nlp.max_length=2000000

    # We also need to detect language, or else we'll be parsing non-english text 
    # as if it were English. 
    nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

    # Add the abbreviation pipe to the spacy pipeline. Only need to run this once.
    abbreviation_pipe = AbbreviationDetector(nlp)
    nlp.add_pipe(abbreviation_pipe)

    # Our linker will look up named entities/concepts in the UMLS graph and normalize
    # the data for us. 
    linker = UmlsEntityLinker(resolve_abbreviations=True)
    nlp.add_pipe(linker)
    
    new_vector = nlp(
               """Positive-sense single‐stranded ribonucleic acid virus, subgenus 
                   sarbecovirus of the genus Betacoronavirus. 
                   Also known as severe acute respiratory syndrome coronavirus 2, 
                   also known by 2019 novel coronavirus. It is 
                   contagious in humans and is the cause of the ongoing pandemic of 
                   coronavirus disease. Coronavirus disease 2019 is a zoonotic infectious 
                   disease.""").vector

    vector_data = {"COVID-19": new_vector,
               "2019-nCoV": new_vector,
               "SARS-CoV-2": new_vector}

    vocab = Vocab()
    for word, vector in vector_data.items():
        nlp.vocab.set_vector(word, vector)
    
    return(nlp, linker)
def init_ner():
    models = ["en_ner_craft_md", "en_ner_jnlpba_md","en_ner_bc5cdr_md","en_ner_bionlp13cg_md"]
    nlps = [spacy.load(model) for model in models]
    return(nlps)

def gather_everything(data_dir):
    ex = Extraction(data_dir=data_dir)
    df_iter = ex.prep_data(None) 
    df_list =[j for j in [i for i in df_iter]]
    df = pd.DataFrame(columns=["paper_id","section","text"], data=df_list)
    df["section"] = [get_section_name(i) for i in df["section"]]
    return(df)

def pipeline(df):
    languages = []
    start_chars = []
    end_chars = []
    entities = []
    sentences = []
    lemmas = []
    vectors = []
    _ids = []
    columns = []
    
    nlp, linker = init_nlp()
    nlps = init_ner()
    
    scispacy_ent_types = ['GGP', 'SO', 'TAXON', 'CHEBI', 'GO', 'CL', 'DNA', 'CELL_TYPE', 'CELL_LINE', 'RNA', 'PROTEIN', 
                          'DISEASE', 'CHEMICAL', 'CANCER', 'ORGAN', 'TISSUE', 'ORGANISM', 'CELL', 'AMINO_ACID',
                          'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL', 'ANATOMICAL_SYSTEM', 'IMMATERIAL_ANATOMICAL_ENTITY',
                          'MULTI-TISSUE_STRUCTURE', 'DEVELOPING_ANATOMICAL_STRUCTURE', 'ORGANISM_SUBDIVISION',
                          'CELLULAR_COMPONENT', 'PATHOLOGICAL_FORMATION']
    
    for i in tqdm(range(len(df))):
        doc = nlp(str(df.iloc[i]["text"]))
        sents = [sent for sent in doc.sents]

        if len(doc._.abbreviations) > 0 and doc._.language["language"] == "en":
            doc._.abbreviations.sort()
            join_list = []
            start = 0
            for abbrev in doc._.abbreviations:
                join_list.append(str(doc.text[start:abbrev.start_char]))
                if len(abbrev._.long_form) > 5: #Increase length so "a" and "an" don't get un-abbreviated
                    join_list.append(str(abbrev._.long_form))
                else:
                    join_list.append(str(doc.text[abbrev.start_char:abbrev.end_char]))
                start = abbrev.end_char
            # Reassign fixed body text to article in df.
            new_text = "".join(join_list)
            # We have new text. Re-nlp the doc for futher processing!
            doc = nlp(new_text)

        if doc._.language["language"] == "en" and len(doc.text) > 5:
            sents = [sent for sent in doc.sents]
            for sent in sents:
                languages.append(doc._.language["language"])
                sentences.append(sent.text)
                vectors.append(sent.vector)
                lemmas.append([token.lemma_ for token in doc])
                doc_ents = []
                for ent in sent.ents: 
                    if len(ent._.umls_ents) > 0:
                        poss = linker.umls.cui_to_entity[ent._.umls_ents[0][0]].canonical_name
                        doc_ents.append(poss)
                entities.append(doc_ents)
                _ids.append(df.iloc[i]["paper_id"])
                columns.append(df.iloc[i]["section"])
        else: 
            entities.append("[]")
            sentences.append(doc.text)
            vectors.append(np.zeros(200))
            lemmas.append("[]")
            _ids.append(df.iloc[i,0])
            languages.append(doc._.language["language"])
            columns.append(df.iloc[i]["section"])

    new_df = pd.DataFrame(data={"paper_id": _ids, "language": languages,
                                "section": columns, "sentence": sentences,
                                "lemma": lemmas, "UMLS": entities, "w2vVector": vectors})
    for col in scispacy_ent_types:
        new_df[col] = "[]"
    for j in tqdm(new_df.index):
        if new_df.iloc[j]["language"] == "en":
            for nlp in nlps:
                doc = nlp(str(new_df.iloc[j]["sentence"]))
                keys = list(set([ent.label_ for ent in doc.ents]))
                for key in keys:

                    # Some entity types are present in the model, but not in the documentation! 
                    # In that case, we'll just automatically add it to the df. 
                    if key not in scispacy_ent_types:
                        new_df = pd.concat([new_df,pd.DataFrame(columns=[key])])
                        new_df[key] = "[]"

                    values = [ent.text for ent in doc.ents if ent.label_ == key]
                    new_df.at[j,key] = values
    sentence_id = [new_df.iloc[i]["paper_id"][0:5] + str(i) + new_df.iloc[i]["paper_id"][-5:] for i in new_df.index]
    new_df["sentence_id"] = sentence_id
                    
    new_df.to_csv("/home/acorn/Documents/covid/df_parts/" + new_df.iloc[0]["paper_id"] + ".complete", index=False)

In [None]:
df = gather_everything("/home/acorn/Documents/covid/v7/")
df.to_csv("dataset_v7.csv", index=False)

In [4]:
def parallelize_dataframe(df, func, n_cores=6):
    df_split = np.array_split(df, 6)
    pool = Pool(n_cores)
    list(tqdm(pool.imap_unordered(func, df_split), total=len(df_split)))
    pool.close()
    pool.join()

In [5]:
parallelize_dataframe(df, pipeline, n_cores=6)

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=1525.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=1525.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=1524.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=1525.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=1525.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=1525.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5788.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5728.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=6285.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=6403.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=6615.0), HTML(value='')))








Process ForkPoolWorker-4:
Process ForkPoolWorker-1:
Process ForkPoolWorker-3:
Process ForkPoolWorker-5:
Process ForkPoolWorker-7:
Process ForkPoolWorker-6:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "

KeyboardInterrupt: 

Process ForkPoolWorker-8:
Traceback (most recent call last):
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/queues.py", line 352, in get
    res = self._reader.recv_bytes()
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/acorn/anaconda3/envs/covid/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, 

### Put it all together

In [None]:
directory = "/home/acorn/Documents/covid/df_parts/"
df = pd.concat([pd.read_csv(directory + f) for f in os.listdir(directory) if f.endswith("complete")])
sentence_id = df["sentence_id"]

# Cleanup

The following code is simply cleanup after the extraction process. First, we'll save the text data in a json file. Next we can save the vector data because it's large, and most people won't be using it. 

In [34]:
import ast
for col in columns[8:]: 
    if col != "sentence_id":
        df[col] = [ast.literal_eval(i) for i in df[col].to_list()]
vectors = df["w2vVector"]
df.drop(columns=["w2vVector"], inplace=True)
df.to_json("v7_text.json", orient="records")

In [73]:
result = [i for i in vectors.apply(lambda x: 
                       np.fromstring(
                           x.replace('\n','')
                            .replace('[','')
                            .replace(']','')
                            .replace('  ',' '), sep=' '))]
vec_df = pd.DataFrame.from_records(result)

In [76]:
vec_df["sentence_id"] = sentence_id

In [77]:
vec_df.to_json("v7_vectors.json", orient="records")