# Browsing CoronaWhy annotated papers from CORD19 collections (v19)

Our annotation pipeline produces a JSON file for each source file from CORD19 collection. The structure of the JSON output file is the following

* `paper_id`: ID from CORD19, like `"000eec3f1e93c3792454ac59415c928ce3a6b4ad"`
    
    
* `abstract`: list of abstract text ID's if available, like:

    `["d04b9df0-ad8e-11ea-9703-98fa9b740088", "d0511bee-ad8e-11ea-9180-98fa9b740088","d06e40d4-ad8e-11ea-ba29-98fa9b740088"]`
	
    
* `text_body`: a list of paper sections' dictionaries, by which we can browse the whole document. Each mini-dictionary has two keys: `section_id` and `section_name`:

	`{"section_id":"d08fd14a-ad8e-11ea-9a80-98fa9b740088", "section_name":"introduction"}`
      
      
* `language`: self-evident, in the case of non-English papers ( `fr`, `it`, `jp`) it should be accompanied by `original text` entry:
   `{"language":"fr"}`
   
   
* `original_text`: the original source text subsequently translated and annotated by English language models.

    `"original_text":"Abstract\nReçu et accept…arge de ces infections."`


* `tables`: tables extracted from the source text


* ID's of sections and sentences.


## Browsing a paper's sections

The JSON dictionary has the following first-order keys:

In [5]:
#adapt this line to your path to preprocessed files
exemplary_json_file = r"C:\Users\lga\eclipse-workspace\Covid19\FirstAttempt\preprocessed\00a0ab182dc01b6c2e737dfae585f050dcf9a7a5.json"

import json
with open(exemplary_json_file, encoding="utf8") as f:
    file_dict = json.load(f)

for single_key in file_dict.keys():
    print(single_key)

paper_id
abstract
text_body
language
original_text
tables
ffe44c74-ad8b-11ea-8c79-98fa9b740088
000dceb0-ad8c-11ea-8759-98fa9b740088
00346990-ad8c-11ea-ab16-98fa9b740088
005beeca-ad8c-11ea-8c96-98fa9b740088
0082d7e8-ad8c-11ea-bf3b-98fa9b740088
00a9e7ca-ad8c-11ea-b5d9-98fa9b740088
00d11ec2-ad8c-11ea-a9a6-98fa9b740088
00f807ba-ad8c-11ea-b762-98fa9b740088
011f3ec2-ad8c-11ea-abc4-98fa9b740088
014675dc-ad8c-11ea-8e09-98fa9b740088
016d85f6-ad8c-11ea-82ae-98fa9b740088
0194bce2-ad8c-11ea-a46a-98fa9b740088
01bba5d2-ad8c-11ea-a0ee-98fa9b740088
01e4b1ae-ad8c-11ea-962a-98fa9b740088
020b257a-ad8c-11ea-9f35-98fa9b740088
0231e750-ad8c-11ea-84bf-98fa9b740088
02596c7a-ad8c-11ea-91e9-98fa9b740088
02802e54-ad8c-11ea-bcaf-98fa9b740088
02a76562-ad8c-11ea-bf27-98fa9b740088
ffb4b248-ad8b-11ea-b46a-98fa9b740088
02db6db8-ad8c-11ea-9e21-98fa9b740088
02ce2742-ad8c-11ea-88ec-98fa9b740088
02ee7f28-ad8c-11ea-bd22-98fa9b740088
02f9f0cc-ad8c-11ea-8b9b-98fa9b740088
03045112-ad8c-11ea-97a8-98fa9b740088
030ed9ba-ad8c-11e

As we see, the paper dictionary has a couple of self-evident keys, like:

In [6]:
print(file_dict["paper_id"])
print(file_dict["language"])

00a0ab182dc01b6c2e737dfae585f050dcf9a7a5
en


The paper id corresponds to paper ID's from CORD19 collection.

`Abstract` and `text_body` consist of lists of ID's, with which we can directly surf through the file:

In [9]:
print(file_dict["abstract"])

[]


In [10]:
for single_entry in file_dict["text_body"]:
    print(single_entry)

{'section_id': 'ffb4b248-ad8b-11ea-b46a-98fa9b740088', 'section_name': 'brief history of the localised epidemic.'}
{'section_id': '02ce2742-ad8c-11ea-88ec-98fa9b740088', 'section_name': 'middle east respiratory syndrome (mers)'}
{'section_id': '02df65d2-ad8c-11ea-98bb-98fa9b740088', 'section_name': 'middle east respiratory syndrome (mers)'}
{'section_id': '032f0ab8-ad8c-11ea-8e4d-98fa9b740088', 'section_name': 'middle east respiratory syndrome (mers)'}
{'section_id': '03c3a968-ad8c-11ea-b39b-98fa9b740088', 'section_name': 'middle east respiratory syndrome (mers)'}
{'section_id': '03d07c12-ad8c-11ea-805f-98fa9b740088', 'section_name': 'middle east respiratory syndrome (mers)'}
{'section_id': '041a2f66-ad8c-11ea-affd-98fa9b740088', 'section_name': 'the middle east respiratory syndrome coronavirus (mers-cov)'}
{'section_id': '043f437a-ad8c-11ea-be3d-98fa9b740088', 'section_name': 'the viral genome'}
{'section_id': '06182210-ad8c-11ea-9a3a-98fa9b740088', 'section_name': 'the viral genome'}

Each `text_body` dictionary entry has two keys: `section_id` by which we can find the list of sentences from a particular section and `section_name` that has been possibly normalised (introduction, methods, results etc...)   

With the `section_id` we retrieve a list of sentences, more precisly their ID's , for example: 

In [18]:
some_example_section = file_dict["text_body"][3]["section_id"]
print(some_example_section)

032f0ab8-ad8c-11ea-8e4d-98fa9b740088


In [19]:
some_sentence_ids = file_dict['032f0ab8-ad8c-11ea-8e4d-98fa9b740088']
print(some_sentence_ids)

['0337e594-ad8c-11ea-949b-98fa9b740088', '0349bfe2-ad8c-11ea-b26b-98fa9b740088', '035b24e4-ad8c-11ea-a07b-98fa9b740088', '036c8a24-ad8c-11ea-bf94-98fa9b740088', '037e165c-ad8c-11ea-ac06-98fa9b740088', '038f7e3e-ad8c-11ea-928f-98fa9b740088', '03a0df2c-ad8c-11ea-bde1-98fa9b740088', '03b24458-ad8c-11ea-bf17-98fa9b740088']


In [21]:
for single_sent_id in some_sentence_ids:
    print(file_dict[single_sent_id].keys())

dict_keys(['sentence_id', 'tokens', 'lemmas', 'umls', 'umls_ids', 'sent2vec', 'GGP', 'TAXON', 'PROTEIN', 'DISEASE', 'CHEMICAL', 'ORGANISM', 'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL'])
dict_keys(['sentence_id', 'tokens', 'lemmas', 'umls', 'umls_ids', 'sent2vec', 'GGP', 'TAXON', 'PROTEIN', 'DISEASE', 'CHEMICAL', 'ORGANISM', 'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL'])
dict_keys(['sentence_id', 'tokens', 'lemmas', 'umls', 'umls_ids', 'sent2vec', 'GGP', 'TAXON', 'PROTEIN', 'DISEASE', 'CHEMICAL', 'ORGANISM', 'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL'])
dict_keys(['sentence_id', 'tokens', 'lemmas', 'umls', 'umls_ids', 'sent2vec', 'GGP', 'TAXON', 'PROTEIN', 'DISEASE', 'CHEMICAL', 'ORGANISM', 'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL'])
dict_keys(['sentence_id', 'tokens', 'lemmas', 'umls', 'umls_ids', 'sent2vec', 'GGP', 'TAXON', 'PROTEIN', 'DISEASE', 'CHEMICAL', 'ORGANISM', 'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL'])
dict_keys(['sentence_id', 'tokens', 'lemmas', 'umls', 'umls_ids', 'sent2vec', 'G

The sentence dictionary has multiple keys, that we can discuss below.

In [26]:
print(file_dict[some_sentence_ids[0]].keys())

dict_keys(['sentence_id', 'tokens', 'lemmas', 'umls', 'umls_ids', 'sent2vec', 'GGP', 'TAXON', 'PROTEIN', 'DISEASE', 'CHEMICAL', 'ORGANISM', 'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL'])


Each sentences has obligatory categories, like `sentence_id`, `tokens`, `lemmas`. Other ones, like `GGP`, `PROTEIN` or `DISEASE` come from different SciSpacy language models and their keys appear in the sentence dictionary if the models identified them.   

The original sentence in raw, non-annotated form:

In [27]:
print(file_dict[some_sentence_ids[0]]["tokens"])

The mean incubation period in a study of 47 cases was 5.2 days, with 95% of cases having shown symptoms within 12.4 days (Assiri et al., 2013a) .


In the lemmatised form, we leave out function words:

In [28]:
print(file_dict[some_sentence_ids[0]]["lemmas"])

['mean', 'incubation', 'period', 'study', 'case', 'day', 'case', 'have', 'show', 'symptom', 'day', 'assiri', 'et', 'al.', '2013a']


Additionally, we provide UMLS forms and UMLS ID's: 

In [29]:
print(file_dict[some_sentence_ids[0]]["umls"])

['oral incubation', 'Clinical Trial Period', 'Study Object', 'Case (situation)', 'day', 'Case (situation)', 'Symptoms aspect', 'day']


In [30]:
print(file_dict[some_sentence_ids[0]]["umls_ids"])

['C2752975', 'C2347804', 'C1705923', 'C0868928', 'C0439228', 'C0868928', 'C0683368', 'C0439228']


For each setence we have also sent2vec vectors from the SciSpacy model being ab averaged vector from the sentence's words (without function words). 

In [31]:
print(file_dict[some_sentence_ids[0]]["sent2vec"])

[0.44271278381347656, 0.9108926057815552, 2.5878982543945312, -0.04971019923686981, -2.7051639556884766, -2.092285633087158, -1.4935014247894287, 1.1537986993789673, -2.2723801136016846, -1.2489807605743408, 0.7622494101524353, -1.940104603767395, 1.4137200117111206, 0.024024873971939087, 0.4279756546020508, -0.3336946368217468, -0.2650260627269745, 0.31964194774627686, 1.7251719236373901, -0.040540069341659546, 0.10278214514255524, 0.850865364074707, -1.9847979545593262, -3.844939708709717, -1.4155571460723877, -0.46970459818840027, 0.7050877809524536, -0.04150650277733803, -1.8214622735977173, 0.9951539039611816, 2.3983359336853027, -0.5730693340301514, 1.008719563484192, -3.589824676513672, 0.8545082211494446, -0.5724735260009766, -0.24429459869861603, 2.4197657108306885, -3.476309299468994, 0.5884241461753845, -1.6153475046157837, 0.03435332700610161, 0.07721948623657227, 0.2923336625099182, 2.0886449813842773, 1.3625032901763916, -1.4790934324264526, -0.3812466859817505, 1.4828133

For the sake of simplicity, the vector has the list format, but we can easily transform it into NumPy or PyTorch array.

In [32]:
print(len(file_dict[some_sentence_ids[0]]["sent2vec"]))

200


In [35]:
other_keys = [ 'GGP', 'TAXON', 'PROTEIN', 'DISEASE', 'CHEMICAL', 'ORGANISM', 'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL']

for other_dict_key in other_keys:
    print(other_dict_key , file_dict[some_sentence_ids[0]][other_dict_key], sep=": ")

GGP: ['MERS-CoV']
TAXON: ['virus']
PROTEIN: ['MERS-CoV']
DISEASE: ['absence of disease']
CHEMICAL: ['Al-Gethamy']
ORGANISM: ['HCW']
GENE_OR_GENE_PRODUCT: ['LRT']
SIMPLE_CHEMICAL: ['Al-Gethamy']


In the case of non-English texts, under the key `original_text` we can recollect the whole source text in an original language, this time being French:

In [37]:
non_english_file = r"C:\Users\lga\eclipse-workspace\Covid19\FirstAttempt\preprocessed\000eec3f1e93c3792454ac59415c928ce3a6b4ad.json"


with open(non_english_file, encoding="utf8") as f:
    non_english_file_dict = json.load(f)

In [39]:
print(non_english_file_dict["original_text"])

Abstract
Reçu et accepté le 7 février 2004
Abstract
Les infections virales respiratoires communautaires sont fréquentes et le plus souvent bénignes. Beaucoup d'agents différents comme les virus influenza, ou para-influenza, le virus respiratoire syncitial, les rhinovirus, coronavirus, adénovirus et les herpès virus peuvent être isolés chez les patients immunocompétents. Parmi ces virus, le cytomégalovirus (CMV) peut être responsable de pneumonie nosocomiale en réanimation. Le diagnostic des infections virales est difficile car les signes cliniques sont non spécifiques et l'isolement du virus responsable difficile. Cependant, une symptomatologie clinique associant fièvre, myalgies, céphalées, pharyngite est fréquente dans les infections à Inflenza qui peuvent aboutir à des tableaux sévères. Enfin, le virus plus récent responsable d'infection respiratoire est un virus nouvellement découvert de la famille des coronavirus, le SRAS-CoV qui a été responsable d'une épidémie d'infections respi