# Data Exploration

    Creation date : 2020-04-03 (Friday)
    Creator       : Stanislav Schmidt <stanislav.schmidt@epfl.ch>

- Thorough data exploration
- Statistics
    - missing values
    - data structure
    - text lengths
    - metadata
- Data ingression/reading
- Data cleaning
- Proper data representation / data class

## Imports

In [78]:
from collections import Counter, OrderedDict
from functools import reduce
import pathlib
import textwrap
import json

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tqdm.notebook import tqdm

import ipywidgets as widgets

## Read Data

Note that the dataset is constantly being updated. Therefore it makes sense to keep the local copy of the data up to date.

In [2]:
data_dir = pathlib.Path("../data/2020-04-03/")
data_dir.exists()

True

### Contents of the data folder

In [3]:
print("Folders:")
for f in sorted(data_dir.iterdir()):
    if f.is_dir():
        print(f)
print()

print("Files:")
for f in sorted(data_dir.iterdir()):
    if not f.is_dir():
        print(f)

Folders:
../data/2020-04-03/biorxiv_medrxiv
../data/2020-04-03/comm_use_subset
../data/2020-04-03/custom_license
../data/2020-04-03/noncomm_use_subset

Files:
../data/2020-04-03/COVID.DATA.LIC.AGMT.pdf
../data/2020-04-03/json_schema.txt
../data/2020-04-03/metadata.csv
../data/2020-04-03/metadata.readme


- The 4 folders contain JSON files, which represent the publications for which the full text PDF was available.
- `COVID.DATA.LIC.AGMT.pdf` is a license agreement for the dataset
- `json_schema.txt` contains the structure of all JSON files
- `metadata.csv` is a table with the metadata of all samples in the dataset. There are two types of entries:
    - Those for which the full text was available. These entries will have a corresponding JSON file
    - Those for which the full text was not available. This is the only place where such entries will apear.
- `metadata.readme` is a short description of the `metadata.csv` file and a changelog


### JSON files

In [10]:
n_json = len(list(data_dir.rglob("*.json")))
print("Total number of JSON files:", n_json)

Total number of JSON files: 33375


Read JSON files into a list

In [11]:
json_files = []

for f in tqdm(data_dir.rglob("*.json"), total=n_json):
    json_file = json.load(open(f))
    assert json_file['paper_id'] == f.stem
    json_files.append(json_file)

HBox(children=(FloatProgress(value=0.0, max=33375.0), HTML(value='')))




### `metadata.readme`

In [12]:
with open(data_dir / 'metadata.readme', 'r') as f:
    print(f.read())

(1) Metadata for papers from these sources are combined: CZI, PMC, BioRxiv/MedRxiv. (total records 29500)
	- CZI 1236 records
	- PMC 27337
	- bioRxiv 566
	- medRxiv 361
(2) 17K of the paper records have PDFs and the hash of the PDFs are in 'sha'
(3) For PMC sourced papers, one paper's metadata can be associated with one or more PDFs/shas under that paper - a PDF/sha correponding to the main article, and possibly additional PDF/shas corresponding to supporting materials for the article.
(4)	13K of the PDFs were processed with fulltext ('has_full_text'=True)
(5) Various 'keys' are populated with the metadata:
	- 'pmcid': populated for all PMC paper records (27337 non null)
	- 'doi': populated for all BioRxiv/MedRxiv paper records and most of the other records (26357 non null)
	- 'WHO #Covidence': populated for all CZI records and none of the other records (1236 non null)
	- 'pubmed_id': populated for some of the records
	- 'Microsoft Academic Paper ID': populated for some of the records


### `json_schema.txt`

In [13]:
with open(data_dir / 'json_schema.txt', 'r') as f:
    print(f.read())

# JSON schema of full text documents


{
    "paper_id": <str>,                      # 40-character sha1 of the PDF
    "metadata": {
        "title": <str>,
        "authors": [                        # list of author dicts, in order
            {
                "first": <str>,
                "middle": <list of str>,
                "last": <str>,
                "suffix": <str>,
                "affiliation": <dict>,
                "email": <str>
            },
            ...
        ],
        "abstract": [                       # list of paragraphs in the abstract
            {
                "text": <str>,
                "cite_spans": [             # list of character indices of inline citations
                                            # e.g. citation "[7]" occurs at positions 151-154 in "text"
                                            #      linked to bibliography entry BIBREF3
                    {
                        "start": 151,
                        "end": 1

#### Top-Level Keys

In fact, the file `json_schema.txt` above suggests a slightly wrong structure. The top-level entries should the the following

In [60]:
json_files[0].keys()

dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])

Are we sure there are no other top-level keys? Check:

In [61]:
top_level_keys = reduce(
    lambda s1, s2: s1.union(s2),
    (json_file.keys() for json_file in json_files),
    set())

top_level_keys

{'abstract',
 'back_matter',
 'bib_entries',
 'body_text',
 'metadata',
 'paper_id',
 'ref_entries'}

So it's really just these seven top-level entries and nothing more.

A JSON file has thus the following structure:

```python
json_file {
    'paper_id': str,
    'metadata': dict,
    'abstract': [text_chunk_1, text_chunk_2, ...],
    'body_text': [text_chunk_1, text_chunk_2, ...],
    'back_matter': [text_chunk_1, text_chunk_2, ...],
    'ref_entries': dict,
    'bib_entries': dict,
}
```

A `text_chunk` represents a block of text, usually a paragraph, and has the following structure:

```python
text_chunk {
    'section': str,
    'text': str,
    'cite_spans': [ref_1, ref_2, ...],
    'ref_spans': [ref_1, ref_2, ...],
}
```

A `ref` represents an inline reference within the text. It can be one of the two things

1. A reference to a figure or table within the text. In this case the corresponding item is found in `ref_entries`
2. A citation reference. In this case the corresponding item is found in `bib_entries`

A `ref` has thus the following structure:

```python
ref {
    'start': int,
    'end': int,
    'text': str,
    'ref_id': str,
}
```

The `ref`s were probably extracted from PDFs by looking for hyperlinks. Thus `start` and `end` represent the position in the text where the hyperlink was located. `text` is the text of the hyperlink. `ref_id` is the identifier of the item the hyperlink was pointing to and the corresponding item can by found either in `bib_entries` or in `ref_entries`, depending on the reference type described above.

#### Paper ID

The paper ID is just the SHA of the corresponding PDF file, saved as a string.

For examples:

In [18]:
json_files[0]['paper_id']

'4dfec82a8f515375c1dcc8e570d37b1aa33a591c'

Make sure all `paper_id` entries are strings

In [19]:
all(isinstance(json_file['paper_id'], str) for json_file in json_files)

True

Let's check that all SHAs are unique

In [27]:
Counter(json_file['paper_id'] for json_file in json_files).most_common(1)

[('4dfec82a8f515375c1dcc8e570d37b1aa33a591c', 1)]

#### Metadata

#### Abstract

Preview the abstract field. As mentioned before it is a list of text chunks

In [64]:
json_files[0]['abstract']

[{'text': 'T lymphocyte cells, including regulatory T (Treg) and T helper 17 cells, have important roles in the human periodontium. However, the basis for Treg cytokine expression in various compartments of the periodontium remains unclear. The aim of the present study was to investigate the expression of interleukin (IL)-35 in the peripheral blood mononuclear cells (PBMCs) and periodontal tissues of patients with chronic periodontitis (CP), with a view to understanding its role in this disease, and ultimately providing improved treatments. Peripheral blood, periodontal tissues and gingival crevicular fluids (GCFs) were collected from patients with CP or impacted teeth, the latter serving as healthy controls. The expression levels of IL-35 subunit mRNAs in PBMCs and periodontal tissues were determined using reverse transcription-quantitative polymerase chain reaction, while the IL-35 protein expression in GCFs and sera was quantified by ELISA. The relative expression of IL-35 subunit m

Is there always just one abstract entry? Test this count checking all JSON files

In [65]:
Counter(len(json_file['abstract']) for json_file in json_files)

Counter({1: 16704,
         0: 8548,
         2: 3456,
         3: 2108,
         5: 567,
         4: 1357,
         8: 96,
         7: 149,
         6: 259,
         10: 23,
         12: 9,
         9: 43,
         11: 20,
         31: 1,
         17: 4,
         29: 2,
         14: 5,
         15: 4,
         13: 6,
         21: 1,
         16: 3,
         20: 3,
         27: 1,
         22: 2,
         19: 1,
         28: 1,
         25: 1,
         18: 1})

By far not!

Are all those entries paragraphs of the same "Abstract" section?

In [66]:
Counter(
    len(set(text_chunk['section'] for text_chunk in json_file['abstract']))
    for json_file in json_files)

Counter({1: 24827, 0: 8548})

What are the names for the abstract section that occur?

In [67]:
set(text_chunk['section']
    for text_chunk in json_file['abstract']
    for json_file in json_files)

{'Abstract'}

So, to summarise:
- There are publications with and without an abstract
- Abstracts can contains multiple paragraphs
- All paragraphs in an abstract are always assigned to the section named "Abstract"

#### Body Text

Preview the contents of the `body_text` field. It is a list of text chunks

In [74]:
json_files[0]['body_text'][:3]

[{'text': 'Chronic periodontitis (CP) is an infectious disease that affects the periodontium and gradually destroys periodontal tissues (1) . Bacterial plaque is a well-known cause of CP, which stimulates a local inflammatory response and activation of the innate immune system (1, 2) . This eventually results in the characteristic pathology of periodontal disease, the main clinical features of which are advancing gingival inflammation, irreversible alveolar bone loss, and the loosening and/or loss of teeth (3) . Numerous studies (4, 5) have highlighted the role of T lymphocyte cells in periodontitis; in particular, T lymphocyte phenotype and function are important in the susceptibility, onset and severity of periodontitis (6) .',
  'cite_spans': [{'start': 125,
    'end': 128,
    'text': '(1)',
    'ref_id': 'BIBREF0'},
   {'start': 267, 'end': 270, 'text': '(1,', 'ref_id': 'BIBREF0'},
   {'start': 271, 'end': 273, 'text': '2)', 'ref_id': 'BIBREF1'},
   {'start': 501, 'end': 504, 'tex

Are there any JSON files with empty `body_text`?

In [76]:
any(len(json_file['body_text']) == 0 for json_file in json_files)

False

What are the section names of all text chunks in a given JSON file?

In [102]:
@widgets.interact(
    idx=widgets.IntSlider(
        value=0,
        min=0,
        max=len(json_files) - 1,
        description="JSON file",
        continuous_update=False,
    ),
)
def _(idx):
    json_file = json_files[idx]
    print(json_file['paper_id'])
    print('---')
    for text_chunk in json_file['body_text']:
        section_name = text_chunk['section']
        print(section_name or '<empty>')

interactive(children=(IntSlider(value=0, continuous_update=False, description='JSON file', max=33374), Output(…

Observations

- Most section titles seem to make sense
- Sometimes there are text chunks with empty section titles
- Some section titles seem to have been misidentified. Some common misidentifications are
    - Page headers
    - Page numbers
    - Parts of normal text
    - Jibberish text
- It seems there is no distinction between sections, sub-setions, etc. (Seen in documents where sections are in all-caps, but the sub-sections aren't)

#### Back Matter

As with the abstract and the body text, back matters are collections of text chunks. It seems that many JSON files have empty back matters.

In [97]:
Counter(len(json_file['back_matter']) for json_file in json_files)

Counter({1: 13569,
         2: 5354,
         0: 11114,
         4: 667,
         3: 1748,
         10: 16,
         17: 3,
         6: 244,
         13: 7,
         5: 330,
         26: 1,
         7: 165,
         8: 71,
         16: 6,
         15: 7,
         12: 11,
         9: 17,
         48: 1,
         23: 3,
         11: 7,
         20: 3,
         42: 1,
         33: 1,
         18: 5,
         153: 1,
         19: 2,
         21: 2,
         24: 3,
         89: 1,
         32: 2,
         14: 4,
         22: 2,
         40: 1,
         55: 2,
         29: 1,
         209: 1,
         70: 1,
         37: 1})

In [103]:
@widgets.interact(
    idx=widgets.IntSlider(
        value=0,
        min=0,
        max=len(json_files) - 1,
        description="JSON file",
        continuous_update=False,
    ),
)
def _(idx):
    json_file = json_files[idx]
    print(json_file['paper_id'])
    print('---')
    for text_chunk in json_file['back_matter']:
        section_name = text_chunk['section']
        print(section_name or '<empty>')

interactive(children=(IntSlider(value=0, continuous_update=False, description='JSON file', max=33374), Output(…

Observations

- Most back matters are either empty or contain acknowledgements
- Some contain supplementary material
- Some contain an appendix / annex / notes

#### Ref Entries

Sample ref entry

In [108]:
json_files[0]['ref_entries']

{'FIGREF0': {'text': 'Foxp3, IL-12p35 and EBi3 mRNAs in periodontal tissues. (A) Foxp3, (B) IL-12p35 and (C) EBi3 mRNA expression in tissues of patients from the healthy control and CP groups. Healthy and periodontal tissue biopsies were collected from patients, mRNA extracted and reverse transcription-quantitative polymerase chain reaction performed using primers for Foxp3, IL-12p35 and EBi3 mRNAs. Data are expressed as the mean + standard error of the mean for each group (n=20). * P<0.05 vs. control. Foxp3, Forkhead box P3; IL-12p35, interleukin 12 p35 subunit; EBi3, Epstein-Barr virus-induced 3; CP, chronic periodontitis.',
  'latex': None,
  'type': 'figure'},
 'FIGREF1': {'text': 'Foxp3, IL-12p35 and EBi3 mRNAs in PBMCs from patients. (A) Foxp3, (B) IL-12p35 and (C) EBi3 mRNA expression in PBMCs of patients from the healthy control and CP groups. Following the extraction of mRNA, reverse transcription-quantitative polymerase chain reaction was performed using primers for Foxp3, IL

Note that some table/figure captions may actually contain considerable text.

All ref entries seem to be either figures or tables, as can be explored with the widget below.

In [109]:
@widgets.interact(
    idx=widgets.IntSlider(
        value=0,
        min=0,
        max=len(json_files) - 1,
        description="JSON file",
        continuous_update=False,
    ),
)
def _(idx):
    json_file = json_files[idx]
    print(json_file['paper_id'])
    print('---')
    for key in json_file['ref_entries']:
        print(key)

interactive(children=(IntSlider(value=0, continuous_update=False, description='JSON file', max=33374), Output(…

We can actually verify the item types in `ref_entry` by checking the `type` field:

In [110]:
Counter(entry['type']
        for entry in json_file['ref_entries'].values()
        for json_file in json_files)

Counter({'figure': 2770125, 'table': 1335000})

#### Bib Entries

### `metadata.csv`

In [154]:
df_metadata = pd.read_csv(data_dir / 'metadata.csv', dtype=str)

df_metadata['has_full_text'] = df_metadata['has_full_text'].map(
    {'True': True, 'False': False})
df_metadata['publish_time'] = pd.to_datetime(
    df_metadata['publish_time'],
    dayfirst=False)

In [153]:
df_metadata.shape

(45774, 17)

In [70]:
df_metadata.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,url
0,vho70jcx,f056da9c64fbf00a4645ae326e8a4339d015d155,biorxiv,SIANN: Strain Identification by Alignment to N...,10.1101/001727,,,biorxiv,Next-generation sequencing is increasingly bei...,2014-01-10,Samuel Minot; Stephen D Turner; Krista L Ternu...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/001727
1,i9tbix2v,daf32e013d325a6feb80e83d15aabc64a48fae33,biorxiv,Spatial epidemiology of networked metapopulati...,10.1101/003889,,,biorxiv,An emerging disease is one infectious epidemic...,2014-06-04,Lin WANG; Xiang Li,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/003889
2,62gfisc6,f33c6d94b0efaa198f8f3f20e644625fa3fe10d2,biorxiv,Sequencing of the human IG light chain loci fr...,10.1101/006866,,,biorxiv,Germline variation at immunoglobulin gene (IG)...,2014-07-03,Corey T Watson; Karyn Meltz Steinberg; Tina A ...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/006866
3,058r9486,4da8a87e614373d56070ed272487451266dce919,biorxiv,Bayesian mixture analysis for metagenomic comm...,10.1101/007476,,,biorxiv,Deep sequencing of clinical samples is now an ...,2014-07-25,Sofia Morfopoulou; Vincent Plagnol,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/007476
4,wich35l7,eccef80cfbe078235df22398f195d5db462d8000,biorxiv,Mapping a viral phylogeny onto outbreak trees ...,10.1101/010389,,,biorxiv,Developing methods to reconstruct transmission...,2014-11-11,Stephen P Velsko; Jonathan E Allen,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/010389


In [71]:
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45774 entries, 0 to 45773
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   cord_uid                     45774 non-null  object        
 1   sha                          31753 non-null  object        
 2   source_x                     45774 non-null  object        
 3   title                        45617 non-null  object        
 4   doi                          42440 non-null  object        
 5   pmcid                        26243 non-null  object        
 6   pubmed_id                    34641 non-null  object        
 7   license                      45774 non-null  object        
 8   abstract                     37913 non-null  object        
 9   publish_time                 45765 non-null  datetime64[ns]
 10  authors                      43774 non-null  object        
 11  journal                      41707 non-nu

## Inspect Contents of JSON Files

### Printing Contents

In [114]:
def text_chunks_to_sections(all_chunks):
    sections = OrderedDict()
    
    for chunk in all_chunks:
        section_name = chunk['section']
        if section_name not in sections:
            sections[section_name] = []
        sections[section_name].append({
            k: v for k, v in chunk.items()
            if k != 'section'
        })
    
    return sections

In [115]:
np.where([len(json_file['abstract']) == 31 for json_file in json_files])

(array([3864]),)

In [116]:
json_file = json_files[3864]

abstract_sections = text_chunks_to_sections(json_file['abstract'])
body_text_sections = text_chunks_to_sections(json_file['body_text'])
back_matter_sections = text_chunks_to_sections(json_file['back_matter'])

print(abstract_sections.keys())
print(body_text_sections.keys())
print(back_matter_sections.keys())

odict_keys(['Abstract'])
odict_keys(['Background', 'Surveillance interventions', 'Results for interventions targeted at Nipah virus', 'Methodological quality of the studies', 'Nipah virus -Review findings', 'Halton et al. A systematic review of community-based interventions for emerging zoonotic infectious diseases in Southeast Asia © the authors 2013 Page 18', 'Prevention and control interventions', 'Contextual factors', 'Summary', 'Results for interventions targeted at dengue', 'Education interventions', 'Environmental control', 'Biological control', 'Chemical control', 'Contextual information', 'Study context', 'Models of behaviour change', 'Dengue knowledge', 'Perceived importance of dengue', 'Perceived effectiveness of the intervention', 'Community input, ownership and involvement', '62', 'Use of schools to deliver education activities', 'Acceptability of the intervention', 'Meta-analysis', 'Household index', 'Container index', 'Breteau index', 'Larval population number', 'Larval 

In [118]:
def wrap_text(text, width=80, indent=4):
    lines = textwrap.wrap(text, width=width - indent)
    filler = ' ' * indent
    lines = [filler + line for line in lines]
    return '\n'.join(lines)
    
print(wrap_text(json_files[0]['abstract'][0]['text']))

    T lymphocyte cells, including regulatory T (Treg) and T helper 17 cells,
    have important roles in the human periodontium. However, the basis for Treg
    cytokine expression in various compartments of the periodontium remains
    unclear. The aim of the present study was to investigate the expression of
    interleukin (IL)-35 in the peripheral blood mononuclear cells (PBMCs) and
    periodontal tissues of patients with chronic periodontitis (CP), with a view
    to understanding its role in this disease, and ultimately providing improved
    treatments. Peripheral blood, periodontal tissues and gingival crevicular
    fluids (GCFs) were collected from patients with CP or impacted teeth, the
    latter serving as healthy controls. The expression levels of IL-35 subunit
    mRNAs in PBMCs and periodontal tissues were determined using reverse
    transcription-quantitative polymerase chain reaction, while the IL-35
    protein expression in GCFs and sera was quantified by ELISA. T

In [None]:
def print_section(section_title, section_chunks, paragraph_sep='\n\n'):
    print('-' * 80)
    print(section_title)
    print('-' * 80)
    
    paragraphs = [wrap_text(chunk['text']) for chunk in section_chunks]
    text = paragraph_sep.join(paragraphs)
    print(text)

In [122]:
for section_title, section_chunks in abstract_sections.items():
    print_section(section_title, section_chunks)

--------------------------------------------------------------------------------
Abstract
--------------------------------------------------------------------------------
    Background Southeast Asia has been at the epicentre of recent epidemics of
    emerging and re-emerging zoonotic diseases. Community-based surveillance and
    control interventions have been heavily promoted but the most effective
    interventions have not been identified.

    This review evaluated evidence for the effectiveness of community-based
    surveillance interventions at monitoring and identifying emerging infectious
    disease; the effectiveness of community-based control interventions at
    reducing rates of emerging infectious disease; and contextual factors that
    influence intervention effectiveness.

    Inclusion criteria

    Communities in Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, the
    Philippines, Singapore, Thailand and Viet Nam.

    Non-pharmaceutical, non-vaccine, and 

In [123]:
for section_title, section_chunks in back_matter_sections.items():
    print_section(section_title, section_chunks)

--------------------------------------------------------------------------------
Acknowledgements
--------------------------------------------------------------------------------
    This research was funded by the Australian Agency for International
    Development (AusAID). The research was commissioned as part of a joint call
    for systematic reviews with the Department for International Development
    (DFID) and the International Initiative for Impact Evaluation (3ie). The
    views expressed are those of the authors and not necessarily those of the
    Commonwealth of Australia. The Commonwealth of Australia accepts no
    responsibility for any loss, damage or injury
--------------------------------------------------------------------------------
Rabiesarticles
--------------------------------------------------------------------------------
    Akoso BT. Rabies in animals in Indonesia. In: Rabies control in Asia, Dodet
    B, Meslin FX, editors, 2001. p


## JSON File Statistics

## JSON Files vs. `metadata.csv`

### Duplicate SHAs

In [181]:
all_json_shas = set(json_file['paper_id'] for json_file in json_files)
all_metadata_shas = set(df_metadata['sha'].values)

In [183]:
print("Are JSON file SHAs unique?", len(all_json_shas) == n_json)
print("Are all metadata SHAs unique?", len(all_metadata_shas) == n_metadata)

Are JSON file SHAs unique? True
Are all metadata SHAs unique? False


All JSON files' SHAs are unique, but the medatadata file has duplicates. How many duplicates are there

In [234]:
sha_duplicates_in_metadata = Counter(df_metadata['sha'])
sha_duplicates_in_metadata = {
    k: v for k, v in sha_duplicates_in_metadata.items()
    if v > 1 and isinstance(k, str)}
sha_duplicates_in_metadata

{'58be092086c74c58e9067121a6ba4836468e7ec3': 2,
 '45e40b2d7d973ed5c9798da613fb3cfa4427e2e2': 2,
 'f3aafdecdc43a3f57e58cf6dcea038b1834a953e': 2,
 '9ce0a6cfd53840cd985f7a1439708c7a48bb7f23': 3,
 'ba4afe00e152de82121a4445aed52c46833d927e': 2,
 '4644c32551fb23aa873a7738ecc8d777bd49877e': 4}

After manually inspecting the documents corresponding to the SHA with most duplicates (`4644c32551fb23aa873a7738ecc8d777bd49877e`, URLs are in the metadatafile) we found that all four entries corresponnd to very short one-paragraph texts that fit on the same PDF page. This explains why they all have the same SHA, which is the hash of that PDF.

In [197]:
df_metadata[df_metadata['sha'] == '4644c32551fb23aa873a7738ecc8d777bd49877e'][['sha', 'title', 'url']]

Unnamed: 0,sha,title,url
41027,4644c32551fb23aa873a7738ecc8d777bd49877e,PIV-33 Detection of oseltamivir-resistant infl...,https://doi.org/10.1016/s1386-6532(09)70129-5
41028,4644c32551fb23aa873a7738ecc8d777bd49877e,PIV-34 A fast procedure for the detection of t...,https://doi.org/10.1016/s1386-6532(09)70130-1
41029,4644c32551fb23aa873a7738ecc8d777bd49877e,PIV-35 Evaluation of two newly developed QIAsy...,https://doi.org/10.1016/s1386-6532(09)70131-3
41030,4644c32551fb23aa873a7738ecc8d777bd49877e,PIV-36 Performance of the Qiagen Resplex II ve...,https://doi.org/10.1016/s1386-6532(09)70132-5


Are any of the duplicate SHAs found in the JSON files?

In [236]:
{sha: (sha in all_json_shas) for sha in sha_duplicates_in_metadata}

{'58be092086c74c58e9067121a6ba4836468e7ec3': True,
 '45e40b2d7d973ed5c9798da613fb3cfa4427e2e2': True,
 'f3aafdecdc43a3f57e58cf6dcea038b1834a953e': True,
 '9ce0a6cfd53840cd985f7a1439708c7a48bb7f23': True,
 'ba4afe00e152de82121a4445aed52c46833d927e': True,
 '4644c32551fb23aa873a7738ecc8d777bd49877e': True}

OK, they all are. So do the corresponding JSON files contain all of the corresponding articles? Check the JSON for which the SHA has 4 duplicates in the metadata file

In [263]:
sha = '4644c32551fb23aa873a7738ecc8d777bd49877e'

sha_idx = np.where([json_file['paper_id'] == sha for json_file in json_files])[0][0]
json_file = json_files[sha_idx]

title = json_file['metadata']['title']


print(wrap_text(title, indent=0))
for part in ['abstract', 'body_text', 'back_matter']:
    print('=' * 80)
    print(part)
    print('=' * 80)
    sections = text_chunks_to_sections(json_file[part])
    for sec_name, sec_chunks in sections.items():
        print_section(sec_name, sec_chunks)

PIV-33 Detection of oseltamivir-resistant influenza A(H1N1) viruses with H274Y
mutation during 2007-2008 influenza season from central and eastern part of
Turkey
abstract
body_text
--------------------------------------------------------------------------------
S26
--------------------------------------------------------------------------------
    S15-S61 A. Carhan *, N. Albayrak, A.B. Altas, Y. Uyar. Refik Saydam National
    Health Agency, Virology Reference and Research Laboratory, Turkey In the
    beginning of 2007-2008 Northern Hemisphere influenza season, the frequency
    of influenza A(H1N1) viruses bearing a previously defined oseltamivir
    resistance conferring amino acid change from Histidine to Tyrosine at
    position 274 (H274Y) in neuraminidase (NA) gene increased dramatically. The
    overall frequency of oseltamivir resistance in A(H1N1) strains from Europe
    was 25%, although it varied between countries, with Norway detecting the
    highest proportion (67%), an

By comparing to the actualy PDF we see that the first 3 entries from the metadata file got parsed into the corresponding JSON file, while from the fourth one the title was taken and appended to the main text of the JSON.

The lesson learnt is that the parsing of the PDFs might not be 100% reliable, and inconsistencies between the JSON files and the metedata file might occur.

### Overlap between JSON files and metadata file

In [156]:
n_metadata = len(df_metadata)
n_has_full_text = np.sum(df_metadata["has_full_text"] == True)
n_json = len(json_files)

print("Number of JSON files:", n_json)
print("Number of entries in metadata.csv:", n_metadata)
print("  has_full_text = True :", n_has_full_text)
print("  has_full_text = False:", n_metadata - n_has_full_text)

Number of JSON files: 33375
Number of entries in metadata.csv: 45774
  has_full_text = True : 31753
  has_full_text = False: 14021


There seems to be a mismatch between the JSON files and the entries in the metadata file that have `has_full_text=True`. Let's check what happening exactly

In [163]:
sha_in_json = df_metadata['sha'].map(lambda sha: sha in all_json_shas)
has_full_text = df_metadata['has_full_text']

Are all SHAs of JSON files in the metadata file?

In [176]:
all_json_shas

{'48ed8f0e9592a06e0d5cb09acdd6397fb7ffba15',
 '0b35b06902127f13fe98623b233abddea9e80f16',
 '4d754b08444d19e6df8559afad357bb13be89bd8',
 '794631ba2517f1c98a655013cbe22169f67b70c9',
 'c97895cef4c4d3d1e78a477e397cc258f2c1640c',
 'b099c55c3b7797a03f19d6ff8c78e568672a6066',
 '0e803f9a98199d4c1e322f933e9943db653460be',
 'df61f934c991b4b4b08e1ec28804812ec0629ea1',
 '7a43d0241aa1e131a19e83d48ce65ba5664c4bb3',
 '0898b22b38bdf9962d526c96baa1693f135d16a6',
 'e4a3696c1433c42badbd9cce9ec2721be357e02d',
 '07c4d4d483be850eb24ecb722746f27e74d2b217',
 '5b0440b7d2a6ed9c1c93a2f7f928dc1cb2a856ea',
 '216a12289d34dd8c0febe000a1de70c1c8b634c8',
 'b0718d5c8888216c95fa19d7a79fd709da2c3ff4',
 '10879a331911aff539b074549e2913baa6bec50e',
 'bad8f12dc3e97b6539e35b4de35790b12cc32a48',
 '9aeeefa3b03f5d9a49a7af22a1524eb320a5bc88',
 '23c57f7251a9c115446b642916b659018f527763',
 '46c5f4a01496ab4fe89b1dfba7b7f94445766640',
 '2eaa2a813ac6ff1fc901cf8eadc3b44e13d71d6a',
 '6b9671552c1cebfbdec4ccb22fc444ea890ae9dd',
 'a347a23c

In [180]:
shas_in_metadata = [sha for sha in all_json_shas if sha in df_metadata['sha'].values]
shas_not_in_metadata = [sha for sha in all_json_shas if sha not in df_metadata['sha'].values]

print(f"Out of all {n_json} JSON files we have:")
print(f"  {len(shas_in_metadata)} in metadata.csv")
print(f"  {len(shas_not_in_metadata)} not in metadata.csv")

KeyboardInterrupt: 

In [164]:
df_metadata['sha'][sha_in_json & ~has_full_text]

Series([], Name: sha, dtype: object)

In [165]:
df_metadata['sha'][sha_in_json & has_full_text]

0        f056da9c64fbf00a4645ae326e8a4339d015d155
1        daf32e013d325a6feb80e83d15aabc64a48fae33
2        f33c6d94b0efaa198f8f3f20e644625fa3fe10d2
3        4da8a87e614373d56070ed272487451266dce919
4        eccef80cfbe078235df22398f195d5db462d8000
                           ...                   
45764    a186e1e74616d4936c8de93c42a857c4cb9d1edf
45765    efd9f0bbc3ac52b299b2799aa8d72cd9a5b55ccf
45768    f81692543d3e35858911cea48c298bfa23b20bc6
45769    289deae0b2050aa259a05ba84565a4df82fa099a
45770    21a4369f83891bf6975dd916c0aa495d5df8709e
Name: sha, Length: 30206, dtype: object

In [166]:
df_metadata['sha'][~sha_in_json & has_full_text]

2399     e9c78584c08ba79d735e150eff98297eb57f12dd; 4a22...
2482     bd92cbae7179f07d59d1ce4d7ca96e37ebb40ec9; 7526...
2589     2bd6e33d92632dfcba4056a2d7355ced5b7ab1fd; 6fe7...
2648     4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a; d4f0...
2650     daee7f7d31f4bf1c0ef883bcd6c124b6e94cbee7; ad9a...
                               ...                        
45749    d14208a77004363b34b3cf0b7d08fd0d121e12a4; 7f7c...
45751    1b5064ad9a2828b30813ac5634d98b5da2f1d3d9; 0391...
45766    f9d941d30a663db32ceabe367cf36b6f3c2c744c; 1f19...
45767    889ba9338ea71cd42c3bc675db30a1928d487f43; d38e...
45773    3369a14e1d116943f48b3a33597796c9802de279; f523...
Name: sha, Length: 1547, dtype: object