# Mining COVID-19 Kaggle competition scientific papers to build an understanding of viruses
## Part 1. Exploring the data

Coronaviruses have been around for decades but few to none such deadly and easily spreading as the COVID-19 thus far. Earlier this year, Allen Institute for AI (AI2) and a consortium of research institutes along with the White House curated a corpus of scientific papers on coronaviruses published since 19th century and offered a Kaggle competition to analyze it and answer some questions about different aspects of the virus, like how it spreads, or how it affects living organisms.

In this notebook we will explore and clean up the dataset published as part of this competition. The dataset is maintained by AI2 and refreshed daily; you can download the newest here: [https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html). In this installment we will use data from 2020-08-24 and that I have already uploaded to our S3 bucket.

# Imports

Let's start with imports.

In [None]:
import cudf
import pandas as pd
import json
import s3fs

import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Get the data
Let's read the data. The metadata table contains a list of all articles released as part of this corpus. 

In [None]:
data_dir = 's3://bsql/data/covid'

metadata = cudf.read_csv(f'{data_dir}/metadata.csv', storage_options={'anon': True})
metadata.head()

Here's the list of all the columns.

In [None]:
metadata.columns

Most of the columns are IDs like cord_uid, or doi among others. However, of the more interesting to us, we will focus on

* **title**, **abstract** and **authors** we will use to see if we have any duplicates
* **publish_time** that shows when the article was published
* **source_x** which shows the source where the article was originated from
* the **journal** column list the journal which the article was published in
* **pdf_json_files** that shows the location of the file with the body text of the paper; we will use this to see if we have any missing files so we can exclude them.

## Blazing SQL
We will interchangeably use Blazing SQL throughout the notebook so let's start the `BlazingContext` and create the `metadata` table.

In [None]:
from blazingsql import BlazingContext

# Start up BlazingSQL
bc = BlazingContext()

# Create table from CSV
bc.create_table('metadata', metadata)

# Data exploration
Let's get familiar with the dataset.

### Article count
I normally start by counting the number of rows I have in any dataset.

In [None]:
print(f'Total number of articles: {len(metadata):,}.')

### Source
Next, let's explore the sources i.e. the repositories of articles where these are hosted.

In [None]:
%%time
bc.sql('''
    SELECT source_x
        , COUNT(*) AS cnt 
    FROM metadata 
    GROUP BY source_x 
    ORDER BY cnt DESC 
    LIMIT 10
''')

## Journals
Next is journals -- let's check the list of journals these papers were published in.

In [None]:
metadata.groupby(by='journal').agg({'cord_uid': 'count'}).sort_values(by='cord_uid', ascending=False).head(10)

### Publication years
Around 2003 there was a big spike in terms of number of papers published on coronaviruses that is correlated with #SARS-Cov-1.

In [None]:
metadata['year_published'] = metadata.publish_time.str.extract('([0-9\.\-]{4})')[0].astype('int16')
bc.create_table('metadata', metadata)

(
    metadata
    .query('year_published < 2020')
    .groupby(by='year_published')
    .agg({'source_x': 'count'})
    .to_pandas()
    .plot(kind='bar', figsize=(18,9))
)

Initially we filtered out the results that included year 2020. Here's why...

In [None]:
(
    metadata
    .groupby(by='year_published')
    .agg({'source_x': 'count'})
    .to_pandas()
    .plot(kind='bar', figsize=(18,9))
)

# Data cleanup

It is now time for us to have a look at the data itself. What I like to do first to any dataset is to check for the missing observations.

## Missing values

In [None]:
cols = list(metadata.columns)[2:-3]

query_missing = 'SELECT '
query_missing += '\n    ,'.join([f'CASE WHEN {col} IS NULL THEN 1 ELSE 0 END AS {col}_miss' for col in cols])
query_missing += '\nFROM metadata'

query_unions = (
    '\nUNION ALL \n    '
    .join([
        f"SELECT '{col}_miss' AS miss_flag, "
        f"{col}_miss AS FLAG, COUNT(*) AS CNT "
        f"FROM missing_flags GROUP BY {col}_miss" 
        for col in cols
    ])
)

bc.create_table('missing_flags', bc.sql(query_missing))
bc.create_table('missing_summary', bc.sql(query_unions))

row_cnt = float(len(metadata))

bc.sql(f'''
    SELECT *
        , CNT / {row_cnt} * 100.0 AS MISS_PCNT
    FROM missing_summary 
    WHERE FLAG = 1
    ORDER BY MISS_PCNT DESC
''')

Here's the queries we use to generate the `_miss` flags.

In [None]:
print(query_missing)

The query to create the aggregates is here:

In [None]:
print(query_unions)

Since we're missing almost 60% of values in the pdf_json_files, let's remove observations that are missing a value in this column.

In [None]:
bc.create_table('metadata_no_missing', bc.sql('SELECT * FROM metadata WHERE pdf_json_files IS NOT NULL'))
bc.sql('SELECT COUNT(*) FROM metadata_no_missing')

Let's rerun our query on the dataset with removed missing observations to check.

In [None]:
query_missing = 'SELECT '
query_missing += '\n    ,'.join([f'CASE WHEN {col} IS NULL THEN 1 ELSE 0 END AS {col}_miss' for col in cols])
query_missing += '\nFROM metadata_no_missing'

query_unions = (
    '\nUNION ALL \n    '
    .join([
        f"SELECT '{col}_miss' AS miss_flag, "
        f"{col}_miss AS FLAG, COUNT(*) AS CNT "
        f"FROM missing_flags GROUP BY {col}_miss" 
        for col in cols
    ])
)

bc.create_table('missing_flags', bc.sql(query_missing))
bc.create_table('missing_summary', bc.sql(query_unions))

row_cnt = float(len(metadata))

bc.sql(f'''
    SELECT *
        , CNT / {row_cnt} * 100.0 AS MISS_PCNT
    FROM missing_summary 
    WHERE FLAG = 1
    ORDER BY MISS_PCNT DESC
''')

Looks like we can continue with this.

# Duplicates
It's now a good time to check if we have any duplicates in our data.

In [None]:
cols_to_check_for_dupes = ['url', 'title', 'doi', 'abstract', 'pdf_json_files', 'pmc_json_files']

col = cols_to_check_for_dupes[5]

select_pattern = '{c}_DIST, {c}_MISS, {c}_DIST + {c}_MISS AS {c}_TTL, CNT - {c}_DIST - {c}_MISS AS {c}_DUPES'
count_pattern  = 'COUNT(DISTINCT {c}) AS {c}_DIST, SUM(CASE WHEN {c} IS NULL THEN 1 ELSE 0 END) AS {c}_MISS'

query = f'SELECT CNT\n    , '
query += '\n    , '.join([select_pattern.format(c=c) for c in cols_to_check_for_dupes])
query += '\nFROM (\n    SELECT COUNT(*) AS CNT, '
query += '\n        ,'.join([count_pattern.format(c=c) for c in cols_to_check_for_dupes])
query += '\n    FROM metadata_no_missing\n) AS A'

bc.sql(query)

In [None]:
bc.create_table('duplicates', bc.sql(query))

query_dupes = 'SELECT '
query_dupes += ','.join(f'{c}_DUPES' for c in cols_to_check_for_dupes)
query_dupes += ' FROM duplicates'
bc.sql(query_dupes)

## Links to pdf JSON files
URLs have no dupes and the other one are small. So, for the `pdf_json_files` let's see what rows these are.

In [None]:
bc.sql('SELECT pdf_json_files FROM metadata_no_missing GROUP BY pdf_json_files HAVING COUNT(*) > 1')

Looks like one of the links is replicated 3 times. Let's check if the other fields are duplicates as well.

In [None]:
bc.sql('''
    SELECT A.*
    FROM metadata_no_missing AS A
    INNER JOIN (
        SELECT pdf_json_files FROM metadata_no_missing GROUP BY pdf_json_files HAVING COUNT(*) > 1
    ) AS B
    ON A.pdf_json_files = B.pdf_json_files
    ORDER BY A.pdf_json_files
''')

The top 3 articles seem simply mislabled as the pmc_json_files point to different xml files and they seem to be 3 different articles. The remaining seem to be duplicated entries. Since the number is low we remove them all

In [None]:
metadata_json_clean = bc.sql('''
    SELECT A.*
    FROM metadata_no_missing AS A
    INNER JOIN (
        SELECT pdf_json_files FROM metadata_no_missing GROUP BY pdf_json_files HAVING COUNT(*) = 1
    ) AS B
    ON A.pdf_json_files = B.pdf_json_files
    ORDER BY A.pdf_json_files
''')
                             
bc.create_table('metadata_json_clean', metadata_json_clean)

## Title duplicates
We do have couple of title duplicates. Let's check them out.

In [None]:
bc.sql('''SELECT A.title
        , A.doi
        , A.abstract
        , A.authors
        , A.journal
        , A.year_published
    FROM metadata_json_clean AS A
    INNER JOIN (
        SELECT title FROM metadata_json_clean GROUP BY title HAVING COUNT(*) > 1
    ) AS B
    ON A.title = B.title
    ORDER BY A.title
    LIMIT 10''')

You can see that these are truly duplicated records: somehow they differ with doi identifier but the titles, abstracts and authors mostly match. However, since there are over 1200 of duplicated records we would not like to drop all of these and will use the `.drop_duplicates()` method from to retain one record from each duplicate.

In [None]:
metadata_title_clean = metadata_json_clean.drop_duplicates(subset=['title'])

Now we should have no duplicated titles.

In [None]:
bc.create_table('metadata_title_clean', metadata_title_clean)
bc.sql('SELECT title FROM metadata_title_clean GROUP BY title HAVING COUNT(*) > 1')

# Missing JSON files
Final check -- are there any files in the folder that are not listed in the metadata file, and vice versa?

In [None]:
pdf_json = f'{data_dir}/document_parses/pdf_json'

fs = s3fs.S3FileSystem(anon=True)
files = ['/'.join(e.split('/')[-3:]) for e in fs.ls(pdf_json)]

bc.create_table('files', cudf.DataFrame(files, columns=['pdf_json_files']))

bc.create_table('pdf_files_meta', bc.sql('''
    SELECT A.pdf_json_files AS meta_pdf_json
        , B.pdf_json_files AS folder_pdf_json
        , CASE WHEN A.pdf_json_files IS NULL THEN 1 ELSE 0 END meta_pdf_json_missing
        , CASE WHEN B.pdf_json_files IS NULL THEN 1 ELSE 0 END folder_pdf_json_missing
    FROM metadata_title_clean AS A
    FULL OUTER JOIN files AS B
        ON A.pdf_json_files = B.pdf_json_files
'''))

bc.sql('''
    SELECT meta_pdf_json_missing
        , folder_pdf_json_missing
        , COUNT(*) AS CNT
    FROM pdf_files_meta
    GROUP BY meta_pdf_json_missing
        , folder_pdf_json_missing
''')

So, actually quite surprisingly, we have 12.5k files that we have links to in the metadata.csv file but cannot be found on disk, and 5k files that are present on disk but cannot be referenced in the metadata.csv file. Well, in this case, I decided to drop all the missing files thus keeping only the 87,438 files I can find in both, the metadata.csv file, and on disk.

In [None]:
bc.create_table('metadata_pdf_json_clean', bc.sql('''
    SELECT A.*
    FROM metadata_title_clean AS A
    INNER JOIN (
        SELECT meta_pdf_json
        FROM pdf_files_meta
        WHERE meta_pdf_json_missing = 0
            AND folder_pdf_json_missing = 0
    ) AS B
        ON A.pdf_json_files = B.meta_pdf_json
'''))

bc.sql('SELECT COUNT(*) AS CNT FROM metadata_pdf_json_clean')

Save clean metadatafile

In [None]:
clean_metadata = bc.sql('SELECT * FROM metadata_pdf_json_clean')
clean_metadata.to_csv(f'metadata_clean.csv', index=False)