# Mining COVID-19 Kaggle competition scientific papers to build an understanding of viruses
## Part 2. Processing and featurizing data

Working off of a clean metadata file, in this notebook we will featurize the subset of the JSON files that we downloaded from AI2 S3 repository.

# Imports

In [None]:
import cudf
import pandas as pd
import json
import re
import cupy
import s3fs

pd.options.display.max_rows = 100

# Read and process the JSON files
All the data is located in our S3 bucket: `s3://bsql/data/covid`. However, the metadata file we saved in the root of this folder.

In [None]:
data_dir = 's3://bsql/data/covid'
fs = s3fs.S3FileSystem(anon=True)
metadata_clean = cudf.read_csv('metadata_clean.csv')

The process below is lengthy. The issue here is the format of the JSON files that requires us to loop through this one-by-one, read the contents using the `json` package, and only then extract the interesting information. You can adapt how much time you can spend or copy the data locally to process; we default to 300 since this should not take more than 15-30 seconds.

In [None]:
%%time
articles_list = []
batch_size = 300

read_subset = True
paper_cnt = batch_size if read_subset else len(metadata_clean)

for i in range(0, paper_cnt, batch_size):
    print(f'Processing articles {i}:{i+batch_size}')
    files = [f'{data_dir}/{f}' for f in metadata_clean.iloc[i:i+batch_size,:]['pdf_json_files'].to_array()]

    papers = []

    for f in files:
        with fs.open(f, 'r') as ff:
            json_read = json.loads(ff.read())

        for i, s in enumerate(json_read['body_text']):
                papers.append((
                    json_read['paper_id']  #### key: SHA
                    , s['section']         #### section title
                    , i                    #### paragraph
                    , s['text']            #### text of the paragraph
                ))

    articles_list.append(
            cudf.DataFrame(
                papers
                , columns=['sha', 'section', 'paragraph', 'text']
        )
    )
    
    del papers
    del files

Now we can concatenate all the small cudf DataFrames into one.

In [None]:
articles = (
    cudf.concat(
        articles_list#[1:]
    ).reset_index(drop=True)
)

In [None]:
print(f'We read {len(articles):,} paragraphs.')

# Data featurization

First step on the way to featurize our dataset - we need to create a vocabulary file. The vocabulary needs to conform to the same format as it is expected by the BERT models. 

## Build vocabulary

In the first step we will simply tokenize the strings into words, normalize the strings to lower, and remove some of the punctuation signs we don't need. The `tokenize()` method splits a string on a space and puts every tokenized word in a `cudf.Series`. Next, we aggregate and count the occurence of each word.

In [None]:
def tokenize_articles(frame, col):
    temp = frame[col].str.tokenize().to_frame()
    temp['text'] = temp['text'].str.lower()
    temp['text'] = temp['text'].str.replace('[\.?,#"$!;:=\(\)\-\+0-9]', '')
    temp['counter'] = 1
    return temp

min_count = 50

token_counts = (
    tokenize_articles(articles, 'text')
    .groupby('text')
    .count()
    .reset_index()
    .sort_values(by='counter', ascending=False)
    .query(f'counter > {min_count}')
)

token_counts = token_counts.to_pandas()

print(f'Total number of tokens: {len(token_counts)}')

Let's have a look what this looks like.

In [None]:
token_counts.head()

We don't want the space so let's remove the record with index `0`.

In [None]:
token_counts = token_counts.loc[1:]

To create the final vocabulary we will be using a `SubwordTextEncoder` from this repository: https://github.com/kwonmha/bert-vocab-builder/. The script we use is further slightly modified to remove the dependency on Tensorflow.

The algorithm scans the words and iteratively builds a vocabulary of the longest subwords that the original words can be subdivided into.

In [None]:
from scripts import text_encoder

sw = text_encoder.SubwordTextEncoder()

The `SubwordTextEncoder` expects a dictionary with keys being the words and the values being the word counts.

In [None]:
token_counts_dict = dict(token_counts.to_dict('split')['data'])

sw.build_from_token_counts(
      token_counts_dict
      , 50
      , 1)

Let's have a look what we got.

In [None]:
vocab = (
    cudf.Series(sw._all_subtoken_strings)
    .sort_values()
    .reset_index(drop=True)
)

with open('vocabulary.txt', 'w') as f:
    f.writelines([f'{item}\n' for item in list(vocab.to_array())])

## Build the hash version of the vocabulary
The `subword_tokenizer` requires an encoded version of the vocabulary to tokenize to the representation BERT is expecting. The script from CLX achieves that: https://github.com/rapidsai/clx/blob/80d3198dfe54bef704d177404873d2312a77f2c9/python/clx/analytics/perfect_hash.py.

In [None]:
from scripts import perfect_hash

perfect_hash.hash_vocab(
    'vocabulary.txt'
    , 'vocabulary_hash.txt'
    , False
)

# Tokenize text
Now we are ready to tokenize the text.

In [None]:
def subword_tokenize(frame):
    num_strings = len(frame.text)
    num_bytes = frame.text.str.byte_count().sum()

    tokens, attention = frame.text.str.lower().str.subword_tokenize(
        'vocabulary_hash.txt'          #### hashed vocabulary file
        , 256                          #### maximum length of a sequence
        , 256                          #### stride
        , max_num_strings=num_strings  #### maximum number of strings to return
        , max_num_chars=num_bytes      #### maximum number of characters
        , max_rows_tensor=num_strings  #### maximum number of rows
        , do_lower=True                #### if True the original text will be lower-cased before encoding
        , do_truncate=True             #### if True the strings will be truncated or padded to the maximum length
    )[:2]
    
    temp = cudf.DataFrame()
    temp['tokens'] = tokens
    temp['attention'] = attention
    
    return temp

In [None]:
tokenized = subword_tokenize(articles)

In [None]:
tokenized.head()

Let's check how many tokens we get from the 300 articles we read.

In [None]:
tokens_cnt = len(tokenized)
articles_cnt = len(articles)

print(f'There are {tokens_cnt:,} tokens in the dataset.')

Since each token has a maximum (padded) length of 256, if we divide the total number of tokens by 256 we should get the total number of paragraphs in our corpus.

In [None]:
assert tokens_cnt / 256 == articles_cnt
print(f'Number of paragraphs derived from tokens: {int(tokens_cnt / 256):,}, actual number of paragraphs: {articles_cnt:,}')