# Essential toolkit to work with CORD-19 🔭

### Collections of plug-and-play python functions to play efficiently with the CORD-19 dataset.

**Motivation**
 - Given the large amount of notebooks (1300+) it's harder to find useful snippet of codes.
 - To avoid reinvent the wheel.
 - To accelerate discovery and development.
 - Universal tools; whatever is your task, you probably needs to go thorugh a quite standard pipeline.
    
**Notebook principles**
 - Each function is independent, single-scoped and well-documented.
 - Notebook's output are two dataset: **cord19.csv** (all data in a single dataframe, you can assume they contains the last version) and **cord19_metadata_only.csv** (same as before without the whole research text).
 - By a member of the Kaggle community, to the community. You needs to help me by telling what you needs and I will do my best to develop any function you may need.
    
**Updates and requests**
 - In case this notebook will receive any form of attention, it will be updated and enhanced with new features. It's important to me to receive your feedback on what you need or what isn't clear for you. Also, if you find it useful please upvote it. With such a large collection of notebook it's harder to be noticed.
 - If you have any snippet of code you think others may use it as well, you can write it in a comment and I will integrate it into the notebook!
 
**Disclaimer**
 - Work in progress, thank you for your understandings. 
 - Some of the functions makes use of a python package I developed. It's called [Texthero](https://github.com/jbesomi/texthero/) and it's still in early beta version. If you find any bug or have suggestions on new functionalities, just leave a comment here or opean an issue on Github and I will be glad to implement it!

#### Install and import packages


In [None]:
# Install and import texthero
!pip install texthero -q
import texthero as hero

# Import the other packages
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns
sns.set(color_codes=True)

from pathlib import Path
import glob
import json

# 1. get_data()

**Description**

Return the CORD-19 dataset in a Pandas DataFrame.

**Arguments**

- `metadata_only` (`False` by default)
    If `True`, return only the metadata 


**Acknolwedgment**

[xhlulu](https://www.kaggle.com/xhlulu) has released a great [notebook](https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv/output) that parse and clean the JSON data. The ouput of his kernels are four csv-dataset ready to be used for further analysis. The `get_data()` function make use of this module, if you use it, you need to import the notebook. The fastest approach is probably to _fork_ this notebook.

**Code**

In [None]:
# Concat

def get_data(metadata_only=False):
    """
    Return CORD-19 dataset
    
    Parameters
    ----------
    
    metadata_only : bool (False by default )
        - When True, returns only the metadata Pandas DataFrame
    
    """
    if metadata_only:
        return pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
    
    CLEAN_DATA_PATH = Path("../input/cord-19-eda-parse-json-and-generate-clean-csv/")

    biorxiv_df = pd.read_csv(CLEAN_DATA_PATH / "biorxiv_clean.csv")
    biorxiv_df['source'] = 'biorxiv'

    pmc_df = pd.read_csv(CLEAN_DATA_PATH / "clean_pmc.csv")
    pmc_df['source'] = 'pmc'

    comm_use_df = pd.read_csv(CLEAN_DATA_PATH / "clean_comm_use.csv")
    comm_use_df['source'] = 'comm_use'

    noncomm_use_df = pd.read_csv(CLEAN_DATA_PATH / "clean_noncomm_use.csv")
    noncomm_use_df['source'] = 'noncomm_use'

    papers_df = pd.concat(
        [biorxiv_df,pmc_df, comm_use_df, noncomm_use_df], axis=0
    ).reset_index(drop=True)

    
    return papers_df

**Usage and examples**

In [None]:
papers_df = get_data()
papers_df.head()

Check shape:

In [None]:
papers_df.shape

# 2. tfidf()

**Description**

Return the TF-IDF representation for the CORD-19 dataset. Vectors are computed using the specified `text` column.

**Argument**

- df: pd.DataFrame with at least one text column
- columns: list of columns name to compute TF-IDF
- dim: dimension of the vector space. Default 256.

**Code**

In [None]:
def tfidf(df, columns=['text', 'abstract'], dim=256):
        
    if len(columns) == 0:
        raise ValueError("columns argument must be a least and have at least one value.")
    
    # Merge all text columns
    df['content'] = df[columns[0]]
    
    for col in columns[1:]:
        df['content'] += df[col]
        
    # Fill missing NA
    if df['content'].isna().sum() > 0:
        print("Warning. The dataset contains NA. They will be dropped for TF-IDF computation.")
        content = df['content'].dropna()
    else:
        content = df['content']
    
    # Compute TF-IDF
    return content.pipe(hero.do_tfidf, max_features=dim)

**Usage and examples**

Example 1: return a Pandas Series of TF-IDF.

In [None]:
sample_df = papers_df.sample(1000)
tfidf_s = tfidf(sample_df, columns=['abstract'])
tfidf_s.head()

Example 2: add the column to the current dataframe

In [None]:
sample_df = papers_df.sample(5000)
sample_df['tfidf'] = tfidf(sample_df, columns=['abstract'])
sample_df.head(2)

Since some of the initial values where zero, some of the TF-IDF values haven't been computed.

In [None]:
sample_df['tfidf'].isna().sum()

# 3. pca()

**Description**

Reduce the vector space to two dimension to visualize the CORD-19 corpus.

**Argument**

s: pd.Series containing for each element a list of vectors.

**Code**

In [None]:
def pca(s):
    return hero.do_pca(s)

**Usage and examples**

In [None]:
sample_df = sample_df.dropna(how='any')
sample_df['pca'] = sample_df['tfidf'].pipe(pca)
sample_df['pca'].head()

# 4. show_pca()

**Description**

Show the reduced vector space

**Argument**

- df: pd.DataFrame
- pca_col: the pre-computed pca column
- color_col: (optional), color each dot according to the label in the color_col  
- title

**Code**

In [None]:
def show_pca(df, pca_col, color_col=None, title=""):
    return hero.scatterplot(df, pca_col, color=color_col, title=title)

**Usage and examples**

In [None]:
title = "Vector space representation of CORD-19"
show_pca(sample_df, 'pca', title=title)

# 5. kmeans()

**Description**

Compute kmeans on the given series and returns the labels

**Argument**

- s: pd.Series
- n_clusters: int

**Code**

In [None]:
def kmeans(s, n_clusters):
    return hero.do_kmeans(s, n_clusters=n_clusters)

**Usage and examples**

In [None]:
title = "Vector space representation of CORD-19 with K-means"
sample_df['kmeans'] = kmeans(sample_df['tfidf'], 20)
sample_df['kmeans'] = sample_df['kmeans'].astype(str)  # for a nicer visualization
show_pca(sample_df, 'pca', color_col='kmeans', title=title)

# 6. topic_modeling()

Next version. Stay tuned.

**Save output**

In [None]:
papers_df.to_csv("cord19.csv", index=False)
get_data(metadata_only=True).to_csv("cord19_metadata_only.csv", index=False)

#### Temporary conclusion: thank you for having read it all. Hope it will be useful to many of you! 