# COVID research: empirical vs. non-empirical



This notebook prepares the COVID Dimensions for model building and analysis.  The data file is too large to be stored on GitHub, so the file is stored in the data folder but included in `.gitignore`.  

Dimensions created the data file using the following query:

>"2019-nCoV" OR "COVID-19" OR “SARS-CoV-2” OR "HCoV-2019" OR "hcov" OR "NCOVID-19" OR "severe acute respiratory syndrome coronavirus 2" OR "severe acute respiratory syndrome corona virus 2" OR “coronavirus disease 2019” OR (("coronavirus" OR "corona virus") AND (Wuhan OR China OR novel))

The current data export version used is 2020-07-03, which can be obtained from [Dimensions](https://dimensions.figshare.com/articles/dataset/Dimensions_COVID-19_publications_datasets_and_clinical_trials/11961063) via Figshare.



## Workspace set-up

In [1]:
import pandas as pd
import numpy as np
from datetime import date



In [2]:
# Missing values function I grabbed from Kaggle
# https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python

def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [3]:
# Read in data file
# Note the lineterminator argument
df = pd.read_csv('/Users/brian/Coding/COVID-research/data/dimensions-covid19-export-2021-07-03-h06-38-27_publications.csv', lineterminator='\n')

## Data preparation and screening


In [4]:
print(df.columns, df.shape)

Index(['Date added', 'Publication ID', 'DOI', 'PMID', 'PMCID', 'Title',
       'Abstract', 'Source title', 'Source UID', 'Publisher', 'MeSH terms',
       'Publication Date', 'PubYear', 'Volume', 'Issue', 'Pagination',
       'Open Access', 'Publication Type', 'Authors', 'Corresponding Authors',
       'Authors Affiliations', 'Research Organizations - standardized',
       'GRID IDs', 'City of Research organization',
       'Country of Research organization', 'Funder',
       'UIDs of supporting grants', 'Times cited', 'Altmetric',
       'Source Linkout', 'Dimensions URL'],
      dtype='object') (507443, 31)


In [5]:
# Drop unnecessary variables
df = df[['Publication ID', 'Title', 'Abstract', \
         'Source title', 'Publication Date', 'PubYear', 'Publication Type']]

df.columns

Index(['Publication ID', 'Title', 'Abstract', 'Source title',
       'Publication Date', 'PubYear', 'Publication Type'],
      dtype='object')

In [9]:
df = df.rename(columns={"Publication ID":"ID", "Source title": "Journal", 
                        "Publication Date":"PubDate", "Publication Type": "PubType"})
df.columns

Index(['ID', 'Title', 'Abstract', 'Journal', 'PubDate', 'PubYear', 'PubType'], dtype='object')

In [10]:
#Obtain list of unique values of article types
pd.unique(df.PubType)

array(['article', 'chapter', 'preprint', 'monograph', 'proceeding',
       'book'], dtype=object)

In [11]:
# Reduce data to only 'articles'
df_articles = df[df["PubType"] == 'article']
print(df_articles.shape)

(406564, 7)


In [12]:
missing = missing_values_table(df_articles)
missing

Your selected dataframe has 7 columns.
There are 2 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Abstract,122823,30.2
Journal,10006,2.5
