# COVID-19 Thematic tagging with Regular Expressions

***UPDATE: Most of the ideas in this Notebook can be accessed via the [covid19_tools](https://www.kaggle.com/ajrwhite/covid19-tools) utility script, which you can import into your own Notebook, and is being updated more frequently than this.***

Goal: tag papers with themes (e.g. `tag_disease_covid19` or `tag_risk_smoking`) _using handcrafted rules_ based on synonyms and related terms, searching the **all_sources_metadata** file.

The chart below shows how multiple synonyms for Covid-19 become a single boolean field in the metadata:

In [None]:
import plotly.express as px
import plotly.graph_objects as go
    
hardcoded_data_for_intro_chart = {
    'covid': 1231,
    '2019 ncov': 576,
    'sars cov 2': 501,
    r'coronavirus 2\b': 154,
    'coronavirus 2019': 61,
    'wuhan coronavirus': 13,
    'coronavirus disease 19': 12,
    'ncov 2019': 10,
    'wuhan pneumonia': 7,
    '2019ncov': 6,
    'wuhan virus': 3,
    r'2019n cov\b': 2,
    r'2019 n cov\b': 2,
    r'\bn cov 2019': 0
}

title = 'Covid19 synonyms in title / abstract metadata<br><i>Hover over dots for exact values</i>'
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(hardcoded_data_for_intro_chart.values())[::-1],
    y=list(hardcoded_data_for_intro_chart.keys())[::-1],
    marker=dict(color="crimson", size=12),
    mode="markers",
    name='Synonyms'
))

fig.add_trace(go.Scatter(
    x=[2105],
    y=['ncov 2019'],
    marker=dict(color='blue', size=20),
    mode='markers',
    text='tag_disease_covid19',
    name='tag_disease_covid19'
))

fig.add_annotation

fig.update_layout(title=title,
              xaxis_title='Counts',
              yaxis_title='Regular Expressions')
fig.show()

## Contents

1. [Motivation](#Motivation) - Why filter papers with regular expressions?
2. **[Diseases and conditions](#Diseases)** - Does paper discuss Covid-19, SARS, MERS, etc.?
3. **[Research Design](#Design)** - Is research design indicated in the abstract?
4. **[Potential risk factors](#Risks)** - Does paper indicate characteristics, comorbidities?
5. **[Immunity and vaccinations](#Immunity)** - Does paper discuss immunity and / or vaccines?
6. **[Geographies](#Geographies)** - Does paper cover specific continents, countries, etc?
7. **[Climate](#Climate)** - Does paper cover issues relating to climate and weather?
8. **[Transmission](#Transmission)** - Does paper mention transmission routes / rates?
9. [Output](#Output) - File outputs
10. [Filtering Tool](#Filtering) - TODO


## Motivation

**Unfocused dataset**: Dataset contains >44k papers, but most of them aren't specifically about Covid-19.

**Inconsistent terminology**: Terminology for Covid-19 wasn't standardised \[[1](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/naming-the-coronavirus-disease-%28covid-2019%29-and-the-virus-that-causes-it)\] \[[2](https://qz.com/1820422/coronavirus-why-wont-who-use-the-name-sars-cov-2/)\] when papers first emerged. 

**Handcrafted features can outperform inferred features**: Domain-specific handcrafted synonym lists can tag papers more efficiently than generic topic modelling approaches.

**Faster filtering on metadata**: Most of these themes can be extracted from the metadata (title and abstract). We can filter on these tags to identify useful papers for more involved analysis.


Click on **Code** button below to see code to import libraries and load data.

In [None]:
# Data libraries
import pandas as pd
import re
import pycountry

# Visualisation libraries
import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

pd.set_option('display.max_columns', 500)

# Load data
metadata_file = '../input/CORD-19-research-challenge/metadata.csv'
df = pd.read_csv(metadata_file,
                 dtype={'Microsoft Academic Paper ID': str,
                        'pubmed_id': str})

def doi_url(d):
    if d.startswith('http'):
        return d
    elif d.startswith('doi.org'):
        return f'http://{d}'
    else:
        return f'http://doi.org/{d}'
    
df.doi = df.doi.fillna('').apply(doi_url)

print(f'loaded DataFrame with {len(df)} records')

In [None]:
# Helper function for filtering df on abstract + title substring
def abstract_title_filter(search_string):
    return (df.abstract.str.lower().str.replace('-', ' ').str.contains(search_string, na=False) |
            df.title.str.lower().str.replace('-', ' ').str.contains(search_string, na=False))

In [None]:
# Helper function for Cleveland dot plot visualisation of count data
def dotplot(input_series, title, x_label='Count', y_label='Regex'):
    subtitle = '<br><i>Hover over dots for exact values</i>'
    fig = go.Figure()
    fig.add_trace(go.Scatter(
    x=input_series.sort_values(),
    y=input_series.sort_values().index.values,
    marker=dict(color="crimson", size=12),
    mode="markers",
    name="Count",
    ))
    fig.update_layout(title=f'{title}{subtitle}',
                  xaxis_title=x_label,
                  yaxis_title=y_label)
    fig.show()

In [None]:
# Helper function which counts synonyms and adds tag column to DF
def count_and_tag(df: pd.DataFrame,
                  synonym_list: list,
                  tag_suffix: str) -> (pd.DataFrame, pd.Series):
    counts = {}
    df[f'tag_{tag_suffix}'] = False
    for s in synonym_list:
        synonym_filter = abstract_title_filter(s)
        counts[s] = sum(synonym_filter)
        df.loc[synonym_filter, f'tag_{tag_suffix}'] = True
    return df, pd.Series(counts)

In [None]:
# Function for printing out key passage of abstract based on key terms
def print_key_phrases(df, key_terms, n=5, chars=300):
    for ind, item in enumerate(df[:n].itertuples()):
        print(f'{ind+1} of {len(df)}')
        print(item.title)
        print('[ ' + item.doi + ' ]')
        try:
            i = len(item.abstract)
            for kt in key_terms:
                kt = kt.replace(r'\b', '')
                term_loc = item.abstract.lower().find(kt)
                if term_loc != -1:
                    i = min(i, term_loc)
            if i < len(item.abstract):
                print('    "' + item.abstract[i-30:i+chars-30] + '"')
            else:
                print('    "' + item.abstract[:chars] + '"')
        except:
            print('NO ABSTRACT')
        print('---')

# Diseases

- Covid-19
- Severe Acute Respiratory Syndrome (SARS)
- Middle East Respiratory Syndrome (MERS)
- Coronaviruses
- Acute Respiratory Distress Syndrome (ARDS)

## Covid-19

We are looking for papers that specifically refer to the recent outbreak, known variously as Covid-19, SARS-CoV-2, 2019-nCoV, Wuhan Pneumonia, novel coronavirus.

See: https://en.wikipedia.org/wiki/Coronavirus_disease_2019

In [None]:
covid19_synonyms = ['covid',
                    'coronavirus disease 19',
                    'sars cov 2', # Note that search function replaces '-' with ' '
                    '2019 ncov',
                    '2019ncov',
                    r'2019 n cov\b',
                    r'2019n cov\b',
                    'ncov 2019',
                    r'\bn cov 2019',
                    'coronavirus 2019',
                    'wuhan pneumonia',
                    'wuhan virus',
                    'wuhan coronavirus',
                    r'coronavirus 2\b']

In [None]:
df, covid19_counts = count_and_tag(df, covid19_synonyms, 'disease_covid19')

In [None]:
covid19_counts.sort_values(ascending=False)

In [None]:
dotplot(covid19_counts, 'Covid-19 synonyms in title / abstract metadata')

In [None]:
novel_corona_filter = (abstract_title_filter('novel corona') &
                       df.publish_time.str.startswith('2020', na=False))
print(f'novel corona (published 2020): {sum(novel_corona_filter)}')
df.loc[novel_corona_filter, 'tag_disease_covid19'] = True

In [None]:
df.tag_disease_covid19.value_counts()

In [None]:
# SENSE CHECK: Confirm these all published 2020 (or missing date)
df[df.tag_disease_covid19].publish_time.str.slice(0, 4).value_counts(dropna=False)

In [None]:
# Fix the earlier papers that are about something else
df.loc[df.tag_disease_covid19 & ~df.publish_time.str.startswith('2020', na=True),
       'tag_disease_covid19'] = False

## Severe Acute Respiratory Syndrome (SARS)

SARS typically means the related coronavirus that caused an outbreak in 2003, although Covid-19 is sometimes referred to with a SARS name.

See: https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus

In [None]:
sars_synonyms = [r'\bsars\b',
                 'severe acute respiratory syndrome']

In [None]:
df, sars_counts = count_and_tag(df, sars_synonyms, 'disease_sars')

In [None]:
sars_counts

In [None]:
df.tag_disease_sars.value_counts()

In [None]:
df.groupby('tag_disease_covid19').tag_disease_sars.value_counts()

## Middle East Respiratory Syndrome (MERS)

See: https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome

In [None]:
mers_synonyms = [r'\bmers\b',
                 'middle east respiratory syndrome']

In [None]:
df, mers_counts = count_and_tag(df, mers_synonyms, 'disease_mers')

In [None]:
mers_counts

In [None]:
df.tag_disease_mers.value_counts()

In [None]:
df.groupby('tag_disease_covid19').tag_disease_mers.value_counts()

## Coronaviruses

**IMPORTANT: This tag needs more work.**

Coronaviruses are a group of related viruses that cause disease in mammals and birds.

See: https://en.wikipedia.org/wiki/Coronavirus

In [None]:
corona_synonyms = ['corona', r'\bcov\b']

In [None]:
df, corona_counts = count_and_tag(df, corona_synonyms, 'disease_corona')

In [None]:
corona_counts

In [None]:
df.tag_disease_corona.value_counts()

In [None]:
df.groupby('tag_disease_covid19').tag_disease_corona.value_counts()

## Acute Respiratory Distress Syndrome (ARDS)

ARDS is a possible consequence of Covid-19 infection.

See: https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome

In [None]:
ards_synonyms = ['acute respiratory distress syndrome',
                 r'\bards\b']

In [None]:
df, ards_counts = count_and_tag(df, ards_synonyms, 'disease_ards')

In [None]:
ards_counts

In [None]:
df.tag_disease_ards.value_counts()

In [None]:
n = (df.tag_disease_covid19 & df.tag_disease_ards).sum()
print(f'There are {n} papers on Covid-19 and ARDS.')

# Design

Research design (thanks to Savanna Reid for input on these):

- risk factor analysis
    - retrospective cohort
    - cross-sectional case-control
    - prospective case-control
    - matched case-control
    - medical records review
    - seroprevalence survey
    - syndromic surveillance
- time series analysis
    - survival analysis

In [None]:
riskfac_synonyms = [
    'risk factor analysis',
    'cross sectional case control',
    'prospective case control',
    'matched case control',
    'medical records review',
    'seroprevalence survey',
    'syndromic surveillance'
]
df, riskfac_counts = count_and_tag(df, riskfac_synonyms, 'design_riskfac')
dotplot(riskfac_counts, 'Risk factor analysis synonyms in title / abstract metadata')

In [None]:
riskfac_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_design_riskfac).sum()
print(f'There are {n} papers on Covid-19 with a Risk Factor Analysis research design.')

# Risks

Potential risk factors:

- Generic risk factors
- _Demographic_:
    - Age
    - Sex
    - Bodyweight
    - Blood type
    - Ethnicity (TODO)
- _Behavioural:
    - Smoking
    - Occupation (TODO)
    - Animal contact (TODO)
    - Social activity (TODO)
- _Pre-existing conditions_:
    - Diabetes
    - Hypertension
    - Immunodeficiency (general)
    - Cancer (general)
    - Chronic respiratory disease (general - inc. asthma, bronchitis)
    - Asthma
    - Cardiovascular disease (TODO)
    - Chronic respiratory disease / bronchitis (TODO)
    - Cerebral infarction (TODO)

See _Estimation of risk factors for COVID-19 mortality - preliminary results_, https://doi.org/10.1101/2020.02.24.20027268

## Generic risk factors

Look for text that indicates that risk factors are assessed in the paper.

In [None]:
risk_factor_synonyms = ['risk factor',
                        'risk model',
                        'risk by',
                        'comorbidity',
                        'comorbidities',
                        'coexisting condition',
                        'co existing condition',
                        'clinical characteristics',
                        'clinical features',
                        'demographic characteristics',
                        'demographic features',
                        'behavioural characteristics',
                        'behavioural features',
                        'behavioral characteristics',
                        'behavioral features',
                        'predictive model',
                        'prediction model',
                        'univariate', # implies analysis of risk factors
                        'multivariate', # implies analysis of risk factors
                        'multivariable',
                        'univariable',
                        'odds ratio', # typically mentioned in model report
                        'confidence interval', # typically mentioned in model report
                        'logistic regression',
                        'regression model',
                        'factors predict',
                        'factors which predict',
                        'factors that predict',
                        'factors associated with',
                        'underlying disease',
                        'underlying condition']
df, risk_generic_counts = count_and_tag(df, risk_factor_synonyms, 'risk_generic')
dotplot(risk_generic_counts,
        'Count of generic risk factor indicated in title / abstract')

In [None]:
risk_generic_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_generic).sum()
print(f'There are {n} papers on Covid-19 and generic risk factors.')

Printing out 5 examples, and key text from the Abstract.

In [None]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_generic],
                  risk_factor_synonyms)

## Demographic risk factors

## Age

In [None]:
age_synonyms = ['median age',
                'mean age',
                'average age',
                'elderly',
                r'\baged\b',
                r'\bold',
                'young',
                'teenager',
                'adult',
                'child'
               ]
df, age_counts = count_and_tag(df, age_synonyms, 'risk_age')
dotplot(age_counts, 'Age synonyms in title / abstract metadata')

In [None]:
age_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_age).sum()
print(f'There are {n} papers on Covid-19 and age.')

## Sex

e.g. _Sex difference and smoking predisposition in patients with COVID-19_, https://doi.org/10.1016/S2213-2600(20)30117-X

In [None]:
sex_synonyms = ['sex',
                'gender',
                r'\bmale\b',
                r'\bfemale\b',
                r'\bmales\b',
                r'\bfemales\b',
                r'\bmen\b',
                r'\bwomen\b'
               ]
df, sex_counts = count_and_tag(df, sex_synonyms, 'risk_sex')
dotplot(sex_counts, 'Sex / gender synonyms in title / abstract metadata')

In [None]:
sex_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_sex).sum()
print(f'There are {n} papers on Covid-19 and sex / gender.')

## Bodyweight

Obesity and related problems (e.g. diabetes, hypertension) have been widely speculated as risk factors, e.g. _The confluence of the COVID19 pandemic with the obesity epidemic_, https://doi.org/10.1136/bmj.m810

In [None]:
bodyweight_synonyms = [
    'overweight',
    'over weight',
    'obese',
    'obesity',
    'bodyweight',
    'body weight',
    r'\bbmi\b',
    'body mass',
    'body fat',
    'bodyfat',
    'kilograms',
    r'\bkg\b', # e.g. 70 kg
    r'\dkg\b'  # e.g. 70kg
]
df, bodyweight_counts = count_and_tag(df, bodyweight_synonyms, 'risk_bodyweight')
dotplot(bodyweight_counts, 'Bodyweight synonyms in title / abstract data')

In [None]:
bodyweight_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_bodyweight).sum()
print(f'There are {n} papers on Covid-19 and bodyweight')

In [None]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_bodyweight],
                  bodyweight_synonyms)

## Smoking

e.g. _Sex difference and smoking predisposition in patients with COVID-19_,  https://doi.org/10.1016/S2213-2600(20)30117-X

- smoking
- smoke(rs)
- cigarette(s)
- cigar(s)
- e-cigarette(s)
- cannabis / marijuana / thc

In [None]:
smoking_synonyms = ['smoking',
                    'smoke',
                    'cigar', # this picks up cigar, cigarette, e-cigarette, etc.
                    'nicotine',
                    'cannabis',
                    'marijuana']
df, smoking_counts = count_and_tag(df, smoking_synonyms, 'risk_smoking')
dotplot(smoking_counts, 'Smoking synonym counts in title / abstract metadata')

In [None]:
smoking_counts.sort_values(ascending=False)

In [None]:
df.groupby('tag_disease_covid19').tag_risk_smoking.value_counts()

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_smoking).sum()
print(f'tag_disease_covid19 x tag_risk_smoking currently returns {n} papers')

In [None]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_smoking],
                  smoking_synonyms, n=12)

## Diabetes

- Type I Diabetes
- Type II Diabetes

In [None]:
diabetes_synonyms = [
    'diabet', # picks up diabetes, diabetic, etc.
    'insulin', # any paper mentioning insulin likely to be relevant
    'blood sugar',
    'blood glucose',
    'ketoacidosis',
    'hyperglycemi', # picks up hyperglycemia and hyperglycemic
]
df, diabetes_counts = count_and_tag(df, diabetes_synonyms, 'risk_diabetes')
dotplot(diabetes_counts, 'Diabetes synonym counts in title / abstract metadata')

In [None]:
diabetes_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_diabetes).sum()
print(f'There are {n} papers on Covid-19 and diabetes')

In [None]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_diabetes],
                  diabetes_synonyms, n=49)

## Hypertension

In [None]:
hypertension_synonyms = [
    'hypertension',
    'blood pressure',
    r'\bhbp\b', # HBP = high blood pressure
    r'\bhtn\b' # HTN = hypertension
]
df, hypertension_counts = count_and_tag(df, hypertension_synonyms, 'risk_hypertension')
dotplot(hypertension_counts, 'Hypertension synonyms in title / abstract metadata')

In [None]:
hypertension_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_hypertension).sum()
print(f'There are {n} papers on Covid-19 and hypertension')

## Immunodeficiency

Immunodeficiency (e.g. HIV / AIDS, side effect of chemotherapy, etc.) may be important.

In [None]:
immunodeficiency_synonyms = [
    'immune deficiency',
    'immunodeficiency',
    r'\bhiv\b',
    r'\baids\b'
    'granulocyte deficiency',
    'hypogammaglobulinemia',
    'asplenia',
    'dysfunction of the spleen',
    'spleen dysfunction',
    'complement deficiency',
    'neutropenia',
    'neutropaenia', # alternate spelling
    'cell deficiency' # e.g. T cell deficiency, B cell deficiency
]
df, immunodeficiency_counts = count_and_tag(df,
                                            immunodeficiency_synonyms,
                                            'risk_immunodeficiency')
dotplot(immunodeficiency_counts, 'Immunodeficiency synonyms in title / abstract metadata')

In [None]:
immunodeficiency_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_immunodeficiency).sum()
print(f'tag_disease_covid19 x tag_risk_immunodeficiency currently returns {n} papers')

In [None]:
df[df.tag_disease_covid19 & df.tag_risk_immunodeficiency].head()

## Cancer

In [None]:
cancer_synonyms = [
    'cancer',
    'malignant tumour',
    'malignant tumor',
    'melanoma',
    'leukemia',
    'leukaemia',
    'chemotherapy',
    'radiotherapy',
    'radiation therapy',
    'lymphoma',
    'sarcoma',
    'carcinoma',
    'blastoma',
    'oncolog'
]
df, cancer_counts = count_and_tag(df, cancer_synonyms, 'risk_cancer')
dotplot(cancer_counts, 'Cancer synonyms in title / abstract metadata')

In [None]:
cancer_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_cancer).sum()
print(f'There are {n} papers on Covid-19 and cancer')

## Chronic respiratory disease

In [None]:
chronicresp_synonyms = [
    'chronic respiratory disease',
    'asthma',
    'chronic obstructive pulmonary disease',
    r'\bcopd',
    'chronic bronchitis',
    'emphysema'
]
df, chronicresp_counts = count_and_tag(df, chronicresp_synonyms, 'risk_chronicresp')
dotplot(chronicresp_counts, 'Chronic respiratory disease terms in title / abstract metadata')

In [None]:
chronicresp_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_chronicresp).sum()
print(f'There are {n} papers on Covid-19 and chronic respiratory disease')

In [None]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_chronicresp],
                  chronicresp_synonyms, n=15)

## Asthma

In [None]:
# Only really one term for asthma
df, asthma_counts = count_and_tag(df, ['asthma'], 'risk_asthma')
asthma_counts

In [None]:
n = (df.tag_disease_covid19 & df.tag_risk_asthma).sum()
print(f'There are {n} papers on Covid-19 and asthma')

In [None]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_asthma],
                  ['asthma'])

# Immunity

Looking for terms which indicate factors relating to vaccination and immunity.

## Generic immunity / vaccination

Papers which mention generic themes relating to immunity / vaccination. (As the research develops, we may extend this section to include specific lines of research relating to immunity / vaccination.

In [None]:
immunity_synonyms = [
    'immunity',
    r'\bvaccin',
    'innoculat'
]
df, immunity_counts = count_and_tag(df, immunity_synonyms, 'immunity_generic')
immunity_counts

In [None]:
n = (df.tag_disease_covid19 & df.tag_immunity_generic).sum()
print(f'There are {n} papers on Covid-19 and immunity / vaccines')

In [None]:
print('Intersection of tag_disease_covid19, tag_risk_generic & tag_immunity_generic')
print('=' * 76)
print_key_phrases(df[df.tag_disease_covid19 &
                     df.tag_risk_generic &
                     df.tag_immunity_generic],
                  risk_factor_synonyms + immunity_synonyms)

In [None]:
tag_columns = df.columns[df.columns.str.startswith('tag_')].tolist()

# Geographies

**IMPORTANT: This section is still under development, as have been focusing on Risk Factors and Research Design**

- Continents (inc. continental regions)
- Countries
- Key regions of countries
- Key cities

## Continents

These search strings include continents and subregions of continents, with particular focus on countries where initial outbreaks have been studied (e.g. China, Korea, Japan, Iran, Italy).

In [None]:
# Note that this section needs more work - have been focusing on later sections
continental_regions = {
    'asia': 'asia|china|korea|japan|hubei|wuhan|malaysia|singapore|hong kong',
    'east_asia': 'east asia|china|korea|japan|hubei|wuhan|hong kong',
    'south_asia': 'south asia|india|pakistan|bangladesh|sri lanka',
    'se_asia': r'south east asia|\bse asia|malaysia|thailand|indonesia|vietnam|cambodia|viet nam',
    'europe': 'europe|italy|france|spain|germany|austria|switzerland|united kingdom|ireland',
    'africa': 'africa|kenya',
    'middle_east': 'middle east|gulf states|saudi arabia|\buae\b|iran|persian',
    'south_america': 'south america|latin america|brazil|argentina',
    'north_america': 'north america|usa|united states|canada|caribbean',
    'australasia': 'australia|new zealand|oceania|australasia|south pacific'
}

counts = {}
for cr, s in continental_regions.items():
    con_filter = abstract_title_filter(s)
    counts[cr] = sum(con_filter)
    df.loc[con_filter, f'tag_continent_{cr}'] = True
    df[f'tag_continent_{cr}'].fillna(False, inplace=True)
counts = pd.Series(counts)
dotplot(counts, 'Continent counts in title / abstract metadata')

## Countries

We will just use countries that appear >50 times in the dateset. Can be adjusted to get more detail.

_**IMPORTANT**: This takes several minutes to run. Skip if not important. PyCountry uses official names which are different from commonly used names - need to fix this._

_TO DO: Add in country subregions (e.g. Hubei -> China, Lombardy -> Italy)_

In [None]:
### THIS SECTION TAKES A LONG TIME TO RUN SO COMMENTED OUT WHILE DEVELOPING
# MIN_PAPERS_PER_COUNTRY = 50
# counts = {}

# for i, country in enumerate(pycountry.countries):
#     if i % 20 == 0:
#         print(f'Checking country {i} ({country.name})')
#     country_filter = abstract_title_filter(r'\b' + re.escape(country.name.lower()) + r'\b')
#     n = sum(country_filter)
#     if n >= MIN_PAPERS_PER_COUNTRY:
#         counts[country.name] = n
#         df.loc[country_filter, f'tag_country_{country.alpha_3.lower()}'] = True
#         df[f'tag_country_{country.alpha_3.lower()}'].fillna(False, inplace=True)
# counts = pd.Series(counts)
# plt.figure(figsize=(5,7))
# dotplot(counts, 'Country counts in title / abstract metadata')
# df.groupby('tag_disease_covid19').tag_country_chn.value_counts()

# Climate

Climate has been hypothesised as a factor in the spread of Covid-19

In [None]:
climate_synonyms = [
    'climate',
    'weather',
    'humid',
    'sunlight',
    'air temperature',
    'meteorolog', # picks up meteorology, meteorological, meteorologist
    'climatolog', # as above
    'dry environment',
    'damp environment',
    'moist environment',
    'wet environment',
    'hot environment',
    'cold environment',
    'cool environment'
]
df, climate_counts = count_and_tag(df, climate_synonyms, 'climate_generic')
dotplot(climate_counts, 'Climate synonyms by title / abstract metadata')

In [None]:
climate_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_climate_generic).sum()
print(f'There are {n} papers on Covid-19 and climate:\n')
print_key_phrases(df[df.tag_disease_covid19 & df.tag_climate_generic],
                  climate_synonyms, n=n)

# Transmission

## Transmission / incubation generic

In [None]:
transmission_synonyms = [
    'transmiss', # Picks up 'transmission' and 'transmissibility'
    'transmitted',
    'incubation',
    'environmental stability',
    'airborne',
    'via contact',
    'human to human',
    'through droplets',
    'through secretions',
    r'\broute',
    'exportation'
]
df, transmission_counts = count_and_tag(df, transmission_synonyms, 'transmission_generic')
dotplot(transmission_counts, 'Transmission / incubation synonyms by title / abstract metadata')

In [None]:
transmission_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_transmission_generic).sum()
print(f'There are {n} papers on Covid-19 and transmission / incubation / environmental stability')
print('\nThis entire dataset is exported to thematic_tagging_output_transmission.csv')

## Reproduction rates ($R$ / $R_0$)

- Basic reproduction rate ($R_0$)
- Effective reproduction rate ($R$)

In [None]:
repr_synonyms = [
    r'reproduction \(r\)',
    'reproduction rate',
    'reproductive rate',
    '{r}_0',
    r'\br0\b',
    r'\br_0',
    '{r_0}',
    r'\b{r}',
    r'\br naught',
    r'\br zero'
]
df, repr_counts = count_and_tag(df,repr_synonyms, 'transmission_repr')
dotplot(repr_counts, 'R<sub>0</sub> synonyms by title / abstract metadata')

In [None]:
repr_counts.sort_values(ascending=False)

In [None]:
n = (df.tag_disease_covid19 & df.tag_transmission_repr).sum()
print(f'There are {n} papers on Covid-19 and R or R_0')
print('=')
print_key_phrases(df[df.tag_disease_covid19 & df.tag_transmission_repr], 
                  repr_synonyms, n=52, chars=500)

In [None]:
# DATA_FOLDER = '../input/CORD-19-research-challenge'

# import json
# import os

# json_list = []

# for row in df[df.tag_disease_covid19 &
#               df.tag_transmission_repr & 
#               df.has_full_text].itertuples():
#     filename = f'{row.sha}.json'
#     sources = ['biorxiv_medrxiv', 'comm_use_subset',
#                'custom_license', 'noncomm_use_subset']
#     for source in sources:
#         if filename in os.listdir(os.path.join(DATA_FOLDER, source, source)):
#             with open(os.path.join(DATA_FOLDER, source, source, filename), 'rb') as f:
#                 json_list.append(json.load(f))

In [None]:
# candidate_sections = [
#     'results',
#     'conclusion',
#     'conclusions',
#     'reproduction',
#     'r_0',
#     'r0',
#     'reproductive'
# ]

In [None]:
# for i, item in enumerate(json_list):
#     print(i)
#     body_text = item['body_text']
#     for sub_item in body_text:
#         found = False
#         for cs in candidate_sections:
#             if cs in sub_item['section'].lower():
#                 found = True
#         if found:
#             print(sub_item['section'])
#             print(sub_item['text'])
#             print()
#     print()

In [None]:
# for i, item in enumerate(json_list):
#     print(i)
#     body_text = item['body_text']
#     for sub_item in body_text:
#         if sub_item['section'] in ['Methods and Results', 'Results', 'Conclusions']:
#             print(sub_item['text'])
#     print()

# Output

## Covid-19 papers only

In [None]:
filename = 'thematic_tagging_output_covid19_only.csv'
print(f'Outputting {df.tag_disease_covid19.sum()} records to {filename}')
df[df.tag_disease_covid19].to_csv(filename, index=False)

## Covid-19 papers x questions

### Risk factors

In [None]:
file_filter = df.tag_disease_covid19 & df.tag_risk_generic
filename = 'thematic_tagging_output_risk_factors.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

### Diabetes

In [None]:
file_filter = df.tag_disease_covid19 & df.tag_risk_diabetes
filename = 'thematic_tagging_output_risk_diabetes.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

### Smoking

In [None]:
file_filter = df.tag_disease_covid19 & df.tag_risk_smoking
filename = 'thematic_tagging_output_risk_smoking.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

### Climate

In [None]:
file_filter = df.tag_disease_covid19 & df.tag_climate_generic
filename = 'thematic_tagging_output_climate.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

## Transmission / incubation

In [None]:
file_filter = df.tag_disease_covid19 & df.tag_transmission_generic
filename = 'thematic_tagging_output_transmission.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

## $R$ / $R_0$

In [None]:
file_filter = df.tag_disease_covid19 & df.tag_transmission_repr
filename = 'thematic_tagging_output_repr.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

## Full dataset

In [None]:
filename = 'thematic_tagging_output_full.csv'
print(f'Outputting {len(df)} records to {filename}')
df.to_csv(filename, index=False)

# Filtering tool

TO DO: Add tool for filtering on tags.