# Sharing of models

These results aim to answer the following questions

## Aim and research questions:

The overarching research aim is determine to what extent authors of DES health studies share models and where models are shared how is this done.

### Primary research questions:

1. What proportion of DES healthcare papers that share their models and code?
2. What proportion of these papers that use Free and Open Source Simulation and of these what number are shared?
3. What proportion of these papers that tackle covid-19 and share their models?
3. Do these metrics vary by the type of article: journal paper, full conference paper or book chapter?
4. How have these metrics changed in over the three years of the study?
5. What proportion of studies make use of a reporting guideline 

## Imports 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Constants

In [2]:
FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/share_sim_data_extract.zip'
COLS_TO_KEEP = [2, 3, 4, 5, 6, 7, 10, 11, 44, 45, 46, 47, 
                48, 49, 50, 51, 52, 52, 53, 54, 55]

## Read and clean dataset

The data set CSV file that has been an extracted from a Zotero library (TODO: INSERT Zotero library link).  The following data was then extracted from each paper

* study_included - has the study been included in the final analysis
* model_code_available - is the model made publically available in some manner
* reporting_guidelines_mention - have reporting guidelines been mentioned or explicitly cited used.
* covid - is DES being used to tackle covid-19 
* sim_software - name of simulation software or programming language if stated.
* foss_sim - free and open source simulation software? 0/1
* model_archive - name of archive if used
* model_repo - name of model repo if used
* model_journal_supp - what is stored in the journal supplementary material 
* model_personal_org - name of personal or organisational website if used
* model_platform - name of cloud platform used (e.g. Binder or Anylogic cloud)

### Cleaning helper functions

In [3]:
def trim_columns(df):
    '''
    Remove fields that are not needed for the clean
    analysis dataset.
    
    Uses the COLS_TO_KEEP constant list.
    
    Params:
    -------
    df - pd.DataFrame
        The raw data
    
    '''
    return df[df.columns[COLS_TO_KEEP]]

In [4]:
def cols_to_lower(df):
    new_cols = [c.lower() for c in df.columns]
    df.columns = new_cols
    return df

### Main load and clean function

In [5]:
def load_clean_dataset(file_name):
    '''
    Loads a cleaned verion of the dataset
    
    1.  Trims the columns to only those relevant to the analysis
    2.  Replaces space in the column names with "_"
    3.  Converts all column names to lower case
    4.  Convert relevant cols to Categorical data type
    5.  Performs remaining type conversions.
    '''
    labels = {'Item Type': 'item_type',
               'Publication Year': 'pub_yr',
               'Publication Title': 'pub_title'}

    type_conversions = {'pub_yr': 'int'}
    
    recoded_types = {'item_type': {'bookSection':'book'},
                     'reporting_guidelines_mention': {'ISPOR-SMDM': 'ISPOR',
                                                      '0': 'None'}}

    clean = (pd.read_csv(file_name)
             .pipe(trim_columns)
             .rename(columns=labels) 
             .pipe(cols_to_lower)
             .replace(recoded_types)
             .assign(study_included=lambda x: 
                         pd.Categorical(x['study_included']),
                     model_code_available=lambda x: 
                         pd.Categorical(x['model_code_available']),
                     reporting_guidelines_mention=lambda x: 
                         pd.Categorical(x['reporting_guidelines_mention']),
                     covid=lambda x: pd.Categorical(x['covid']),
                     foss_sim=lambda x: pd.Categorical(x['foss_sim']),
                     item_type=lambda x: pd.Categorical(x['item_type']))
            .astype(type_conversions)
            
    )

    return clean

In [6]:
clean = load_clean_dataset(FILE_NAME)
clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 484 entries, 0 to 483
Data columns (total 21 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   key                           484 non-null    object  
 1   item_type                     484 non-null    category
 2   pub_yr                        484 non-null    int64   
 3   author                        483 non-null    object  
 4   title                         484 non-null    object  
 5   pub_title                     460 non-null    object  
 6   doi                           429 non-null    object  
 7   url                           402 non-null    object  
 8   study_included                484 non-null    category
 9   model_code_available          430 non-null    category
 10  reporting_guidelines_mention  429 non-null    category
 11  covid                         432 non-null    category
 12  sim_software                  431 non-null    obje

In [7]:
def high_level_review_summary(df, name='None'):
    '''A simple high level summary of the review.
    
    Returns a dict containing simple high level counts
    and percentages in the data
    
    Params:
    -------
    df: pd.DataFrame 
        A cleaned dataset.  Could be overall or subgroups/categories
        
    Returns:
    --------
        dict 
    '''
    results = {}
    included = df[df['study_included'] == 1]
    available = included[included['model_code_available'] == 1]
    results['n_included'] = len(included[included['study_included'] == 1])
    results['n_foss'] = len(included[included['foss_sim'] == '1'])
    results['n_covid'] = len(included[included['covid'] == 1])
    results['n_avail'] = len(included[included['model_code_available'] == 1])
    results['n_foss_avail'] = len(available[available['foss_sim'] == '1'])
    results['n_covid_avail'] = len(available[available['covid'] == 1])
    results['per_foss'] = results['n_foss'] / results['n_included']
    results['per_covid'] = results['n_covid'] / results['n_included']
    results['per_avail'] = results['n_avail'] / results['n_included']
    results['per_foss_avail'] = results['n_foss_avail'] / results['n_foss']
    results['per_covid_avail'] = results['n_covid_avail'] / results['n_covid']
    results['reporting_guide'] = len(included[included['reporting_guidelines_mention'] != 'None'])
    results['per_reporting_guide'] = results['reporting_guide'] / results['n_included']
    return pd.Series(results, name=name)

In [8]:
def analysis_by_item_type(df_clean, decimals=4):
    '''
    Conducts a high level analysis by item type: journal, conference, book
    + overall.
    
    Params:
    -------
    df_clean: pd.DataFrame
        Assumes a cleaned version of the dataset.
    
    Returns: 
    -------
    pd.DataFrame
        Containing the result summary
        
    '''
    overall_results = high_level_review_summary(df_clean, 'overall')
    article_type_results = []
    article_types = df_clean['item_type'].unique().tolist()
    for article_type in article_types:
        subset = df_clean[df_clean['item_type'] == article_type]
        article_type_results.append(high_level_review_summary(subset, name=article_type))
    article_type_results = [overall_results] + article_type_results
    return pd.DataFrame(article_type_results).T.round(decimals)


In [9]:
def analysis_by_year(df_clean, decimals=4):
    '''
    Conducts a high level analysis by year of publcation
    2019-2022
    
    Params:
    -------
    df_clean: pd.DataFrame
        Assumes a cleaned version of the dataset.
    
    Returns: 
    -------
    pd.DataFrame
        Containing the result summary
        
    '''
    overall_results = high_level_review_summary(df_clean, 'overall')
    year_results = []
    years = df_clean['pub_yr'].unique().tolist()
    for year in years:
        subset = df_clean[df_clean['pub_yr'] == year]
        year_results.append(high_level_review_summary(subset, name=str(year)))
    year_results = [overall_results] + year_results
    year_results = pd.DataFrame(year_results).T.round(decimals)
    return year_results[sorted(year_results.columns.tolist())]

In [10]:
# analysis of reporting guidelines

In [11]:
reporting_guidelines = clean['reporting_guidelines_mention'].unique().tolist()
reporting_guidelines

['None',
 'STRESS',
 nan,
 'ISPOR',
 'CHEERS',
 'Zhang et al.',
 'ODD',
 'SQUIRE',
 'Sanders et al.']

In [12]:
def reporting_guideline_summary(df_clean):
    included = df_clean[df_clean['study_included'] == 1]
    report_guidelines = included[included['reporting_guidelines_mention'] != 'None']
    counts = report_guidelines.groupby(['reporting_guidelines_mention'])['key'].count().sort_values(ascending=False)
    percentages = counts / len(included)
    percentages

    summary = pd.concat([counts, (percentages * 100).round(1)], axis=1)
    summary.columns = ['n', '% of included']
    summary = summary.drop('None', axis=0)
    return summary.sort_values(by=['n'], ascending=False)

## Formatting tables

In [13]:
def format_table2(summary):
    '''
    Create a formatted table 1 of results for manuscript.
    '''
    total_rows = ['n_included', 'n_covid', 'n_foss']
    avail_rows = ['n_avail', 'n_covid_avail', 'n_foss_avail']
    per_rows = ['per_avail', 'per_covid_avail', 'per_foss_avail']
    new_cols_titles = ['metric', 'overall', 'shared', 'per']
       
    # only work with the overall column
    selected_cols = ['overall'] # , 'journalArticle', 'conferencePaper', 'book']
    overall = summary[selected_cols]
    
    # total number of papers
    totals = overall.loc[total_rows]
    totals = totals.reset_index()
    totals['overall'] = totals['overall'].map('{:,.0f}'.format)
    
    # no. models that are available from the total
    shared = overall.loc[avail_rows]
    shared = shared.reset_index()
    
    # percentage of papers 
    per = overall.loc[per_rows]
    per = per.reset_index()
    per = per * 100
        
    # construct table and format columns in n (%) format
    t1 = pd.concat([totals, shared['overall'], per['overall']], \
                   axis=1, ignore_index=True)

    t1.columns = new_cols_titles
    
    t1['shared n (\%)'] = t1['shared'].map('{:,.0f}'.format) \
        + ' (' + t1['per'].map('{:,.1f}'.format) + ')'
    
    #t1['overall'] = t1['overall'].map('{:,.0f}')
    
    to_drop = ['shared', 'per']
    t1 = t1.drop(to_drop, axis=1)
    t1.iat[0, 0] = 'Total'
    t1.iat[1, 0] = 'COVID-19'
    t1.iat[2, 0] = 'FOSS'
    t1 = t1.set_index('metric')
    return t1
    

## Results

In [14]:
summary_table = analysis_by_item_type(clean)
summary_table

Unnamed: 0,overall,journalArticle,book,conferencePaper
n_included,423.0,334.0,22.0,67.0
n_foss,80.0,64.0,6.0,10.0
n_covid,52.0,42.0,1.0,9.0
n_avail,39.0,35.0,1.0,3.0
n_foss_avail,24.0,21.0,1.0,2.0
n_covid_avail,14.0,11.0,0.0,3.0
per_foss,0.1891,0.1916,0.2727,0.1493
per_covid,0.1229,0.1257,0.0455,0.1343
per_avail,0.0922,0.1048,0.0455,0.0448
per_foss_avail,0.3,0.3281,0.1667,0.2


### Table 2

In the manuscript table 2 provides a simple high level results 

In [15]:
table2 = format_table2(summary_table)
table2

Unnamed: 0_level_0,overall,shared n (\%)
metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Total,423,39 (9.2)
COVID-19,52,14 (26.9)
FOSS,80,24 (30.0)


####  Table 2 LateX

In [16]:
print(table2.style.to_latex())

\begin{tabular}{lll}
 & overall & shared n (\%) \\
metric &  &  \\
Total & 423 & 39 (9.2) \\
COVID-19 & 52 & 14 (26.9) \\
FOSS & 80 & 24 (30.0) \\
\end{tabular}



In [17]:
table3 = analysis_by_year(clean)
table3

Unnamed: 0,2019,2020,2021,2022,overall
n_included,104.0,113.0,124.0,82.0,423.0
n_foss,15.0,17.0,29.0,19.0,80.0
n_covid,1.0,9.0,30.0,12.0,52.0
n_avail,5.0,11.0,14.0,9.0,39.0
n_foss_avail,5.0,5.0,9.0,5.0,24.0
n_covid_avail,0.0,4.0,10.0,0.0,14.0
per_foss,0.1442,0.1504,0.2339,0.2317,0.1891
per_covid,0.0096,0.0796,0.2419,0.1463,0.1229
per_avail,0.0481,0.0973,0.1129,0.1098,0.0922
per_foss_avail,0.3333,0.2941,0.3103,0.2632,0.3


In [18]:
reporting_guideline_summary(clean)

Unnamed: 0_level_0,n,% of included
reporting_guidelines_mention,Unnamed: 1_level_1,Unnamed: 2_level_1
ISPOR,35,8.3
STRESS,13,3.1
CHEERS,8,1.9
ODD,1,0.2
SQUIRE,1,0.2
Sanders et al.,1,0.2
Zhang et al.,1,0.2
