# DES model sharing in healthcare: an analysis of the published literature 2019-2022

These results accompany Harper and Monks (2023).

> An early version of these results (2022 part) was presented at the WSC2022 conference for the panel on **Grand Challenges in Healthcare Simulation**.  Thanks goes to Prof Christine Currie for her support.

## Aim and research questions:

The overarching research aim is determine to what extent authors of DES health studies share models and where models are shared how is this done.

### Primary research questions:

1. What proportion of DES healthcare papers that share their models and code?
2. What proportion of these papers that use Free and Open Source Simulation and of these what number are shared?
3. What proportion of these papers that tackle covid-19 and share their models?
3. Do these metrics vary by the type of article: journal paper, full conference paper or book chapter?
4. How have these metrics changed in over the three years of the study?
5. What proportion of studies make use of a reporting guideline 

## Review and data extraction process summary

The review searched both Scopus and PubMed for relevant papers.  Scopus was search for 'discrete-event simulation' and 'health' (or 'healthcare') while PubMed was simply searched for 'discrete-event simulation'.  We include journal articles, full conference papers and book chapters in our review (results are broken down by type).  

Following a deduplication of scopus and pubmed articles we screened the titles and abstract of each paper and excluded papers that were not DES.  We included both standard DES models and hybrid models that include a DES component such as DES + ABS or SD.

Where possible we viewed papers at the publishers website so we could identify, download, and access any supplementary material or information that may not be directly included in the article PDF. If an article built on and cited previously published work/models we followed up the paper in attempt to complete data extraction.   

## Imports 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Constants

In [2]:
FILE_NAME = 'share_sim_data_extract.csv'
COLS_TO_KEEP = [2, 3, 4, 5, 6, 7, 10, 11, 44, 45, 46, 47, 
                48, 49, 50, 51, 52, 52, 53, 54, 55]

## Read and clean dataset

The data set CSV file that has been an extracted from a Zotero library (TODO: INSERT Zotero library link).  The following data was then extracted from each paper

* study_included - has the study been included in the final analysis
* model_code_available - is the model made publically available in some manner
* reporting_guidelines_mention - have reporting guidelines been mentioned or explicitly cited used.
* covid - is DES being used to tackle covid-19 
* sim_software - name of simulation software or programming language if stated.
* foss_sim - free and open source simulation software? 0/1
* model_archive - name of archive if used
* model_repo - name of model repo if used
* model_journal_supp - what is stored in the journal supplementary material 
* model_personal_org - name of personal or organisational website if used
* model_platform - name of cloud platform used (e.g. Binder or Anylogic cloud)

### Cleaning helper functions

In [3]:
def trim_columns(df):
    '''
    Remove fields that are not needed for the clean
    analysis dataset.
    
    Uses the COLS_TO_KEEP constant list.
    
    Params:
    -------
    df - pd.DataFrame
        The raw data
    
    '''
    return df[df.columns[COLS_TO_KEEP]]

In [4]:
def cols_to_lower(df):
    new_cols = [c.lower() for c in df.columns]
    df.columns = new_cols
    return df

### Main load and clean function

In [5]:
def load_clean_dataset(file_name):
    '''
    Loads a cleaned verion of the dataset
    
    1.  Drop row 1 which contains example data
    2.  Trims the columns to only those relevant to the analysis
    3.  Replaces space in the column names with "_"
    4.  Converts all column names to lower case
    5.  Convert relevant cols to Categorical data type
    6.  Performs remaining type conversions.
    '''
    labels = {'Item Type': 'item_type',
               'Publication Year': 'pub_yr',
               'Publication Title': 'pub_title'}

    type_conversions = {'pub_yr': 'int'}
    
    recoded_types = {'item_type': {'bookSection':'book'},
                     'reporting_guidelines_mention': {'ISPOR-SMDM': 'ISPOR',
                                                      '0': 'None'}}

    clean = (pd.read_csv(file_name)
             .drop(0, axis=0)
             .pipe(trim_columns)
             .rename(columns=labels) 
             .pipe(cols_to_lower)
             .replace(recoded_types)
             .assign(study_included=lambda x: pd.Categorical(x['study_included']),
                     model_code_available=lambda x: pd.Categorical(x['model_code_available']),
                     reporting_guidelines_mention=lambda x: pd.Categorical(x['reporting_guidelines_mention']),
                     covid=lambda x: pd.Categorical(x['covid']),
                     foss_sim=lambda x: pd.Categorical(x['foss_sim']),
                     item_type=lambda x: pd.Categorical(x['item_type']))
            .astype(type_conversions)
            
    )

    return clean

In [6]:
clean = load_clean_dataset(FILE_NAME)
clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 484 entries, 1 to 484
Data columns (total 21 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   key                           484 non-null    object  
 1   item_type                     484 non-null    category
 2   pub_yr                        484 non-null    int64   
 3   author                        483 non-null    object  
 4   title                         484 non-null    object  
 5   pub_title                     460 non-null    object  
 6   doi                           429 non-null    object  
 7   url                           402 non-null    object  
 8   study_included                484 non-null    category
 9   model_code_available          429 non-null    category
 10  reporting_guidelines_mention  429 non-null    category
 11  covid                         431 non-null    category
 12  sim_software                  430 non-null    obje

In [7]:
def high_level_review_summary(df, name='None'):
    '''A simple high level summary of the review.
    
    Returns a dict containing simple high level counts
    and percentages in the data
    
    Params:
    -------
    df: pd.DataFrame 
        A cleaned dataset.  Could be overall or subgroups/categories
        
    Returns:
    --------
        dict 
    '''
    results = {}
    included = df[df['study_included'] == '1']
    available = included[included['model_code_available'] == '1']
    results['n_included'] = len(included[included['study_included'] == '1'])
    results['n_foss'] = len(included[included['foss_sim'] == '1'])
    results['n_covid'] = len(included[included['covid'] == '1'])
    results['n_avail'] = len(included[included['model_code_available'] == '1'])
    results['n_foss_avail'] = len(available[available['foss_sim'] == '1'])
    results['n_covid_avail'] = len(available[available['covid'] == '1'])
    results['per_foss'] = results['n_foss'] / results['n_included']
    results['per_covid'] = results['n_covid'] / results['n_included']
    results['per_avail'] = results['n_avail'] / results['n_included']
    results['per_foss_avail'] = results['n_foss_avail'] / results['n_foss']
    results['per_covid_avail'] = results['n_covid_avail'] / results['n_covid']
    results['reporting_guide'] = len(included[included['reporting_guidelines_mention'] != 'None'])
    results['per_reporting_guide'] = results['reporting_guide'] / results['n_included']
    return pd.Series(results, name=name)

In [8]:
def analysis_by_item_type(df_clean, decimals=2):
    '''
    Conducts a high level analysis by item type: journal, conference, book
    + overall.
    
    Params:
    -------
    df_clean: pd.DataFrame
        Assumes a cleaned version of the dataset.
    
    Returns: 
    -------
    pd.DataFrame
        Containing the result summary
        
    '''
    overall_results = high_level_review_summary(df_clean, 'overall')
    article_type_results = []
    article_types = df_clean['item_type'].unique().tolist()
    for article_type in article_types:
        subset = df_clean[df_clean['item_type'] == article_type]
        article_type_results.append(high_level_review_summary(subset, name=article_type))
    article_type_results = [overall_results] + article_type_results
    return pd.DataFrame(article_type_results).T.round(decimals)


In [9]:
def analysis_by_year(df_clean, decimals=2):
    '''
    Conducts a high level analysis by year of publcation
    2019-2022
    
    Params:
    -------
    df_clean: pd.DataFrame
        Assumes a cleaned version of the dataset.
    
    Returns: 
    -------
    pd.DataFrame
        Containing the result summary
        
    '''
    overall_results = high_level_review_summary(df_clean, 'overall')
    year_results = []
    years = df_clean['pub_yr'].unique().tolist()
    for year in years:
        subset = df_clean[df_clean['pub_yr'] == year]
        year_results.append(high_level_review_summary(subset, name=str(year)))
    year_results = [overall_results] + year_results
    year_results = pd.DataFrame(year_results).T.round(decimals)
    return year_results[sorted(year_results.columns.tolist())]

In [10]:
# analysis of reporting guidelines

In [11]:
reporting_guidelines = clean['reporting_guidelines_mention'].unique().tolist()
reporting_guidelines

['None',
 'STRESS',
 nan,
 'ISPOR',
 'CHEERS',
 'Zhang et al.',
 'ODD',
 'SQUIRE',
 'Sanders et al.']

In [12]:
def reporting_guideline_summary(df_clean):
    included = df_clean[df_clean['study_included'] == '1']
    report_guidelines = included[included['reporting_guidelines_mention'] != 'None']
    counts = report_guidelines.groupby(['reporting_guidelines_mention'])['key'].count().sort_values(ascending=False)
    percentages = counts / len(included)
    percentages

    summary = pd.concat([counts, (percentages * 100).round(1)], axis=1)
    summary.columns = ['n', '% of included']
    summary = summary.drop('None', axis=0)
    return summary.sort_values(by=['n'], ascending=False)

## Results

In [13]:
table1 = analysis_by_item_type(clean)
table2 = analysis_by_year(clean)
table1

Unnamed: 0,overall,journalArticle,book,conferencePaper
n_included,422.0,333.0,22.0,67.0
n_foss,79.0,64.0,5.0,10.0
n_covid,52.0,42.0,1.0,9.0
n_avail,41.0,37.0,1.0,3.0
n_foss_avail,24.0,22.0,0.0,2.0
n_covid_avail,15.0,12.0,0.0,3.0
per_foss,0.19,0.19,0.23,0.15
per_covid,0.12,0.13,0.05,0.13
per_avail,0.1,0.11,0.05,0.04
per_foss_avail,0.3,0.34,0.0,0.2


In [14]:
table2

Unnamed: 0,2019,2020,2021,2022,overall
n_included,104.0,112.0,124.0,82.0,422.0
n_foss,14.0,17.0,29.0,19.0,79.0
n_covid,1.0,9.0,30.0,12.0,52.0
n_avail,5.0,11.0,15.0,10.0,41.0
n_foss_avail,4.0,5.0,9.0,6.0,24.0
n_covid_avail,0.0,4.0,11.0,0.0,15.0
per_foss,0.13,0.15,0.23,0.23,0.19
per_covid,0.01,0.08,0.24,0.15,0.12
per_avail,0.05,0.1,0.12,0.12,0.1
per_foss_avail,0.29,0.29,0.31,0.32,0.3


In [15]:
reporting_guideline_summary(clean)

Unnamed: 0_level_0,n,% of included
reporting_guidelines_mention,Unnamed: 1_level_1,Unnamed: 2_level_1
ISPOR,35,8.3
STRESS,13,3.1
CHEERS,8,1.9
ODD,1,0.2
SQUIRE,1,0.2
Sanders et al.,1,0.2
Zhang et al.,1,0.2
