# Dataset pre-processing

This notebook provides an overview of the code to read in the data extracted from the review.  

The data set is held in a CSV file that has been an extracted from a Zotero library (TODO: INSERT Zotero library link).  The following data was then extracted from each paper

* `study_included` - has the study been included in the final analysis
* `model_code_available` - is the model made publically available in some manner
* `reporting_guidelines_mention` - have reporting guidelines been mentioned or explicitly cited used.
* `covid` - is DES being used to tackle covid-19 
* `sim_software` - name of simulation software or programming language if stated.
* `foss_sim` - free and open source simulation software? 0/1
* `model_archive` - name of archive if used
* `model_repo` - name of model repo if used
* `model_journal_supp` - what is stored in the journal supplementary material 
* `model_personal_org` - name of personal or organisational website if used
* `model_platform` - name of cloud platform used (e.g. Binder or Anylogic cloud)
* `excluded_reason` - One of four reasons that the study was excluded.

## 1. Imports

In [1]:
import pandas as pd
import numpy as np

## 2. Constants

In [2]:
FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/share_sim_data_extract.zip'

FILE_NAME = '../../data/share_sim_data_extract.zip'

# used to drop redudant manuscript fields outputted by zotero 
# e.g. keywords and abstracts.
COLS_TO_KEEP = [2, 3, 4, 5, 6, 7, 10, 11, 44, 45, 46, 47, 
                48, 49, 50, 51, 52, 52, 53, 54, 55, 57]

## 3. Function to read and clean dataset

We have implemented the read and clean up of the dataset using `pandas`

### 3.1 Cleaning helper functions

Two supporting functions are defined for the main routine.  These trim redundant columns and convert all column names to lower case.

In [3]:
def trim_columns(df):
    '''
    Remove fields that are not needed for the clean
    analysis dataset.
    
    Uses the COLS_TO_KEEP constant list.
    
    Params:
    -------
    df - pd.DataFrame
        The raw data
    
    Returns:
    --------
    pd.DataFrame
    
    '''
    return df[df.columns[COLS_TO_KEEP]]

In [4]:
def cols_to_lower(df):
    '''
    Convert all column names in a dataframe to lower case
    
    Params:
    ------
    df - pandas.DataFrame
    
    Returns:
    -------
    pandas.DataFrame
    '''
    new_cols = [c.lower() for c in df.columns]
    df.columns = new_cols
    return df

### 3.2. Main load and clean function

The main function makes use of pandas method chaining functions.

In [7]:
def load_clean_dataset(file_name):
    '''
    Loads a cleaned verion of the dataset
    
    1.  Trims the columns to only those relevant to the analysis
    2.  Replaces space in the column names with "_"
    3.  Converts all column names to lower case
    4.  Convert relevant cols to Categorical data type
    5.  Performs remaining type conversions.
    '''
    labels = {'Item Type': 'item_type',
               'Publication Year': 'pub_yr',
               'Publication Title': 'pub_title'}

    type_conversions = {'pub_yr': 'int'}
    
    recoded_types = {'item_type': {'bookSection':'book'},
                     'reporting_guidelines_mention': {'ISPOR-SMDM': 'ISPOR',
                                                      '0': 'None'}}

    clean = (pd.read_csv(file_name)
             .pipe(trim_columns)
             .rename(columns=labels) 
             .pipe(cols_to_lower)
             .replace(recoded_types)
             .assign(study_included=lambda x: 
                         pd.Categorical(x['study_included']),
                     model_code_available=lambda x: 
                         pd.Categorical(x['model_code_available']),
                     reporting_guidelines_mention=lambda x: 
                         pd.Categorical(x['reporting_guidelines_mention']),
                     covid=lambda x: pd.Categorical(x['covid']),
                     foss_sim=lambda x: pd.Categorical(x['foss_sim']),
                     item_type=lambda x: pd.Categorical(x['item_type']))
            #.astype(type_conversions)
            
    )

    return clean

## 4.  Example read in, clean.

Here we run the preprocessing of the main dataset and then examine the `DataFrame` information and peak at the head and tail.

In [8]:
clean = load_clean_dataset(FILE_NAME)
clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 566 entries, 0 to 565
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   key                           566 non-null    object  
 1   item_type                     566 non-null    category
 2   pub_yr                        561 non-null    float64 
 3   author                        565 non-null    object  
 4   title                         566 non-null    object  
 5   pub_title                     539 non-null    object  
 6   doi                           498 non-null    object  
 7   url                           442 non-null    object  
 8   study_included                566 non-null    category
 9   model_code_available          492 non-null    category
 10  reporting_guidelines_mention  492 non-null    category
 11  covid                         494 non-null    category
 12  sim_software                  493 non-null    obje

In [None]:
clean.head(2)

In [None]:
clean.tail(2)