# Introduction


Contributors: Kyle G Young, Sally Shrapnel,...

The PyIsaricBasics package has been designed to provide a simple introduction to facilitate exploration and analysis of the ISARIC dataset. We suggest first running the tutorial on the September dataset to match the tutorial outputs. Once you are comfortable using the methods the tutorial will work with any iteration of the dataset. 

The dataset is comprised of individual Domains:

SA = Clinical and Adverse Events 

MB = Microbiology Specimen 

LB = Laboratory Results 

HO = Healthcare Encounters 

DM = Demographics

IN = Treatments and Interventions 

RS = Disease Response and Clinical Classification 

SV = Subject Visits 

RP = Reproductive System Findings 

PO = Pregnancy Outcomes 

DS = Disposition 

ER = Environmental Risk 

IE = Inclusion/Exclusion Criteria 

TI = 

VS = Vital Signs 

SC = Subject Characteristics 


This package contains a Class method that loads an individual Domain and several functions to explore and analyse the data within that domain. Objects are stored as Pandas Dataframes and functions use the open source Pandas library (https://pandas.pydata.org) to facilitate data analysis, visualisation and manipulation.

The package also provide functionality to load the dataframes into SQLite for easy browsing using, for example, DB Browser (https://sqlitebrowser.org)

## Getting set up
### 1. Set file paths to data
Set DATA_DIRECTORY to the directory where your raw ISARIC .csv's are contained, and use DATABASE_FILE to name the sqlite database. 

In [None]:
DATA_DIRECTORY = "tests/Tutorial_data"
DATABASE_FILE = "test_db.sqlite"

### 2. Import the Domain Class and key functions from the pyISARICBasics package. 

In [None]:
# PIP install
from pyISARICBasics.domain import Domain
from pyISARICBasics.functions import csv_to_sqlite, df_to_sqlite

### 3. Convert CSV files to SQLite database

The first step in our data exploration / analysis is to convert all of our raw .csv's to a sqlite database. This is useful for browsing with the application DB Browser (https://sqlitebrowser.org).

Unfortunately, reading and writing full sqlite tables into memory as a dataframe is not particularly efficient in Python 3. However, the following function also creates auxiliary .pickle files that contain a serialised version of pandas DataFrame objects - loading these files is much more efficient. Generating the inital database can take some time (approximately 20mins on a laptop), we suggest you let this run and then have a read through the pyIsaricBasics documentation: (https://kyleyoung1997.github.io/pyISARICBasics/index.html)

In [None]:
csv_to_sqlite(DATA_DIRECTORY, DATABASE_FILE)

## Exploring an example Domain

For this example, we will use the SA domain. This domain contains (insert details from data dictionary).

The domain class contains three arguments: Domain (domain, data_directory, num_rows). 

1. domain: (string): specifying the name of the domain we wish to load e.g. "SA"
2. data_directory: (string): A path to the directory containing the raw ISARIC .csv's (the previous steps should set this up) 
3. num_rows: (int): An optional argument that can be used to specify how many rows of data we wish to load. If we wish to load all the data we can leave this blank or specify num_rows = None

Some of the ISARIC domains contain a large number of rows. If you wish to perform a quick exploration of the dataset or test individual functions, it can be useful to only load a subset of rows. This is achieved using the third argument, e.g. num_rows = 20. 


In [None]:
SA = Domain("SA", DATA_DIRECTORY, num_rows = None)

### 1. List the columns of the SA domain

The Domain.columns( ) function prints a list of the columns in the current domain.


All the columns in UPPERCASE are unaltered from the original SA.csv file. 

We also have one extra column 'status', which converts the outcomes from ISARIC / STDM format into a simple "Y", "N" or "U". (Yes, no or unknown). 

We will use the convention of lower case for columns like 'status' that have been derived or created here.

Some important columns from the original ISARIC data are:
    Put the list here
    
    SATERM, INTRT, LBTEST, HOTERM - Contains the verbatim non-standardised wording of an event 
    xxOCCUR - Signifies whether an event occured or not
    xxPREPSP - a value of 'y' in this column indicates that the event was prespecified on the CRF, while 'n' or missing indicates a spontaneous (or free-text) entry
    xxSTDY - Gives the day of an event (relative to admission day) 
    
The 'status' column indicates whether an event occurred based on the combination of values in xxPRESP and xxOCCUR as follows: 

| xxPRESP | xxOCCUR | status |
|---------|---------|--------|
| NA      | NA      | Y      |
| NA      | Y       | U      |
| N       | Y       | N      |
| U       | Y       | U      |
| Y       | NA      | Y      |
| Y       | Y       | Y      |


Source code and documentation for this function can be viewed at (https://kyleyoung1997.github.io/pyISARICBasics/domain.html#pyISARICBasics.domain.Domain.process_occur) 

In [None]:
SA.columns()

### 2. Explore missingness in each column:

When columns are empty, or have very high missingness, it can be useful to remove them from the dataframe.
As individual patients will usually be associated with multiple rows it can also be useful to identify the number of unique patients.

In [None]:
SA.table_missingness()

### 3. Exclude columns with high missingness
Exclude these columns from our dataframe has the benefit of freeing up memory and making computations more time efficient 

In [None]:
SA.exclude_columns(['SASCAT', "SASTAT", "SAREASND", "SALOC", "SATPT", "SATPTREF", "SASTRF", "SAEVINTX", "SARPOC"])

### 4. Provide a list of the variables contained within each column.

We can use the Domain.column_events method to identify the variables contained within each column. 

In [None]:
SA.column_events("SACAT")

We can see SACAT (SA Category) only has 9 distinct varibles.

In [None]:
SA.column_events("SAMODIFY")

We can see SAMODIFY (SA modified term) has many distinct variables.

### 5. Indentifying variable missingness.
We can now identify the missingness for a specific variable. For example, if we are interested in 'TREMOR' from the SAMODIFY column:



In [None]:
SA.table_missingness("SAMODIFY", "TREMOR")

This output displays the missingness for the 12272 rows where SAMODIFY contains TREMOR. Of the 677,926 unique patients in the SA domain, there are 11974 that have an entry for TREMOR. Of these 12272 rows containing TREMOR, 12272 do not have an associated day (SASTDY) on which the event occurred.

### 6. Visualising the new DataFrame that we have filtered by SAMODIFY and TREMOR: 

The Domain.select_variable_from_column( ) method returns a Pandas DataFrame, so any function contained in the Pandas library can be used to further filter this dataframe. 

In [None]:
SA.select_variables_from_column("SAMODIFY", "TREMOR")

### 7. Create a list of relevant columns 
We can also create a list of specific columns that we're interested in: 

In [None]:
cols_of_interest = ["USUBJID", "SASTDY", "SAMODIFY", "SAPRESP", "SAOCCUR", 'status']
SA.select_variables_from_column("SAMODIFY", "TREMOR")[cols_of_interest]

### 8. Print row counts for each column.


In [None]:
SA.column_summary("SAMODIFY")

### 9. Print row counts + status for each column

We can print a summary of the variables in each column as well as the 'status' variable. 

In [None]:
SA.column_summary("SAMODIFY", status = True)

We can also specify a subset of variables:

In [None]:
SA.column_summary("SAMODIFY",  "ASTHMA", "STROKE", "TUBERCULOSIS", status = True,)

### 10. Saving the modified dataframe as a sqlite table: 

If we want to browse (or access later) we can save our new filtered dataframe into a sqlite table 
(note this takes some timefor large domains such as SA and IN).

In [None]:
SA.save_to_sqlite("SA_tutorial_modified", DATA_DIRECTORY, DATABASE_FILE )

This creates a new table in our existing sqlite database as well as a .pickle file for quicker read and write in Python

# Free Text Searches
For most variables in the ISARIC dataset, the xxMODIFY column contains a standardised event name. However, xxTERM contains some spontaneously recorded events that are not recorded in the xxMODIFY. 

For example, we can search the SA domain for terms that might be relevant to Kidney Stones (for which there is no standardised variable in the 'SAMODIFY' column). We use the domain.free_text_search() method. We can enter any search terms as strings separated by commas. This method then searches for these terms in the relevant column and returns a dataframe with the result. 

Note that the Domain.free_text_search( ) method searches to see if our search terms are substrings of any raw terms. For example searching "Kidney" would return rows containing "Acute Kidney Injury" as well as "Kidney Stones". 

In [None]:
stones_frame = SA.free_text_search("kidney stones", "nephrolithiasis", "renal calculi")

In [None]:
stones_frame

We found 271 free text entries that are relevant for Kidney stones. Note that the value of SAPRESP is NaN (missing) as is the value of SAOCCUR. This indicates that the entry was made spontaenously (i.e. not indicated on the CRF).

# Vaccination Status Example

The following example uses the functionality we have used thus far to retrieve the vaccination status of patients. 

We will load the IN domain as this contains information about vaccinations. Note we first delete the SA domain from memory to save some space. 

In [None]:
del(SA)

In [None]:
IN = Domain("IN", DATA_DIRECTORY)

We then inspect the columns:

In [None]:
IN.columns()

Most of those columns are not relevant to vaccination status so we're going to include only relevant columns

In [None]:
relevant_cols = ['USUBJID', 'INTRT', 'INMODIFY', 'INPRESP', 'INOCCUR', 'INREFID' ,'INSTDY', 'status']
IN.include_columns(relevant_cols)

We can now look at "INMODIFY" to ascertain what variables are relevant to COVID-19 Vaccination

In [None]:
IN.column_summary("INMODIFY", status = True)

We can then take a closer look at only those variables related to COVID-19 Vaccination: 

In [None]:
IN.column_summary("INMODIFY", 'COVID-19 VACCINATION', 'COVID-19 VACCINATION', 
                                            'COVID-19 VACCINE PFIZER-BIONTECH',
                                            'COVID-19 VACCINE JANSSENS (JOHNSON AND JOHNSON)',
                                            'COVID-19 VACCINE TYPE UNKNOWN',
                                            'COVID-19 VACCINE ASTRAZENECA/UNIVERSITY OF OXFORD',
                                            'COVID-19 VACCINE CANSINBIO', 
                                            'COVID-19 VACCINE SPUTNIK V',
                                            'COVID-19 VACCINE SINOPHARM', 
                                            'COVID-19 VACCINE MODERNA',
                                            'COVID-19 VACCINE SINOVAC', 
                                            'COVID-19 VACCINE COVAXIN',
                                            status = True)

We can also use our list of INMODIFY variables to return a dataframe with only entries relevant to vaccination: 

In [None]:
covid_vacc = IN.select_variables_from_column("INMODIFY", 'COVID-19 VACCINATION', 
                                            'COVID-19 VACCINE PFIZER-BIONTECH',
                                            'COVID-19 VACCINE JANSSENS (JOHNSON AND JOHNSON)',
                                            'COVID-19 VACCINE TYPE UNKNOWN',
                                            'COVID-19 VACCINE ASTRAZENECA/UNIVERSITY OF OXFORD',
                                            'COVID-19 VACCINE CANSINBIO', 
                                            'COVID-19 VACCINE SPUTNIK V',
                                            'COVID-19 VACCINE SINOPHARM', 
                                            'COVID-19 VACCINE MODERNA',
                                            'COVID-19 VACCINE SINOVAC', 
                                            'COVID-19 VACCINE COVAXIN')

We can then take a look at our filtered dataframe: 

In [None]:
covid_vacc

In [None]:
covid_vacc.INTRT.unique()

In [None]:
covid_vacc.INMODIFY.unique()

In [None]:
covid_vacc.INREFID.unique()

In [None]:
covid_vacc.status.value_counts()

Great! So now what do we do if we want to save this DataFrame to access it later?

We can use the function df_to_sqlite() which saves a DataFrame into the sqlite database created earlier and as a .pickle which we can load quickly into Python.

In [None]:
df_to_sqlite(covid_vacc, "vacc_status", DATA_DIRECTORY, DATABASE_FILE)

As you can see the function returns True, meaning the write has been succesful. 