# pyISARICBasics Tutorial

Author: Kyle G Young

This tutorial introduces the user to the ISARIC dataset and provides an overview of some basic data exploration tools that can be used for each domain. 

The package includes functions to read and write data from the raw .csv files. It includes a Domain Class to load and explore a specific domain. The package relies heavily on Pandas (https://pandas.pydata.org)




In this tutorial we will create a sqlite database and do some data exploration and analysis.  

## Sqlite Database Creation

We first set some global variables. DATA_DIRECTORY is a path to the directory where the raw ISARIC .csv's are contained. While DATABASE_FILE is what we want the sqlite database to be named. In addition the sqlite database will also be created inside the directory specified by DATA_DIRECTORY. 

In [None]:
DATA_DIRECTORY = "path_to_data"
DATABASE_FILE = "data.sqlite"

We now import the Domain Class and some useful functions from the pyISARICBasics package. 

In [None]:
# PIP install
from pyISARICBasics.domain import Domain
from pyISARICBasics.functions import csv_to_sqlite, df_to_sqlite

The first step in our data exploration / analysis is to convert all of our raw .csv's to a sqlite database. This is useful for browsing with the application DB Browser (https://sqlitebrowser.org).

Unfortunately reading and writing full sqlite tables into memory as a dataframe is not particularly efficient in Python 3. However the following function also creates auxiliary .pickle files that contain a serialised version of pandas DataFrame objects - loading these files is much more efficient. Generating the inital database can take some time (approximately 20mins on a laptop), we suggest you let this run and then have a read through the pyIsaricBasics documentation: (https://kyleyoung1997.github.io/pyISARICBasics/index.html)

In [None]:
csv_to_sqlite(DATA_DIRECTORY, DATABASE_FILE)

## Data Exploration using the SA domain

Let's load the SA domain as an example: 

The domain class contains four arguements: Domain(domain, data_directory, num_rows). 

1. domain: (string): specifying the name of the domain we wish to load e.g. "SA"
2. data_directory: (string): A path to the directory containing the raw ISARIC .csv's (if you've been following along you should have set this up above) 
3. num_rows: (int): An optional argument that can be used to specify how many rows of data we wish to load. If we wish to load all the data we can leave this blank or specify num_rows = None

Some of the ISARIC domains contain a large number of rows, if you're just exploring the dataset or testing functions it might be useful to only load a subset of rows. 


In [None]:
SA = Domain("SA", DATA_DIRECTORY, num_rows = None)

Let's look at the columns in this domain:

In [None]:
SA.columns()

All the columns in UPPERCASE are unaltered from the original SA csv file. We also have one extra column 'status', which converts the outcomes from ISARIC / STDM format into a simple "Y", "N" or "U". (Yes, no or unknown). we will use the convention of lower case for the names of any columns that we create or derive ourselves. 

Some important columns from the original ISARIC data are:
    
    xxTERM - Contains the verbatim non-standardised wording of an event
    xxOCCUR - Helps to determine whether an event occured or not
    xxPREPSP - a value of 'y' in this column indicates that the event was prespecified on the CRF, while 'n' or missing indicates a spontaneous (or free-text entry)
    xxSTDY - Gives the day of an event (relative to admission day) 
    
The 'status' column indicates whether an event occurred based on the combination of values in xxPRESP and xxOCCUR as follows: 

| xxPRESP | xxOCCUR | status |
|---------|---------|--------|
| NA      | NA      | Y      |
| NA      | Y       | U      |
| N       | Y       | N      |
| U       | Y       | U      |
| Y       | NA      | Y      |
| Y       | Y       | Y      |


Source code and documentation for this function can be viewed at (https://kyleyoung1997.github.io/pyISARICBasics/domain.html#pyISARICBasics.domain.Domain.process_occur) 

Now we know what the columns in our table are, it could be useful to look at the missingness in different columns:

In [None]:
SA.table_missingness()

This method prints out the number of rows in each column that have missing values, as well as the total number of rows in the domain.

As you can see there is a large number of columns with high missingness. We can choose to exclude some of these columns from our dataframe, to free up memory and make computations more time efficient 

In [None]:
SA.exclude_columns(['SASCAT', "SASTAT", "SAREASND", "SALOC", "SATPT", "SATPTREF", "SASTRF", "SAEVINTX", "SARPOC"])

We can use the following method to display all events in a given column:

In [None]:
SA.column_events("SAMODIFY")

We can now look at the table missingness while filtering on a specific variable. For example if we are interested in 'HYPERTENSION' we can examine the missingness for only those rows where there is an entry for "HYPERTENSION":


In [None]:
SA.table_missingness("SAMODIFY", "HYPERTENSION")

This output displays the missingness for the 632,964 rows where SAMODIFY contains HYPERTENSION, of the 677,926 unique patients in the SA domain, there are 631,160 that have an entry for HYPERTENSION.

Now let's take a closer look at the filtered DataFrame: 

In [None]:
SA.select_variable_from_column("SAMODIFY", "HYPERTENSION")

Its worthwhile noting that this method returns a Pandas DataFrame, so we can use anything contained in the Pandas library to further filter this dataframe. For instance if we create a list of columns that we're interested in we can use this to only display these columns: 

In [None]:
cols_of_interest = ["USUBJID", "SASTDY", "SAMODIFY", "SAPRESP", "SAOCCUR", 'status']
SA.select_variable_from_column("SAMODIFY", "HYPERTENSION")[cols_of_interest]

When we select only these columns the relationship between SAPRESP, SAOCCUR and status becomes a little more evident too.

We can also print a summary of counts for each column. For example SAMODIFY:

In [None]:
SA.column_summary("SAMODIFY")

We can also use this method to show proportions of each variable as well by adding proportions = True as seen below. Note that that the proportions displayed are the number of rows containing a variable over the total number of rows in SAMODIFY. That is, they are independent of the 'status' variable. 


In [None]:
SA.column_summary("SAMODIFY", proportions = True)

However this just gives us the counts / proportions of events that are recorded without any information on the status of the event (e.g Y, N or U). If we set status = True, this will extract this information: 

In [None]:
SA.column_summary("SAMODIFY", status = True)

We can also optionally specify some variables if we only want to print some variables:

In [None]:
SA.column_summary("SAMODIFY",  "ASTHMA", "STROKE", "TUBERCULOSIS", status = True,)

Now we should save our modified DataFrame (with the added status variable) back into a sqlite table: 

If we want to browse (or access later) we can save this back into a sqlite table. 
(note this takes some timefor large domains such as SA and IN).

In [None]:
SA.save_to_sqlite("SA_tutorial_modified", DATA_DIRECTORY, DATABASE_FILE )

This creates a new table in our existing sqlite database as well as a .pickle file for quicker read and write in Python

## Free Text Variables
For most variables in the ISARIC dataset, the xxMODIFY column contains a standardised event name. However for some spontaneously recorded events this might not be the case. In some instances it can be worthwhile checking these entries... 

In this example we are going to search the SA domain for some terms that might be relevant to Kidney Stones (for which there is no standardised variable in the 'SAMODIFY' column. We use the domain.free_text_search() method. We can enter any terms we wish to search for as strings separated by commas. This method then searches for any of these terms in the relevant column and returns a dataframe with the result. 

It is worth noting that the Domain.free_text_search() method searches to see if our search terms are substrings of any raw terms. For example searching "Kidney" would return rows containing "Acute Kidney Injury" as well as "Kidney Stones". 

In [None]:
stones_frame = SA.free_text_search("kidney stones", "nephrolithiasis", "renal calculi")

In [None]:
stones_frame

So we found 271 free text entries that are relevant for Kidney stones. . Note that the value of SAPRESP is NaN (missing) as is the value of SAOCCUR. This indicates that the entry was made spontaenously (i.e. not indicated on the CRF) 

# Vaccination Status Example

Now we have introduced the basic functionality of our package we are going to give an example of using the package to retrieve the vaccination status of patients. 

In this example we need to load the IN domain as this contains information about vaccinations (note we first delete the SA domain from memory to save some space). 

In [None]:
del(SA)

In [None]:
IN = Domain("IN", DATA_DIRECTORY)

We then inspect the columns:

In [None]:
IN.columns()

Most of those columns are not relevant to vaccination status so we're going to include only relevant columns

In [None]:
relevant_cols = ['USUBJID', 'INTRT', 'INMODIFY', 'INPRESP', 'INOCCUR', 'INREFID' ,'INSTDY', 'status']
IN.include_columns(relevant_cols)

While there are derived values for COVID-19 vaccination status in the 'INMODIFY' column, they contain different values depending on the type of vaccination received. Instead we are going to search the 'INTRT' column with a variety of free-text search terms to ensure we get as many COVID-19 vaccination events as possible, including those events that do not contain a value in the standardised column. 

In [None]:
covid_vacc = IN.free_text_search("COVID-19 Vaccine", "ASTRAZENECA", "PFIZER", "COVISHIELD",
                                 "SINOVAC", "COVID-19 VACCINATION", "RECEIVED A COVID-19 VACCIN")

So we found 559,420 rows that are relevant to COVID-19 Vaccination status in the IN domain. Taking a closer look at what the result looks like:

In [None]:
covid_vacc.head(5)

Lets look at the unique values for each column using some functionality from Pandas. Each column in a pandas DataFrame is stored as a series. We can access the series directly by using 'df.colname' and then using the .unique() method we can find the unique values contained in that column. 

In [None]:
covid_vacc.INTRT.unique()

In [None]:
covid_vacc.INMODIFY.unique()

In [None]:
covid_vacc.INREFID.unique()

We can also look at the counts in the 'status' variable: 

In [None]:
covid_vacc.status.value_counts()

Great! So now what do we do if we want to save this DataFrame to access it later?

We can use the function df_to_sqlite() which saves a DataFrame into the sqlite database created earlier and as a .pickle which we can load quickly into Python.

In [None]:
df_to_sqlite(covid_vacc, "vacc_status", DATA_DIRECTORY, DATABASE_FILE)

As you can see the function returns True, meaning the write has been succesful. 