# PhenEx Study Tutorial
In this page we will show you how to use PhenEx to :
1. Connect to a Snowflake Database
2. Work with OMOP data
3. Create a simple cohort
4. View cohort summary statistics

First make sure that your PhenEx version is up to date

In [None]:
# For updating PhenEx to latest released version
# !pip install -Uq PhenEx

# For using local source code version of PhenEx
# %pip install -e /Users/ahartens/src/PhenEx

In [None]:
import ibis
ibis.options.interactive = True

## Set Snowflake Credentials
PhenEx needs to connect to a Snowflake backend and therefore needs your login credentials. There are two ways to do this : (1) explicitly or (2) using an .env (dot env) file. We show how to do both, but only do one!
### Method 1 :

In [None]:
import os

# authentication
os.environ.update({
    'SNOWFLAKE_ACCOUNT':'phrwdstore.us-east-1',
    'SNOWFLAKE_WAREHOUSE':'COMPUTE_WH',
    'SNOWFLAKE_ROLE':'RWDSTORE_PROJECTS_IEG_RW',
    'SNOWFLAKE_USER':'alexander.hartenstein@bayer.com', # ENTER YOUR SNOWFLAKE USERNAME HERE
})

### Method 2 :
You can also specify these with using a dotenv file (https://github.com/motdotla/dotenv). One advantage to doing this is that you do not put sensitive credential information into your jupyter notebook.

In [None]:
from dotenv import load_dotenv
load_dotenv()

If you see True above, it means python was able to find and load your environment file.



## Connect to the database

We will now establish a connection to Snowflake using a SnowflakeConnector; these connectors will use your environment variables (set above) for login credentials.

At this point we must define two databases in Snowflake:
1. Source : the snowflake location where input data to phenex should come from
2. Destination (dest) : the snowflake location where output data from phenex should be written. The destination will be created if it does not exist.

Run this cell to connect to these databases; this cell will open up two browser tabs (if you're using browser authentication). After those pages load (wait for them to say completed!), close them and return to this notebook.

In [None]:
%%capture
from phenex.ibis_connect import SnowflakeConnector

con = SnowflakeConnector(
    SNOWFLAKE_SOURCE_DATABASE = 'OPTUM_CLAIMS_OMOP.CDM',
    SNOWFLAKE_DEST_DATABASE = 'PROJECTS_IEG.GMEMA_PHENEX_DEV_TEST'
)

Notice that both of these locations can also be specified using environment variables (like we did in method 1/2 for credentials), and vice versa (credentials can be passed to a connector as keyword arguments, rather being hidden in the .env file). However, as credentials generally remain the same between projects and the database locations are project dependent, it is best practice to define database locations with the connector.


## Tell PhenEx about the input data structure

PhenEx is designed to be data model agnostic i.e. does not require you to transform your data. However, PhenEx does need to know a little bit about the structure of the input data in order to help us make phenotypes and cohorts.

What this means is that PhenEx needs to know in what table and column to find information such as patient id, year of birth, diagnosis events, etc. This information is generally present in all RWD sources, but for each data source, is (1) organized in a different way and (2) can have different column names.

When using a new data source, we need to onboard that database (once!) for usage with PhenEx (i.e. tell it about table structure and column names). Go to the [tutorial on onboarding a new database](/2_Onboarding_New_Database.ipynb) to learn how to onboard a database.

For the purposes of this tutorial, we will be using OMOP data, which is already onboarded and available in the PhenEx library. All we have to do is import the OMOPDomains and then get the mapped tables.

In [None]:
from phenex.mappers import OMOPDomains
omop_mapped_tables = OMOPDomains.get_mapped_tables(con)

### Looking at input data
PhenEx bundles all input data from a single data source into a python dictionary, in this case in the variable called omop_mapped_tables. After this step, we no longer need to deal with input data - it is all available for this datasource in the omop_mapped_tables dictionary.

The dictionary keys are the names of the 'domains' within OMOP. Let's look at the domains available

In [None]:
list(omop_mapped_tables.keys())

We see that there are several domains. We will see later on that from now on, we tell PhenEx what table to use using these keys.

We can additionally look at and explore these tables interactively as well; just access the value (table) of the domain you are interested in.

In [None]:
omop_mapped_tables['PERSON']

# Building a cohort
We are now ready to build a cohort using the OMOP data we now have available to PhenEx.

## Step 1 : Define an Entry criterion
The entry criterion is the phenotype that defines the index date of your cohort. 

**Note on index dates** The concept of index date comes from prospective clinical trials; it is simplistically the date on which the patient enters the clinical trial, i.e. day 0 of data collection. In real world data sources, we are generally performing retrospective studies and have data that exists in the past. Regardless, it is standard practice in observational studies to define an index date for each patient, and the index date is defined by some medical event or phenotypic feature of each patient.

Here we will create a cohort that has an index date set at the 'date of first instance of atrial fibrillation diagnosis' for each patient. See the CodelistPhenotype tutorial to learn more about how to define codelist phenotypes.

In [None]:
from phenex.phenotypes.codelist_phenotype import CodelistPhenotype
from phenex.codelists.codelists import Codelist

af_codelist = Codelist([313217])
entry = CodelistPhenotype(
    name='af',
    domain='CONDITION_OCCURRENCE',
    codelist=af_codelist,
    use_code_type=False,
    return_date='first',
)

Once we've created our phenotype, we're ready to move on. However, if you want to, you can already execute and see the output of a single phenotype (though this is not necessary). This is helpful for sanity checking the construction of your phenotypes, or seeing if any patients at all fulfill the phenotypic criteria you entered.

In [None]:
entry.execute(omop_mapped_tables)
entry.table.head(5).to_pandas()

## Step 2 : Define inclusion criteria (optional)
Next we need to define additional phenotypic features a patient must have in order to be a part of our cohort. We usually see these as a list in study definitions. For example, we require  patients that are 18 or older and

In PhenEx, we simply create a phenotype using the provided phenotype classes for each inclusion criteria, and then create a list with each inclusion criteria phenotype. We will later pass these to the cohort.

In [None]:
from phenex.phenotypes.age_phenotype import AgePhenotype
from phenex.filters import GreaterThanOrEqualTo

age_ge18 = AgePhenotype(anchor_phenotype=entry, min_age=GreaterThanOrEqualTo(18))

Remember that we can check the results of a phenotype directly (though this is NOT required)

In [None]:
age_ge18.execute(omop_mapped_tables)

Finally, create the list of inclusion phenotypes

In [None]:
inclusions = [age_ge18]

## Step 3 : Define exclusion criteria (optional)
Cohort definitions often have a list of things patients should **not** have to be considered part of our study cohort. These are exclusion criteria. PhenEx handles these similarly to inclusion criteria; simply create a list of individual phenotypes that patients shoudl not have and bundle them together in a list called 'exclusions'

Here we create a slightly more complicated phenotype: we are excluding all patients who had a emergency room visit for myocardial infarction within 90 days prior of their index date.

In [None]:

from phenex.filters.value import Value
from phenex.filters.categorical_filter import CategoricalFilter
from phenex.filters.relative_time_range_filter import RelativeTimeRangeFilter

# define 'emergency room visit'
inpatient = CategoricalFilter(
    column_name='VISIT_DETAIL_SOURCE_VALUE', 
    allowed_values=['22'], 
    domain='VISIT_DETAIL'
)

# define time period pre-index to search for mi code
preindex = RelativeTimeRangeFilter(max_days=Value('<', 90), anchor_phenotype=entry)

# define MI codes of interest
mi_codelist = Codelist([49601007])

# create exclusion phenotype
mi_emergency_preindex = CodelistPhenotype(
    name='hf',
    domain='condition_occurrence'.upper(),
    codelist=af_codelist,
    use_code_type=False,
    return_date='first',
    categorical_filter=inpatient,
    relative_time_range=preindex
)

As prior, we can let this run immediately for sanity checking (not required)

In [None]:
mi_emergency_preindex.execute(omop_mapped_tables)
mi_emergency_preindex.table.head(5).to_pandas()

Create the final list of exclusion criteria

In [None]:
exclusions = [mi_emergency_preindex]

## Step 4 : Define baseline characteristics (optional)
We are often interested in characterizing patients at index date. For example, we could be interested in knowing how old they are at index. 

In PhenEx, we simply create a list of phenotypes, identically to inclusion and exclusion criteria.

**Note**, inclusion, exclusion criteria and baseline characteristics are defined at or before index date; they should not use 'future' data (i.e. data after the index date)! PhenEx does NOT currently check whether future data is being used. It is up to the user to design phenotypes with appropriate relative time range filters that only work in the pre-index period.

In this example, we are simply interested in the age of patients at index date.

In [None]:
from phenex.phenotypes.age_phenotype import AgePhenotype

age = AgePhenotype(anchor_phenotype=entry)
characteristics = [age]

In [None]:
age.execute(omop_mapped_tables)

## Step 5: Build the cohort
In this step we take all the pieces we defined above and put them together into a cohort. Simply instantiate a cohort, give it a name, and pass it the entry, inclusion, exclusions and baseline characteristic phenotypes defined above.

In [None]:
from phenex.phenotypes.cohort import Cohort

cohort = Cohort(
    name = 'af',
    entry_criterion=entry,
    inclusions=inclusions,
    exclusions=exclusions,
    characteristics=characteristics
)

We execute the full cohort and pass it the tables required

In [None]:
cohort.execute(omop_mapped_tables)

## Viewing cohort summary
After execution of the cohort, PhenEx has created a number of standard tables and can produce some standard readouts.

### Tables created
1. index table : contains patients that fulfill all in/exclusion criteria, with index date
2. inclusion table : a feature table with only inclusion criteria (rows are patients, columns are inclusion criteria)
3. exclusion table : a feature table with only exclusion criteria (rows are patients, columns are exclusion criteria)
4. characteristics table : a feature table of all baseline characteristics

### Reports can be used to produce readouts

In [None]:
cohort.characteristics_table.head(5).to_pandas()

In [None]:
cohort.table1