# Data exploration

This notebooks is used to explore provided data and discovering interesting conditions to work with.

First, we imported libraries used:

In [45]:
from sqlite3 import connect
import numpy as np
import pandas as pd
import os
import re

Then, we opened the tables relevant to our studies (Patients, Encounters, Conditions and Procedures)

In [46]:
conditions_path = '../data/raw/synthea/scenario1/conditions.csv'
encounters_path = '../data/raw/synthea/scenario1/encounters.csv'
patients_path = '../data/raw/synthea/scenario1/patients.csv'
procedures_path = '../data/raw/synthea/scenario1/procedures.csv'

In [47]:
conn = connect(':memory:')

The first rows of the Patients table can be seen below. It contains personal data of the synthetic patient, including birth and death dates, which will be used to infer death prognostics. Also, it can be seen that scenario1 presents a total of 1174 patients.

In [48]:
patients_df = pd.read_csv(patients_path)
patients_df.to_sql('patients', conn, if_exists='replace')
patients_df

Unnamed: 0,Id,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,LAST,SUFFIX,...,BIRTHPLACE,ADDRESS,CITY,STATE,COUNTY,ZIP,LAT,LON,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE
0,7b3c738d-3f86-58e3-450e-4018521d192f,2021-08-11,,999-71-7790,,,,Svetlana462,O'Hara248,,...,Boston Massachusetts US,971 Ullrich Grove Suite 53,Boston,Massachusetts,Suffolk County,2135.0,42.304260,-71.045262,2.295091e+04,0.0000
1,c40c2c75-13c9-8e4a-047f-573ae1330157,2020-03-03,,999-49-5505,,,,Rosalyn434,Christiansen251,,...,Lawrence Massachusetts US,485 Senger Route Apt 34,Waltham,Massachusetts,Middlesex County,2451.0,42.340439,-71.206815,5.476973e+04,2624.1225
2,39e76039-522c-61f1-d961-e03dce5f0bb2,1998-11-21,2003-07-17,999-36-5991,,,,Lashawnda573,Daniel959,,...,North Reading Massachusetts US,944 Witting Passage,Everett,Massachusetts,Middlesex County,2149.0,42.412900,-71.009282,1.391483e+05,932.7450
3,ff981e00-4004-44b4-48ba-0bfa62440bb3,2009-07-29,,999-39-5303,,,,Ulysses632,Rice937,,...,Scituate Massachusetts US,414 Rempel Harbor,Cohasset,Massachusetts,Norfolk County,,42.214770,-70.814553,3.393595e+05,90.6000
4,648dbf91-4334-a7e2-2cfe-72e88dbd4c15,2000-12-04,,999-77-8366,S99996514,X32384432X,Mr.,Tory770,Harvey63,,...,Boston Massachusetts US,368 Pouros Ramp Apt 46,Haverhill,Massachusetts,Essex County,1832.0,42.779225,-71.055842,5.432674e+05,2173.6425
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1169,c815ffa4-2917-2c09-1569-e90d5a89eeb2,1933-05-23,1948-05-19,999-84-8049,,,,Olen518,Hammes673,,...,Arlington Massachusetts US,541 Lakin Promenade,Northampton,Massachusetts,Hampshire County,1053.0,42.284637,-72.589066,4.060467e+04,7632.8415
1170,b7fda58e-8eb1-a2d0-8fbf-57afefce34c6,1933-05-23,2003-12-13,999-57-9348,S99975005,X56712306X,Mr.,Edward499,Cole117,,...,Montague Massachusetts US,498 Shields Trafficway,Northampton,Massachusetts,Hampshire County,,42.354777,-72.593976,1.164761e+06,163012.4570
1171,1b46a593-3b14-53dd-1d2b-b59866e09a18,1933-05-23,2021-08-23,999-98-9276,S99942794,X86239063X,Mr.,Antoine384,Brekke496,,...,Andover Massachusetts US,1093 McClure Village Apt 71,Northampton,Massachusetts,Hampshire County,1062.0,42.405194,-72.755758,5.698680e+05,151240.1710
1172,5eeeb1e8-73c4-f7fa-cbb6-be359624f38e,1933-05-23,1975-02-25,999-47-8049,S99934661,X24881672X,Mr.,Britt177,Breitenberg711,,...,Hopkinton Massachusetts US,572 Sipes Ramp Apt 94,Northampton,Massachusetts,Hampshire County,1053.0,42.288684,-72.671682,1.087981e+05,128412.6970


The next seen table is the Conditions table. It contains the synthetic description of the conditions of a patient as well as the standardized code of that condition, for each encounter (a period that a patient enters the medical care unit, either for emergency or consultation). Those can also be interesting as variables for prognostic inference.

In [49]:
conds_df = pd.read_csv(conditions_path)
conds_df.to_sql('conditions', conn, if_exists='replace')
conds_df.head()

Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
0,2021-12-26,2022-01-07,c40c2c75-13c9-8e4a-047f-573ae1330157,862287d1-a2eb-8116-de42-e76de6163260,195662009,Acute viral pharyngitis (disorder)
1,1999-05-12,,39e76039-522c-61f1-d961-e03dce5f0bb2,00681d7e-e9bb-392b-6ccb-209cf46e2ec3,128613002,Seizure disorder
2,1999-05-12,,39e76039-522c-61f1-d961-e03dce5f0bb2,00681d7e-e9bb-392b-6ccb-209cf46e2ec3,703151001,History of single seizure (situation)
3,2003-03-04,2003-03-17,39e76039-522c-61f1-d961-e03dce5f0bb2,7ac71954-b14b-f259-f032-bc07d5f4eaa6,43878008,Streptococcal sore throat (disorder)
4,2003-07-17,,39e76039-522c-61f1-d961-e03dce5f0bb2,705cabcc-6c94-1804-ee22-0e0b944ed62f,262574004,Bullet wound


We save here the counts of each condition in this scenario table for visualization, in the file `conditions_type_counts.csv`

In [50]:
conds_df[['CODE', 'DESCRIPTION']].value_counts().to_csv('../data/interim/scenario1/conditions_type_counts.csv')

Upon visualizing the above file, we could extract some sets of conditions we found interesting.

The first set we explored was the licit and illicit drug abuse, namely the conditions including the words `drug`, `alcohol`, `opioid` and `tobacco`. From that, we found a total of 356 conditions.

In [51]:
pd.read_sql(
    '''
    SELECT * FROM conditions
    WHERE DESCRIPTION LIKE '%drug%' OR DESCRIPTION LIKE '%alcohol%' OR DESCRIPTION LIKE '%opioid%' OR DESCRIPTION LIKE '%tobacco%'
    '''
    , conn
)

Unnamed: 0,index,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
0,238,1997-06-03,1999-06-15,04e9ac90-2279-d293-523f-838b39d1e177,20917154-d6c5-6c59-6532-26aecd1165aa,10939881000119104,Unhealthy alcohol drinking behavior (finding)
1,292,2020-09-22,2021-09-28,5f2e1751-1689-2b18-f3df-1cf4a05b5439,e7d4fd73-34ae-a5f3-16a5-4c9ee897f8e9,10939881000119104,Unhealthy alcohol drinking behavior (finding)
2,711,2002-12-27,2014-02-14,53b07e50-426d-43bb-42d9-d175c2fb29f6,2b6ffb87-d675-a040-17e4-61d4a385af70,361055000,Misuses drugs (finding)
3,766,2022-04-01,,53b07e50-426d-43bb-42d9-d175c2fb29f6,7fc3b70a-fbb7-1b7d-6cc1-8ccf00a2da18,10939881000119104,Unhealthy alcohol drinking behavior (finding)
4,789,2021-11-26,,ea3c8180-450a-de48-559b-9d753d346507,3d40b5e1-2b58-b08b-5142-0dcc98c1147c,10939881000119104,Unhealthy alcohol drinking behavior (finding)
...,...,...,...,...,...,...,...
351,36139,1952-06-30,,c20cd05e-a058-b63b-aa3d-38325afdd690,6b443522-a92c-286c-4058-61b0bfc5c488,5602001,Opioid abuse (disorder)
352,36140,1952-06-30,,c20cd05e-a058-b63b-aa3d-38325afdd690,6b443522-a92c-286c-4058-61b0bfc5c488,449868002,Smokes tobacco daily
353,36241,2005-09-27,2009-10-20,2fd7bfe7-8c26-8be1-b074-efd95829f978,c684d175-c5e0-da6d-5946-147ef37623a1,10939881000119104,Unhealthy alcohol drinking behavior (finding)
354,36285,1995-08-01,1997-08-12,b7fda58e-8eb1-a2d0-8fbf-57afefce34c6,051bc787-cf22-09b9-bc4e-3a6f86c877b6,10939881000119104,Unhealthy alcohol drinking behavior (finding)


We also checked for the word `lung`, which we have asserted that applies for suspected or diagnosed lung cancer of different types, as seen below. Only 39 conditions are available.

In [52]:
pd.read_sql(
    '''
    SELECT * FROM conditions
    WHERE DESCRIPTION LIKE '%lung%'
    '''
    , conn
)

Unnamed: 0,index,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
0,4129,1998-09-18,,f0f321af-2b86-0e6f-de91-338279526b1d,c429e617-1318-4a48-1b27-a0958392f073,162573006,Suspected lung cancer (situation)
1,4130,1998-09-29,,f0f321af-2b86-0e6f-de91-338279526b1d,8afae70c-cfa6-3211-c050-2cdf8bbfe4b1,254637007,Non-small cell lung cancer (disorder)
2,4131,1998-10-01,,f0f321af-2b86-0e6f-de91-338279526b1d,071deb68-d763-535a-7d3e-546ec16bee01,424132000,Non-small cell carcinoma of lung TNM stage 1 ...
3,8080,1992-11-20,,958c3318-1cfe-1122-f4a3-bcf0fabaed05,4be89d3b-ba2f-5f22-b436-9e86ea670b9c,162573006,Suspected lung cancer (situation)
4,8081,1992-12-01,,958c3318-1cfe-1122-f4a3-bcf0fabaed05,98989e2e-2dc0-6f73-1574-6b96578aa45f,254637007,Non-small cell lung cancer (disorder)
5,8082,1992-12-02,,958c3318-1cfe-1122-f4a3-bcf0fabaed05,7bb9637a-25d2-53b9-0fbc-8f1bf10dee87,424132000,Non-small cell carcinoma of lung TNM stage 1 ...
6,14561,2016-05-24,,4e3e9e9c-ab95-c628-6fb6-ddecbd25ac95,d4bb0c88-5b8c-2340-ff4f-69e0692dc7a8,162573006,Suspected lung cancer (situation)
7,14562,2016-06-03,,4e3e9e9c-ab95-c628-6fb6-ddecbd25ac95,c3773393-bcea-98f2-7f62-c235a1b7ab00,254637007,Non-small cell lung cancer (disorder)
8,14563,2016-06-06,,4e3e9e9c-ab95-c628-6fb6-ddecbd25ac95,b87803bd-e8d6-f86c-9daf-2e5b3f72cc69,424132000,Non-small cell carcinoma of lung TNM stage 1 ...
9,15812,1987-08-11,,bfa71154-31e9-c514-2fbc-61c5ee112549,e8b55758-1d9f-5aa2-8fef-fd5abe83d9a9,162573006,Suspected lung cancer (situation)


In this next step, the conditions related to drug abuse (show above, along with `overdose`), the ones related to home violence (`violence` and `partner` were found in the countings table to be keywords for those) and the ones related to anxiety (`anxiety`, `stress` and `depression`) were joined for each patient. In the case of `overdose`, `partner` and `violence`, instead of only setting flags, the countings of each condition for that patient were applied, since it was logic to assume that each occurence of said condition corresponded to an episode.

In [53]:
def match_substring(sub_str, s):
    return sub_str.lower() in s.lower()

conds_df['OPIOID'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('opioid', x))
conds_df['ALCOHOL'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('alcohol', x))
conds_df['MISUSE'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('misuse', x))
conds_df['OVERDOSE'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('overdose', x))
conds_df['TOBACCO'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('tobacco', x))
conds_df['VIOLENCE'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('violence', x))
conds_df['PARTNER'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('partner abuse', x))
conds_df['DEPRESSION'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('depression', x))
conds_df['ANXIETY'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('anxiety', x))
conds_df['STRESS'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('stress', x))

drug_conds_df = conds_df.sort_values(by='START').groupby('PATIENT').aggregate({
    'OPIOID': lambda x: np.sum(x)>0,
    'ALCOHOL': lambda x: np.sum(x)>0,
    'MISUSE': lambda x: np.sum(x)>0,
    'OVERDOSE': lambda x: np.sum(x),
    'TOBACCO': lambda x: np.sum(x)>0,
    'VIOLENCE': lambda x: np.sum(x),
    'PARTNER': lambda x: np.sum(x),
    'DEPRESSION': lambda x: np.sum(x)>0,
    'ANXIETY': lambda x: np.sum(x)>0,
    'STRESS': lambda x: np.sum(x)>0,
})
drug_conds_df.to_sql('drug_conditions', conn, if_exists='replace')
drug_conds_df



Unnamed: 0_level_0,OPIOID,ALCOHOL,MISUSE,OVERDOSE,TOBACCO,VIOLENCE,PARTNER,DEPRESSION,ANXIETY,STRESS
PATIENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0000c650-47ba-eaf7-c2e6-ee06774a3d95,False,False,False,0,False,0,0,False,False,True
009c30e1-5fa4-f50c-dfa5-bf3092116bbb,False,False,False,0,False,0,1,False,False,True
00c2e638-e098-f9ff-3610-5af6f565616d,False,False,False,0,False,0,0,False,False,False
00fe47e4-40df-207e-e357-789d2aca44c1,False,False,False,0,False,0,0,False,False,True
014e204c-bd87-a7fd-0ee5-86296391a44d,False,False,False,0,False,0,0,False,False,True
...,...,...,...,...,...,...,...,...,...,...
fdaa221b-c4ee-7609-2cfc-1f034b15b3ef,False,False,False,0,False,0,0,False,True,True
fe0b3504-c5df-4691-055d-34ad7002a5d1,False,False,False,1,False,0,1,False,False,True
ff34b909-0ce6-1d5e-6b6d-9b96c75f08bb,False,False,False,0,False,2,2,False,False,True
ff981e00-4004-44b4-48ba-0bfa62440bb3,False,False,False,0,False,0,0,False,False,False


After generating that table, the maximum value of each column was summed, so it could be seed a max of 7 occurences of violence related conditions and 10 of partner related conditions.

In [54]:
drug_conds_df.max(0)

OPIOID        True
ALCOHOL       True
MISUSE        True
OVERDOSE         1
TOBACCO       True
VIOLENCE         7
PARTNER         10
DEPRESSION    True
ANXIETY       True
STRESS        True
dtype: object

Also, the number of patients having any of that conditions is shown below. There are low countings of patients for `depression` conditions, but other conditions have high countings, like `stress`.

In [55]:
(drug_conds_df>0).sum(0)

OPIOID         20
ALCOHOL       164
MISUSE         82
OVERDOSE       49
TOBACCO         6
VIOLENCE      347
PARTNER       453
DEPRESSION      5
ANXIETY       111
STRESS        881
dtype: int64

The same was done for lung related conditions. Only 13 patients were found to have said conditions.

In [56]:
def match_substring(sub_str, s):
    return sub_str.lower() in s.lower()

conds_df['LUNG'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('lung', x))

lung_conds_df = conds_df.sort_values(by='START').groupby('PATIENT').aggregate({
    'LUNG': lambda x: np.sum(x)>0,
})
lung_conds_df.to_sql('lung_conditions', conn, if_exists='replace')
lung_conds_df.head()



Unnamed: 0_level_0,LUNG
PATIENT,Unnamed: 1_level_1
0000c650-47ba-eaf7-c2e6-ee06774a3d95,False
009c30e1-5fa4-f50c-dfa5-bf3092116bbb,False
00c2e638-e098-f9ff-3610-5af6f565616d,False
00fe47e4-40df-207e-e357-789d2aca44c1,False
014e204c-bd87-a7fd-0ee5-86296391a44d,False


In [57]:
lung_conds_df['LUNG'].sum()

13

The same was done for heart related conditions, namely `Cardiac Arrest` and `Myocardial Infarction`. 47 conditions were found for the first case and 30 for the second.

In [58]:
def match_substring(sub_str, s):
    return sub_str.lower() in s.lower()

conds_df['ARREST'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('Cardiac Arrest', x))
conds_df['INFARCTION'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('Myocardial Infarction', x))

cardiac_conds_df = conds_df.sort_values(by='START').groupby('PATIENT').aggregate({
    'ARREST': lambda x: np.sum(x)>0,
    'INFARCTION': lambda x: np.sum(x)>0,
})
cardiac_conds_df.to_sql('cardiac_conditions', conn, if_exists='replace')
cardiac_conds_df.head()



Unnamed: 0_level_0,ARREST,INFARCTION
PATIENT,Unnamed: 1_level_1,Unnamed: 2_level_1
0000c650-47ba-eaf7-c2e6-ee06774a3d95,False,False
009c30e1-5fa4-f50c-dfa5-bf3092116bbb,False,False
00c2e638-e098-f9ff-3610-5af6f565616d,False,False
00fe47e4-40df-207e-e357-789d2aca44c1,False,False
014e204c-bd87-a7fd-0ee5-86296391a44d,False,False


In [59]:
cardiac_conds_df.sum(0)


ARREST        47
INFARCTION    30
dtype: int64

For concussion related conditions, 92 patients were found.

In [60]:
def match_substring(sub_str, s):
    return sub_str.lower() in s.lower()

conds_df['CONCUSSION'] = conds_df['DESCRIPTION'].apply(lambda x: match_substring('concussion', x))

concussion_conds_df = conds_df.sort_values(by='START').groupby('PATIENT').aggregate({
    'CONCUSSION': lambda x: np.sum(x)>0,
})
concussion_conds_df.to_sql('concussion_conditions', conn, if_exists='replace')
concussion_conds_df.head()



Unnamed: 0_level_0,CONCUSSION
PATIENT,Unnamed: 1_level_1
0000c650-47ba-eaf7-c2e6-ee06774a3d95,False
009c30e1-5fa4-f50c-dfa5-bf3092116bbb,False
00c2e638-e098-f9ff-3610-5af6f565616d,False
00fe47e4-40df-207e-e357-789d2aca44c1,False
014e204c-bd87-a7fd-0ee5-86296391a44d,False


In [61]:
concussion_conds_df.sum(0)

CONCUSSION    92
dtype: int64

The next step was to count the number of patients that had a registered death date along interest condition sets. For the stress/drug set, it was found that high rates of patients that suffered from home violence and stress died.

In [62]:
patient_drugs = pd.read_sql(
    '''
    SELECT p.Id, p.birthdate, p.deathdate, d.OPIOID, d.ALCOHOL, d.MISUSE, d.OVERDOSE, d.TOBACCO, d.VIOLENCE, d.PARTNER, d.DEPRESSION, d.ANXIETY, d.STRESS
    FROM patients p
    JOIN drug_conditions d on p.Id = d.patient
    ''',
    conn
)
patient_drugs[~pd.isna(patient_drugs['DEATHDATE'])].sum(0)

Id            39e76039-522c-61f1-d961-e03dce5f0bb20d679725-4...
BIRTHDATE     1998-11-211973-04-201960-12-271973-04-201952-0...
DEATHDATE     2003-07-172005-08-192000-06-212011-12-161955-0...
OPIOID                                                        7
ALCOHOL                                                      31
MISUSE                                                       11
OVERDOSE                                                     10
TOBACCO                                                       2
VIOLENCE                                                     88
PARTNER                                                     158
DEPRESSION                                                    0
ANXIETY                                                      15
STRESS                                                      128
dtype: object

In other hand, only 13 patients died with lung cancer conditions.

In [16]:
patient_lungs = pd.read_sql(
    '''
    SELECT p.Id, p.birthdate, p.deathdate, l.LUNG
    FROM patients p
    JOIN lung_conditions l on p.Id = l.patient
    ''',
    conn
)
patient_lungs[~pd.isna(patient_lungs['DEATHDATE'])].sum(0)

Id           39e76039-522c-61f1-d961-e03dce5f0bb20d679725-4...
BIRTHDATE    1998-11-211973-04-201960-12-271973-04-201952-0...
DEATHDATE    2003-07-172005-08-192000-06-212011-12-161955-0...
LUNG                                                        13
dtype: object

Also, 15 patients that had cardiac arrest and 25 that had miocardic infarction have died.

In [18]:
patient_cardiacs = pd.read_sql(
    '''
    SELECT p.Id, p.birthdate, p.deathdate, l.ARREST, l.INFARCTION
    FROM patients p
    JOIN cardiac_conditions l on p.Id = l.patient
    ''',
    conn
)
patient_cardiacs[~pd.isna(patient_cardiacs['DEATHDATE'])].sum(0)

Id            39e76039-522c-61f1-d961-e03dce5f0bb20d679725-4...
BIRTHDATE     1998-11-211973-04-201960-12-271973-04-201952-0...
DEATHDATE     2003-07-172005-08-192000-06-212011-12-161955-0...
ARREST                                                       15
INFARCTION                                                   25
dtype: object

It can also be seen that 14 patients that had a concussion died.

In [20]:
patient_concussions = pd.read_sql(
    '''
    SELECT p.Id, p.birthdate, p.deathdate, l.CONCUSSION
    FROM patients p
    JOIN concussion_conditions l on p.Id = l.patient
    ''',
    conn
)
patient_concussions[~pd.isna(patient_concussions['DEATHDATE'])].sum(0)

Id            39e76039-522c-61f1-d961-e03dce5f0bb20d679725-4...
BIRTHDATE     1998-11-211973-04-201960-12-271973-04-201952-0...
DEATHDATE     2003-07-172005-08-192000-06-212011-12-161955-0...
CONCUSSION                                                   14
dtype: object

In [27]:
patient_concussions = pd.read_sql(
    '''
    SELECT p.Id, p.birthdate, p.deathdate, c.DESCRIPTION, c.START
    FROM patients p
    JOIN conditions c on p.Id = c.patient
    ''',
    conn
)
patient_concussions[~pd.isna(patient_concussions['DEATHDATE'])].to_csv('../data/interim/scenario1/death_conditions.csv')