# Data Preparation 2

This notebook consist in the second data preparation. Instances of admission Height and Weight measures available corresponding to each ICUSTAY_ID from CHARTEVENTS.db database created in the first notebook (DataPreparation_1) are taken and converted to pandas Dataframe in order to fill the desired features in the main file MAIN_DF.csv. This features are selected via ITEMID. Then Body Mass Index (BMI) is computed from this two features.

This example is the ilustration of the process that will be more automatic in the next notebook. The goal is to create a rich Dataframe of features for a given admission.

In [1]:
import pandas as pd
import numpy as np
import sqlite3

Take the file created in DataPreparation_1

In [2]:
admissions=pd.read_csv('MAIN_DF.csv')
admissions=admissions.drop(['Unnamed: 0'], axis=1)
admissions=admissions.reset_index(drop=True)
print(admissions.shape)
print('unique SUBJECT_ID:', admissions.SUBJECT_ID.nunique())
print('unique HADM_ID   :', admissions.HADM_ID.nunique())
print('unique ICUSTAY_ID:', admissions.ICUSTAY_ID.nunique())
admissions.head()

(48989, 25)
unique SUBJECT_ID: 36659
unique HADM_ID   : 46273
unique ICUSTAY_ID: 48989


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,GENDER,AGE_AD,ADMITTIME,DISCHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,...,LAST_CAREUNIT,FIRST_WARDID,LAST_WARDID,INTIME,OUTTIME,LOS,LOS_C,TIMEDELTA,TIMEDELTA_C,HOSPITAL_EXPIRE_FLAG
0,3,145834,211552,M,76.53,2101-10-20 19:08:00,2101-10-31 13:58:00,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,...,MICU,12,12,2101-10-20 19:10:11,2101-10-26 20:43:09,6.0646,>4,10.78,>10,0
1,4,185777,294638,F,47.84,2191-03-16 00:28:00,2191-03-23 18:41:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME WITH HOME IV PROVIDR,...,MICU,52,52,2191-03-16 00:29:31,2191-03-17 16:46:31,1.6785,1-2,7.76,6-10,0
2,6,107064,228232,F,65.94,2175-05-30 07:15:00,2175-06-15 16:00:00,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,...,SICU,33,33,2175-05-30 21:30:54,2175-06-03 13:39:54,3.6729,2-4,16.36,>10,0
3,9,150750,220597,M,41.79,2149-11-09 13:06:00,2149-11-14 10:15:00,EMERGENCY,EMERGENCY ROOM ADMIT,DEAD/EXPIRED,...,MICU,15,15,2149-11-09 13:07:02,2149-11-14 20:52:14,5.3231,>4,4.88,3-6,1
4,11,194540,229441,F,50.15,2178-04-16 06:18:00,2178-05-11 19:00:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,...,SICU,57,57,2178-04-16 06:19:32,2178-04-17 20:21:05,1.5844,1-2,25.53,>10,0


Define the connexion to the database where we are going to take the desired instances

In [3]:
connex = sqlite3.connect("data/CHARTEVENTS.db")

In [4]:
cur = connex.cursor()

Height in centimeters extraction.

From D_ITEMS.csv file we can link each ITEMID to a measure of a concrete variable or feature

For the Height in centimeters its ITEMID is 226730

In [5]:
ids = [226730]
ids = [str(id) for id in ids] 
str_matching = "(" + ",".join(ids) + ")"  # Construct the string of SQL
print(str_matching)

(226730)


In [6]:
sql = "SELECT * FROM CHARTEVENTS_DB WHERE ITEMID IN " + str_matching + ";"
print('String of SQL   :', sql)
print('Object Execution:', cur.execute(sql))

String of SQL   : SELECT * FROM CHARTEVENTS_DB WHERE ITEMID IN (226730);
Object Execution: <sqlite3.Cursor object at 0x7f954a47f260>


In [7]:
df_226730 = pd.read_sql_query(sql, connex)
print(df_226730.shape)
df_226730.head()

(12015, 15)


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,ITEMID,CHARTTIME,STORETIME,CGID,VALUE,VALUENUM,VALUEUOM,WARNING,ERROR,RESULTSTATUS,STOPPED
0,443,34,144319,290505.0,226730,2191-02-23 11:25:00,2191-02-23 11:25:00,15173.0,165.0,165.0,cm,0,0,,
1,1981,36,165660,241249.0,226730,2134-05-16 10:58:00,2134-05-16 10:58:00,16223.0,180.0,180.0,cm,0,0,,
2,3010,107,174162,264253.0,226730,2122-05-15 23:40:00,2122-05-15 23:40:00,17114.0,168.0,168.0,cm,0,0,,
3,3406,109,147469,253139.0,226730,2141-06-11 21:27:00,2141-06-11 21:27:00,17248.0,152.0,152.0,cm,0,0,,
4,4328,109,131345,243978.0,226730,2141-09-05 22:15:00,2141-09-05 22:15:00,19937.0,150.0,150.0,cm,0,0,,


In [9]:
print('unique ICUSTAY_ID:', df_226730.ICUSTAY_ID.nunique())

unique ICUSTAY_ID: 12011


In [10]:
df_226730.isnull().sum()

ROW_ID              0
SUBJECT_ID          0
HADM_ID             0
ICUSTAY_ID          3
ITEMID              0
CHARTTIME           0
STORETIME           0
CGID                0
VALUE               0
VALUENUM            0
VALUEUOM            0
ERROR               0
RESULTSTATUS    12015
STOPPED         12015
dtype: int64

In [11]:
df_226730=df_226730[df_226730['WARNING']==0]
df_226730=df_226730[df_226730['ERROR']==0]
df_226730=df_226730.drop(['ROW_ID', 'ITEMID', 'STORETIME', 'CGID', 'VALUENUM', 'VALUEUOM', 'WARNING', 
                                'ERROR', 'RESULTSTATUS', 'STOPPED'], axis=1)

In [12]:
df_226730=df_226730.rename({'CHARTTIME':'T_HEIGHT', 'VALUE':'HEIGHT'}, axis='columns')
print(df_226730.shape)
df_226730.head()

(12015, 5)


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,T_HEIGHT,HEIGHT
0,34,144319,290505.0,2191-02-23 11:25:00,165.0
1,36,165660,241249.0,2134-05-16 10:58:00,180.0
2,107,174162,264253.0,2122-05-15 23:40:00,168.0
3,109,147469,253139.0,2141-06-11 21:27:00,152.0
4,109,131345,243978.0,2141-09-05 22:15:00,150.0


Merge with the original file conserving the keys of the original file. This is because the second new file doesn't contain values for some ICUSTAY_ID

In [13]:
admissions=pd.merge(admissions,df_226730,how='left',on=['SUBJECT_ID', 'HADM_ID','ICUSTAY_ID'])

In [14]:
admissions.shape

(48989, 27)

In [15]:
admissions.isnull().sum()

SUBJECT_ID                  0
HADM_ID                     0
ICUSTAY_ID                  0
GENDER                      0
AGE_AD                      0
ADMITTIME                   0
DISCHTIME                   0
ADMISSION_TYPE              0
ADMISSION_LOCATION          0
DISCHARGE_LOCATION          0
INSURANCE                   0
ETHNICITY                   0
DIAGNOSIS                   0
DBSOURCE                    0
FIRST_CAREUNIT              0
LAST_CAREUNIT               0
FIRST_WARDID                0
LAST_WARDID                 0
INTIME                      0
OUTTIME                     0
LOS                         0
LOS_C                       0
TIMEDELTA                   0
TIMEDELTA_C                 0
HOSPITAL_EXPIRE_FLAG        0
T_HEIGHT                37951
HEIGHT                  37951
dtype: int64

Weight admission in kilograms extraction

In [16]:
ids = [226512]
ids = [str(id) for id in ids] 
str_matching = "(" + ",".join(ids) + ")"  # Construct the string of SQL
print(str_matching)

(226512)


In [17]:
sql = "SELECT * FROM CHARTEVENTS_DB WHERE ITEMID IN " + str_matching + ";"
print('String of SQL   :', sql)
print('Object Execution:', cur.execute(sql))

String of SQL   : SELECT * FROM CHARTEVENTS_DB WHERE ITEMID IN (226512);
Object Execution: <sqlite3.Cursor object at 0x7f954a47f260>


In [18]:
df_226512 = pd.read_sql_query(sql, connex)
print(df_226512.shape)
df_226512.head()

(22604, 15)


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,ITEMID,CHARTTIME,STORETIME,CGID,VALUE,VALUENUM,VALUEUOM,WARNING,ERROR,RESULTSTATUS,STOPPED
0,355,34,144319,290505.0,226512,2191-02-23 07:44:00,2191-02-23 07:44:00,17741.0,74.5,74.5,kg,0,0,,
1,8,23,124321,234044.0,226512,2157-10-21 12:15:00,2157-10-21 12:15:00,16978.0,66.8,66.8,kg,0,0,,
2,1978,36,165660,241249.0,226512,2134-05-16 10:58:00,2134-05-16 10:58:00,16223.0,106.2,106.2,kg,0,0,,
3,2029,85,112077,291697.0,226512,2167-07-25 21:31:00,2167-07-25 21:31:00,21050.0,98.0,98.0,kg,0,0,,
4,2505,107,182383,252542.0,226512,2121-12-01 05:54:00,2121-12-01 05:54:00,16526.0,88.6,88.6,kg,0,0,,


In [19]:
df_226512.isnull().sum()

ROW_ID              0
SUBJECT_ID          0
HADM_ID             0
ICUSTAY_ID          5
ITEMID              0
CHARTTIME           0
STORETIME           0
CGID                0
VALUE               0
VALUENUM            0
VALUEUOM            0
ERROR               0
RESULTSTATUS    22604
STOPPED         22604
dtype: int64

In [20]:
df_226512=df_226512[df_226512['WARNING']==0]
df_226512=df_226512[df_226512['ERROR']==0]
df_226512=df_226512.drop(['ROW_ID', 'ITEMID', 'STORETIME', 'CGID', 'VALUENUM', 'VALUEUOM', 'WARNING', 
                                'ERROR', 'RESULTSTATUS', 'STOPPED'], axis=1)

In [21]:
df_226512=df_226512.rename({'CHARTTIME':'TIME_M_WEIGHT', 'VALUE':'WEIGHT'}, axis='columns')
print(df_226512.shape)
df_226512.head()

(22604, 5)


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,TIME_M_WEIGHT,WEIGHT
0,34,144319,290505.0,2191-02-23 07:44:00,74.5
1,23,124321,234044.0,2157-10-21 12:15:00,66.8
2,36,165660,241249.0,2134-05-16 10:58:00,106.2
3,85,112077,291697.0,2167-07-25 21:31:00,98.0
4,107,182383,252542.0,2121-12-01 05:54:00,88.6


In [22]:
admissions=pd.merge(admissions,df_226512,how='left',on=['SUBJECT_ID', 'HADM_ID', 'ICUSTAY_ID'])

In [23]:
admissions['BMI']=np.round((admissions['WEIGHT']/(admissions['HEIGHT']*admissions['HEIGHT']/10000)), 2)

In [24]:
print(admissions.shape)
admissions.head()

(48990, 30)


Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,GENDER,AGE_AD,ADMITTIME,DISCHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,...,LOS,LOS_C,TIMEDELTA,TIMEDELTA_C,HOSPITAL_EXPIRE_FLAG,T_HEIGHT,HEIGHT,TIME_M_WEIGHT,WEIGHT,BMI
0,3,145834,211552,M,76.53,2101-10-20 19:08:00,2101-10-31 13:58:00,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,...,6.0646,>4,10.78,>10,0,,,,,
1,4,185777,294638,F,47.84,2191-03-16 00:28:00,2191-03-23 18:41:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME WITH HOME IV PROVIDR,...,1.6785,1-2,7.76,6-10,0,,,,,
2,6,107064,228232,F,65.94,2175-05-30 07:15:00,2175-06-15 16:00:00,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,...,3.6729,2-4,16.36,>10,0,,,,,
3,9,150750,220597,M,41.79,2149-11-09 13:06:00,2149-11-14 10:15:00,EMERGENCY,EMERGENCY ROOM ADMIT,DEAD/EXPIRED,...,5.3231,>4,4.88,3-6,1,,,,,
4,11,194540,229441,F,50.15,2178-04-16 06:18:00,2178-05-11 19:00:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,...,1.5844,1-2,25.53,>10,0,,,,,


In [25]:
admissions[admissions['SUBJECT_ID']==23].head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,GENDER,AGE_AD,ADMITTIME,DISCHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,...,LOS,LOS_C,TIMEDELTA,TIMEDELTA_C,HOSPITAL_EXPIRE_FLAG,T_HEIGHT,HEIGHT,TIME_M_WEIGHT,WEIGHT,BMI
15,23,152223,227807,M,71.13,2153-09-03 07:15:00,2153-09-08 19:10:00,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,...,1.2641,1-2,5.5,3-6,0,,,,,
16,23,124321,234044,M,75.25,2157-10-18 19:34:00,2157-10-25 14:00:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME HEALTH CARE,...,1.1862,1-2,6.77,6-10,0,,,2157-10-21 12:15:00,66.8,


In [26]:
admissions[admissions['SUBJECT_ID']==109].head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,GENDER,AGE_AD,ADMITTIME,DISCHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,...,LOS,LOS_C,TIMEDELTA,TIMEDELTA_C,HOSPITAL_EXPIRE_FLAG,T_HEIGHT,HEIGHT,TIME_M_WEIGHT,WEIGHT,BMI
87,109,155726,254406,F,25.02,2142-08-13 04:03:00,2142-08-16 18:17:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,...,1.744,1-2,3.59,3-6,0,,,2142-08-13 06:07:00,43.7,
88,109,125288,257134,F,24.28,2141-11-18 14:00:00,2141-11-23 16:42:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,...,1.3151,1-2,5.11,3-6,0,,,2141-11-18 16:43:00,49.0,
89,109,126055,236124,F,24.19,2141-10-13 23:10:00,2141-11-03 18:45:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,...,11.9014,>4,20.82,>10,0,,,2141-10-14 00:58:00,49.7,
90,109,172335,262652,F,24.12,2141-09-18 10:32:00,2141-09-24 13:53:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,...,2.0418,2-4,6.14,6-10,0,2141-09-20 23:27:00,142.0,2141-09-20 23:27:00,50.0,24.8
91,109,113189,291270,F,24.52,2142-02-14 10:42:00,2142-02-17 18:15:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,...,1.1068,1-2,3.31,3-6,0,,,2142-02-14 11:53:00,46.6,


In [27]:
admissions.isnull().sum()

SUBJECT_ID                  0
HADM_ID                     0
ICUSTAY_ID                  0
GENDER                      0
AGE_AD                      0
ADMITTIME                   0
DISCHTIME                   0
ADMISSION_TYPE              0
ADMISSION_LOCATION          0
DISCHARGE_LOCATION          0
INSURANCE                   0
ETHNICITY                   0
DIAGNOSIS                   0
DBSOURCE                    0
FIRST_CAREUNIT              0
LAST_CAREUNIT               0
FIRST_WARDID                0
LAST_WARDID                 0
INTIME                      0
OUTTIME                     0
LOS                         0
LOS_C                       0
TIMEDELTA                   0
TIMEDELTA_C                 0
HOSPITAL_EXPIRE_FLAG        0
T_HEIGHT                37952
HEIGHT                  37952
TIME_M_WEIGHT           27662
WEIGHT                  27662
BMI                     38045
dtype: int64

Group by SUBJECT_ID and replace NaN by mean values

In [28]:
admissions['WEIGHT'] = admissions['WEIGHT'].groupby(admissions['SUBJECT_ID']).transform(lambda x: x.fillna(x.mean()))
admissions['HEIGHT'] = admissions['HEIGHT'].groupby(admissions['SUBJECT_ID']).transform(lambda x: x.fillna(x.mean()))
admissions['BMI']    = np.round((admissions['WEIGHT']/(admissions['HEIGHT']*admissions['HEIGHT']/10000)), 2)

In [29]:
admissions.isnull().sum()

SUBJECT_ID                  0
HADM_ID                     0
ICUSTAY_ID                  0
GENDER                      0
AGE_AD                      0
ADMITTIME                   0
DISCHTIME                   0
ADMISSION_TYPE              0
ADMISSION_LOCATION          0
DISCHARGE_LOCATION          0
INSURANCE                   0
ETHNICITY                   0
DIAGNOSIS                   0
DBSOURCE                    0
FIRST_CAREUNIT              0
LAST_CAREUNIT               0
FIRST_WARDID                0
LAST_WARDID                 0
INTIME                      0
OUTTIME                     0
LOS                         0
LOS_C                       0
TIMEDELTA                   0
TIMEDELTA_C                 0
HOSPITAL_EXPIRE_FLAG        0
T_HEIGHT                37952
HEIGHT                  33397
TIME_M_WEIGHT           27662
WEIGHT                  24260
BMI                     33398
dtype: int64

In [30]:
admissions[admissions['SUBJECT_ID']==23].head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,GENDER,AGE_AD,ADMITTIME,DISCHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,...,LOS,LOS_C,TIMEDELTA,TIMEDELTA_C,HOSPITAL_EXPIRE_FLAG,T_HEIGHT,HEIGHT,TIME_M_WEIGHT,WEIGHT,BMI
15,23,152223,227807,M,71.13,2153-09-03 07:15:00,2153-09-08 19:10:00,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,...,1.2641,1-2,5.5,3-6,0,,,,66.8,
16,23,124321,234044,M,75.25,2157-10-18 19:34:00,2157-10-25 14:00:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME HEALTH CARE,...,1.1862,1-2,6.77,6-10,0,,,2157-10-21 12:15:00,66.8,


In [35]:
admissions[admissions['SUBJECT_ID']==109].head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,GENDER,AGE_AD,ADMITTIME,DISCHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,...,LOS,LOS_C,TIMEDELTA,TIMEDELTA_C,HOSPITAL_EXPIRE_FLAG,TIME_M_HEIGHT,HEIGHT,TIME_M_WEIGHT,WEIGHT,BMI
87,109,155726,254406,F,25.02,2142-08-13 04:03:00,2142-08-16 18:17:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,...,1.744,1-2,3.59,3-6,0,,150.8,2142-08-13 06:07:00,43.7,19.22
88,109,125288,257134,F,24.28,2141-11-18 14:00:00,2141-11-23 16:42:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,...,1.3151,1-2,5.11,3-6,0,,150.8,2141-11-18 16:43:00,49.0,21.55
89,109,126055,236124,F,24.19,2141-10-13 23:10:00,2141-11-03 18:45:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,...,11.9014,>4,20.82,>10,0,,150.8,2141-10-14 00:58:00,49.7,21.86
90,109,172335,262652,F,24.12,2141-09-18 10:32:00,2141-09-24 13:53:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,...,2.0418,2-4,6.14,6-10,0,2141-09-20 23:27:00,142.0,2141-09-20 23:27:00,50.0,24.8
91,109,113189,291270,F,24.52,2142-02-14 10:42:00,2142-02-17 18:15:00,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,...,1.1068,1-2,3.31,3-6,0,,150.8,2142-02-14 11:53:00,46.6,20.49


In [31]:
admissions.to_csv('MAIN_DF_2.csv')