# Data selection

In [1]:
# Import
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from pandas_profiling import ProfileReport
import os
%load_ext autoreload
%autoreload 2

## 1. Load data

In [2]:
# Load file with descriptions of data variables
data_description_folder = './data/'
data_description = pd.read_csv(data_description_folder + 'IDDO_SDTM_Data-Dictionary_v3.0_2022-10-06.csv', sep=';', encoding="ISO-8859-1")
data_description.columns = data_description.columns.str.replace(' ', '_')
data_description.head(3)

Unnamed: 0,Domain,Domain_Name,Variable_Name,Variable_Label,Variable_Type,Variable_Definition,Controlled_Terminology?
0,AU,Audiometry Test Results,STUDYID,Study Identifier,character,This variable contains the unique identifier f...,
1,AU,Audiometry Test Results,DOMAIN,Domain Abbreviation,character,This variable contains the two-character abbre...,Y
2,AU,Audiometry Test Results,USUBJID,Unique Subject Identifier,character,This variable contains the unique subject iden...,


In [3]:
# Load data
df_list = []
data_folder = './data/DATA_2022-09-01/'
print('Names of DataFrames ---> Description :')
for f in os.listdir(data_folder):
    if f != 'IN_2022-09-01.csv':
        df_name_str = f.split('_')[0]
        df_list.append(df_name_str)
        df_name = df_name_str
        locals()[df_name] = pd.read_csv(data_folder + f, sep=',', low_memory=False) 
        locals()[df_name].name = df_name_str
        print(df_name_str, '---> ', data_description.loc[data_description.Domain==df_name_str, 'Domain_Name'].iloc[0])
    else:
        #df_name_str = f.split('_')[0]
        #df_list.append(df_name_str)
        #df_name = df_name_str
        #mylist = []
        #for chunk in  pd.read_csv(data_folder + f, sep=',', low_memory=False, chunksize=2000):
            #mylist.append(chunk)
        #locals()[df_name] = pd.concat(mylist, axis= 0)
        #locals()[df_name].name = df_name_str
        #del mylist
        print('error')

Names of DataFrames ---> Description :
DM --->  Demographics
DS --->  Disposition
ER --->  Environmental Risk
HO --->  Healthcare Encounters
IE --->  Inclusion/Exclusion Criteria
error
LB --->  Laboratory Results
MB --->  Microbiology Specimen
PO --->  Pregnancy Outcomes
RELREC --->  Related Records
RP --->  Reproductive System Findings
RS --->  Disease Response and Clinical Classification
SA --->  Clinical and Adverse Events
SC --->  Subject Characteristics
SV --->  Subject Visits
TI --->  Trial Inclusion Exclusion Criteria
TS --->  Trial Summary
TV --->  Trial Visits
VS --->  Vital Signs


## 2. Select data

#### For DM:

In [11]:
# See variables for DM
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==DM.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
130,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
131,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
132,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
133,SUBJID,character,This variable contains the unique subject identifier provided by the data contributor.
134,RFSTDTC,character,"This variable describes the date and time of the start of the Subject Reference Period. The Subject Reference Period is defined by IDDO as starting with the subject's first study encounter and ending with the subject's final study encounter. RFSTDTC corresponds with the time and date of the subject's first study encounter (e.g., screening, enrollment, admission). This date will be used to calculate the relative days in the --DY?, --STDY, --ENDY variables. This date and time will be provided in ISO 8601 format. This variable will be blank for submissions that do not provide this initial date. All of the derived variables will also be blank since they are all calculated based on RFSTDTC."
135,DTHDTC,character,"This variable describes the date and time of the collection of the observation, administration of a test or collection of a specimen. This date and time will be provided in ISO 8601 format."
136,DTHFL,character,"This variable contains information about whether the subject died during the study period. The variable is expected to be null if the choice is not ""Yes""."
137,SITEID,character,This variable contains information about the study site.
138,INVID,character,This variable contains the unique investigator identifier of the data contributor. This may be used for COVID-19 data where many separate investigators have contributed data to a single large study.
139,INVNAM,character,This variable contains the clinical trial registry number associated with the subjects record.


In [12]:
# Combine AGE and AGETXT

print('Values for AGETXT: ', DM['AGETXT'].unique())
print('Min and Max for AGE: ', DM['AGE'].min(), DM['AGE'].max())

DM.loc[DM.AGETXT=='95+', 'AGE'] = 96. # We make the choice to put 96 for 95+

print('Min and Max for AGE after: ', DM['AGE'].min(), DM['AGE'].max())

Values for AGETXT:  [nan '95+']
Min and Max for AGE:  -70.0 94.089996
Min and Max for AGE:  -70.0 96.0


In [16]:
# Investigate and replace negative values for AGE

print('Values for AGEU: ', DM['AGEU'].unique())
print('Number of negative values for AGE: ', (DM['AGE']<0).sum())
print('Number of NA for AGE: ', DM['AGE'].isna().sum())

DM.loc[DM.AGE<0, 'AGE'] = np.nan # We make the choice to replace negative values by NA

print('Number of NA for AGE after: ', DM['AGE'].isna().sum())

Values for AGEU:  ['YEARS' nan 'MONTHS' 'DAYS']
Number of negative values for AGE:  8
Number of NA for AGE:  22493
Number of NA for AGE after:  22501


In [30]:
# Combine AGE and AGEU

print('Values for AGEU: ', DM['AGEU'].unique())
print('Number of NA for AGEU when AGE is not NA: ', DM.loc[DM.AGE.notna(), 'AGEU'].isna().sum())
print('Min and Max for AGE when AGEU is NA: ', DM.loc[DM.AGEU.isna(), 'AGE'].unique())
print('Min and Max for AGE: ', DM['AGE'].min(), DM['AGE'].max())

# We make the choice to keep the YEARS unit
DM.loc[DM.AGEU=='MONTHS', 'AGE'] /= 12
DM.loc[DM.AGEU=='DAYS', 'AGE'] /= 365

# TODO : Que fait-on quand NA pour AGEU ??

Values for AGEU:  ['YEARS' nan 'MONTHS' 'DAYS']
Number of NA for AGEU when AGE is not NA:  593
Min and Max for AGE when AGEU is NA:  [51. 59. 46. 74. 61. 70. 72. 50. 79. 42. 49. 38. 77. 78. 68. 58. 62. 53.
 56. nan 66. 71. 44. 52. 87. 80. 37. 41. 64. 67. 73. 27. 40. 63. 85. 88.
 91. 92. 60. 45. 76. 86. 83. 47. 57. 69. 54. 65. 55. 81. 29. 48. 43. 89.
 39. 75. 25. 36. 84. 28. 32. 33. 96. 90.  0. 24. 35. 23. 19.  3.  5. 34.
 22. 31. 20.  8. 17. 13.  4.  7. 12. 15. 16. 30. 18. 82.]
Min and Max for AGE:  0.0 96.0


In [32]:
# Investigate ARMCD and ARM
print('Values for ARMCD: ', DM['ARMCD'].unique())
print('Values for ARM: ', DM['ARM'].unique())

Values for ARMCD:  ['PER CLIN GUIDE']
Values for ARM:  ['Per Clinical Guidelines']


In [35]:
# Check the existence of columns in DM
DM.columns # We can see that DMDTC does not exist

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'RFSTDTC', 'DTHFL', 'INVID', 'AGE',
       'AGETXT', 'AGEU', 'SEX', 'RACE', 'ETHNIC', 'ARMCD', 'ARM', 'COUNTRY',
       'DMDY'],
      dtype='object')

**=> To conclud, for DM, we can keep: USUBJID, DTHFL, AGE, SEX, RACE, ETHNIC, COUNTRY, DMDY:**

In [36]:
# Keep only some columns for DM
DM = DM[['USUBJID', 'DTHFL', 'AGE', 'SEX', 'RACE', 'ETHNIC', 'COUNTRY', 'DMDY']]
DM.columns

Index(['USUBJID', 'DTHFL', 'AGE', 'SEX', 'RACE', 'ETHNIC', 'COUNTRY', 'DMDY'], dtype='object')

In [42]:
DM.shape

(844451, 8)

#### For DS:

In [39]:
# See variables for DS
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==DS.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
157,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
158,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
159,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
160,DSSEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
161,DSTERM,character,This variable contains the verbatim wording of the event as provided by the Data Contributor.
162,DSMODIFY,character,This variable contains a modification of the verbatim wording of the event. This is used to capture IDDO-defined standardised terms of the event.
163,DSDECOD,character,This variable contains a dictionary-derived text description of the event. This is defined by CDISC Controlled Terminology and IDDO Controlled Terminology. More details can be found in the IDDO Implementation Guide.
164,VISITNUM,number,This variable contains a number designating the planned clinical encounter number. This is a numeric version of the visit described in VISIT? and it is used for sorting.
165,VISIT,character,This variable contains the protocol-defined text description of the planned clinical encounter number (as defined in the Trial Visits (TV) Domain).
166,VISITDY,number,This variable contains a number designating the Study Day of the planned clinical encounter. This is also a numeric version of the visit described in VISIT? and can be used for sorting.


In [56]:
# Columns really in DS
DS.columns

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'DSSEQ', 'DSTERM', 'DSMODIFY',
       'DSDECOD', 'VISITNUM', 'VISIT', 'VISITDY', 'DSDY', 'DSSTDY', 'DSEVINTX',
       'DSCDSTDY'],
      dtype='object')

In [40]:
# Investigate DSMODIFY
print('Values for DSMODIFY: ', DS['DSMODIFY'].unique())

Values for DSMODIFY:  [nan]


In [41]:
# Investigate DSTERM and DSDECOD
print('Values for DSTERM: ', DS['DSTERM'].unique())
print('Values for DSDECOD: ', DS['DSDECOD'].unique())
# We make the choice to keep only DSDECOD

Values for DSTERM:  ['DISCHARGED ALIVE' 'DEATH' 'TRANSFER TO OTHER FACILITY'
 'MISSING IN DATABASE' 'UNKNOWN' 'HOSPITALIZATION' 'PALLIATIVE DISCHARGE'
 'Death' 'Hospitalisation' 'Discharged alive' 'Transfer to other facility'
 'Medically fit for discharge (COVID-19 resolved) but remains in hospital for other reason'
 'Palliative discharge' 'Unknown'
 'Ongoing health care needs relating to this admission for COVID-19'
 'Ongoing health care needs NOT related to COVID episode'
 'Discharged alive expected to survive' 'DISCHARGED'
 'CURRENTLY HOSPITALISED' 'DISCHARGE' 'TRANSFERRED TO ANOTHER FACILITY'
 'ALIVE' 'DECEASED' 'Forwarding to home'
 'Transfer to another health care facility' 'Hospitalization'
 'Transfer to the Health District' 'Discharge against medical advice'
 'Palliative care' 'QUARANTINE CENTER'
 'TRANSFER TO OTHER HOSPITAL/FACILITY' 'LONG TERM CARE FACILITY'
 'DEATH IN HOSPITAL' 'TRANSFERRED TO ANOTHER UNIT' 'HOSPITAL DISCHARGE'
 'DISCHARGE WITH PALLIATIVE CARE' 'HOSPITALIZED

In [45]:
# Investigate VISITNUM, VISIT and VISITDY

print('Values for VISITNUM: ', DS['VISITNUM'].unique())
print('Values for VISIT: ', DS['VISIT'].unique())
print('Values for VISITDY: ', DS['VISITDY'].unique())

print('%NA for VISITNUM: ', DS['VISITNUM'].isna().sum()/DS.shape[0]*100)
print('%NA for VISIT: ', DS['VISIT'].isna().sum()/DS.shape[0]*100)
print('%NA for VISITDY: ', DS['VISITDY'].isna().sum()/DS.shape[0]*100)

Values for VISITNUM:  [nan  1.  2.]
Values for VISIT:  [nan 'Day 0' 'Week 2 Day 14']
Values for VISITDY:  [nan  1. 15.]
%NA for VISITNUM:  99.82092346266369
%NA for VISIT:  99.82092346266369
%NA for VISITDY:  99.82092346266369


In [46]:
# Investigate DSDY, DSEVINTX and DSCDSTDY

print('Values for DSDY: ', DS['DSDY'].unique())
print('Values for DSEVINTX: ', DS['DSEVINTX'].unique())
print('Values for DSCDSTDY: ', DS['DSCDSTDY'].unique())

print('%NA for DSDY: ', DS['DSDY'].isna().sum()/DS.shape[0]*100)
print('%NA for DSEVINTX: ', DS['DSEVINTX'].isna().sum()/DS.shape[0]*100)
print('%NA for DSCDSTDY: ', DS['DSCDSTDY'].isna().sum()/DS.shape[0]*100)

Values for DSDY:  [nan 78. 75. 29. 28. 24. 17. 32. 25. 21. 22. 30. 50. 38. 33. 36. 31. 45.
 80. 35. 59. 63. 34. 27. 26. 39. 49. 53. 16. 19. 20. 37. 72. 46. 11. 40.
 23. 48. 42. 54. 14. 90. 41. 51. 13.  7. 12.]
Values for DSEVINTX:  [nan 'AT ANY TIME AFTER DISCHARGE']
Values for DSCDSTDY:  [ nan  16.   2.  20.   8.   5.   4.   1.   9.  14.  25.  51.  18.  19.
  24.  12.  55.   6.  13.  23.  28. 109.   7.  11.  21.   3.]
%NA for DSDY:  99.87887781936428
%NA for DSEVINTX:  99.83814002888019
%NA for DSCDSTDY:  99.99211917743611


**=> To conclud, for DS, we can keep: USUBJID, DSSEQ and DSDECOD:**

In [58]:
# Keep only some columns for DS
DS = DS[['USUBJID', 'DSSEQ', 'DSDECOD']]
DS.columns

Index(['USUBJID', 'DSDECOD'], dtype='object')

In [59]:
DS.shape

(824787, 2)

#### For ER:

In [61]:
# See variables for ER
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==ER.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
180,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
181,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
182,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
183,ERSEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
184,ERTERM,character,This variable contains the verbatim wording of the event as provided by the Data Contributor.
185,ERMODIFY,character,This variable contains a modification of the verbatim wording of the event. This is used to capture IDDO-defined standardised terms of the event.
186,ERCAT,character,This variable contains a categorization of the observation.
187,ERSCAT,character,This variable contains a further categorization of the observation.
188,ERPRESP,character,"This variable identifies whether an observation was pre-specified on the CRF. Values are null for spontaneously reported events, i.e. those collected as free-text verbatim terms."
189,EROCCUR,character,This variable identifies whether or not a pre-specified event has occurred. Values are null for spontaneously reported events.


In [62]:
# Columns really in ER
ER.columns

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'ERSEQ', 'ERTERM', 'ERCAT', 'ERPRESP',
       'EROCCUR', 'ERSTAT', 'ERREASND', 'VISITNUM', 'VISIT', 'VISITDY', 'ERDY',
       'ERSTDY', 'ERENDY', 'EREVINTX', 'ERCNTRY'],
      dtype='object')

In [64]:
# Investigate ERCAT
print('Values for ERCAT: ', ER['ERCAT'].unique())

Values for ERCAT:  ['COVID-19 RISK FACTOR']


**=> To conclud, for ER, we can keep nothing because there is no decod of the event.**

#### For HO:

In [69]:
# See variables for HO
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==HO.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
208,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
209,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
210,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
211,HOSEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
212,HOREFID,character,"This variable contains an identifier to distinguish duplicate findings, collections or events occurring within the same time period where no other timing information is available. This variable is used to make the rows' timing details unique."
213,HOTERM,character,This variable contains the verbatim wording of the event as provided by the Data Contributor.
214,HODECOD,character,This variable contains a dictionary-derived text description of the event. This is defined by CDISC Controlled Terminology and IDDO Controlled Terminology. More details can be found in the IDDO Implementation Guide.
215,HOCAT,character,This variable contains a categorization of the observation.
216,HOPRESP,character,"This variable identifies whether an observation was pre-specified on the CRF. Values are null for spontaneously reported events, i.e. those collected as free-text verbatim terms."
217,HOOCCUR,character,This variable identifies whether or not a pre-specified event has occurred. Values are null for spontaneously reported events.


In [70]:
# Columns really in HO
HO.columns

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'HOSEQ', 'HOREFID', 'HOTERM', 'HODECOD',
       'HOCAT', 'HOPRESP', 'HOOCCUR', 'HOSTAT', 'HOREASND', 'HOPATT',
       'VISITNUM', 'VISIT', 'VISITDY', 'HODY', 'HOSTDY', 'HOENDY', 'HODUR',
       'HOSTRF', 'HOEVINTX', 'HOCDSTDY', 'HODISOUT', 'SELFCARE', 'HOINDC'],
      dtype='object')

In [71]:
# Investigate HODECOD
print('Values for HODECOD: ', HO['HODECOD'].unique())

Values for HODECOD:  ['HOSPITAL' 'INTENSIVE CARE UNIT']


**=> To conclud, for HO, we can keep: USUBJID, HOSEQ and HODECOD:**

In [72]:
# Keep only some columns for HO
HO = HO[['USUBJID', 'HOSEQ', 'HODECOD']]
HO.columns

Index(['USUBJID', 'HOSEQ', 'HODECOD'], dtype='object')

In [73]:
HO.shape

(2336786, 3)

#### For IE:

In [75]:
# See variables for IE
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==IE.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
239,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
240,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
241,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
242,IESEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
243,IETESTCD,character,This variable identifies the shortened code for the name of the test or examination performed.
244,IETEST,character,This variable identifies the name of the test or examination performed.
245,IECAT,character,This variable contains a categorization of the observation.
246,IEORRES,character,"This variable contains the result of the test or examination performed as provided by the data contributor. The original data can be either numericm e.g. ""503"" or string, e.g., ""Positive""."
247,IESTRESC,character,"This variable contains the converted, standardized result of the test or examination performed. The data can be either numeric, e.g. ""503"", or string, e.g. ""Positive"", and is stored as a string in the repository. The standard units and conversion formulas (as applicable) are described in detail in the IDDO Implementation Guide in the relevant sections about the variable --STRESU."
248,IESTAT,character,This variable contains information about the status of the observation  specifically that it was not completed when it was expected to have been. This column should be empty when there is a value in the --OCCUR (Events and Interventions Domains) or --ORRES (Findings Domains) variables.


In [76]:
# Columns really in IE
IE.columns

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'IESEQ', 'IETESTCD', 'IETEST', 'IECAT',
       'IEORRES', 'IESTRESC', 'IESTAT', 'IEREASND', 'IEDY'],
      dtype='object')

**=> To conclud, for IE, we can keep: USUBJID, IESEQ, IETESTCD, IESTRESC:**

In [77]:
# Keep only some columns for IE
IE = IE[['USUBJID', 'IESEQ', 'IETESTCD', 'IESTRESC']]
IE.columns

Index(['USUBJID', 'IESEQ', 'IETESTCD', 'IESTRESC'], dtype='object')

In [78]:
IE.shape

(948185, 4)

#### For IN:

TODO

#### For LB:

In [80]:
# See variables for LB
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==LB.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
317,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
318,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
319,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
320,LBSEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
321,LBREFID,character,"This variable contains an identifier to distinguish duplicate findings, collections or events occurring within the same time period where no other timing information is available. This variable is used to make the rows' timing details unique."
322,LBTESTCD,character,This variable identifies the shortened code for the name of the test or examination performed. This is defined by CDISC Controlled Terminology.
323,LBTEST,character,This variable identifies the name of the test or examination performed. This is defined by CDISC Controlled Terminology.
324,LBCAT,character,This variable contains a categorization of the observation.
325,LBSCAT,character,This variable contains a further categorization of the observation.
326,LBORRES,character,"This variable contains the result of the test or examination performed as provided by the data contributor. The original data can be either numericm e.g. ""503"" or string, e.g., ""Positive""."


In [81]:
# Columns really in LB
LB.columns

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'LBSEQ', 'LBTESTCD', 'LBTEST', 'LBCAT',
       'LBSCAT', 'LBORRES', 'LBORRESU', 'LBSTRESC', 'LBSTRESN', 'LBSTRESU',
       'LBSTAT', 'LBREASND', 'LBSPEC', 'LBMETHOD', 'VISITNUM', 'VISIT',
       'VISITDY', 'LBDY', 'LBSTRF', 'LBEVINTX', 'LBCDSTDY'],
      dtype='object')

TODO : grouper valeur et unités ??

**=> To conclud, for LB, we can keep: USUBJID, LBSEQ, LBTESTCD, LBORRES, LBORRESU, LBSTRESC, LBSTRESN, LBSTRESU:**

In [82]:
# Keep only some columns for LB
LB = LB[['USUBJID', 'LBSEQ', 'LBTESTCD', 'LBORRES', 'LBORRESU', 'LBSTRESC', 'LBSTRESN', 'LBSTRESU']]
LB.columns

Index(['USUBJID', 'LBSEQ', 'LBTESTCD', 'LBORRES', 'LBORRESU', 'LBSTRESC',
       'LBSTRESN', 'LBSTRESU'],
      dtype='object')

In [83]:
LB.shape

(7330863, 8)

#### For MB:

In [3]:
data_folder = './data/DATA_2022-09-01/'
MB = pd.read_csv(data_folder + 'MB_2022-09-01.csv', sep=',', low_memory=False) 
MB.name = 'MB'

In [4]:
# See variables for MB
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==MB.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
356,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
357,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
358,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
359,MBSEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
360,MBGRPID,character,This variable contains an identifier used to link together a block of related records and distinguish duplicate findings or events that occur within the same time period where no other timing information is available. This variable is used to make the rows' timing details unique.
361,MBREFID,character,"This variable contains an identifier to distinguish duplicate findings, collections or events occurring within the same time period where no other timing information is available. This variable is used to make the rows' timing details unique."
362,MBTESTCD,character,This variable identifies the shortened code for the name of the test or examination performed. This is defined by CDISC Controlled Terminology.
363,MBTEST,character,This variable identifies the name of the test or examination performed. This is defined by CDISC Controlled Terminology.
364,MBMODIFY,character,This variable contains a modification of the result as provided by the data contributor. This is used for the capture of the non-standard grade scale corresponding to the result contained in --ORRES.
365,MBTSTDTL,character,This variable describes the test type.


In [5]:
# Columns really in MB
MB.columns

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'ERSEQ', 'ERTERM', 'ERCAT', 'ERPRESP',
       'EROCCUR', 'ERSTAT', 'ERREASND', 'VISITNUM', 'VISIT', 'VISITDY', 'ERDY',
       'ERSTDY', 'ERENDY', 'EREVINTX', 'ERCNTRY'],
      dtype='object')

We have the wrong file for MB -> it is the ER file with the MB name!

**=> To conclud, for MB, we can keep nothing because it is the wrong file.**

#### For PO:

In [6]:
data_folder = './data/DATA_2022-09-01/'
PO = pd.read_csv(data_folder + 'PO_2022-09-01.csv', sep=',', low_memory=False) 
PO.name = 'PO'

In [7]:
# See variables for PO
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==PO.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
519,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
520,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
521,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
522,POSEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
523,POTERM,character,This variable contains the verbatim wording of the event as provided by the Data Contributor.
524,POMODIFY,character,This variable contains a modification of the verbatim wording of the event. This is used to capture IDDO-defined standardised terms of the event.
525,POCAT,character,This variable contains a categorization of the observation.
526,POPRESP,character,"This variable identifies whether an observation was pre-specified on the CRF. Values are null for spontaneously reported events, i.e. those collected as free-text verbatim terms."
527,POOCCUR,character,This variable identifies whether or not a pre-specified event has occurred. Values are null for spontaneously reported events.
528,POSTAT,character,This variable contains information about the status of the observation  specifically that it was not completed when it was expected to have been. This column should be empty when there is a value in the --OCCUR (Events and Interventions Domains) or --ORRES (Findings Domains) variables.


In [8]:
# Columns really in PO
PO.columns

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'POSEQ', 'POTERM', 'POCAT', 'VISITNUM',
       'VISIT', 'VISITDY', 'PODY', 'POSTDY', 'POENDY'],
      dtype='object')

In [9]:
# Investigate POTERM
print('Values for var: ', PO['POTERM'].unique())

Values for var:  ['Live Birth' 'Pregnancy' 'UNKNOWN' 'Stillbirth' 'Born Alive']


The only important thing is that the person is pregnant, as the outcome of the delivery is always the same. However, this information is already in the RP table so nothing will be kept from PO.

**=> To conclud, for PO, we can keep nothing.**

#### For RELREC:

In [13]:
data_folder = './data/DATA_2022-09-01/'
RELREC = pd.read_csv(data_folder + 'RELREC_2022-09-01.csv', sep=',', low_memory=False) 
RELREC.name = 'RELREC'

In [14]:
# See variables for RELREC
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==RELREC.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
600,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
601,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
602,RSUBJID,character,This variable contains the related unique subject identifier from the same or another study.
603,RELREC,character,This variable contains the patient relationship type between the identifier contained in USUBJID and the identifier contained in RSUBJID. This is defined by IDDO Controlled Terminology.


In [15]:
# Columns really in RELREC
RELREC.columns

Index(['STUDYID', 'USUBJID', 'RSUBJID', 'RELREC'], dtype='object')

In [16]:
# Investigate var
print('Values for var: ', RELREC['RELREC'].unique())

Values for var:  ['SAME']


**=> To conclud, for RELREC, we can keep nothing.**

#### For RP:

In [53]:
data_folder = './data/DATA_2022-09-01/'
RP = pd.read_csv(data_folder + 'RP_2022-09-01.csv', sep=',', low_memory=False) 
RP.name = 'RP'

In [18]:
# See variables for RP
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==RP.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
604,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
605,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
606,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
607,RPSEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
608,RPTESTCD,character,This variable identifies the shortened code for the name of the test or examination performed. This is defined by CDISC Controlled Terminology.
609,RPTEST,character,This variable identifies the name of the test or examination performed. This is defined by CDISC Controlled Terminology.
610,RPCAT,character,This variable contains a categorization of the observation.
611,RPSCAT,character,This variable contains a further categorization of the observation.
612,RPORRES,character,"This variable contains the result of the test or examination performed as provided by the data contributor. The original data can be either numericm e.g. ""503"" or string, e.g., ""Positive""."
613,RPORRESU,character,This variable contains the unit for the result of the test or examination performed as provided by the data contributor. This is defined by CDISC Controlled Terminology.


In [19]:
# Columns really in RP
RP.columns

Index(['STUDYID', 'DOMAIN', 'USUBJID', 'RPSEQ', 'RPTESTCD', 'RPTEST',
       'RPORRES', 'RPORRESU', 'RPSTRESC', 'RPSTRESN', 'RPSTRESU', 'RPSTAT',
       'RPREASND', 'VISITNUM', 'VISIT', 'VISITDY', 'RPDY'],
      dtype='object')

In [20]:
# Investigate RPSTRESC
print('Values for RPSTRESC: ', RP['RPSTRESC'].unique())

Values for var:  [nan 'N' '27' 'Y' 'U' '40' '36' '25' '30' '39' '37' '22' '31' '20' '24'
 '26' '3' '28' '32' '12' '38' '33' '23' '9' '8' '13' '29' '18' '41' '35'
 '34.00' '37.00' '38.00' '33.00' '40.00' '7.00' '41.00' '28.00' '32.00'
 '00.06' '25.00' '38.01' '40.02' '40.04' '40.01' '37.02' '34.06' '39.00'
 '8.00' '41.60' '35.00' '39.04' '30.00' '36.00' '27.00' '20.05' '35.50'
 '40.20' '27.03' '12.00' '37.10' '36.10' '39.20' '39.50' '38.50' '38.80'
 '38.60' '6.00' '20.00' '9.00' '29.00' '22.00' '38.10' '33.60' '18.00'
 '35.40' '22.20' '31.00' '37.60' '38.40' '40.30' '29.60' '0.39' '34.85'
 '37.85' '41.15' '21.85' '37.30' '24.00' '41.57' '34.60' '39.03' '39.30'
 '23.09' '28.20' '37.35' '38.15' '33.70' '33.40' '26.00' '38.30' '40.50'
 '04.00' '11.00' '24.01' '36.06' '10.00' '19.00' '22.02' '36.50' '23.00'
 '06.00' '0.40' '0.46' '39.10' '32.40' '34.10' '15.00' '09.00' '34.01'
 '38.75' '37.06' '41.06' '41.05' '36.04' '39.06' '39.05' '42.00' '41.01'
 '33.06' '41.03' '17.03' '38.02' '36.02' '

The numbers can be replaced by a Yes 'Y' because the person is indeed pregnant. U 'unknown' can be replaced by NA.  The number of observations indicators can be removed as it just keeps the fact that the person is pregnant or not:

In [54]:
def containsNumber(value):
    if True in [char.isdigit() for char in value]:
        return True
    return False

def replace(elm):
    if isinstance(elm, str) and containsNumber(elm):
        return 'Y'
    elif elm == 'U':
        return np.nan
    else:
        return elm

RP['RPSTRESC'] = RP['RPSTRESC'].apply(replace)

print('Values for RPSTRESC: ', RP['RPSTRESC'].unique())

Values for RPSTRESC:  [nan 'N' 'Y']


In [55]:
RP = RP.loc[RP['RPSEQ'] == 1]

**=> To conclud, for RP, we can keep: USUBJID and RPSTRESC:**

In [56]:
# Keep only some columns for RP
RP = RP[['USUBJID', 'RPSTRESC']]
RP.columns

Index(['USUBJID', 'RPSTRESC'], dtype='object')

In [57]:
RP.shape

(325298, 2)

#### For RS:

In [None]:
data_folder = './data/DATA_2022-09-01/'
RS = pd.read_csv(data_folder + 'RS_2022-09-01.csv', sep=',', low_memory=False) 
RS.name = 'RS'

In [None]:
# See variables for RS
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==RS.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

In [None]:
# Columns really in RS
RS.columns

In [None]:
# Investigate var
print('Values for var: ', RS['var'].unique())

**=> To conclud, for RS, we can keep: USUBJID:**

In [None]:
# Keep only some columns for RS
RS = RS[['USUBJID', 'fill']]
RS.columns

In [None]:
RS.shape

#### For SA:

In [None]:
data_folder = './data/DATA_2022-09-01/'
SA = pd.read_csv(data_folder + 'SA_2022-09-01.csv', sep=',', low_memory=False) 
SA.name = 'SA'

In [None]:
# See variables for SA
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==SA.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

In [None]:
# Columns really in SA
SA.columns

In [None]:
# Investigate var
print('Values for var: ', SA['var'].unique())

**=> To conclud, for SA, we can keep: USUBJID:**

In [None]:
# Keep only some columns for SA
SA = SA[['USUBJID', 'fill']]
SA.columns

In [None]:
SA.shape

#### For SC:

In [None]:
data_folder = './data/DATA_2022-09-01/'
SC = pd.read_csv(data_folder + 'SC_2022-09-01.csv', sep=',', low_memory=False) 
SC.name = 'SC'

In [None]:
# See variables for SC
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==SC.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

In [None]:
# Columns really in SC
SC.columns

In [None]:
# Investigate var
print('Values for var: ', SC['var'].unique())

**=> To conclud, for SC, we can keep: USUBJID:**

In [None]:
# Keep only some columns for SC
SC = SC[['USUBJID', 'fill']]
SC.columns

In [None]:
SC.shape

#### For SV:

In [None]:
data_folder = './data/DATA_2022-09-01/'
SV = pd.read_csv(data_folder + 'SV_2022-09-01.csv', sep=',', low_memory=False) 
SV.name = 'SV'

In [None]:
# See variables for SV
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==SV.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

In [None]:
# Columns really in SV
SV.columns

In [None]:
# Investigate var
print('Values for var: ', SV['var'].unique())

**=> To conclud, for SV, we can keep: USUBJID:**

In [None]:
# Keep only some columns for SV
SV = SV[['USUBJID', 'fill']]
SV.columns

In [None]:
SV.shape

#### For TI:

In [None]:
data_folder = './data/DATA_2022-09-01/'
TI = pd.read_csv(data_folder + 'TI_2022-09-01.csv', sep=',', low_memory=False) 
TI.name = 'TI'

In [None]:
# See variables for TI
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==TI.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

In [None]:
# Columns really in TI
TI.columns

In [None]:
# Investigate var
print('Values for var: ', TI['var'].unique())

**=> To conclud, for TI, we can keep: USUBJID:**

In [None]:
# Keep only some columns for TI
TI = TI[['USUBJID', 'fill']]
TI.columns

In [None]:
TI.shape

#### For TS:

In [None]:
data_folder = './data/DATA_2022-09-01/'
TS = pd.read_csv(data_folder + 'TS_2022-09-01.csv', sep=',', low_memory=False) 
TS.name = 'TS'

In [None]:
# See variables for TS
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==TS.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

In [None]:
# Columns really in TS
TS.columns

In [None]:
# Investigate var
print('Values for var: ', TS['var'].unique())

**=> To conclud, for TS, we can keep: USUBJID:**

In [None]:
# Keep only some columns for TS
TS = TS[['USUBJID', 'fill']]
TS.columns

In [None]:
TS.shape

#### For TV:

In [None]:
data_folder = './data/DATA_2022-09-01/'
TV = pd.read_csv(data_folder + 'TV_2022-09-01.csv', sep=',', low_memory=False) 
TV.name = 'TV'

In [None]:
# See variables for TV
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==TV.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

In [None]:
# Columns really in TV
TV.columns

In [None]:
# Investigate var
print('Values for var: ', TV['var'].unique())

**=> To conclud, for TV, we can keep: USUBJID:**

In [None]:
# Keep only some columns for TV
TV = TV[['USUBJID', 'fill']]
TV.columns


In [None]:
TV.shape

#### For VS:

In [None]:
data_folder = './data/DATA_2022-09-01/'
VS = pd.read_csv(data_folder + 'VS_2022-09-01.csv', sep=',', low_memory=False) 
VS.name = 'VS'

In [None]:
# See variables for VS
pd.set_option("display.max_colwidth", 1000)
data_description.loc[data_description.Domain==VS.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]

In [None]:
# Columns really in VS
VS.columns

In [None]:
# Investigate var
print('Values for var: ', VS['var'].unique())

**=> To conclud, for VS, we can keep: USUBJID:**

In [None]:
# Keep only some columns for VS
VS = VS[['USUBJID', 'fill']]
VS.columns

In [None]:
VS.shape

### 2.3 EDA

In [13]:
# CHOOSE YOUR DATAFRAME :
df_profile = DS

In [14]:
# See variables of the DataFrame
pd.set_option("display.max_colwidth", 1000)
df_profile_var = data_description.loc[data_description.Domain==df_profile.name, ['Variable_Name', 'Variable_Type', 'Variable_Definition']]
df_profile_var.to_csv("./eda/" + df_profile.name + "_variables-description.csv", index=False)
df_profile_var

Unnamed: 0,Variable_Name,Variable_Type,Variable_Definition
157,STUDYID,character,This variable contains the unique identifier for a study. This is the main key/identifier for all domains in the IDDO Data Repository  every domain table will have the STUDYID identifier.
158,DOMAIN,character,This variable contains the two-character abbreviation for the domain.
159,USUBJID,character,"This variable contains the unique subject identifier for a study. This is a secondary key/identifier for all subject-level domains in the IDDO Data Repository  every domain table containing subject-level information (i.e., all but the Trial Domains) will have the USUBJID identifier. This variable will identify unique subjects in the repository."
160,DSSEQ,number,"This variable contains a sequence number to ensure uniqueness of subject records within the domain. Each observation (each recorded as a separate row in the domain) will have a unique number within each subject, e.g., a subject with 10 observations will have 10 rows and each row is numbered sequentially from 1-10."
161,DSTERM,character,This variable contains the verbatim wording of the event as provided by the Data Contributor.
162,DSMODIFY,character,This variable contains a modification of the verbatim wording of the event. This is used to capture IDDO-defined standardised terms of the event.
163,DSDECOD,character,This variable contains a dictionary-derived text description of the event. This is defined by CDISC Controlled Terminology and IDDO Controlled Terminology. More details can be found in the IDDO Implementation Guide.
164,VISITNUM,number,This variable contains a number designating the planned clinical encounter number. This is a numeric version of the visit described in VISIT? and it is used for sorting.
165,VISIT,character,This variable contains the protocol-defined text description of the planned clinical encounter number (as defined in the Trial Visits (TV) Domain).
166,VISITDY,number,This variable contains a number designating the Study Day of the planned clinical encounter. This is also a numeric version of the visit described in VISIT? and can be used for sorting.


In [15]:
# EDA on the DataFrame
profile = ProfileReport(df_profile, title="Pandas Profiling Report") # Choose the dataframe you want
profile.to_widgets()
profile.to_file("./eda/" + df_profile.name + ".html") # Write the name of the dataframe

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/pandas-profiling/issues
(include the error message: 'No data; `observed` has size 0.')
  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

#### Comments:

##### DM :

...