# Data Cleaning Strategy
For this project, we will use the CTN0027 Dataset from the clinical trial network.<br>
The dataset and all it's documentation are avaialble at the following website:<br>
https://datashare.nida.nih.gov/study/nida-ctn-0027<br>

The main challenge for this data set is that it is de-itentified and will have to be<br>
manually labeled.  There are 65 unique tables to choose from, with hundreds of<br>
columns of features.  We will create a list below of the tables and description<br>
of columns to be cleaned and merged into a high quality dataset, appropriate for <br>
machine learning, to extract meaningful insights.<br>

## Data Cleaning Process
We will try to keep things simple and employ a process driven by reusable functions.<br>
For each table we will follow the following steps:<br>
1. Load the data
2. Identify columns that require labels
3. Apply labels to columns
4. Drop columns that are not needed
5. Create imputation strategy for missing values
5. Apply transformations to values where required
3. Feature Engineering (if necessary)
4. Flatten Dataframes (encode week of treatment into columns, where applicable)
4. Merge with other tables

## List of Reusable Functions
| Name of Function | Description |
| ---------------- | ----------- |
| load_dataframes | Loads data from files specified in a DataFrame and assigns each to a DataFrame, returned in a dictionary with keys being the variable names. |
| clean_df | Clean the given DataFrame by dropping unnecessary columns, renaming columns, and reordering columns. |
| backfill_nulls | Backfill null values in the given columns with the last non-null value. |
| flatten_dataframe | Flattens a dataframe by creating separate dataframes for each week of clinical data, renaming columns with the corresponding week number, and merging all dataframes into one. |
| merge_dfs | Merge the given list of DataFrames into one DataFrame. |
| uds_features | Creates metrics used to measure outcomes from opiate test data. |
| markdown_table_to_df | Converts a Markdown table string into a pandas DataFrame. |

## Tables to be cleaned
| File Name | Table Name | Variable |Description |
| :--- | :--- | :--- | :--- |
| T_FRRSA.csv | Research Session Attendance|RSA |Records attendence for each week of treatment |
| T_FRDEM.csv | Demographics|DEM |Sex, Ethnicity, Race |
| T_FRUDSAB.csv | Urine Drug Screen| UDS  |Drug test for 8 different drug classes, taken weekly for 24 weeks |
| T_FRDSM.csv | DSM-IV Diagnosis|DSM |Tracks 6 different conditions|
| T_FRMDH.csv | Medical and Psychiatric History|MDH |Tracks 24 different Conditions|
| T_FRPEX.csv | Physical Exam|PEX |Tracks 12 different physical observations|
| T_FRPBC.csv | Pregnancy and Birth Control|PBC |Tracks 2 different conditions|
| T_FRTFB.csv | Timeline Follow Back Survey|TFB |Surveys for self reported drug use, collected every 4 weeks, includes previous 30 days of use|
| T_FRABZ.csv | Alcohol Breathalyzer |ABZ |Breathalyzer test for alcohol, taken weekly for 24 weeks|
|T_FRDOS.csv | Dose Record |DOS |Records the dose of medication taken each week for 24 weeks|
|SAE.csv | Serious Adverse Events |SAE |Records any serious adverse events that occur during the study|

In [1]:
import pandas as pd # data manipulation library
import numpy as np # numerical computing library
import matplotlib.pyplot as plt # data visualization library
import seaborn as sns # advanced data visualization library
import helper # custom fuctions I created to clean and plot data

import warnings
warnings.filterwarnings('ignore')

In [2]:
# load the data dictionary, which contains file names and descriptions

# We turned the markdown table above into CSV.  This table will provide file names
# and variables.  We will use those as input arguments for bulk data loading function.
data_dict = pd.read_csv('data_dict_draft.csv')

# identify the file path where unlabeled data is stored
path = '../unlabeled_data/'

#call load_dataframes function to load all 10 files at once
new_dfs = helper.load_dataframes(data_dict, path)

# assign the dataframes to variables
for key in new_dfs.keys():
    # assign the key to the dataframe
    globals()[key] = new_dfs[key]


rsa loaded successfully.
dem loaded successfully.
uds loaded successfully.
dsm loaded successfully.
mdh loaded successfully.
pex loaded successfully.
pbc loaded successfully.
tfb loaded successfully.
abz loaded successfully.
dos loaded successfully.
sae loaded successfully.


### Transform Attendence Table

In [3]:
# we will define the columns and labels that we need for each df and then transform the data

# set parameters for transformation
rsa_cols = ['patdeid','VISIT','RSA001']
rsa_labels = {'RSA001':'attendance'}

# the helper function will transform the data
rsa = helper.clean_df(rsa, rsa_cols, rsa_labels)

# fill nulls with 0, marking no attendance
rsa['attendance'] = rsa['attendance'].fillna(0)

# remove the followup visits from the main clinical data weeks 0 - 24
rsa = rsa[~rsa['VISIT'].isin([28, 32])]

# remove duplicate rows
rsa = rsa.drop_duplicates(subset=['patdeid', 'VISIT'], keep='first')

# observe shape and sample 5 observations
print(rsa.shape)
display(rsa.sample(5))

(24217, 3)


Unnamed: 0,patdeid,VISIT,attendance
14021,1012,9,1.0
16336,1179,4,1.0
4176,292,18,0.0
15495,1116,3,1.0
1158,90,10,1.0


In [4]:
# set parameters to flatten the df
start = 0 # include data starting from week 0
end = 24 # finish at week 24
step = 1 # include data for every week

# call function to flatten dataframe
rsa_flat = helper.flatten_dataframe(rsa, start, end, step)

# fill nulls with 0 for no attendance
rsa_flat = rsa_flat.fillna(0)

# visually inspect the data
rsa_flat

Unnamed: 0,patdeid,attendance_0,attendance_1,attendance_2,attendance_3,attendance_4,attendance_5,attendance_6,attendance_7,attendance_8,...,attendance_15,attendance_16,attendance_17,attendance_18,attendance_19,attendance_20,attendance_21,attendance_22,attendance_23,attendance_24
0,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,4,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
4,5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1915,1930,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1916,1931,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
1917,1932,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1918,1933,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Feature Engineering
We will create the feature for treatment dropout, an important metric.<br>
Patients who do not show attendence for the final 4 weeks of treatment<br>
they will be considered to have dropped out of treatment

In [5]:
# create feature for treatment dropout
# treatment dropout is defined when patient misses attendance for final 4 weeks of treatment
rsa_flat['dropout'] = (
                        rsa_flat 
                       .iloc[:,22:25] # look at the final 4 columns
                       .sum(axis=1) # sum the values 
                       .apply(lambda x: 1 if x == 0 else 0) # if the sum is 0, then the patient dropped out
                       )

print('The dropout ratio is:')
display(rsa_flat.dropout.value_counts(normalize=True))

The dropout ratio is:


dropout
1    0.620313
0    0.379688
Name: proportion, dtype: float64

In [6]:
# create separate dropout df to merge later, using patedeid as primary key
rsa_dropout = rsa_flat[['patdeid', 'dropout']]

rsa_dropout

Unnamed: 0,patdeid,dropout
0,1,0
1,2,0
2,3,0
3,4,0
4,5,1
...,...,...
1915,1930,1
1916,1931,0
1917,1932,1
1918,1933,1


### Transform Demographics Table

In [7]:
# set parameters for transformation
dem_cols = ['patdeid','DEM002','DEM003A','DEM003B1','DEM003B2','DEM004A','DEM004B',     
            'DEM004C','DEM004E','DEM004F']
dem_labels = {'DEM002':'gender','DEM003A':'spanish_origin','DEM003B1':'mexican',        
              'DEM003B2':'puerto_rican','DEM004A':'amer_indian','DEM004B':'asian','DEM004C':'black','DEM004E':'white','DEM004F':'other_dem'}

# the helper function will clean and transform the data
dem = helper.clean_df(dem, dem_cols, dem_labels)

# fill missing values with 0
dem.fillna(0, inplace=True)

# we will need to change the values to binary
# for all columns afer 2, if values are > 0, change to 1 else 0
for col in dem.columns[2:]:
    dem[col] = dem[col].apply(lambda x: 1 if x > 0 else 0)

print('dem dataframe with shape of', dem.shape, 'has been cleaned ')
display(dem.sample(5))

dem dataframe with shape of (1920, 10) has been cleaned 


Unnamed: 0,patdeid,gender,spanish_origin,mexican,puerto_rican,amer_indian,asian,black,white,other_dem
386,390,2.0,1,0,0,0,0,0,1,0
721,727,1.0,1,0,0,0,0,0,1,0
320,324,2.0,1,0,0,0,0,0,1,0
1904,1918,2.0,1,0,0,0,0,0,1,0
1683,1695,1.0,1,0,0,0,0,0,1,0


### Transform Urine Drug Screen Table
This table contains the data for most of the outcome metrics<br>
Stay tuned for feature engineering section towards the end of this table transformation<br>

In [8]:
# set parameters for transformation
uds_cols = ['patdeid','VISIT', 'UDS005', 'UDS006', 'UDS007', 'UDS008', 'UDS009', 'UDS010', 'UDS011', 'UDS012', 
            'UDS013']
uds_labels = {'UDS005':'test_Amphetamines', 'UDS006':'test_Benzodiazepines','UDS007':'test_Methadone', 
              'UDS008':'test_Oxycodone', 'UDS009':'test_Cocaine', 'UDS010':'test_Methamphetamine', 'UDS011':'test_Opiate300', 'UDS012':'test_Cannabinoids', 'UDS013':'test_Propoxyphene'}

# the helper function will clean and transform the data
uds = helper.clean_df(uds, uds_cols, uds_labels)

print('Dataframe uds with shape of', uds.shape, 'has been cleaned')
display(uds)


Dataframe uds with shape of (24930, 11) has been cleaned


Unnamed: 0,patdeid,VISIT,test_Propoxyphene,test_Amphetamines,test_Cannabinoids,test_Benzodiazepines,test_Methadone,test_Oxycodone,test_Cocaine,test_Methamphetamine,test_Opiate300
0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
24925,1931,24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24926,1931,32,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
24927,1932,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
24928,1933,0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0


In [9]:
# dataframe is ready to be flattened

# set params for flattening
start = 0 # include data starting from week 0
end = 24 # finish at week 24
step = 1 # include data for every week

# call function to flatten dataframe
uds_flat = helper.flatten_dataframe(uds, start, end, step)

# fill missing values with 1, which counts as missed test
uds_flat.fillna(1, inplace=True)

# visually inspect the data
print('The clinical data was added in the form of',uds_flat.shape[1],'features')
print('Which includes tests for 8 different drug classes over 24 weeks')
display(uds_flat.sample(5))

The clinical data was added in the form of 226 features
Which includes tests for 8 different drug classes over 24 weeks


Unnamed: 0,patdeid,test_Propoxyphene_0,test_Amphetamines_0,test_Cannabinoids_0,test_Benzodiazepines_0,test_Methadone_0,test_Oxycodone_0,test_Cocaine_0,test_Methamphetamine_0,test_Opiate300_0,...,test_Opiate300_23,test_Propoxyphene_24,test_Amphetamines_24,test_Cannabinoids_24,test_Benzodiazepines_24,test_Methadone_24,test_Oxycodone_24,test_Cocaine_24,test_Methamphetamine_24,test_Opiate300_24
1213,1224,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1373,1387,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
584,590,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
69,70,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
434,440,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


### Feature Engineering
There will be feature engineering for the following:<br>
- `responder` - a binary feature indicating if the patient has a negative drug test for the final 4 weeks of treatment
- `tnt` - total negative tests, a measure of clinical benefit, count of negative tests over 24 weeks
- `cnt` - concsecutive negative tests, a measure of clinical benefit, count of consecutive negative tests over 24 weeks

In [10]:
# call the helper function to create the UDS features
uds_flat = helper.uds_features(uds_flat)

# visually inspect the data
print('A total of',uds_flat.shape[1],'features were created from the UDS data')
display(uds_flat)

A total of 29 features were created from the UDS data


Unnamed: 0,patdeid,0,1,2,3,4,5,6,7,8,...,18,19,20,21,22,23,24,TNT,CNT,responder
0,1,1.0,0.0,0.0,0.0,0.0,-5.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20,8,1
1,2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,1.0,1.0,6,4,0
2,3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0
3,4,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,...,1.0,1.0,1.0,0.0,1.0,1.0,0.0,4,1,0
4,5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1912,1930,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0
1913,1931,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,-5.0,1.0,0.0,1.0,0.0,16,8,0
1914,1932,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0
1915,1933,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0


In [11]:
# inspect the class inbalance for the outcome variable
uds_flat.responder.value_counts()

responder
0    1640
1     277
Name: count, dtype: int64

### Transform DSM-IV Diagnosis Table

In [12]:
# set params for transformation
dsm_cols = ['patdeid','DSMOPI','DSMAL','DSMAM','DSMCA','DSMCO','DSMSE']
dsm_labels = {'DSMOPI':'dsm_opiates','DSMAL':'dsm_alcohol','DSMAM':'dsm_amphetamine',
              'DSMCA':'dsm_cannabis','DSMCO':'dsm_cocaine','DSMSE':'dsm_sedative'}

# call the helper function to clean the data
dsm = helper.clean_df(dsm, dsm_cols, dsm_labels)

# change values to binary
for col in dsm.columns[1:]:
    dsm[col] = dsm[col].apply(lambda x: 1 if x > 0 else 0)

# fill nulls with 0, where patient does not confirm diagnosis
dsm.fillna(0, inplace=True)

print('Dataframe dsm with shape of', dsm.shape, 'has been cleaned')
display(dsm[:5])

Dataframe dsm with shape of (1889, 7) has been cleaned


Unnamed: 0,patdeid,dsm_cannabis,dsm_cocaine,dsm_sedative,dsm_opiates,dsm_alcohol,dsm_amphetamine
0,1,1,1,1,1,1,1
1,2,1,1,1,1,1,1
2,3,1,1,1,1,1,1
3,4,1,1,1,1,1,1
4,5,0,0,0,0,0,0


### Transform Medical and Psychiatric History Table
We will track 18 different medical conditions

In [13]:
# set parameters for transformation
mdh_cols = ['patdeid','MDH001','MDH002','MDH003','MDH004','MDH005','MDH006','MDH007','MDH008','MDH009',
            'MDH010','MDH011A','MDH011B','MDH012','MDH013','MDH014','MDH015','MDH016','MDH017']
mdh_labels = {'MDH001':'mds_head_injury','MDH002':'mds_allergies','MDH003':'mds_liver_problems',
                'MDH004':'mds_kidney_problems','MDH005':'mds_gi_problems','MDH006':'mds_thyroid_problems',
                'MDH007':'mds_heart_condition','MDH008':'mds_asthma','MDH009':'mds_hypertension',
                'MDH010':'mds_skin_disease','MDH011A':'mds_opi_withdrawal','MDH011B':'mds_alc_withdrawal',
                'MDH012':'mds_schizophrenia','MDH013':'mds_major_depressive_disorder',
                'MDH014':'mds_bipolar_disorder','MDH015':'mds_anxiety_disorder','MDH016':'mds_sig_neurological_damage','MDH017':'mds_epilepsy'}

# call the helper function to clean the data
mdh = helper.clean_df(mdh, mdh_cols, mdh_labels)

# imputation strategy, fill missing values with 0, indicates no diagnosis
mdh.fillna(0, inplace=True)

# visually inspect the data
print('Dataframe mdh with shape of', mdh.shape, 'has been cleaned')
display(mdh[:5])

Dataframe mdh with shape of (1869, 19) has been cleaned


Unnamed: 0,patdeid,mds_liver_problems,mds_kidney_problems,mds_alc_withdrawal,mds_schizophrenia,mds_major_depressive_disorder,mds_bipolar_disorder,mds_anxiety_disorder,mds_sig_neurological_damage,mds_allergies,mds_gi_problems,mds_thyroid_problems,mds_heart_condition,mds_asthma,mds_hypertension,mds_skin_disease,mds_head_injury,mds_opi_withdrawal,mds_epilepsy
0,1,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Transform the PEX (Physical Exam) Table

In [14]:
# set params to clean cols
pex_cols = ['patdeid','PEX001A','PEX002A','PEX003A','PEX004A','PEX005A','PEX006A','PEX007A',
            'PEX008A','PEX009A','PEX010A','PEX011A','PEX012A']
pex_labels = {'PEX001A':'pex_gen_appearance','PEX002A':'pex_head_neck','PEX003A':'pex_ears_nose_throat',
              'PEX004A':'pex_cardio','PEX005A':'pex_lymph_nodes','PEX006A':'pex_respiratory',
              'PEX007A':'pex_musculoskeletal','PEX008A':'pex_gi_system','PEX009A':'pex_extremeties',
              'PEX010A':'pex_neurological','PEX011A':'pex_skin','PEX012A':'pex_other'}
              
# call the helper function to clean the data
pex = helper.clean_df(pex, pex_cols, pex_labels)

# remove duplicate rows
pex = pex.drop_duplicates(subset=['patdeid'], keep='first')

# imputation strategy: 9 indicates no diagnosis
pex.fillna(9, inplace=True)

# visually inspect the data
print('Dataframe pex with shape of', pex.shape, 'has been cleaned')
display(pex[:5])

Dataframe pex with shape of (1869, 13) has been cleaned


Unnamed: 0,patdeid,pex_lymph_nodes,pex_other,pex_respiratory,pex_musculoskeletal,pex_gi_system,pex_extremeties,pex_neurological,pex_gen_appearance,pex_ears_nose_throat,pex_head_neck,pex_cardio,pex_skin
0,1,1.0,9.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,2,1.0,9.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
4,3,1.0,9.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
6,4,1.0,9.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,5,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0


In [None]:
pbc_cols = []
pbc_labels = {} 

In [None]:
tfb_cols = []
tfb_labels = {}

In [None]:
abz_cols = []
abz_labels = {}

In [None]:
dos_cols = []   
dos_labels = {}   