# Data Cleaning Strategy
For this project, we will use the CTN0027 Dataset from the clinical trial network.<br>
The dataset and all it's documentation are avaialble at the following website:<br>
https://datashare.nida.nih.gov/study/nida-ctn-0027<br>

The main challenge for this data set is that it is de-itentified and will have to be<br>
manually labeled.  There are 65 unique tables to choose from, with hundreds of<br>
columns of features.  We will create a list below of the tables and description<br>
of columns to be cleaned and merged into a high quality dataset, appropriate for <br>
machine learning, to extract meaningful insights.<br>

## Data Cleaning Process
We will try to keep things simple and employ a process driven by reusable functions.<br>
For each table we will follow the following steps:<br>
1. Load the data
2. Identify columns that require labels
3. Apply labels to columns
4. Drop columns that are not needed
5. Create imputation strategy for missing values
3. Feature Engineering (if necessary)
4. Flatten Dataframes (encode week of treatment into columns, where applicable)
4. Merge with other tables

## Tables to be cleaned
| File Name | Table Name | Variable |Description |
| :--- | :--- | :--- | :--- |
| T_FRRSA.csv | Research Session Attendance|RSA |Records attendence for each week of treatment |
| T_FRDEM.csv | Demographics|DEM |Sex, Ethnicity, Race |
| T_FRUDSAB.csv | Urine Drug Screen| UDS  |Drug test for 8 different drug classes, taken weekly for 24 weeks |
| T_FRDSM.csv | DSM-IV Diagnosis|DSM |Tracks 6 different conditions|
| T_FRMDH.csv | Medical and Psychiatric History|MDH |Tracks 24 different Conditions|
| T_FRPEX.csv | Physical Exam|PEX |Tracks 12 different physical observations|
| T_FRPBC.csv | Pregnancy and Birth Control|PBC |Tracks 2 different conditions|
| T_FRTFB.csv | Timeline Follow Back Survey|TFB |Surveys for self reported drug use, collected every 4 weeks, includes previous 30 days of use|
| T_FRABZ.csv | Alcohol Breathalyzer |ABZ |Breathalyzer test for alcohol, taken weekly for 24 weeks|
|T_FRDOS.csv | Dose Record |DOS |Records the dose of medication taken each week for 24 weeks|
|SAE.csv | Serious Adverse Events |SAE |Records any serious adverse events that occur during the study|

In [1]:
import pandas as pd # data manipulation library
import numpy as np # numerical computing library
import matplotlib.pyplot as plt # data visualization library
import seaborn as sns # advanced data visualization library
import helper # custom fuctions I created to clean and plot data

import warnings
warnings.filterwarnings('ignore')

In [2]:
# load the data dictionary, which contains file names and descriptions

# we can use this to load all the files at once, it contains file names and variable descriptions
data_dict = pd.read_csv('data_dict_draft.csv')

# identify the file path where unlabeled data is stored
path = '../unlabeled_data/'

#call load_dataframes function to load all 10 files at once
new_dfs = helper.load_dataframes(data_dict, path)

# assign the dataframes to variables
for key in new_dfs.keys():
    # assign the key to the dataframe
    globals()[key] = new_dfs[key]


rsa loaded successfully.
dem loaded successfully.
uds loaded successfully.
dsm loaded successfully.
mdh loaded successfully.
pex loaded successfully.
pbc loaded successfully.
tfb loaded successfully.
abz loaded successfully.
dos loaded successfully.
sae loaded successfully.


### Transform Attendence Table

In [3]:
# we will define the columns and labels that we need for each df and then transform the data

# set parameters for transformation
rsa_cols = ['patdeid','VISIT','RSA001']
rsa_labels = {'RSA001':'attendance'}

# the helper function will transform the data
rsa = helper.clean_df(rsa, rsa_cols, rsa_labels)

# fill nulls with 0, marking no attendance
rsa['attendance'] = rsa['attendance'].fillna(0)

# remove the followup visits from the main clinical data weeks 0 - 24
rsa = rsa[~rsa['VISIT'].isin([28, 32])]

# remove duplicate rows
rsa = rsa.drop_duplicates(subset=['patdeid', 'VISIT'], keep='first')

# observe shape and sample 5 observations
print(rsa.shape)
display(rsa.sample(5))

(24217, 3)


Unnamed: 0,patdeid,VISIT,attendance
21403,1525,13,1.0
20213,1444,19,0.0
42,2,13,1.0
20608,1469,20,1.0
22584,1612,6,1.0


In [4]:
# set parameters to flatten the df
start = 0
end = 24
step = 1

# call function to flatten dataframe
rsa_flat = helper.flatten_dataframe(rsa, start, end, step)

# fill nulls with 0 for no attendance
rsa_flat = rsa_flat.fillna(0)

rsa_flat

Unnamed: 0,patdeid,attendance_0,attendance_1,attendance_2,attendance_3,attendance_4,attendance_5,attendance_6,attendance_7,attendance_8,...,attendance_15,attendance_16,attendance_17,attendance_18,attendance_19,attendance_20,attendance_21,attendance_22,attendance_23,attendance_24
0,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,4,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
4,5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1915,1930,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1916,1931,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
1917,1932,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1918,1933,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Feature Engineering
We will create the feature for treatment dropout, an important metric.<br>
Patients who do not show attendence for the final 4 weeks of treatment<br>
they will be considered to have dropped out of the study

In [5]:
# create feature for treatment dropout
# treatment dropout is defined when patient misses attendance for final 4 weeks of treatment
rsa_flat['dropout'] = (
                        rsa_flat 
                       .iloc[:,22:25] # look at the final 4 columns
                       .sum(axis=1) # sum the values 
                       .apply(lambda x: 1 if x == 0 else 0) # if the sum is 0, then the patient dropped out
                       )

print('The dropout ratio is:')
display(rsa_flat.dropout.value_counts(normalize=True))

The dropout ratio is:


dropout
1    0.620313
0    0.379688
Name: proportion, dtype: float64

In [6]:
# create separate dropout df to merge later, using patedeid as primary key
rsa_dropout = rsa_flat[['patdeid', 'dropout']]

rsa_dropout

Unnamed: 0,patdeid,dropout
0,1,0
1,2,0
2,3,0
3,4,0
4,5,1
...,...,...
1915,1930,1
1916,1931,0
1917,1932,1
1918,1933,1


### Transform Demographics Table

In [7]:
# set parameters for transformation
dem_cols = ['patdeid','DEM002','DEM003A','DEM003B1','DEM003B2','DEM004A','DEM004B','DEM004C','DEM004E','DEM004F']
dem_labels = {'DEM002':'gender','DEM003A':'spanish_origin','DEM003B1':'mexican','DEM003B2':'puerto_rican',
              'DEM004A':'amer_indian','DEM004B':'asian','DEM004C':'black','DEM004E':'white','DEM004F':'other_dem'}

# the helper function will clean and transform the data
dem = helper.clean_df(dem, dem_cols, dem_labels)

# fill missing values with 0
dem.fillna(0, inplace=True)

# we will need to change the values to binary
# for all columns afer 2, if values are > 0, change to 1 else 0
for col in dem.columns[2:]:
    dem[col] = dem[col].apply(lambda x: 1 if x > 0 else 0)

print('dem dataframe with shape of', dem.shape, 'has been cleaned ')
display(dem.sample(5))

dem dataframe with shape of (1920, 10) has been cleaned 


Unnamed: 0,patdeid,gender,spanish_origin,mexican,puerto_rican,amer_indian,asian,black,white,other_dem
1898,1912,2.0,1,0,0,0,1,0,0,0
621,627,2.0,1,0,0,0,0,0,1,0
1697,1709,1.0,1,0,0,0,0,0,1,0
1733,1745,1.0,1,0,0,0,0,1,0,0
389,393,2.0,1,0,0,0,0,0,1,0


### Transform Urine Drug Screen Table
This table contains the data for most of the outcome metrics<br>
There will be feature engineering for the following:<br>
- responder - a binary feature indicating if the patient has a negative drug test for the final 4 weeks of treatment
- tnt - total negative tests, a measure of clinical benefit, count of negative tests over 24 weeks
- cnt - concsecutive negative tests, a measure of clinical benefit, count of consecutive negative tests over 24 weeks



In [None]:

uds_cols = []
uds_labels = {}

dsm_cols = []
dsm_labels = {}

mdh_cols = []
mdh_labels = {}

pex_cols = []
pex_labels = {}

pbc_cols = []
pbc_labels = {} 

tfb_cols = []
tfb_labels = {}

abz_cols = []
abz_labels = {}

dos_cols = []   
dos_labels = {}               