# Data Cleaning Strategy
For this project, we will use the CTN0027 Dataset from the clinical trial network.<br>
The dataset and all it's documentation are avaialble at the following website:<br>
https://datashare.nida.nih.gov/study/nida-ctn-0027<br>

There are a few challenges to highlight that should be considered in the cleaning approach:
- Large de-identified dataset must be manually labeled, very time consuming and prone to errors
- High dimension data - Requires bespoke transformations to fit into machine learning models

## Tables to be cleaned
| File Name | Table Name | Variable |Description | Process Applied |
| :--- | :--- | :--- | :--- | :--- |
| T_FRRSA.csv | Research Session Attendance|RSA |Records attendence for each week of treatment | Clean, Flatten, Feature Extraction, Merge |
| T_FRDEM.csv | Demographics|DEM |Sex, Ethnicity, Race | Clean, Merge |
| T_FRUDSAB.csv | Urine Drug Screen| UDS  |Drug test for 8 different drug classes, taken weekly for 24 weeks | Clean, Flatten, Feature Extraction, Merge |
| T_FRDSM.csv | DSM-IV Diagnosis|DSM |Tracks clinical diagnosis for substance use disorder, in accordance with DSM guidelines| Clean, Merge |
| T_FRMDH.csv | Medical and Psychiatric History|MDH |Tracks medical and psychiatric history of 24 different Conditions| Clean, Merge |
| T_FRPEX.csv | Physical Exam|PEX |Tracks the appearance and condition of patients for 12 different physical observations| Clean, Merge |
| T_FRPBC_BL.csv | Pregnancy and Birth Control|PBC |Pregnancy test taken once per month on weeks 0, 4, 8, 12, 16, 20, 24| Clean, Merge |
| T_FRTFB.csv | Timeline Follow Back Survey|TFB |Surveys for self reported drug use, collected every 4 weeks, includes previous 30 days of use ot week 0, 4, 8, 12, 16, 20, 24| Clean, Aggregate, Flatten, Merge |
|T_FRDOS.csv | Dose Record |DOS |Records the medication, averge weekly dose and week of treatment| Clean, Aggregate, Feature Extraction, Flatten, Merge |


## Data Cleaning Process
We will try to keep things simple and employ a process driven by reusable functions<br>
to **improve data quality**, **reduce time to market** and **reducing human error**.<br>
<br> 
For each table we will follow the following steps:<br>
1. Load the data
2. Identify columns that require labels
3. Apply labels to columns
4. Drop columns that are not needed
5. Create imputation strategy for missing values
5. Apply transformations to values where required
3. Feature Engineering (if necessary)
4. Flatten Dataframes (encode week of treatment into columns, where applicable)
4. Merge with other tables

## List of Reusable Functions
| Name of Function | Description | 
| ---------------- | ----------- |
| clean_df | Clean the given DataFrame by dropping unnecessary columns, renaming columns, and reordering columns. |
| flatten_dataframe | This function creates features by combining the VISIT column with the clinical datapoint (see example below).  The goal is to reduce individual rows per patient.  The data currently presents 25 rows per patient (for each week of treatment), which won't work for machine learning.  The model will only accept one row per patient, so we must encode all the clinical data into columns.  We will tranform the data by creating a separate dataframe for each week of treatment.  We will encode the week of treatment into the columns in each dataframe and then merge them together to form a high quality dataset, with granular level treatment data, that should help improve machine learning model accuracy.  This is a complex transformation, but justified for the incremental improvement to machine learning accuracy |
| merge_dfs | Merge the given list of DataFrames into one DataFrame. |
| uds_features | Creates 4 new features which are metrics used to measure outcomes from opiate test data. |
| med_features | Creates 2 new features for medication dose to enrich dataset and improve accuracy in machine learning|

### Example of How Flattening Works
![flatten](../images/flatten.png)




### Import Required Libraries

In [22]:
import pandas as pd # data manipulation library
import numpy as np # numerical computing library
import matplotlib.pyplot as plt # data visualization library
import seaborn as sns # advanced data visualization library
import helper # custom fuctions I created to clean and plot data

import warnings
warnings.filterwarnings('ignore')

### Load the Data
We will load 10 files from the de-identified dataset

In [23]:
# define parameters to load data

# define the path to the data
data_path = '../unlabeled_data/'

# define the names of the files to load
file_names = ['T_FRRSA.csv', 'T_FRDEM.csv','T_FRUDSAB.csv',
              'T_FRDSM.csv','T_FRMDH.csv','T_FRPEX.csv',
              'T_FRPBC.csv','T_FRTFB.csv','T_FRDOS.csv']

# define the names of the variables for the dataframes
variables = ['rsa', 'dem', 'uds', 'dsm', 'mdh', 'pex', 
             'pbc', 'tfb', 'dos']

# create a loop to iterate through the files and load them into the notebook
for file_name, variable in zip(file_names, variables):
        globals()[variable] = pd.read_csv(data_path + file_name)
        print(f"{variable} shape: {globals()[variable].shape}") # print the shape of the dataframes

rsa shape: (27029, 12)
dem shape: (1920, 43)
uds shape: (24930, 66)
dsm shape: (1889, 26)
mdh shape: (1869, 89)
pex shape: (2779, 33)
pbc shape: (2691, 20)
tfb shape: (100518, 56)
dos shape: (160908, 19)


### Transform Attendence Table
- This table establishes the patient population and will serve as the primary table
- All subsequent tables will use a LEFT JOIN to add clinical data as columns to each patient ID
- This table requires feature engineering for `attendance` and `dropout` variables

In [24]:
# we will define the columns and labels that we need for each df and then transform the data

# set parameters for transformation
rsa_cols = ['patdeid','VISIT','RSA001']
rsa_labels = {'RSA001':'attendance'}

# the helper function will transform the data
rsa = helper.clean_df(rsa, rsa_cols, rsa_labels)

# fill nulls with 0, marking no attendance
rsa['attendance'] = rsa['attendance'].fillna(0)

# remove the followup visits from the main clinical data weeks 0 - 24
rsa = rsa[~rsa['VISIT'].isin([28, 32])]

# remove duplicate rows
rsa = rsa.drop_duplicates(subset=['patdeid', 'VISIT'], keep='first')

# observe shape and sample 5 observations
print(rsa.shape)
display(rsa)

(24217, 3)


Unnamed: 0,patdeid,VISIT,attendance
0,1,0,1.0
2,1,1,1.0
3,1,2,1.0
4,1,3,1.0
5,1,4,1.0
...,...,...,...
27022,1931,23,0.0
27023,1931,24,1.0
27026,1932,0,1.0
27027,1933,0,1.0


### Feature Engineering
Capture sessions attended per patient

In [25]:
# create df with count of attendance for each patient
attendence = rsa.groupby('patdeid')['attendance'].size().to_frame('attendance').reset_index()

attendence

Unnamed: 0,patdeid,attendance
0,1,25
1,2,25
2,3,25
3,4,25
4,5,1
...,...,...
1915,1930,1
1916,1931,25
1917,1932,1
1918,1933,1


In [26]:
# set parameters to flatten the df
start = 0 # include data starting from week 0
end = 24 # finish at week 24
step = 1 # include data for every week

# call function to flatten dataframe
rsa_flat = helper.flatten_dataframe(rsa, start, end, step)

# fill nulls with 0 for no attendance
rsa_flat = rsa_flat.fillna(0)

# visually inspect the data
rsa_flat

Unnamed: 0,patdeid,attendance_0,attendance_1,attendance_2,attendance_3,attendance_4,attendance_5,attendance_6,attendance_7,attendance_8,...,attendance_15,attendance_16,attendance_17,attendance_18,attendance_19,attendance_20,attendance_21,attendance_22,attendance_23,attendance_24
0,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,4,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
4,5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1915,1930,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1916,1931,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
1917,1932,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1918,1933,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Feature Engineering
We will create the feature for treatment dropout, an important metric.<br>
Patients who do not show attendence for the final 4 weeks of treatment<br>
they will be considered to have dropped out of treatment

In [27]:
# create feature for treatment dropout
# treatment dropout is defined when patient misses attendance for final 4 weeks of treatment
rsa_flat['dropout'] = (
                        rsa_flat 
                       .iloc[:,22:25] # look at the final 4 columns
                       .sum(axis=1) # sum the values 
                       .apply(lambda x: 1 if x == 0 else 0) # if the sum is 0, then the patient dropped out
                       )

print('The dropout ratio is:')
display(rsa_flat.dropout.value_counts(normalize=True))

The dropout ratio is:


dropout
1    0.620313
0    0.379688
Name: proportion, dtype: float64

### Transform Demographics Table
This table came with inconsistent numbering convensions.<br>
We will have to manually transform the values for each feature.


In [28]:
# set parameters for transformation
dem_cols = ['patdeid','DEM002','DEM003A','DEM004A','DEM004B','DEM004C','DEM004D','DEM004E',
            'DEM004F','DEM004G','DEM004H']
dem_labels = {'DEM002':'dem_gender','DEM003A':'dem_ethnicity','DEM004A':'dem_race_amer_ind',
              'DEM004B':'dem_race_asian','DEM004C':'dem_race_black','DEM004D':'dem_race_pacific_islander',
             'DEM004E':'dem_race_white','DEM004F':'dem_race_other','DEM004G':'dem_race_no_answer',
                'DEM004H':'dem_race_unknown'}

# the helper function will clean and transform the data
dem = helper.clean_df(dem, dem_cols, dem_labels)

# for ethnicity column, map 1:'spanish_origin', 2:'not_spanish_origin', to values
for col in dem.columns:
    if col =='dem_ethnicity':
        dem[col] = dem[col].replace({1:'spanish_origin',2:'not_spanish_origin'})
    if col =='dem_gender':
        dem[col] = dem[col].replace({1:'male',2:'female'})
    if col =='dem_race_asian':
        dem[col] = dem[col].replace({2:1})
    if col=='dem_race_black':
        dem[col] = dem[col].replace({3:1})
    if col=='dem_race_pacific_islander':
        dem[col] = dem[col].replace({4:1})
    if col=='dem_race_white':
        dem[col] = dem[col].replace({5:1})
    if col=='dem_race_other':
        dem[col] = dem[col].replace({6:1})
    if col=='dem_race_no_answer':
        dem[col] = dem[col].replace({7:1})
    if col=='dem_race_unknown':
        dem[col] = dem[col].replace({8:1})

# imputation strategy: 0 for missing values, purpose is for counts of dem data
dem = dem.fillna(0)

# review the data
dem

Unnamed: 0,patdeid,dem_gender,dem_ethnicity,dem_race_no_answer,dem_race_unknown,dem_race_amer_ind,dem_race_asian,dem_race_black,dem_race_pacific_islander,dem_race_white,dem_race_other
0,1,male,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2,male,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3,male,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,4,female,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,male,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
1915,1930,female,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1916,1931,male,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1917,1932,female,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1918,1933,male,not_spanish_origin,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### Transform Urine Drug Screen Table
This table contains the data for most of the outcome metrics<br>
Stay tuned for feature engineering section towards the end of this table transformation<br>

In [None]:
# set parameters for transformation
uds_cols = ['patdeid','VISIT', 'UDS005', 'UDS006', 'UDS007', 'UDS008', 'UDS009', 'UDS010', 'UDS011', 'UDS012', 
            'UDS013']
uds_labels = {'UDS005':'test_Amphetamines', 'UDS006':'test_Benzodiazepines','UDS007':'test_Methadone', 
              'UDS008':'test_Oxycodone', 'UDS009':'test_Cocaine', 'UDS010':'test_Methamphetamine', 'UDS011':'test_Opiate300', 'UDS012':'test_Cannabinoids', 'UDS013':'test_Propoxyphene'}

# the helper function will clean and transform the data
uds = helper.clean_df(uds, uds_cols, uds_labels)

print('Dataframe uds with shape of', uds.shape, 'has been cleaned')
display(uds)


In [None]:
# dataframe is ready to be flattened

# set params for flattening
start = 0 # include data starting from week 0
end = 24 # finish at week 24
step = 1 # include data for every week

# call function to flatten dataframe
uds_flat = helper.flatten_dataframe(uds, start, end, step)

# fill missing values with 1, which is a binary value for positive test
uds_flat.fillna(1, inplace=True)

# visually inspect the data
print('The clinical data was added in the form of',uds_flat.shape[1],'features')
print('Which includes tests for 8 different drug classes over 24 weeks')
display(uds_flat)

### Feature Engineering
The following metrics will be created to assess treatment success:<br>
- `TNT` - (numeric) - total negative tests, a measure of clinical benefit, count of negative tests over 24 weeks
- `NTR` - (float) - negative test rate, a measure of clinical benefit, percentage of negative tests over 24 weeks
- `CNT` - (numeric) - concsecutive negative tests, a measure of clinical benefit, count of consecutive negative tests over 24 weeks
- `responder` - (binary) - indicating if the patient responds to treatment, by testing negative for opiates for the final 4 weeks of treatment


In [None]:
# call the helper function to create the UDS features
uds_features = helper.uds_features(uds_flat)

# isolate the features df for merge with the clinical data
uds_features = uds_features[['patdeid','TNT','NTR','CNT','responder']]

print('The UDS features have been created with shape of', uds_features.shape)
display(uds_features)

### Transform DSM-IV Diagnosis Table
The values for these features are mapped as follows:<br>
<br>
1 = Dependence<br>
2 = Abuse<br>
3 = No Diagnosis<br>
<br>
This will require one hot encoding in the datapipelines later on.<br>
We will label the values as text strings, so that they can appear<br>
on columns in the final dataset. The text strings will also help<br>
with analysis<br>


In [None]:
# set params for transformation
dsm_cols = ['patdeid','DSMOPI','DSMAL','DSMAM','DSMCA','DSMCO','DSMSE']
dsm_labels = {'DSMOPI':'dsm_opiates','DSMAL':'dsm_alcohol','DSMAM':'dsm_amphetamine',
              'DSMCA':'dsm_cannabis','DSMCO':'dsm_cocaine','DSMSE':'dsm_sedative'}

# call the helper function to clean the data
dsm = helper.clean_df(dsm, dsm_cols, dsm_labels)

# convert cols to numeric
dsm = dsm.apply(pd.to_numeric, errors='coerce')

# convert values to text strings as follows:
# 1 - dependence, 2 - abuse, 3 - no diagnosis, 0 - not present
dsm = dsm.replace({1:'dependence', 2:'abuse', 3:'no_diagnosis'})

# fill nulls with 0, where patient does not confirm diagnosis
dsm.fillna('not_present', inplace=True)

print('Dataframe dsm with shape of', dsm.shape, 'has been cleaned')
display(dsm[:5])

### Transform Medical and Psychiatric History Table
We will track 18 different medical conditions

In [None]:
# set parameters for transformation
mdh_cols = ['patdeid','MDH001','MDH002','MDH003','MDH004','MDH005','MDH006','MDH007','MDH008','MDH009',
            'MDH010','MDH011A','MDH011B','MDH012','MDH013','MDH014','MDH015','MDH016','MDH017']
mdh_labels = {'MDH001':'mdh_head_injury','MDH002':'mdh_allergies','MDH003':'mdh_liver_problems',
                'MDH004':'mdh_kidney_problems','MDH005':'mdh_gi_problems','MDH006':'mdh_thyroid_problems',
                'MDH007':'mdh_heart_condition','MDH008':'mdh_asthma','MDH009':'mdh_hypertension',
                'MDH010':'mdh_skin_disease','MDH011A':'mdh_opi_withdrawal','MDH011B':'mdh_alc_withdrawal',
                'MDH012':'mdh_schizophrenia','MDH013':'mdh_major_depressive_disorder',
                'MDH014':'mdh_bipolar_disorder','MDH015':'mdh_anxiety_disorder','MDH016':'mdh_sig_neurological_damage','MDH017':'mdh_epilepsy'}

# call the helper function to clean the data
mdh = helper.clean_df(mdh, mdh_cols, mdh_labels)

# map values to txt strings, 0 = no_history, 1 = yes_history, 9 = not_evaluated, skip the first column
for col in mdh.columns[1:]:
    mdh[col] = mdh[col].map({0:'no_history', 1:'yes_history', 9:'not_evaluated'})

# fill in the nulls, but skip the patdeid column
mdh = mdh.fillna('not_evaluated')


# visually inspect the data
print('Dataframe mdh with shape of', mdh.shape, 'has been cleaned')
display(mdh[:5])

### Transform the PEX (Physical Exam) Table

In [None]:
# set params to clean cols
pex_cols = ['patdeid','PEX001A','PEX002A','PEX003A','PEX004A','PEX005A','PEX006A','PEX007A',
            'PEX008A','PEX009A','PEX010A','PEX011A','PEX012A','VISIT']
pex_labels = {'PEX001A':'pex_gen_appearance','PEX002A':'pex_head_neck','PEX003A':'pex_ears_nose_throat',
              'PEX004A':'pex_cardio','PEX005A':'pex_lymph_nodes','PEX006A':'pex_respiratory',
              'PEX007A':'pex_musculoskeletal','PEX008A':'pex_gi_system','PEX009A':'pex_extremeties',
              'PEX010A':'pex_neurological','PEX011A':'pex_skin','PEX012A':'pex_other'}

# this dataset includes data from visit BASELINE and 24, we are only interested in BASELINE
pex = pex.loc[pex.VISIT=='BASELINE']
              
# call the helper function to clean the data
pex = helper.clean_df(pex, pex_cols, pex_labels)

# map values to strings, 0 = normal, 1 = abnormal, 9 = not_evaluated
for col in pex.columns[2:]:
    pex[col] = pex[col].map({0:'normal', 1:'abnormal', 9:'not_evaluated'})

# imputation strategy: 9 indicates no diagnosis
pex.fillna('not_evaluated', inplace=True)

# drop the visit column
pex.drop(columns='VISIT', inplace=True)

# visually inspect the data
print('Dataframe pex with shape of', pex.shape, 'has been cleaned')
display(pex)

### Transform the Pregnancy and Birth Control Table

In [None]:
# define parameters for cleaning
pbc_cols = ['patdeid','VISIT','PBC003']
pbc_labels = {'PBC003':'pbc_test_result'} 

# call the helper function to clean the data
pbc = helper.clean_df(pbc, pbc_cols, pbc_labels)

# remove followup visits from the main clinical after week 24
pbc = pbc[~pbc['VISIT'].isin([28, 32])]

pbc = pbc.fillna(0)

pbc

In [None]:
# flatten the pbc data

# set parameters for flattening
start = 0 # include data starting from week 0
end = 24 # finish at week 24
step = 4 # include data for every week

pbc_flat = helper.flatten_dataframe(pbc, start, end, step)

pbc_flat


### Transform the TFB (Timeline Follow Back Survey) Table
- This table has an issue with multiple rows per patient
- Each report of drug use is recorded in a new row
- We will aggregate the data to a single row per patient
- After the aggregation, the table will be flattened, to encode the survey, drug class and week collected, in each column
- Surveys are collected once a month and reflect the previous 30 days of drug use

In [None]:
# define parameters for cleaning
tfb_cols = ['patdeid','VISIT','TFB001A','TFB002A','TFB003A','TFB004A','TFB005A','TFB006A','TFB007A',
            'TFB008A','TFB009A','TFB010A']
tfb_labels = {'TFB001A':'survey_alcohol','TFB002A':'survey_cannabis','TFB003A':'survey_cocaine',    
              'TFB010A':'survey_oxycodone','TFB009A':'survey_methadone','TFB004A':'survey_amphetamine','TFB005A':'survey_methamphetamine','TFB006A':'survey_opiates','TFB007A':'survey_benzodiazepines','TFB008A':'survey_propoxyphene'}

# call the helper function to clean the data
tfb = helper.clean_df(tfb, tfb_cols, tfb_labels)

# visually inspect the data
print('Shape of cleaned tfb dataframe is', tfb.shape)
display(tfb[:5])

In [None]:
# aggregate rows by patient and visit, sum all records of drug use

# create index
index = ['patdeid','VISIT']

# create aggregation dictionary, omit the first two columns, they do not require aggregation
agg_dict = {col:'sum' for col in tfb.columns[2:]}

# aggregate the data, we will apply sum to all instances of reported us to give the total use for the period
tfb_agg = tfb.groupby(index).agg(agg_dict).reset_index()

# visually inspect the data
print('Aggregated tfb dataframe contains', tfb_agg.shape[0],'rows, coming from', tfb.shape[0],'rows')
display(tfb_agg[:5])

In [None]:
# flatten the dataframe

# set parameters to flatten survey data
start = 0 # include data starting from week 0
end = 24 # finish at week 24
step = 4 # include data for every 4 weeks

# call function to flatten dataframe
tfb_flat = helper.flatten_dataframe(tfb_agg, start, end, step)

# imputation strategy: fill missing values with 0, indicates no drug use
tfb_flat.fillna(0, inplace=True)

# visualize the data
print('Flattended dataframe contains', tfb_flat.shape[1]-1,'features')
display(tfb_flat)

### Transform the Medication Dose Table
- This table has an issue with multiple rows per patient
- Each dose of medication is recorded as a row
- This means that if a patient received 7 doses of medication, there will be 7 rows for that patient
- This needs to be consolidated into a single row per patient
- For total_dose with null values, we will treat that as a no show or 0 dose

In [None]:
# set parameters for cleaning the dataframe
dos_cols = ['patdeid','VISIT','DOS002','DOS005']   
dos_labels = {'DOS002':'medication','DOS005':'total_dose'}

# call the helper function to clean the data
dos = helper.clean_df(dos, dos_cols, dos_labels)

# Imputation strategy: backfill and forwardfill missing values from medication and total dose
dos['medication'] = dos['medication'].fillna(method='ffill').fillna(method='bfill')
dos['total_dose'] = dos['total_dose'].fillna(method='ffill').fillna(method='bfill')

# observe the data
print('The medication dataframe contains', dos.shape[0],'rows that must be aggregated')
display(dos)

In [None]:
# aggregate columns 

# create index
index = ['patdeid','VISIT','medication']
# create aggregation dictionary
agg_dict = {col:'sum' for col in dos.columns[3:]}

# aggregate the data, we will add daily dose to create weekly dose total, aggregating multiple columns per patient
dos_agg = dos.groupby(index).agg(agg_dict).reset_index()

# create df with patdeid and medication to merge later, this will help make analysis easier
medication = dos[['patdeid', 'medication']].drop_duplicates(subset=['patdeid'], keep='first').reset_index(drop=True)

# visualize the data
print('Total rows in the aggregated dataframe:', dos_agg.shape[0],'from', dos.shape[0],'rows')
dos_agg


### Feature Engineering
Create separate columns for bupe and methadone, this improves data quality

In [None]:
# feature engineering

# call helper function to create features from the medication data
dos_agg = helper.med_features(dos_agg)

# visually inspect the data
print('The aggregated dataframe contains', dos_agg.shape[1]-2,'features')
display(dos_agg)

In [None]:
# flatten the dataframe

# set parameters to flatten the dataframe
start = 0 # include data starting from week 0
end = 24 # finish at week 24
step = 1 # include data for every week

# call function to flatten dataframe
dos_flat = helper.flatten_dataframe(dos_agg, start, end, step)

# imputation strategy: nulls come post merge, these were visits for patients who dropped out, fill with 0
dos_flat.fillna(0, inplace=True)

print('The flattened dataframe contains', dos_flat.shape[1]-1,'features')
display(dos_flat)

### Now we will merge all the tables into a single dataset

In [None]:
# set parameters for merge

# Define the dataframes to merge
dfs = [rsa_flat, dos_flat, uds_flat, tfb_flat, 
       pbc_flat, uds_features, dem, dsm, mdh, 
       pex, medication, attendence]

# Initialize merged_df with the first DataFrame in the list
merged_df = dfs[0]

# Merge the dfs above using left merge on 'patdeid'
for df in dfs[1:]:  # Start from the second item in the list
    merged_df = pd.merge(merged_df, df, on='patdeid', how='left')

# some rows were duplicated from one:many merge, they will be dropped
merged_df = merged_df.drop_duplicates(subset=['patdeid'], keep='first')

# Print the shape of the final dataframe
print('The final table includes', merged_df.shape[1]-1, 'features for', merged_df.shape[0], 'patients in treatment')

merged_df

### Analyze Null Values

In [None]:
# show all rows from the function call
pd.set_option('display.max_rows', None)
merged_df.isnull().sum()

### Imputation Strategy

In [None]:
# for the medication column we will backfill and forward fill the nulls
# these patients dropped out, however, we would like to closely track their meds where possible
merged_df.medication = merged_df.medication.fillna(method='ffill').fillna(method='bfill')

# for columns that have 'meds' in the column name, forwardfill and backfill nulls
# these columns are the daily dose of medication
for col in merged_df.columns:
    if 'meds' in col:
        merged_df[col] = merged_df[col].fillna(method='ffill').fillna(method='bfill')

# for the sae and pbc columns, the nulls are just patients without data, can be set to 0
# for the pex columns, nulls come from patients who dropped from treatment
# can be filled with 0

# create list with prefix of columns to fill for zero value
cols1 = ['survey','pbc']
for col in merged_df.columns:
    if any(x in col for x in cols1):
        merged_df[col] = merged_df[col].fillna(0)

# set nulls in mdh to not_evaluated
cols2 = ['pex', 'dsm', 'mdh']
for col in merged_df.columns:
    if any(x in col for x in cols2):
        merged_df[col] = merged_df[col].fillna('not_evaluated')

In [None]:
# delete rows for incomplete patient profiles -  334, 1003 and 1006
merged_df = merged_df[~merged_df['patdeid'].isin([334, 1003, 1006])]

In [None]:
merged_df.isnull().sum()

In [None]:
# save to data folder in csv
merged_df.to_csv('../data/merged_data.csv', index=False)