# Data pre-processing and augmentation

In this notebook, the different data pre processing and augmentation processes are outlined.

All this functionality can be found in `data_utils.py`

## Calculate the normlized patient timeline

* Normalize each patient's timeline to be 0 at the time they joined the study. 

* Save to `data/bs_normed_full.xls`

* Bin data per year for each patient

* Save to `data/bs_normed_binned.xls`



In [3]:
import pandas as pd

basmi_df = pd.read_excel('../data/clean_basmi.xls', index_col=(0,1))

# Turn the Drug column into binary
basmi_df['Drug_Indicator'] = basmi_df['Drug'].notnull().map({False: 0, True: 1})
basmi_df.drop('Drug',axis=1, inplace=True)


### 2. Bin data per year for each patient
If patient has multiple measurements within a year, take the mean of those measurements.

Now each patient should have one score per year in study.

In [4]:
def get_norm_years(df):
    dates = df.index.get_level_values('Date')
    start_date = min(dates)
    norm_years = [int(pd.Timedelta(date - start_date).days / 365) for date in dates]
    return norm_years

# Get the normalized patient timeline
# Keep only BS score
basmi_df['norm_years'] = basmi_df.groupby(level=0)['BS'].transform(get_norm_years)

# Bin data per year for each patient
agg_bs_df = basmi_df.groupby(['patient_id','norm_years']).agg({'BS': 'mean'}).reset_index(level=1)
# Round floats to 2 digits
agg_bs_df = agg_bs_df.round(2)

agg_bs_df.head()

Unnamed: 0_level_0,norm_years,BS
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1
40,0.0,2.96
40,1.0,3.0
40,2.0,3.2
40,3.0,3.1
40,4.0,3.4


#### Impute the missing values
If a patient missed a year, impute the missing value by assuming a linear progression

To get the progression rate, use the difference between BS scores before and after missing values and divide by the time difference.

In [5]:
# Impute missing values
fixed_dfs = []
for id, df in agg_bs_df.groupby('patient_id'):
    
    years = df['norm_years']

    bs_scores = df['BS'] 

    rate_of_change = (bs_scores.shift(-1) - bs_scores) / (years.shift(-1) - years)
    
    if df.shape[0] <= 1:
        fixed_df = pd.DataFrame({'BS': bs_scores, 'norm_years': years, 'patient_id': id})
    else:
        bs_scores.index = years
        rate_of_change.index = years

        # Full range of years - the actual years that patient was in the study
        years_range = pd.RangeIndex(0, stop=max(years))

        fixed_data = []
        last_bs_obs = None
        for year in years_range:
            # If we had data for this year, add set the last observation
            # and add this entry to fixed data
            if year in years.values:
                last_obs = (bs_scores.loc[year], rate_of_change.loc[year])
                fixed_data.append(last_obs[0])

            # Else, make a new observation by adding the rate of change to the last BS score we had
            # and updating the last observation to this new observation keeping the rate of change the same
            else:
                new_obs = last_obs[0] + last_obs[1]
                fixed_data.append(new_obs)
                last_obs = (new_obs, last_obs[1])

        fixed_df = pd.DataFrame({'BS': fixed_data, 'norm_years': years_range})
        fixed_df['patient_id'] = id
    
    fixed_dfs.append(fixed_df)
    
fixed_bs_df = pd.concat(fixed_dfs)
fixed_bs_df = fixed_bs_df.set_index('patient_id')
fixed_bs_df.head()

Unnamed: 0_level_0,BS,norm_years
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1
40,2.96,0.0
40,3.0,1.0
40,3.2,2.0
40,3.1,3.0
40,3.4,4.0
