# Predictions of WG, CV and night PPGR from dietary and personal features

In this notebook I will build a predictive model for three measurements: Wakeup glucose (WG), CV of the night and PPGR of the night.
Predictive features I want to use will be daily nutritional data, personal data (age, gender, BMI, waist circumference) and blood tests:
- CRP
- lipid profile including triglycerides, HDL, LDL, cholesterol, cholesterol/HDL, Triglycerides/HDL 
- creatinine for kidney function
- AST, ALL, GGT, Alkaline Phosphatase for liver function

## Imports

In [116]:
import pandas as pd
from LabData.DataLoaders.CGMLoader import CGMLoader
from LabData.DataLoaders.DietLoggingLoader import DietLoggingLoader
from LabData.DataLoaders.SubjectLoader import SubjectLoader
from LabData.DataLoaders.BodyMeasuresLoader import BodyMeasuresLoader
from LabData.DataLoaders.BloodTestsLoader import BloodTestsLoader
import datetime
%matplotlib inline
cgml = CGMLoader()
dll = DietLoggingLoader()

import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import seaborn as sns

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error

## Preparing predictive features

### Personal features of PNP3 participants

Age and gender are in the SubjectLoader. Gender map: 1 - male, 0 - female. 

In the body measurments we have systolic and diastolic blood pressure, weight, BMI, hips, waist, height.
IIn the blood tests I can take all besides A1C, Fructosamine and fasting glucose.

Columns to exclude from the blood tests: 'bt__hba1c', 'bt__glucose', 'bt__fructosamine',  'bt__insulin'

Since the measurements were conducted several times I will take the average over all non zero and non NaN values per person. I can first look if the predictions will work at all. If they don't work a more sofisticated approach will also not help. But! If predictions will work good, then I could think about improving this method.

In [153]:
sl = SubjectLoader()
participants = sl.get_data(study_ids=3).df

In [6]:
bml = BodyMeasuresLoader()
body_meas = bml.get_data(study_ids=3).df

In [79]:
btl = BloodTestsLoader()
blood_tests = btl.get_data(study_ids=3).df

In [143]:
def calc_mean_per_person(df, nans_limit=30):
    
    """This function correctly filters for the columns with enough measurments in them and calculates means per person"""
    
    # df = df.dropna(axis=1, how='all')
    # Zeros disturb the correct calculation of the mean, NaNs do not
    df = df.replace(0, np.NaN)
    df = df.reset_index()
    df_means = df.drop(columns='Date').groupby('RegistrationCode').mean()
    # Some columns include too many NaN values, nans_limit was determined manually
    sum_nans = df_means.isnull().sum().rename('sum').to_frame()
    too_many_nans = sum_nans[sum_nans['sum'] > nans_limit].index
    df_means = df_means.drop(columns=too_many_nans)
    df_means = df_means.dropna(axis=0, how='any')
    
    return df_means

In [150]:
body_meas_means = calc_mean_per_person(body_meas)

blood_tests_means = calc_mean_per_person(blood_tests)

blood_tests_means = blood_tests_means.drop(columns=['bt__hba1c', 'bt__glucose', 'bt__fructosamine',  'bt__insulin'])

In [151]:
blood_tests_means.columns

Index(['bt__creatinine', 'bt__mchc', 'bt__crp_hs', 'bt__hdl_cholesterol',
       'bt__rdw', 'bt__lymphocytes_%', 'bt__monocytes_%', 'bt__rbc',
       'bt__hemoglobin', 'bt__triglycerides', 'bt__ast_got', 'bt__mch',
       'bt__alt_gpt', 'bt__mean_platelet_volume', 'bt__eosinophils_%',
       'bt__wbc', 'bt__basophils_%', 'bt__total_cholesterol', 'bt__mcv',
       'bt__neutrophils_%', 'bt__crp_synthetic', 'bt__platelets', 'bt__hct',
       'bt__ldl_cholesterol', 'bt__tsh', 'bt__albumin'],
      dtype='object')

In [154]:
def merge_all_personal_data(bt_means, bm_means, participants):
    bm_bt = pd.merge(bm_means, bt_means, on='RegistrationCode')
    participants = participants.reset_index('Date')
    pers_data = pd.merge(bm_bt, participants[['age', 'gender']], on='RegistrationCode')
    return pers_data

In [155]:
pers_data = merge_all_personal_data(blood_tests_means, body_meas_means, participants)

In [156]:
pers_data.head()

Unnamed: 0_level_0,weight,body_fat,hips,sitting_blood_pressure_diastolic,bmi,height,trunk_fat,bmr,waist,sitting_blood_pressure_pulse_rate,...,bt__mcv,bt__neutrophils_%,bt__crp_synthetic,bt__platelets,bt__hct,bt__ldl_cholesterol,bt__tsh,bt__albumin,age,gender
RegistrationCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
111527,72.5,40.722223,111.0,88.6,31.204312,152.444444,37.477779,1335.555556,90.5,66.2,...,82.0,60.333333,1.766526,336.5,38.0,91.5,1.21,4.766667,49.0,0.0
117111,108.637498,32.628571,118.333333,89.25,34.287811,178.0,35.085715,2189.857143,116.166667,85.25,...,92.0,50.825,0.672996,347.0,46.425,201.0,1.84,5.0,49.0,1.0
126092,59.944444,33.275,98.0,94.5,23.415799,160.0,30.125,1200.375,80.666667,70.0,...,88.833333,53.05,0.451477,244.333333,39.116667,96.0,1.6,4.866667,58.0,0.0
12752,95.355555,26.4125,113.0,97.75,27.268596,187.0,29.0375,2049.625,100.333333,60.75,...,93.666667,57.466667,0.437412,232.0,44.616667,148.5,0.59,4.833333,58.0,1.0
130279,86.32,26.522222,104.875,77.0,31.145671,166.5,28.711111,1839.333333,101.75,68.2,...,95.666667,53.483333,-0.540084,209.5,46.166667,99.75,1.03,4.633333,58.0,1.0


In [157]:
pers_data.shape

(226, 40)

### Dietary features

In [61]:
def make_hourly_log(nutrient_list, study_ids=3, min_cal_per_day=1000):
    
    """From the raw logdf I get a DataFrame with RC and Date as index and nutritional data aggregated on hourly bases.
    Resulting DataFrame has a column multiindex structure (hour_of_the_day, nutrient)
    """ 
    carbs_cal_per_gram = 4
    fat_cal_per_gram = 9
    prot_cal_per_gram = 4
    
    log = dll.get_data(study_ids=study_ids).df
    logdf = dll.add_nutrients(log, nutrient_list)
    logdf = dll.squeeze_log(logdf)
    logdf = logdf.reset_index()
    logdf['Day'] = logdf['Date'].dt.date
    # Add 1 day to the day column for later correct merge with the features to predict (CV and WG)
    logdf['Day'] = logdf['Day'] + datetime.timedelta(days=1)
    #Identify days with good log more than min_cal_per_day
    totaldaylog = logdf.drop(columns=['meal_type']).groupby(['RegistrationCode', 'Day']).sum()
    totaldaylog = totaldaylog[totaldaylog['energy_kcal'] >= min_cal_per_day]
    days_to_keep = totaldaylog.index
    logdf = logdf.set_index(['RegistrationCode', 'Day'])
    logdf = logdf.loc[days_to_keep]
    logdf['hour'] = logdf['Date'].dt.hour
    # Adding Date to index for correct dropping of the 0 kcal rows
    logdf = logdf.set_index('Date', append=True)
    # Drop rows rows with 0 energy (should be water or tea)
    logdf = logdf.drop(logdf[logdf['energy_kcal'] == 0].index)
    logdf = logdf.reset_index().groupby(['RegistrationCode', 'Day', 'hour']).sum()
    # Add additional features
    logdf['carbs/lipids'] = logdf['carbohydrate_g'] / logdf['totallipid_g']
    logdf['caloric%carbs'] = logdf['carbohydrate_g'] * carbs_cal_per_gram / logdf['energy_kcal']
    logdf['caloric%fat'] = logdf['totallipid_g'] * fat_cal_per_gram / logdf['energy_kcal']

    # Arrange a data frame in a column multiindex format
    hourly_log = logdf.drop(columns='score').stack().unstack(level=2).unstack(level=-1)
    hourly_log = hourly_log.replace(np.NaN, 0)
    
    return hourly_log

In [60]:
nutrient_list = ['caffeine_mg', 'carbohydrate_g', 'energy_kcal', 'protein_g', 'sodium_mg', 'sugarstotal_g', 'totaldietaryfiber_g', 'totallipid_g']

40665 person/days in total

In [48]:
hourly_log.head()

Unnamed: 0_level_0,hour,0,0,0,0,0,0,0,0,0,0,...,23,23,23,23,23,23,23,23,23,23
Unnamed: 0_level_1,Unnamed: 1_level_1,weight,caffeine_mg,carbohydrate_g,energy_kcal,protein_g,sodium_mg,sugarstotal_g,totaldietaryfiber_g,totallipid_g,carbs/lipids,...,carbohydrate_g,energy_kcal,protein_g,sodium_mg,sugarstotal_g,totaldietaryfiber_g,totallipid_g,carbs/lipids,caloric%carbs,caloric%fat
RegistrationCode,Day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
111527,2017-10-06,,,,,,,,,,,...,13.55,95.0,1.475,45.75,0.0,0.5575,3.875,3.496774,0.570526,0.367105
111527,2017-10-07,,,,,,,,,,,...,0.0,2.5,0.3,24.2,0.0,0.0,0.05,0.0,0.0,0.18
111527,2017-10-08,,,,,,,,,,,...,,,,,,,,,,
111527,2017-11-05,,,,,,,,,,,...,,,,,,,,,,
111527,2017-11-06,,,,,,,,,,,...,,,,,,,,,,


In [50]:
hourly_log.loc[:, ([22,23], slice(None))]

Unnamed: 0_level_0,hour,22,22,22,22,22,22,22,22,22,22,...,23,23,23,23,23,23,23,23,23,23
Unnamed: 0_level_1,Unnamed: 1_level_1,weight,caffeine_mg,carbohydrate_g,energy_kcal,protein_g,sodium_mg,sugarstotal_g,totaldietaryfiber_g,totallipid_g,carbs/lipids,...,carbohydrate_g,energy_kcal,protein_g,sodium_mg,sugarstotal_g,totaldietaryfiber_g,totallipid_g,carbs/lipids,caloric%carbs,caloric%fat
RegistrationCode,Day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
111527,2017-10-06,250.0,112.5,0.0,2.5,0.3,5.0,0.0,0.0,0.05,0.0,...,13.55,95.0,1.475,45.75,0.0,0.5575,3.875,3.496774,0.570526,0.367105
111527,2017-10-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.00,2.5,0.300,24.20,0.0,0.0000,0.050,0.000000,0.000000,0.180000
111527,2017-10-08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.00,0.0,0.000,0.00,0.0,0.0000,0.000,0.000000,0.000000,0.000000
111527,2017-11-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.00,0.0,0.000,0.00,0.0,0.0000,0.000,0.000000,0.000000,0.000000
111527,2017-11-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.00,0.0,0.000,0.00,0.0,0.0000,0.000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
997735,2019-11-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.00,0.0,0.000,0.00,0.0,0.0000,0.000,0.000000,0.000000,0.000000
997735,2019-11-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.00,0.0,0.000,0.00,0.0,0.0000,0.000,0.000000,0.000000,0.000000
997735,2020-01-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.00,0.0,0.000,0.00,0.0,0.0000,0.000,0.000000,0.000000,0.000000
997735,2020-01-27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.00,0.0,0.000,0.00,0.0,0.0000,0.000,0.000000,0.000000,0.000000


## Features to predict

### Wakeup glucose

In [62]:
logdf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,weight,score,caffeine_mg,carbohydrate_g,energy_kcal,protein_g,sodium_mg,sugarstotal_g,totaldietaryfiber_g,totallipid_g,carbs/lipids,caloric%carbs,caloric%fat
RegistrationCode,Day,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
111527,2017-10-06,9,490.0,0.0,112.5,0.0,2.5,0.3,9.8,0.0,0.0,0.05,0.0,0.0,0.18
111527,2017-10-06,11,770.0,0.0,50.4,21.583,225.057,10.984,375.995,0.0,0.56,9.946,2.170018,0.383601,0.397739
111527,2017-10-06,14,605.200001,0.0,0.0,50.999878,740.724492,35.415213,2039.291605,0.0,9.136548,43.672731,1.167774,0.275405,0.530635
111527,2017-10-06,17,296.0,0.0,108.0,18.3058,136.5,4.1704,236.36,2.5584,0.602,4.5964,3.982639,0.536434,0.303059
111527,2017-10-06,20,274.0,0.0,0.0,44.8486,425.31,21.0482,718.93,0.079,0.6386,17.0054,2.637315,0.421797,0.359852


In [63]:
def calculate_wakeup_glucose_pnp3(time_between = [5,7], study_ids=3):    

    """Calculates wakeup glucose in the interval given in time_between 
    depending on the breakfast time"""
    
    # Get the cgm df and combine it with adjusted glucose 
    cgmdf = cgml.get_data(study_ids=study_ids).df
    cgmdf = cgml._remove_first_day_of_connections(cgmdf)
    cgmdf = cgmdf.reset_index()
    cgmdf['hour'] = cgmdf.Date.dt.hour
    cgmdf = cgmdf.set_index('Date')
    cgmdf.index = cgmdf.index.tz_localize(None)
    adj_gluc = pd.read_json('/home/elming/Cache/adj_gl.json')
    adj_gluc['ConnectionID'] = adj_gluc['ConnectionID'].astype(str)
    adj_gluc['GlucoseTimestamp'] = pd.to_datetime(adj_gluc['GlucoseTimestamp'])
    adj_gluc = adj_gluc.rename(columns={'GlucoseTimestamp':'Date'})
    adj_gluc = adj_gluc.set_index(['ConnectionID', 'Date'])
    cgm_adj = pd.merge(cgmdf, adj_gluc['GlucoseAdj50N13_Mm'], on=['ConnectionID', 'Date'])
    cgm_adj = cgm_adj.rename(columns={'GlucoseAdj50N13_Mm':'GlucoseAdj'})

    #  Get the log df 
    log = dll.get_data(study_ids=study_ids).df
    logdf = dll.add_nutrients(log, ['energy_kcal'])
    logdf = dll.squeeze_log(logdf)
    logdf = logdf.reset_index()
    logdf['Day'] = logdf['Date'].dt.date
    
    # Filter out beverages with 0 kcal
    logdf = logdf[logdf['energy_kcal'] != 0]
    
    # Filter out days with first meals earlier than 6 am 
    firstmeals = pd.DataFrame(logdf.groupby(['RegistrationCode', 'Day'])['Date'].first().rename('breakfast_ts'))
    firstmeals = firstmeals[(firstmeals['breakfast_ts'].dt.time > datetime.time(6, 0, 0))]
    
    # Dtype handling. After groupby 'Day' is an object, but I need it to be datetime
    firstmeals = firstmeals.reset_index('Day')
    firstmeals['Day'] = pd.to_datetime(firstmeals['Day'])
    firstmeals = firstmeals.set_index('Day', append=True)
    cgm_adj['Day'] = cgm_adj.index.date
    cgm_adj = cgm_adj.set_index(['RegistrationCode', 'Day'])
    
    # Get cgm and firstmeals ts in one df
    cgm_fm = pd.merge(cgm_adj, firstmeals, on=['RegistrationCode', 'Day'])
    
    # Leave cgm timestamps between 5 and 7 only
    cgm_fm = cgm_fm[(cgm_fm['hour'] >= time_between[0]) & (cgm_fm['hour'] < time_between[1])]
    
    # If breakfast was between 6 and 7 then wakeup glucose is a mean value between 5 and 6, otherwise between 6 and 7
    cgm_fm = cgm_fm[((cgm_fm['hour'] == time_between[0]) & (cgm_fm['breakfast_ts'].dt.hour == time_between[0] + 1)) | 
                    ((cgm_fm['hour'] == time_between[0] + 1) & (cgm_fm['breakfast_ts'].dt.hour >= time_between[1]))]
    wakeup_glucose = pd.DataFrame(cgm_fm.reset_index().groupby(['RegistrationCode', 'Day', 'hour'])['GlucoseAdj'].mean().rename(
                                'wakeup_glucose'))
    wakeup_glucose = wakeup_glucose.reset_index('hour').drop(columns='hour')
    
    return wakeup_glucose

In [64]:
wakeup_glucose = calculate_wakeup_glucose_pnp3()

In [67]:
wakeup_glucose = wakeup_glucose.reset_index('hour').drop(columns='hour')

In [71]:
wakeup_glucoseup_glucose.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,wakeup_glucose
RegistrationCode,Day,Unnamed: 2_level_1
111527,2017-11-08,103.071429
111527,2017-11-10,98.071429
111527,2017-11-11,101.321429
111527,2017-11-12,107.071429
111527,2017-11-13,106.071429


### Night CV

In [68]:
def add_gluc_adj_pnp3(study_ids=3):    

    """Get the cgm df and combine it with adjusted glucose"""
     
    cgmdf = cgml.get_data(study_ids=study_ids).df
    cgmdf = cgml._remove_first_day_of_connections(cgmdf)
    cgmdf = cgmdf.reset_index()
    cgmdf['hour'] = cgmdf.Date.dt.hour
    cgmdf = cgmdf.set_index('Date')
    cgmdf.index = cgmdf.index.tz_localize(None)
    adj_gluc = pd.read_json('/home/elming/Cache/adj_gl.json')
    adj_gluc['ConnectionID'] = adj_gluc['ConnectionID'].astype(str)
    adj_gluc['GlucoseTimestamp'] = pd.to_datetime(adj_gluc['GlucoseTimestamp'])
    adj_gluc = adj_gluc.rename(columns={'GlucoseTimestamp':'Date'})
    # adj_gluc = adj_gluc.set_index(['ConnectionID', 'Date'])
    cgm_adj = pd.merge(cgmdf, adj_gluc[['GlucoseAdj50N13_Mm', 'Date', 'ConnectionID']], on=['ConnectionID', 'Date'])
    cgm_adj = cgm_adj.rename(columns={'GlucoseAdj50N13_Mm':'GlucoseAdj'})

    return cgm_adj

In [69]:
cgm_adj = add_gluc_adj_pnp3()

In [87]:
def filter_by_time(df, start, end):
    
    """
    The function filters the cgm entries between certain hours of the day. 
    The output is a dataframe containing the entries between start and end hour of each day.
    :param df: CGM DataFrame
    :param start: string of a type 'hh:mm'
    :param end: string of a type 'hh:mm'
    :return: filtered DataFrame with an index set to 'RegistrationCode', 'Day' and 'Time'
    """

    df = df.set_index('Date')
    filtered_df = df.between_time(start, end)
    pd.options.mode.chained_assignment = None  # to avoid the warning message about chained assignments
    filtered_df['Day'] = filtered_df.index.date

#   From the night cgm adjusted I need to remove nights with less than 16 observations
    count = filtered_df.groupby(['RegistrationCode', 'Day'])['GlucoseAdj'].count()
    rc_days_to_keep = count[count >= 20].index
    filtered_df = filtered_df.set_index(['RegistrationCode', 'Day'])
    filtered_df = filtered_df.loc[rc_days_to_keep]
    
    return filtered_df

In [88]:
night_cgm_adj = filter_by_time(cgm_adj, '00:00', '06:00')

In [92]:
def count_stats(cgm_df):

    f = {'GlucoseAdj': ['mean', 'std']}
    
    stats = cgm_df.groupby(['RegistrationCode', 'Day'])['GlucoseAdj'].agg(f)
    stats.columns = stats.columns.droplevel()
    stats['CV'] = stats['std'] / stats['mean']
    
    return stats

In [93]:
night_cv = count_stats(night_cgm_adj)

is deprecated and will be removed in a future version. Use                 named aggregation instead.

    >>> grouper.agg(name_1=func_1, name_2=func_2)

  """


In [94]:
night_cv

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,CV
RegistrationCode,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
111527,2017-11-08,96.529762,6.188904,0.064114
111527,2017-11-09,99.696429,6.641847,0.066621
111527,2017-11-10,92.321429,5.350742,0.057958
111527,2017-11-11,111.196429,11.293755,0.101566
111527,2017-11-12,93.071429,2.570738,0.027621
...,...,...,...,...
997735,2019-11-02,99.005952,5.417196,0.054716
997735,2019-11-03,100.047619,3.522186,0.035205
997735,2019-11-04,97.422619,10.195818,0.104656
997735,2019-11-05,89.339286,7.488034,0.083816


### Night PPGR

In [95]:
night_cgm_adj.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ConnectionID,GlucoseValue,PPGR,hour,GlucoseAdj
RegistrationCode,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
111527,2017-11-08,1926,100.0,0.0,0,98.571429
111527,2017-11-08,1926,97.0,0.0,0,95.571429
111527,2017-11-08,1926,96.0,0.0,0,94.571429
111527,2017-11-08,1926,93.0,1.75,0,91.571429
111527,2017-11-08,1926,90.0,9.25,1,88.571429


In [113]:
def calc_mean_night_ppgr(night_cgm, min_timepoints=12):
    
    """Calculates mean PPGR for the nights with more than min_timepoints available (non-NaNs)"""
    
    ppgr_count = pd.DataFrame(night_cgm.reset_index().groupby(['RegistrationCode', 'Day'])['PPGR'].count())
    index_to_keep = ppgr_count[ppgr_count['PPGR'] >= min_timepoints].index
    night_cgm = night_cgm.loc[index_to_keep]
    ppgr = pd.DataFrame(night_cgm.reset_index().groupby(['RegistrationCode', 'Day'])['PPGR'].mean())
    
    return ppgr

In [114]:
ppgr = calc_mean_night_ppgr(night_cgm_adj)

In [115]:
ppgr

Unnamed: 0_level_0,Unnamed: 1_level_0,PPGR
RegistrationCode,Day,Unnamed: 2_level_1
111527,2017-11-08,9.416667
111527,2017-11-09,8.161458
111527,2017-11-10,9.114583
111527,2017-11-11,6.250000
111527,2017-11-12,5.479167
...,...,...
997735,2019-11-02,5.017857
997735,2019-11-03,2.554688
997735,2019-11-04,2.073529
997735,2019-11-05,1.007812


## Running predictions

### Preparing joint DataFrames