# This Notebook shows where the group 5431 patients with pre-diabetes come from and where the 190 patients with diabetes at the start of the study come from within this group that need to be cut out of the study (patients diagnosed with pre-diabetes before the measurements taken with an epistart date, but were diagnosed with diabetes at the measurement date by the csv files). These 190 patients are taken out of every notebook at some point. The remaining 5241 patients are combined with the 4513 pre-diabetic patients found from a diagnosis from 2018 or later in the clinical-biological model to create the total cohort of 9754 patients.

First we import the packages for this notebook.

In [1]:
import numpy as np
import pandas as pd
import glob
import os
import dask.dataframe as dd
import shap
import time

First, we import the information with the metabolomics data.

In [2]:
prediabetes_metabolomics = pd.read_csv('prediabetics_with_metabolomic_and_telomere_length_data')
prediabetes_metabolomics = prediabetes_metabolomics.drop(columns = 'Unnamed: 0')
prediabetes_metabolomics

Unnamed: 0,eid,22190-0.0,22190-1.0,22190-2.0,22191-0.0,22191-1.0,22191-2.0,22192-0.0,22192-1.0,22192-2.0,...,23864-1.0,23867-0.0,23868-0.0,23869-0.0,23870-0.0,23871-0.0,23871-1.0,23876-0.0,23878-0.0,23878-1.0
0,1000330,0.769934,,,0.868105,,,0.351325,,,...,,,,,,,,,,
1,1000789,,,,,,,,,,...,,,,,,,,,,
2,1000809,0.728637,,,0.749579,,,-0.598856,,,...,,,,,,,,,,
3,1001783,0.613309,,,0.742680,,,-0.658705,,,...,,,,,,,,,,
4,1002432,0.794364,,,0.933980,,,0.824742,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9938,6024145,0.706609,,,0.782405,,,-0.321438,,,...,,,,,,,,,,
9939,6024268,0.870288,,,0.938913,,,0.858840,,,...,,,,,,,,,,
9940,6024393,0.667180,,,0.737532,,,-0.703726,,,...,,,,,,,,,,
9941,6024501,,,,,,,,,,...,,,,,,,,,,


First, we import our dataframe of all patients we will use in our model which we created in Model 34 and 34.1

In [3]:
all_prediabetes_for_model = pd.read_csv('prediabetic_patients_with_additional_4513_patients_from_unknown_patients_who_have_ICD_code_since_2018')
all_prediabetes_for_model = all_prediabetes_for_model.drop(columns = ['Unnamed: 0'])
all_prediabetes_for_model

Unnamed: 0,eid,target
0,4754980,0
1,3990434,0
2,4341307,0
3,2189918,0
4,1384041,0
...,...,...
9939,6020998,0
9940,6022016,0
9941,6024268,0
9942,6024393,0


Below we import the dataframe that has all the features from the UKBiobank with less than half of the values in the feature as NaN values. As we have seen before, we drop the first column which is just an identity column brought over from the SCU python.

In [4]:
prediabetes_with_features = pd.read_csv('all_features_prediabetic_patients_with_additional_4513_patients_from_unknown_patients_who_have_ICD_code_since_2018')
prediabetes_with_features = prediabetes_with_features.drop(columns = ['Unnamed: 0'])
prediabetes_with_features

Columns (42,43,213,317,318,319,320,321,322,323,324,325,326,327,328,329,330,2610,2611,2612,2613,2614,2615,2616,2617,2618,2619,2847,2851,2855,2859,2862,2864,2866,2867,2868,2869,2870,2871,2872,2895,2896,2897,2898,2912,2913,2914,2915,2964,3085,3086,3087,3088,3089,3090,3091,3092,3093,3094,3095,3096,3097,3098,3099,3100,3101,3102,3103,3104,3105,3106,3107,3108,3109,3110,3111,3112,3113,3114,3115,3116,3117,3124,3125,3126,3127,3128,3129,3130,3157,3158,3159,3160,3161,3162,3163,3164,3165,3166,3167,3168,3169,3170,3171,3172,3173,3174,3175,3176,3177,3178,3179,3180,3181,3182,3183,3184,3185,3186,3187,3188,3189,3190,3191,3192,3193,3211,3212,3213,3214,3215,3216,3217,3218,3219,3220,3233,3263,3264,3265,3266,3267,3268,3269,3270,3271,3272,3273,3274,3275,3276,3277,3278,3279,3280,3281,3282,3283,3284,3285,3286,3287,3288,3289,3290,3291,3292,3293,3294,3295,3296,3297,3298,3299,3300,3301,3302,3303,3304,3305,3306,3307,3308,3309,3310,3311,3312,3313,3314,3315,3316,3317,3318,3319,3320,3321,3322,3323,3324,3325,3326,3327,

Unnamed: 0,eid,21-0.0,21-1.0,21-2.0,31-0.0,34-0.0,35-0.0,35-1.0,35-2.0,36-0.0,...,40021-7.0,40021-8.0,40021-9.0,40021-10.0,40021-11.0,40021-12.0,40021-13.0,40021-14.0,40021-15.0,40021-16.0
0,4863545,1.0,,,1,1942,1.0,,,1056.0,...,,,,,,,,,,
1,3440845,1.0,,,0,1950,1.0,,,3157.0,...,,,,,,,,,,
2,1191090,1.0,,,0,1954,1.0,,,1044.0,...,,,,,,,,,,
3,3087664,1.0,,,1,1944,1.0,,,1040.0,...,,,,,,,,,,
4,2195757,1.0,,,0,1958,1.0,,,1039.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9939,5612713,1.0,,,1,1958,1.0,,,3156.0,...,,,,,,,,,,
9940,6022397,1.0,,,1,1945,1.0,,,2818.0,...,,,,,,,,,,
9941,5011216,1.0,,,0,1947,1.0,,,3155.0,...,,,,,,,,,,
9942,3482443,1.0,,,1,1950,1.0,,,1045.0,...,,,,,,,,,,


Below we show some information about the total dataframe. We find that there are mostly float data types in our dataframe, but also a few integer and a bunch of object data types.

In [5]:
prediabetes_with_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9944 entries, 0 to 9943
Columns: 12711 entries, eid to 40021-16.0
dtypes: float64(10792), int64(6), object(1913)
memory usage: 964.3+ MB


Below we show that there are many columns in which duplicate data is recorded. This may be due to patients requesting not to have their data used in the study after it was originally recorded and posted for use. After these patients decided not to share their data, their information was removed in a column in a different text file. The higher the number of the csv file (e.g. whole_file_26867 < whole_file_42385) the less patients kept in the columns. We need to account for this request of removal and not use patients who decide not to share their information as ethical researchers. Since both columns are kept, they are differentiated by x (representing column from lower number of csv file) and y (representing column from higher number of csv file which we want to keep) in the pandas dataframe. Therefore, we want to get rid of all columns ending in x and keep all columns ending in y. Below we search for all columns ending in x. and print all the column names so that we can easily drop them from the dataframe.

In [6]:
spike_cols = [col for col in prediabetes_with_features.columns if 'x' in col]
print(spike_cols)
print(len(spike_cols))

['84-0.0_x', '84-0.1_x', '84-0.2_x', '84-0.3_x', '84-0.4_x', '84-0.5_x', '84-1.0_x', '84-1.1_x', '84-1.2_x', '84-1.3_x', '84-1.4_x', '84-1.5_x', '84-2.0_x', '84-2.1_x', '84-2.2_x', '84-2.3_x', '84-2.4_x', '84-2.5_x', '87-0.0_x', '87-0.1_x', '87-0.2_x', '87-0.3_x', '87-0.4_x', '87-0.5_x', '87-0.6_x', '87-0.7_x', '87-0.8_x', '87-0.9_x', '87-0.10_x', '87-0.11_x', '87-0.12_x', '87-0.13_x', '87-0.14_x', '87-0.15_x', '87-0.16_x', '87-0.17_x', '87-0.18_x', '87-0.19_x', '87-0.20_x', '87-0.21_x', '87-0.22_x', '87-0.23_x', '87-0.24_x', '87-0.25_x', '87-0.26_x', '87-0.27_x', '87-0.28_x', '87-0.29_x', '87-0.30_x', '87-0.31_x', '87-0.32_x', '87-0.33_x', '87-1.0_x', '87-1.1_x', '87-1.2_x', '87-1.3_x', '87-1.4_x', '87-1.5_x', '87-1.6_x', '87-1.7_x', '87-1.8_x', '87-1.9_x', '87-1.10_x', '87-1.11_x', '87-1.12_x', '87-1.13_x', '87-1.14_x', '87-1.15_x', '87-1.16_x', '87-1.17_x', '87-1.18_x', '87-1.19_x', '87-1.20_x', '87-1.21_x', '87-1.22_x', '87-1.23_x', '87-1.24_x', '87-1.25_x', '87-1.26_x', '87-1.27_x


1652


Now we want to delete these columns from the final dataframe, shown below.

In [7]:
prediabetes_with_features = prediabetes_with_features.drop(columns = ['84-0.0_x', '84-0.1_x', '84-0.2_x', '84-0.3_x', '84-0.4_x', '84-0.5_x', '84-1.0_x', '84-1.1_x', '84-1.2_x', '84-1.3_x', '84-1.4_x', '84-1.5_x', '84-2.0_x', '84-2.1_x', '84-2.2_x', '84-2.3_x', '84-2.4_x', '84-2.5_x', '87-0.0_x', '87-0.1_x', '87-0.2_x', '87-0.3_x', '87-0.4_x', '87-0.5_x', '87-0.6_x', '87-0.7_x', '87-0.8_x', '87-0.9_x', '87-0.10_x', '87-0.11_x', '87-0.12_x', '87-0.13_x', '87-0.14_x', '87-0.15_x', '87-0.16_x', '87-0.17_x', '87-0.18_x', '87-0.19_x', '87-0.20_x', '87-0.21_x', '87-0.22_x', '87-0.23_x', '87-0.24_x', '87-0.25_x', '87-0.26_x', '87-0.27_x', '87-0.28_x', '87-0.29_x', '87-0.30_x', '87-0.31_x', '87-0.32_x', '87-0.33_x', '87-1.0_x', '87-1.1_x', '87-1.2_x', '87-1.3_x', '87-1.4_x', '87-1.5_x', '87-1.6_x', '87-1.7_x', '87-1.8_x', '87-1.9_x', '87-1.10_x', '87-1.11_x', '87-1.12_x', '87-1.13_x', '87-1.14_x', '87-1.15_x', '87-1.16_x', '87-1.17_x', '87-1.18_x', '87-1.19_x', '87-1.20_x', '87-1.21_x', '87-1.22_x', '87-1.23_x', '87-1.24_x', '87-1.25_x', '87-1.26_x', '87-1.27_x', '87-1.28_x', '87-1.29_x', '87-1.30_x', '87-1.31_x', '87-1.32_x', '87-1.33_x', '87-2.0_x', '87-2.1_x', '87-2.2_x', '87-2.3_x', '87-2.4_x', '87-2.5_x', '87-2.6_x', '87-2.7_x', '87-2.8_x', '87-2.9_x', '87-2.10_x', '87-2.11_x', '87-2.12_x', '87-2.13_x', '87-2.14_x', '87-2.15_x', '87-2.16_x', '87-2.17_x', '87-2.18_x', '87-2.19_x', '87-2.20_x', '87-2.21_x', '87-2.22_x', '87-2.23_x', '87-2.24_x', '87-2.25_x', '87-2.26_x', '87-2.27_x', '87-2.28_x', '87-2.29_x', '87-2.30_x', '87-2.31_x', '87-2.32_x', '87-2.33_x', '134-0.0_x', '134-1.0_x', '134-2.0_x', '135-0.0_x', '135-1.0_x', '135-2.0_x', '1259-0.0_x', '1259-1.0_x', '1259-2.0_x', '1558-0.0_x', '1558-1.0_x', '1558-2.0_x', '1647-0.0_x', '1647-1.0_x', '1647-2.0_x', '1677-0.0_x', '1677-1.0_x', '1677-2.0_x', '1687-0.0_x', '1687-1.0_x', '1687-2.0_x', '1697-0.0_x', '1697-1.0_x', '1697-2.0_x', '1707-0.0_x', '1707-1.0_x', '1707-2.0_x', '1767-0.0_x', '1767-1.0_x', '1767-2.0_x', '1777-0.0_x', '1777-1.0_x', '1777-2.0_x', '1787-0.0_x', '1787-1.0_x', '1787-2.0_x', '1797-0.0_x', '1797-1.0_x', '1797-2.0_x', '1807-0.0_x', '1807-1.0_x', '1807-2.0_x', '1835-0.0_x', '1835-1.0_x', '1835-2.0_x', '1845-0.0_x', '1845-1.0_x', '1845-2.0_x', '1873-0.0_x', '1873-1.0_x', '1873-2.0_x', '1883-0.0_x', '1883-1.0_x', '1883-2.0_x', '2946-0.0_x', '2946-1.0_x', '2946-2.0_x', '3140-0.0_x', '3140-1.0_x', '3140-2.0_x', '3526-0.0_x', '3526-1.0_x', '3526-2.0_x', '3912-0.0_x', '3912-1.0_x', '3912-2.0_x', '3942-0.0_x', '3942-1.0_x', '3942-2.0_x', '3972-0.0_x', '3972-1.0_x', '3972-2.0_x', '3982-0.0_x', '3982-1.0_x', '3982-2.0_x', '4501-0.0_x', '4501-1.0_x', '4501-2.0_x', '5057-0.0_x', '5057-1.0_x', '5057-2.0_x', '5364-0.0_x', '5364-1.0_x', '5364-2.0_x', '20001-0.0_x', '20001-0.1_x', '20001-0.2_x', '20001-0.3_x', '20001-0.4_x', '20001-0.5_x', '20001-1.0_x', '20001-1.1_x', '20001-1.2_x', '20001-1.3_x', '20001-1.4_x', '20001-1.5_x', '20001-2.0_x', '20001-2.1_x', '20001-2.2_x', '20001-2.3_x', '20001-2.4_x', '20001-2.5_x', '20002-0.0_x', '20002-0.1_x', '20002-0.2_x', '20002-0.3_x', '20002-0.4_x', '20002-0.5_x', '20002-0.6_x', '20002-0.7_x', '20002-0.8_x', '20002-0.9_x', '20002-0.10_x', '20002-0.11_x', '20002-0.12_x', '20002-0.13_x', '20002-0.14_x', '20002-0.15_x', '20002-0.16_x', '20002-0.17_x', '20002-0.18_x', '20002-0.19_x', '20002-0.20_x', '20002-0.21_x', '20002-0.22_x', '20002-0.23_x', '20002-0.24_x', '20002-0.25_x', '20002-0.26_x', '20002-0.27_x', '20002-0.28_x', '20002-0.29_x', '20002-0.30_x', '20002-0.31_x', '20002-0.32_x', '20002-0.33_x', '20002-1.0_x', '20002-1.1_x', '20002-1.2_x', '20002-1.3_x', '20002-1.4_x', '20002-1.5_x', '20002-1.6_x', '20002-1.7_x', '20002-1.8_x', '20002-1.9_x', '20002-1.10_x', '20002-1.11_x', '20002-1.12_x', '20002-1.13_x', '20002-1.14_x', '20002-1.15_x', '20002-1.16_x', '20002-1.17_x', '20002-1.18_x', '20002-1.19_x', '20002-1.20_x', '20002-1.21_x', '20002-1.22_x', '20002-1.23_x', '20002-1.24_x', '20002-1.25_x', '20002-1.26_x', '20002-1.27_x', '20002-1.28_x', '20002-1.29_x', '20002-1.30_x', '20002-1.31_x', '20002-1.32_x', '20002-1.33_x', '20002-2.0_x', '20002-2.1_x', '20002-2.2_x', '20002-2.3_x', '20002-2.4_x', '20002-2.5_x', '20002-2.6_x', '20002-2.7_x', '20002-2.8_x', '20002-2.9_x', '20002-2.10_x', '20002-2.11_x', '20002-2.12_x', '20002-2.13_x', '20002-2.14_x', '20002-2.15_x', '20002-2.16_x', '20002-2.17_x', '20002-2.18_x', '20002-2.19_x', '20002-2.20_x', '20002-2.21_x', '20002-2.22_x', '20002-2.23_x', '20002-2.24_x', '20002-2.25_x', '20002-2.26_x', '20002-2.27_x', '20002-2.28_x', '20002-2.29_x', '20002-2.30_x', '20002-2.31_x', '20002-2.32_x', '20002-2.33_x', '20006-0.0_x', '20006-0.1_x', '20006-0.2_x', '20006-0.3_x', '20006-0.4_x', '20006-0.5_x', '20006-1.0_x', '20006-1.1_x', '20006-1.2_x', '20006-1.3_x', '20006-1.4_x', '20006-1.5_x', '20006-2.0_x', '20006-2.1_x', '20006-2.2_x', '20006-2.3_x', '20006-2.4_x', '20006-2.5_x', '20007-0.0_x', '20007-0.1_x', '20007-0.2_x', '20007-0.3_x', '20007-0.4_x', '20007-0.5_x', '20007-1.0_x', '20007-1.1_x', '20007-1.2_x', '20007-1.3_x', '20007-1.4_x', '20007-1.5_x', '20007-2.0_x', '20007-2.1_x', '20007-2.2_x', '20007-2.3_x', '20007-2.4_x', '20007-2.5_x', '20008-0.0_x', '20008-0.1_x', '20008-0.2_x', '20008-0.3_x', '20008-0.4_x', '20008-0.5_x', '20008-0.6_x', '20008-0.7_x', '20008-0.8_x', '20008-0.9_x', '20008-0.10_x', '20008-0.11_x', '20008-0.12_x', '20008-0.13_x', '20008-0.14_x', '20008-0.15_x', '20008-0.16_x', '20008-0.17_x', '20008-0.18_x', '20008-0.19_x', '20008-0.20_x', '20008-0.21_x', '20008-0.22_x', '20008-0.23_x', '20008-0.24_x', '20008-0.25_x', '20008-0.26_x', '20008-0.27_x', '20008-0.28_x', '20008-0.29_x', '20008-0.30_x', '20008-0.31_x', '20008-0.32_x', '20008-0.33_x', '20008-1.0_x', '20008-1.1_x', '20008-1.2_x', '20008-1.3_x', '20008-1.4_x', '20008-1.5_x', '20008-1.6_x', '20008-1.7_x', '20008-1.8_x', '20008-1.9_x', '20008-1.10_x', '20008-1.11_x', '20008-1.12_x', '20008-1.13_x', '20008-1.14_x', '20008-1.15_x', '20008-1.16_x', '20008-1.17_x', '20008-1.18_x', '20008-1.19_x', '20008-1.20_x', '20008-1.21_x', '20008-1.22_x', '20008-1.23_x', '20008-1.24_x', '20008-1.25_x', '20008-1.26_x', '20008-1.27_x', '20008-1.28_x', '20008-1.29_x', '20008-1.30_x', '20008-1.31_x', '20008-1.32_x', '20008-1.33_x', '20008-2.0_x', '20008-2.1_x', '20008-2.2_x', '20008-2.3_x', '20008-2.4_x', '20008-2.5_x', '20008-2.6_x', '20008-2.7_x', '20008-2.8_x', '20008-2.9_x', '20008-2.10_x', '20008-2.11_x', '20008-2.12_x', '20008-2.13_x', '20008-2.14_x', '20008-2.15_x', '20008-2.16_x', '20008-2.17_x', '20008-2.18_x', '20008-2.19_x', '20008-2.20_x', '20008-2.21_x', '20008-2.22_x', '20008-2.23_x', '20008-2.24_x', '20008-2.25_x', '20008-2.26_x', '20008-2.27_x', '20008-2.28_x', '20008-2.29_x', '20008-2.30_x', '20008-2.31_x', '20008-2.32_x', '20008-2.33_x', '20009-0.0_x', '20009-0.1_x', '20009-0.2_x', '20009-0.3_x', '20009-0.4_x', '20009-0.5_x', '20009-0.6_x', '20009-0.7_x', '20009-0.8_x', '20009-0.9_x', '20009-0.10_x', '20009-0.11_x', '20009-0.12_x', '20009-0.13_x', '20009-0.14_x', '20009-0.15_x', '20009-0.16_x', '20009-0.17_x', '20009-0.18_x', '20009-0.19_x', '20009-0.20_x', '20009-0.21_x', '20009-0.22_x', '20009-0.23_x', '20009-0.24_x', '20009-0.25_x', '20009-0.26_x', '20009-0.27_x', '20009-0.28_x', '20009-0.29_x', '20009-0.30_x', '20009-0.31_x', '20009-0.32_x', '20009-0.33_x', '20009-1.0_x', '20009-1.1_x', '20009-1.2_x', '20009-1.3_x', '20009-1.4_x', '20009-1.5_x', '20009-1.6_x', '20009-1.7_x', '20009-1.8_x', '20009-1.9_x', '20009-1.10_x', '20009-1.11_x', '20009-1.12_x', '20009-1.13_x', '20009-1.14_x', '20009-1.15_x', '20009-1.16_x', '20009-1.17_x', '20009-1.18_x', '20009-1.19_x', '20009-1.20_x', '20009-1.21_x', '20009-1.22_x', '20009-1.23_x', '20009-1.24_x', '20009-1.25_x', '20009-1.26_x', '20009-1.27_x', '20009-1.28_x', '20009-1.29_x', '20009-1.30_x', '20009-1.31_x', '20009-1.32_x', '20009-1.33_x', '20009-2.0_x', '20009-2.1_x', '20009-2.2_x', '20009-2.3_x', '20009-2.4_x', '20009-2.5_x', '20009-2.6_x', '20009-2.7_x', '20009-2.8_x', '20009-2.9_x', '20009-2.10_x', '20009-2.11_x', '20009-2.12_x', '20009-2.13_x', '20009-2.14_x', '20009-2.15_x', '20009-2.16_x', '20009-2.17_x', '20009-2.18_x', '20009-2.19_x', '20009-2.20_x', '20009-2.21_x', '20009-2.22_x', '20009-2.23_x', '20009-2.24_x', '20009-2.25_x', '20009-2.26_x', '20009-2.27_x', '20009-2.28_x', '20009-2.29_x', '20009-2.30_x', '20009-2.31_x', '20009-2.32_x', '20009-2.33_x', '20012-0.0_x', '20012-0.1_x', '20012-0.2_x', '20012-0.3_x', '20012-0.4_x', '20012-0.5_x', '20012-1.0_x', '20012-1.1_x', '20012-1.2_x', '20012-1.3_x', '20012-1.4_x', '20012-1.5_x', '20012-2.0_x', '20012-2.1_x', '20012-2.2_x', '20012-2.3_x', '20012-2.4_x', '20012-2.5_x', '20013-0.0_x', '20013-0.1_x', '20013-0.2_x', '20013-0.3_x', '20013-0.4_x', '20013-0.5_x', '20013-0.6_x', '20013-0.7_x', '20013-0.8_x', '20013-0.9_x', '20013-0.10_x', '20013-0.11_x', '20013-0.12_x', '20013-0.13_x', '20013-0.14_x', '20013-0.15_x', '20013-0.16_x', '20013-0.17_x', '20013-0.18_x', '20013-0.19_x', '20013-0.20_x', '20013-0.21_x', '20013-0.22_x', '20013-0.23_x', '20013-0.24_x', '20013-0.25_x', '20013-0.26_x', '20013-0.27_x', '20013-0.28_x', '20013-0.29_x', '20013-0.30_x', '20013-0.31_x', '20013-0.32_x', '20013-0.33_x', '20013-1.0_x', '20013-1.1_x', '20013-1.2_x', '20013-1.3_x', '20013-1.4_x', '20013-1.5_x', '20013-1.6_x', '20013-1.7_x', '20013-1.8_x', '20013-1.9_x', '20013-1.10_x', '20013-1.11_x', '20013-1.12_x', '20013-1.13_x', '20013-1.14_x', '20013-1.15_x', '20013-1.16_x', '20013-1.17_x', '20013-1.18_x', '20013-1.19_x', '20013-1.20_x', '20013-1.21_x', '20013-1.22_x', '20013-1.23_x', '20013-1.24_x', '20013-1.25_x', '20013-1.26_x', '20013-1.27_x', '20013-1.28_x', '20013-1.29_x', '20013-1.30_x', '20013-1.31_x', '20013-1.32_x', '20013-1.33_x', '20013-2.0_x', '20013-2.1_x', '20013-2.2_x', '20013-2.3_x', '20013-2.4_x', '20013-2.5_x', '20013-2.6_x', '20013-2.7_x', '20013-2.8_x', '20013-2.9_x', '20013-2.10_x', '20013-2.11_x', '20013-2.12_x', '20013-2.13_x', '20013-2.14_x', '20013-2.15_x', '20013-2.16_x', '20013-2.17_x', '20013-2.18_x', '20013-2.19_x', '20013-2.20_x', '20013-2.21_x', '20013-2.22_x', '20013-2.23_x', '20013-2.24_x', '20013-2.25_x', '20013-2.26_x', '20013-2.27_x', '20013-2.28_x', '20013-2.29_x', '20013-2.30_x', '20013-2.31_x', '20013-2.32_x', '20013-2.33_x', '20107-0.0_x', '20107-0.1_x', '20107-0.2_x', '20107-0.3_x', '20107-0.4_x', '20107-0.5_x', '20107-0.6_x', '20107-0.7_x', '20107-0.8_x', '20107-0.9_x', '20107-1.0_x', '20107-1.1_x', '20107-1.2_x', '20107-1.3_x', '20107-1.4_x', '20107-1.5_x', '20107-1.6_x', '20107-1.7_x', '20107-1.8_x', '20107-1.9_x', '20107-2.0_x', '20107-2.1_x', '20107-2.2_x', '20107-2.3_x', '20107-2.4_x', '20107-2.5_x', '20107-2.6_x', '20107-2.7_x', '20107-2.8_x', '20107-2.9_x', '20110-0.0_x', '20110-0.1_x', '20110-0.2_x', '20110-0.3_x', '20110-0.4_x', '20110-0.5_x', '20110-0.6_x', '20110-0.7_x', '20110-0.8_x', '20110-0.9_x', '20110-0.10_x', '20110-1.0_x', '20110-1.1_x', '20110-1.2_x', '20110-1.3_x', '20110-1.4_x', '20110-1.5_x', '20110-1.6_x', '20110-1.7_x', '20110-1.8_x', '20110-1.9_x', '20110-1.10_x', '20110-2.0_x', '20110-2.1_x', '20110-2.2_x', '20110-2.3_x', '20110-2.4_x', '20110-2.5_x', '20110-2.6_x', '20110-2.7_x', '20110-2.8_x', '20110-2.9_x', '20110-2.10_x', '20111-0.0_x', '20111-0.1_x', '20111-0.2_x', '20111-0.3_x', '20111-0.4_x', '20111-0.5_x', '20111-0.6_x', '20111-0.7_x', '20111-0.8_x', '20111-0.9_x', '20111-0.10_x', '20111-0.11_x', '20111-1.0_x', '20111-1.1_x', '20111-1.2_x', '20111-1.3_x', '20111-1.4_x', '20111-1.5_x', '20111-1.6_x', '20111-1.7_x', '20111-1.8_x', '20111-1.9_x', '20111-1.10_x', '20111-1.11_x', '20111-2.0_x', '20111-2.1_x', '20111-2.2_x', '20111-2.3_x', '20111-2.4_x', '20111-2.5_x', '20111-2.6_x', '20111-2.7_x', '20111-2.8_x', '20111-2.9_x', '20111-2.10_x', '20111-2.11_x', '20112-0.0_x', '20112-0.1_x', '20112-0.2_x', '20112-0.3_x', '20112-0.4_x', '20112-0.5_x', '20112-0.6_x', '20112-1.0_x', '20112-1.1_x', '20112-1.2_x', '20112-1.3_x', '20112-1.4_x', '20112-1.5_x', '20112-1.6_x', '20112-2.0_x', '20112-2.1_x', '20112-2.2_x', '20112-2.3_x', '20112-2.4_x', '20112-2.5_x', '20112-2.6_x', '20113-0.0_x', '20113-0.1_x', '20113-0.2_x', '20113-0.3_x', '20113-0.4_x', '20113-0.5_x', '20113-1.0_x', '20113-1.1_x', '20113-1.2_x', '20113-1.3_x', '20113-1.4_x', '20113-1.5_x', '20113-2.0_x', '20113-2.1_x', '20113-2.2_x', '20113-2.3_x', '20113-2.4_x', '20113-2.5_x', '20114-0.0_x', '20114-0.1_x', '20114-0.2_x', '20114-0.3_x', '20114-0.4_x', '20114-0.5_x', '20114-0.6_x', '20114-1.0_x', '20114-1.1_x', '20114-1.2_x', '20114-1.3_x', '20114-1.4_x', '20114-1.5_x', '20114-1.6_x', '20114-2.0_x', '20114-2.1_x', '20114-2.2_x', '20114-2.3_x', '20114-2.4_x', '20114-2.5_x', '20114-2.6_x', '20116-0.0_x', '20116-1.0_x', '20116-2.0_x', '20117-0.0_x', '20117-1.0_x', '20117-2.0_x', '20160-0.0_x', '20160-1.0_x', '20160-2.0_x', '20161-0.0_x', '20161-1.0_x', '20161-2.0_x', '30000-0.0_x', '30000-1.0_x', '30000-2.0_x', '30010-0.0_x', '30010-1.0_x', '30010-2.0_x', '30020-0.0_x', '30020-1.0_x', '30020-2.0_x', '30030-0.0_x', '30030-1.0_x', '30030-2.0_x', '30040-0.0_x', '30040-1.0_x', '30040-2.0_x', '30050-0.0_x', '30050-1.0_x', '30050-2.0_x', '30060-0.0_x', '30060-1.0_x', '30060-2.0_x', '30070-0.0_x', '30070-1.0_x', '30070-2.0_x', '30080-0.0_x', '30080-1.0_x', '30080-2.0_x', '30090-0.0_x', '30090-1.0_x', '30090-2.0_x', '30100-0.0_x', '30100-1.0_x', '30100-2.0_x', '30110-0.0_x', '30110-1.0_x', '30110-2.0_x', '30120-0.0_x', '30120-1.0_x', '30120-2.0_x', '30130-0.0_x', '30130-1.0_x', '30130-2.0_x', '30140-0.0_x', '30140-1.0_x', '30140-2.0_x', '30150-0.0_x', '30150-1.0_x', '30150-2.0_x', '30160-0.0_x', '30160-1.0_x', '30160-2.0_x', '30170-0.0_x', '30170-1.0_x', '30170-2.0_x', '30180-0.0_x', '30180-1.0_x', '30180-2.0_x', '30190-0.0_x', '30190-1.0_x', '30190-2.0_x', '30200-0.0_x', '30200-1.0_x', '30200-2.0_x', '30210-0.0_x', '30210-1.0_x', '30210-2.0_x', '30220-0.0_x', '30220-1.0_x', '30220-2.0_x', '30230-0.0_x', '30230-1.0_x', '30230-2.0_x', '30240-0.0_x', '30240-1.0_x', '30240-2.0_x', '30250-0.0_x', '30250-1.0_x', '30250-2.0_x', '30260-0.0_x', '30260-1.0_x', '30260-2.0_x', '30270-0.0_x', '30270-1.0_x', '30270-2.0_x', '30280-0.0_x', '30280-1.0_x', '30280-2.0_x', '30290-0.0_x', '30290-1.0_x', '30290-2.0_x', '30300-0.0_x', '30300-1.0_x', '30300-2.0_x', '30502-0.0_x', '30502-1.0_x', '30503-0.0_x', '30503-1.0_x', '30512-0.0_x', '30512-1.0_x', '30513-0.0_x', '30513-1.0_x', '30522-0.0_x', '30522-1.0_x', '30523-0.0_x', '30523-1.0_x', '30532-0.0_x', '30532-1.0_x', '30533-0.0_x', '30533-1.0_x', '40005-0.0_x', '40005-1.0_x', '40005-2.0_x', '40005-3.0_x', '40005-4.0_x', '40005-5.0_x', '40005-6.0_x', '40005-7.0_x', '40005-8.0_x', '40005-9.0_x', '40005-10.0_x', '40005-11.0_x', '40005-12.0_x', '40005-13.0_x', '40005-14.0_x', '40005-15.0_x', '40005-16.0_x', '40006-0.0_x', '40006-1.0_x', '40006-2.0_x', '40006-3.0_x', '40006-4.0_x', '40006-5.0_x', '40006-6.0_x', '40006-7.0_x', '40006-8.0_x', '40006-9.0_x', '40006-10.0_x', '40006-11.0_x', '40006-12.0_x', '40006-13.0_x', '40006-14.0_x', '40006-15.0_x', '40006-16.0_x', '40008-0.0_x', '40008-1.0_x', '40008-2.0_x', '40008-3.0_x', '40008-4.0_x', '40008-5.0_x', '40008-6.0_x', '40008-7.0_x', '40008-8.0_x', '40008-9.0_x', '40008-10.0_x', '40008-11.0_x', '40008-12.0_x', '40008-13.0_x', '40008-14.0_x', '40008-15.0_x', '40008-16.0_x', '40009-0.0_x', '40011-0.0_x', '40011-1.0_x', '40011-2.0_x', '40011-3.0_x', '40011-4.0_x', '40011-5.0_x', '40011-6.0_x', '40011-7.0_x', '40011-8.0_x', '40011-9.0_x', '40011-10.0_x', '40011-11.0_x', '40011-12.0_x', '40011-13.0_x', '40011-14.0_x', '40011-15.0_x', '40011-16.0_x', '40012-0.0_x', '40012-1.0_x', '40012-2.0_x', '40012-3.0_x', '40012-4.0_x', '40012-5.0_x', '40012-6.0_x', '40012-7.0_x', '40012-8.0_x', '40012-9.0_x', '40012-10.0_x', '40012-11.0_x', '40012-12.0_x', '40012-13.0_x', '40012-14.0_x', '40012-15.0_x', '40012-16.0_x', '40013-0.0_x', '40013-1.0_x', '40013-2.0_x', '40013-3.0_x', '40013-4.0_x', '40013-5.0_x', '40013-6.0_x', '40013-7.0_x', '40013-8.0_x', '40013-9.0_x', '40013-10.0_x', '40013-11.0_x', '40013-12.0_x', '40013-13.0_x', '40013-14.0_x', '40019-0.0_x', '40019-1.0_x', '40019-2.0_x', '40019-3.0_x', '40019-4.0_x', '40019-5.0_x', '40019-6.0_x', '40019-7.0_x', '40019-8.0_x', '40019-9.0_x', '40019-10.0_x', '40019-11.0_x', '40019-12.0_x', '40019-13.0_x', '40019-14.0_x', '40019-15.0_x', '40019-16.0_x', '41200-0.0_x', '41200-0.1_x', '41200-0.2_x', '41200-0.3_x', '41200-0.4_x', '41200-0.5_x', '41200-0.6_x', '41200-0.7_x', '41200-0.8_x', '41200-0.9_x', '41200-0.10_x', '41200-0.11_x', '41200-0.12_x', '41200-0.13_x', '41200-0.14_x', '41200-0.15_x', '41200-0.16_x', '41200-0.17_x', '41200-0.18_x', '41200-0.19_x', '41200-0.20_x', '41200-0.21_x', '41200-0.22_x', '41200-0.23_x', '41200-0.24_x', '41200-0.25_x', '41200-0.26_x', '41200-0.27_x', '41200-0.28_x', '41200-0.29_x', '41200-0.30_x', '41200-0.31_x', '41200-0.32_x', '41200-0.33_x', '41200-0.34_x', '41200-0.35_x', '41200-0.36_x', '41200-0.37_x', '41200-0.38_x', '41200-0.39_x', '41200-0.40_x', '41200-0.41_x', '41200-0.42_x', '41200-0.43_x', '41200-0.44_x', '41200-0.45_x', '41200-0.46_x', '41200-0.47_x', '41200-0.48_x', '41210-0.0_x', '41210-0.1_x', '41210-0.2_x', '41210-0.3_x', '41210-0.4_x', '41210-0.5_x', '41210-0.6_x', '41210-0.7_x', '41210-0.8_x', '41210-0.9_x', '41210-0.10_x', '41210-0.11_x', '41210-0.12_x', '41210-0.13_x', '41210-0.14_x', '41210-0.15_x', '41210-0.16_x', '41210-0.17_x', '41210-0.18_x', '41210-0.19_x', '41210-0.20_x', '41210-0.21_x', '41210-0.22_x', '41210-0.23_x', '41210-0.24_x', '41210-0.25_x', '41210-0.26_x', '41210-0.27_x', '41210-0.28_x', '41210-0.29_x', '41210-0.30_x', '41210-0.31_x', '41210-0.32_x', '41210-0.33_x', '41210-0.34_x', '41210-0.35_x', '41210-0.36_x', '41210-0.37_x', '41210-0.38_x', '41210-0.39_x', '41210-0.40_x', '41210-0.41_x', '41210-0.42_x', '41210-0.43_x', '41210-0.44_x', '41210-0.45_x', '41210-0.46_x', '41210-0.47_x', '41210-0.48_x', '41210-0.49_x', '41210-0.50_x', '41210-0.51_x', '41210-0.52_x', '41210-0.53_x', '41210-0.54_x', '41210-0.55_x', '41210-0.56_x', '41210-0.57_x', '41210-0.58_x', '41210-0.59_x', '41210-0.60_x', '41210-0.61_x', '41210-0.62_x', '41210-0.63_x', '41210-0.64_x', '41210-0.65_x', '41210-0.66_x', '41210-0.67_x', '41210-0.68_x', '41210-0.69_x', '41210-0.70_x', '41210-0.71_x', '41210-0.72_x', '41210-0.73_x', '41210-0.74_x', '41210-0.75_x', '41210-0.76_x', '41210-0.77_x', '41210-0.78_x', '41210-0.79_x', '41210-0.80_x', '41210-0.81_x', '41210-0.82_x', '41210-0.83_x', '41210-0.84_x', '41210-0.85_x', '864-0.0_x', '864-1.0_x', '864-2.0_x', '874-0.0_x', '874-1.0_x', '874-2.0_x', '884-0.0_x', '884-1.0_x', '884-2.0_x', '894-0.0_x', '894-1.0_x', '894-2.0_x', '904-0.0_x', '904-1.0_x', '904-2.0_x', '914-0.0_x', '914-1.0_x', '914-2.0_x', '924-0.0_x', '924-1.0_x', '924-2.0_x', '943-0.0_x', '943-1.0_x', '943-2.0_x', '971-0.0_x', '971-1.0_x', '971-2.0_x', '981-0.0_x', '981-1.0_x', '981-2.0_x', '991-0.0_x', '991-1.0_x', '991-2.0_x', '1001-0.0_x', '1001-1.0_x', '1001-2.0_x', '1011-0.0_x', '1011-1.0_x', '1011-2.0_x', '1021-0.0_x', '1021-1.0_x', '1021-2.0_x', '1050-0.0_x', '1050-1.0_x', '1050-2.0_x', '1060-0.0_x', '1060-1.0_x', '1060-2.0_x', '1070-0.0_x', '1070-1.0_x', '1070-2.0_x', '1080-0.0_x', '1080-1.0_x', '1080-2.0_x', '1090-0.0_x', '1090-1.0_x', '1090-2.0_x', '1100-0.0_x', '1100-1.0_x', '1100-2.0_x', '1160-0.0_x', '1160-1.0_x', '1160-2.0_x', '1170-0.0_x', '1170-1.0_x', '1170-2.0_x', '1180-0.0_x', '1180-1.0_x', '1180-2.0_x', '1190-0.0_x', '1190-1.0_x', '1190-2.0_x', '1200-0.0_x', '1200-1.0_x', '1200-2.0_x', '1210-0.0_x', '1210-1.0_x', '1210-2.0_x', '1220-0.0_x', '1220-1.0_x', '1220-2.0_x', '1239-0.0_x', '1239-1.0_x', '1239-2.0_x', '1249-0.0_x', '1249-1.0_x', '1249-2.0_x', '1269-0.0_x', '1269-1.0_x', '1269-2.0_x', '1279-0.0_x', '1279-1.0_x', '1279-2.0_x', '1289-0.0_x', '1289-1.0_x', '1289-2.0_x', '1299-0.0_x', '1299-1.0_x', '1299-2.0_x', '1309-0.0_x', '1309-1.0_x', '1309-2.0_x', '1319-0.0_x', '1319-1.0_x', '1319-2.0_x', '1329-0.0_x', '1329-1.0_x', '1329-2.0_x', '1339-0.0_x', '1339-1.0_x', '1339-2.0_x', '1349-0.0_x', '1349-1.0_x', '1349-2.0_x', '1359-0.0_x', '1359-1.0_x', '1359-2.0_x', '1369-0.0_x', '1369-1.0_x', '1369-2.0_x', '1379-0.0_x', '1379-1.0_x', '1379-2.0_x', '1389-0.0_x', '1389-1.0_x', '1389-2.0_x', '1408-0.0_x', '1408-1.0_x', '1408-2.0_x', '1418-0.0_x', '1418-1.0_x', '1418-2.0_x', '1428-0.0_x', '1428-1.0_x', '1428-2.0_x', '1438-0.0_x', '1438-1.0_x', '1438-2.0_x', '1448-0.0_x', '1448-1.0_x', '1448-2.0_x', '1458-0.0_x', '1458-1.0_x', '1458-2.0_x', '1468-0.0_x', '1468-1.0_x', '1468-2.0_x', '1478-0.0_x', '1478-1.0_x', '1478-2.0_x', '1488-0.0_x', '1488-1.0_x', '1488-2.0_x', '1498-0.0_x', '1498-1.0_x', '1498-2.0_x', '1508-0.0_x', '1508-1.0_x', '1508-2.0_x', '1518-0.0_x', '1518-1.0_x', '1518-2.0_x', '1528-0.0_x', '1528-1.0_x', '1528-2.0_x', '1538-0.0_x', '1538-1.0_x', '1538-2.0_x', '1548-0.0_x', '1548-1.0_x', '1548-2.0_x', '1568-0.0_x', '1568-1.0_x', '1568-2.0_x', '1578-0.0_x', '1578-1.0_x', '1578-2.0_x', '1588-0.0_x', '1588-1.0_x', '1588-2.0_x', '1598-0.0_x', '1598-1.0_x', '1598-2.0_x', '1608-0.0_x', '1608-1.0_x', '1608-2.0_x', '1618-0.0_x', '1618-1.0_x', '1618-2.0_x', '1628-0.0_x', '1628-1.0_x', '1628-2.0_x', '1717-0.0_x', '1717-1.0_x', '1717-2.0_x', '1727-0.0_x', '1727-1.0_x', '1727-2.0_x', '1737-0.0_x', '1737-1.0_x', '1737-2.0_x', '1747-0.0_x', '1747-1.0_x', '1747-2.0_x', '1757-0.0_x', '1757-1.0_x', '1757-2.0_x', '2129-0.0_x', '2129-1.0_x', '2129-2.0_x', '2139-0.0_x', '2139-1.0_x', '2139-2.0_x', '2149-0.0_x', '2149-1.0_x', '2149-2.0_x', '2159-0.0_x', '2159-1.0_x', '2159-2.0_x', '2267-0.0_x', '2267-1.0_x', '2267-2.0_x', '2277-0.0_x', '2277-1.0_x', '2277-2.0_x', '2624-0.0_x', '2624-1.0_x', '2624-2.0_x', '2634-0.0_x', '2634-1.0_x', '2634-2.0_x', '2644-0.0_x', '2644-1.0_x', '2644-2.0_x', '2654-0.0_x', '2654-1.0_x', '2654-2.0_x', '2664-0.0_x', '2664-1.0_x', '2664-2.0_x', '2714-0.0_x', '2714-1.0_x', '2714-2.0_x', '2724-0.0_x', '2724-1.0_x', '2724-2.0_x', '2867-0.0_x', '2867-1.0_x', '2867-2.0_x', '2877-0.0_x', '2877-1.0_x', '2877-2.0_x', '2887-0.0_x', '2887-1.0_x', '2887-2.0_x', '2897-0.0_x', '2897-1.0_x', '2897-2.0_x', '2907-0.0_x', '2907-1.0_x', '2907-2.0_x', '2926-0.0_x', '2926-1.0_x', '2926-2.0_x', '2936-0.0_x', '2936-1.0_x', '2936-2.0_x', '3436-0.0_x', '3436-1.0_x', '3436-2.0_x', '3446-0.0_x', '3446-1.0_x', '3446-2.0_x', '3456-0.0_x', '3456-1.0_x', '3456-2.0_x', '3466-0.0_x', '3466-1.0_x', '3466-2.0_x', '3476-0.0_x', '3476-1.0_x', '3476-2.0_x', '3486-0.0_x', '3486-1.0_x', '3486-2.0_x', '3496-0.0_x', '3496-1.0_x', '3496-2.0_x', '3506-0.0_x', '3506-1.0_x', '3506-2.0_x', '3637-0.0_x', '3637-1.0_x', '3637-2.0_x', '3647-0.0_x', '3647-1.0_x', '3647-2.0_x', '3669-0.0_x', '3669-1.0_x', '3669-2.0_x', '3680-0.0_x', '3680-1.0_x', '3680-2.0_x', '3700-0.0_x', '3700-1.0_x', '3700-2.0_x', '3710-0.0_x', '3710-1.0_x', '3710-2.0_x', '3720-0.0_x', '3720-1.0_x', '3720-2.0_x', '3731-0.0_x', '3731-1.0_x', '3731-2.0_x', '3859-0.0_x', '3859-1.0_x', '3859-2.0_x', '4407-0.0_x', '4407-1.0_x', '4407-2.0_x', '4418-0.0_x', '4418-1.0_x', '4418-2.0_x', '4429-0.0_x', '4429-1.0_x', '4429-2.0_x', '4440-0.0_x', '4440-1.0_x', '4440-2.0_x', '4451-0.0_x', '4451-1.0_x', '4451-2.0_x', '4462-0.0_x', '4462-1.0_x', '4462-2.0_x', '5959-0.0_x', '5959-1.0_x', '5959-2.0_x', '6144-0.0_x', '6144-0.1_x', '6144-0.2_x', '6144-0.3_x', '6144-1.0_x', '6144-1.1_x', '6144-1.2_x', '6144-1.3_x', '6144-2.0_x', '6144-2.1_x', '6144-2.2_x', '6144-2.3_x', '6157-0.0_x', '6157-0.1_x', '6157-0.2_x', '6157-0.3_x', '6157-1.0_x', '6157-1.1_x', '6157-1.2_x', '6157-1.3_x', '6157-2.0_x', '6157-2.1_x', '6157-2.2_x', '6157-2.3_x', '6158-0.0_x', '6158-0.1_x', '6158-0.2_x', '6158-0.3_x', '6158-1.0_x', '6158-1.1_x', '6158-1.2_x', '6158-1.3_x', '6158-2.0_x', '6158-2.1_x', '6158-2.2_x', '6158-2.3_x', '6162-0.0_x', '6162-0.1_x', '6162-0.2_x', '6162-0.3_x', '6162-1.0_x', '6162-1.1_x', '6162-1.2_x', '6162-1.3_x', '6162-2.0_x', '6162-2.1_x', '6162-2.2_x', '6162-2.3_x', '6164-0.0_x', '6164-0.1_x', '6164-0.2_x', '6164-0.3_x', '6164-0.4_x', '6164-1.0_x', '6164-1.1_x', '6164-1.2_x', '6164-1.3_x', '6164-1.4_x', '6164-2.0_x', '6164-2.1_x', '6164-2.2_x', '6164-2.3_x', '6164-2.4_x', '6183-0.0_x', '6183-1.0_x', '6183-2.0_x', '6194-0.0_x', '6194-1.0_x', '6194-2.0_x', '10115-0.0_x', '10767-0.0_x', '10776-0.0_x', '10818-0.0_x', '10827-0.0_x', '10853-0.0_x', '10855-0.0_x', '10855-0.1_x', '10855-0.2_x', '10855-0.3_x', '10895-0.0_x', '10912-0.0_x', '10953-0.0_x', '10962-0.0_x', '10971-0.0_x', '20162-0.0_x', '20162-1.0_x', '20162-2.0_x', '22032-0.0_x', '22033-0.0_x', '22034-0.0_x', '22035-0.0_x', '22036-0.0_x', '22037-0.0_x', '22038-0.0_x', '22039-0.0_x', '22040-0.0_x'])
prediabetes_with_features

Unnamed: 0,eid,21-0.0,21-1.0,21-2.0,31-0.0,34-0.0,35-0.0,35-1.0,35-2.0,36-0.0,...,40021-7.0,40021-8.0,40021-9.0,40021-10.0,40021-11.0,40021-12.0,40021-13.0,40021-14.0,40021-15.0,40021-16.0
0,4863545,1.0,,,1,1942,1.0,,,1056.0,...,,,,,,,,,,
1,3440845,1.0,,,0,1950,1.0,,,3157.0,...,,,,,,,,,,
2,1191090,1.0,,,0,1954,1.0,,,1044.0,...,,,,,,,,,,
3,3087664,1.0,,,1,1944,1.0,,,1040.0,...,,,,,,,,,,
4,2195757,1.0,,,0,1958,1.0,,,1039.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9939,5612713,1.0,,,1,1958,1.0,,,3156.0,...,,,,,,,,,,
9940,6022397,1.0,,,1,1945,1.0,,,2818.0,...,,,,,,,,,,
9941,5011216,1.0,,,0,1947,1.0,,,3155.0,...,,,,,,,,,,
9942,3482443,1.0,,,1,1950,1.0,,,1045.0,...,,,,,,,,,,


Below we show that the shape has decreased by the number of columns with an x at the end (1652).

In [8]:
prediabetes_with_features.shape

(9944, 11059)

The next step is to rename all the columns ending in y such that we drop the "_y" pattern from them so that all columns are kept in the same form (UDI number, with exception of eid)

In [9]:
prediabetes_with_features.columns = prediabetes_with_features.columns.str.replace('_y','')
prediabetes_with_features

Unnamed: 0,eid,21-0.0,21-1.0,21-2.0,31-0.0,34-0.0,35-0.0,35-1.0,35-2.0,36-0.0,...,40021-7.0,40021-8.0,40021-9.0,40021-10.0,40021-11.0,40021-12.0,40021-13.0,40021-14.0,40021-15.0,40021-16.0
0,4863545,1.0,,,1,1942,1.0,,,1056.0,...,,,,,,,,,,
1,3440845,1.0,,,0,1950,1.0,,,3157.0,...,,,,,,,,,,
2,1191090,1.0,,,0,1954,1.0,,,1044.0,...,,,,,,,,,,
3,3087664,1.0,,,1,1944,1.0,,,1040.0,...,,,,,,,,,,
4,2195757,1.0,,,0,1958,1.0,,,1039.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9939,5612713,1.0,,,1,1958,1.0,,,3156.0,...,,,,,,,,,,
9940,6022397,1.0,,,1,1945,1.0,,,2818.0,...,,,,,,,,,,
9941,5011216,1.0,,,0,1947,1.0,,,3155.0,...,,,,,,,,,,
9942,3482443,1.0,,,1,1950,1.0,,,1045.0,...,,,,,,,,,,


Below we show all patients who are diagnosed with diabetes when their measurements are taken and we need to cut them out of the dataframe.

In [10]:
already_diabetic_patients_at_measurement = prediabetes_with_features[prediabetes_with_features['2443-0.0'] == 1]
already_diabetic_patients_at_measurement

Unnamed: 0,eid,21-0.0,21-1.0,21-2.0,31-0.0,34-0.0,35-0.0,35-1.0,35-2.0,36-0.0,...,40021-7.0,40021-8.0,40021-9.0,40021-10.0,40021-11.0,40021-12.0,40021-13.0,40021-14.0,40021-15.0,40021-16.0
14,2425377,1.0,,,0,1946,1.0,,,2818.0,...,,,,,,,,,,
56,5512282,3.0,,,1,1940,1.0,,,1044.0,...,,,,,,,,,,
61,5724786,1.0,,,1,1940,1.0,,,1057.0,...,,,,,,,,,,
200,4256846,3.0,,1.0,1,1955,1.0,,1.0,1039.0,...,,,,,,,,,,
222,2189918,1.0,,,0,1961,1.0,,,1039.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9640,1665367,1.0,,,1,1946,1.0,,,1053.0,...,,,,,,,,,,
9675,2739978,1.0,,,0,1941,1.0,,,1056.0,...,,,,,,,,,,
9689,5684156,3.0,,,0,1940,1.0,,,1047.0,...,,,,,,,,,,
9779,5209790,1.0,,,1,1950,1.0,,,2815.0,...,,,,,,,,,,


Below we make a list of the eid numbers.

In [11]:
dfToList_already_diabetic_at_start = already_diabetic_patients_at_measurement['eid'].tolist()
dfToList_already_diabetic_at_start

[2425377,
 5512282,
 5724786,
 4256846,
 2189918,
 2546801,
 1706511,
 4387781,
 3462831,
 2817041,
 1072495,
 2268893,
 4683307,
 2748729,
 4623980,
 5605075,
 5275213,
 1146097,
 4821929,
 5415675,
 2429507,
 3265844,
 4578103,
 1202644,
 2586223,
 2763479,
 3593711,
 2480161,
 1526262,
 2268137,
 4861030,
 2238033,
 4496514,
 1731265,
 3829166,
 2883116,
 5708973,
 2693446,
 4246045,
 3857759,
 2484797,
 2994186,
 4745503,
 1338136,
 5905564,
 1336221,
 4270479,
 3342350,
 2225944,
 1328378,
 2500901,
 2850077,
 2219400,
 2203109,
 4690101,
 3264433,
 1637230,
 1457680,
 5083386,
 5011490,
 2591687,
 1404125,
 1936742,
 4779249,
 2839824,
 3230080,
 3669628,
 4983552,
 1067255,
 1895718,
 2648322,
 1424752,
 2693862,
 1117612,
 4157944,
 3836870,
 5402929,
 5903604,
 3415657,
 3676477,
 1105902,
 2650506,
 4791134,
 2862531,
 4876864,
 3348953,
 3392380,
 2233450,
 4729030,
 2191774,
 5311017,
 4641634,
 5878402,
 2031987,
 4366316,
 1340066,
 2344091,
 1568527,
 2997821,
 5665933,


# Now that we have all 190 patients who need to be cut from the dataframe, we show the construction of the dataframe of 5431 patients. First we show that the prediabetic patients we found are mostly not overlapping with the currently defined patients we classified as prediabetic.

Below we import all patients who have an ICD code of prediabetic.

In [12]:
prediabetes_with_dates = pd.read_csv('patients_with_prediabetes_with_dates')
prediabetes_with_dates = prediabetes_with_dates.drop(columns = ['Unnamed: 0'])
prediabetes_with_dates

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart
0,1006294,37,3,2,,,R739,,12/02/2018
1,1007537,0,3,2,,,R730,,06/06/2019
2,1008393,2,1,2,,,R739,,26/04/2011
3,1010613,0,0,1,,,R739,,07/06/2010
4,1011449,12,5,2,,,R730,,03/02/2020
...,...,...,...,...,...,...,...,...,...
3794,6017167,3,3,2,,,R730,,09/12/2019
3795,6019270,5,9,2,,,R730,,24/09/2020
3796,6020217,7,12,2,,,R730,,19/08/2020
3797,6021943,1,3,2,,,R739,,11/08/2020


Below we show the number of actual patients in this new sample group.

In [13]:
prediabetes_with_dates.drop_duplicates(subset = 'eid')

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart
0,1006294,37,3,2,,,R739,,12/02/2018
1,1007537,0,3,2,,,R730,,06/06/2019
2,1008393,2,1,2,,,R739,,26/04/2011
3,1010613,0,0,1,,,R739,,07/06/2010
4,1011449,12,5,2,,,R730,,03/02/2020
...,...,...,...,...,...,...,...,...,...
3794,6017167,3,3,2,,,R730,,09/12/2019
3795,6019270,5,9,2,,,R730,,24/09/2020
3796,6020217,7,12,2,,,R730,,19/08/2020
3797,6021943,1,3,2,,,R739,,11/08/2020


Below we make a list of all eid numbers for these patients.

In [14]:
dfToList_prediabetes_with_dates = prediabetes_with_dates['eid'].tolist()
dfToList_prediabetes_with_dates

[1006294,
 1007537,
 1008393,
 1010613,
 1011449,
 1011449,
 1011449,
 1011449,
 1011449,
 1011449,
 1011962,
 1011962,
 1013036,
 1013036,
 1013036,
 1014004,
 1017269,
 1019396,
 1019396,
 1019396,
 1019396,
 1020786,
 1020786,
 1021189,
 1022728,
 1024680,
 1025738,
 1025738,
 1026846,
 1029208,
 1029688,
 1030702,
 1032950,
 1040790,
 1044718,
 1044919,
 1047778,
 1048696,
 1049423,
 1050516,
 1050516,
 1050516,
 1050674,
 1051881,
 1053505,
 1053846,
 1055445,
 1055591,
 1055677,
 1055794,
 1058522,
 1062014,
 1067255,
 1067321,
 1067321,
 1067321,
 1070410,
 1070616,
 1072495,
 1073132,
 1074343,
 1075479,
 1076801,
 1078042,
 1078042,
 1078115,
 1079176,
 1086312,
 1087841,
 1088760,
 1088760,
 1092549,
 1092549,
 1094789,
 1094789,
 1095516,
 1095516,
 1096278,
 1096278,
 1097144,
 1097522,
 1099911,
 1102492,
 1102726,
 1105902,
 1105902,
 1106809,
 1106809,
 1111155,
 1111802,
 1116010,
 1116102,
 1116102,
 1117612,
 1119987,
 1120236,
 1121406,
 1121406,
 1121406,
 1121406,


Below we import all patients who are involved in our cohorts, whether unknown or known to be diabetic or not progress to diabetes or neither.

In [15]:
prediabetes_for_feature_selection = pd.read_csv('prediabetes_df_after_data_manipulation_ready_for_feature_selection_with_episodes_data_keeping_all_prediabetic_patients')
prediabetes_for_feature_selection = prediabetes_for_feature_selection.drop(columns = 'Unnamed: 0')
prediabetes_for_feature_selection

Unnamed: 0,eid,31-0.0,48-0.0,49-0.0,50-0.0,51-0.0,102-0.0,102-0.1,137-0.0,189-0.0,...,6144-0.0_1.0,6144-0.0_2.0,6144-0.0_3.0,6144-0.0_4.0,6144-0.0_5.0,22035-0.0_0.0,22035-0.0_1.0,22036-0.0_0.0,22036-0.0_1.0,target
0,1797967,1,0.569697,0.588506,0.856436,0.867470,0.411043,0.429530,0.000000,0.452587,...,0,0,0,1,0,0,1,0,1,0
1,1945240,1,0.678788,0.620690,0.871287,0.867470,0.435583,0.469799,0.407407,0.462753,...,0,0,0,0,1,1,0,0,1,0
2,1589923,1,0.509091,0.517241,0.871287,0.765060,0.429448,0.476510,0.037037,0.830111,...,0,0,0,0,1,0,1,0,1,0
3,4899746,0,0.587879,0.643678,0.772277,0.740964,0.441718,0.630872,0.037037,0.555037,...,0,0,0,0,1,1,0,0,1,0
4,4863545,1,0.600000,0.551724,0.787129,0.740964,0.417178,0.429530,0.037037,0.656086,...,0,0,0,0,1,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17195,3241124,1,0.690909,0.637931,0.871287,0.873494,0.601227,0.664430,0.074074,0.387725,...,0,0,0,0,1,1,0,1,0,0
17196,6022397,1,0.545455,0.551724,0.841584,0.849398,0.380368,0.395973,0.185185,0.620813,...,0,0,0,1,0,1,0,1,0,1
17197,5011216,0,0.600000,0.655172,0.787129,0.746988,0.447853,0.476510,0.000000,0.512179,...,0,0,0,0,1,0,1,0,1,1
17198,3482443,1,0.527273,0.545977,0.821782,0.837349,0.435583,0.469799,0.148148,0.460704,...,0,0,0,0,1,0,0,0,0,0


Below we show that only a few of the patients with known prediabetes ICD codes are found to be part of our cohort. This could be good if we can find more patients not progressing to diabetes within this group.

In [16]:
prediabetes_for_feature_selection_with_dates = prediabetes_for_feature_selection[prediabetes_for_feature_selection.eid.isin(dfToList_prediabetes_with_dates)]
prediabetes_for_feature_selection_with_dates

Unnamed: 0,eid,31-0.0,48-0.0,49-0.0,50-0.0,51-0.0,102-0.0,102-0.1,137-0.0,189-0.0,...,6144-0.0_1.0,6144-0.0_2.0,6144-0.0_3.0,6144-0.0_4.0,6144-0.0_5.0,22035-0.0_0.0,22035-0.0_1.0,22036-0.0_0.0,22036-0.0_1.0,target
133,5490470,0,0.460606,0.626437,0.806931,0.777108,0.417178,0.449664,0.111111,0.314701,...,0,0,0,0,1,0,0,0,0,0
204,5432688,1,0.569697,0.545977,0.876238,0.867470,0.435583,0.469799,0.185185,0.745740,...,0,0,0,0,1,1,0,0,1,0
210,2479180,1,0.648485,0.626437,0.861386,0.831325,0.521472,0.563758,0.851852,0.374262,...,0,0,0,1,0,1,0,1,0,1
247,3990434,0,0.618182,0.626437,0.841584,0.843373,0.411043,0.449664,0.037037,0.635502,...,0,0,0,0,1,0,1,0,1,0
445,3656363,1,0.690909,0.672414,0.866337,0.867470,0.368098,0.395973,0.185185,0.265942,...,0,0,0,0,1,1,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17031,2362858,0,0.490909,0.614943,0.752475,0.765060,0.368098,0.469799,0.185185,0.861580,...,0,0,0,0,1,1,0,1,0,0
17046,3280304,1,0.575758,0.534483,0.841584,0.843373,0.282209,0.308725,0.037037,0.066784,...,0,0,0,0,1,0,0,0,0,0
17051,1917396,0,0.672727,0.764368,0.816832,0.825301,0.484663,0.557047,0.148148,0.288891,...,0,0,0,1,0,1,0,1,0,1
17082,3468020,1,0.557576,0.568966,0.871287,0.867470,0.361963,0.389262,0.148148,0.372911,...,0,0,0,0,1,1,0,0,1,0


# Next we need to classify all prediabetic patients 

Below we import our dataframe so that we can find the patients who have prediabetes to start the study.

In [17]:
merged_for_prediabetes_before_drop_of_diabetics = pd.read_csv('merged_prediabetes_information')
merged_for_prediabetes_before_drop_of_diabetics

Unnamed: 0.1,Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,74-0.0,74-1.0,74-2.0,2443-0.0,2443-1.0,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
0,0,2075783,2008-06-06,,,3.0,,,0.0,,...,,,,,,,,,,
1,1,4345775,2008-06-13,,,5.0,,,0.0,,...,,,,,,,,,,
2,2,5686018,2008-02-09,2012-09-24,,5.0,5.0,,0.0,0.0,...,,,,,,,,,,
3,3,3907457,2008-01-31,,,4.0,,,0.0,,...,,,,,,,,,,
4,4,3160513,2008-09-08,,2017-12-06,3.0,,5.0,0.0,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502499,502499,2054859,2007-11-03,,,2.0,,,0.0,,...,,,,,,,,,,
502500,502500,3268102,2009-08-14,,,5.0,,,0.0,,...,,,,,,,,,,
502501,502501,2793268,2009-10-15,,,3.0,,,1.0,,...,,,,,,,,,,
502502,502502,4922013,2008-10-18,,,2.0,,,0.0,,...,,,,,,,,,,


Below we rename some columns so that we know exactly what they are.

In [18]:
merged_for_prediabetes_before_drop_of_diabetics = merged_for_prediabetes_before_drop_of_diabetics.rename(columns={'74-0.0' : 'Fasting Time 1', '74-1.0' : 'Fasting Time 2', '74-2.0' : 'Fasting Time 3', '2443-0.0' : 'Diabetes Diagnosed 1', '2443-1.0' : 'Diabetes Diagnosed 2', '2443-2.0' : 'Diabetes Diagnosed 3', '2976-0.0' : 'Age Diabetes Diagnosed 1', '2976-1.0' : 'Age Diabetes Diagnosed 2', '2976-2.0' : 'Age Diabetes Diagnosed 3', '4041-0.0' : 'Gestational Diabetes Only 1', '4041-1.0' : 'Gestational Diabetes Only 2', '4041-2.0' : 'Gestational Diabetes Only 3',  '30740-0.0': "Blood Glucose Level 1", '30740-1.0': "Blood Glucose level 2", '30741-0.0': "Glucose Assay Date 1", '30741-1.0': "Glucose Assay Date 2", '30750-0.0': "HbA1c 1", '30750-1.0': "HbA1c 2", '30751-0.0': "HbA1c Assay Date 1", '30751-1.0': "HbA1c Assay Date 2"})
merged_for_prediabetes_before_drop_of_diabetics = merged_for_prediabetes_before_drop_of_diabetics.drop('Unnamed: 0', axis = 1)
merged_for_prediabetes_before_drop_of_diabetics

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
0,2075783,2008-06-06,,,3.0,,,0.0,,,...,,,,,,,,,,
1,4345775,2008-06-13,,,5.0,,,0.0,,,...,,,,,,,,,,
2,5686018,2008-02-09,2012-09-24,,5.0,5.0,,0.0,0.0,,...,,,,,,,,,,
3,3907457,2008-01-31,,,4.0,,,0.0,,,...,,,,,,,,,,
4,3160513,2008-09-08,,2017-12-06,3.0,,5.0,0.0,,0.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502499,2054859,2007-11-03,,,2.0,,,0.0,,,...,,,,,,,,,,
502500,3268102,2009-08-14,,,5.0,,,0.0,,,...,,,,,,,,,,
502501,2793268,2009-10-15,,,3.0,,,1.0,,,...,,,,,,,,,,
502502,4922013,2008-10-18,,,2.0,,,0.0,,,...,,,,,,,,,,


First let us look at all the NaN values in each column.

In [19]:
merged_for_prediabetes_before_drop_of_diabetics.isnull().sum()

eid                    0
53-0.0                 0
53-1.0            482159
53-2.0            466961
Fasting Time 1      1233
                   ...  
20002-2.29        502504
20002-2.30        502504
20002-2.31        502504
20002-2.32        502504
20002-2.33        502504
Length: 126, dtype: int64

Next we cut out all patients who start with diabetes using the CSV features.

In [20]:
diabetics_to_start = merged_for_prediabetes_before_drop_of_diabetics[merged_for_prediabetes_before_drop_of_diabetics['Diabetes Diagnosed 1'] == 1]
diabetics_to_start

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
33,5649227,2008-08-22,,,2.0,,,1.0,,,...,,,,,,,,,,
39,1691705,2009-10-23,,,2.0,,,1.0,,,...,,,,,,,,,,
105,2126827,2009-11-07,,2018-11-02,1.0,,4.0,1.0,,1.0,...,,,,,,,,,,
128,5687965,2008-08-27,,,4.0,,,1.0,,,...,,,,,,,,,,
166,2897908,2009-01-21,,,3.0,,,1.0,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502475,1689615,2008-04-01,,,2.0,,,1.0,,,...,,,,,,,,,,
502479,3475115,2010-06-25,,,3.0,,,1.0,,,...,,,,,,,,,,
502485,3697041,2010-02-10,,,6.0,,,1.0,,,...,,,,,,,,,,
502493,3117536,2009-03-03,,,4.0,,,1.0,,,...,,,,,,,,,,


Below we make a list of all eid numbers for these patients.

In [21]:
dfToList_diabetics = diabetics_to_start['eid'].tolist()
dfToList_diabetics

[5649227,
 1691705,
 2126827,
 5687965,
 2897908,
 1950918,
 1962969,
 4254769,
 1174378,
 1388426,
 1582704,
 2061291,
 5303248,
 4353264,
 2840946,
 3578446,
 1575721,
 5807592,
 2862855,
 4344728,
 4391471,
 4458528,
 3095269,
 5868354,
 5684651,
 5899666,
 4941661,
 4735962,
 4891261,
 5940031,
 2558105,
 4773788,
 2499664,
 4400186,
 1224008,
 5859584,
 1139473,
 4846778,
 2453406,
 4136206,
 4914684,
 2425377,
 5020756,
 2113926,
 4542234,
 1759061,
 5537524,
 4894695,
 1450495,
 2068483,
 4566504,
 2792357,
 3639765,
 4878111,
 1722384,
 5936322,
 2607439,
 3399365,
 4165791,
 2261532,
 6010666,
 4873486,
 3608059,
 2904410,
 2475192,
 2735746,
 1800702,
 2799911,
 1177676,
 4804595,
 1284448,
 5196141,
 3140614,
 5438784,
 4579495,
 5734452,
 4992498,
 5274468,
 5890189,
 4841749,
 2740867,
 2256248,
 3517257,
 2738097,
 3860309,
 1784170,
 5229202,
 3278338,
 4289573,
 2100096,
 2880157,
 5520599,
 5029655,
 3447293,
 5504650,
 6012089,
 3582706,
 1525548,
 2931980,
 1663067,


Below we filter all diabetics from the dataframe of all patients.

In [22]:
merged_for_prediabetes = merged_for_prediabetes_before_drop_of_diabetics[~merged_for_prediabetes_before_drop_of_diabetics.eid.isin(dfToList_diabetics)]
merged_for_prediabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
0,2075783,2008-06-06,,,3.0,,,0.0,,,...,,,,,,,,,,
1,4345775,2008-06-13,,,5.0,,,0.0,,,...,,,,,,,,,,
2,5686018,2008-02-09,2012-09-24,,5.0,5.0,,0.0,0.0,,...,,,,,,,,,,
3,3907457,2008-01-31,,,4.0,,,0.0,,,...,,,,,,,,,,
4,3160513,2008-09-08,,2017-12-06,3.0,,5.0,0.0,,0.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502498,3820845,2009-11-21,,,2.0,,,0.0,,,...,,,,,,,,,,
502499,2054859,2007-11-03,,,2.0,,,0.0,,,...,,,,,,,,,,
502500,3268102,2009-08-14,,,5.0,,,0.0,,,...,,,,,,,,,,
502502,4922013,2008-10-18,,,2.0,,,0.0,,,...,,,,,,,,,,


Below we find all patients with an HbA1c between 42 and 47 to begin the study, which is the prediabetic range. 

In [23]:
prediabetic_hba1c_at_start = merged_for_prediabetes[((merged_for_prediabetes['HbA1c 1'] >= 42) & (merged_for_prediabetes['HbA1c 1'] <= 47))]
prediabetic_hba1c_at_start

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
10,1797967,2008-09-10,,,3.0,,,0.0,,,...,,,,,,,,,,
75,1945240,2008-02-23,,,4.0,,,0.0,,,...,,,,,,,,,,
79,1589923,2010-07-01,,,4.0,,,0.0,,,...,,,,,,,,,,
87,4899746,2009-11-10,,,4.0,,,-1.0,,,...,,,,,,,,,,
111,4863545,2009-10-29,,,5.0,,,0.0,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502273,3241124,2010-03-29,,,7.0,,,0.0,,,...,,,,,,,,,,
502333,6022397,2007-12-20,,,3.0,,,0.0,,,...,,,,,,,,,,
502391,5011216,2009-11-26,,,3.0,,,0.0,,,...,,,,,,,,,,
502421,3482443,2008-06-05,,,2.0,,,0.0,,,...,,,,,,,,,,


Below we specify all prediabetic patients using the fasting blood glucose test. These patients have fasted for 8 or more hours before having their blood drawn. Therefore, we search for these patients only. 

In [24]:
patients_fasting = merged_for_prediabetes[merged_for_prediabetes['Fasting Time 1'] >= 8]
patients_fasting

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
24,2840264,2010-01-28,,,10.0,,,0.0,,,...,,,,,,,,,,
25,5130146,2009-03-10,,,15.0,,,0.0,,,...,,,,,,,,,,
43,4766696,2008-06-07,,,12.0,,,0.0,,,...,,,,,,,,,,
53,1263576,2009-01-15,,,16.0,,,0.0,,,...,,,,,,,,,,
59,5711475,2008-10-07,,,8.0,,,0.0,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502223,1481100,2010-01-23,,,11.0,,,0.0,,,...,,,,,,,,,,
502237,4341434,2009-06-15,,,10.0,,,0.0,,,...,,,,,,,,,,
502255,5612713,2008-04-17,,,14.0,,,-1.0,,,...,,,,,,,,,,
502304,4081094,2010-03-11,,,16.0,,,0.0,,,...,,,,,,,,,,


Below we show all patients who started the analysis with a Blood Glucose Level measured in the prediabetic range given by the results of a fasting blood sugar test.

In [25]:
prediabetic_blood_glucose_at_start = patients_fasting[((patients_fasting['Blood Glucose Level 1'] >= 5.6) & (patients_fasting['Blood Glucose Level 1'] <= 7))]
prediabetic_blood_glucose_at_start

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
309,2371061,2007-08-04,,,14.0,,,0.0,,,...,,,,,,,,,,
584,3288614,2009-01-15,,,12.0,,,0.0,,,...,,,,,,,,,,
941,2570774,2008-07-18,,,8.0,,,0.0,,,...,,,,,,,,,,
1125,4512530,2009-04-01,,,12.0,,,0.0,,,...,,,,,,,,,,
1443,1230667,2008-01-26,,,21.0,,,0.0,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501899,3774941,2009-11-12,,,13.0,,,0.0,,,...,,,,,,,,,,
501938,3624451,2009-05-07,,,13.0,,,0.0,,,...,,,,,,,,,,
501989,6022522,2008-02-01,,,15.0,,,0.0,,,...,,,,,,,,,,
502025,2043876,2010-06-05,,,13.0,,,0.0,,,...,,,,,,,,,,


# HbA1c

We still have to filter out all patients who self-diagnose themselves with diabetes but we classified as prediabetic because Scott Kulm found that both self-diagnosis and diagnosis by a doctor do not have statistically significant results and therefore patients who self-diagnose themselves with diabetes will be considered as having diabetes. We use the columns below to mark patients who self-diagnose themselves with diabetes. 

In [26]:
spike_cols = [col for col in prediabetic_hba1c_at_start.columns if '20002-' in col]
print(spike_cols)

['20002-0.0', '20002-0.1', '20002-0.2', '20002-0.3', '20002-0.4', '20002-0.5', '20002-0.6', '20002-0.7', '20002-0.8', '20002-0.9', '20002-0.10', '20002-0.11', '20002-0.12', '20002-0.13', '20002-0.14', '20002-0.15', '20002-0.16', '20002-0.17', '20002-0.18', '20002-0.19', '20002-0.20', '20002-0.21', '20002-0.22', '20002-0.23', '20002-0.24', '20002-0.25', '20002-0.26', '20002-0.27', '20002-0.28', '20002-0.29', '20002-0.30', '20002-0.31', '20002-0.32', '20002-0.33', '20002-1.0', '20002-1.1', '20002-1.2', '20002-1.3', '20002-1.4', '20002-1.5', '20002-1.6', '20002-1.7', '20002-1.8', '20002-1.9', '20002-1.10', '20002-1.11', '20002-1.12', '20002-1.13', '20002-1.14', '20002-1.15', '20002-1.16', '20002-1.17', '20002-1.18', '20002-1.19', '20002-1.20', '20002-1.21', '20002-1.22', '20002-1.23', '20002-1.24', '20002-1.25', '20002-1.26', '20002-1.27', '20002-1.28', '20002-1.29', '20002-1.30', '20002-1.31', '20002-1.32', '20002-1.33', '20002-2.0', '20002-2.1', '20002-2.2', '20002-2.3', '20002-2.4', '2

Next we create a dataframe containing only these patients.

In [27]:
possible_diabetes_self_diagnosis_hba1c = prediabetic_hba1c_at_start[['20002-0.0', '20002-0.1', '20002-0.2', '20002-0.3', '20002-0.4', '20002-0.5', '20002-0.6', '20002-0.7', '20002-0.8', '20002-0.9', '20002-0.10', '20002-0.11', '20002-0.12', '20002-0.13', '20002-0.14', '20002-0.15', '20002-0.16', '20002-0.17', '20002-0.18', '20002-0.19', '20002-0.20', '20002-0.21', '20002-0.22', '20002-0.23', '20002-0.24', '20002-0.25', '20002-0.26', '20002-0.27', '20002-0.28', '20002-0.29', '20002-0.30', '20002-0.31', '20002-0.32', '20002-0.33', '20002-1.0', '20002-1.1', '20002-1.2', '20002-1.3', '20002-1.4', '20002-1.5', '20002-1.6', '20002-1.7', '20002-1.8', '20002-1.9', '20002-1.10', '20002-1.11', '20002-1.12', '20002-1.13', '20002-1.14', '20002-1.15', '20002-1.16', '20002-1.17', '20002-1.18', '20002-1.19', '20002-1.20', '20002-1.21', '20002-1.22', '20002-1.23', '20002-1.24', '20002-1.25', '20002-1.26', '20002-1.27', '20002-1.28', '20002-1.29', '20002-1.30', '20002-1.31', '20002-1.32', '20002-1.33', '20002-2.0', '20002-2.1', '20002-2.2', '20002-2.3', '20002-2.4', '20002-2.5', '20002-2.6', '20002-2.7', '20002-2.8', '20002-2.9', '20002-2.10', '20002-2.11', '20002-2.12', '20002-2.13', '20002-2.14', '20002-2.15', '20002-2.16', '20002-2.17', '20002-2.18', '20002-2.19', '20002-2.20', '20002-2.21', '20002-2.22', '20002-2.23', '20002-2.24', '20002-2.25', '20002-2.26', '20002-2.27', '20002-2.28', '20002-2.29', '20002-2.30', '20002-2.31', '20002-2.32', '20002-2.33']]
possible_diabetes_self_diagnosis_hba1c

Unnamed: 0,20002-0.0,20002-0.1,20002-0.2,20002-0.3,20002-0.4,20002-0.5,20002-0.6,20002-0.7,20002-0.8,20002-0.9,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
10,1065.0,1473.0,,,,,,,,,...,,,,,,,,,,
75,1074.0,1065.0,1465.0,1453.0,1207.0,,,,,,...,,,,,,,,,,
79,,,,,,,,,,,...,,,,,,,,,,
87,1330.0,1436.0,,,,,,,,,...,,,,,,,,,,
111,99999.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502273,1065.0,1111.0,1473.0,,,,,,,,...,,,,,,,,,,
502333,1074.0,1065.0,1458.0,1474.0,1294.0,1471.0,,,,,...,,,,,,,,,,
502391,1162.0,1351.0,1541.0,,,,,,,,...,,,,,,,,,,
502421,1065.0,1093.0,,,,,,,,,...,,,,,,,,,,


Next we show all columns with the value 1223 which is the code for Type 2 diabetes. All columns marked True contain at least one value which is 1223. Therefore, we will continue our analysis with these columns.

In [28]:
pd.options.display.max_rows = 110
possible_diabetes_self_diagnosis_hba1c.isin([1223]).any()

20002-0.0      True
20002-0.1      True
20002-0.2      True
20002-0.3     False
20002-0.4      True
20002-0.5      True
20002-0.6     False
20002-0.7     False
20002-0.8     False
20002-0.9     False
20002-0.10    False
20002-0.11    False
20002-0.12    False
20002-0.13    False
20002-0.14    False
20002-0.15    False
20002-0.16    False
20002-0.17    False
20002-0.18    False
20002-0.19    False
20002-0.20    False
20002-0.21    False
20002-0.22    False
20002-0.23    False
20002-0.24    False
20002-0.25    False
20002-0.26    False
20002-0.27    False
20002-0.28    False
20002-0.29    False
20002-0.30    False
20002-0.31    False
20002-0.32    False
20002-0.33    False
20002-1.0      True
20002-1.1     False
20002-1.2     False
20002-1.3     False
20002-1.4     False
20002-1.5     False
20002-1.6     False
20002-1.7     False
20002-1.8     False
20002-1.9     False
20002-1.10    False
20002-1.11    False
20002-1.12    False
20002-1.13    False
20002-1.14    False
20002-1.15    False


Next we search for patients who self-diagnose themselves with diabetes.

In [29]:
pd.options.display.max_rows = 10
self_diagnosed_diabetics_hba1c = prediabetic_hba1c_at_start[(prediabetic_hba1c_at_start['20002-0.0'] == 1223) | (prediabetic_hba1c_at_start['20002-0.1'] == 1223) | (prediabetic_hba1c_at_start['20002-0.2'] == 1223) | (prediabetic_hba1c_at_start['20002-0.4'] == 1223) | (prediabetic_hba1c_at_start['20002-0.5'] == 1223) | (prediabetic_hba1c_at_start['20002-1.0'] == 1223) | (prediabetic_hba1c_at_start['20002-2.0'] == 1223)]
self_diagnosed_diabetics_hba1c

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
38080,4834004,2008-08-29,,,6.0,,,0.0,,,...,,,,,,,,,,
39719,3350008,2008-02-25,2013-05-12,2014-11-30,2.0,4.0,6.0,0.0,1.0,1.0,...,,,,,,,,,,
45472,1475776,2007-05-03,,,4.0,,,-1.0,,,...,,,,,,,,,,
61053,4777233,2010-02-02,2012-09-13,2016-04-21,3.0,3.0,4.0,0.0,0.0,1.0,...,,,,,,,,,,
63548,4468202,2007-10-12,,2018-11-04,11.0,,13.0,0.0,,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472264,4892284,2009-12-09,,,3.0,,,0.0,,,...,,,,,,,,,,
472561,5766529,2008-04-04,2013-05-17,2019-01-22,2.0,0.0,5.0,0.0,0.0,1.0,...,,,,,,,,,,
477977,4218081,2009-10-23,,2015-11-01,4.0,,3.0,0.0,,1.0,...,,,,,,,,,,
488914,4490894,2009-01-28,,2015-05-09,5.0,,3.0,0.0,,1.0,...,,,,,,,,,,


# Different groups of patients we need to search for:

1. Patients who are not diagnosed with diabetes at starting time but are self-diagnosed at starting time - cut them from dataframe
2. Patients who are diagnosed with diabetes at starting time by doctor - cut from dataframe
3. Patients who are not diagnosed with diabetes ever, but do self-diagnose themselves after the starting date (in -1.X or -2.X columns) - Keep them in the dataframe
4. Patients who are diagnosed with diabetes by doctor after first tests - keep them in the dataframe

Next we search for all prediabetic symptom patients without being diagnosed by a doctor with diabetes. There are 42 prediabetic patients who are not initially diagnosed with diabetes by a doctor, but who have self-diagnosed themselves with diabetes at some point.

In [30]:
only_self_diagnosed_diabetics_hba1c = self_diagnosed_diabetics_hba1c[self_diagnosed_diabetics_hba1c['Diabetes Diagnosed 1'] == 0]
only_self_diagnosed_diabetics_hba1c

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
38080,4834004,2008-08-29,,,6.0,,,0.0,,,...,,,,,,,,,,
39719,3350008,2008-02-25,2013-05-12,2014-11-30,2.0,4.0,6.0,0.0,1.0,1.0,...,,,,,,,,,,
61053,4777233,2010-02-02,2012-09-13,2016-04-21,3.0,3.0,4.0,0.0,0.0,1.0,...,,,,,,,,,,
63548,4468202,2007-10-12,,2018-11-04,11.0,,13.0,0.0,,1.0,...,,,,,,,,,,
93423,1909046,2008-06-14,2013-05-19,2017-08-05,5.0,5.0,3.0,0.0,1.0,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472264,4892284,2009-12-09,,,3.0,,,0.0,,,...,,,,,,,,,,
472561,5766529,2008-04-04,2013-05-17,2019-01-22,2.0,0.0,5.0,0.0,0.0,1.0,...,,,,,,,,,,
477977,4218081,2009-10-23,,2015-11-01,4.0,,3.0,0.0,,1.0,...,,,,,,,,,,
488914,4490894,2009-01-28,,2015-05-09,5.0,,3.0,0.0,,1.0,...,,,,,,,,,,


Next we have to see when these patients self-diganosed themselves with diabetes. If they were self-diagnosed in the 20002-0.X columns, we must remove them from the dataframe. Otherwise, they would have progressed to having diabetes and we will have to check if these patients were also diagnosed by a doctor for diabetes.

Below we show all patients who self-diagnose themselves with diabetes, but are not diagnosed with diabetes by a doctor at the start. We will still cut these patients from the dataframe as if they were true diabetic patients.

In [31]:
diabetes_at_start_hba1c = only_self_diagnosed_diabetics_hba1c[((only_self_diagnosed_diabetics_hba1c['20002-0.0'] == 1223) | (only_self_diagnosed_diabetics_hba1c['20002-0.1'] == 1223) | (only_self_diagnosed_diabetics_hba1c['20002-0.2'] == 1223) | (only_self_diagnosed_diabetics_hba1c['20002-0.3'] == 1223) | (only_self_diagnosed_diabetics_hba1c['20002-0.4'] == 1223) | (only_self_diagnosed_diabetics_hba1c['20002-0.5'] == 1223) | (only_self_diagnosed_diabetics_hba1c['20002-0.6'] == 1223)) & (only_self_diagnosed_diabetics_hba1c['Diabetes Diagnosed 1'] == 0)]
diabetes_at_start_hba1c

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
38080,4834004,2008-08-29,,,6.0,,,0.0,,,...,,,,,,,,,,
126144,4722780,2007-10-25,,,2.0,,,0.0,,,...,,,,,,,,,,
238127,5849058,2009-05-14,,,15.0,,,0.0,,,...,,,,,,,,,,
274299,4134911,2009-07-23,,,3.0,,,0.0,,,...,,,,,,,,,,
285511,1187842,2008-02-29,,,2.0,,,0.0,,,...,,,,,,,,,,
360583,4008901,2007-10-04,,,3.0,,,0.0,,,...,,,,,,,,,,
421487,2990620,2008-01-23,,,2.0,,,0.0,,,...,,,,,,,,,,
472264,4892284,2009-12-09,,,3.0,,,0.0,,,...,,,,,,,,,,


Below we create a list of the patient eid numbers so that we can cut them from the dataframe.

In [32]:
dfToList_diabetes_at_start_hba1c = diabetes_at_start_hba1c['eid'].tolist()
dfToList_diabetes_at_start_hba1c

[4834004, 4722780, 5849058, 4134911, 1187842, 4008901, 2990620, 4892284]

Now we filter these patients from the final dataframe of prediabetic patients.

In [33]:
prediabetic_hba1c_at_start = prediabetic_hba1c_at_start[~prediabetic_hba1c_at_start.eid.isin(dfToList_diabetes_at_start_hba1c)]
prediabetic_hba1c_at_start

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,...,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
10,1797967,2008-09-10,,,3.0,,,0.0,,,...,,,,,,,,,,
75,1945240,2008-02-23,,,4.0,,,0.0,,,...,,,,,,,,,,
79,1589923,2010-07-01,,,4.0,,,0.0,,,...,,,,,,,,,,
87,4899746,2009-11-10,,,4.0,,,-1.0,,,...,,,,,,,,,,
111,4863545,2009-10-29,,,5.0,,,0.0,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502273,3241124,2010-03-29,,,7.0,,,0.0,,,...,,,,,,,,,,
502333,6022397,2007-12-20,,,3.0,,,0.0,,,...,,,,,,,,,,
502391,5011216,2009-11-26,,,3.0,,,0.0,,,...,,,,,,,,,,
502421,3482443,2008-06-05,,,2.0,,,0.0,,,...,,,,,,,,,,


# Extra progression to diabetes labels in next cell

Below we show that the remaining patients who self-diagnose themselves do so after the first tests are performed. Therefore, we can label these patients as progressing to diabetes even if they are not diagnosed by a doctor for diabetes.

In [34]:
pd.options.display.max_columns = 125
pd.options.display.max_rows = 100
self_diagnosed_diabetics_some_not_doctor_diagnosed_hba1c = only_self_diagnosed_diabetics_hba1c[(only_self_diagnosed_diabetics_hba1c['20002-1.0'] == 1223) | (only_self_diagnosed_diabetics_hba1c['20002-2.0'] == 1223)]
self_diagnosed_diabetics_some_not_doctor_diagnosed_hba1c

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,Age Diabetes Diagnosed 1,Age Diabetes Diagnosed 2,Age Diabetes Diagnosed 3,Gestational Diabetes Only 1,Gestational Diabetes Only 2,Gestational Diabetes Only 3,Blood Glucose Level 1,Blood Glucose level 2,Glucose Assay Date 1,Glucose Assay Date 2,HbA1c 1,HbA1c 2,HbA1c Assay Date 1,HbA1c Assay Date 2,20002-0.0,20002-0.1,20002-0.2,20002-0.3,20002-0.4,20002-0.5,20002-0.6,20002-0.7,20002-0.8,20002-0.9,20002-0.10,20002-0.11,20002-0.12,20002-0.13,20002-0.14,20002-0.15,20002-0.16,20002-0.17,20002-0.18,20002-0.19,20002-0.20,20002-0.21,20002-0.22,20002-0.23,20002-0.24,20002-0.25,20002-0.26,20002-0.27,20002-0.28,20002-0.29,20002-0.30,20002-0.31,20002-0.32,20002-0.33,20002-1.0,20002-1.1,20002-1.2,20002-1.3,...,20002-1.6,20002-1.7,20002-1.8,20002-1.9,20002-1.10,20002-1.11,20002-1.12,20002-1.13,20002-1.14,20002-1.15,20002-1.16,20002-1.17,20002-1.18,20002-1.19,20002-1.20,20002-1.21,20002-1.22,20002-1.23,20002-1.24,20002-1.25,20002-1.26,20002-1.27,20002-1.28,20002-1.29,20002-1.30,20002-1.31,20002-1.32,20002-1.33,20002-2.0,20002-2.1,20002-2.2,20002-2.3,20002-2.4,20002-2.5,20002-2.6,20002-2.7,20002-2.8,20002-2.9,20002-2.10,20002-2.11,20002-2.12,20002-2.13,20002-2.14,20002-2.15,20002-2.16,20002-2.17,20002-2.18,20002-2.19,20002-2.20,20002-2.21,20002-2.22,20002-2.23,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
39719,3350008,2008-02-25,2013-05-12,2014-11-30,2.0,4.0,6.0,0.0,1.0,1.0,,55.0,55.0,,,,6.489,,2017-03-06,2017-05-26,43.7,,2015-02-26,,1473.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1220.0,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
61053,4777233,2010-02-02,2012-09-13,2016-04-21,3.0,3.0,4.0,0.0,0.0,1.0,,,69.0,,,0.0,5.009,,2017-05-07,2017-08-12,42.7,46.3,2016-03-08,2015-10-05,1452.0,1154.0,1473.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1633.0,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
63548,4468202,2007-10-12,,2018-11-04,11.0,,13.0,0.0,,1.0,,,51.0,,,,5.164,,2016-03-01,,46.0,,2014-12-10,,1081.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
93423,1909046,2008-06-14,2013-05-19,2017-08-05,5.0,5.0,3.0,0.0,1.0,1.0,,52.0,53.0,,0.0,0.0,4.922,,2016-12-09,2017-09-07,44.4,40.9,2015-09-23,2016-03-08,1065.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1220.0,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
113012,5753685,2008-02-19,,2018-11-04,4.0,,4.0,0.0,,-3.0,,,,,,,5.802,,2017-06-27,,42.6,,2015-06-25,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
118464,5539124,2009-03-16,,2017-08-18,2.0,,5.0,0.0,,1.0,,,69.0,,,0.0,,,2017-09-05,,43.7,,2016-01-30,,1074.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
122228,2678130,2009-06-26,,2018-11-26,3.0,,2.0,0.0,,0.0,,,,,,,4.712,,2016-12-11,,44.2,,2015-11-23,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
143285,3089987,2008-05-24,,2017-04-02,4.0,,3.0,0.0,,1.0,,,55.0,,,,4.276,,2017-06-28,,43.5,,2015-08-28,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
180539,2943986,2008-07-26,2013-04-21,2018-11-14,3.0,6.0,8.0,0.0,1.0,,,66.0,,,0.0,,5.486,5.826,2015-12-20,2016-02-08,43.6,48.8,2017-03-09,2015-09-07,1065.0,1385.0,1402.0,1138.0,1294.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1065.0,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
197734,3716387,2009-10-12,2013-03-12,2018-11-20,3.0,4.0,2.0,0.0,0.0,1.0,,,71.0,,,,4.829,5.442,2016-02-14,2016-05-01,44.6,44.9,2015-09-22,2016-03-10,1162.0,1165.0,1598.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1165.0,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Below we show that all patients who self-diagnosed themselves with diabetes after the first reading also did not get diagnosed by a doctor for diabetes at the first reading. We keep these patients in the dataframe.

In [35]:
pd.options.display.max_rows = 10
pd.options.display.max_columns = 10
only_self_diagnosed_diabetics_hba1c_progressing_to_diabetes = only_self_diagnosed_diabetics_hba1c[((only_self_diagnosed_diabetics_hba1c['20002-1.0'] == 1223) | (only_self_diagnosed_diabetics_hba1c['20002-2.0'] == 1223)) & (only_self_diagnosed_diabetics_hba1c['Diabetes Diagnosed 1'] == 0)]
only_self_diagnosed_diabetics_hba1c_progressing_to_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
39719,3350008,2008-02-25,2013-05-12,2014-11-30,2.0,...,,,,,
61053,4777233,2010-02-02,2012-09-13,2016-04-21,3.0,...,,,,,
63548,4468202,2007-10-12,,2018-11-04,11.0,...,,,,,
93423,1909046,2008-06-14,2013-05-19,2017-08-05,5.0,...,,,,,
113012,5753685,2008-02-19,,2018-11-04,4.0,...,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
469135,3157047,2010-04-27,,2018-10-11,5.0,...,,,,,
472561,5766529,2008-04-04,2013-05-17,2019-01-22,2.0,...,,,,,
477977,4218081,2009-10-23,,2015-11-01,4.0,...,,,,,
488914,4490894,2009-01-28,,2015-05-09,5.0,...,,,,,


Below we keep a list of the patients above as progressing to diabetes for future labeling.

In [36]:
dfToList_diabetes_progression_without_doctor_diagnosis_hba1c = only_self_diagnosed_diabetics_hba1c_progressing_to_diabetes['eid'].tolist()
dfToList_diabetes_progression_without_doctor_diagnosis_hba1c

[3350008,
 4777233,
 4468202,
 1909046,
 5753685,
 5539124,
 2678130,
 3089987,
 2943986,
 3716387,
 3048175,
 5377572,
 5363883,
 5592357,
 5867769,
 5514018,
 4843057,
 5326222,
 2134812,
 3817376,
 4039474,
 2376480,
 1984811,
 2255700,
 3383131,
 2920397,
 3606720,
 4757304,
 3856044,
 3157047,
 5766529,
 4218081,
 4490894,
 4288393]

Now that we have defined our prediabetic patients, we can import another dataframe that contains all the diabetes diagnoses for patients. We again have a column we need to drop that comes over with the SCU index.

In [37]:
patients_with_type2_diabetes_with_dates = pd.read_csv('patients_with_type2_diabetes_with_dates')
patients_with_type2_diabetes_with_dates = patients_with_type2_diabetes_with_dates.drop('Unnamed: 0', axis = 1)
patients_with_type2_diabetes_with_dates

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart
0,1000402,3,1,2,,,E119,,28/06/2018
1,1000402,1,1,2,,,E119,,12/03/2008
2,1000402,0,3,2,,,E119,,16/07/2008
3,1000610,6,3,2,,,E119,,11/05/2005
4,1000610,0,5,2,,,E112,,06/10/2010
...,...,...,...,...,...,...,...,...,...
250364,6025367,12,2,2,,,E119,,02/11/2017
250365,6025367,11,2,2,,,E119,,07/09/2017
250366,6025367,6,4,2,,,E119,,09/04/2015
250367,6025367,5,2,2,,,E119,,31/12/2014


Below we show the data types of each column.

In [38]:
patients_with_type2_diabetes_with_dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250369 entries, 0 to 250368
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   eid            250369 non-null  int64  
 1   ins_index      250369 non-null  int64  
 2   arr_index      250369 non-null  int64  
 3   level          250369 non-null  int64  
 4   diag_icd9      0 non-null       float64
 5   diag_icd9_nb   0 non-null       float64
 6   diag_icd10     250369 non-null  object 
 7   diag_icd10_nb  4064 non-null    float64
 8   epistart       250349 non-null  object 
dtypes: float64(3), int64(4), object(2)
memory usage: 17.2+ MB


Next we change epistart to datetime format from object. We see that the date now has the format of year-month-day.

In [39]:
patients_with_type2_diabetes_with_dates['epistart'] = pd.to_datetime(patients_with_type2_diabetes_with_dates['epistart'])
patients_with_type2_diabetes_with_dates

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart
0,1000402,3,1,2,,,E119,,2018-06-28
1,1000402,1,1,2,,,E119,,2008-12-03
2,1000402,0,3,2,,,E119,,2008-07-16
3,1000610,6,3,2,,,E119,,2005-11-05
4,1000610,0,5,2,,,E112,,2010-06-10
...,...,...,...,...,...,...,...,...,...
250364,6025367,12,2,2,,,E119,,2017-02-11
250365,6025367,11,2,2,,,E119,,2017-07-09
250366,6025367,6,4,2,,,E119,,2015-09-04
250367,6025367,5,2,2,,,E119,,2014-12-31


Below we look at how many NaN values are in the epistart date. We see that it is only 20.

In [40]:
patients_with_type2_diabetes_with_dates.isnull().sum()

eid                   0
ins_index             0
arr_index             0
level                 0
diag_icd9        250369
diag_icd9_nb     250369
diag_icd10            0
diag_icd10_nb    246305
epistart             20
dtype: int64

Next we want to merge our values with those of the prediabetic patients. This will tell us if there are any patients we missed who developed diabetes, or possibly patients we should not account for who had diabetes before the measurements for prediabetes.

In [41]:
pd.options.display.max_columns = 14
prediabetic_hba1c_at_start_finding_more_diabetes = prediabetic_hba1c_at_start.merge(patients_with_type2_diabetes_with_dates[['eid', 'epistart']], on = 'eid')
prediabetic_hba1c_at_start_finding_more_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart
0,5100710,2009-01-20,,,3.0,,,...,,,,,,,2016-01-18
1,4647495,2008-12-05,,,3.0,,,...,,,,,,,2018-12-03
2,5494714,2010-04-14,,,3.0,,,...,,,,,,,2017-07-20
3,5494714,2010-04-14,,,3.0,,,...,,,,,,,2017-08-29
4,5494714,2010-04-14,,,3.0,,,...,,,,,,,2019-05-25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18718,6022397,2007-12-20,,,3.0,,,...,,,,,,,2019-07-26
18719,6022397,2007-12-20,,,3.0,,,...,,,,,,,2019-05-11
18720,6022397,2007-12-20,,,3.0,,,...,,,,,,,2019-04-25
18721,6022397,2007-12-20,,,3.0,,,...,,,,,,,2020-02-26


Below we show the data types of the columns.

In [42]:
prediabetic_hba1c_at_start_finding_more_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18723 entries, 0 to 18722
Columns: 127 entries, eid to epistart
dtypes: datetime64[ns](1), float64(118), int64(1), object(7)
memory usage: 18.3+ MB


Our next goal is to find the patients who start off with prediabetic conditions, which is everyone in the above dataframe, and find out all the patients who are diagnosed with diabetes at a later date, both those who do and do not have a doctor diagnosis of diabetes as well. However, this will give us patients who developed diabetes from prediabetes that have NaN values after their original readings for the doctor diagnosed diabetes columns. This is why this method is very powerful and gives us many more targets than by just using the columns and nothing else. First we have to change 53-0.0, the date of attending the assessment centre, to datetime data type. WE USE THE DATES FROM 53-0.0 BECAUSE THIS REPRESENTS THE DATE WHEN THE ACTUAL BLOOD SAMPLES WERE TAKEN! THE GLUCOSE ASSAY DATES AND HBA1C ASSAY DATES DO NOT CORRESPOND TO THE ACTUAL TIME WHEN THE BLOOD WAS DRAWN USED TO TEST FOR THESE THINGS!

In [43]:
prediabetic_hba1c_at_start_finding_more_diabetes['53-0.0'] = pd.to_datetime(prediabetic_hba1c_at_start_finding_more_diabetes['53-0.0'])
prediabetic_hba1c_at_start_finding_more_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart
0,5100710,2009-01-20,,,3.0,,,...,,,,,,,2016-01-18
1,4647495,2008-12-05,,,3.0,,,...,,,,,,,2018-12-03
2,5494714,2010-04-14,,,3.0,,,...,,,,,,,2017-07-20
3,5494714,2010-04-14,,,3.0,,,...,,,,,,,2017-08-29
4,5494714,2010-04-14,,,3.0,,,...,,,,,,,2019-05-25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18718,6022397,2007-12-20,,,3.0,,,...,,,,,,,2019-07-26
18719,6022397,2007-12-20,,,3.0,,,...,,,,,,,2019-05-11
18720,6022397,2007-12-20,,,3.0,,,...,,,,,,,2019-04-25
18721,6022397,2007-12-20,,,3.0,,,...,,,,,,,2020-02-26


First we need to drop duplicate eid numbers so that only the first epistart date is left which will be compared with the dates used to classify all the patients as prediabetic.

In [44]:
prediabetic_hba1c_at_start_finding_more_diabetes_sorted = prediabetic_hba1c_at_start_finding_more_diabetes.sort_values(by = ['eid', 'epistart'])
prediabetic_hba1c_at_start_finding_more_diabetes_sorted

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart
11464,1002432,2008-10-28,,,2.0,,,...,,,,,,,2020-02-03
15791,1003245,2009-05-09,,,3.0,,,...,,,,,,,2012-05-11
15790,1003245,2009-05-09,,,3.0,,,...,,,,,,,2013-08-22
14042,1005686,2010-06-18,,,4.0,,,...,,,,,,,2019-04-11
18403,1005936,2009-07-09,,,3.0,,,...,,,,,,,2016-10-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18198,6025367,2008-10-25,,,3.0,,,...,,,,,,,2014-10-20
18197,6025367,2008-10-25,,,3.0,,,...,,,,,,,2014-12-31
18196,6025367,2008-10-25,,,3.0,,,...,,,,,,,2015-09-04
18194,6025367,2008-10-25,,,3.0,,,...,,,,,,,2017-02-11


Next we drop the rows by eid number keeping only the first eid value such that the first date of diabetes diagnosis is the only date remaining to be used for date comparisons.

In [45]:
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date = prediabetic_hba1c_at_start_finding_more_diabetes_sorted.drop_duplicates(subset = 'eid', keep = 'first')
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart
11464,1002432,2008-10-28,,,2.0,,,...,,,,,,,2020-02-03
15791,1003245,2009-05-09,,,3.0,,,...,,,,,,,2012-05-11
14042,1005686,2010-06-18,,,4.0,,,...,,,,,,,2019-04-11
18403,1005936,2009-07-09,,,3.0,,,...,,,,,,,2016-10-05
18006,1010949,2008-03-07,,,3.0,,,...,,,,,,,2014-05-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10304,6021245,2010-06-28,,,3.0,,,...,,,,,,,2019-01-15
18689,6021297,2008-03-14,,,4.0,,,...,,,,,,,2016-12-28
4925,6021801,2007-10-18,,,5.0,,,...,,,,,,,2014-03-24
18717,6022397,2007-12-20,,,3.0,,,...,,,,,,,2019-04-24


Now that we have both the original assessment date when the patients were diagnosed as prediabetic and the date when the patients were diagnosed with diabetes, all we have to do is mark the patients who developed diabetes after the first assessment when they only had prediabetes. Below we create a column that marks True if the patient develops diabetes (epistart is later than the original assessment date) and false if the patient had diabetes before the original classification as a prediabetic.

In [46]:
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date['target'] = prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date['53-0.0'] < prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date['epistart']
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart,target
11464,1002432,2008-10-28,,,2.0,,,...,,,,,,2020-02-03,True
15791,1003245,2009-05-09,,,3.0,,,...,,,,,,2012-05-11,True
14042,1005686,2010-06-18,,,4.0,,,...,,,,,,2019-04-11,True
18403,1005936,2009-07-09,,,3.0,,,...,,,,,,2016-10-05,True
18006,1010949,2008-03-07,,,3.0,,,...,,,,,,2014-05-31,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10304,6021245,2010-06-28,,,3.0,,,...,,,,,,2019-01-15,True
18689,6021297,2008-03-14,,,4.0,,,...,,,,,,2016-12-28,True
4925,6021801,2007-10-18,,,5.0,,,...,,,,,,2014-03-24,True
18717,6022397,2007-12-20,,,3.0,,,...,,,,,,2019-04-24,True


Below we convert our test column to binary.

In [47]:
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date['target'] = prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date["target"].astype(int)
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart,target
11464,1002432,2008-10-28,,,2.0,,,...,,,,,,2020-02-03,1
15791,1003245,2009-05-09,,,3.0,,,...,,,,,,2012-05-11,1
14042,1005686,2010-06-18,,,4.0,,,...,,,,,,2019-04-11,1
18403,1005936,2009-07-09,,,3.0,,,...,,,,,,2016-10-05,1
18006,1010949,2008-03-07,,,3.0,,,...,,,,,,2014-05-31,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10304,6021245,2010-06-28,,,3.0,,,...,,,,,,2019-01-15,1
18689,6021297,2008-03-14,,,4.0,,,...,,,,,,2016-12-28,1
4925,6021801,2007-10-18,,,5.0,,,...,,,,,,2014-03-24,1
18717,6022397,2007-12-20,,,3.0,,,...,,,,,,2019-04-24,1


Below we show the number of patients who had the HbA1c test that classified them as prediabetic which occurred before the epistart diabetes diagnosis of the patients. 

In [48]:
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date_before = prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date[(prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date['target'] == 1)]
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date_before

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart,target
11464,1002432,2008-10-28,,,2.0,,,...,,,,,,2020-02-03,1
15791,1003245,2009-05-09,,,3.0,,,...,,,,,,2012-05-11,1
14042,1005686,2010-06-18,,,4.0,,,...,,,,,,2019-04-11,1
18403,1005936,2009-07-09,,,3.0,,,...,,,,,,2016-10-05,1
18006,1010949,2008-03-07,,,3.0,,,...,,,,,,2014-05-31,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10304,6021245,2010-06-28,,,3.0,,,...,,,,,,2019-01-15,1
18689,6021297,2008-03-14,,,4.0,,,...,,,,,,2016-12-28,1
4925,6021801,2007-10-18,,,5.0,,,...,,,,,,2014-03-24,1
18717,6022397,2007-12-20,,,3.0,,,...,,,,,,2019-04-24,1


Below we create a list of these patients to remove these from the final dataframe since we already know what label to give them.

In [49]:
dfToList_diabetes_progression_hba1c = prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date_before['eid'].tolist()
dfToList_diabetes_progression_hba1c

[1002432,
 1003245,
 1005686,
 1005936,
 1010949,
 1015296,
 1015458,
 1017443,
 1018119,
 1023747,
 1025329,
 1025738,
 1026392,
 1026995,
 1028632,
 1029631,
 1030454,
 1030597,
 1030664,
 1031126,
 1035809,
 1039510,
 1041652,
 1043112,
 1044679,
 1045572,
 1045764,
 1048428,
 1049177,
 1049710,
 1050272,
 1050906,
 1051949,
 1053340,
 1053530,
 1053984,
 1054030,
 1054983,
 1056771,
 1056980,
 1058047,
 1059072,
 1059157,
 1059778,
 1062113,
 1063541,
 1063856,
 1064008,
 1067304,
 1067408,
 1068154,
 1068684,
 1071227,
 1071847,
 1072550,
 1073047,
 1073266,
 1074213,
 1074377,
 1075927,
 1076087,
 1077768,
 1080623,
 1082118,
 1082423,
 1082529,
 1085404,
 1086687,
 1087122,
 1090411,
 1090737,
 1091279,
 1093438,
 1094700,
 1095304,
 1097020,
 1097522,
 1098087,
 1098102,
 1099336,
 1100345,
 1101247,
 1101705,
 1102744,
 1106188,
 1106386,
 1107695,
 1109310,
 1110064,
 1110221,
 1110458,
 1112001,
 1112966,
 1113401,
 1116252,
 1116486,
 1117478,
 1117673,
 1117850,
 1120584,


Next we need to cut out all patients who were diagnosed with diabetes before the diagnosis of prediabetes.

In [50]:
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date_after = prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date[(prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date['target'] == 0)]
prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date_after

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart,target
7598,1029593,2008-01-21,,,3.0,,,...,,,,,,2006-01-07,0
15174,1080360,2010-03-16,,,5.0,,,...,,,,,,2008-08-09,0
18495,1204304,2010-03-05,,,4.0,,,...,,,,,,2007-10-15,0
312,1284851,2009-05-08,,,5.0,,,...,,,,,,2006-09-26,0
1501,1475776,2007-05-03,,,4.0,,,...,,,,,,2006-08-26,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11482,5600407,2008-03-01,,,15.0,,,...,,,,,,2007-09-10,0
8157,5648986,2009-12-14,,,7.0,,,...,,,,,,2006-09-05,0
15802,5695285,2009-01-15,,,6.0,,,...,,,,,,2004-08-23,0
11212,5748920,2009-04-17,,,4.0,,,...,,,,,,1999-11-25,0


Below we create a list of these patients to cut from the final dataframe of prediabetic patients.

In [51]:
dfToList_diabetes_before_prediabetes_hba1c = prediabetic_hba1c_at_start_finding_more_diabetes_sorted_only_one_date_after['eid'].tolist()
dfToList_diabetes_before_prediabetes_hba1c

[1029593,
 1080360,
 1204304,
 1284851,
 1475776,
 1482752,
 1607238,
 1623018,
 1705802,
 1720561,
 1723662,
 1835391,
 1947359,
 1953923,
 1965662,
 1998242,
 2047421,
 2063362,
 2079581,
 2176698,
 2413252,
 2559375,
 2670084,
 2782601,
 2828334,
 2854289,
 2866396,
 2908158,
 2926539,
 3035370,
 3105423,
 3107344,
 3145357,
 3147296,
 3189406,
 3456135,
 3493310,
 3527962,
 3630434,
 3649707,
 3690324,
 3693175,
 3693397,
 3726498,
 3809334,
 3846318,
 4053299,
 4164957,
 4168081,
 4284639,
 4505939,
 4771990,
 4926234,
 4948784,
 4997442,
 5045813,
 5246871,
 5346094,
 5600407,
 5648986,
 5695285,
 5748920,
 5899216]

# Blood Glucose

Next we create a dataframe containing only these patients.

In [52]:
possible_diabetes_self_diagnosis_blood_glucose = prediabetic_blood_glucose_at_start[['20002-0.0', '20002-0.1', '20002-0.2', '20002-0.3', '20002-0.4', '20002-0.5', '20002-0.6', '20002-0.7', '20002-0.8', '20002-0.9', '20002-0.10', '20002-0.11', '20002-0.12', '20002-0.13', '20002-0.14', '20002-0.15', '20002-0.16', '20002-0.17', '20002-0.18', '20002-0.19', '20002-0.20', '20002-0.21', '20002-0.22', '20002-0.23', '20002-0.24', '20002-0.25', '20002-0.26', '20002-0.27', '20002-0.28', '20002-0.29', '20002-0.30', '20002-0.31', '20002-0.32', '20002-0.33', '20002-1.0', '20002-1.1', '20002-1.2', '20002-1.3', '20002-1.4', '20002-1.5', '20002-1.6', '20002-1.7', '20002-1.8', '20002-1.9', '20002-1.10', '20002-1.11', '20002-1.12', '20002-1.13', '20002-1.14', '20002-1.15', '20002-1.16', '20002-1.17', '20002-1.18', '20002-1.19', '20002-1.20', '20002-1.21', '20002-1.22', '20002-1.23', '20002-1.24', '20002-1.25', '20002-1.26', '20002-1.27', '20002-1.28', '20002-1.29', '20002-1.30', '20002-1.31', '20002-1.32', '20002-1.33', '20002-2.0', '20002-2.1', '20002-2.2', '20002-2.3', '20002-2.4', '20002-2.5', '20002-2.6', '20002-2.7', '20002-2.8', '20002-2.9', '20002-2.10', '20002-2.11', '20002-2.12', '20002-2.13', '20002-2.14', '20002-2.15', '20002-2.16', '20002-2.17', '20002-2.18', '20002-2.19', '20002-2.20', '20002-2.21', '20002-2.22', '20002-2.23', '20002-2.24', '20002-2.25', '20002-2.26', '20002-2.27', '20002-2.28', '20002-2.29', '20002-2.30', '20002-2.31', '20002-2.32', '20002-2.33']]
possible_diabetes_self_diagnosis_blood_glucose

Unnamed: 0,20002-0.0,20002-0.1,20002-0.2,20002-0.3,20002-0.4,20002-0.5,20002-0.6,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
309,1453.0,,,,,,,...,,,,,,,
584,,,,,,,,...,,,,,,,
941,1078.0,,,,,,,...,,,,,,,
1125,1065.0,1224.0,,,,,,...,,,,,,,
1443,,,,,,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501899,1065.0,1462.0,1377.0,1197.0,1516.0,1458.0,,...,,,,,,,
501938,1111.0,1452.0,,,,,,...,,,,,,,
501989,,,,,,,,...,,,,,,,
502025,1111.0,1452.0,1387.0,1297.0,,,,...,,,,,,,


Next we show all columns with the value 1223 which is the code for Type 2 diabetes. All columns marked True contain at least one value which is 1223. Therefore, we will continue our analysis with these columns.

In [53]:
pd.options.display.max_rows = 110
possible_diabetes_self_diagnosis_blood_glucose.isin([1223]).any()

20002-0.0     False
20002-0.1     False
20002-0.2      True
20002-0.3     False
20002-0.4     False
20002-0.5     False
20002-0.6     False
20002-0.7     False
20002-0.8     False
20002-0.9     False
20002-0.10    False
20002-0.11    False
20002-0.12    False
20002-0.13    False
20002-0.14    False
20002-0.15    False
20002-0.16    False
20002-0.17    False
20002-0.18    False
20002-0.19    False
20002-0.20    False
20002-0.21    False
20002-0.22    False
20002-0.23    False
20002-0.24    False
20002-0.25    False
20002-0.26    False
20002-0.27    False
20002-0.28    False
20002-0.29    False
20002-0.30    False
20002-0.31    False
20002-0.32    False
20002-0.33    False
20002-1.0     False
20002-1.1     False
20002-1.2     False
20002-1.3     False
20002-1.4     False
20002-1.5     False
20002-1.6     False
20002-1.7     False
20002-1.8     False
20002-1.9     False
20002-1.10    False
20002-1.11    False
20002-1.12    False
20002-1.13    False
20002-1.14    False
20002-1.15    False


Next we search for patients who self-diagnose themselves with diabetes.

In [54]:
pd.options.display.max_rows = 10
self_diagnosed_diabetics_blood_glucose = prediabetic_blood_glucose_at_start[(prediabetic_blood_glucose_at_start['20002-0.2'] == 1223) | (prediabetic_blood_glucose_at_start['20002-2.0'] == 1223)]
self_diagnosed_diabetics_blood_glucose

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
197820,1239607,2007-12-22,2012-08-29,2017-05-20,15.0,7.0,3.0,...,,,,,,,
238127,5849058,2009-05-14,,,15.0,,,...,,,,,,,


# Different groups of patients we need to search for:

1. Patients who are not diagnosed with diabetes at starting time but are self-diagnosed at starting time - cut them from dataframe
2. Patients who are diagnosed with diabetes at starting time by doctor - cut from dataframe
3. Patients who are not diagnosed with diabetes ever, but do self-diagnose themselves after the starting date (in -1.X or -2.X columns) - Keep them in the dataframe
4. Patients who are diagnosed with diabetes by doctor after first tests - keep them in the dataframe

Next we search for all prediabetic symptom patients without being diagnosed by a doctor with diabetes. There are 2 prediabetic patients who are not initially diagnosed with diabetes by a doctor, but who have self-diagnosed themselves with diabetes at some point.

In [55]:
only_self_diagnosed_diabetics_blood_glucose = self_diagnosed_diabetics_blood_glucose[self_diagnosed_diabetics_blood_glucose['Diabetes Diagnosed 1'] == 0]
only_self_diagnosed_diabetics_blood_glucose

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
197820,1239607,2007-12-22,2012-08-29,2017-05-20,15.0,7.0,3.0,...,,,,,,,
238127,5849058,2009-05-14,,,15.0,,,...,,,,,,,


Next we have to see when these patients self-diganosed themselves with diabetes. If they were self-diagnosed in the 20002-0.X columns, we must remove them from the dataframe. Otherwise, they would have progressed to having diabetes and we will have to check if these patients were also diagnosed by a doctor for diabetes.

Below we show all patients who self-diagnose themselves with diabetes, but are not diagnosed with diabetes by a doctor at the start. We will still cut these patients from the dataframe as if they were true diabetic patients.

In [56]:
diabetes_at_start_blood_glucose = only_self_diagnosed_diabetics_blood_glucose[((only_self_diagnosed_diabetics_blood_glucose['20002-0.0'] == 1223) | (only_self_diagnosed_diabetics_blood_glucose['20002-0.1'] == 1223) | (only_self_diagnosed_diabetics_blood_glucose['20002-0.2'] == 1223) | (only_self_diagnosed_diabetics_blood_glucose['20002-0.3'] == 1223) | (only_self_diagnosed_diabetics_blood_glucose['20002-0.4'] == 1223) | (only_self_diagnosed_diabetics_blood_glucose['20002-0.5'] == 1223) | (only_self_diagnosed_diabetics_blood_glucose['20002-0.6'] == 1223)) & (only_self_diagnosed_diabetics_blood_glucose['Diabetes Diagnosed 1'] == 0)]
diabetes_at_start_blood_glucose

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
238127,5849058,2009-05-14,,,15.0,,,...,,,,,,,


Below we create a list of the patient eid numbers so that we can cut them from the dataframe.

In [57]:
dfToList_diabetes_at_start_blood_glucose = diabetes_at_start_blood_glucose['eid'].tolist()
dfToList_diabetes_at_start_blood_glucose

[5849058]

Now we filter these patients from the final dataframe of prediabetic patients.

In [58]:
prediabetic_blood_glucose_at_start = prediabetic_blood_glucose_at_start[~prediabetic_blood_glucose_at_start.eid.isin(dfToList_diabetes_at_start_blood_glucose)]
prediabetic_blood_glucose_at_start

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
309,2371061,2007-08-04,,,14.0,,,...,,,,,,,
584,3288614,2009-01-15,,,12.0,,,...,,,,,,,
941,2570774,2008-07-18,,,8.0,,,...,,,,,,,
1125,4512530,2009-04-01,,,12.0,,,...,,,,,,,
1443,1230667,2008-01-26,,,21.0,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501899,3774941,2009-11-12,,,13.0,,,...,,,,,,,
501938,3624451,2009-05-07,,,13.0,,,...,,,,,,,
501989,6022522,2008-02-01,,,15.0,,,...,,,,,,,
502025,2043876,2010-06-05,,,13.0,,,...,,,,,,,


# Extra progression to diabetes labels in next cell

Below we show that the remaining patients who self-diagnose themselves do so after the first tests are performed. Therefore, we can label these patients as progressing to diabetes even if they are not diagnosed by a doctor for diabetes.

In [59]:
pd.options.display.max_columns = 125
pd.options.display.max_rows = 100
self_diagnosed_diabetics_some_not_doctor_diagnosed_blood_glucose = only_self_diagnosed_diabetics_blood_glucose[(only_self_diagnosed_diabetics_blood_glucose['20002-1.0'] == 1223) | (only_self_diagnosed_diabetics_blood_glucose['20002-2.0'] == 1223)]
self_diagnosed_diabetics_some_not_doctor_diagnosed_blood_glucose
# 1 patient in total

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,Age Diabetes Diagnosed 1,Age Diabetes Diagnosed 2,Age Diabetes Diagnosed 3,Gestational Diabetes Only 1,Gestational Diabetes Only 2,Gestational Diabetes Only 3,Blood Glucose Level 1,Blood Glucose level 2,Glucose Assay Date 1,Glucose Assay Date 2,HbA1c 1,HbA1c 2,HbA1c Assay Date 1,HbA1c Assay Date 2,20002-0.0,20002-0.1,20002-0.2,20002-0.3,20002-0.4,20002-0.5,20002-0.6,20002-0.7,20002-0.8,20002-0.9,20002-0.10,20002-0.11,20002-0.12,20002-0.13,20002-0.14,20002-0.15,20002-0.16,20002-0.17,20002-0.18,20002-0.19,20002-0.20,20002-0.21,20002-0.22,20002-0.23,20002-0.24,20002-0.25,20002-0.26,20002-0.27,20002-0.28,20002-0.29,20002-0.30,20002-0.31,20002-0.32,20002-0.33,20002-1.0,20002-1.1,20002-1.2,20002-1.3,...,20002-1.6,20002-1.7,20002-1.8,20002-1.9,20002-1.10,20002-1.11,20002-1.12,20002-1.13,20002-1.14,20002-1.15,20002-1.16,20002-1.17,20002-1.18,20002-1.19,20002-1.20,20002-1.21,20002-1.22,20002-1.23,20002-1.24,20002-1.25,20002-1.26,20002-1.27,20002-1.28,20002-1.29,20002-1.30,20002-1.31,20002-1.32,20002-1.33,20002-2.0,20002-2.1,20002-2.2,20002-2.3,20002-2.4,20002-2.5,20002-2.6,20002-2.7,20002-2.8,20002-2.9,20002-2.10,20002-2.11,20002-2.12,20002-2.13,20002-2.14,20002-2.15,20002-2.16,20002-2.17,20002-2.18,20002-2.19,20002-2.20,20002-2.21,20002-2.22,20002-2.23,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
197820,1239607,2007-12-22,2012-08-29,2017-05-20,15.0,7.0,3.0,0.0,0.0,1.0,,,49.0,,,-2.0,5.633,4.487,2016-01-21,2017-01-06,36.4,,2015-08-28,,1111.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1111.0,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Below we show that all patients who self-diagnosed themselves with diabetes after the first reading also did not get diagnosed by a doctor for diabetes at the first reading. We keep these patients in the dataframe and will label them as progressing to diabetes.

In [60]:
pd.options.display.max_rows = 10
pd.options.display.max_columns = 10
only_self_diagnosed_diabetics_blood_glucose_progressing_to_diabetes = only_self_diagnosed_diabetics_blood_glucose[((only_self_diagnosed_diabetics_blood_glucose['20002-1.0'] == 1223) | (only_self_diagnosed_diabetics_blood_glucose['20002-2.0'] == 1223)) & (only_self_diagnosed_diabetics_blood_glucose['Diabetes Diagnosed 1'] == 0)]
only_self_diagnosed_diabetics_blood_glucose_progressing_to_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
197820,1239607,2007-12-22,2012-08-29,2017-05-20,15.0,...,,,,,


Below we keep a list of the patients above as progressing to diabetes for future labeling.

In [61]:
dfToList_diabetes_progression_without_doctor_diagnosis_blood_glucose = only_self_diagnosed_diabetics_blood_glucose_progressing_to_diabetes['eid'].tolist()
dfToList_diabetes_progression_without_doctor_diagnosis_blood_glucose

[1239607]

Next we want to merge our values with those of the prediabetic patients. This will tell us if there are any patients we missed who developed diabetes, or possibly patients we should not account for who had diabetes before the measurements for prediabetes. 

In [62]:
pd.options.display.max_columns = 14
prediabetic_blood_glucose_at_start_finding_more_diabetes = prediabetic_blood_glucose_at_start.merge(patients_with_type2_diabetes_with_dates[['eid', 'epistart']], on = 'eid')
prediabetic_blood_glucose_at_start_finding_more_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart
0,3374375,2007-12-18,,,12.0,,,...,,,,,,,2016-11-03
1,3970193,2008-02-21,,,10.0,,,...,,,,,,,2016-08-18
2,3970193,2008-02-21,,,10.0,,,...,,,,,,,2018-07-02
3,1949485,2008-11-03,,,12.0,,,...,,,,,,,2019-08-20
4,2942409,2008-07-10,,,13.0,,,...,,,,,,,2019-04-26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1302,4210556,2010-03-19,,,8.0,,,...,,,,,,,2012-07-17
1303,4210556,2010-03-19,,,8.0,,,...,,,,,,,2012-12-18
1304,4210556,2010-03-19,,,8.0,,,...,,,,,,,2012-08-17
1305,5612713,2008-04-17,,,14.0,,,...,,,,,,,2019-09-23


Below we show the data types of the columns.

In [63]:
prediabetic_blood_glucose_at_start_finding_more_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1307 entries, 0 to 1306
Columns: 127 entries, eid to epistart
dtypes: datetime64[ns](1), float64(118), int64(1), object(7)
memory usage: 1.3+ MB


Our next goal is to find the patients who start off with prediabetic conditions, which is everyone in the above dataframe, and find out all the patients who are diagnosed with diabetes at a later date, both those who do and do not have a doctor diagnosis of diabetes as well. However, this will give us patients who developed diabetes from prediabetes that have NaN values after their original readings for the doctor diagnosed diabetes columns. This is why this method is very powerful and gives us many more targets than by just using the columns and nothing else. First we have to change 53-0.0, the date of attending the assessment centre, to datetime data type.

In [64]:
prediabetic_blood_glucose_at_start_finding_more_diabetes['53-0.0'] = pd.to_datetime(prediabetic_blood_glucose_at_start_finding_more_diabetes['53-0.0'])
prediabetic_blood_glucose_at_start_finding_more_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart
0,3374375,2007-12-18,,,12.0,,,...,,,,,,,2016-11-03
1,3970193,2008-02-21,,,10.0,,,...,,,,,,,2016-08-18
2,3970193,2008-02-21,,,10.0,,,...,,,,,,,2018-07-02
3,1949485,2008-11-03,,,12.0,,,...,,,,,,,2019-08-20
4,2942409,2008-07-10,,,13.0,,,...,,,,,,,2019-04-26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1302,4210556,2010-03-19,,,8.0,,,...,,,,,,,2012-07-17
1303,4210556,2010-03-19,,,8.0,,,...,,,,,,,2012-12-18
1304,4210556,2010-03-19,,,8.0,,,...,,,,,,,2012-08-17
1305,5612713,2008-04-17,,,14.0,,,...,,,,,,,2019-09-23


We need to drop duplicate eid numbers so that only the first epistart date is left which will be compared with the dates used to classify all the patients as prediabetic. To do this we first sort the dataframe by eid number and increasing dates within an eid number.

In [65]:
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted = prediabetic_blood_glucose_at_start_finding_more_diabetes.sort_values(by = ['eid', 'epistart'])
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart
968,1019343,2010-01-08,,,12.0,,,...,,,,,,,2012-11-15
965,1019343,2010-01-08,,,12.0,,,...,,,,,,,2016-05-13
961,1019343,2010-01-08,,,12.0,,,...,,,,,,,2016-05-16
966,1019343,2010-01-08,,,12.0,,,...,,,,,,,2016-05-21
969,1019343,2010-01-08,,,12.0,,,...,,,,,,,2016-05-21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
733,5875904,2009-08-29,,,8.0,,,...,,,,,,,2020-06-23
698,5935187,2008-05-15,,,16.0,,,...,,,,,,,2018-10-19
257,5949556,2009-04-17,,,12.0,,,...,,,,,,,2019-02-27
735,5984794,2008-04-24,,,11.0,,,...,,,,,,,2019-03-27


Next we drop the rows by eid number keeping only the first eid value such that the first date of diabetes diagnosis is the only date remaining to be used for date comparisons.

In [66]:
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date = prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted.drop_duplicates(subset = 'eid', keep = 'first')
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart
968,1019343,2010-01-08,,,12.0,,,...,,,,,,,2012-11-15
998,1024217,2008-02-28,,,17.0,,,...,,,,,,,2012-08-09
1207,1073047,2008-06-04,,,12.0,,,...,,,,,,,2018-03-05
183,1083708,2008-03-13,,,13.0,,,...,,,,,,,2014-07-31
1097,1091279,2008-04-17,,,16.0,,,...,,,,,,,2018-12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
849,5871602,2009-06-01,,2015-05-24,16.0,,4.0,...,,,,,,,2012-01-28
733,5875904,2009-08-29,,,8.0,,,...,,,,,,,2020-06-23
698,5935187,2008-05-15,,,16.0,,,...,,,,,,,2018-10-19
257,5949556,2009-04-17,,,12.0,,,...,,,,,,,2019-02-27


Now that we have both the original assessment date when the patients were diagnosed as prediabetic and the date when the patients were diagnosed with diabetes, all we have to do is mark the patients who developed diabetes after the first assessment when they only had prediabetes. Below we create a column that marks True if the patient develops diabetes (epistart is later than the original assessment date) and false if the patient had diabetes before the original classification as a prediabetic.

In [67]:
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date['target'] = prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date['53-0.0'] < prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date['epistart']
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart,target
968,1019343,2010-01-08,,,12.0,,,...,,,,,,2012-11-15,True
998,1024217,2008-02-28,,,17.0,,,...,,,,,,2012-08-09,True
1207,1073047,2008-06-04,,,12.0,,,...,,,,,,2018-03-05,True
183,1083708,2008-03-13,,,13.0,,,...,,,,,,2014-07-31,True
1097,1091279,2008-04-17,,,16.0,,,...,,,,,,2018-12-20,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
849,5871602,2009-06-01,,2015-05-24,16.0,,4.0,...,,,,,,2012-01-28,True
733,5875904,2009-08-29,,,8.0,,,...,,,,,,2020-06-23,True
698,5935187,2008-05-15,,,16.0,,,...,,,,,,2018-10-19,True
257,5949556,2009-04-17,,,12.0,,,...,,,,,,2019-02-27,True


Below we convert our test column to binary.

In [68]:
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date['target'] = prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date["target"].astype(int)
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart,target
968,1019343,2010-01-08,,,12.0,,,...,,,,,,2012-11-15,1
998,1024217,2008-02-28,,,17.0,,,...,,,,,,2012-08-09,1
1207,1073047,2008-06-04,,,12.0,,,...,,,,,,2018-03-05,1
183,1083708,2008-03-13,,,13.0,,,...,,,,,,2014-07-31,1
1097,1091279,2008-04-17,,,16.0,,,...,,,,,,2018-12-20,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
849,5871602,2009-06-01,,2015-05-24,16.0,,4.0,...,,,,,,2012-01-28,1
733,5875904,2009-08-29,,,8.0,,,...,,,,,,2020-06-23,1
698,5935187,2008-05-15,,,16.0,,,...,,,,,,2018-10-19,1
257,5949556,2009-04-17,,,12.0,,,...,,,,,,2019-02-27,1


Below we show the number of patients who had the Blood Glucose test that classified them as prediabetic which occurred before the epistart diabetes diagnosis of the patients.

In [69]:
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date_before = prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date[(prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date['target'] == 1)]
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date_before

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart,target
968,1019343,2010-01-08,,,12.0,,,...,,,,,,2012-11-15,1
998,1024217,2008-02-28,,,17.0,,,...,,,,,,2012-08-09,1
1207,1073047,2008-06-04,,,12.0,,,...,,,,,,2018-03-05,1
183,1083708,2008-03-13,,,13.0,,,...,,,,,,2014-07-31,1
1097,1091279,2008-04-17,,,16.0,,,...,,,,,,2018-12-20,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
849,5871602,2009-06-01,,2015-05-24,16.0,,4.0,...,,,,,,2012-01-28,1
733,5875904,2009-08-29,,,8.0,,,...,,,,,,2020-06-23,1
698,5935187,2008-05-15,,,16.0,,,...,,,,,,2018-10-19,1
257,5949556,2009-04-17,,,12.0,,,...,,,,,,2019-02-27,1


Below we create a list of these patients to remove these from the final dataframe since we already know what label to give them.

In [70]:
dfToList_diabetes_progression_blood_glucose = prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date_before['eid'].tolist()
dfToList_diabetes_progression_blood_glucose

[1019343,
 1024217,
 1073047,
 1083708,
 1091279,
 1094263,
 1123902,
 1142652,
 1169859,
 1174134,
 1198884,
 1207400,
 1225907,
 1239607,
 1281634,
 1297087,
 1318774,
 1325843,
 1328712,
 1340839,
 1342774,
 1350285,
 1425217,
 1436687,
 1441283,
 1521232,
 1528249,
 1542832,
 1586770,
 1630865,
 1643573,
 1651687,
 1665277,
 1667505,
 1701842,
 1766718,
 1769889,
 1796867,
 1800310,
 1808423,
 1810335,
 1815010,
 1826307,
 1834497,
 1857301,
 1862331,
 1894070,
 1903099,
 1932013,
 1938564,
 1947431,
 1949485,
 1979597,
 1996323,
 2001834,
 2002264,
 2026276,
 2042956,
 2045731,
 2060719,
 2076825,
 2078075,
 2084541,
 2092116,
 2100300,
 2131772,
 2133533,
 2135705,
 2145700,
 2171118,
 2195193,
 2221306,
 2222094,
 2253027,
 2253221,
 2278027,
 2278189,
 2302471,
 2325718,
 2355832,
 2370543,
 2386324,
 2432745,
 2447171,
 2448692,
 2449416,
 2462401,
 2469595,
 2508863,
 2513181,
 2565441,
 2570120,
 2573584,
 2590357,
 2604062,
 2613076,
 2614142,
 2627975,
 2631643,
 2661566,


Below we show the patients who were diagnosed with diabetes before their blood was drawn. These patients must be cut from the total dataframe.

In [71]:
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date_after = prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date[(prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date['target'] == 0)]
prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date_after

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart,target
1156,3105423,2008-09-20,,,12.0,,,...,,,,,,2008-07-11,0
372,3703097,2009-06-24,,,20.0,,,...,,,,,,2008-08-19,0
708,3723020,2008-09-29,,,14.0,,,...,,,,,,2004-01-15,0
1217,4001962,2009-07-13,,,9.0,,,...,,,,,,2005-05-12,0
1152,4137297,2010-06-18,,,10.0,,,...,,,,,,2010-05-13,0
403,4612704,2007-10-19,,,15.0,,,...,,,,,,2006-09-29,0
731,5600407,2008-03-01,,,15.0,,,...,,,,,,2007-09-10,0


Below we create a list of the patients who must be cut from the final dataframe of prediabetic patients.

In [72]:
dfToList_diabetes_before_prediabetes_blood_glucose = prediabetic_blood_glucose_at_start_finding_more_diabetes_sorted_only_one_date_after['eid'].tolist()
dfToList_diabetes_before_prediabetes_blood_glucose

[3105423, 3703097, 3723020, 4001962, 4137297, 4612704, 5600407]

Below we import our dataframe so that we can find the patients who have prediabetes to start the study.

In [73]:
merged_for_prediabetes_before_drop_of_diabetics = pd.read_csv('merged_prediabetes_information')
merged_for_prediabetes_before_drop_of_diabetics

Unnamed: 0.1,Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,74-0.0,74-1.0,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
0,0,2075783,2008-06-06,,,3.0,,...,,,,,,,
1,1,4345775,2008-06-13,,,5.0,,...,,,,,,,
2,2,5686018,2008-02-09,2012-09-24,,5.0,5.0,...,,,,,,,
3,3,3907457,2008-01-31,,,4.0,,...,,,,,,,
4,4,3160513,2008-09-08,,2017-12-06,3.0,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502499,502499,2054859,2007-11-03,,,2.0,,...,,,,,,,
502500,502500,3268102,2009-08-14,,,5.0,,...,,,,,,,
502501,502501,2793268,2009-10-15,,,3.0,,...,,,,,,,
502502,502502,4922013,2008-10-18,,,2.0,,...,,,,,,,


Below we rename some columns so that we know exactly what they are.

In [74]:
merged_for_prediabetes_before_drop_of_diabetics = merged_for_prediabetes_before_drop_of_diabetics.rename(columns={'74-0.0' : 'Fasting Time 1', '74-1.0' : 'Fasting Time 2', '74-2.0' : 'Fasting Time 3', '2443-0.0' : 'Diabetes Diagnosed 1', '2443-1.0' : 'Diabetes Diagnosed 2', '2443-2.0' : 'Diabetes Diagnosed 3', '2976-0.0' : 'Age Diabetes Diagnosed 1', '2976-1.0' : 'Age Diabetes Diagnosed 2', '2976-2.0' : 'Age Diabetes Diagnosed 3', '4041-0.0' : 'Gestational Diabetes Only 1', '4041-1.0' : 'Gestational Diabetes Only 2', '4041-2.0' : 'Gestational Diabetes Only 3',  '30740-0.0': "Blood Glucose Level 1", '30740-1.0': "Blood Glucose level 2", '30741-0.0': "Glucose Assay Date 1", '30741-1.0': "Glucose Assay Date 2", '30750-0.0': "HbA1c 1", '30750-1.0': "HbA1c 2", '30751-0.0': "HbA1c Assay Date 1", '30751-1.0': "HbA1c Assay Date 2"})
merged_for_prediabetes_before_drop_of_diabetics = merged_for_prediabetes_before_drop_of_diabetics.drop('Unnamed: 0', axis = 1)
merged_for_prediabetes_before_drop_of_diabetics

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
0,2075783,2008-06-06,,,3.0,,,...,,,,,,,
1,4345775,2008-06-13,,,5.0,,,...,,,,,,,
2,5686018,2008-02-09,2012-09-24,,5.0,5.0,,...,,,,,,,
3,3907457,2008-01-31,,,4.0,,,...,,,,,,,
4,3160513,2008-09-08,,2017-12-06,3.0,,5.0,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502499,2054859,2007-11-03,,,2.0,,,...,,,,,,,
502500,3268102,2009-08-14,,,5.0,,,...,,,,,,,
502501,2793268,2009-10-15,,,3.0,,,...,,,,,,,
502502,4922013,2008-10-18,,,2.0,,,...,,,,,,,


Below we import all patients who have an ICD code of prediabetic.

In [75]:
prediabetes_with_dates = pd.read_csv('patients_with_prediabetes_with_dates')
prediabetes_with_dates = prediabetes_with_dates.drop(columns = ['Unnamed: 0'])
prediabetes_with_dates

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart
0,1006294,37,3,2,,,R739,,12/02/2018
1,1007537,0,3,2,,,R730,,06/06/2019
2,1008393,2,1,2,,,R739,,26/04/2011
3,1010613,0,0,1,,,R739,,07/06/2010
4,1011449,12,5,2,,,R730,,03/02/2020
...,...,...,...,...,...,...,...,...,...
3794,6017167,3,3,2,,,R730,,09/12/2019
3795,6019270,5,9,2,,,R730,,24/09/2020
3796,6020217,7,12,2,,,R730,,19/08/2020
3797,6021943,1,3,2,,,R739,,11/08/2020


Below we show the number of actual patients in this new sample group.

In [76]:
prediabetes_with_dates.drop_duplicates(subset = 'eid')

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart
0,1006294,37,3,2,,,R739,,12/02/2018
1,1007537,0,3,2,,,R730,,06/06/2019
2,1008393,2,1,2,,,R739,,26/04/2011
3,1010613,0,0,1,,,R739,,07/06/2010
4,1011449,12,5,2,,,R730,,03/02/2020
...,...,...,...,...,...,...,...,...,...
3794,6017167,3,3,2,,,R730,,09/12/2019
3795,6019270,5,9,2,,,R730,,24/09/2020
3796,6020217,7,12,2,,,R730,,19/08/2020
3797,6021943,1,3,2,,,R739,,11/08/2020


Below we make a list of all eid numbers for these patients.

In [77]:
dfToList_prediabetes_with_dates = prediabetes_with_dates['eid'].tolist()
dfToList_prediabetes_with_dates

[1006294,
 1007537,
 1008393,
 1010613,
 1011449,
 1011449,
 1011449,
 1011449,
 1011449,
 1011449,
 1011962,
 1011962,
 1013036,
 1013036,
 1013036,
 1014004,
 1017269,
 1019396,
 1019396,
 1019396,
 1019396,
 1020786,
 1020786,
 1021189,
 1022728,
 1024680,
 1025738,
 1025738,
 1026846,
 1029208,
 1029688,
 1030702,
 1032950,
 1040790,
 1044718,
 1044919,
 1047778,
 1048696,
 1049423,
 1050516,
 1050516,
 1050516,
 1050674,
 1051881,
 1053505,
 1053846,
 1055445,
 1055591,
 1055677,
 1055794,
 1058522,
 1062014,
 1067255,
 1067321,
 1067321,
 1067321,
 1070410,
 1070616,
 1072495,
 1073132,
 1074343,
 1075479,
 1076801,
 1078042,
 1078042,
 1078115,
 1079176,
 1086312,
 1087841,
 1088760,
 1088760,
 1092549,
 1092549,
 1094789,
 1094789,
 1095516,
 1095516,
 1096278,
 1096278,
 1097144,
 1097522,
 1099911,
 1102492,
 1102726,
 1105902,
 1105902,
 1106809,
 1106809,
 1111155,
 1111802,
 1116010,
 1116102,
 1116102,
 1117612,
 1119987,
 1120236,
 1121406,
 1121406,
 1121406,
 1121406,


Below we show all the patients with prediabetic icd codes with the dates when they attended the assessment center.

In [78]:
patients_from_icd = merged_for_prediabetes_before_drop_of_diabetics[merged_for_prediabetes_before_drop_of_diabetics.eid.isin(dfToList_prediabetes_with_dates)]
patients_from_icd

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
488,2863808,2008-07-04,,,3.0,,,...,,,,,,,
715,2425377,2008-07-19,,,2.0,,,...,,,,,,,
862,5402036,2009-08-24,,,4.0,,,...,,,,,,,
955,2290883,2009-08-19,,,4.0,,,...,,,,,,,
1030,3399365,2010-01-21,,,3.0,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501671,3960736,2007-08-02,,,2.0,,,...,,,,,,,
501749,5785587,2008-12-17,,,2.0,,,...,,,,,,,
502100,5744339,2008-02-08,,2018-06-11,15.0,,3.0,...,,,,,,,
502109,5839815,2008-09-05,,,3.0,,,...,,,,,,,


Below we change the column name of the epistart column for the dataframe for the new prediabetic patients using the ICD codes.

In [79]:
prediabetes_with_dates = prediabetes_with_dates.rename(columns = {'epistart':'epistart_pre'})
prediabetes_with_dates

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart_pre
0,1006294,37,3,2,,,R739,,12/02/2018
1,1007537,0,3,2,,,R730,,06/06/2019
2,1008393,2,1,2,,,R739,,26/04/2011
3,1010613,0,0,1,,,R739,,07/06/2010
4,1011449,12,5,2,,,R730,,03/02/2020
...,...,...,...,...,...,...,...,...,...
3794,6017167,3,3,2,,,R730,,09/12/2019
3795,6019270,5,9,2,,,R730,,24/09/2020
3796,6020217,7,12,2,,,R730,,19/08/2020
3797,6021943,1,3,2,,,R739,,11/08/2020


Below we change the new column to date time format.

In [80]:
prediabetes_with_dates['epistart_pre'] = pd.to_datetime(prediabetes_with_dates['epistart_pre'])
prediabetes_with_dates

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart_pre
0,1006294,37,3,2,,,R739,,2018-12-02
1,1007537,0,3,2,,,R730,,2019-06-06
2,1008393,2,1,2,,,R739,,2011-04-26
3,1010613,0,0,1,,,R739,,2010-07-06
4,1011449,12,5,2,,,R730,,2020-03-02
...,...,...,...,...,...,...,...,...,...
3794,6017167,3,3,2,,,R730,,2019-09-12
3795,6019270,5,9,2,,,R730,,2020-09-24
3796,6020217,7,12,2,,,R730,,2020-08-19
3797,6021943,1,3,2,,,R739,,2020-11-08


Our next goal is to find the patients who start off with prediabetic conditions, and find out all the patients who are diagnosed with diabetes at a later date, both those who do and do not have a doctor diagnosis of diabetes as well. This is done by comparing the dates when the prediabetic diagnosis is given, the diabetic diagnosis is given, and the date attending the assessment center is given. Below we change the date attending the assessment center column to datetime data type.

In [81]:
patients_from_icd['53-0.0'] = pd.to_datetime(patients_from_icd['53-0.0'])
patients_from_icd


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
488,2863808,2008-07-04,,,3.0,,,...,,,,,,,
715,2425377,2008-07-19,,,2.0,,,...,,,,,,,
862,5402036,2009-08-24,,,4.0,,,...,,,,,,,
955,2290883,2009-08-19,,,4.0,,,...,,,,,,,
1030,3399365,2010-01-21,,,3.0,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501671,3960736,2007-08-02,,,2.0,,,...,,,,,,,
501749,5785587,2008-12-17,,,2.0,,,...,,,,,,,
502100,5744339,2008-02-08,,2018-06-11,15.0,,3.0,...,,,,,,,
502109,5839815,2008-09-05,,,3.0,,,...,,,,,,,


Next we merge the new prediabetes dataframe epistart column with the above dataframe.

In [82]:
patients_from_icd_with_prediabetes_dates = patients_from_icd.merge(prediabetes_with_dates[['eid', 'epistart_pre']], on = 'eid')
patients_from_icd_with_prediabetes_dates

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
0,2863808,2008-07-04,,,3.0,,,...,,,,,,,2015-05-27
1,2425377,2008-07-19,,,2.0,,,...,,,,,,,2008-07-03
2,5402036,2009-08-24,,,4.0,,,...,,,,,,,2020-03-09
3,2290883,2009-08-19,,,4.0,,,...,,,,,,,2014-07-04
4,3399365,2010-01-21,,,3.0,,,...,,,,,,,2016-04-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3794,5785587,2008-12-17,,,2.0,,,...,,,,,,,1997-03-08
3795,5744339,2008-02-08,,2018-06-11,15.0,,3.0,...,,,,,,,2018-11-21
3796,5744339,2008-02-08,,2018-06-11,15.0,,3.0,...,,,,,,,2019-07-31
3797,5839815,2008-09-05,,,3.0,,,...,,,,,,,2019-01-29


Below we keep only patients who we know were prediabetic before their measurements were taken.

In [83]:
patients_from_icd_with_prediabetes_dates_before_test = patients_from_icd_with_prediabetes_dates[patients_from_icd_with_prediabetes_dates['53-0.0'] >= patients_from_icd_with_prediabetes_dates['epistart_pre']]
patients_from_icd_with_prediabetes_dates_before_test

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
1,2425377,2008-07-19,,,2.0,,,...,,,,,,,2008-07-03
12,5512282,2008-09-16,,,4.0,,,...,,,,,,,2001-06-29
15,5724786,2008-01-07,,,4.0,,,...,,,,,,,2006-05-08
18,4754980,2009-02-11,,,3.0,,,...,,,,,,,2000-04-20
21,1350792,2009-09-14,,,2.0,,,...,,,,,,,2007-05-11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3761,4447186,2007-12-10,,,4.0,,,...,,,,,,,2002-05-15
3766,1076801,2009-04-17,,,3.0,,,...,,,,,,,2003-01-12
3773,3065620,2008-10-11,,,4.0,,,...,,,,,,,2003-03-06
3793,3960736,2007-08-02,,,2.0,,,...,,,,,,,2003-01-04


Below we show the number of patients known to have prediabetes before the measurements at the testing center.

In [84]:
patients_from_icd_with_prediabetes_dates_before_test_before_test = patients_from_icd_with_prediabetes_dates_before_test.drop_duplicates(subset = 'eid')
patients_from_icd_with_prediabetes_dates_before_test_before_test

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
1,2425377,2008-07-19,,,2.0,,,...,,,,,,,2008-07-03
12,5512282,2008-09-16,,,4.0,,,...,,,,,,,2001-06-29
15,5724786,2008-01-07,,,4.0,,,...,,,,,,,2006-05-08
18,4754980,2009-02-11,,,3.0,,,...,,,,,,,2000-04-20
21,1350792,2009-09-14,,,2.0,,,...,,,,,,,2007-05-11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3755,4447186,2007-12-10,,,4.0,,,...,,,,,,,2003-02-25
3766,1076801,2009-04-17,,,3.0,,,...,,,,,,,2003-01-12
3773,3065620,2008-10-11,,,4.0,,,...,,,,,,,2003-03-06
3793,3960736,2007-08-02,,,2.0,,,...,,,,,,,2003-01-04


Below we keep all self-diagnosis columns.

In [85]:
possible_diabetes_self_diagnosis_from_icd = patients_from_icd_with_prediabetes_dates_before_test[['20002-0.0', '20002-0.1', '20002-0.2', '20002-0.3', '20002-0.4', '20002-0.5', '20002-0.6', '20002-0.7', '20002-0.8', '20002-0.9', '20002-0.10', '20002-0.11', '20002-0.12', '20002-0.13', '20002-0.14', '20002-0.15', '20002-0.16', '20002-0.17', '20002-0.18', '20002-0.19', '20002-0.20', '20002-0.21', '20002-0.22', '20002-0.23', '20002-0.24', '20002-0.25', '20002-0.26', '20002-0.27', '20002-0.28', '20002-0.29', '20002-0.30', '20002-0.31', '20002-0.32', '20002-0.33', '20002-1.0', '20002-1.1', '20002-1.2', '20002-1.3', '20002-1.4', '20002-1.5', '20002-1.6', '20002-1.7', '20002-1.8', '20002-1.9', '20002-1.10', '20002-1.11', '20002-1.12', '20002-1.13', '20002-1.14', '20002-1.15', '20002-1.16', '20002-1.17', '20002-1.18', '20002-1.19', '20002-1.20', '20002-1.21', '20002-1.22', '20002-1.23', '20002-1.24', '20002-1.25', '20002-1.26', '20002-1.27', '20002-1.28', '20002-1.29', '20002-1.30', '20002-1.31', '20002-1.32', '20002-1.33', '20002-2.0', '20002-2.1', '20002-2.2', '20002-2.3', '20002-2.4', '20002-2.5', '20002-2.6', '20002-2.7', '20002-2.8', '20002-2.9', '20002-2.10', '20002-2.11', '20002-2.12', '20002-2.13', '20002-2.14', '20002-2.15', '20002-2.16', '20002-2.17', '20002-2.18', '20002-2.19', '20002-2.20', '20002-2.21', '20002-2.22', '20002-2.23', '20002-2.24', '20002-2.25', '20002-2.26', '20002-2.27', '20002-2.28', '20002-2.29', '20002-2.30', '20002-2.31', '20002-2.32', '20002-2.33']]
possible_diabetes_self_diagnosis_from_icd

Unnamed: 0,20002-0.0,20002-0.1,20002-0.2,20002-0.3,20002-0.4,20002-0.5,20002-0.6,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
1,1220.0,1078.0,1473.0,,,,,...,,,,,,,
12,1111.0,1220.0,1074.0,1081.0,1072.0,1163.0,1240.0,...,,,,,,,
15,1065.0,1220.0,1452.0,,,,,...,,,,,,,
18,1351.0,,,,,,,...,,,,,,,
21,1065.0,1224.0,1352.0,1662.0,1515.0,1465.0,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3761,1081.0,1065.0,1111.0,1264.0,1222.0,1473.0,,...,,,,,,,
3766,1065.0,1220.0,1465.0,1331.0,,,,...,,,,,,,
3773,1111.0,,,,,,,...,,,,,,,
3793,1065.0,1226.0,1440.0,1440.0,,,,...,,,,,,,


Next we show all columns with the value 1223 which is the code for Type 2 diabetes. All columns marked True contain at least one value which is 1223. Therefore, we will continue our analysis with these columns.

In [86]:
pd.options.display.max_rows = 110
possible_diabetes_self_diagnosis_from_icd.isin([1223]).any()

20002-0.0      True
20002-0.1      True
20002-0.2      True
20002-0.3      True
20002-0.4      True
20002-0.5      True
20002-0.6     False
20002-0.7     False
20002-0.8     False
20002-0.9     False
20002-0.10    False
20002-0.11    False
20002-0.12    False
20002-0.13    False
20002-0.14    False
20002-0.15    False
20002-0.16    False
20002-0.17    False
20002-0.18    False
20002-0.19    False
20002-0.20    False
20002-0.21    False
20002-0.22    False
20002-0.23    False
20002-0.24    False
20002-0.25    False
20002-0.26    False
20002-0.27    False
20002-0.28    False
20002-0.29    False
20002-0.30    False
20002-0.31    False
20002-0.32    False
20002-0.33    False
20002-1.0     False
20002-1.1     False
20002-1.2     False
20002-1.3     False
20002-1.4     False
20002-1.5     False
20002-1.6     False
20002-1.7     False
20002-1.8     False
20002-1.9     False
20002-1.10    False
20002-1.11    False
20002-1.12    False
20002-1.13    False
20002-1.14    False
20002-1.15    False


Next we search for patients who self-diagnose themselves with diabetes.

In [87]:
pd.options.display.max_rows = 10
self_diagnosed_diabetics_from_icd = patients_from_icd_with_prediabetes_dates_before_test[(patients_from_icd_with_prediabetes_dates_before_test['20002-0.0'] == 1223) | (patients_from_icd_with_prediabetes_dates_before_test['20002-0.1'] == 1223) | (patients_from_icd_with_prediabetes_dates_before_test['20002-0.2'] == 1223) | (patients_from_icd_with_prediabetes_dates_before_test['20002-0.3'] == 1223) | (patients_from_icd_with_prediabetes_dates_before_test['20002-0.4'] == 1223) | (patients_from_icd['20002-0.5'] == 1223) | (patients_from_icd_with_prediabetes_dates_before_test['20002-2.0'] == 1223)]
self_diagnosed_diabetics_from_icd

Boolean Series key will be reindexed to match DataFrame index.


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
540,4578103,2008-07-08,,,3.0,,,...,,,,,,,2005-02-03
864,2693446,2009-12-08,,,2.0,,,...,,,,,,,2009-04-30
1057,1328378,2009-07-22,,,5.0,,,...,,,,,,,2000-10-11
1276,5011490,2008-11-29,2013-05-18,,3.0,2.0,,...,,,,,,,2001-12-11
1391,2839824,2007-11-27,,,6.0,,,...,,,,,,,2003-01-13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3323,4908590,2009-09-10,,,6.0,,,...,,,,,,,2005-09-10
3442,1657339,2009-05-20,,,5.0,,,...,,,,,,,2008-08-13
3523,4424579,2007-08-20,,,4.0,,,...,,,,,,,2001-05-30
3524,4424579,2007-08-20,,,4.0,,,...,,,,,,,2001-05-21


# Different groups of patients we need to search for:

1. Patients who are not diagnosed with diabetes at starting time but are self-diagnosed at starting time - cut them from dataframe
2. Patients who are diagnosed with diabetes at starting time by doctor - cut from dataframe
3. Patients who are not diagnosed with diabetes ever, but do self-diagnose themselves after the starting date (in -1.X or -2.X columns) - Keep them in the dataframe
4. Patients who are diagnosed with diabetes by doctor after first tests - keep them in the dataframe

Next we search for all prediabetic symptom patients without being diagnosed by a doctor with diabetes. There are 2 prediabetic patients who are not initially diagnosed with diabetes by a doctor, but who have self-diagnosed themselves with diabetes at some point.

In [88]:
only_self_diagnosed_diabetics_from_icd = self_diagnosed_diabetics_from_icd[self_diagnosed_diabetics_from_icd['Diabetes Diagnosed 1'] == 0]
only_self_diagnosed_diabetics_from_icd

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
1733,5112251,2009-10-19,,,2.0,,,...,,,,,,,2008-09-27
2251,5068971,2008-03-19,,2018-08-06,2.0,,1.0,...,,,,,,,2003-06-06
2332,2656765,2009-01-09,,,4.0,,,...,,,,,,,2008-01-12


Next we have to see when these patients self-diganosed themselves with diabetes. If they were self-diagnosed in the 20002-0.X columns, we must remove them from the dataframe. Otherwise, they would have progressed to having diabetes and we will have to check if these patients were also diagnosed by a doctor for diabetes.

Below we show all patients who self-diagnose themselves with diabetes, but are not diagnosed with diabetes by a doctor at the start. We will still cut these patients from the dataframe as if they were true diabetic patients.

In [89]:
diabetes_at_start_from_icd = only_self_diagnosed_diabetics_from_icd[((only_self_diagnosed_diabetics_from_icd['20002-0.0'] == 1223) | (only_self_diagnosed_diabetics_from_icd['20002-0.1'] == 1223) | (only_self_diagnosed_diabetics_from_icd['20002-0.2'] == 1223) | (only_self_diagnosed_diabetics_from_icd['20002-0.3'] == 1223) | (only_self_diagnosed_diabetics_from_icd['20002-0.4'] == 1223) | (only_self_diagnosed_diabetics_from_icd['20002-0.5'] == 1223) | (only_self_diagnosed_diabetics_from_icd['20002-0.6'] == 1223)) & (only_self_diagnosed_diabetics_from_icd['Diabetes Diagnosed 1'] == 0)]
diabetes_at_start_from_icd

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
1733,5112251,2009-10-19,,,2.0,,,...,,,,,,,2008-09-27
2332,2656765,2009-01-09,,,4.0,,,...,,,,,,,2008-01-12


Below we create a list of the patient eid numbers so that we can cut them from the dataframe.

In [90]:
dfToList_diabetes_at_start_from_icd = diabetes_at_start_from_icd['eid'].tolist()
dfToList_diabetes_at_start_from_icd

[5112251, 2656765]

Now we filter these patients from the final dataframe of prediabetic patients.

In [91]:
patients_from_icd_with_prediabetes_dates_before_test = patients_from_icd_with_prediabetes_dates_before_test[~patients_from_icd_with_prediabetes_dates_before_test.eid.isin(dfToList_diabetes_at_start_from_icd)]
patients_from_icd_with_prediabetes_dates_before_test

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
1,2425377,2008-07-19,,,2.0,,,...,,,,,,,2008-07-03
12,5512282,2008-09-16,,,4.0,,,...,,,,,,,2001-06-29
15,5724786,2008-01-07,,,4.0,,,...,,,,,,,2006-05-08
18,4754980,2009-02-11,,,3.0,,,...,,,,,,,2000-04-20
21,1350792,2009-09-14,,,2.0,,,...,,,,,,,2007-05-11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3761,4447186,2007-12-10,,,4.0,,,...,,,,,,,2002-05-15
3766,1076801,2009-04-17,,,3.0,,,...,,,,,,,2003-01-12
3773,3065620,2008-10-11,,,4.0,,,...,,,,,,,2003-03-06
3793,3960736,2007-08-02,,,2.0,,,...,,,,,,,2003-01-04


# MARKED WITH 190 PATIENTS WITH DIABETES - SHOWS THAT ALL PATIENTS WITH DIABETES AT MEASUREMENT TIME WERE FROM THIS POPULATION OF PEOPLE. THIS HAS BEEN CHANGED IN OUR PIE CHART OF PATIENTS FOR THE PUBLICATION.

Below we cut out all patients who are already diabetic.

In [92]:
print(len(patients_from_icd_with_prediabetes_dates_before_test.drop_duplicates(subset = 'eid')))
test = patients_from_icd_with_prediabetes_dates_before_test.drop_duplicates(subset = 'eid')[~patients_from_icd_with_prediabetes_dates_before_test.drop_duplicates(subset = 'eid').eid.isin(dfToList_already_diabetic_at_start)]
test

495


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
18,4754980,2009-02-11,,,3.0,,,...,,,,,,,2000-04-20
21,1350792,2009-09-14,,,2.0,,,...,,,,,,,2007-05-11
50,2479180,2007-12-17,,,4.0,,,...,,,,,,,2007-07-11
63,3990434,2009-07-09,,,3.0,,,...,,,,,,,2005-07-14
93,4341307,2008-03-19,,,6.0,,,...,,,,,,,2008-02-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3716,4571529,2008-10-31,,,1.0,,,...,,,,,,,2007-08-22
3766,1076801,2009-04-17,,,3.0,,,...,,,,,,,2003-01-12
3773,3065620,2008-10-11,,,4.0,,,...,,,,,,,2003-03-06
3793,3960736,2007-08-02,,,2.0,,,...,,,,,,,2003-01-04


# Extra progression to diabetes labels in next cell

Below we show that the remaining patients who self-diagnose themselves do so after the first tests are performed. Therefore, we can label these patients as progressing to diabetes even if they are not diagnosed by a doctor for diabetes.

In [93]:
pd.options.display.max_columns = 125
pd.options.display.max_rows = 100
self_diagnosed_diabetics_some_not_doctor_diagnosed_from_icd = only_self_diagnosed_diabetics_from_icd[(only_self_diagnosed_diabetics_from_icd['20002-2.0'] == 1223)]
self_diagnosed_diabetics_some_not_doctor_diagnosed_from_icd
# 1 patient in total

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,Diabetes Diagnosed 1,Diabetes Diagnosed 2,Diabetes Diagnosed 3,Age Diabetes Diagnosed 1,Age Diabetes Diagnosed 2,Age Diabetes Diagnosed 3,Gestational Diabetes Only 1,Gestational Diabetes Only 2,Gestational Diabetes Only 3,Blood Glucose Level 1,Blood Glucose level 2,Glucose Assay Date 1,Glucose Assay Date 2,HbA1c 1,HbA1c 2,HbA1c Assay Date 1,HbA1c Assay Date 2,20002-0.0,20002-0.1,20002-0.2,20002-0.3,20002-0.4,20002-0.5,20002-0.6,20002-0.7,20002-0.8,20002-0.9,20002-0.10,20002-0.11,20002-0.12,20002-0.13,20002-0.14,20002-0.15,20002-0.16,20002-0.17,20002-0.18,20002-0.19,20002-0.20,20002-0.21,20002-0.22,20002-0.23,20002-0.24,20002-0.25,20002-0.26,20002-0.27,20002-0.28,20002-0.29,20002-0.30,20002-0.31,20002-0.32,20002-0.33,20002-1.0,20002-1.1,20002-1.2,20002-1.3,...,20002-1.7,20002-1.8,20002-1.9,20002-1.10,20002-1.11,20002-1.12,20002-1.13,20002-1.14,20002-1.15,20002-1.16,20002-1.17,20002-1.18,20002-1.19,20002-1.20,20002-1.21,20002-1.22,20002-1.23,20002-1.24,20002-1.25,20002-1.26,20002-1.27,20002-1.28,20002-1.29,20002-1.30,20002-1.31,20002-1.32,20002-1.33,20002-2.0,20002-2.1,20002-2.2,20002-2.3,20002-2.4,20002-2.5,20002-2.6,20002-2.7,20002-2.8,20002-2.9,20002-2.10,20002-2.11,20002-2.12,20002-2.13,20002-2.14,20002-2.15,20002-2.16,20002-2.17,20002-2.18,20002-2.19,20002-2.20,20002-2.21,20002-2.22,20002-2.23,20002-2.24,20002-2.25,20002-2.26,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
2251,5068971,2008-03-19,,2018-08-06,2.0,,1.0,0.0,,1.0,,,53.0,,,0.0,5.572,,2017-04-18,,40.3,,2017-05-24,,1065.0,1463.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,1223.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2003-06-06


Below we show that all patients who self-diagnosed themselves with diabetes after the first reading also did not get diagnosed by a doctor for diabetes at the first reading. We keep these patients in the dataframe and will label them as progressing to diabetes.

In [94]:
pd.options.display.max_rows = 10
pd.options.display.max_columns = 10
only_self_diagnosed_diabetics_from_icd_progressing_to_diabetes = only_self_diagnosed_diabetics_from_icd[(only_self_diagnosed_diabetics_from_icd['20002-2.0'] == 1223) & (only_self_diagnosed_diabetics_from_icd['Diabetes Diagnosed 1'] == 0)]
only_self_diagnosed_diabetics_from_icd_progressing_to_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,...,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
2251,5068971,2008-03-19,,2018-08-06,2.0,...,,,,,2003-06-06


Below we keep a list of the patients above as progressing to diabetes for future labeling.

In [95]:
dfToList_diabetes_progression_without_doctor_diagnosis_from_icd = only_self_diagnosed_diabetics_from_icd_progressing_to_diabetes['eid'].tolist()
dfToList_diabetes_progression_without_doctor_diagnosis_from_icd

[5068971]

Now that we have defined our prediabetic patients, we can import another dataframe that contains all the diabetes diagnoses for patients. We again have a column we need to drop that comes over with the SCU index.

In [96]:
patients_with_type2_diabetes_with_dates = pd.read_csv('patients_with_type2_diabetes_with_dates')
patients_with_type2_diabetes_with_dates = patients_with_type2_diabetes_with_dates.drop('Unnamed: 0', axis = 1)
patients_with_type2_diabetes_with_dates

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart
0,1000402,3,1,2,,,E119,,28/06/2018
1,1000402,1,1,2,,,E119,,12/03/2008
2,1000402,0,3,2,,,E119,,16/07/2008
3,1000610,6,3,2,,,E119,,11/05/2005
4,1000610,0,5,2,,,E112,,06/10/2010
...,...,...,...,...,...,...,...,...,...
250364,6025367,12,2,2,,,E119,,02/11/2017
250365,6025367,11,2,2,,,E119,,07/09/2017
250366,6025367,6,4,2,,,E119,,09/04/2015
250367,6025367,5,2,2,,,E119,,31/12/2014


Next we change epistart to datetime format from object. We see that the date now has the format of year-month-day.

In [97]:
patients_with_type2_diabetes_with_dates['epistart'] = pd.to_datetime(patients_with_type2_diabetes_with_dates['epistart'])
patients_with_type2_diabetes_with_dates

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart
0,1000402,3,1,2,,,E119,,2018-06-28
1,1000402,1,1,2,,,E119,,2008-12-03
2,1000402,0,3,2,,,E119,,2008-07-16
3,1000610,6,3,2,,,E119,,2005-11-05
4,1000610,0,5,2,,,E112,,2010-06-10
...,...,...,...,...,...,...,...,...,...
250364,6025367,12,2,2,,,E119,,2017-02-11
250365,6025367,11,2,2,,,E119,,2017-07-09
250366,6025367,6,4,2,,,E119,,2015-09-04
250367,6025367,5,2,2,,,E119,,2014-12-31


Next we want to merge our values with those of the prediabetic patients. This will tell us if there are any patients we missed who developed diabetes, or possibly patients we should not account for who had diabetes before the measurements for prediabetes. 

In [98]:
pd.options.display.max_columns = 14
prediabetic_from_icd_at_start_finding_more_diabetes = patients_from_icd_with_prediabetes_dates_before_test.merge(patients_with_type2_diabetes_with_dates[['eid', 'epistart']], on = 'eid')
prediabetic_from_icd_at_start_finding_more_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart
0,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2008-12-06
1,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2008-12-06
2,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2011-04-10
3,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2016-09-24
4,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2014-06-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2962,1076801,2009-04-17,,,3.0,,,...,,,,,,2003-01-12,2020-03-29
2963,1076801,2009-04-17,,,3.0,,,...,,,,,,2003-01-12,2019-11-25
2964,1076801,2009-04-17,,,3.0,,,...,,,,,,2003-01-12,2020-03-30
2965,1076801,2009-04-17,,,3.0,,,...,,,,,,2003-01-12,2011-06-09


Below we show the data types of the columns.

In [99]:
prediabetic_from_icd_at_start_finding_more_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2967 entries, 0 to 2966
Columns: 128 entries, eid to epistart
dtypes: datetime64[ns](3), float64(118), int64(1), object(6)
memory usage: 2.9+ MB


Our next goal is to find the patients who start off with prediabetic conditions, and find out all the patients who are diagnosed with diabetes at a later date, both those who do and do not have a doctor diagnosis of diabetes as well. This is done by comparing the dates when the prediabetic diagnosis is given, the diabetic diagnosis is given, and the date attending the assessment center is given. Below we change the date attending the assessment center column to datetime data type.

In [100]:
prediabetic_from_icd_at_start_finding_more_diabetes['53-0.0'] = pd.to_datetime(prediabetic_from_icd_at_start_finding_more_diabetes['53-0.0'])
prediabetic_from_icd_at_start_finding_more_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart
0,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2008-12-06
1,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2008-12-06
2,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2011-04-10
3,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2016-09-24
4,2425377,2008-07-19,,,2.0,,,...,,,,,,2008-07-03,2014-06-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2962,1076801,2009-04-17,,,3.0,,,...,,,,,,2003-01-12,2020-03-29
2963,1076801,2009-04-17,,,3.0,,,...,,,,,,2003-01-12,2019-11-25
2964,1076801,2009-04-17,,,3.0,,,...,,,,,,2003-01-12,2020-03-30
2965,1076801,2009-04-17,,,3.0,,,...,,,,,,2003-01-12,2011-06-09


We need to drop duplicate eid numbers so that only the first epistart date is left which will be compared with the dates used to classify all the patients as prediabetic. To do this we first sort the dataframe by eid number and increasing dates within an eid number for epistart and epistart_pre.

In [101]:
prediabetic_from_icd_at_start_finding_more_diabetes_sorted = prediabetic_from_icd_at_start_finding_more_diabetes.sort_values(by = ['eid', 'epistart', 'epistart_pre'])
prediabetic_from_icd_at_start_finding_more_diabetes_sorted

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart
61,1025738,2008-11-18,,,6.0,,,...,,,,,,2004-05-19,2017-07-06
60,1025738,2008-11-18,,,6.0,,,...,,,,,,2004-05-20,2017-07-06
974,1044919,2007-08-01,,,4.0,,,...,,,,,,2005-12-13,2018-09-11
2055,1055677,2009-02-09,,,4.0,,,...,,,,,,2006-09-26,2007-06-17
2058,1055677,2009-02-09,,,4.0,,,...,,,,,,2006-09-26,2007-06-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2521,5983540,2009-06-02,,,3.0,,,...,,,,,,2009-01-22,2010-09-15
2519,5983540,2009-06-02,,,3.0,,,...,,,,,,2009-01-22,2011-06-11
2520,5983540,2009-06-02,,,3.0,,,...,,,,,,2009-01-22,2011-08-11
2517,5983540,2009-06-02,,,3.0,,,...,,,,,,2009-01-22,2011-11-22


Next we drop the rows by eid number keeping only the first eid value such that the first date of diabetes diagnosis is the only date remaining to be used for date comparisons.

In [102]:
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date = prediabetic_from_icd_at_start_finding_more_diabetes_sorted.drop_duplicates(subset = 'eid', keep = 'first')
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart
61,1025738,2008-11-18,,,6.0,,,...,,,,,,2004-05-19,2017-07-06
974,1044919,2007-08-01,,,4.0,,,...,,,,,,2005-12-13,2018-09-11
2055,1055677,2009-02-09,,,4.0,,,...,,,,,,2006-09-26,2007-06-17
1095,1055794,2010-03-12,,,4.0,,,...,,,,,,2009-12-10,2009-10-23
1291,1067255,2010-06-19,,,5.0,,,...,,,,,,2005-01-12,2011-06-24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
660,5905564,2009-08-24,,,3.0,,,...,,,,,,2008-12-31,2014-03-06
217,5946732,2009-06-23,,,4.0,,,...,,,,,,2005-02-08,2013-05-30
310,5969205,2009-08-07,,,4.0,,,...,,,,,,2007-02-26,2013-02-05
2272,5969895,2010-03-11,,,1.0,,,...,,,,,,2002-07-19,2007-08-25


Now that we have both the original assessment date when the patients were diagnosed as prediabetic and the date when the patients were diagnosed with diabetes, all we have to do is mark the patients who developed diabetes after the first assessment when they only had prediabetes. Below we create a column that marks True if the patient develops diabetes (epistart is later than the prediabetes epistart date) and false if the patient had diabetes before the original classification as a prediabetic.

In [103]:
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date['target'] = prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date['epistart_pre'] < prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date['epistart']
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart,target
61,1025738,2008-11-18,,,6.0,,,...,,,,,2004-05-19,2017-07-06,True
974,1044919,2007-08-01,,,4.0,,,...,,,,,2005-12-13,2018-09-11,True
2055,1055677,2009-02-09,,,4.0,,,...,,,,,2006-09-26,2007-06-17,True
1095,1055794,2010-03-12,,,4.0,,,...,,,,,2009-12-10,2009-10-23,False
1291,1067255,2010-06-19,,,5.0,,,...,,,,,2005-01-12,2011-06-24,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
660,5905564,2009-08-24,,,3.0,,,...,,,,,2008-12-31,2014-03-06,True
217,5946732,2009-06-23,,,4.0,,,...,,,,,2005-02-08,2013-05-30,True
310,5969205,2009-08-07,,,4.0,,,...,,,,,2007-02-26,2013-02-05,True
2272,5969895,2010-03-11,,,1.0,,,...,,,,,2002-07-19,2007-08-25,True


Below we convert our test column to binary.

In [104]:
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date['target'] = prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date["target"].astype(int)
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart,target
61,1025738,2008-11-18,,,6.0,,,...,,,,,2004-05-19,2017-07-06,1
974,1044919,2007-08-01,,,4.0,,,...,,,,,2005-12-13,2018-09-11,1
2055,1055677,2009-02-09,,,4.0,,,...,,,,,2006-09-26,2007-06-17,1
1095,1055794,2010-03-12,,,4.0,,,...,,,,,2009-12-10,2009-10-23,0
1291,1067255,2010-06-19,,,5.0,,,...,,,,,2005-01-12,2011-06-24,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
660,5905564,2009-08-24,,,3.0,,,...,,,,,2008-12-31,2014-03-06,1
217,5946732,2009-06-23,,,4.0,,,...,,,,,2005-02-08,2013-05-30,1
310,5969205,2009-08-07,,,4.0,,,...,,,,,2007-02-26,2013-02-05,1
2272,5969895,2010-03-11,,,1.0,,,...,,,,,2002-07-19,2007-08-25,1


Below we show the number of patients who had the ICD codes that classified them as prediabetic which occurred before the epistart diabetes diagnosis of the patients.

In [105]:
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_before = prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date[(prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date['target'] == 1)]
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_before

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart,target
61,1025738,2008-11-18,,,6.0,,,...,,,,,2004-05-19,2017-07-06,1
974,1044919,2007-08-01,,,4.0,,,...,,,,,2005-12-13,2018-09-11,1
2055,1055677,2009-02-09,,,4.0,,,...,,,,,2006-09-26,2007-06-17,1
1291,1067255,2010-06-19,,,5.0,,,...,,,,,2005-01-12,2011-06-24,1
157,1072495,2009-02-26,,,2.0,,,...,,,,,2006-02-05,2014-04-02,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
660,5905564,2009-08-24,,,3.0,,,...,,,,,2008-12-31,2014-03-06,1
217,5946732,2009-06-23,,,4.0,,,...,,,,,2005-02-08,2013-05-30,1
310,5969205,2009-08-07,,,4.0,,,...,,,,,2007-02-26,2013-02-05,1
2272,5969895,2010-03-11,,,1.0,,,...,,,,,2002-07-19,2007-08-25,1


# MARKED BELOW ARE 147 PATIENTS WITH DIABETES WHO WE HAD PROGRESSING TO DIABETES

Below we cut out all patients who are already diabetic.

In [106]:
print(len(prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_before.drop_duplicates(subset = 'eid')))
test = prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_before.drop_duplicates(subset = 'eid')[~prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_before.drop_duplicates(subset = 'eid').eid.isin(dfToList_already_diabetic_at_start)]
test

199


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart,target
61,1025738,2008-11-18,,,6.0,,,...,,,,,2004-05-19,2017-07-06,1
974,1044919,2007-08-01,,,4.0,,,...,,,,,2005-12-13,2018-09-11,1
64,1097522,2010-06-09,,,5.0,,,...,,,,,2006-06-26,2018-01-31,1
145,1162102,2008-09-08,,,14.0,,,...,,,,,2002-07-17,2017-04-12,1
801,1345140,2008-06-17,,,6.0,,,...,,,,,2007-03-13,2016-11-05,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1919,5600407,2008-03-01,,,15.0,,,...,,,,,2005-09-25,2007-09-10,1
1991,5765638,2008-07-07,,,6.0,,,...,,,,,2001-01-22,2017-04-12,1
2584,5905443,2006-04-19,,,6.0,,,...,,,,,2004-03-14,2013-07-30,1
217,5946732,2009-06-23,,,4.0,,,...,,,,,2005-02-08,2013-05-30,1


Below we create a list of these patients to remove these from the final dataframe since we already know what label to give them.

In [107]:
dfToList_diabetes_progression_from_icd = prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_before['eid'].tolist()
dfToList_diabetes_progression_from_icd

[1025738,
 1044919,
 1055677,
 1067255,
 1072495,
 1097522,
 1102492,
 1117612,
 1146097,
 1162102,
 1202644,
 1328378,
 1336221,
 1338136,
 1340066,
 1345140,
 1350792,
 1395277,
 1404125,
 1423963,
 1424752,
 1436188,
 1457680,
 1476060,
 1478693,
 1494306,
 1516312,
 1568527,
 1609289,
 1637230,
 1657339,
 1665367,
 1706511,
 1718079,
 1731265,
 1796260,
 1801554,
 1829255,
 1858957,
 1872446,
 1895718,
 1926625,
 1936742,
 1943774,
 1954071,
 1963418,
 1998845,
 2031987,
 2064376,
 2191774,
 2191996,
 2203109,
 2219400,
 2225944,
 2233450,
 2233800,
 2268137,
 2268893,
 2325762,
 2335695,
 2343298,
 2344091,
 2364302,
 2425377,
 2429507,
 2479180,
 2480161,
 2484797,
 2486950,
 2500901,
 2533918,
 2551317,
 2586223,
 2591687,
 2595175,
 2632123,
 2648322,
 2650506,
 2666552,
 2684632,
 2693446,
 2693862,
 2699084,
 2739978,
 2745944,
 2748729,
 2763479,
 2817041,
 2840954,
 2847047,
 2850077,
 2894270,
 2908413,
 2961619,
 2997821,
 3027047,
 3054818,
 3056879,
 3070446,
 3094464,


Below we show the patients who were diagnosed with diabetes before their prediabetic diagnosis. These patients must be cut from the total dataframe.

In [108]:
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_after = prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date[(prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date['target'] == 0)]
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_after

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre,epistart,target
1095,1055794,2010-03-12,,,4.0,,,...,,,,,2009-12-10,2009-10-23,0
2960,1076801,2009-04-17,,,3.0,,,...,,,,,2003-01-12,2003-01-12,0
2875,1097144,2010-01-08,,,4.0,,,...,,,,,2005-12-08,2001-12-15,0
1623,1154973,2009-02-25,,,2.0,,,...,,,,,2006-09-25,2003-08-05,0
40,1279042,2009-09-22,,,4.0,,,...,,,,,2005-03-21,2005-03-21,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1875,5446165,2008-06-04,,,3.0,,,...,,,,,2002-06-20,2002-06-20,0
1061,5469241,2009-09-09,,,6.0,,,...,,,,,1998-07-05,1998-07-05,0
619,5709199,2008-01-18,,,3.0,,,...,,,,,2005-10-13,1999-06-28,0
1792,5719990,2009-03-23,,,3.0,,,...,,,,,2002-04-08,2002-03-22,0


Below we create a list of the patients who must be cut from the final dataframe of prediabetic patients.

In [109]:
dfToList_diabetes_before_prediabetes_from_icd = prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_after['eid'].tolist()
dfToList_diabetes_before_prediabetes_from_icd

[1055794,
 1076801,
 1097144,
 1154973,
 1279042,
 1357422,
 1463906,
 1763405,
 1799202,
 1847804,
 2142353,
 2198369,
 2259101,
 2359100,
 2607995,
 2757921,
 2788914,
 2791771,
 2840113,
 2869088,
 2896574,
 3003304,
 3042220,
 3054716,
 3084681,
 3138702,
 3160888,
 3472185,
 3583678,
 3607394,
 3752154,
 3767691,
 3822945,
 3830044,
 3924052,
 4048637,
 4235103,
 4507199,
 4541068,
 4836696,
 5007002,
 5204509,
 5271125,
 5313260,
 5432873,
 5446165,
 5469241,
 5709199,
 5719990,
 5746353]

Below we show the final number of patients we will add to the cohort of patients not progressing to diabetes. We believe this is justified because these patients were classified by medical professionals to have prediabetes and will therefore be closely monitored for diabetes development. If they were to develop diabetes, they would have an ICD code for it and it would most likely be known.

In [110]:
patients_from_icd_no_diabetes = patients_from_icd_with_prediabetes_dates_before_test[~patients_from_icd_with_prediabetes_dates_before_test.eid.isin(dfToList_diabetes_before_prediabetes_from_icd)]
patients_from_icd_no_diabetes = patients_from_icd_no_diabetes[~patients_from_icd_no_diabetes.eid.isin(dfToList_diabetes_progression_from_icd)]
patients_from_icd_no_diabetes = patients_from_icd_no_diabetes.drop_duplicates(subset = 'eid')
patients_from_icd_no_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,epistart_pre
18,4754980,2009-02-11,,,3.0,,,...,,,,,,,2000-04-20
63,3990434,2009-07-09,,,3.0,,,...,,,,,,,2005-07-14
93,4341307,2008-03-19,,,6.0,,,...,,,,,,,2008-02-05
98,2189918,2009-06-23,,,5.0,,,...,,,,,,,2002-05-27
115,1384041,2007-09-07,,,5.0,,,...,,,,,,,2004-04-26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3713,3682872,2009-10-07,2013-04-27,,2.0,8.0,,...,,,,,,,2007-06-25
3716,4571529,2008-10-31,,,1.0,,,...,,,,,,,2007-08-22
3773,3065620,2008-10-11,,,4.0,,,...,,,,,,,2003-03-06
3793,3960736,2007-08-02,,,2.0,,,...,,,,,,,2003-01-04


Below we save the dataframe above for exportation and use in new dataframes.

In [111]:
###patients_from_icd_no_diabetes.to_csv(path_or_buf = 'all_not_progressing_to_diabetic_icd_code_classified_patients')

Below we make a list of the eid numbers for these patients.

In [112]:
dfToList_prediabetes_not_progressing_to_diabetes = patients_from_icd_no_diabetes['eid'].tolist()
dfToList_prediabetes_not_progressing_to_diabetes

[4754980,
 3990434,
 4341307,
 2189918,
 1384041,
 2668919,
 4731399,
 5176915,
 2546801,
 4074087,
 5610428,
 4291108,
 1087841,
 2400159,
 1333339,
 2651683,
 3021843,
 3741116,
 2300870,
 5300315,
 3131032,
 5054628,
 3848034,
 3814179,
 4041516,
 3244213,
 2153925,
 5275213,
 4247282,
 3005969,
 5415675,
 2666156,
 1157575,
 3787794,
 1721346,
 4125910,
 4772707,
 2629558,
 4329061,
 3593711,
 3706865,
 3468606,
 1526262,
 2315761,
 1532705,
 3762664,
 1696778,
 2238033,
 2412934,
 2802052,
 3586593,
 2883116,
 1245115,
 3993670,
 3857759,
 2994186,
 4054334,
 1656085,
 5985181,
 1099911,
 1877610,
 1414086,
 5495249,
 2771587,
 3615764,
 5176828,
 4968900,
 3132679,
 2319445,
 3274237,
 2935906,
 1443994,
 1931859,
 5665291,
 3387357,
 3133140,
 4065319,
 5083386,
 2979692,
 2262386,
 2024396,
 2455237,
 3631850,
 2839824,
 5273096,
 3230080,
 4893226,
 3554949,
 3245699,
 2997509,
 4757559,
 5817242,
 3422921,
 2337129,
 1204343,
 2857922,
 3406342,
 5040694,
 1642615,
 5402929,


# We have found the following information now:

Targets who progressed to diabetes from prediabetic state:
1. Lists of prediabetic patients who self-diagnosed themselves for having diabetes after the first measurements were taken for both blood glucose (dfToList_diabetes_progression_without_doctor_diagnosis_blood_glucose) and hba1c (dfToList_diabetes_progression_without_doctor_diagnosis_hba1c) and icd codes (dfToList_diabetes_progression_without_doctor_diagnosis_from_icd) defined prediabetics.
2. Lists of prediabetic patients who progressed to a diabetic state for both patients classified as prediabetic using the blood sugar (dfToList_diabetes_progression_blood_glucose) and hba1c (dfToList_diabetes_progression_hba1c) tests and icd codes (dfToList_diabetes_progression_from_icd). 

Patients needed to be cut from the final dataframe:
3. Patients who were diagnosed with diabetes before blood samples were taken from them to determine if they had prediabetes for hba1c (dfToList_diabetes_before_prediabetes_hba1c) and blood glucose (dfToList_diabetes_before_prediabetes_blood_glucose) and (dfToList_diabetes_before_prediabetes_from_icd).
4. Lists for patients who were self-diagnosed with diabetes at the start of the study and must be cut from the dataframe for both hba1c (dfToList_diabetes_at_start_hba1c) and blood glucose (dfToList_diabetes_at_start_blood_glucose) and icd codes (dfToList_diabetes_at_start_from_icd).

Below we find how many patients falling in the HbA1c prediabetes range OR falling within the Blood Glucose prediabetes range. We see that this results in many more patients in the prediabetic range than when we only keep patients with BOTH conditions.

In [113]:
prediabetic_all_at_start = merged_for_prediabetes[((merged_for_prediabetes['HbA1c 1'] >= 42) & (merged_for_prediabetes['HbA1c 1'] <= 47)) | ((patients_fasting['Blood Glucose Level 1'] >= 5.6) & (patients_fasting['Blood Glucose Level 1'] <= 7))]
prediabetic_all_at_start

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
10,1797967,2008-09-10,,,3.0,,,...,,,,,,,
75,1945240,2008-02-23,,,4.0,,,...,,,,,,,
79,1589923,2010-07-01,,,4.0,,,...,,,,,,,
87,4899746,2009-11-10,,,4.0,,,...,,,,,,,
111,4863545,2009-10-29,,,5.0,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502273,3241124,2010-03-29,,,7.0,,,...,,,,,,,
502333,6022397,2007-12-20,,,3.0,,,...,,,,,,,
502391,5011216,2009-11-26,,,3.0,,,...,,,,,,,
502421,3482443,2008-06-05,,,2.0,,,...,,,,,,,


Next we filter the patients which we found to be self-diagnosed with diabetes at the start of the study since we make the assumption that these patients do in fact have diabetes.

In [114]:
prediabetic_all_at_start = prediabetic_all_at_start[~prediabetic_all_at_start.eid.isin(dfToList_diabetes_at_start_hba1c)]
prediabetic_all_at_start = prediabetic_all_at_start[~prediabetic_all_at_start.eid.isin(dfToList_diabetes_at_start_blood_glucose)]
prediabetic_all_at_start

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
10,1797967,2008-09-10,,,3.0,,,...,,,,,,,
75,1945240,2008-02-23,,,4.0,,,...,,,,,,,
79,1589923,2010-07-01,,,4.0,,,...,,,,,,,
87,4899746,2009-11-10,,,4.0,,,...,,,,,,,
111,4863545,2009-10-29,,,5.0,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502273,3241124,2010-03-29,,,7.0,,,...,,,,,,,
502333,6022397,2007-12-20,,,3.0,,,...,,,,,,,
502391,5011216,2009-11-26,,,3.0,,,...,,,,,,,
502421,3482443,2008-06-05,,,2.0,,,...,,,,,,,


Next we must filter out all patients who were diagnosed with diabetes, before we classified them as prediabetic, using the ICD numbers E110-E119 corresponding to type 2 diabetes.

In [115]:
prediabetic_all_at_start = prediabetic_all_at_start[~prediabetic_all_at_start.eid.isin(dfToList_diabetes_before_prediabetes_hba1c)]
prediabetic_all_at_start = prediabetic_all_at_start[~prediabetic_all_at_start.eid.isin(dfToList_diabetes_before_prediabetes_blood_glucose)]
prediabetic_all_at_start

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
10,1797967,2008-09-10,,,3.0,,,...,,,,,,,
75,1945240,2008-02-23,,,4.0,,,...,,,,,,,
79,1589923,2010-07-01,,,4.0,,,...,,,,,,,
87,4899746,2009-11-10,,,4.0,,,...,,,,,,,
111,4863545,2009-10-29,,,5.0,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502273,3241124,2010-03-29,,,7.0,,,...,,,,,,,
502333,6022397,2007-12-20,,,3.0,,,...,,,,,,,
502391,5011216,2009-11-26,,,3.0,,,...,,,,,,,
502421,3482443,2008-06-05,,,2.0,,,...,,,,,,,


Below we cut out all prediabetic patients defined from the hba1c and blood glucose tests who developed diabetes since we already have their labels. 

In [116]:
prediabetic_all_at_start_without_known_diabetes_progressors = prediabetic_all_at_start[~prediabetic_all_at_start.eid.isin(dfToList_diabetes_progression_hba1c)]
prediabetic_all_at_start_without_known_diabetes_progressors = prediabetic_all_at_start[~prediabetic_all_at_start.eid.isin(dfToList_diabetes_progression_blood_glucose)]
prediabetic_all_at_start_without_known_diabetes_progressors

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
10,1797967,2008-09-10,,,3.0,,,...,,,,,,,
75,1945240,2008-02-23,,,4.0,,,...,,,,,,,
79,1589923,2010-07-01,,,4.0,,,...,,,,,,,
87,4899746,2009-11-10,,,4.0,,,...,,,,,,,
111,4863545,2009-10-29,,,5.0,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502273,3241124,2010-03-29,,,7.0,,,...,,,,,,,
502333,6022397,2007-12-20,,,3.0,,,...,,,,,,,
502391,5011216,2009-11-26,,,3.0,,,...,,,,,,,
502421,3482443,2008-06-05,,,2.0,,,...,,,,,,,


We now have a dataframe in which we do not know if the remaining prediabetic patients progress to diabetes or not. This study requires follow up assessments in order to be able to say whether a patient developed diabetes or not. Therefore, we need to filter for all patients who had a value in either the second or third Diabetes Diagnosed column. To answer this question, we must only look for patients who have values in the Diabetes Diagnosed 1 and either the Diabetes Diagnosed 2 or Diabetes Diagnosed 3 columns since this will tell us if a patient progresses to diabetes or not. This tells us exactly the response we want to know: Whether or not the patient developed diabetes after first being diagnosed with prediabetes. Below we show that there are 1,131 patients corresponding to this specification.

In [117]:
prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify = prediabetic_all_at_start_without_known_diabetes_progressors[prediabetic_all_at_start_without_known_diabetes_progressors['Diabetes Diagnosed 2'].notnull() | prediabetic_all_at_start_without_known_diabetes_progressors['Diabetes Diagnosed 3'].notnull()]
prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
956,5809754,2008-05-03,2013-04-15,,2.0,2.0,,...,,,,,,,
1207,3650787,2010-02-25,2012-11-21,,2.0,4.0,,...,,,,,,,
1410,3527899,2010-01-23,2012-12-15,,3.0,5.0,,...,,,,,,,
2301,2802011,2010-03-31,2013-04-24,,4.0,5.0,,...,,,,,,,
2734,5165708,2008-12-08,,2017-08-04,4.0,,2.0,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501209,1967943,2008-01-24,,2017-05-04,8.0,,4.0,...,,,,,,,
501297,1519651,2010-07-12,,2015-11-04,4.0,,3.0,...,,,,,,,
501491,4255872,2009-06-24,,2018-08-11,2.0,,2.0,...,,,,,,,
501600,6024145,2008-12-15,2013-01-15,,1.0,5.0,,...,,,,,,,


Next we want to filter out all patients that have gestational diabetes only (Gestational Diabetes Only 1 = 1.0 or Gestational Diabetes Only 2 = 1.0). The reason for this is because we do not consider gestational diabetes because the driving factor for this is pregnancy, which means that other features do not have as much influence on these patients and they may bias the model.

In [118]:
prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify = prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify.drop(prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify[(prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify['Gestational Diabetes Only 1'] == 1.0) | (prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify['Gestational Diabetes Only 2'] == 1.0)].index)
prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
956,5809754,2008-05-03,2013-04-15,,2.0,2.0,,...,,,,,,,
1207,3650787,2010-02-25,2012-11-21,,2.0,4.0,,...,,,,,,,
1410,3527899,2010-01-23,2012-12-15,,3.0,5.0,,...,,,,,,,
2301,2802011,2010-03-31,2013-04-24,,4.0,5.0,,...,,,,,,,
2734,5165708,2008-12-08,,2017-08-04,4.0,,2.0,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501209,1967943,2008-01-24,,2017-05-04,8.0,,4.0,...,,,,,,,
501297,1519651,2010-07-12,,2015-11-04,4.0,,3.0,...,,,,,,,
501491,4255872,2009-06-24,,2018-08-11,2.0,,2.0,...,,,,,,,
501600,6024145,2008-12-15,2013-01-15,,1.0,5.0,,...,,,,,,,


### Remember that our target variables are:

1. Prediabetic patients who develop diabetes
2. Prediabetic patients who do not develop diabetes

Therefore, the next step is to filter those that are diagnosed with diabetes later on by searching for a 1.0 in the diabetes diagnosed 2 and diabetes diagnosed 3 columns. We see that we have a total of 300 patients who develop diabetes. Therefore, we have 1130 - 300 = 830 patients who do not develop diabetes. Our goal is to find why these groups are different.

In [119]:
developed_diabetes = prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify[(prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify['Diabetes Diagnosed 2'] == 1.0) | (prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify['Diabetes Diagnosed 3'] == 1.0)]
developed_diabetes

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
1207,3650787,2010-02-25,2012-11-21,,2.0,4.0,,...,,,,,,,
1410,3527899,2010-01-23,2012-12-15,,3.0,5.0,,...,,,,,,,
8584,3672453,2008-08-14,2013-03-08,,6.0,2.0,,...,,,,,,,
12049,3466795,2007-04-20,2012-10-19,,3.0,2.0,,...,,,,,,,
13537,3806269,2008-04-05,,2016-02-22,4.0,,3.0,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497033,3005111,2010-05-17,,2016-02-21,4.0,,2.0,...,,,,,,,
497139,5473950,2008-02-06,,2018-08-27,3.0,,8.0,...,,,,,,,
499313,1449914,2008-02-06,2013-04-14,,5.0,5.0,,...,,,,,,,
501209,1967943,2008-01-24,,2017-05-04,8.0,,4.0,...,,,,,,,


Below we make a list of these reports which we will use to cut them from the dataframe so that the only remaining patients will not have developed diabetes.

In [120]:
dfToList_developed_diabetes_doctor_diagnosed = developed_diabetes['eid'].tolist()
dfToList_developed_diabetes_doctor_diagnosed

[3650787,
 3527899,
 3672453,
 3466795,
 3806269,
 1953832,
 5013729,
 4638333,
 3899441,
 1101050,
 1230060,
 1333339,
 2455352,
 5733962,
 5342346,
 5747493,
 3350008,
 3554431,
 4268856,
 4616298,
 3408534,
 4308316,
 5378414,
 2115441,
 3557587,
 2905618,
 3713275,
 2239274,
 4661761,
 1146655,
 4781782,
 1874554,
 4777233,
 3962483,
 4226583,
 3370182,
 2330395,
 1093062,
 4468202,
 5679675,
 3817334,
 3443880,
 3860455,
 1052039,
 5867217,
 3360161,
 2671183,
 4971240,
 1574877,
 4927310,
 1461063,
 4985579,
 1303461,
 4451985,
 2172975,
 1909046,
 5314428,
 3221592,
 3294877,
 2617656,
 3205845,
 5656635,
 1771695,
 5821615,
 1175507,
 3487187,
 1945509,
 3255652,
 5269964,
 1087380,
 5539124,
 3742688,
 5502647,
 2263614,
 1465629,
 2766259,
 3084584,
 3737542,
 1877610,
 4733206,
 5812310,
 2054766,
 5308696,
 5638591,
 4727805,
 3130710,
 2646796,
 2023718,
 4448221,
 3089987,
 4380642,
 4311574,
 5235652,
 2651780,
 2345323,
 5328184,
 3656859,
 3259630,
 3814312,
 3327373,


Below we cut the eid values from the dataframe for patients who did develop diabetes in order to get all prediabetic patients who did not develop diabetes and this was confirmed by a later checkup date. We see that the number of patients in the dataframe below added with the number of patients found to develop diabetes gives the number of total prediabetic patients (830 + 300 = 1130). This is a good check to know we have not messed up.

In [121]:
did_not_develop_diabetes_total = prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify[~prediabetic_all_at_start_without_known_diabetes_progressors_all_possible_to_classify.eid.isin(dfToList_developed_diabetes_doctor_diagnosed)]
did_not_develop_diabetes_total

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
956,5809754,2008-05-03,2013-04-15,,2.0,2.0,,...,,,,,,,
2301,2802011,2010-03-31,2013-04-24,,4.0,5.0,,...,,,,,,,
2734,5165708,2008-12-08,,2017-08-04,4.0,,2.0,...,,,,,,,
2808,1482078,2008-05-08,2012-08-11,,3.0,5.0,,...,,,,,,,
2979,5350822,2007-12-10,2012-09-26,2014-10-25,3.0,6.0,3.0,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501096,3734872,2007-08-29,2013-05-22,,16.0,4.0,,...,,,,,,,
501179,3896205,2009-08-14,,2016-01-15,4.0,,1.0,...,,,,,,,
501297,1519651,2010-07-12,,2015-11-04,4.0,,3.0,...,,,,,,,
501491,4255872,2009-06-24,,2018-08-11,2.0,,2.0,...,,,,,,,


The final step is to label all of our dataframes as either patients with prediabetes developing diabetes (1) or not developing diabetes (0). I think the best way to do this is to combine all lists of patients who progressed to diabetes first and then make a dataframe using these eid numbers from the original merged patient dataframe. Then we can label them. We already have the dataframe above of prediabetic patients who do not progress to diabetes so we will just label them without doing anything to them. Finally, we can concatenate the dataframes together. First we combine the lists of prediabetic patients who developed diabetes.

In [122]:
dfToList_diabetes_progression_hba1c.extend(dfToList_diabetes_progression_blood_glucose)
dfToList_diabetes_progression_hba1c.extend(dfToList_developed_diabetes_doctor_diagnosed)
len(dfToList_diabetes_progression_hba1c)

4506

Next we use the eid numbers in this list to create a new dataframe containing all patients who developed diabetes. Notice that the number of patients is not the total number of rows because there were some duplicates involved in the list.

In [123]:
all_developed_diabetes_total = merged_for_prediabetes[merged_for_prediabetes.eid.isin(dfToList_diabetes_progression_hba1c)]
all_developed_diabetes_total

Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.27,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33
380,5100710,2009-01-20,,,3.0,,,...,,,,,,,
394,4647495,2008-12-05,,,3.0,,,...,,,,,,,
482,5494714,2010-04-14,,,3.0,,,...,,,,,,,
561,3319119,2008-02-09,,,3.0,,,...,,,,,,,
610,1814566,2008-05-27,,,4.0,,,...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502010,3939961,2007-08-02,,,5.0,,,...,,,,,,,
502226,2215660,2008-06-10,,,3.0,,,...,,,,,,,
502255,5612713,2008-04-17,,,14.0,,,...,,,,,,,
502333,6022397,2007-12-20,,,3.0,,,...,,,,,,,


Next we label these patients with a target variable (1).

In [124]:
all_developed_diabetes_total['target'] = 1
all_developed_diabetes_total


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,target
380,5100710,2009-01-20,,,3.0,,,...,,,,,,,1
394,4647495,2008-12-05,,,3.0,,,...,,,,,,,1
482,5494714,2010-04-14,,,3.0,,,...,,,,,,,1
561,3319119,2008-02-09,,,3.0,,,...,,,,,,,1
610,1814566,2008-05-27,,,4.0,,,...,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502010,3939961,2007-08-02,,,5.0,,,...,,,,,,,1
502226,2215660,2008-06-10,,,3.0,,,...,,,,,,,1
502255,5612713,2008-04-17,,,14.0,,,...,,,,,,,1
502333,6022397,2007-12-20,,,3.0,,,...,,,,,,,1


Below we do the same (add a 0) for our dataframe for prediabetic patients that did not progress to a diabetic state.

In [125]:
did_not_develop_diabetes_total['target'] = 0
did_not_develop_diabetes_total


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,53-0.0,53-1.0,53-2.0,Fasting Time 1,Fasting Time 2,Fasting Time 3,...,20002-2.28,20002-2.29,20002-2.30,20002-2.31,20002-2.32,20002-2.33,target
956,5809754,2008-05-03,2013-04-15,,2.0,2.0,,...,,,,,,,0
2301,2802011,2010-03-31,2013-04-24,,4.0,5.0,,...,,,,,,,0
2734,5165708,2008-12-08,,2017-08-04,4.0,,2.0,...,,,,,,,0
2808,1482078,2008-05-08,2012-08-11,,3.0,5.0,,...,,,,,,,0
2979,5350822,2007-12-10,2012-09-26,2014-10-25,3.0,6.0,3.0,...,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501096,3734872,2007-08-29,2013-05-22,,16.0,4.0,,...,,,,,,,0
501179,3896205,2009-08-14,,2016-01-15,4.0,,1.0,...,,,,,,,0
501297,1519651,2010-07-12,,2015-11-04,4.0,,3.0,...,,,,,,,0
501491,4255872,2009-06-24,,2018-08-11,2.0,,2.0,...,,,,,,,0


Next we want to do the same thing but with our new prediabetic patients from the ICD codes. First we filter out all patients who had diabetes before prediabetes and at the time of the assessment center measurements.

In [126]:
prediabetic_all_at_start_for_icd = prediabetes_with_dates[~prediabetes_with_dates.eid.isin(dfToList_diabetes_at_start_from_icd)]
prediabetic_all_at_start_for_icd = prediabetic_all_at_start_for_icd[~prediabetic_all_at_start_for_icd.eid.isin(dfToList_diabetes_before_prediabetes_from_icd)]
prediabetic_all_at_start_for_icd

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart_pre
0,1006294,37,3,2,,,R739,,2018-12-02
1,1007537,0,3,2,,,R730,,2019-06-06
2,1008393,2,1,2,,,R739,,2011-04-26
3,1010613,0,0,1,,,R739,,2010-07-06
4,1011449,12,5,2,,,R730,,2020-03-02
...,...,...,...,...,...,...,...,...,...
3794,6017167,3,3,2,,,R730,,2019-09-12
3795,6019270,5,9,2,,,R730,,2020-09-24
3796,6020217,7,12,2,,,R730,,2020-08-19
3797,6021943,1,3,2,,,R739,,2020-11-08


Below we take only the patients who we know progressed to diabetes that were diagnosed with prediabetes prior.

In [127]:
prediabetic_all_at_start_for_icd_progress_to_diabetes = prediabetes_with_dates[prediabetes_with_dates.eid.isin(dfToList_diabetes_progression_without_doctor_diagnosis_from_icd) | prediabetes_with_dates.eid.isin(dfToList_diabetes_progression_from_icd)]
prediabetic_all_at_start_for_icd_progress_to_diabetes
# 199 patients in here after dropping duplicates on eid

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart_pre
26,1025738,9,2,2,,,R739,,2004-05-20
27,1025738,6,3,2,,,R739,,2004-05-19
35,1044919,0,0,1,,,R739,,2005-12-13
48,1055677,14,0,1,,,R739,,2006-09-26
52,1067255,1,0,1,,,R730,,2005-01-12
...,...,...,...,...,...,...,...,...,...
3699,5905564,30,1,2,,,R739,,2014-06-06
3736,5946732,1,0,1,,,R739,,2005-02-08
3752,5969205,0,0,1,,,R739,,2007-02-26
3753,5969895,2,2,2,,,R739,,2002-07-19


Below we show the dataframe for all remaining prediabetic patients who do not progress to diabetes.

In [128]:
prediabetic_patients_all_at_start_for_icd_progress_to_diabetes = prediabetic_all_at_start_for_icd_progress_to_diabetes.drop_duplicates(subset = 'eid')
prediabetic_patients_all_at_start_for_icd_progress_to_diabetes

Unnamed: 0,eid,ins_index,arr_index,level,diag_icd9,diag_icd9_nb,diag_icd10,diag_icd10_nb,epistart_pre
26,1025738,9,2,2,,,R739,,2004-05-20
35,1044919,0,0,1,,,R739,,2005-12-13
48,1055677,14,0,1,,,R739,,2006-09-26
52,1067255,1,0,1,,,R730,,2005-01-12
58,1072495,0,0,1,,,R730,,2006-02-05
...,...,...,...,...,...,...,...,...,...
3694,5905564,4,0,1,,,R739,,2008-12-31
3736,5946732,1,0,1,,,R739,,2005-02-08
3752,5969205,0,0,1,,,R739,,2007-02-26
3753,5969895,2,2,2,,,R739,,2002-07-19


First, we create a combined dataframe for all prediabetic patients we found not to progress to diabetes.

In [129]:
patients_from_icd_no_diabetes['target'] = 0
did_not_develop_diabetes_total['target'] = 0
patients_from_icd_no_diabetes_final_small = patients_from_icd_no_diabetes[['eid', 'target']]
did_not_develop_diabetes_total_small = did_not_develop_diabetes_total[['eid', 'target']]
prediabetics_not_progressing_to_diabetes_total = patients_from_icd_no_diabetes_final_small.append(did_not_develop_diabetes_total_small)
prediabetics_not_progressing_to_diabetes_total = prediabetics_not_progressing_to_diabetes_total.drop_duplicates(subset = 'eid')
prediabetics_not_progressing_to_diabetes_total


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,target
18,4754980,0
63,3990434,0
93,4341307,0
98,2189918,0
115,1384041,0
...,...,...
501096,3734872,0
501179,3896205,0
501297,1519651,0
501491,4255872,0


Below we create a dataframe for all prediabetic patients who did progress to diabetes.

In [130]:
prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_before['target'] = 1
all_developed_diabetes_total['target'] = 1
patients_from_icd_diabetes_final_small = prediabetic_from_icd_at_start_finding_more_diabetes_sorted_only_one_date_before[['eid', 'target']]
all_developed_diabetes_total_small = all_developed_diabetes_total[['eid', 'target']]
prediabetics_progressing_to_diabetes_total = patients_from_icd_diabetes_final_small.append(all_developed_diabetes_total_small)
prediabetics_progressing_to_diabetes_total = prediabetics_progressing_to_diabetes_total.drop_duplicates(subset = 'eid')
prediabetics_progressing_to_diabetes_total


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,eid,target
61,1025738,1
974,1044919,1
2055,1055677,1
1291,1067255,1
157,1072495,1
...,...,...
502010,3939961,1
502226,2215660,1
502255,5612713,1
502333,6022397,1


Next we combine the dataframes so that we get one final group of prediabetic patients who are classified as either progressing to diabetes or not.

In [131]:
final_prediabetic_dataframe_labeled = prediabetics_not_progressing_to_diabetes_total.append(prediabetics_progressing_to_diabetes_total)
final_prediabetic_dataframe_labeled

Unnamed: 0,eid,target
18,4754980,0
63,3990434,0
93,4341307,0
98,2189918,0
115,1384041,0
...,...,...
502010,3939961,1
502226,2215660,1
502255,5612713,1
502333,6022397,1


Next we drop the duplicate eid numbers found in the dataframe. This is possible because some of the prediabetic patients who had a value of 0 in the Diabetes Diagnosed 2 or Diabetes Diagnosed 3 columns actually were diagnosed with diabetes using the ICD files. Therefore, we drop duplicates and keep the last entry because our dataframe is structured so that the later entries are for the patients who did develop diabetes, and this characteristic is correct for these patients.

In [132]:
final_prediabetic_dataframe_labeled_no_duplicates = final_prediabetic_dataframe_labeled.drop_duplicates(subset = 'eid', keep = 'last')
final_prediabetic_dataframe_labeled_no_duplicates

Unnamed: 0,eid,target
18,4754980,0
63,3990434,0
93,4341307,0
98,2189918,0
115,1384041,0
...,...,...
502010,3939961,1
502226,2215660,1
502255,5612713,1
502333,6022397,1


Below we cut out all patients who are already diabetic.

In [133]:
final_prediabetic_dataframe_labeled_no_duplicates_correct = final_prediabetic_dataframe_labeled_no_duplicates[~final_prediabetic_dataframe_labeled_no_duplicates.eid.isin(dfToList_already_diabetic_at_start)]
final_prediabetic_dataframe_labeled_no_duplicates_correct

Unnamed: 0,eid,target
18,4754980,0
63,3990434,0
93,4341307,0
115,1384041,0
130,2668919,0
...,...,...
502010,3939961,1
502226,2215660,1
502255,5612713,1
502333,6022397,1
