This notebook was developed by Gautam Machiraju and Conor Corbin, modified by Minh Nguyen

### Description:
- Use 2015 - 2017 as the training data to check for value distributions with 10 quantiles
- Use these distributions to assign new validation and test values to quantile bins

Input: 
- `6_9_coh4_feature_values`: used cohort4

Outputs: `...coh4...`
- `6_10_binned_labs_vitals_train`:
    - use 2015 - 2017 as trained data for value distributions to bin validation data of 2018
    - this is for the purpose of train data and select model hyper parameters on validation data
    - test data > 2018 (2019 and some 2020) left unused
- `6_10_binned_labs_vitals_test`:
    - after model hyper parameters were already selected, the value distributions are done again on data 2015 - 2018
    - test data > 2018 (2019 and some 2020) are binned on based on these distributions
    - binned test data is used for the final prediction and evaluation of model performance

In [1]:
import os 
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

In [2]:
import os 
from google.cloud import bigquery
from google.cloud.bigquery import dbapi

##Use correct path based on whether you are, Nero or local
# use Ctrl + Insert to copy and Shift + Insert to paste

# for Nero:
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/minh084/.config/gcloud/application_default_credentials.json' 
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/jupyter/.config/gcloud/application_default_credentials.json'

# for local computer:
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'C:\Users\User\AppData\Roaming\gcloud\application_default_credentials.json' 

##set correct Nero project
os.environ['GCLOUD_PROJECT'] = 'som-nero-phi-jonc101' 

In [3]:
valdir = "../../OutputTD/6_validation"
cohortdir = "../../OutputTD/1_cohort"
featuredir = "../../OutputTD/2_features"

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

### Train And Apply Featurizer - change for new validation data after March 2020! 

#### Depending on which dataset used for train/val or train_val/test, choose different time split
- For the purpose of training to select hyperparameters for models, use 2015 - 2017 (<2018) for binning distribution --> **NEW: use 2015 - 2018**
- One a final model with hyperparameters are selected, for the final prediction on an unseen test set, use 2015 - 2018 for binning distribution ( < 2019) --> **NEW: use 2015 - 03/2020**
- basically with this new validation data, we could use all the old data in this process, without saving the 2019 - 03/2020 as the final test set

In [4]:
# OLD DATA up to 03/2020
df = pd.read_csv(os.path.join(featuredir, "2_7_coh4_feature_values.csv"))

# need this if read from local file, if pulled from BQ, already in datetime format
df["admit_time"] = pd.to_datetime(df["admit_time"]) 

print(len(df)) # 3,085,046

  interactivity=interactivity, compiler=compiler, result=result)


3085046


In [5]:
# check to see if any feature is null
df[df['features'].isnull()]

Unnamed: 0,anon_id,pat_enc_csn_id_coded,admit_time,first_label,death_24hr_recent_label,death_24hr_max_label,feature_type,features,values,time


In [6]:
# df.loc[df['pat_enc_csn_id_coded'] == 131250899044]

In [7]:
df = df[df['feature_type'].isin(['labs', 'vitals'])]
df["time"] = pd.to_datetime(df["time"])

print(len(df)) #2337386

2337386


In [8]:
df.head(5)

Unnamed: 0,anon_id,pat_enc_csn_id_coded,admit_time,first_label,death_24hr_recent_label,death_24hr_max_label,feature_type,features,values,time
747660,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,Temp,36.9,2019-08-31 10:14:00+00:00
747661,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,DBP,62.0,2019-08-31 12:00:00+00:00
747662,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,SBP,124.0,2019-08-31 12:00:00+00:00
747663,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,Pulse,102.0,2019-08-31 12:00:00+00:00
747664,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,RR,11.0,2019-08-31 12:00:00+00:00


In [9]:
# calculate time difference in hours between admit time and time of vitals and labs,
df['difftime'] = round((df['admit_time'] - df['time']).dt.total_seconds()/3600, 1)

# check for observations more than 24 hours before admission time
print(len(df.loc[df['difftime'] > 24])) # 40070
df[['difftime']].describe()

40070


Unnamed: 0,difftime
count,2337386.0
mean,4.232521
std,91.9164
min,0.0
25%,1.3
50%,2.6
75%,4.6
max,140140.5


In [10]:
# remove those observations:
df = df.loc[df['difftime'] <= 24]
print(len(df)) #2297316

# check for coh4:
2337386 - 40070

2297316


2297316

### Process Continuous Features
Create a function that "trains" binning featurizer (computes distribution of values) based on subset of the data.  This is important because we only want to build the distribution with our training set and apply the bin featurizer
to the test set (prevents leakage). 

Then create a function that "applies" the trained featurizer on a set of data. 


In [11]:
def convert_to_dict(look_up_table):
    """Converts df look up table to dictionary for faster look up later"""
    bin_val_dict = {}
    for feature in look_up_table['features'].unique():
        feature_bin_vals = look_up_table[look_up_table['features'] == feature]
        for _bin in feature_bin_vals['bins'].unique():
            if feature not in bin_val_dict:
                bin_val_dict[feature] = {}
                bin_val_dict[feature]['min'] = []
                bin_val_dict[feature]['max'] = []

            min_val_for_bin = feature_bin_vals[feature_bin_vals['bins'] == _bin]['values']['min'].values[0]
            max_val_for_bin = feature_bin_vals[feature_bin_vals['bins'] == _bin]['values']['max'].values[0]

            bin_val_dict[feature]['min'].append(min_val_for_bin)
            bin_val_dict[feature]['max'].append(max_val_for_bin)
    return bin_val_dict

    
def train_featurizer(df_train):
    """
    Compute percent_ranks and generates a look up table of min and max bin values
    Input : long form dataframe with features and values where values are the continuous values of labs / vitals
    Output: look up table - dict of dict of lists (key1 = feature_name, key2 = max or min, values = lists of values)
    """
    # Compute percentiles and bins
    df_train['percentiles'] = df_train.groupby('features')['values'].transform(lambda x: x.rank(pct=True))
    df_train['bins'] = df_train['percentiles'].apply(lambda x: int(x * 10))
    
    # Generate look up table and conver to dictionary stucture
    look_up_table_df = df_train.groupby(['features', 'bins']).agg({'values' : ['min', 'max']}).reset_index()
    look_up_table = convert_to_dict(look_up_table_df)
    
    ### Sanity Check. Ensure that min vector for each feature is strictly increasing (no ties!)
    # Should be the case because ties are given same percentile rank in default pandas rank function
    for feature in look_up_table:
        mins = look_up_table[feature]['min']
        for i in range(len(mins)-1):
            assert mins[i] < mins[i+1]
    
    return look_up_table


def apply_featurizer(df, look_up_table):
    
    def get_appropriate_bin(feature, value, look_up_table):
        """Takes in feature, value and look up table and returns appropriate bin

        Quick Note: For some features, we do not have 10 bins.  This happens when we have many many ties in the 
        percent rank - and the percent rank alg returns ties as the average rank within that tie. So for instance
        we're trying to break each feature up into deciles where each bin covers range of 10% of the examples. But if more
        than 10% of the examples take on 1 value - then bins can be skipped. This shouldn't really be a problem
        for downstream tasks - just something to be aware of. This also means 'bins' and 'bins_applied' won't have
        perfect overlap in features that end up having less than 10 bins

        """
        mins = look_up_table[feature]['min']
        for i in range(len(mins) - 1):
            # If value is smaller than min value of smallest bin (in test time) - then return 0 (smallest bin)
            if i == 0 and value < mins[i]:
                return i

            if value >= mins[i] and value < mins[i+1] :
                return i

        # Then in last bin
        return len(mins)-1
    
    df['bins_applied'] = df[['features', 'values']].apply(
        lambda x: get_appropriate_bin(x['features'], x['values'], look_up_table), axis=1)
    
    return df

### Train And Apply Featurizer - change for new validation data after March 2020! 

#### Depending on which dataset used for train/val or train_val/test, choose different time split
- For the purpose of training to select hyperparameters for models, use 2015 - 2017 (<2018) for binning distribution --> **NEW: use 2015 - 2018**
- One a final model with hyperparameters are selected, for the final prediction on an unseen test set, use 2015 - 2018 for binning distribution ( < 2019) --> **NEW: use 2015 - 03/2020**
- basically with this new validation data, we could use all the old data in this process, without saving the 2019 - 03/2020 as the final test set

In [12]:
# for train/val VS. train_val/test:
# df_train =  df[(df['admit_time'].dt.year < 2019)] # old 2018
# df_train =  df[(df['admit_time'].dt.year < 2021)] # old 2019

In [13]:
# FOR TRAINING/VALIDATION purpose - chosing hyperparameters, 
# do decile distribution only on train set
df_train =  df[(df['admit_time'].dt.year < 2019)]

In [14]:
# continue here
look_up_table = train_featurizer(df_train)
df_featurized = apply_featurizer(df, look_up_table)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [15]:
# sanity check of the total cohort size with time splitting
# Change the years
print(len(df))
print(len(df[(df['admit_time'].dt.year < 2019)]))
print(len(df[(df['admit_time'].dt.year == 2019)]))
print(len(df[(df['admit_time'].dt.year > 2019)]))
len(df[(df['admit_time'].dt.year < 2019)]) + len(df[(df['admit_time'].dt.year == 2019)]) + len(df[(df['admit_time'].dt.year > 2019)])

2297316
1678328
500057
118931


2297316

### Quick Sanity Check
For features that have 10 bins from 0 to 9 - `bins` should be same as `bins_applied`

In [16]:
df_train = apply_featurizer(df_train, look_up_table)
look_up_table_df = df_train.groupby(['features', 'bins']).agg({'values' : ['min', 'max']}).reset_index()

features_with_0_9_bins = []
for feature in look_up_table_df:
    num_bins = len(look_up_table_df[look_up_table_df['features'] == feature]['bins'].values)
    ten_in_bins = 10 in look_up_table_df[look_up_table_df['features'] == feature]['bins'].values
    if num_bins == 10 and not ten_in_bins:
        features_with_0_9_bins.append(feature)

for feature in features_with_0_9_bins:
    df_test = df_train[df_train['features'] == 'feature']
    for b_real, b_computed in zip(df_test['bins'].values, df_test['bins_applied'].values):
        assert(b_real == b_computed)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [17]:
# checked against R 2.9_
len(df_train) # 1678328 --old coh4: 1254522 vs 1714603 

1678328

### Little bit of house cleaning
Create new feature names that reflect which bin the value belongs in

In [18]:
columns = ['anon_id', 'pat_enc_csn_id_coded', 'admit_time', 'feature_type', 'features', 'values', 'bins_applied']
df_new = df_featurized[columns]

In [19]:
df_new['features'] = ['_'.join([x, str(y)]) for x, y in zip(df_new['features'].values, df_new['bins_applied'].values)] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [20]:
len(df_new) #coh4 2337386 

2297316

### Get Counts representation
Group by patient, cns, and feature name (with bin value appended to feature name) and make value the number of times
that particular feature appears for that csn id. 

In [21]:
df_final = df_new.groupby(['anon_id', 'pat_enc_csn_id_coded', 'features']).agg(
    {'admit_time' : 'first',
     'feature_type' : 'first',
     'values' : 'count'}).reset_index()

columns = ['anon_id', 'pat_enc_csn_id_coded', 'admit_time', 'feature_type', 'features', 'values']
df_final = df_final[columns] # reorder columns
 
df_final['feature_type'] = [x + '_results' if x == 'labs' else x for x in df_final['feature_type'].values]

#### Depending on which dataset used for train/val or train_val/test, name: _train vs _test
- Rename feature_type to reflect training set used: ..._train or ...test
- 'vitals_train' means everything up to 2017 (**NEW- 2018**) used. (train vs val)
- 'vitals_test' means everything up to 2018 (**NEW - 02/2020**) used. (train + val vs test)

In [22]:
# for train/val to select hyperparameters, binned distributions on train/val: 
# df_train =  df[(df['admit_time'].dt.year < 2018)]
df_final['feature_type'] = [x + '_train' for x in df_final['feature_type'].values]

In [23]:
print(len(df_final)) # old - updated coh4 1825210(train) vs 1826919 (train_val)
print(len(df)) # updated coh4 2297316 same

1826919
2297316


In [24]:
# Sanity check - sum of the counts should be length of the orginal dataframe
assert df_final['values'].sum() == len(df)

In [25]:
df_final.head(10)

Unnamed: 0,anon_id,pat_enc_csn_id_coded,admit_time,feature_type,features,values
0,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,ALB_3,1
1,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,ALK_7,1
2,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,ALT_0,1
3,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,AST_1,1
4,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,AnionGap_9,1
5,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,BUN_8,1
6,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,Base_0,2
7,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,Basos_3,1
8,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,CO2_0,1
9,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,labs_results_train,Ca_2,1


### SAVE, depending on which cohort and which dataset

#### cohort4, train/val, train_val/test

In [26]:
# for training to select hyperparameters: df_train =  df[(df['admit_time'].dt.year < 2019)] - new
df_final.to_csv(os.path.join(valdir, '6_10_coh4_binned_labs_vitals_train.csv'), index=False)

### ReDO with TRAIN_VAL/TEST

- First, format the new data 04/2020 - 2021
- Remove the bins_applied column from the old data
- Concat the 2 data into one for re-binning

In [27]:
# new data from 04/2020 - 2021
# USE cohort4, read from local storage
df21 = pd.read_csv(os.path.join(valdir, "6_9_coh4_feature_values.csv"))

# need this if read from local file, if pulled from BQ, already in datetime format
df21["admit_time"] = pd.to_datetime(df21["admit_time"]) 

print(len(df21)) # 1175680

# check if any feature is null
df[df['features'].isnull()] # NONE

# extract vitals and labs
df21 = df21[df21['feature_type'].isin(['labs', 'vitals'])]
df21["time"] = pd.to_datetime(df21["time"])

print(len(df21)) # 895452
# calculate time difference in hours between admit time and time of vitals and labs,
df21['difftime'] = round((df21['admit_time'] - df21['time']).dt.total_seconds()/3600, 1)

# check for observations more than 24 hours before admission time
print(len(df21.loc[df21['difftime'] > 24])) # 6752

# remove those observations:
df21 = df21.loc[df21['difftime'] <= 24]
print(len(df21)) # 888700

df21[['difftime']].describe()

  interactivity=interactivity, compiler=compiler, result=result)


1175680
895452
6752
888700


Unnamed: 0,difftime
count,888700.0
mean,2.863723
std,2.468526
min,0.0
25%,1.2
50%,2.3
75%,3.8
max,24.0


In [28]:
# old data up to 03/2020
print(len(df)) #2297316
df.head(3)

2297316


Unnamed: 0,anon_id,pat_enc_csn_id_coded,admit_time,first_label,death_24hr_recent_label,death_24hr_max_label,feature_type,features,values,time,difftime,bins_applied
747660,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,Temp,36.9,2019-08-31 10:14:00+00:00,2.6,3
747661,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,DBP,62.0,2019-08-31 12:00:00+00:00,0.9,2
747662,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,SBP,124.0,2019-08-31 12:00:00+00:00,0.9,4


In [29]:
df0 = df.drop(columns=["bins_applied"])
df0.head(1)

Unnamed: 0,anon_id,pat_enc_csn_id_coded,admit_time,first_label,death_24hr_recent_label,death_24hr_max_label,feature_type,features,values,time,difftime
747660,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0,0,0,vitals,Temp,36.9,2019-08-31 10:14:00+00:00,2.6


In [30]:
# new data up 04/2020 - 2021
print(len(df21)) # 888700
df21.head(1)

888700


Unnamed: 0,anon_id,pat_enc_csn_id_coded,admit_time,first_label,death_24hr_recent_label,death_24hr_max_label,feature_type,features,values,time,difftime
280228,JC1000116,131295313275,2020-09-29 22:45:00+00:00,0.0,0,0,vitals,Temp,36.6,2020-09-29 15:55:00+00:00,6.8


In [31]:
### Concat 2 data frames into 1

df = pd.concat([df0,df21])
print(len(df0)) # 2297316
print(len(df21)) # 888700
print(len(df)) # 3186016

df.head(1)

2297316
888700
3186016


Unnamed: 0,anon_id,pat_enc_csn_id_coded,admit_time,first_label,death_24hr_recent_label,death_24hr_max_label,feature_type,features,values,time,difftime
747660,JC29f8ad2,131274729058,2019-08-31 12:52:00+00:00,0.0,0,0,vitals,Temp,36.9,2019-08-31 10:14:00+00:00,2.6


In [32]:
features0 = df0.features.unique()
features0.sort()
print(len(features0))
features0

49


array(['ALB', 'ALK', 'ALT', 'AST', 'AnionGap', 'BUN', 'Base', 'Basos',
       'CO2', 'Ca', 'Cl', 'Cr', 'DBP', 'Eos', 'Glob', 'Glucose', 'HCO3_a',
       'HCO3_v', 'Hct', 'Hgb', 'INR', 'K', 'Lactate', 'Lymp', 'MCH',
       'Mono', 'Na', 'Neut', 'O2sat_a', 'O2sat_v', 'PO2_a', 'PO2_v', 'PT',
       'Platelet', 'Pulse', 'RDW', 'RR', 'SBP', 'TBili', 'TCO2_a',
       'TProtein', 'Temp', 'Trop', 'WBC', 'eGFR', 'pCO2_a', 'pCO2_v',
       'pH_a', 'pH_v'], dtype=object)

In [33]:
features21 = df21.features.unique()
features21.sort()
print(len(features21))
features21

49


array(['ALB', 'ALK', 'ALT', 'AST', 'AnionGap', 'BUN', 'Base', 'Basos',
       'CO2', 'Ca', 'Cl', 'Cr', 'DBP', 'Eos', 'Glob', 'Glucose', 'HCO3_a',
       'HCO3_v', 'Hct', 'Hgb', 'INR', 'K', 'Lactate', 'Lymp', 'MCH',
       'Mono', 'Na', 'Neut', 'O2sat_a', 'O2sat_v', 'PO2_a', 'PO2_v', 'PT',
       'Platelet', 'Pulse', 'RDW', 'RR', 'SBP', 'TBili', 'TCO2_a',
       'TProtein', 'Temp', 'Trop', 'WBC', 'eGFR', 'pCO2_a', 'pCO2_v',
       'pH_a', 'pH_v'], dtype=object)

In [34]:
### OR 2nd time on TRAIN_VAL/TEST
# AFTER training and chosing hyperparameters, RE-DO DECILE distribution on both train_val set (2015 - 2018)
# df_train =  df[(df['admit_time'].dt.year < XXX)] # all old data up to 03/2020
df_train = df0 # all old data up to 03/2020
print(len(df_train)) # 2297316

# continue here
look_up_table = train_featurizer(df_train)
df_featurized = apply_featurizer(df, look_up_table)  # NOTE, df here is the combined df0 and df21

# sanity check
df_train = apply_featurizer(df_train, look_up_table)
look_up_table_df = df_train.groupby(['features', 'bins']).agg({'values' : ['min', 'max']}).reset_index()

features_with_0_9_bins = []
for feature in look_up_table_df:
    num_bins = len(look_up_table_df[look_up_table_df['features'] == feature]['bins'].values)
    ten_in_bins = 10 in look_up_table_df[look_up_table_df['features'] == feature]['bins'].values
    if num_bins == 10 and not ten_in_bins:
        features_with_0_9_bins.append(feature)

for feature in features_with_0_9_bins:
    df_test = df_train[df_train['features'] == 'feature']
    for b_real, b_computed in zip(df_test['bins'].values, df_test['bins_applied'].values):
        assert(b_real == b_computed)

# cleaning - 
# create new feature names that reflect which bin the value belongs in
columns = ['anon_id', 'pat_enc_csn_id_coded', 'admit_time', 'feature_type', 'features', 'values', 'bins_applied']
df_new = df_featurized[columns]

df_new['features'] = ['_'.join([x, str(y)]) for x, y in zip(df_new['features'].values, df_new['bins_applied'].values)] 

print(len(df_new)) #3186016

# get count presentation
df_final = df_new.groupby(['anon_id', 'pat_enc_csn_id_coded', 'features']).agg(
    {'admit_time' : 'first',
     'feature_type' : 'first',
     'values' : 'count'}).reset_index()

columns = ['anon_id', 'pat_enc_csn_id_coded', 'admit_time', 'feature_type', 'features', 'values']
df_final = df_final[columns] # reorder columns

df_final['feature_type'] = [x + '_results' if x == 'labs' else x for x in df_final['feature_type'].values]

# OR 2nd time on train-val/test: rename feature type to reflect train or test
df_final['feature_type'] = [x + '_test' for x in df_final['feature_type'].values]

print(len(df_final)) # 2534674
print(len(df)) # 3186016

# sanity check
assert df_final['values'].sum() == len(df)

### OR 2nd time on train-val/test
df_final.to_csv(os.path.join(valdir, '6_10_coh4_binned_labs_vitals_test.csv'), index=False)

2297316


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


3186016
2534674
3186016


### View

In [36]:
for feature in look_up_table_df['features'].unique():
    print(look_up_table_df[look_up_table_df['features'] == feature])

   features bins values     
                    min  max
0       ALB    0    0.9  2.4
1       ALB    1    2.5  2.8
2       ALB    2    2.9  3.1
3       ALB    3    3.2  3.3
4       ALB    4    3.4  3.5
5       ALB    5    3.6  3.7
6       ALB    6    3.8  3.8
7       ALB    7    3.9  4.0
8       ALB    8    4.1  4.3
9       ALB    9    4.4  6.4
10      ALB   10    6.7  6.7
   features bins values        
                    min     max
11      ALK    0   14.0    56.0
12      ALK    1   57.0    65.0
13      ALK    2   66.0    73.0
14      ALK    3   74.0    82.0
15      ALK    4   83.0    91.0
16      ALK    5   92.0   102.0
17      ALK    6  103.0   117.0
18      ALK    7  118.0   144.0
19      ALK    8  145.0   211.0
20      ALK    9  212.0  3500.0
   features bins  values        
                     min     max
21      ALT    0     5.0    13.0
22      ALT    1    14.0    17.0
23      ALT    2    18.0    20.0
24      ALT    3    21.0    23.0
25      ALT    4    24.0    27.0
26      