In [3]:
import altair as alt
import pandas as pd


In [4]:
data = pd.read_csv("diabetic_data.csv")
data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


This dataset is very large and has more rows than Altair allows (i.e.  more than 5000). Trying to plot the data as is throws the 'max rows' errow. 
We can get around this by using Pandas to filter and aggregate the data. 
Let's only look at rows with an admission_source_id of 7. This means that they were admitted via the hospital's ER. 
I also only want to look at patients who were then admitted as inpatients whose time_in_hospital exceeded one week, hence we are looking for the discharge_disposition_id value 1 and time_in_hospital values larger than 7.

In [5]:
filt = (data['admission_source_id'] == 7) & (data['discharge_disposition_id'] == 1) & (data['time_in_hospital'] > 7)
data = data.loc[filt]
data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
10,28236,89869032,AfricanAmerican,Female,[40-50),?,1,1,7,9,...,No,Steady,No,No,No,No,No,No,Yes,>30
71,881016,55152216,Caucasian,Male,[50-60),?,1,1,7,12,...,No,Up,No,No,No,No,No,Ch,Yes,>30
106,1445010,23807808,Other,Female,[50-60),?,1,1,7,9,...,No,Up,No,No,No,No,No,Ch,Yes,NO
135,2292606,53848278,AfricanAmerican,Female,[70-80),?,6,1,7,13,...,No,No,No,No,No,No,No,No,Yes,>30
142,2309268,77475465,Caucasian,Female,[80-90),?,6,1,7,9,...,No,Steady,No,No,No,No,No,No,Yes,>30


As the values in 'weight' column are only question marks, I'd like to delete that column.

In [6]:
data = data.dropna()
data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
10,28236,89869032,AfricanAmerican,Female,[40-50),?,1,1,7,9,...,No,Steady,No,No,No,No,No,No,Yes,>30
71,881016,55152216,Caucasian,Male,[50-60),?,1,1,7,12,...,No,Up,No,No,No,No,No,Ch,Yes,>30
106,1445010,23807808,Other,Female,[50-60),?,1,1,7,9,...,No,Up,No,No,No,No,No,Ch,Yes,NO
135,2292606,53848278,AfricanAmerican,Female,[70-80),?,6,1,7,13,...,No,No,No,No,No,No,No,No,Yes,>30
142,2309268,77475465,Caucasian,Female,[80-90),?,6,1,7,9,...,No,Steady,No,No,No,No,No,No,Yes,>30


Now that we have cleaned the data, we can plot it using altair. First, we determine the type of data we want to model.
I'd like to visualise the correlation between time in hospital (quantitative data) and number of lab procedures (also quantitative), alongside gender distribution (nominal data, shown by colour).

In [57]:
alt.Chart(data.reset_index()).mark_point(clip=True).encode(
    #we are only considering hospital stays over a week, hence we need to adjust the x-axis so its domain starts at 8 rather than 0:
    alt.X('time_in_hospital:Q',
        scale=alt.Scale(domain=(8, 14))
    ),
    y = 'num_lab_procedures:Q',
    color = 'gender:N'
    ).interactive()


dashboard = two interactive graphs