In [140]:
import altair as alt
import pandas as pd


In [141]:
data = pd.read_csv("diabetic_data.csv")
data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


This dataset is very large and has more rows than Altair allows (i.e.  more than 5000). Trying to plot the data as is throws the 'max rows' errow. 
We can get around this by using Pandas to prepare and filter the data. 
Let's only look at rows with an admission_source_id of 7. This means that they were admitted via the hospital's ER. 
I also only want to look at patients who were then admitted as inpatients whose time_in_hospital exceeds one week, hence we are looking for the discharge_disposition_id value 1 and time_in_hospital values larger than 7.
I am also only interested in patients who are receiving insulin.

In [142]:
filt = (data['admission_source_id'] == 7) & (data['discharge_disposition_id'] == 1) & (data['time_in_hospital'] > 7) & (data['insulin'] != 'No')
data = data.loc[filt]
data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
10,28236,89869032,AfricanAmerican,Female,[40-50),?,1,1,7,9,...,No,Steady,No,No,No,No,No,No,Yes,>30
71,881016,55152216,Caucasian,Male,[50-60),?,1,1,7,12,...,No,Up,No,No,No,No,No,Ch,Yes,>30
106,1445010,23807808,Other,Female,[50-60),?,1,1,7,9,...,No,Up,No,No,No,No,No,Ch,Yes,NO
142,2309268,77475465,Caucasian,Female,[80-90),?,6,1,7,9,...,No,Steady,No,No,No,No,No,No,Yes,>30
382,3616710,73884501,Caucasian,Male,[40-50),?,1,1,7,12,...,No,Steady,No,No,No,No,No,No,Yes,NO


As the values in the 'weight' column are only question marks, I'd like to delete that column as well as any other columns with no values.

In [143]:
data = data.dropna()
data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
10,28236,89869032,AfricanAmerican,Female,[40-50),?,1,1,7,9,...,No,Steady,No,No,No,No,No,No,Yes,>30
71,881016,55152216,Caucasian,Male,[50-60),?,1,1,7,12,...,No,Up,No,No,No,No,No,Ch,Yes,>30
106,1445010,23807808,Other,Female,[50-60),?,1,1,7,9,...,No,Up,No,No,No,No,No,Ch,Yes,NO
142,2309268,77475465,Caucasian,Female,[80-90),?,6,1,7,9,...,No,Steady,No,No,No,No,No,No,Yes,>30
382,3616710,73884501,Caucasian,Male,[40-50),?,1,1,7,12,...,No,Steady,No,No,No,No,No,No,Yes,NO


Now that we have cleaned the data, we can plot it using altair. First, we determine the type of data we want to model.
I'd like to visualise the correlation between time in hospital (quantitative data) and number of lab procedures (also quantitative), alongside age distribution (ordianl data, shown by colour).

In [144]:
brush = alt.selection_interval()

points = alt.Chart(data.reset_index()).mark_point(clip=True).encode(
    #we are only considering hospital stays over a week, hence we need to adjust the x-axis so its domain starts at 8 rather than 0:
    alt.X('time_in_hospital:Q',
        scale=alt.Scale(domain=(8, 14))
    ),
    y = 'num_lab_procedures:Q',
    color = alt.condition(brush, 'age:O', alt.value('lightgray'))
    ).add_selection(
        brush
    ).interactive()

#we create a bar chart to show the distribution of age groups
bars = alt.Chart(data).mark_bar().encode(
    y='age:O',
    color='age:O',
    x='count(age):Q'
).transform_filter(
    brush
)
points & bars



Notes on my design choices: 
I was interested in finding out the relationship between the number of lab procedures per patient and time spent in hospital and how this might correlate with the patient's age. 
I was working with quantitative data on both axes with no datapoints outside of the tick-values of the x-axis, hence I initially decided to use a bar-chart. However, I was keen to use a bar-chart for my linked graph showing the number of patients per age-group, so to avoid having two similar-looking graphs, I decided to use points as marks for my first chart. I believe this works well as it is clear and legible and integrates well with the 'color'-axis showing the age-groups.

Insights gained:
Patients in the age groups from 50-80 are the most represented group overall.
The overall number of lab procedures peaks at 100 with only very few datapoints exceeding that value. Similarly, there are very few datapoints near the lowest number of lab procedures (approaching zero). The mean of lab procedures is around 50, with patients between the ages of 60 and 70 being the most represented group here as well.


In the next graph, I want to illustrate the correlation between number of lab procedures (quantitative) and change in insulin doses (ordinal), alongside gender and ethnicity (both nominal).

In [145]:
brush2 = alt.selection_interval()

ticks = alt.Chart(data).mark_tick().encode(
    x = 'insulin:O',
    y = 'num_lab_procedures:Q',
    color = alt.condition(brush2, 'gender:N', alt.value('lightgray'))
    ).add_selection(
        brush2
    ).properties(
    width=300,
    height=200
    ).interactive()

#we create a bar chart to show the distribution of age groups
bars2 = alt.Chart(data).mark_bar().encode(
    y='race:N',
    color='gender:N',
    x='count(race):Q'
).transform_filter(
    brush2
)
ticks & bars2


Notes on my design choices: 
I was interested in finding out the relationship between the number of lab procedures per patient and any change in their insulin regimen, e.g. do a larger number of blood and urine tests correlate with a decrease in insulin? I was also interested in any gender or ethinicity-related differences. 
I was working with quantitative data on the y-axis and ordinal data on the x-axis and used tick marks to illustrate the individual data points, which really come into their own when zooming into the graph and selecting an area for the linked graph.

Insights gained:
With regard to gender distribution across the individual ethinicities (of which Caucasian and African-American were the most represented groups), we find that they were roughly equal. However, there were more female than male African-American patients in the cohort with a ratio of roughly 60:40.
The highest number of lab procedures across all insulin groups were performed for males. Women of all ethnicities received fewer lab procedures.
It is notable that the percentage of African-Americans increases with the amount of lab procedures. 
Overall, there does not seem to be a clear correlation between change in insulin regimen and number of lab procedures.