<a href="https://www.kaggle.com/drjohnwagner/parallel-coordinates-in-heart-disease-prediction?scriptVersionId=84515027" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## <center> Parallel Coordinates in Heart Disease Prediction

I am new to Kaggle--though not to data science--and as a result I've been perusing competitions, datasets and a fair number of notebooks for a few weeks now and thought it was about time to jump in and get my feet wet with a notebook of my own.

One thing I've noticed is that not that many people seem to use parallel coordinate plots in their data exploration and visualisations. Since I've been a big fan of parallel coordinates ever since I met Alfred Inselberg back in the late 1980s, around the time he began popularising them, I thought I might use that as a backdrop for my first notebook. This is where that led.

Also, I have a passion for solving puzzles, especially in healthcare and life sciences...

>> **Daniel:** *You want to talk hypocrisy; What about you? You act like you don't care about anyone, but here you are, saving lives.*  
>> **House:** *Solving Puzzles. Saving lives is just collateral damages.*
>
> *Unfaithful*, **House**, Season 5, Episode 15

So I thought I'd incorporate a bit of a medical show inspired theme. Just for fun. I hope you enjoy.

NB: I had hoped to include some medical show memes and so forth, but quickly ran out of space and had to downsize on the images to preserve room for the code and visualisations. But I hope you have fun with what remains!

In [1]:
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
from IPython.core.display import display, HTML
import ipywidgets

# Seed for random number generation...
# Set to None to seed the random number
# generator with a random seed...
SEED = 12903614

# Colorblindness friendly colours...
# It is important to make our work
# as accessible as possible...
COLORMAP = ["#005AB5", "#DC3220"]

# Template settings for plotly...
layout_axis = dict(
    mirror=True,
    ticks="outside",
    showline=True,
    title_standoff = 5,
    showgrid = True,
)
pio.templates["DrJohnWagner"] = go.layout.Template(
    layout_xaxis = layout_axis,
    layout_yaxis = layout_axis,
    layout_title_font_size = 18,
    layout_font_size = 16,
)
pio.templates.default = "simple_white+DrJohnWagner"

## Bring in the patients!

>> **Foreman:** *OK, we can all stare at each other, or we can investigate what caused the heart failure. Just the heart failure.*
>
> *Safe*, **House**, Season 2, Episode 16

OK, let's stop staring and dig right in! Time to open the patient files!

In [2]:
# Loading the data from the csv file...
df = pd.read_csv('/kaggle/input/heart-failure-prediction/heart.csv')
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


>> The patient says, "Doctor, it hurts when I do this."  
>> The doctor says, "Then don't do that!”
>
> Henny Youngman

Do you notice anything different about the columns `ST_Slope` and `Oldpeak`? This is an egregious naming faux pas that is simply confusing and will cause errors over and over again by you and anyone else who uses your data or code.

<div class="alert alert-block alert-warning">
It hurts when people do that. So don't do that!
</div>

Let's fix that immediately so that it doesn't propagate. We'll change them to column names consistent with the others: `STSlope` and `OldPeak`.

In [3]:
# Fix the egregious column naming error...
df = df.rename(columns = {"ST_Slope": "STSlope", "Oldpeak": "OldPeak"})

# Always test these things...
assert len(df["STSlope"]) > 0, "Ruh roh! ST_Slope is still terribly mistaken!"
assert len(df["OldPeak"]) > 0, "Ruh roh! Oldpeak is still terribly mistaken!"

df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,OldPeak,STSlope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## Time to examine the patients!
Is exam room one open? Let's wheel these patients in there and have a look.

With the column names fixed, we now have 12 consistently named data columns, including 5 categorical columns (`Sex`, `ChestPainType`, `RestingECG`, `ExerciseAngina`, and `STSlope`) and 7 numerical columns (`Age`, `RestingBP`, `Cholesterol`, `FastingBS`, `MaxHR`, `OldPeak`, and `HeartDiseases`).

Note, however, that the numerical columns `FastingBS` and `HeartDisease` are numerical columns with small numbers (two) of discrete values. As a result, we may treat them as both categorical and numerical.

Now I know what you're thinking. Fasting blood sugar is a glucose test that measures the levels of sugar (typically in mg/dl) in your blood after fasting overnight. Why isn't it a float? Good question. I was going to cover what some of these measures are and why they're important, but [Duncan McKinnon](https://www.kaggle.com/duncankmckinnon) has already done that: [Explanation of ECG Attributes in Data](https://www.kaggle.com/ronitf/heart-disease-uci/discussion/151898).

Moving on...let's extract the categorical and numerical columns, as they will come in handy later. We will also define some labels for plots. Note that some of the columns will be displayed in abbreviated form, e.g. "Chest Pain" instead of "Chest Pain Type".

In [4]:
# Break the columns into two groupings...
categorical_columns = [column for column in df.columns if df[column].dtypes =='O']
numerical_columns = [column for column in df.columns if df[column].dtypes != 'O']

# Now that we've got the column names fixed
# let's define some labels for later...
LABELS = {
    "Sex": "Sex",
    "Age": "Age",
    "MaxHR": "Max HR",
    "OldPeak": "Old Peak",
    "STSlope": "ST Slope",
    "RestingBP": "Rest. BP",
    "FastingBS": "Fast. BS",
    "RestingECG": "Rest. ECG",
    "Cholesterol": "Cholesterol",
    "HeartDisease": "Heart Disease",
    "ChestPainType": "Chest Pain",
    "ExerciseAngina": "Ex. Angina",
    # For playing with classes...
    "AgeClass": "Age Class",
    "CholesterolClass": "Cholesterol Class",
    "RestingBPClass": "Resting BP Class",
}

## In a world full of Richard Webbers be Cristina Yang
Let's get a quick overview of all of the patients on the ward.

>> **Richard:** *What do we need to know in order to do our jobs and not simply be mechanics?*  
>> **Owen:** *How are we...we're not being mechanics!*  
>> **Richard:** *You're treating this patient like a sack of organs on a table.*  
>
> *The Room Where It Happens*, **Grey's Anatomy**, Season 13, Episode 8

Now if this were Grey's Anatomy, some of you might think of yourselves as Richard Webber: brilliant, talented and respected but accustomed to playing by the rules, medically speaking. As a result, you might be inclined to follow standard medical practice and throw the patients' vitals into `matplotlib` bar charts, scatter plots, heatmaps, etc.

Now this isn't Grey's Anatomy, and we're not Richard Webber, but if it were, I'd rather be Cristina Yang. She's a doctor doctor! PhD from Berkeley. MD from Stanford. And a Smith grad to boot. And she likes to break the rules. Sure, she's arrogant and condescending--and has the *worst* bedside manner--but Cristina would never waste her time with `matplotlb` or bar charts. No way! She'd pour a glass of wine, roll up her sleeves, `pip install plotly` and start coding up something a little less conventional: a parallel coordinates plot.

Parallel coordinate plots allow us to plot all of our patients and (numerical) columns on one plot. Unlike Cartesian coordinate plots, where coordinate axes are perpendicular to one another, in parallel coordinate plots the coordinate axes are placed in parallel. And datapoints, which would be represented as points in Cartesian coordinates, are represented as polylines in parallel coordinates with vertices on the parallel axes.

We begin by examining the numerical columns of our first two patients:

In [5]:
display(df.iloc[0:2, :][numerical_columns])

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,OldPeak,HeartDisease
0,40,140,289,0,172,0.0,0
1,49,160,180,0,156,1.0,1


Note that the first patient does not have heart disease, but the second does, and that while the two share the same fasting blood sugar (`FastingBS`) value, all of their other vitals are different. Thus they will appear very different on almost any plot. That's perfect for getting a feel for parallel coordinates!

Also, neither of these patients has lupus. Jus' sayin'.

>> **Foreman:** *You stash your drugs in a lupus textbook.*  
>> **House:** *It's never lupus.*
>
> *Finding Judas*, **House**, Season 3, Episode 9

That's right. It's never lupus.

Moving on...this is what those two sets of vitals look like plotted in parallel coordinates:

In [6]:
# We will make a lot of parallel coordinates plots
# so let's wrap the calls in a simple function that
# creates a plotly.graph_objects.Parcoords object
# to display our data in parallel coordinates...
def create_parallel_coordinates_plot(df, color, columns):
    return go.Parcoords(
        dimensions = list([
            dict(
                values = df[column],
                label = LABELS[column],
                name = column,
            ) for column in columns
        ]),
        line = dict(
            color = df["HeartDisease"],
            colorscale = COLORMAP,
            showscale = False
        ),
        labelfont = dict( size = 16 ),
        tickfont = dict( size = 16 ),
    )

def create_layout(title):
    return go.Layout(
        title = go.layout.Title(
            text = title,
            x = 0.5,
            xanchor = "center"
        )
    )

# Wrap the plotly.graph_objects.Parcoords object
# in a plotly.graph_objects.Figure object...
fig_one = go.Figure(
    create_parallel_coordinates_plot(
        df.iloc[0:2, :], "HeartDisease", numerical_columns
    ),
    layout = create_layout("Parallel Coordinates Plot of Patients One and Two"),
)
# And show() the figure!
fig_one.show()


In this plot, the lines are coloured by the value of `HeartDisease`. As a result, the first patient is represented by the blue polyline, which intersects the `HeartDisease` axis at 0 (no heart disease), the `OldPeak` axis at 0, the `MaxHeartRate` axis at 172 and so forth. Conversely, the second patient is represented by the red polyline, which intersects the `HeartDisease` axis at 1 (has heart disease), the `OldPeak` axis at 1, the `MaxHeartRate` axis at 156 and so on. Note that both patients have the value 0 for `FastingBS`, but do not share any other values.

To convince yourself you understand parallel coordinate plots now, compare the values where these two polylines intersect each of these axes with the values in the two patient records above.

Now let's look at a parallel coordinates plot of every other patient (to reduce the clutter) starting with the second patient, with datapoints again coloured by `HeartDisease`:

In [7]:
# Plot the numerical columns of the dataset
# colour coded by HeartDisease with red = true...
# Plot every other row starting at one to
# reduce data for better visibility...
fig_two = go.Figure(
    create_parallel_coordinates_plot(
        df.iloc[1::2, :], "HeartDisease", numerical_columns
    ),
    layout = create_layout("Parallel Coordinates Plot of Odd Numbered Patients"),
)

# Show the plot!
fig_two.show()


## Code Blue! Get the crash cart!

Now we're cooking with gas! Oh. Wait. We're Gregory House. Um...now we're diagnosing with technology? Moving on...

The first thing we see--immediately--is that a number of our patients are dead due to extremely low blood pressure (one polyline intersects the `RestingBP` axis at 0) and/or cholesterol (a few dozen polylines intersect the `Cholesterol` axis at 0).

>> **Cox:** *I actually had my physical last week, and while my cholesterol was low, my blood pressure was through the roof. Needless to say, my physician was stumped. But now, thank God, you've helped to solve that riddle, because the instant I heard your shrill voice whining about a "teeny-weeny problem," oh, it took every ounce of self-restraint I had to keep blood from shooting out my ears.*  
>> **Elliot:** *Doesn't it seem like in the time that it took you to say all that, you could have just helped me instead?*  
>> **Dr. Cox:** *Well, yes, it does, but here, that's what makes it delicious.*
>
> *My First Step*, **Scrubs**, Season 2, Episode 7

I guess the problem could be blood shooting out of their ears, but more likely these values are simply this dataset's way of recording missing values.

Rather than calling the medical examiner to transport them to the morgue, let's get the crash cart (STAT!) and bring their values up.

We will set those values to random samples drawn from a uniform distribution approximating the rest of the values in those columns:

In [8]:
def set_column_value_to_normal_distribution(df, column, value):
    # Compute the column's mean and standard deviation
    # after removing rows whose column matches value...
    mean_value = df[df[column] != value][column].mean()
    std_value  = df[df[column] != value][column].std()
    # Create a random number generator...
    rng = np.random.default_rng(SEED)
    # Now set the column of those rows to a
    # random sample from a normal distribution...
    df[column] = df[column].apply(
        lambda x : rng.normal(mean_value, std_value) if x == value else x
    )
    return df

df = set_column_value_to_normal_distribution(df, "RestingBP"  , 0)
df = set_column_value_to_normal_distribution(df, "Cholesterol", 0)

# Always test...
assert len(df[df["RestingBP"  ] == 0]) == 0, "Ruh roh! One or more patients has crashed again!"
assert len(df[df["Cholesterol"] == 0]) == 0, "Ruh roh! One or more patients has crashed again!"

Now that we have dealt with these "missing values" and our patients are stable again, let's get another set of vitals.

In [9]:
# Let's shuffle the data to ensure
# we get a random selection of data
df = shuffle(df)

# Regenerate the figure...
fig_three = go.Figure(
    create_parallel_coordinates_plot(
        df.iloc[1::2, :], "HeartDisease", numerical_columns
    ),
    layout = create_layout("Parallel Coordinates Plot of Odd Numbered Patients Redux"),
)

# And show and tell time!
fig_three.show()

## Everybody Lies

>> **House:** *It's a basic truth of the human condition that everybody lies. The only variable is about what.*
>
> **Three Stories**, *House*, Season 1, Episode 21

Let's see what we can quickly and easily learn about our patients from this one plot:
1. The majority of patients with heart disease are between the ages of 50 and 65. This is because heart disease tends to occur and be diagnosed by about age 50 and, unfortunately, often starts leading to death around age of 65. That said, old age does not seem to be as strongly discriminating as younger age.
1. The mean OldPeak value for patients with heart disease is probably lower than the mean OldPeak value for patients with no heart disease. I also see something similar for MaxHR.
1. About half(?) of patients with heart disease have a fasting blood sugar value of 1, but most patients who do not have heart disease have a fasting blood sugar value of 0.
1. I see something that I am very familiar with after many years analysing health data...  
My blood pressure is extremely stable. It is almost always 106 mm Hg systolic unless I've been doing something to raise it, like walking or exercising or chasing my cat around the house. But if you look in my health records, you'll conclude that it is very consistently 110 mm Hg. But the proper conclusion is that my doctor is lazy when recording my blood pressure. It's somehow easier to write 110 than 106, I guess,  
But then again, 4 mm Hg doesn't make that much of a difference. And my doctor knows that.  
In any case, you can easily see from the parallel coordinates plot that blood pressure is often rounded off to the nearest 10 mm Hg. We'll look at this next.

Note that plotly plots are interactive, and in the caase of parallel coordinate plots, you can do something really useful from a data exploration standpoint: click on any axis, hold the mouse down while you drag (you'll see a thick pink line appear) and then release. The pink line is a slider. Slide it up and down. See what happens? This constraint (plotly calls them constraints) acts like a filter, or data selector. You can create as many constraints on as many axes as you like, even multiple constraints on a single axis. When you're done with a constraint, just click on it and it will disappear. Go ahead and play with them.

Moving on...just as an aside, let's examine that last point above:

In [10]:
from plotly.subplots import make_subplots

min_bp, max_bp = int(np.min(df["RestingBP"])), int(np.max(df["RestingBP"]))

fig_four = go.Figure(
    layout = create_layout("Histogram of Resting Blood Pressure"),
)
fig_four.add_trace(
    go.Histogram(
        x = df[df["HeartDisease"] == 0]["RestingBP"],
        name = "Heart Disease Absent",
        marker_color = COLORMAP[0],
        xbins = dict(
            start = min_bp - 0.5,
            end = max_bp + 0.5,
            size = 1.0
        )
    )
)
fig_four.add_trace(
    go.Histogram(
        x = df[df["HeartDisease"] == 1]["RestingBP"],
        name = "Heart Disease Present",
        marker_color = COLORMAP[1],
        xbins = dict(
            start = min_bp - 0.5,
            end = max_bp + 0.5,
            size = 1.0
        )
    )
)

fig_four.update_layout(
    barmode = "stack",
    xaxis_tickvals = np.arange(min_bp, max_bp + 1, 10),
    xaxis_title_text = "Resting Systolic Blood Pressure (mm Hg)",
    yaxis_title_text = "Number of Patients",
    legend=dict(
        orientation = "h",
        yanchor = "bottom",
        y = 1.02,
        xanchor = "center",
        x = 0.5
    )
)

# Show the plot!
fig_four.show()

Clearly there is a rounding tendency (read: bias) when recording blood pressures and the effect is larger for patients with heart disease. This is no surprise. If you work with health data regularly you'll see this over and over again and quickly come to realise blood pressure values are extremely imprecise and unreliable.

Either that or a lot of people were injured by turles.

>> According to hospital insurance codes, there are 9 different ways you can be injured by turtles.
>
> Wall Street Journal

So a word to the wise (health data scientist): if you build a model whose power depends heavily on blood pressure, espeically small differences in blood pressure, you're probably doing something wrong.

As [Davina](https://www.kaggle.com/daviniagun) pointed out in the notebook [Heart Failure Machine Learning](https://www.kaggle.com/daviniagun/heart-failure-machine-learning) this also happens with heart rate:

In [11]:
from plotly.subplots import make_subplots

min_hr, max_hr = int(np.min(df["MaxHR"])), int(np.max(df["MaxHR"]))

fig_four_hr = go.Figure(
    layout = create_layout("Histogram of Maximum Heart Rate"),
)
fig_four_hr.add_trace(
    go.Histogram(
        x = df[df["HeartDisease"] == 0]["MaxHR"],
        name = "Heart Disease Absent",
        marker_color = COLORMAP[0],
        xbins = dict(
            start = min_hr - 0.5,
            end = max_hr + 0.5,
            size = 1.0
        )
    )
)
fig_four_hr.add_trace(
    go.Histogram(
        x = df[df["HeartDisease"] == 1]["MaxHR"],
        name = "Heart Disease Present",
        marker_color = COLORMAP[1],
        xbins = dict(
            start = min_hr - 0.5,
            end = max_hr + 0.5,
            size = 1.0
        )
    )
)

fig_four_hr.update_layout(
    barmode = "stack",
    xaxis_tickvals = np.arange(min_hr, max_hr + 1, 10),
    xaxis_title_text = "Maximum Heart Rate (per minute)",
    yaxis_title_text = "Number of Patients",
    legend=dict(
        orientation = "h",
        yanchor = "bottom",
        y = 1.02,
        xanchor = "center",
        x = 0.5
    )
)

# Show the plot!
fig_four_hr.show()

Moving on...

See how much you can learn about a dataset from a parallel coordinates plot? Try doing that examining one or two columns at a time with bar charts or scatter plots!

Now before we forget, let's treat `FastingBS` as a categorical column, and while we're at it, let's add `HeartDisease` to the categorical columns list.

In [12]:
# Let's move FastingBS from numerical_columns
# to categorical_columns and add HeartDisease
# to categorical_columns...
if "FastingBS" not in categorical_columns:
    categorical_columns.append("FastingBS")
if "HeartDisease" not in categorical_columns:
    categorical_columns.append("HeartDisease")
if "FastingBS" in numerical_columns:
    numerical_columns.remove("FastingBS")

assert "FastingBS" not in numerical_columns, "FastingBS not removed from numerical columns..."
assert "FastingBS" in categorical_columns, "FastingBS not added to categorical columns..."
assert "HeartDisease" in categorical_columns, "HeartDisease not added to categorical columns..."

## Get the EEG, the BP monitor, and the AVV.

>> **Obstetrician 1:** *Get the EEG, the BP monitor, and the AVV.*  
>> **Obstetrician 2:** *And get the machine that goes 'ping!'.*  
>> **Obstetrician 1:** *And get the most expensive machine - in case the Administrator comes.*
>
> *The Meaning of Life*, Monty Python

Now, what about categorical columns? Why can't we plot them in a parallel coordinates plot?

That's a fair question but one with an easy answer: we can, we just have to convert the categorical values to numbers. We could probably even add in the EEG, the BP monitor and the AVV if we wanted.

But as we might guess based upon `FastingBS`, parallel coordinate plots with (many) axes with small numbers of discrete values are often difficult to interpret.

Let's take a look:

In [13]:
df_copy = df.copy()
label_encoder = LabelEncoder()
for column in ["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "STSlope"]:
    df_copy[column] = label_encoder.fit_transform(df_copy[column])

fig_five = go.Figure(
    create_parallel_coordinates_plot(
        df_copy[1::2], "HeartDisease", list(df_copy)
    ),
    layout = create_layout("Parallel Coordinates Plot of Odd Numbered Patients With Categorical Variables"),
)

# Show the plot!
fig_five.show()

But again...

>> The patient says, "Doctor, it hurts when I do this."  
>> The doctor says, "Then don't do that!”
>
> Henny Youngman

The problem is that the vertices of polylines going through categorical values are all plotted on top of each other, making it impossible to see the *distributions of values* on categorical axes. Put another way, you can't visually follow polylines through categorical axes. This is even more problematic when you have two categorical axes next to each other, for example `Sex` and `ChestPainType`; `FastingBS` and `RestingECG`; and`STSlope` and `HeartDisease`.

Fortunately, there is a different way of plotting polylines with categorical variables: parallel category plots. Let's take a look:

In [14]:
# We will make a lot of parallel categories plots
# so let's wrap the calls in a simple function...
def create_parallel_categoriess_plot(df, color, columns, labels = LABELS):
    return go.Parcats(
        line = dict(color = df['HeartDisease'], colorscale = COLORMAP, showscale = False),
        dimensions = list([
            dict(
                label = labels[column],
                values = df[column],
                categoryorder = "category ascending"
            ) for column in columns
        ]),
        labelfont = dict( size = 16 ),
        tickfont = dict( size = 16 ),
    )

fig_six = go.Figure(
    create_parallel_categoriess_plot(df, "HeartDisease", categorical_columns),
    layout = create_layout("Parallel Categories Plot of All Patients"),
)

# Show the plot!
fig_six.show()

Parallel category plots "spread out" the polylines while also grouping records with similar categorical values. That reminds me...

>>There may be said to be two classes of people in the world; those who constantly divide the people of the world into two classes, and those who do not.
>
> Robert Benchley

But wait! There's more!

Again, plotly plots are interactive, and parallel category plots are no exception, though their interactions differ from parallel coordinate plots: you can hover! Hover over the categories on each axis. Then hover over the polilines between axes. You'll see groups of polylines--that is, groups of patient records--highlighted. This is extremely useful, and reminiscent of decision tree representations of data.

## You’re Very Sick!

As an aside, you might be tempted to ask, "I understand we can't add categorical values to a parallel coordinates plot without converting them to numerical values, and I understand why that's generally not very useful, but could we maybe add numerical values to parallel category plots? After all, we did that for FastingBS, though really it was only technically a numerical value...but isn't it worth a try?"

Good question! Let's find out! `Age` is a numerical value that can easily be thought of as a category. Let's try that:

In [15]:
fig_seven = go.Figure(
    create_parallel_categoriess_plot(
        df, "HeartDisease", ["Age"] + categorical_columns
    ),
    layout = create_layout("Parallel Categories Plot of All Patients and Age"),
)

# Show the plot!
fig_seven.show()

Ruh roh!

>> Doctor to patient: “You’re very sick! I like that in a patient.”
>
> P.C. Vey

This plot is quite sick, but nobody likes that in a data visualisation. And I trust you can diagnose the problem. The problem is that parallel category plots don't scale well to large numbers of categories. Sure, the labels are a mess, but that can be dealth with to a large extent by turning them off and relying on hovering to supply the missing labels. The real problem is that they lose their ability to group records.

But that's not the only way to turn `Age` into a categorical variable. Let's try age classes instead:

In [16]:
def create_categories_from_values(df, column, resolution = 10, offset = 0):
    return df[column].apply(
        lambda x : str(
            int(resolution*((x - offset) // resolution) + offset)
        ) + " - " + str(
            int(resolution*(1 + (x - offset) // resolution) + offset)
        )
    )

df_copy = df.copy()
df_copy["AgeClass" ] = create_categories_from_values(df_copy, "Age", 10, 5)

fig_eight = go.Figure(
    create_parallel_categoriess_plot(
        df_copy, "HeartDisease", ["AgeClass"] + categorical_columns
    ),
    layout = create_layout("Parallel Categories Plot of All Patients and Age Classes"),
)

# Turn off the colorbar...it's useless...
fig_eight.update_layout(coloraxis_showscale = False)
# fig.update_layout(yaxis_autorange = "reversed")
# fig['layout']['yaxis']['autorange'] = "reversed"
# Show the plot!
fig_eight.show()

Noice!

>> Look! It's moving. It's alive. It's alive...it's alive, it's moving, it's alive, it's alive, it's alive, it's alive, IT'S ALIVE!
>
> Henry Frankenstein, Frankenstein, 1931

Yessssss! Bolstered by this sucess, let's keep going. Let's add `Cholesterol` and `RestingBP` next.

But before we do, check out how many men--and how few women--between the ages of 65 and 75 have heart disease. SMH

Moving on...adding `Cholesterol` and `RestingBP`...

In [17]:
df_copy["CholesterolClass" ] = create_categories_from_values(df_copy, "Cholesterol", 50, 0)
df_copy["RestingBPClass" ] = create_categories_from_values(df_copy, "RestingBP", 10, 5)

fig_nine = go.Figure(
    create_parallel_categoriess_plot(
        df_copy, "HeartDisease", ["AgeClass", "CholesterolClass", "RestingBPClass"] + categorical_columns
    ),
    layout = create_layout("Parallel Categories Plot of All Patients and Age, Cholesterol and RestingBP Classes"),
)

# Turn off the colorbar...it's useless...
fig_nine.update_layout(coloraxis_showscale = False)
# fig.update_layout(yaxis_autorange = "reversed")
# fig['layout']['yaxis']['autorange'] = "reversed"
# Show the plot!
fig_nine.show()

This sort of works but not well. There are simply too many classes, and decreasing the numbers makes the classes less informative. Moreover, the more dimensions we add, the smaller the groups of patients that are clustered together. Non-uniform classes would help here, like combining all `Cholesterol` values between 300 and 650 into a single class, and all `RestingBP` values less than 105 into one class, and all values above 165 into another. But the limitations of this approach should be clear. Do when it works, but don't expect it to work very often.

>> The key to happiness in life is low expectations.
>
> You guessed it. Me.

Next let's move some columns around so that in what comes next we don't have to specify the order they should be drawn in, since plotly orders axes in dataframe column order:

In [18]:
categorical_columns.remove("HeartDisease")
categorical_columns = ["HeartDisease"] + categorical_columns
numerical_columns.remove("Age")
numerical_columns.remove("HeartDisease")
numerical_columns = numerical_columns + ["Age", "HeartDisease"]

And now, without further ado--and brilliantly funny, perfectly selected, medically related quotes--we generate our final parallel coordinates visualisation...the Mother Of All Parallel Coordinates Visualisations...

In [19]:
# Forgive the use of ipywidgets for laying out
# figures side by side...it's the only way.
def update_layout(fig, left = 20, right = 40, width = 650):
    fig.update_layout(
        margin = dict(l = left, r = right),
        autosize = False,
        width = width,
        height = 650,
    )
    return fig

output_left = ipywidgets.Output()
with output_left:
    fig_ten = go.Figure(
        create_parallel_categoriess_plot(df, "HeartDisease", categorical_columns),
        layout = create_layout("Parallel Categories Plot of All Patients Final"),
    )
    fig_ten = update_layout(fig_ten, 5, 0, 650)
    # Show the plot!
    fig_ten.show()
    
output_right = ipywidgets.Output()
with output_right:
    fig_eleven = go.Figure(
        create_parallel_coordinates_plot(df, "HeartDisease", numerical_columns),
        layout = create_layout("Parallel Coordinates Plot of All Patients Final"),
    )
    fig_eleven = update_layout(fig_eleven, 40, 60, 600)
    # Show the plot!
    fig_eleven.show()

ipywidgets.HBox([output_left, output_right], layout = ipywidgets.Layout(width = "1600px", flex_flow = "row wrap"))


HBox(children=(Output(), Output()), layout=Layout(flex_flow='row wrap', width='1600px'))

And there you have it. All of the columns in the dataset displayed in one (er, one, two-in-one) interactive visualisation.

Thank you for attending my TED Talk.

Questions? Comments? Threats? Funny anecdotes? Bitingly sarcastic remarks? Clever observations? Ad hominem attacks? Throw them in the comments!

## Parallel Coordinates: The Postmortem

Well, as the old joke goes, the surgery was a success but the patient died.

If you are viewing this in preview mode on Kaggle, then the Mother Of All Parallel Coordinates Visualisations, which may or may not be displaed above, was a success but may have died in preview mode. A service will be held in the Kaggle chapel on Saturday afternoon. We ask that donations be sent to World Heart Foundation in the name of The Mother Of All Parallel Coordinates Visualisations.

If, however, you open this notebook and run it, the visualisation above should be resuscitated and display properly. The issue is that The Mother Of All Parallel Coordinates Visualisations is about 1600 pixels wide (their physician recommended diet and exercise, but did they listen? Nnnooo...I mean, who listens to their physician?) and Kaggle's previewer simply can't display anything that wide. It could also be that Kaggle's preview mode can't handle the `ipywidgets.HBox` used to display two `plotly` plots side by side. Either way, the surgery was a success but the patient died.

However, if you have a litle imagination, and can mentally transpose two vertically arranged plots, you can still experience The Mother Of All Parallel COordinates Visualisation here...provided it does not also fail to display:

In [20]:
fig_twelve = go.Figure(
    create_parallel_categoriess_plot(df, "HeartDisease", categorical_columns),
    layout = create_layout("Parallel Categories Plot of All Patients Final"),
)

# Show the plot!
fig_twelve.show()

fig_thirteen = go.Figure(
    create_parallel_coordinates_plot(df, "HeartDisease", numerical_columns),
    layout = create_layout("Parallel Coordinates Plot of All Patients Final"),
)

# Show the plot!
fig_thirteen.show()