Before we begin, we will change a few settings to make the notebook look a bit prettier

In [1]:
%%html
<style> body {font-family: "Calibri", cursive, sans-serif;} </style>
<style>.sankey .node {font-family: "Calibri";}</style>

<img src="https://iknl.nl/images/default-source/images/.png?sfvrsn=3" align="right">

# Visualizing flow of patients in Oncoguide
<sup>Arturo Moncada-Torres</sup>

At [IKNL](<https://iknl.nl/>), we work everyday to continuously improve
oncological and palliative care of the Dutch population. We have developed
[Oncoguide](<https://www.iknl.nl/oncologische-zorg/oncoguide/>), a tool that 
supports healthcare professionals and patients in making the best decisions
for their treatment. Oncoguide provides a graphical representation of clinical
guidelines for patient therapy in the shape of decision trees.

In this Jupyter notebook, we generate Sankey diagrams representing the flow of patients
through the decision trees. Namely, we were interested in seeing the amount of
patients that were treated according to the guidelines and the amount of
patients that were not. We do so using [floWeaver](<https://github.com/ricklupton/floweaver>).

## Preliminaries

In [2]:
import os
import pathlib
import pandas as pd

import floweaver as fw
from ipysankeywidget import SankeyWidget

Set up working directory

In [3]:
os.chdir(os.getcwd())

Path definition and verification

In [4]:
PATH_DATA = pathlib.Path(r'../data/')
PATH_RESULTS = pathlib.Path(r'../results/')

if not PATH_DATA.exists():
    # PATH_DATA must contain the files flow_patients.csv and tree_elements.csv
    raise ValueError("Data directory not found.")
    
if not PATH_RESULTS.exists():
    PATH_RESULTS.mkdir()

## The data

A decision tree is made of nodes. Each node represents a decision point. Based on the patient's (disease) characteristics, (s)he travels through the tree until (s)he reaches an end node (or leaf), which represents a (suggested) treatment.

The data is very simple. Each row states the number of patients (`value`) that went from one node (`source`) to another one (`target`). Furthermore, `value_good` represents the number of patients that were supposed to be there *according to the guideline*. The `description` column gives a brief explanation of the connection for that row.

In [5]:
df = pd.read_csv(PATH_DATA/'flow_patients.csv', low_memory=False)
df

Unnamed: 0,source,target,description,value,value_good
0,35408,35420,Levensverwachting,243,148
1,35406,35417,Kans op lymfklier-betrokkenheid is {result},327,319
2,35406,35428,Kans op lymfklier-betrokkenheid is {result},913,547
3,35416,35425,Levensverwachting,66,57
4,35409,35418,cT,1618,1543
5,35413,35421,Aantal positieve biopten,1069,986
6,35416,35406,Levensverwachting,1240,866
7,35414,35410,Kans op lymfklier-betrokkenheid is {result},7,0
8,35407,35416,EAU/ESTRO Risicogroep,1306,923
9,35407,35413,EAU/ESTRO Risicogroep,1735,1640


Now, we will find the trunk, branches, and leaves of this tree. The trunk is the element that appears as source, but never as target. The leaves are the elements that appear as target, but never as source. Branches are the elements that are neither trunk nor leaves.

In [6]:
source_set = set(df['source'])
target_set = set(df['target'])

In [7]:
trunk = source_set - target_set
trunk

{35409}

In [8]:
leaves = target_set - source_set
leaves

{35410, 35415, 35417, 35418, 35420, 35421, 35425, 35428, 35429}

In [9]:
branches = (source_set | target_set) - trunk - leaves
branches

{35406, 35407, 35408, 35413, 35414, 35416}

## Generating a simple Sankey diagram
We can start creating a quick (and a bit dirty) Sankey diagram

In [10]:
SankeyWidget(links=df.to_dict('records')).auto_save_png(filename=str(PATH_RESULTS/'sankey_simple.png'))

SankeyWidget(links=[{'source': 35408, 'target': 35420, 'description': 'Levensverwachting', 'value': 243, 'valu…

## Enhacing the Sankey diagram

Not bad as a first try, but maybe we can do better. Let's explicitely define the number of "bad" patients (patients that, *according to the guideline* shouldn't be there).

In [11]:
df['value_bad'] = df['value'] - df['value_good']

Now, let's generate a new dataframe where `value_good` and `value_bad` are just `type` (needed for the Sankey diagram definition).

In [12]:
df_good = df.drop(columns=['value', 'value_bad'])
df_good.rename(columns={'value_good':'value'}, inplace=True)
df_good['type'] = 'good'

df_bad = df.drop(columns=['value', 'value_good'])
df_bad.rename(columns={'value_bad':'value'}, inplace=True)
df_bad['type'] = 'bad'

In [13]:
df = pd.concat([df_good, df_bad], axis=0, ignore_index=True)

We will add a column for color visualization

In [14]:
def pick_color(row):
    if row['type']=='good':
        return '#ABD66A'
    elif row['type']=='bad':
        return '#E96245'
df['color'] = df.apply(lambda row: pick_color(row), axis=1)

The dataframe in its final form looks like this:

In [15]:
df

Unnamed: 0,source,target,description,value,type,color
0,35408,35420,Levensverwachting,148,good,#ABD66A
1,35406,35417,Kans op lymfklier-betrokkenheid is {result},319,good,#ABD66A
2,35406,35428,Kans op lymfklier-betrokkenheid is {result},547,good,#ABD66A
3,35416,35425,Levensverwachting,57,good,#ABD66A
4,35409,35418,cT,1543,good,#ABD66A
5,35413,35421,Aantal positieve biopten,986,good,#ABD66A
6,35416,35406,Levensverwachting,866,good,#ABD66A
7,35414,35410,Kans op lymfklier-betrokkenheid is {result},0,good,#ABD66A
8,35407,35416,EAU/ESTRO Risicogroep,923,good,#ABD66A
9,35407,35413,EAU/ESTRO Risicogroep,1640,good,#ABD66A


Now, we will define a few options to enhance the looks of our Sankey diagram. First, I want to make sure that all paths have the same length (i.e., that all the leaves finish all at the same place). I can do so by defining `rank_sets`:

In [16]:
rank_sets = [{'type': 'same', 'nodes': list(leaves)}]

Furthermore, I would like to add an indicator to show that the leaves are all treatments. I can do that by grouping the corresponding nodes:

In [17]:
groups = [{'id': 'treatments', 'title': 'Treatments', 'nodes': list(leaves)}]

In [18]:
SankeyWidget(links=df.to_dict('records'), 
             rank_sets=rank_sets, 
             groups=groups,
             align_link_types=False).auto_save_png(filename=str(PATH_RESULTS/'sankey_enhanced.png'))

SankeyWidget(groups=[{'id': 'treatments', 'title': 'Treatments', 'nodes': [35425, 35428, 35429, 35410, 35415, …

Ah, much nicer! I am happy with how this Sankey diagram looks now. However, you might be interested in showing the number of patients that correspond to each flow. This can be done with the option `linkLabelFormat`:

In [19]:
SankeyWidget(links=df.to_dict('records'), 
             rank_sets=rank_sets, 
             groups=groups,
             linkLabelFormat='.0f', 
             align_link_types=False).auto_save_png(filename=str(PATH_RESULTS/'sankey_enhanced_numbers.png'))

SankeyWidget(groups=[{'id': 'treatments', 'title': 'Treatments', 'nodes': [35425, 35428, 35429, 35410, 35415, …

Personally I don't like it that much, since it starts looking a bit cluttered.

Just for fun, we can add some extra columns to our original dataframe to give meaningful names to `source` and `target` (and make the Sankey diagram more easy to interpret)

In [20]:
def number_to_name(number):
    df_elements = pd.read_csv(PATH_DATA/'tree_elements.csv', index_col=0, low_memory=False)
    name = df_elements.loc[number, 'description']
    return name[0:15]

df['source_name'] = df['source'].map(number_to_name)
df['target_name'] = df['target'].map(number_to_name)
df

Unnamed: 0,source,target,description,value,type,color,source_name,target_name
0,35408,35420,Levensverwachting,148,good,#ABD66A,Levensverwachti,Waakzaam afwach
1,35406,35417,Kans op lymfklier-betrokkenheid is {result},319,good,#ABD66A,Kans op lymfkli,Radicale prosta
2,35406,35428,Kans op lymfklier-betrokkenheid is {result},547,good,#ABD66A,Kans op lymfkli,Radicale prosta
3,35416,35425,Levensverwachting,57,good,#ABD66A,Levensverwachti,Waakzaam afwach
4,35409,35418,cT,1543,good,#ABD66A,cT,Radicale prosta
5,35413,35421,Aantal positieve biopten,986,good,#ABD66A,Aantal positiev,Actieve surveil
6,35416,35406,Levensverwachting,866,good,#ABD66A,Levensverwachti,Kans op lymfkli
7,35414,35410,Kans op lymfklier-betrokkenheid is {result},0,good,#ABD66A,Kans op lymfkli,Uitwendige radi
8,35407,35416,EAU/ESTRO Risicogroep,923,good,#ABD66A,EAU/ESTRO Risic,Levensverwachti
9,35407,35413,EAU/ESTRO Risicogroep,1640,good,#ABD66A,EAU/ESTRO Risic,Aantal positiev
