# Visualise data using Sankey Diagram

While using visualizations, one compact visualization showing the relation between multiple variables have an upper hand over multiple visualizations - one for each variable. When you are trying to visualize high dimensional numerical data instead of multiple bar charts (one for each numerical variable), a single parallel plot could be more useful.

In [226]:
import plotly.graph_objects as go

source = [ 0, 0, 0, 1, 1]
target = [ 2, 3, 4, 4, 5]
value  = [ 8, 2, 4, 2, 8]

NODES = dict(pad  = 20, thickness = 10,
            line  = dict(color = "black", width = 0.5),
            label = ["A1", "A2", "B1", "B2", "B3", "B4"],
            color = ["red", "red", "blue", "blue", "blue", "blue"],)

LINKS = dict(source = source, target = target, value = value, 
            label = ["A1-B1", "A1-B2", "A1-B3", "A2-B3", "A2-B4"],
            color = ["pink", "cyan", "green", "grey", "magenta"],)

data = go.Sankey(node = NODES, link = LINKS)

fig = go.Figure(data)

fig.show()

In [227]:
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.simplefilter("ignore")

## Read and process the dataset
Let's use the [Olympics 2021](https://www.kaggle.com/arjunprasadsarkhel/2021-olympics-in-tokyo) dataset to illustrate the use of Parallel Coordinates

In [229]:
df_medals = pd.read_excel("data/Medals.xlsx")
print(df_medals.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Rank           93 non-null     int64  
 1   Team/NOC       93 non-null     object 
 2   Gold           93 non-null     int64  
 3   Silver         93 non-null     int64  
 4   Bronze         93 non-null     int64  
 5   Total          93 non-null     int64  
 6   Rank by Total  93 non-null     int64  
 7   Unnamed: 7     0 non-null      float64
 8   Unnamed: 8     1 non-null      float64
dtypes: float64(2), int64(6), object(1)
memory usage: 6.7+ KB
None


- There is no missing data and no missing data handling is needed

In [230]:
df_medals.rename(columns={'Team/NOC':'Country', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals'}, inplace=True)
df_medals.drop(columns=['Unnamed: 7','Unnamed: 8','Rank by Total'], inplace=True)
df_medals

Unnamed: 0,Rank,Country,Gold Medals,Silver Medals,Bronze Medals,Total Medals
0,1,United States of America,39,41,33,113
1,2,People's Republic of China,38,32,18,88
2,3,Japan,27,14,17,58
3,4,Great Britain,22,21,22,65
4,5,ROC,20,28,23,71
...,...,...,...,...,...,...
88,86,Ghana,0,0,1,1
89,86,Grenada,0,0,1,1
90,86,Kuwait,0,0,1,1
91,86,Republic of Moldova,0,0,1,1


### Display a basic Sankey diagram

In [265]:
NODES = dict( #           0                               1                          2        3       4           5
            label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
            color = ["seagreen",                 "dodgerblue",                  "orange", "gold", "silver", "brown" ],)

LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
                target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
                value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links 
                # Color of the links 
                # Target Node:    3-Gold          4 -Silver        5-Bronze           
                color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
                                "lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
                                "bisque",       "bisque",       "bisque"],)        # Source Node: 2 - Japan

data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.show()

# Colors are at https://developer.mozilla.org/en-US/docs/Web/CSS/color_value

### Adjust the order of the nodes and the width of the plot

In [267]:
NODES = dict( #           0                               1                          2        3       4           5
            label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
            color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
            x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
            y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],)
data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.show()

### Improve the format of the hoverlabels

In [271]:
NODES = dict( #           0                               1                          2        3       4           5
            label = ["United States of America", "People's Republic of China",   "Japan", "Gold", "Silver", "Bronze"],
            color = [                "seagreen",                 "dodgerblue",  "orange", "gold", "silver", "brown" ],
            x     = [                         0,                            0,         0,    0.5,      0.5,      0.5],
            y     = [                         0,                          0.5,         1,    0.1,      0.5,        1],
            hovertemplate=" ",)

LINK_LABELS = []
for country in ["USA","China","Japan"]:
    for medal in ["Gold","Silver","Bronze"]:
        LINK_LABELS.append(f"{country}-{medal}")
LINKS = dict(   source = [  0,  0,  0,  1,  1,  1,  2,  2,  2], # The origin or the source nodes of the link
                target = [  3,  4,  5,  3,  4,  5,  3,  4,  5], # The destination or the target nodes of the link
                value =  [ 39, 41, 33, 38, 32, 18, 27, 14, 17], # The width (quantity) of the links 
                # Color of the links 
                # Target Node:    3-Gold          4 -Silver        5-Bronze           
                color =     [   "lightgreen",   "lightgreen",   "lightgreen",      # Source Node: 0 - United States of America
                                "lightskyblue", "lightskyblue", "lightskyblue",    # Source Node: 1 - People's Republic of China
                                "bisque",       "bisque",       "bisque"],         # Source Node: 2 - Japan
                label = LINK_LABELS, hovertemplate="%{label}",)

data = go.Sankey(node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.update_traces( valueformat='3d', valuesuffix=' Medals', selector=dict(type='sankey'))
fig.update_layout(hoverlabel=dict(bgcolor="lightgray",font_size=16,font_family="Rockwell"))
fig.show()

### Generalize for any number of countries

In [232]:
NUM_COUNTRIES = 5
X_POS, Y_POS = 0.5, 1/(NUM_COUNTRIES-1)
NODE_COLORS = ["seagreen", "dodgerblue", "orange", "palevioletred", "darkcyan"]
LINK_COLORS = ["lightgreen", "lightskyblue", "bisque", "pink", "lightcyan"]

source = []
node_x_pos, node_y_pos = [], []
node_labels, node_colors = [], NODE_COLORS[0:NUM_COUNTRIES]
link_labels, link_colors, link_values = [], [], [] 

# FIRST set of links and nodes
for i in range(NUM_COUNTRIES):
    source.extend([i]*3)
    node_x_pos.append(0.01)
    node_y_pos.append(round(i*Y_POS+0.01,2))
    country = df_medals['Country'][i]
    node_labels.append(country) 
    for medal in ["Gold", "Silver", "Bronze"]:
        link_labels.append(f"{country}-{medal}")
        link_values.append(df_medals[f"{medal} Medals"][i])
    link_colors.extend([LINK_COLORS[i]]*3)

source_last = max(source)+1
target = [ source_last, source_last+1, source_last+2] * NUM_COUNTRIES
target_last = max(target)+1

node_labels.extend(["Gold", "Silver", "Bronze"])
node_colors.extend(["gold", "silver", "brown"])
node_x_pos.extend([X_POS, X_POS, X_POS])
node_y_pos.extend([0.01, 0.5, 1])

# LAST set of links and nodes
source.extend([ source_last, source_last+1, source_last+2])
target.extend([target_last]*3)
node_labels.extend(["Total Medals"])
node_colors.extend(["grey"])
node_x_pos.extend([X_POS+0.25])
node_y_pos.extend([0.5])

for medal in ["Gold","Silver","Bronze"]:
    link_labels.append(f"{medal}")
    link_values.append(df[f"{medal} Medals"][:i+1].sum())
link_colors.extend(["gold", "silver", "brown"])

'''print("source", source, source_last); print("target", target)'''
print("node_labels", node_labels)
print("node_x_pos", node_x_pos); print("node_y_pos", node_y_pos)
'''; print("node_colors", node_colors)

print("link_labels", link_labels); print("link_values", link_values)'''

# Display the figure
NODES = dict(pad  = 20, thickness = 20, line  = dict(color = "lightslategrey", width = 0.5),hovertemplate=" ",
            label = node_labels, color = node_colors,
            x = node_x_pos, y = node_y_pos, )
LINKS = dict(source = source, target = target, value = link_values, 
            label = link_labels, color = link_colors,
            hovertemplate="%{label}",)
data = go.Sankey( arrangement='snap', node = NODES, link = LINKS)
fig = go.Figure(data)
fig.update_traces( valueformat='3d', valuesuffix=' Medals', selector=dict(type='sankey'))
fig.update_layout(title="Olympics - 2021: Country &  Medals",  font_size=16)
fig.update_layout(hoverlabel=dict( bgcolor="grey", font_size=14, font_family="Rockwell"))
fig.show()

node_labels ['United States of America', "People's Republic of China", 'Japan', 'Great Britain', 'ROC', 'Gold', 'Silver', 'Bronze', 'Total Medals']
node_x_pos [0.01, 0.01, 0.01, 0.01, 0.01, 0.5, 0.5, 0.5, 0.75]
node_y_pos [0.01, 0.26, 0.51, 0.76, 1.01, 0.01, 0.5, 1, 0.5]


* What is a sankey chart?: https://olihawkins.com/visualisation/8
* Colors are at  https://developer.mozilla.org/en-US/docs/Web/CSS/color_value

### Interactivity
For huge datasets, Parallel Coordinates Plots tend to get cluttered. In such cases, interactivity comes to our resuue. Using interactivity, it is possible to filter out or highlight certain sections of data. The order of the axes can also be adjusted in an optimum way so that patterns or correlations across variables emerge.

The plotly parallel coordinates plots support interactivity like
- Drag the lines along the axes to filter regions
- Drag the axis names across the plot to rearrange variables. 