## Module 4, Activity 5: Alluvial (Sankey) Plots

Let's finish this Module looking at how to create alluvial (or, sankey) plots with Plotly. Sankey plots are great visualisations of flows from one set of nodes to another. We'll use Plotly to make a simple sankey plot of University of California Berkley admissions data from 1973. The dataset is a classic example of Simpson's paradox - where a trend appears in subsets of a dataset, but disappears when the subsets are combined. If you're interested, you can read more [here](https://kharshit.github.io/blog/2017/09/01/simson%27s-paradox). Let's get started by loading our libraries and the dataset.

In [20]:
# import packages/libraries
import plotly
import plotly.graph_objects as go
import pandas as pd
import numpy as np

# load dataset
df = pd.read_csv("data/UCB_data.csv")
df

Unnamed: 0,Admit,Gender,Dept,Freq
0,Admitted,Male,A,512
1,Rejected,Male,A,313
2,Admitted,Female,A,89
3,Rejected,Female,A,19
4,Admitted,Male,B,353
5,Rejected,Male,B,207
6,Admitted,Female,B,17
7,Rejected,Female,B,8
8,Admitted,Male,C,120
9,Rejected,Male,C,205


Let's make a sankey plot, showing the flow of male and female applicants applying to each department and whether they were either admitted or rejected. We'll then look at colouring admissions and rejections by gender. 

In terms of coding, the hardest part of constructing an alluvial plot is getting your data in the right format. 

In this activity, we'll have three sets of nodes: Department, Gender and Admission outcome (admitted or rejected). To build a sankey plot we need three things: the source and destination (or target) of each flow, and the magnitude of the flow. In Plotly, these are encoded as numbers, which we then link with an individual label. So,  we need to reshape our dataset into these three variables before we can plot it. We already have counts in the Freq column. The code below that does this for us has been copied from [here](https://stackoverflow.com/questions/70293723/how-do-i-make-a-simple-multi-level-sankey-diagram-with-plotly). We've commented each step of the code, but if you're unclear on any step take your time to understand what is happening. Of course, there are many different ways to do this, so if you'd like to try constructing an alternative approach (or finding an alternative online) feel free.

In [37]:
nodes = ['Dept', 'Gender',  'Admit'] # Note the order of the nodes here.

newDf = pd.DataFrame() # Create an empty data frame
for i in range(len(nodes)-1): # Loop over the nodes
    tempDf = df[[nodes[i],nodes[i+1],'Freq']] # Select the current node, the next node in 'nodes' and the Freq column
    tempDf.columns = ['source','target','count'] # Rename the above tempDf variables to 'source', 'target' and 'count'
    newDf = pd.concat([newDf,tempDf]) # Append (or concatenate) tempDf to newDf (which will be empty on the first loop)

newDf = newDf.groupby(['source','target']).agg({'count':'sum'}).reset_index() # Combine all replicated 'source' and 'target' rows, but add their 'count' values together

label_list = list(np.unique(df[categories].values)) # Pull out list of labels for our nodes, in the order they appear
sources = newDf['source'].apply(lambda x: label_list.index(x)) # Create source vector, converting each from label to corresponding number using apply function
targets = newDf['target'].apply(lambda x: label_list.index(x)) # Create target vector, converting each from label to corresponding number using apply function
counts = newDf['count'] # Create count vector, to give width of each flow

Now we have vectors of sources, targets and counts, we also have a label_list to decode the numeric source and target vectors in our sankey plot. We're ready to plot. We'll use the [**Sankey**](https://plotly.github.io/plotly.py-docs/generated/plotly.graph_objects.Sankey.html) function from Plotly.

In [43]:
fig = go.Figure(go.Sankey(
    node = dict( # Dictionary of variables to define our nodes
      label = label_list, # labels corresponding to indices in source and target vectors
    ),
    link = dict( # Dictionary of variables to define our links
      source = sources, # indices correspond to labels, eg, 0 = Male, 1 = Female, 2 = A, ...
      target = targets, # same as above
      value = counts # value sets the flow volume for each connection
  )))

fig.update_layout(title="")

fig.show()

Notice how the order of the nodes in the columns are not in alphabetical order. This is because sankey orders them by flow size, in descending order by default. 

**Exercise:** 
1) Hover your mouse over different parts of the figure above to find out how many Females applied for a place in Department C? How many males across all Departments had their applications rejected?\
2) Add an informative title to the plot, using the fig.update_layout(title="") argument.
3) Experiment with the order of the nodes - you'll need to go back and change the order of the nodes vector in the code section that gave us the label_list, source, target and count vectors.

Finally, what if we wanted our flow lines to have the same colour as their source node? We would first need to define a vector of colours corresponding to our labels. That vector would colour our nodes. Then, to colour the flow lines by the colour of their source node, we would need to create another vector of colours, the same length as our source vector.

In [39]:
## Vector of colours for each node label
lab_cols = ["red", "orange", "yellow", "green", "darkgrey", "purple", "grey", "blue", "cyan", "pink"]

## Vector of colours for each flow line (where each flow line is coloured by its source node colour)
flow_cols = newDf['source'].apply(lambda x: lab_cols[label_list.index(x)])

**Exercise:** In your own words, briefly explain what the flow_cols line is doing in the code above.

In [40]:
fig = go.Figure(data=[go.Sankey(
    node = dict(# Dictionary of variables to define our nodes
        color = lab_cols, # label colours
      label = label_list, # labels corresponding to indices in source and target vectors
    ),
    link = dict(# Dictionary of variables to define our links
        color = flow_cols, # flow colours
      source = sources, # indices correspond to labels, eg, 0 = Male, 1 = Female, 2 = A, ...
      target = targets, # same as above
      value = counts # value sets the flow volume for each connection
  ))])

fig.show()

**Exercise:** 
1) Colour the two terminal nodes (Rejected or Admitted) black. HINT: You'll need to modify lab_cols.\
2) Colour each flow line in our sankey plot by the colour of their target node.