# Sankey diagram with Highcharts and Python

&#9733;<i>Thomas Roca, PhD, Data Officer, French Development Agency  &#9733;</i>
*Version July 2017* v1

---

Highcharts' new release will contain a lot of cool stuff. Among them, Sankey Diagrams !
An Early release for testing propose was shared - thanks [@MusMekh](https://twitter.com/MusMekh)!
You can find it on this [JSfiddle](http://jsfiddle.net/gh/get/library/pure/highcharts/highcharts/tree/samples/highcharts/studies/sankey-diagram/)

This folder contain a python script to clean, prepare the data and write the HTML file hosting the diagram.

For this example we will use data from the United Nations High Commissioner for Refugees, more precisely, data from the [population statistics](http://popstats.unhcr.org/en/overview) which take account of the number of Syrian refugees and asylum seekers in 2016 and their destination.

The data extracted were displaced people considered as 'Refugees (incl. refugee-like situations)' or 'Asylum-seekers (pending cases)') from Syrian to all other countries. The corresponding database is available as a CSV in this [Github folder](https://github.com/ThomasRoca/Sankey-graph-highchart-python)

For readibilty purpose we decided to display only the countries for which more than 1,000 refugees or asylum seekers arrived.

## I. Organizing the data

What we have is a CSV and what we need is something like this:

    [
      ["Syrian Arab Rep.", "Rep. of Korea", 1120.0],
      ["Syrian Arab Rep.", "Morocco", 3242.0],
      ["Syrian Arab Rep.", "Libya", 19508.0],
       //....
     ]
     
To proceede we will read the CSV and write our on data file with the expected format.

In [71]:
import pandas as pd
from pandas_datareader import data, wb
from pandas import Series, DataFrame, concat
import numpy as np
import sys,os,os.path

# Read data
datafile='unhcr_popstats_export_persons_of_concern_2017_07_26_171219.csv'
folder="https://raw.githubusercontent.com/ThomasRoca/Sankey-graph-highchart-python/master/"
dataset = pd.read_csv(folder+datafile, encoding='latin1', skiprows=3)

# The raw file contain "*" for Non Applicable No information.
dataset['Refugees (incl. refugee-like situations)']=dataset['Refugees (incl. refugee-like situations)'].replace('*',np.nan)
dataset['Asylum-seekers (pending cases)']=dataset['Asylum-seekers (pending cases)'].replace('*',np.nan)

dataset=dataset.sort_values(by=['Origin'], ascending=False)

# What we want here is the count of Asylum seekers and Refugees, we thus sum these two colums.
# NaN and 0 are different information but for our purpose we decided to proceed this way - we cannot sum a number with a NaN
dataset['Refugees (incl. refugee-like situations)']=dataset['Refugees (incl. refugee-like situations)'].astype(float)
dataset['Asylum-seekers (pending cases)']=dataset['Asylum-seekers (pending cases)'].astype(float)
dataset['Total']=dataset['Asylum-seekers (pending cases)']+dataset['Refugees (incl. refugee-like situations)']

# We will organize the data this way: ['from', 'to', 'weight']. We will store corresponding matrix in an external file (data.js)
file = open("data.js", "w")
file.write("var dataUNHCR= [")
#Looping over the dataframe
for row in range(len(dataset.index)):
    if dataset['Total'].iloc[row]>1000 :
        if (dataset["Origin"].iloc[row]=='Syrian Arab Rep.') & (dataset["Country / territory of asylum/residence"].iloc[row]!='Iraq') : 
            file.write('["'+dataset["Origin"].iloc[row]+'","'+dataset["Country / territory of asylum/residence"].iloc[row]+'",')
            file.write(str(dataset["Total"].iloc[row])+'],')
        
        if (dataset["Origin"].iloc[row]=='Iraq') & (dataset["Country / territory of asylum/residence"].iloc[row]!='Syrian Arab Rep.') : 
            file.write('["'+dataset["Origin"].iloc[row]+'","'+dataset["Country / territory of asylum/residence"].iloc[row]+'",')
            file.write(str(dataset["Total"].iloc[row])+'],')
        
file.write("]")
file.close()

## II. Write the Sankey Diagram

We now have a dataset (data.js) to feed the diagram. We can now write the HTML file which contain the datavisualization.
To simplify the code, we stored the highcharts JavaScript code of the sankey diagram in a seperate file "sankey.js".
We will call this file so as the data within the script.

In [68]:
from string import Template
from IPython.display import HTML
import codecs

html= '''
<!DOCTYPE html>
<html>
<head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
   <style type="text/css">
    #container {
    width: 800px;
    height: 800px;
    margin: 1em auto;
    border: 0px ;
}

#csv {
display: none;
}
  </style>

  <title>Highcharts Demo</title>
 
</head>

<body>
<script src="https://code.highcharts.com/highcharts.js"></script>
<script src="https://code.highcharts.com/modules/exporting.js"></script>
<script src="sankey.js"></script>
<script src="data.js"></script>
    
<div id="container"></div>

<script type='text/javascript'>//<![CDATA[

Highcharts.chart('container', {
    title: {
        text: 'Highcharts Sankey <br>Refugee and Asylum seekers from Syrian by destination in 2016'
    },
    subtitle: {
        text: 'Data source: UNHCR Population Statistics Reference Database 2017'
    },
    series: [{
        keys: ['from', 'to', 'weight'],
        data: dataUNHCR,
        type: 'sankey',
        name: 'Refugee and Asylum seekers '
    }]
});

//]]> 

</script>
</body>
</html>
'''
    
with codecs.open("dataviz.html", "w", "utf-8-sig") as f:
    f.write(html)
    f.close()

In [None]:
HTML('<iframe src="http://stats4dev.com/dataviz/sankey/dataviz.html" scrolling="no" frameborder="0" width="100%" height="875px"></iframe>')

<iframe src="http://stats4dev.com/dataviz/sankey/dataviz.html" scrolling="no" frameborder="0" width="100%" height="875px"></iframe>