# IngoScholtes/kdd2018-tutorial

Fetching contributors…
Cannot retrieve contributors at this time
101 lines (67 sloc) 3.62 KB
 #%% import markdown from IPython.core.display import display, HTML def md(str): display(HTML(markdown.markdown(str + "
"))) #%% md(""" ## Sparse flight data A key question for the generation of sparse state networks is _how_ sparse. If we lump all state nodes with each physical node, we lose all higher-order information and may underfit. On the other hand, keeping all second-order state nodes may overfit. In this tutorial we will generate second-order state networks from path data and from there generate multiple sparse networks with different number of (lumped) state nodes and evaluate the result with Infomap """) #%% md(""" ### Generate training and validation sets To get a bigger network, we can merge the flight path data from the four quarters (`"data/air2015_{q}_paths.net" for q in [1,2,3,4]`). But to evaluate the goodness of fit, we can split each path randomly in either a _training_ or a _validation_ set and write a path data file for each of the data set. **TODO:** - Write a function that merges all paths of the year and writes it to a _training_ paths file with 50% chance and to a _validation_ paths file otherwise. Skip the '*vertices' section """) #%% In [1] # TODO: Fill code here #%% md(""" #### Generate state networks from paths **TODO:** - Use Infomap to generate second-order state networks from the two paths data files. """) #%% In [2] # TODO: Fill code here #%% md(""" ### Generate _sparse_ state networks Here we will generate multiple lumped state networks with different amount of state nodes. A simple way is to parameterise this with a cluster rate \$r\$ going from 0.1 to 1, where `n_clusters = max(1, int(r * numStateNodes)`. For convenience, you can just send in the argument `clusterRate` to `clusterStateNodes` to achieve this, instead of the cluster function in the previous tutorial. **TODO:** - Read in the training network with `StateNetwork` - Calculate entropy rate - Cluster the network for all cluster rates \$r\$ in for example `np.linspace(0.1, 1, 10)`. - Save the number of lumped state nodes and the lumped entropy rate """) #%% In [3] # TODO: Fill code here #%% md(""" #### How much information do we lose as we reduce the number of state nodes? **TODO:** - Plot the entropy rate against the number of state nodes - Check that the entropy rates approaches the original one and coincides at cluster rate \$r = 1\$ """) #%% In [4] # TODO: Fill code here #%% md(""" Note that the original number of state nodes can be much larger than the maximum in the lumped state networks due to dangling nodes which are lumped implicitly. """) #%% md(""" ### Validate with Infomap The goal here is to calculate the codelength for the validation network, given the different partitions found on the lumped training networks. **TODO:** - Run Infomap on all lumped state networks and write a `.clu` file for each and store codelength - Run Infomap on the validation network but with cluster data from external file for all `.clu` files generated from the lumped networks and store the codelength - Plot the training and validation codelengths against the number of state nodes and check if there is an optimum that balances underfit and overfit Note: Use the `--two-level` flag to Infomap here to find an optimum two-level solution that will be stored in the `.clu` files. These can be read into infomap with the `--cluster-data [clusters.clu]` option. Add `--no-infomap` to not continue the clustering algorithm after the initial clusters has been incorporated. Then the codelength after the run will be the codelength for the specified input clustering. """) #%% In [12] # TODO: Fill code here #%% In [13] # TODO: Fill code here