# MGT - 416 Network Analytics - Final project report - CFF Railway Network
### Clément Catajar & Cedric Cook
Due to December 12 2017

## Abstract

Epidemiology is very important to understand the impact and the spreading of a new virus or bacteria across a country. Being able to predict the spread of a given virus can impact political, medical, economical and sociological decisions. Indeed, many policies can be invoked in order to reduce or react to an epidemic if governments know the critical cities in the country.

Therefore, in this report, we will explore the CFF railway and busses network to find the spreading of a virus and the most influential spreaders. We will analyze this phenomenon with a modified SIR model. Indeed, based on the basic SIR infection model, we will attribute a probability of infection proportional to the population of the city and the type of stops (bus or train stop) and determine the most influential spreaders among all the cities in Switzerland. 

This project aims to answer the following questions:
- What is the percentage of Switzerland's population attained when a virus is spread from a Swiss city?
- What are the most critical starting cities for infection? 

The work on these questions was done in different parts. First of all, it was necessary to acquire all the data for the network and prepare it for further analysis. Then the analysis was done in two main parts, the first one was the computation of the main basic and centrality measures in order to understand the network's behavior and secondly we built a modified SIR model to evaluate the spread of an infection in the network. 

TO DO !! 
Main results, answer to the questions and conclusion

### Table of Contents

1. <a href='#Abstract'>Abstract</a>
- <a href='#Introduction'>Introduction</a>
- <a href='#Data'>Data Acquisition and Preparation</a>
- <a href='#OverviewAnalysis'>Overview of Analysis</a>
- <a href='#Analysis'>Analysis</a>
- <a href='#Results'>Results and Interpretation</a>
- <a href='#Conclusion'>Conclusion</a>
- <a href='#Appendices'>Appendices</a>

<a id='Introduction'></a>
## Introduction

Understanding the behavior of a virus for a country is one of the key topic for public health policy and epidemic prevention. Moreover, public transports like train and busses are one of the common ways to spread an infection across a country. Therefore, in this report, we will analyze the CFF Railway and Busses Network in Switzerland to look at an infection spreading. We will thus try to answer the following questions:
- What is the percentage of Switzerland's population attained when a virus is spread from a Swiss city?
- What are the most critical starting cities for infection? 

To adress this issue, we will first present how we acquire data and prepare it. Then we will give an overview and conduct all of our analysis and finally we will analyze the results and answer the questions. 

<a id='Data'></a>
## Data acquisition and preparation

### Data acquisition

The data used to generate the network were taken from https://opentransportdata.swiss/fr/datasetList. It contains all the information on the transportation data in Switzerland. For this analysis we used the train and bus schedule for the coming year 2018. 

Different files are available in this dataset. The structure of this dataset is explained by the following UML Diagram: 

<img src="UML Data.png">

The useful files in this dataset are the following:
- The file "routes" describes the type of route (Bus, InterRegio, InterCity, etc...)
- The file "trips" creates the matching between a trip (journey) and a route type
- The file "stops" contains all the data on the stops (bus and train station) in Switzerland
- The file "stop times" contains all the sequence of stops for a given trip and the scheduled time of each train and bus

For simplicity and because our analysis does not take into account time factors, we will not consider the data on the dates and the times for each trip. 

### Data preparation

All the data cleaning and preparation can be found in the "Data Cleaning Final Project" notebook. Please note that this notebook requires very large computation time due to the size of the dataset (several hours for some cells).
The work done to prepare the data for the analysis is described hereafter:

- __Identification of the relevant routes, trips and stops__

In this first part, we decided to clean and select only the routes, trips and stops corresponding to buses or train. As we have seen with the UML Diagramm in the previous section, we first select the route_id corresponding to relevant type then we extract the relevant trip_id corresponding to these routes and finally we extract the stops contained in this trips.

- __Identification of trips sequence__

The second part of the preparation was done to identify the sequence of stops for each trip. In order to do that, we analyzed the "stop_times" file and extract the sequence of stops for each trip. 

- __Population preparation__

Ideally in a transport network we want to be able to use the journey data in order to decide how many people get on a train or leave a train at a give stop, to then use that as a probability for the epidemic model. Regrettably this data is not available since it is not really recorded, thus we had to divert to an alternative method: data correlation with population data.

From [admin.ch](https://www.pxweb.bfs.admin.ch/pxweb/en/px-x-0102020000_401/px-x-0102020000_401/px-x-0102020000_401.px/table/tableViewLayout2/?rxid=ad5c6be1-7da0-49f6-834d-1b346f731e91) we gathered census data for each swiss commune. 

This data is used for two purposes:
1. Find the largest Swiss cities, with more than 24'000 inhabitants. In a radius of 5km around these cities, remove all bus trips that are entirely contained within this radius. The purpose of this purging is to treat cities as one blob instead of a tiny web of many small stops. Please refer to the discussion for comments on this point.
2. For a bit less than half of the train stops, we were thus able to correlate the train stop with population data. This gives us the possibility of using the population of a city relative to the population of the largest city as a probability that the infection should spread here when passing by.


- __Network generation with attributes__

The last part of the data preparation was the generation of the network with all the attributes. To do this, we first create all the nodes with attributes 'Longitude', 'Latitude', 'Population', 'NodeType' from the identified relevant stops. For the NodeType attribute, only two options are considered: Train or Bus. For the Population attribute, we assign for all stops the value -1 if the stop is not link to a known city, 0 if it is a Bus stop and the actual number of the population for the train stop of the city. Then the edges are generated via the previous identified sequence of stops for each trip. 
Finally the graph is saved in a .gml file

<a id='OverviewAnalysis'></a>
## Overview of analysis

With all the data cleaned and prepared, we are now able to start our analysis. The aim of our analysis is therefore to understand how the network works and what are the main nodes and then we will answer the following questions:
- What is the percentage of Switzerland's population attained when a virus is spread from a Swiss city?
- What are the most critical starting cities for infection?

The analysis available in the next section will have the following structure:
- First we will import the graph and print the basic information about it
- We will then look at the different communities  and look at the community as nodes.
- We will look at different basic measures on the network 
- We will look at centrality measures (in, out, close, between, pagerank)
- We will build our Infection Model

#### The infection model

In our model we wanted to represent a certain stop, or list of stops, to get infected (i.e. a person carrying the disease is at this stop). We then model with the help of probabilities, how other stops in the network get infected, just like how a common disease would get carried in normal public transport.

Therefore we chose the SIR model. However, the ndlib SIR Model only has model wide `beta` and `gamma` parameters, that represent the global probability a node gets infected, and the global probability a node gets removed after being infected. Since we wanted to use a per-node probability for infection, we modified the NDLib library slightly, to move the `beta` parameter to the node level. The modified library is available here: [NDLib Modified](https://github.com/CedricCook/ndlib).

<a id='Analysis'></a>
## Analysis

CopyPaste Code here

<a id='Results'></a>
## Results and interpretation

ToDo 

## Discussion

### Trains? Busses? Both?

For this project it was possible to either consider Train data, Bus data, or both. In the midterm that Clément did before this project, he only focussed on train data. In the train data set there are around ~2200 nodes and ~2500 edges. Upon further inspection we find the network ressembles a tree, which makes it a bit less interesting for advanced analysis. Mainly for this reason we decided to also include the bus stops & routes for the final project, briging the total number of nodes up to ~18'000 nodes, and around ~22'000 edges. The bus routes do in fact increase the average degree a little, especially around the main train station, but the rest of the routes are purely linear aswell, thus overall the difference is not extremely exciting.

Our objective for this project was to observe how an infection spreads throughout the country of Switzerland, and we considered the spread at the neighbourhood (intra city) level to be not so relevant. Therefore we removed all bus lines that start and end within a given radius of a large city. In hindsight this proved a very costly process, and was not very beneficial at all to the final results.

### A spatial infection or a dynamic infection?

We chose to use an adapted SIR Model. In our version of the SIR Model, each node has a probability to get infected with respect to its population if one of the neighbors is infected. This means that the epidemic can quite simply die out in an early iteration, even though in the real world a person is likely to stay on the transport network for atleast a few stops.

### Edge weights

When constructing the graph, we append a trip ID to an edge if the edge already existed in the graph. We have no further concept of edge weight. We could have used the number of times this edge is added as an edge weight, and then scale the probability of the epidemic spreading to another node by the edge weight to that node.

<a id='Conclusion'></a>
## Conclusion

<a id='Appendices'></a>
## Appendices

### Project structure

ToDo