# Living in the City of Eindhoven

Capstone project by Peter van Liesdonk

## Introduction

For the last few years I have been living in the city of Eindhoven, a small provincial town in the south of the Netherland and mostly known
for its high tech industries. For this reason it is sometimes called Silicon Valley of the Netherlands.
I like living here, and I'm living in a great neighbourhood. Unfortunately I cannot stay here: I'm currently residing in a social housing
project, and I'm getting too rich to stay here. I know: luxury problem.

So where in this city will I buy a new house? 

To figure this out I want to compare the various neighbourhoods in Eindhoven to see how similar they are, and finally figure out if there is a neighbourhood similar to mine where I could also live.
I will do this based on the different venues and ammenities available in the direct vicinity of the various neighbourhoods.

As in the course, I'd like to cluster neighbourhoods on similarity and show them on a map. Then I'd also like to create a decision tree that can show me the most important factors for choosing a certain neighbourhood.

## Data

To know more about Eindhoven I need as much data as possible. I found the following interesting datasets:

* A list of neighbourhoods in Eindhoven at [https://data.eindhoven.nl/explore/dataset/buurten/export/]. This includes
  * Name of neighbourhood (buurt), residential areas (wijken) and boroughs (stadsdeel) within Eindhoven,
  * their geographic coordinates,
  * their borders in GeoJSON format. 
* A table of key figures about the various neighbourhoods [https://opendata.cbs.nl/statline/#/CBS/nl/dataset/84286NED/table?ts=1546775064672], which includes things like
  * population,
  * population density, 
  * area,
  * amount of house
* A table of distances to popular amenities [https://opendata.cbs.nl/statline/#/CBS/nl/dataset/80306ned/table?ts=1546776064982], including distances like:
  * distance to park or forest,
  * distance to supermarket.

Of particular difficulty in using these datasets might be matching up the tables, as there are no official names to each of the neighbourhoods.


In addition I will use Foursquare to find popular venues close to each of the neighbourhoods. Using Foursquare might mean that I do not actually need the last of the tables above.  Another difficulty might be the sparsity of information on Foursquare, since Foursquare is not very popular in the Netherlands.


## Methodology


### Exploring the dataset

We start by importing the first dataset, a list of neighbourhoods and their geographic position from [https://data.eindhoven.nl/explore/dataset/buurten/export/]. After importing in Pandas and cleaning up some of the column names we get a dataset that looks as follows:

<img src="figs/buurten.png"/>

Imporing the key figures dataset from [https://opendata.cbs.nl/statline/#/CBS/nl/dataset/84286NED/table?ts=1546775064672] gives us the following columns.

`['Wijken en buurten', 'Gemeentenaam', 'Soort regio', 'Codering',
       'Indelingswijziging wijken en buurten', 'Inwoners',
       'Inwoners 15 tot 25 jaar', 'Inwoners Westers totaal',
       'Inwoners Nederlandse Antillen en Aruba', 'Eenpersoonshuishoudens',
       'Bevolkingsdichtheid', 'Woningvoorraad', 'Percentage meergezinswoning',
       'Personenauto's; brandstof benzine', 'Motorfietsen', 'Oppervlakte',
       'Mate van stedelijkheid', 'Omgevingsadressendichtheid'] `
       
This requires some more cleanup. 

First of all we only want neighbourhoods (wijken) while this dataset also has aggregated information. This can be done by filtering on the 'Soort Regio' column.

Secondly, not all data is relevant. We decide to filter only the following columns, since they seem to be the most relevant to the problem at hand:

* Wijken en buurten:  the neighbourhood name
* Bevolkingsdichtheid: population density.
* Mate van stedelijkheid: Urbanization, how city-like is the neighbourhood
* Omgevingsadressendichtheid: How many addresses per square km

After cleanup we get the following results that still gave NaN values:

<img src="figs/nans.png"/>

We know all of these are industrial areas, and replace the NaNs with 0.
When we merge with the previous dataset, we get the following result:

<img src="figs/kerncijfers.png"/>

### Putting Eindhoven on a map

We can now compile a simple map that shows the different neighbourhoods in a Choropeth map, showing the population density.

This map immediately shows something interesting: four neighbourhoods with no population density.
This makes sense: one is an airport, two are industrial areas and the last one consists only of nature.

<img src="figs/choropleth.png"/>

### Venues from Foursquare

Next, we want to get a list of interesting venues from Foursquare. Since we are deciding in which neighbourhood we want to live, it doesn't actually matter whether the found venue is in the neighbourhood we are searching for, only that it is close by. As such we can use a RADIUS restriction. We put it on 500m, since that is approximately the maximum distance I'm willing to walk.

This gives us a dataframe of venues as follows. This frame contains many duplicates, which are venues that fall within the readius of multiple neighbourhoods.

<img src="figs/venues.png"/>

We can put all the found venues on the map as follows:

<img src="figs/allvenues.png"/>

### Clustering neighbourhoods

Next, we want to cluster all of the neighbourhoods. We will be using a K-means clustering algorithm.
In order to do this, we first turn the venues dataframe into one-hot notation and sum how often each venue occurs for each neighbourhood. To this we add population density, urbanization and housing density figures.

To find an optimal value for $k$, we run the clustering algorithm for various values and plot the silhouette score.
<img src="figs/kmeans.png"/>

We immediately see there is no clear choice for $k$. Choosing $k=2$ (the maximum) doesn't give any real insight. We decide to go for $k=11$, which is a local optimum and might provide interesting insights.

This results in the following clustering:

<img src="figs/clustermap.png"/>

### Decision Tree

Finally we want to build a decision tree in order to figure in which type of neighbourhood I should live.

We train a decision tree using the existence of certain venues as training data in order to predict the cluster labels we found above.
This gives the following result:

<img src="figs/decision.svg"/>

## Results

## Discussion

Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.

## Conclusion

Conclusion section where you conclude the report.