# Edinburgh: Where to move?
<hr/>

<img src="Edinburgh.jpg">

## Introduction
This project will aim to consider potential areas for a family to move to in the city of Edinburgh in Scotland. This will be based on similarity to the area they currently live, based on the k-Means clustering method using the relative presence of different types of venues returned by the Foursquare app. The audience will be real estate agents in Edinburgh, as they may use the resulting clusters to recommend particular areas to move to.

## Data
The list of Edinburgh postcode areas (outcodes) can be found on the following Wikipedia page
https://en.wikipedia.org/wiki/EH_postcode_area which I will scrape using the Beautiful Soup package. The top rows of that table are shown below:

|Postcode District|Post Town|Coverage|Local Authority Area(s)|
|:--|:--|:--|:--|
|EH1|Edinburgh|Mostly consists of Edinburgh's Old Town. Also hosts the old GPO building (at EH1 1AA) and the areas immediately to the north of this are also included, that is St. James Centre and the areas down Leith Street and Broughton Street.|   |
|EH2|Edinburgh|The New Town and central commercial area of Edinburgh which includes Princes Street.|   |

I will then use the Lookup Outward Code API on Postcodes.io (http://postcodes.io/) to find the latitude and longitude of these locations. An example of the output is below:

`{'outcode': 'EH7',
 'longitude': -3.16464211780104,
 'latitude': 55.9602252041884,
 'northings': 674747,
 'eastings': 327387,
 'admin_district': ['City of Edinburgh'],
 'parish': [],
 'admin_county': [],
 'admin_ward': ['Leith Walk',
  'Inverleith',
  'Craigentinny/Duddingston',
  'City Centre'],
 'country': ['Scotland']}`

I will then use the Foursquare API to return venues around these locations. This will then be used to cluster the areas using the k-Means algorithm in the SciKit Learn package.

## Methodology
### Data collection and cleaning
I used the Beautiful Soup package to scrape the Edinburgh 'outcodes' from the Wikipedia page. I then used the Postcodes.io outcodes API to get the latitude and longitude of those outcodes. I saw that there were some special postcodes e.g. that for the Scottish Parliament. These were not areas that someone could live in, so I removed them by filtering on the phrase "Special postcode" appearing in the 'coverage' column. I then used the Foursquare API with the explore parameter and the different latitudes and longitudes to get the venues around each postcode. 

### Exploratory data analysis
I counted the number of venue results by area giving the following results:

|Outcode|Number of results|
|:---|:---|
|EH1|100|
|EH2|100|
|EH3|100|
|EH4|5|
|EH5|20|
|EH6|98|
|EH7|60|
|EH8|40|
|EH9|49|
|EH10|28|
|EH11|33|
|EH12|36|
|EH13|10|
|EH14|6|
|EH15|26|
|EH16|20|
|EH17|14|
|EH28|4|
|EH29|5|
|EH30|31|

You can see that the densely populated city-centre postcodes (EH1, EH2 & EH3) had the maximum number of returned venues (100) and I was concerned that there could be duplicate venues as searches were allocated just based on the radius around that search.

I also counted the unique venue categories per outcode:

|Outcode|Unique venue categories|
|:---|:---|
|EH1|53|
|EH2|61|
|EH3|56|
|EH4|5|
|EH5|16|
|EH6|51|
|EH7|29|
|EH8|28|
|EH9|27|
|EH10|19|
|EH11|24|
|EH12|24|
|EH13|9|
|EH14|6|
|EH15|24|
|EH16|15|
|EH17|8|
|EH28|4|
|EH29|5|
|EH30|19|

I noted the high number of unique categories as a proporition of the total venues in each area. By looking at the data I could see there were a lot of subsets of restaurants (e.g. American Restaurant, Argentinian restaurant, etc) as well as a lot of different ways to describe cafes. I was concerned this could impact the clustering, by suggesting greater differences between areas than is useful for this purpose.

I decided to proceed with clustering using the k-Means algorithm, part of the Scikit Learn Package - with 5 clusters. and then consider the above concerns in light of the result. I examined the cluster sizes and got the following numbers:

|Cluster|Number of areas|
|:---|:---|
|1|3|
|2|11|
|3|3|
|4|1|
|5|2|

The fact that there was one big cluster did reinforce the above concerns. In addition, the resulting map showed that most of the city was in cluster 2 and it was only areas on the periphery of the city that were clustered differently.

<img src="model4a.png">

That suggested this model may not provide insights of interest to real estate agents.

### Revised model
I made two revisions to the model. To resolve my concerns about the number of unique categories I recoded the venue categories to phrase them more generally. This was done by exporting the list of unique categories and then manually giving more general category names. This was then merged with the original data.

In order to resolve my concern about venues appearing in multiple areas (and that venues were included in postcodes in which they weren't present) I decided to set the postcode based on the venue location rather than the search location. I again did this with the Postcodes.io outcodes API. I then dropped duplicate locations. I then reran the clustering algorithm.



## Results 
After re-running the clustering algorithm I got the following cluster sizes:

|Cluster|Number of areas|
|:---|:---|
|1|7|
|2|4|
|3|2|
|4|6|
|5|1|

I also got the following map of the clusters:

<img src="model4c.png">

|Cluster|Colour|
|:---|:---|
|1|Red|
|2|Purple|
|3|Blue|
|4|Green|
|5|Orange|

### Resulting clusters
Below are the resulting clusters, which are examined in the Discussion section below.

#### Cluster 1
<img src="cluster1.png">

#### Cluster 2
<img src="cluster2.png">

#### Cluster 3
<img src="cluster3.png">

#### Cluster 4
<img src="cluster4.png">

#### Cluster 5
<img src="cluster5.png">

## Discussion 
The resulting clusters demonstrate the benefit of a k-Means clustering method to advice homeowners looking to move within or two Edinburgh. It has divided the city into 5 clusters each with fairly clear differences:

#### Cluster 1
This cluster contains areas located just outside the centre of town. Its mix of coffee shops and parks (as well as a quite a few pubs) suggest these areas may be suitable for families with young children.

#### Cluster 2
Cluster 2 is the other cluster surrounding the centre of town. It has plenty of places to buy groceries, good transport links and gyms or sports fields. 

#### Cluster 3
Cluster 3 contains two areas at the edge of Edinburgh. It also has a high number of places to buy groceries, as well as markets, hostels and hotels.

#### Cluster 4
This is geographically located right in the centre of town, with many pubs, restaurants and pubs. These locations may suit the active social lives of young people.

#### Cluster 5
This cluster only contains the EH28 Newbridge/Ratho postcode. This shows the distinct character of that area. It most common venue is hotel, and it's possible its distinct character is impacted by the nearby airport.

Uses of this clustering could include providing inspiration. So if someone is interested in moving to Marchmont (EH9) on the Southside of Edinburgh you may suggest the EH6 postcode in the North of Edinburgh because they are both in cluster 4. Greater potential could be achieved by using the Zoopla API to find average house prices in these areas - allowing you to narrow down areas both by similarity and by price.

## Conclusion
This project demonstrates the potential of k-Means clustering in real estate. These clusters can be used to get a reality check on assumptions about Edinburgh areas and provide some inspiration for areas that could be suitable for your clients.  