# Capstone Project - Battle of Neighborhoods
### Applied Data Science Capstone by IBM/Coursera
#### by Hans-Joachim Steinort

## Table of contents <a name="table_of_contents"></a>
* [Introduction: Business Problem](#introduction)
* [Data Aquisition](#data_aquisition)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)
* [Literature](#literatur)

## Introduction: Business Problem <a name="introduction"></a> 
[Back to top](#table_of_contents)

As emerging start-up we are looking for an adequate city where we can open our headquarters. Of course we have the ambition to choose a city with international flair, a city renown all around the world, something like New York City or Tokyo. Unfortunately we are bound to the south of Germany. So no London, Paris or Singapur for us, not even Berlin. Luckily we are located near a city that proclaimed till 2005 to be a "world city with heart" - Munich [[1]](#l_1).

To verify this claim we will use the powers of data science to analyse the neighbourhoods of Munich and compare them to Manhattan, NYC as our baseline. If we find similarities we might be safe to choos Munich for our headquarters and still have the international flair we are looking for. 

## Data Aquisition<a name="data_aquisition"></a>
[Back to top](#table_of_contents)

Based on our requirements the following factors will influence our decission:
* office rent and availability
* the distance to the next international airport / main train station
* the classification of the neighbourhood based on the venues in direct vicinity (so our employees do not have to leave early if the want to go out) 

Necessary data will be extracted/generated from/by:
* **Real estate reports** ([[2]](#l_2), [[3]](#l_3))
* **GeoPy**
* **Foursquare API**

The data acquisition part consists of
* [Office Rent Comparison](#rent_comparison)
* [Airport and Train Station Distance](#air_trian)
* [Munich Segmentation](#segmentation)
* [Munich Segment Data Aquisition](#segment_data)

### Office Rent Comparison<a name="rent_comparison"></a>
[Back to Data](#data)

At the first glance Munich is comparabily cheap to Manhattan. As claimed in [[2]](#l_2) the **availability** of offices in Manhattan was **9.8% in Q3 2019** with an acclaimed **rent** of **79.77 $/SF/YR**. To get comparable results we have to convert this value to the German standard of Euro(€) per squaremeter(SQM) per month(M).

To get SQM from SF we have to multiply the value by the factor of 10.764:

79.77 * 10.764 = 858,64 $/SQM/YR

To get the monthly rent we then have to divide the value by the factor of 12:

858,64 / 12 = 71.55 $/SQM/M

To get the monthly rent in Euro we mulitply with a generous estimate of the 2019 average Dollar/Euro exchange rate of 0.9 €/$:

71.55 * 0.9 = 64.40 €/SQM/M

If we take a look at the office market report of Munich in [[3]](#l_3) we might find our first discouragement. Even if we choose our office in the most expensive area of Munich - in the city center - we can only pay **shatterig 36% less** compared to the Manhattan average.

**Manhattan, NYC: 64.40 €/SQM/M**

**City Center, Munich: 41.00 €/SQM/M**

So no flexing with the height of our rent... But at least we have found the area, we want to compare to our neighbourhoods of Manhattan. 

### Airport and Train Station Distance<a name="air_train"></a>
[Back to Data](#data)

Being internationally active we need fast access to an international airport, because it is too time consuming sailing accross the Atlantic for each and every meeting. Yet trying our best to reduce our environmental footprint we are planning to travell inside Germany by train. For this purpose a nearby train station (within 10 mins by bike, e.g. 5 km) would be beneficial.

Both distances will be roughly measured by a straight line derived from the latitude & longitude values of our respective neighbourhood/areal centers

In [68]:
# imports for retrieving latitudes and longitudes
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="my-application")

In [2]:
# imports for meassuring distances between latitudes and longitudes
from math import sin, cos, sqrt, atan2, radians

R = 6373.0 # approximate radius of earth [km]

#### Manhattan, NYC

As main international airport of NYC we assume **John F. Kennedy Airport (JFK)**. As main train station of Manhattan we assume the **New York Pennsylvania Station**.

In [3]:
# get location of the center of Manhattan

location_Manhattan = geolocator.geocode("Manhattan NYC")

print(location_Manhattan.address)
print("")

latitude_Manhattan = location_Manhattan.latitude
longitude_Manhattan = location_Manhattan.longitude


print((latitude_Manhattan, longitude_Manhattan))

Manhattan, New York County, New York, United States of America

(40.7896239, -73.9598939)


In [4]:
# get location of JFK

location_JFK = geolocator.geocode("JFK NYC")

print(location_JFK.address)
print("")

latitude_JFK = location_JFK.latitude
longitude_JFK = location_JFK.longitude

print((latitude_JFK, longitude_JFK))

John F. Kennedy International Airport, 167th Street, Rochdale Village, Queens, Queens County, New York, 11430, United States of America

(40.642947899999996, -73.7793733748521)


In [5]:
# get location of Penn Station

location_Penn = geolocator.geocode("Pennsylvania Station NYC")

print(location_Penn.address)
print("")

latitude_Penn = location_Penn.latitude
longitude_Penn = location_Penn.longitude

print((latitude_Penn, longitude_Penn))

Pennsylvania Station, 234, West 31st Street, Chelsea, Manhattan Community Board 5, Manhattan, New York County, New York, 10001, United States of America

(40.7502382, -73.9928111)


In [6]:
# get airport distance [km]

rLat_1 = radians(latitude_JFK)
rLat_2 = radians(latitude_Manhattan)
rLong_1 = radians(longitude_JFK)
rLong_2 = radians(longitude_Manhattan)

dLat = rLat_1 - rLat_2
dLong = rLong_1 - rLong_2

a = sin(dLat / 2)**2 + cos(rLat_1) * cos(rLat_2) * sin(dLong / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1-a))

distance = R * c

print("Distance: %.2f" % distance, "km")

Distance: 22.31 km


In [7]:
# get train station distance [km]

rLat_1 = radians(latitude_Penn)
rLat_2 = radians(latitude_Manhattan)
rLong_1 = radians(longitude_Penn)
rLong_2 = radians(longitude_Manhattan)

dLat = rLat_1 - rLat_2
dLong = rLong_1 - rLong_2

a = sin(dLat / 2)**2 + cos(rLat_1) * cos(rLat_2) * sin(dLong / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1-a))

distance = R * c

print("Distance: %.2f" % distance, "km")

Distance: 5.18 km


Our **benchmark** for an office in Manhattan is an approximate airport distance of **22.31 km** and a train station distance of **5.18 km**.

#### City Center, Munich, Germany

As main international airport of Munic we assume **Franz Joseph Strauss Airport (MUC)**. As main train station of Manhattan we assume the **Munich Main Station (MMS)**.

In [8]:
# get location of the center of Manhattan

location_Munich = geolocator.geocode("Center Munich Germany")

print(location_Munich.address)
print("")

latitude_Munich = location_Munich.latitude
longitude_Munich = location_Munich.longitude


print((latitude_Munich, longitude_Munich))

München, Bayern, Deutschland

(48.1371079, 11.5753822)


In [9]:
# get location of MUC

location_MUC = geolocator.geocode("MUC Airport Munich Germany")

print(location_MUC.address)
print("")

latitude_MUC = location_MUC.latitude
longitude_MUC = location_MUC.longitude

print((latitude_MUC, longitude_MUC))

Flughafen München, Lohstraße, Schwaigermoos, Oberding, Oberding (VGem), Landkreis Erding, Bayern, 85445, Deutschland

(48.35376735, 11.778011507058581)


In [10]:
# get location of MMS

location_MMS = geolocator.geocode("Munich Main Station Germany")

print(location_MMS.address)
print("")

latitude_MMS = location_MMS.latitude
longitude_MMS = location_MMS.longitude

print((latitude_MMS, longitude_MMS))

München Hauptbahnhof, Bezirksteil Ludwigsvorstadt-Kliniken, Stadtbezirk 02 Ludwigsvorstadt-Isarvorstadt, München, Bayern, 80335, Deutschland

(48.1407138, 11.556312490222421)


In [11]:
# get airport distance [km]

rLat_1 = radians(latitude_MUC)
rLat_2 = radians(latitude_Munich)
rLong_1 = radians(longitude_MMS)
rLong_2 = radians(longitude_MUC)

dLat = rLat_1 - rLat_2
dLong = rLong_1 - rLong_2

a = sin(dLat / 2)**2 + cos(rLat_1) * cos(rLat_2) * sin(dLong / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1-a))

distance = R * c

print("Distance: %.2f" % distance, "km")

Distance: 29.16 km


In [12]:
# get train station distance [km]

rLat_1 = radians(latitude_MMS)
rLat_2 = radians(latitude_Munich)
rLong_1 = radians(longitude_MMS)
rLong_2 = radians(longitude_Munich)

dLat = rLat_1 - rLat_2
dLong = rLong_1 - rLong_2

a = sin(dLat / 2)**2 + cos(rLat_1) * cos(rLat_2) * sin(dLong / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1-a))

distance = R * c

print("Distance: %.2f" % distance, "km")

Distance: 1.47 km


With an approximate distance of **29.16 km** the airport in Munich is slightly farther away from the desired office area but this is absolutly within our tollerance. Especially because the distance from the center to the main train station is only **1.47 km**.

### Munich Segmentation<a name="segmentation"></a>
[Back to Data](#data)

Due to the fact that we have no predefined segmentation of the city center of Munich (compared to the different neighbourhoods in Manhattan), we have to build our own segmentation.

For this purpose we create an **equally spaced grid within a ~4km radius of the center of Munich**. Each artificial neighbourhood will have a radius of 200 m.

The calculation will take place within a Cartesian 2D coordinate system. The conversion between the WGS84 spherical coordinate system and the UTM Cartesian coordinate system is inspired by the Capstone Example [[4]](#l_4).

In [13]:
import shapely
from pyproj import Proj, transform
import math

In [14]:
def projection():
    return Proj("+proj=utm +zone=32U, +south +ellps=WGS84 +datum=WGS84 +units=m +no_defs")

def lonlat_to_xy(lon, lat):
    proj_xy = projection()
    x, y = proj_xy(lon, lat)
    
    return x, y

def xy_to_lonlat(x, y):
    proj_lonlat = proj_xy = projection()
    lon, lat = proj_lonlat(x, y, inverse=True)
    
    return lon, lat

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    
    return math.sqrt(dx*dx + dy*dy)

In [15]:
munich_center_x, munich_center_y = lonlat_to_xy(longitude_Munich, latitude_Munich)

print('Munich city center UTMx={}, UTMy={}'.format(munich_center_x, munich_center_y))

Munich city center UTMx=691595.3564082251, UTMy=5334747.274427182


In the next step we build a **hexagonal grid** euqally spaced around the city center of Munich.

In [16]:
y_off = math.sqrt(3) / 2     # vertical offset
n_cnt_x = 21                   # neighbourhood horizontal counter 
n_off_y = int(n_cnt_x/y_off)     # neighbourhood vertical counter

circle_radius = 4000 #[m]
center_distance = 400 #[m]

x_min = munich_center_x - circle_radius
x_step = center_distance
y_min = munich_center_y - circle_radius - (n_off_y * y_off * center_distance - circle_radius * 2) / 2
y_step = center_distance * y_off

latitudes = []
longitudes = []
distances_from_center = []
X = []
Y = []

for i in range (0, n_off_y):
    y = y_min + i * y_step
    x_off = (center_distance / 2) if i%2 == 0 else 0
    for j in range (0, n_cnt_x):
        x = x_min + j * x_step + x_off
        distance_from_center = calc_xy_distance(munich_center_x, munich_center_y, x, y)
        if (distance_from_center <= circle_radius + 1):
            lon, lat = xy_to_lonlat(x, y)
            longitudes.append(lon)
            latitudes.append(lat)
            distances_from_center.append(distance_from_center)
            X.append(x)
            Y.append(y)
            
print(len(latitudes), 'neighbourhood centers generated.')

364 neighbourhood centers generated.


To verify our segmentation we generate a map with folium.

In [67]:
import folium

In [18]:
center_Munich = [location_Munich.latitude, location_Munich.longitude]
map_Munich = folium.Map(location=center_Munich, zoom_start=13)
folium.Marker(center_Munich, popup='City Center Munich').add_to(map_Munich)
for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=center_distance/2, color='green', fill=False).add_to(map_Munich)

map_Munich

For further analysis we add our newly created neighbourhoods to a pandas dataframe.

In [19]:
import pandas as pd

centerIDs = []
for i in range(0, len(latitudes)):
    centerIDs.append('C_' + str(i))

munich_data = pd.DataFrame({'Center IDs':centerIDs,
                            'Center Latitudes':latitudes,
                            'Center Longitudes':longitudes})
munich_data.head()

Unnamed: 0,Center IDs,Center Latitudes,Center Longitudes
0,C_0,48.103219,11.557567
1,C_1,48.103099,11.562935
2,C_2,48.10298,11.568302
3,C_3,48.102859,11.57367
4,C_4,48.102739,11.579037


### Munich Segment Data Aquisition<a name="segment_data"></a>
[Back to Data](#data)

To create a baseline which we can compare to the clusters of the Manhattan neighbourhood clustering from the lab in week 3, we have to reproduce the venue aquisition with the **Foursquare API** in regard to our artificially created neighbourhood centers.

To achieve this we make use of the getNearbyVenues function from the lab [[5]](#l_5).

In [20]:
import requests

In [21]:
# Credentials hidden for sharing

In [25]:
def getNearbyVenues(ids, latitudes, longitudes, limit=100, radius=500):
    
    venues_list=[]
    for ids, lat, lng in zip(ids, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
              CLIENT_ID, 
              CLIENT_SECRET, 
              VERSION, 
              lat, 
              lng, 
              radius, 
              limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            ids,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Center ID',
                             'Center Latitude', 
                             'Center Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

In [26]:
venue_limit = 100
search_radius = center_distance / 2

munich_venues = getNearbyVenues(ids=munich_data['Center'],
                                latitudes=munich_data['Center Latitudes'],
                                longitudes=munich_data['Center Longitudes'],
                                limit=venue_limit,
                                radius=search_radius)

In [32]:
print(munich_venues.shape)
munich_venues.head()

(2561, 7)


Unnamed: 0,Center,Center Latitude,Center Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,C_0,48.103219,11.557567,Isar Alm zum Gartl,48.102888,11.558148,Beer Garden
1,C_1,48.103099,11.562935,Reitclub Isartal,48.103324,11.562401,Stables
2,C_1,48.103099,11.562935,Dog's Academy Hundeschule,48.10361,11.562719,Dog Run
3,C_2,48.10298,11.568302,Therapie- und Trainingszentrum München Harlaching,48.103578,11.568077,Gym / Fitness Center
4,C_2,48.10298,11.568302,H Kurzstraße,48.1044,11.569698,Tram Station


This concludes our data aquisition phase. As last step we pickle our retrieved venue list so that we do not have to call the Foursquare API over and over again.

In [33]:
import pickle

munich_venues.to_pickle('./Munich_Venues.pkl')

## Methodology <a name="methodology"></a>
[Back to top](#table_of_contents)

After collecting the venues of each area around the city center of Munich we will want to find out if there are neighbourhoods/areas that are comparable to the ones in Manhattan.

To achieve this we have to analize the areas by **grouping** them together and use **one-hot encoding** to extract the frequency of the respective venues. But we will use the the data and **k-means clustering algorithm** from the already mentioned lab [[5]](#l_5).

To check if the neighbourhoods/areas of Munich fit into the clusters of Manhattan we **merge the venue data of Manhattan and Munich together** and then run the clustering algorithm. After that we will proceed with an optical analysis of the maps of Munich and Manhattan to **compare the clusters** and (hopefully) find the area in which we will try to rent our new headquarters.

## Analysis <a name="analysis"></a>
[Back to top](#table_of_contents)

The analysis part consists of
* [Data Preparation](#data_prep)
* [Cluster Comparison](#clusters)

### Data Preparation<a name="data_prep"></a>
[Back to Analysis](#analysis)

In [37]:
# one-hot encoding
munich_onehot = pd.get_dummies(munich_venues[['Venue Category']], prefix="", prefix_sep="")
munich_onehot['Center'] = munich_venues['Center']

fixed_columns = [munich_onehot.columns[-1]] + list(munich_onehot.columns[:-1])
munich_onehot = munich_onehot[fixed_columns]

munich_grouped = munich_onehot.groupby('Center').mean().reset_index()
munich_grouped.head()

Unnamed: 0,Center,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Volleyball Court,Water Park,Waterfall,Wine Bar,Wine Shop,Women's Store,Xinjiang Restaurant,Yoga Studio
0,C_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,C_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,C_10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,C_100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,C_101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [130]:
munich_data.rename(columns={'Center IDs':'Center', 'Latitudes':'Latitude', 'Longitudes':'Longitude'}, inplace=True)
munich_merged = munich_data
munich_merged = munich_merged.join(munich_grouped.set_index('Center'), on='Center')
munich_merged.head()

Unnamed: 0,Center,Latitude,Longitude,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,...,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Volleyball Court,Water Park,Waterfall,Wine Bar,Wine Shop,Women's Store,Xinjiang Restaurant,Yoga Studio
0,C_0,48.103219,11.557567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,C_1,48.103099,11.562935,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,C_2,48.10298,11.568302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,C_3,48.102859,11.57367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,C_4,48.102739,11.579037,,,,,,,,...,,,,,,,,,,


In [40]:
# get Manhattan venue data
manhattan_grouped = pd.read_pickle("./Manhattan_Venues_Grouped.pkl")
manhattan_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.03,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.010101,0.0,0.0,0.0,0.010101,...,0.020202,0.0,0.0,0.0,0.0,0.010101,0.030303,0.0,0.010101,0.030303
2,Central Harlem,0.0,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0
4,Chinatown,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01


In [44]:
# get Manhattan general data
manhattan_data = pd.read_pickle("./Manhattan_Data.pkl")
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


To match the format of both grouped lists we have to rename the ID column of the manhattan_grouped dataframe and merge it with the latitude and longitude values from the manhattan_data dataframe.

In [42]:
manhattan_grouped.rename(columns={'Neighborhood':'Center'}, inplace=True)
manhattan_grouped.head()

Unnamed: 0,Center,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.03,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.010101,0.0,0.0,0.0,0.010101,...,0.020202,0.0,0.0,0.0,0.0,0.010101,0.030303,0.0,0.010101,0.030303
2,Central Harlem,0.0,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0
4,Chinatown,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01


In [49]:
del manhattan_data['Borough']
manhattan_data.rename(columns={'Neighborhood':'Center'}, inplace=True)
manhattan_data.head()

Unnamed: 0,Center,Latitude,Longitude
0,Marble Hill,40.876551,-73.91066
1,Chinatown,40.715618,-73.994279
2,Washington Heights,40.851903,-73.9369
3,Inwood,40.867684,-73.92121
4,Hamilton Heights,40.823604,-73.949688


In [50]:
manhattan_merged = manhattan_data
manhattan_merged = manhattan_merged.join(manhattan_grouped.set_index('Center'), on='Center')
manhattan_merged.head()

Unnamed: 0,Center,Latitude,Longitude,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,...,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,40.876551,-73.91066,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478
1,Chinatown,40.715618,-73.994279,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
2,Washington Heights,40.851903,-73.9369,0.011765,0.0,0.0,0.0,0.011765,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.011765,0.023529,0.0,0.011765,0.0
3,Inwood,40.867684,-73.92121,0.0,0.0,0.0,0.0,0.033898,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.033898,0.016949,0.0,0.0,0.016949
4,Hamilton Heights,40.823604,-73.949688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,0.033333


In [128]:
print('Manhattan_Merged has the shape of:', manhattan_merged.shape)
print('Munich_Grouped has the shape of:', munich_merged.shape)

Manhattan_Merged has the shape of: (40, 337)
Munich_Grouped has the shape of: (364, 283)


In [125]:
# Merge Munich and Manhattan data
# => Add new columns to dataframe if one contains venue that is not present in the other
manhattan_munich_grouped = manhattan_merged.append(munich_merged, sort=False)
manhattan_munich_grouped.head()

Unnamed: 0,Center,Latitude,Longitude,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,...,Taverna,Theme Park,Tibetan Restaurant,Track,Tram Station,Trattoria/Osteria,Tunnel,Water Park,Waterfall,Xinjiang Restaurant
0,Marble Hill,40.876551,-73.91066,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,...,,,,,,,,,,
1,Chinatown,40.715618,-73.994279,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,,,,,,,,,,
2,Washington Heights,40.851903,-73.9369,0.011765,0.0,0.0,0.0,0.011765,0.0,0.0,...,,,,,,,,,,
3,Inwood,40.867684,-73.92121,0.0,0.0,0.0,0.0,0.033898,0.0,0.0,...,,,,,,,,,,
4,Hamilton Heights,40.823604,-73.949688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,


In [126]:
manhattan_munich_grouped.shape

(404, 401)

Now the two onehot encoded dataframes are merged together. All matching columns are concatenated, columns that are in one datafrmae but not in the other are added in the horizontal axis and filled with NaN for the respective other dataframe. After filling this NaN values with 0.0 we have the desired **manhattan_munich_grouped** dataframe to fit our k-means clustering algorithm.

In [131]:
manhattan_munich_grouped.fillna(0, inplace=True)
manhattan_munich_grouped.tail()

Unnamed: 0,Center,Latitude,Longitude,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,...,Taverna,Theme Park,Tibetan Restaurant,Track,Tram Station,Trattoria/Osteria,Tunnel,Water Park,Waterfall,Xinjiang Restaurant
359,C_359,48.171477,11.571723,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0
360,C_360,48.171356,11.577098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
361,C_361,48.171235,11.582472,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
362,C_362,48.171115,11.587847,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
363,C_363,48.170993,11.593222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Cluster Comparison<a name="cluster"></a>
[Back to Analysis](#analysis)

We run k-means to cluster our neihbourhoods/centers into **10 clusters**.

In [64]:
from sklearn.cluster import KMeans

In [161]:
kclusters = 10

# drop columns unnecessary for clustering
manhattan_munich_grouped_clustering = manhattan_munich_grouped.drop('Center', 1)
manhattan_munich_grouped_clustering = manhattan_munich_grouped_clustering.drop('Latitude', 1)
manhattan_munich_grouped_clustering = manhattan_munich_grouped_clustering.drop('Longitude', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_munich_grouped_clustering)

After fitting clustering we add the labels to our previously created dataframe.

In [162]:
del manhattan_munich_clustered['Clusters']

In [163]:
manhattan_munich_clustered = manhattan_munich_grouped
manhattan_munich_clustered.insert(3, 'Clusters', kmeans.labels_)
manhattan_munich_clustered.head()

Unnamed: 0,Center,Latitude,Longitude,Clusters,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,...,Taverna,Theme Park,Tibetan Restaurant,Track,Tram Station,Trattoria/Osteria,Tunnel,Water Park,Waterfall,Xinjiang Restaurant
0,Marble Hill,40.876551,-73.91066,0,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Chinatown,40.715618,-73.994279,0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Washington Heights,40.851903,-73.9369,0,0.011765,0.0,0.0,0.0,0.011765,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Inwood,40.867684,-73.92121,0,0.0,0.0,0.0,0.0,0.033898,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Hamilton Heights,40.823604,-73.949688,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


To visually compare our clusters we create **two maps** - one for **Manhattan including the clustered neighbourhoods** and one for **Munich containing the clustered artificial neighbourhood centers**.

In [69]:
manhattan = geolocator.geocode('Manhattan, NY')

In [78]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

def create_markers(map_object, data_object, k_clusters):
    # set color scheme
    x = np.arange(k_clusters)
    ys = [i + x + (i * x)**2 for i in range(k_clusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]
    
    # add markers to map
    markers_colors = []
    for lat, lon, poi, cluster in zip(data_object['Latitude'], data_object['Longitude'], data_object['Center'], data_object['Clusters']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker([lat, lon],
                            radius=5,
                            popup=label,
                            color=rainbow[cluster-1],
                            fill=True,
                            fill_color=rainbow[cluster-1],
                            fill_opacity=0.7).add_to(map_object)
    
    return map_object

In [164]:
manhattan_clusters = folium.Map(location=[manhattan.latitude, manhattan.longitude], zoom_start=11)
manhattan_clusters = create_markers(manhattan_clusters, manhattan_munich_clustered[0:40], kclusters)
manhattan_clusters

In [165]:
munich_clusters = folium.Map(location=[location_Munich.latitude, location_Munich.longitude], zoom_start=12)
munich_clusters = create_markers(munich_clusters, manhattan_munich_clustered[41:], kclusters)
munich_clusters

In [171]:
munich_like_manhattan = manhattan_munich_clustered[41:]
munich_like_manhattan[munich_like_manhattan.Clusters==0].count()["Center"]

159

## Results and Discussion <a name="results"></a>
[Back to top](#table_of_contents)

As the results show the k-means algorithm has clustered whole Manhattan into cluster '0', indes Munich contains much more different clusters. Nevertheless we still have **159** possible neighbourhoods/areas out of our 364 artificial centers that have the same cluster.

To possibly find better solutions, e.g. that Manhattan gets clustered more granularly we could utilize following options:
* devide Munich into larger circles (currently our searched areas in Munich are half the size of the ones in Manhattan which could skew our results)
* fit our algorithm to a different (smaller or larger) set of clusters
* use a different clustering algorithms like _Fuzzy k-means_ or _K-harmonic means_ [[6]]((#l_6)

Nevertheless we got a good first impression in our search for the place of our new headquarters and can modify the data / algorithms in this notebook accordingly if we decide to run other analyses before we decide for a new office.

## Conclusion <a name="conclusion"></a>
[Back to top](#table_of_contents)

In our search for an office for the headquarters of our startup we wanted an area comparable to the neighbourhoods of a world city like New York, namely the borough of Manhattan. We searched for such an area in a city nearby that claimed about itself to be a "world city with heart" - Munich.

After a first high level comparison between Munich and Manhattan in regard to expected office rent and the accessability of airport and train connections we analyzed the venues around our potential office locations to provide our employees with the best world city flair we can possibly find. For this we utilized the Foursquare API to retrieve the venue data.

To achieve comparability between Munich and Manhattan we merged the data we retrieved with the one we already had from a lab belonging to the IBM Data Science Capstone Project. We fed this merged data to a k-means clustering algorithm and projected our clusters onto maps of Manhattan and Munich to compare both.

In the end we found out that around 43% of the city center of Munich falls into the same cluster as Manhattan. Combined with our first high level comparison we can confidently say that the claim of Munich being a world city is not to far fetched and that it is an acceptable first alternative for opening our headquarters in Manhattan, NY.

## Literature <a name="literatur"></a>
[Back to top](#table_of_contents)

[1] <a name="l_1"></a>https://www.muenchen.de/rathaus/Stadtverwaltung/Referat-fuer-Arbeit-und-Wirtschaft/Presse/Muenchen-mag-dich-Munich-loves-you.html

[2] <a name="l_2"></a>Colliers International - Manhattan Office Market Report 3Q 2019

[3] <a name="l_3"></a>JLL - Office Market Profile Munich 4th Quarter 2019

[4] <a name="l_4"></a>https://cocl.us/coursera_capstone_notebook

[5] <a name="l_5"></a>CognitiveClass.ai - Segmenting and Clustering Neighborhoods in New York City

[6] <a name="l_6"></a>Hamerly et. al. - Alternatives to the k-means algorithm that find better clusterings, ACM CIKM Proceedings, 2002