In [1]:
import pandas as pd
import numpy as np
import requests
import json, urllib
from pandas.io.json import json_normalize
import folium
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

# Introduction

Company office is located in one of the neighbourhood in New York. But because of business reasons stakeholders decided to move the office to Berlin. People are used to the infrastructure that they had in New York neighbourhood and they want to feel the same level of comfort in Berlin.  
So, the target audience - company stakeholders and employees.  
The problem - selecting the most similar neighbourhood to New York neighbourhood in Berlin.  
The main reason - having the same infrastructure and the same level of comfort.  
We will need to leverage the Foursquare location data for all neighbourhoods in both cities to make the right decision.

# Data selection

So, we will take he Foursquare location data for all neighbourhoods in both cities. We will gather data on all vanues, preprocces it, so we have the mean amount of all vanues and cluster neighbourhoods in both cities.  
The final dataset will have the data on all vanues that are located in neighbourhoods. This will allow us to make a proper clustering.

# Methodology section

## Part 1. New York

First. let's get data on New York

In [2]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [3]:
neighborhoods_data = newyork_data['features']

In [4]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
ny_neighborhoods = pd.DataFrame(columns=column_names)

In [5]:
ny_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [6]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [8]:
ny_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Now let's use foursquare and get info about vanues. For example company is located in 'Queens'

In [17]:
queens_data = ny_neighborhoods[ny_neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)

In [18]:
CLIENT_ID = '4D2PETAWGZF1JPYNPRGXAWDT1OTSN3Q1AIF5EOQMKJZPELAS'
CLIENT_SECRET = 'GOGVEQQGMZCYUD5IZD10GQ4IOPOOZSIOPTVSXTRBQF5UWWOU'
VERSION = '20180605' 

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [22]:
queens_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude']
                                  )

Astoria
Woodside
Jackson Heights
Elmhurst
Howard Beach
Corona
Forest Hills
Kew Gardens
Richmond Hill
Flushing
Long Island City
Sunnyside
East Elmhurst
Maspeth
Ridgewood
Glendale
Rego Park
Woodhaven
Ozone Park
South Ozone Park
College Point
Whitestone
Bayside
Auburndale
Little Neck
Douglaston
Glen Oaks
Bellerose
Kew Gardens Hills
Fresh Meadows
Briarwood
Jamaica Center
Oakland Gardens
Queens Village
Hollis
South Jamaica
St. Albans
Rochdale
Springfield Gardens
Cambria Heights
Rosedale
Far Rockaway
Broad Channel
Breezy Point
Steinway
Beechhurst
Bay Terrace
Edgemere
Arverne
Rockaway Beach
Neponsit
Murray Hill
Floral Park
Holliswood
Jamaica Estates
Queensboro Hill
Hillcrest
Ravenswood
Lindenwood
Laurelton
Lefrak City
Belle Harbor
Rockaway Park
Somerville
Brookville
Bellaire
North Corona
Forest Hills Gardens
Jamaica Hills
Utopia
Pomonok
Astoria Heights
Hunters Point
Sunnyside Gardens
Blissville
Roxbury
Middle Village
Malba
Hammels
Bayswater
Queensbridge


Get the needed features

In [23]:
# one hot encoding
queens_onehot = pd.get_dummies(queens_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
queens_onehot['Neighborhood'] = queens_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [queens_onehot.columns[-1]] + list(queens_onehot.columns[:-1])
queens_onehot = queens_onehot[fixed_columns]

queens_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
queens_grouped = queens_onehot.groupby('Neighborhood').mean().reset_index()
queens_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Arverne,0.0,0.000000,0.0000,0.000000,0.000000,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.0,0.0,0.0,0.000000,0.00,0.055556,0.0,0.000000
1,Astoria,0.0,0.000000,0.0000,0.010000,0.000000,0.0,0.0,0.0,0.0,...,0.01,0.000000,0.0,0.0,0.0,0.000000,0.00,0.010000,0.0,0.000000
2,Astoria Heights,0.0,0.000000,0.0000,0.000000,0.000000,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000
3,Auburndale,0.0,0.000000,0.0000,0.055556,0.000000,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000
4,Bay Terrace,0.0,0.027027,0.0000,0.054054,0.000000,0.0,0.0,0.0,0.0,...,0.00,0.027027,0.0,0.0,0.0,0.027027,0.00,0.000000,0.0,0.054054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,Sunnyside Gardens,0.0,0.000000,0.0000,0.030000,0.000000,0.0,0.0,0.0,0.0,...,0.00,0.010000,0.0,0.0,0.0,0.000000,0.01,0.000000,0.0,0.000000
77,Utopia,0.0,0.000000,0.0625,0.000000,0.000000,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000
78,Whitestone,0.0,0.000000,0.0000,0.000000,0.000000,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000
79,Woodhaven,0.0,0.000000,0.0000,0.000000,0.038462,0.0,0.0,0.0,0.0,...,0.00,0.000000,0.0,0.0,0.0,0.000000,0.00,0.000000,0.0,0.000000


## Part 2. Berlin

In [38]:
url = 'https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin'
df = pd.read_html(url)

In [50]:
borough = ['Mitte', 'Friedrichshain-Kreuzberg', 'Pankow', 'Charlottenburg-Wilmersdorf',
       'Spandau', 'Steglitz-Zehlendorf', 'Tempelhof-Schöneberg', 'Neukölln', 'Treptow-Köpenick',
       'Marzahn-Hellersdorf', 'Lichtenberg', 'Reinickendorf']

In [64]:
df1 = df[2][['Locality']]
df1['Borough'] = 'Mitte'
for i in range(1, len(borough)):
    df_l = df[2 + i][['Locality']]
    df_l['Borough'] = borough[i]
    df1 = df1.append(df_l)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [68]:
berlin_data = df1.reset_index(drop=True)

In [70]:
berlin_data.columns = ['Neighborhood', 'Borough']

In [71]:
berlin_data

Unnamed: 0,Neighborhood,Borough
0,(0101) Mitte,Mitte
1,(0102) Moabit,Mitte
2,(0103) Hansaviertel,Mitte
3,(0104) Tiergarten,Mitte
4,(0105) Wedding,Mitte
...,...,...
91,(1207) Waidmannslust,Reinickendorf
92,(1208) Lübars,Reinickendorf
93,(1209) Wittenau,Reinickendorf
94,(1210) Märkisches Viertel,Reinickendorf


#### Get coordinates of Neighborhoods (we do it liek this bacause library doesn't work in any other way)

In [95]:
lat = []
long = []

for a in list(berlin_data['Neighborhood'])[:30]:
    address = 'Berlin,' + a.split(')')[-1]

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    lat.append(location.latitude)
    long.append(location.longitude)

In [96]:
for a in list(berlin_data['Neighborhood'])[30:60]:
    address = 'Berlin,' + a.split(')')[-1]

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    lat.append(location.latitude)
    long.append(location.longitude)

In [97]:
for a in list(berlin_data['Neighborhood'])[60:90]:
    address = 'Berlin,' + a.split(')')[-1]

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    lat.append(location.latitude)
    long.append(location.longitude)

In [98]:
for a in list(berlin_data['Neighborhood'])[90:97]:
    address = 'Berlin,' + a.split(')')[-1]

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    lat.append(location.latitude)
    long.append(location.longitude)

We add latitude and longitude

In [100]:
berlin_data['Latitude'] = lat
berlin_data['Longitude'] = long

In [101]:
berlin_data.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,(0101) Mitte,Mitte,52.51769,13.402376
1,(0102) Moabit,Mitte,52.530102,13.342542
2,(0103) Hansaviertel,Mitte,52.519123,13.341872
3,(0104) Tiergarten,Mitte,52.509778,13.35726
4,(0105) Wedding,Mitte,52.550123,13.34197


repeat the same process for Berlin

In [103]:
Berlin_venues = getNearbyVenues(names=berlin_data['Neighborhood'],
                                   latitudes=berlin_data['Latitude'],
                                   longitudes=berlin_data['Longitude']
                                  )

(0101) Mitte
(0102) Moabit
(0103) Hansaviertel
(0104) Tiergarten
(0105) Wedding
(0106) Gesundbrunnen
(0201) Friedrichshain
(0202) Kreuzberg
(0301) Prenzlauer Berg
(0302) Weißensee
(0303) Blankenburg
(0304) Heinersdorf
(0305) Karow
(0306) Stadtrandsiedlung Malchow
(0307) Pankow
(0308) Blankenfelde
(0309) Buch
(0310) Französisch Buchholz
(0311) Niederschönhausen
(0312) Rosenthal
(0313) Wilhelmsruh
(0401) Charlottenburg
(0402) Wilmersdorf
(0403) Schmargendorf
(0404) Grunewald
(0405) Westend
(0406) Charlottenburg-Nord
(0407) Halensee
(0501) Spandau
(0502) Haselhorst
(0503) Siemensstadt
(0504) Staaken
(0505) Gatow
(0506) Kladow
(0507) Hakenfelde
(0508) Falkenhagener Feld
(0509) Wilhelmstadt
(0601) Steglitz
(0602) Lichterfelde
(0603) Lankwitz
(0604) Zehlendorf
(0605) Dahlem
(0606) Nikolassee
(0607) Wannsee
(0701) Schöneberg
(0702) Friedenau
(0703) Tempelhof
(0704) Mariendorf
(0705) Marienfelde
(0706) Lichtenrade
(0801) Neukölln
(0802) Britz
(0803) Buckow
(0804) Rudow
(0805) Gropiusstadt
(090

In [107]:
# one hot encoding
berlin_onehot = pd.get_dummies(Berlin_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
berlin_onehot['Neighborhood'] = Berlin_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [berlin_onehot.columns[-1]] + list(berlin_onehot.columns[:-1])
berlin_onehot = berlin_onehot[fixed_columns]

berlin_onehot.head()

Unnamed: 0,Zoo Exhibit,ATM,African Restaurant,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Austrian Restaurant,Auto Dealership,...,Vietnamese Restaurant,Vineyard,Volleyball Court,Warehouse Store,Water Park,Waterfront,Windmill,Wine Bar,Wine Shop,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [108]:
berlin_grouped = berlin_onehot.groupby('Neighborhood').mean().reset_index()
berlin_grouped

Unnamed: 0,Neighborhood,Zoo Exhibit,ATM,African Restaurant,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Austrian Restaurant,...,Vietnamese Restaurant,Vineyard,Volleyball Court,Warehouse Store,Water Park,Waterfront,Windmill,Wine Bar,Wine Shop,Yoga Studio
0,(0101) Mitte,0.0,0.0,0.0,0.000000,0.0,0.04,0.020000,0.000000,0.000000,...,0.020000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0
1,(0102) Moabit,0.0,0.0,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.015152,...,0.015152,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0
2,(0103) Hansaviertel,0.0,0.0,0.0,0.000000,0.0,0.00,0.074074,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.037037,0.0,0.0,0.037037,0.0
3,(0104) Tiergarten,0.0,0.0,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0
4,(0105) Wedding,0.0,0.0,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,(1207) Waidmannslust,0.0,0.0,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.076923,0.0,0.000000,0.0,0.0,0.000000,0.0
89,(1208) Lübars,0.0,0.0,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0
90,(1209) Wittenau,0.0,0.0,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0
91,(1210) Märkisches Viertel,0.0,0.0,0.0,0.083333,0.0,0.00,0.000000,0.083333,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0


## Part 3. Clustering

Finally, let's concatenate two dataframes and cluster Neighborhoods

In [194]:
final_df = queens_grouped.append(berlin_grouped)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


In [195]:
final_df.fillna(value=0, inplace=True)

Place Neighborhood column first

In [196]:
col = list(final_df.columns)
n = col.index('Neighborhood')
newcol = [col[n]] + col[:n] + col[n + 1:]
final_df = final_df[newcol]

In [197]:
final_df.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Water Park,Waterfront,Weight Loss Center,Windmill,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Arverne,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0
1,Astoria,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
2,Astoria Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Auburndale,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bay Terrace,0.0,0.027027,0.0,0.0,0.054054,0.0,0.0,0.0,0.0,...,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.054054,0.0,0.0


Let's cluster

In [198]:
# set number of clusters
kclusters = 20

final_df_clustering = final_df.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(final_df_clustering)

In [199]:
# add clustering labels
final_df.insert(0, 'Cluster Labels', kmeans.labels_)

In [200]:
final_df

Unnamed: 0,Cluster Labels,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Water Park,Waterfront,Weight Loss Center,Windmill,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,1,Arverne,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.055556,0.0,0.000000,0.0,0.0
1,1,Astoria,0.0,0.000000,0.0,0.0,0.010000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.010000,0.0,0.000000,0.0,0.0
2,1,Astoria Heights,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
3,1,Auburndale,0.0,0.000000,0.0,0.0,0.055556,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
4,1,Bay Terrace,0.0,0.027027,0.0,0.0,0.054054,0.0,0.0,0.0,...,0.0,0.0,0.027027,0.0,0.0,0.000000,0.0,0.054054,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,3,(1207) Waidmannslust,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
89,12,(1208) Lübars,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
90,3,(1209) Wittenau,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
91,1,(1210) Märkisches Viertel,0.0,0.000000,0.0,0.0,0.083333,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0


Let's suppose that company is located in 'Cambria Heights'. Let's check where it can possibly move.

In [201]:
# First, check the cluster
cluster_num = final_df[final_df['Neighborhood'] == 'Cambria Heights']['Cluster Labels'].values[0]
cluster_num

17

In [202]:
final_df[final_df['Cluster Labels'] == cluster_num]

Unnamed: 0,Cluster Labels,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Water Park,Waterfront,Weight Loss Center,Windmill,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
16,17,Cambria Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
43,17,Laurelton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73,17,St. Albans,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can see, that there is no Berlin Neighborhood that is similar to it. But what if company is located in South Ozone Park?

In [215]:
cluster_num = final_df[final_df['Neighborhood'] == 'South Ozone Park']['Cluster Labels'].values[0]
cluster_num

11

In [216]:
final_df[final_df['Cluster Labels'] == cluster_num]

Unnamed: 0,Cluster Labels,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Water Park,Waterfront,Weight Loss Center,Windmill,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
71,11,South Ozone Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40,11,(0605) Dahlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Great! There is one Neighborhood in Berlin that matches.

We will not build a Map for these Neighborhoods, because cities are located too far away from each other.

Let's investigate clusters more

In [217]:
final_df[final_df['Cluster Labels'] == 1]

Unnamed: 0,Cluster Labels,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Water Park,Waterfront,Weight Loss Center,Windmill,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,1,Arverne,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.055556,0.0,0.000000,0.0,0.0
1,1,Astoria,0.0,0.000000,0.0,0.0,0.010000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.010000,0.0,0.000000,0.0,0.0
2,1,Astoria Heights,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
3,1,Auburndale,0.0,0.000000,0.0,0.0,0.055556,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
4,1,Bay Terrace,0.0,0.027027,0.0,0.0,0.054054,0.000000,0.0,0.0,...,0.0,0.0,0.027027,0.0,0.0,0.000000,0.0,0.054054,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,1,Woodside,0.0,0.000000,0.0,0.0,0.036585,0.012195,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
30,1,(0503) Siemensstadt,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
57,1,(0904) Johannisthal,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0
71,1,(1005) Hellersdorf,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0


The first cluster has both Berlin and New York Neighborhoods

In [220]:
final_df[final_df['Cluster Labels'] == 3]

Unnamed: 0,Cluster Labels,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Water Park,Waterfront,Weight Loss Center,Windmill,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,3,(0101) Mitte,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3,(0102) Moabit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,(0103) Hansaviertel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.037037,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0
3,3,(0104) Tiergarten,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3,(0105) Wedding,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,3,(0106) Gesundbrunnen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,3,(0201) Friedrichshain,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.017544,...,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0
7,3,(0202) Kreuzberg,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.017241,...,0.0,0.034483,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0
8,3,(0301) Prenzlauer Berg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.020408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,3,(0302) Weißensee,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The third cluster has olny Berlin Neighborhoods

Other clusters are less interesting

# Results

As a result of this project we achieved the next:  
1) We gathered the data on Berlin and New York Neighborhoods from Foursquare;  
2) We clustered Neighborhoods according to characteristics;  
3) We can say if there is a similar Neighborhood in Berlin to a given Neighborhood in New York.

# Discussion

We noticed next interesting things:  
1) If New York Neighborhood is from the first cluster then it is very easy to find a matching Neighborhood in Berlin;  
2) Some Berlin Neighborhoods are really different and therefor they produce the third cluster;  
3) Eventhough cities are really different we can find some similar Neighborhoods.  
  
We recommend:  
1) If company is located at 'Cambria Heights' to move to any Berlin Neighborhood cause there is no any similar;
2) If company is located at 'South Ozone Park' to move to (0605) Dahlem Berlin Neighborhood. And it's one to one match;  
  
We also can recommend some pattern of behaviour for any given Neighborhood

# Conclusion

In this project we clustered Neighborhoods of two different cities in order to recommend some Berlin Neighborhood for a company to move to.  
We gathered the full peacture of the situation and now can recommend some pattern of behaviour for any given Neighborhood in New York or Berlin.  
  
Further developement can be next:   
  
1) Gather more information about Neighborhood to make clustering more accurate;  
2) Gat new complex features from gathered ones for the same reason.