# Please view this notebook in https://nbviewer.jupyter.org to see the maps!

# IBM Capstone Project - The Battle of the Neighborhoods

This notebook will be used for the final course in this IBM series.


The following CSV file contains a list of the 4-digit part of a dutch zipcode and the associated geographical coordinate. 
https://github.com/bobdenotter/4pp/blob/master/4pp.csv
I have imported this file to my project assets.

The following article explains how you can work with a file that you uploaded to your project assets:
https://medium.com/@snehalgawas/working-with-ibm-cloud-object-storage-in-python-fe0ba8667d5f

You can find a quick tutorial on importing a CSV file in Pandas here:
https://towardsdatascience.com/how-to-read-csv-file-using-pandas-ab1f5e7e7b58


In [1]:
# The code was removed by Watson Studio for sharing.

In [2]:
from ibm_botocore.client import Config
import ibm_boto3

cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])

In [3]:
cos.download_file(Bucket=credentials['BUCKET'],Key=credentials['FILE'],Filename=credentials['FILE'])

In [4]:
import os

os.listdir("./")

['.virtual_documents', '4pp.csv']

In [5]:
import pandas as pd

#the CSV file is comma separated
df = pd.read_csv (credentials['FILE'], sep=',', error_bad_lines=False)

In [6]:
df.head()

Unnamed: 0,id,postcode,woonplaats,alternatieve_schrijfwijzen,gemeente,provincie,netnummer,latitude,longitude,soort
0,1,1000,Amsterdam,,Amsterdam,Noord-Holland,20,52.336243,4.869444,Postbus
1,2,1001,Amsterdam,,Amsterdam,Noord-Holland,20,52.36424,4.883358,Postbus
2,3,1002,Amsterdam,,Amsterdam,Noord-Holland,20,52.36424,4.883358,Onbekend
3,4,1003,Amsterdam,,Amsterdam,Noord-Holland,20,52.36424,4.883358,Onbekend
4,5,1005,Amsterdam,,Amsterdam,Noord-Holland,20,52.36424,4.883358,Postbus


From examining the table I have found that the 4-digit zipcodes of Amsterdam are all in the range 1000-1099. The column 'soort' indicates which zipcodes are used as address (some are categorized as unknown or PO box). We will only use the zipcodes used as address.

Finally, we only need the postcode, latitude and longitude columns for our purposes.

The following post explains nicely how this can be done in a single operation in Pandas, much like a sql statement
https://stackoverflow.com/questions/48035493/pandas-select-rows-and-columns-based-on-boolean-condition

In [7]:
neighborhoods = df.loc[(df['postcode'] < 1100) & (df['soort'].str.contains('Adres')), ['postcode', 'latitude', 'longitude']]

neighborhoods.head()

Unnamed: 0,postcode,latitude,longitude
9,1011,52.372976,4.903957
10,1012,52.373386,4.894064
11,1013,52.396789,4.876607
12,1014,52.392305,4.855884
13,1015,52.379093,4.885109


In [8]:
neighborhoods.shape

(73, 3)

In [9]:
#I added this line to reduce the number of zipcodes, thereby reducing the number of areas. I think the amount of zipcodes is too dense for the area being examined
#neighborhoods = neighborhoods.sample(frac = 0.5)
#neighborhoods.shape

There are 73 zipcodes.

Join the two dataframes df and df_geo together using postal code as key:

## Explore and cluster the neighborhoods in Amsterdam

Use geopy library to get the latitude and longitude values of Amsterdam:

In [10]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [11]:
#I am using a central location in Amsterdam, which is Dam Square
address = 'de Dam, Amsterdam, Netherlands'

geolocator = Nominatim(user_agent='amsterdam_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geographical coordinates of {} are {}, {}.'.format(address, latitude, longitude))

The geographical coordinates of de Dam, Amsterdam, Netherlands are 52.3988783, 4.932138698513164.


Create a map of Amsterdam with the neighborhoods superimposed on top:

In [12]:
!pip install folium 
import folium # map rendering library

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [13]:
#create a map of Amsterdam using the latitude and longitude values
map_amsterdam = folium.Map(location=[latitude, longitude], zoom_start=11)

#add markers to map
for lat, lng, pcode in zip(neighborhoods['latitude'], neighborhoods['longitude'], neighborhoods['postcode']):
    label = '{}'.format(pcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_amsterdam)

map_amsterdam

Now we will use the FourSquare API to explore the neighborhoods and segment them. First we set the FourSquare credentials:

In [14]:
# The code was removed by Watson Studio for sharing.

To check if the api works we will start with a single neighborhood. We will start with a central zipcode, I am using 1012 by looking at the map. Let's get the venues that are within a radius of 300 meters of this zipcode.

In [15]:
radius = 300
neighborhood_latitude = 52.373386
neighborhood_longitude = 4.894064

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude,ACCESS_TOKEN, VERSION, radius, LIMIT)

Send the request and get the results:

In [16]:
import requests  

results = requests.get(url).json()

#Uncomment the next line if you want to inspect the JSON
#results

The information is in the items key so we will copy the code that cleans the json and structures it into a pandas dataframe:

In [17]:
#define function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [18]:
venues = results['response']['groups'][0]['items']

nearby_venues = pd.json_normalize(venues) #flatten JSON

#filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

#filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis = 1)

#clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Proeflokaal De Drie Fleschjes,Bar,52.374203,4.892239
1,Wynand Fockink,Liquor Store,52.372301,4.895253
2,Rob Wigboldus Vishandel,Snack Place,52.374144,4.893967
3,Scheltema,Bookstore,52.372205,4.893175
4,Kaagman & Kortekaas,French Restaurant,52.374878,4.892455


In [19]:
print('{} venues were returned by FourSquare.'.format(nearby_venues.shape[0]))

100 venues were returned by FourSquare.


## Explore neighborhoods in Amsterdam

All the steps above are combined into a single function so we can repeat this process for all Neighborhoods in Amsterdam:

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=300):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [21]:
amsterdam_venues = getNearbyVenues(names=neighborhoods['postcode'],
                                 latitudes=neighborhoods['latitude'],
                                 longitudes=neighborhoods['longitude']
                                )

1011
1012
1013
1014
1015
1016
1017
1018
1019
1021
1022
1023
1024
1025
1026
1027
1028
1031
1032
1033
1034
1035
1036
1037
1041
1042
1043
1044
1045
1046
1047
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1071
1072
1073
1074
1075
1076
1077
1078
1079
1081
1082
1083
1086
1087
1091
1092
1093
1094
1095
1096
1097
1098
1099


Check the resulting dataframe:

In [22]:
print(amsterdam_venues.shape)
amsterdam_venues.head()

(945, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,1011,52.372976,4.903957,OCHA,52.374024,4.901683,Thai Restaurant
1,1011,52.372976,4.903957,HPS,52.371683,4.907673,Cocktail Bar
2,1011,52.372976,4.903957,Restaurant Gebr. Hartering,52.371577,4.907455,Restaurant
3,1011,52.372976,4.903957,A-Fusion,52.373235,4.900254,Sushi Restaurant
4,1011,52.372976,4.903957,Slagerij Vet,52.374011,4.900319,Butcher


Let's check number of venues by neighborhood:

In [23]:
#Let's check the distribution of venue category
category_totals = amsterdam_venues.groupby(['Venue Category'], sort=False)['Venue'].count().reset_index(name = 'Count')
category_totals.head()

Unnamed: 0,Venue Category,Count
0,Thai Restaurant,13
1,Cocktail Bar,10
2,Restaurant,40
3,Sushi Restaurant,9
4,Butcher,1


In [24]:
#how many categories have more than one entry
category_toptotals = category_totals.loc[category_totals['Count'] > 1]
category_toptotals.shape

(126, 2)

In [25]:
#select only amsterdam_venues where the category is in the list provided above
#amsterdam_venues = pd.merge(amsterdam_venues, category_toptotals, on=["Venue Category"])
#amsterdam_venues.head()

In [26]:
#visual check on duplicate venues
venue_totals = amsterdam_venues['Venue'].value_counts()
venue_totals.head(25)

Albert Heijn                        14
Kruidvat                             9
Febo                                 6
Action                               4
Vegan Junk Food Bar                  3
Coffee Company                       3
Massimo Gelato                       3
Lidl                                 3
Blokker                              3
Buongiorno Espressobar               2
Restaurant Sinne                     2
Coffeecompany                        2
Restaurant Surya                     2
Café Maxwell                         2
Marqt                                2
Mama Dough                           2
Henry's Bar                          2
Sir Hummus                           2
Dirk van den Broek                   2
Run2Day                              2
Batoni Khinkali                      2
Bagels & Beans                       2
The Breakfast Club                   2
Vascobelo                            2
All the Luck in the World | Oost     2
Name: Venue, dtype: int64

In [27]:
amsterdam_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1011,13,13,13,13,13,13
1012,54,54,54,54,54,54
1013,5,5,5,5,5,5
1014,4,4,4,4,4,4
1015,50,50,50,50,50,50
...,...,...,...,...,...,...
1095,5,5,5,5,5,5
1096,6,6,6,6,6,6
1097,4,4,4,4,4,4
1098,5,5,5,5,5,5


How many unique categories are in all the returned venues:

In [28]:
print('There are {} unique categories.'.format(len(amsterdam_venues['Venue Category'].unique())))

There are 201 unique categories.


## Analyze each neighborhood

In [29]:
# one hot encoding
amsterdam_onehot = pd.get_dummies(amsterdam_venues[['Venue Category']], prefix='', prefix_sep='')

# add zipcode column to dataframe
amsterdam_onehot.insert(0, 'postcode', amsterdam_venues['Neighborhood'], True)

amsterdam_onehot.head()

Unnamed: 0,postcode,Adult Boutique,African Restaurant,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Turkish Restaurant,Udon Restaurant,VR Cafe,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo Exhibit
0,1011,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1011,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1011,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1011,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1011,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
# check the dataframe size
amsterdam_onehot.shape

(945, 202)

Let's group rows by neighborhood taking the mean of the frequency of occurence for each category:

In [31]:
amsterdam_grouped = amsterdam_onehot.groupby('postcode').mean().reset_index()
amsterdam_grouped

Unnamed: 0,postcode,Adult Boutique,African Restaurant,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Turkish Restaurant,Udon Restaurant,VR Cafe,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo Exhibit
0,1011,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.076923,0.00,0.0,0.00,0.0,0.0
1,1012,0.0,0.0,0.0,0.018519,0.018519,0.018519,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.0,0.0
2,1013,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.0,0.0
3,1014,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.0,0.0
4,1015,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.02,0.0,0.02,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,1095,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.0,0.0
63,1096,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.0,0.0
64,1097,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.0,0.0
65,1098,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.00,0.0,0.00,0.0,0.0


In [32]:
# Confirm size:
amsterdam_grouped.shape

(67, 202)

In [33]:
# Print each neighborhood with top 5 categories:
num_top_venues = 5

for pcode in amsterdam_grouped['postcode']:
    print("----"+str(pcode)+"----")
    temp = amsterdam_grouped[amsterdam_grouped['postcode'] == pcode].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----1011----
             venue  freq
0  Thai Restaurant  0.15
1    Deli / Bodega  0.15
2       Restaurant  0.08
3     Cocktail Bar  0.08
4          Butcher  0.08


----1012----
                  venue  freq
0                   Bar  0.09
1                 Hotel  0.07
2  Marijuana Dispensary  0.06
3           Coffee Shop  0.06
4             Hotel Bar  0.04


----1013----
                        venue  freq
0                  Restaurant   0.2
1            Business Service   0.2
2  Modern European Restaurant   0.2
3                     Theater   0.2
4                  Soup Place   0.2


----1014----
         venue  freq
0  Medical Lab  0.25
1       Bakery  0.25
2     Bus Stop  0.25
3    Nightclub  0.25
4         Pool  0.00


----1015----
                venue  freq
0                 Bar  0.20
1  Italian Restaurant  0.12
2      Sandwich Place  0.08
3                Café  0.06
4              Market  0.04


----1016----
                venue  freq
0  Italian Restaurant  0.07
1          Resta

Put into pandas dataframe:

In [34]:
import numpy as np

# function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# create df

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['postcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['postcode'] = amsterdam_grouped['postcode']

for ind in np.arange(amsterdam_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(amsterdam_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,1011,Deli / Bodega,Thai Restaurant,Sushi Restaurant,Chinese Restaurant,Gay Bar
1,1012,Bar,Hotel,Coffee Shop,Marijuana Dispensary,Dessert Shop
2,1013,Modern European Restaurant,Restaurant,Business Service,Theater,Soup Place
3,1014,Bus Stop,Nightclub,Medical Lab,Bakery,Zoo Exhibit
4,1015,Bar,Italian Restaurant,Sandwich Place,Café,Coffee Shop


## Cluster Neighborhoods

Run k-means to cluster the neighborhoods into 5 clusters. Please note I ran the exercise with different values for k (2, 3, 5, 7) but this did not seem to have a large impact on the results.

In [35]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [36]:
# set number of clusters
kclusters = 7

#drop the postcode column for clustering purposes
amsterdam_grouped_clustering = amsterdam_grouped.drop('postcode', 1)

#uncomment the next line to check amsterdam_grouped_clustering
#amsterdam_grouped_clustering.head()

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(amsterdam_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:50] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 0, 0, 0, 4, 0, 2,
       5, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0], dtype=int32)

Create a new dataframe that includes the cluster as wel as top 10 venues for each neighborhood

In [37]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

amsterdam_merged = neighborhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
amsterdam_merged = amsterdam_merged.join(neighborhoods_venues_sorted.set_index(['postcode']), on=['postcode'])

amsterdam_merged.head() # check the last columns!

Unnamed: 0,postcode,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
9,1011,52.372976,4.903957,0.0,Deli / Bodega,Thai Restaurant,Sushi Restaurant,Chinese Restaurant,Gay Bar
10,1012,52.373386,4.894064,0.0,Bar,Hotel,Coffee Shop,Marijuana Dispensary,Dessert Shop
11,1013,52.396789,4.876607,0.0,Modern European Restaurant,Restaurant,Business Service,Theater,Soup Place
12,1014,52.392305,4.855884,0.0,Bus Stop,Nightclub,Medical Lab,Bakery,Zoo Exhibit
13,1015,52.379093,4.885109,0.0,Bar,Italian Restaurant,Sandwich Place,Café,Coffee Shop


In [38]:
#following an error message I checked the forum and it seems that there is one row with NaN values causing an error
#uncomment the next lines to check if there are NaN values

#amsterdam_merged_nan = amsterdam_merged[amsterdam_merged.isna().any(axis=1)]
#print (amsterdam_merged_nan)

#Indeed, postcode 1024 and 1045 have NaN values

amsterdam_merged.dropna(inplace=True)

amsterdam_merged.shape

(67, 9)

Visualize the resulting clusters:

In [39]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [40]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(amsterdam_merged['latitude'], amsterdam_merged['longitude'], amsterdam_merged['postcode'], amsterdam_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #following error message list indices must be integers or slices not float I found the following solution on the forum
        #color=rainbow[cluster-1],
        color = rainbow[int(cluster)-1],
        fill=True,
        #fill_color=rainbow[cluster-1],
        fill_color = rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine clusters:

I have tried different values of k but my general observation remains that the Amsterdam data is not as diverse as the NY data, resulting in one cluster being "dominant" with a couple of neighbourhoods getting placed in another cluster as can be seen in the results. Three of the non "red" dominant clusters have only one neighbourhood.

In [41]:
#check the distribution of zipcodes across clusters
amsterdam_merged['Cluster Labels'].value_counts()

0.0    57
2.0     3
5.0     2
3.0     2
1.0     1
6.0     1
4.0     1
Name: Cluster Labels, dtype: int64