# Final Capstone Project

### Introduction

The region of Flanders, Belgium, is one of the most densily populated region in Western Europe and also an economic hub, enjoying its citizens a high level of GDP per capita. In this exercise, we want to compare the neighborhoods across 3 different cities in order to understand the similarities among them so we can chose where we could potentially chose which city to move into and into which neighborhood.

The three cities are:
- Antwerpen, the capital of the region. It is the largest of the three and host the second biggest harbour in Europe
- Leuven, second biggest city of the region and cultural hub
- Genk, smaller provincial city, much smaller in terms of population

### Data

Two different data sets were used, both from the official Belgium Statistics Open Data repository (https://statbel.fgov.be/en/open-data?category=All)
- Statistical sectors 2020 https://statbel.fgov.be/en/open-data/statistical-sectors-2020 --> This is the minimum administrative sub-division 
- Population by statistical sector 2020 https://statbel.fgov.be/en/open-data/population-statistical-sector-8

At the start of this exercise, it is unclear if Population by statistical sector will be needed nevertheless it was decided to include it.

The data was prepared in QGIS and Excel as follows:
1. Change coordinate system to geographical coordinate system WGS 84 EPSG:4326, so the coordinates will come in decimal degrees, by means of Saving Layer As with different coordinate system
2. In Excel, save Population by statistical sector 2020 in CSV
3. Join attribute tables of imported CSV with those of the Statistical Sectors
4. Calculate centroids of Statistical Sectors, so we get points instead of polygons
6. Calculate geometry attributes of centroids
7. Create different data sets by cities of interest (Antwerpen, Leuven, Genk)

For the clustering, Foursquare API will be used to extract the venues in the neighborhoods.

For visualization purposes, OpenStreetMaps data is used for the background maps.

Pre-processing of data will be done in this notebook and therefore will be explained in the methodology section.

### Methodology

We start by importing all different libraries, including geopandas to import the shp files

In [1]:
import pandas as pd
import geopandas
import matplotlib.pyplot as plt
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import numpy as np

We chose a mid location in Flanders, Putte, so we can see both cities at the same time

In [2]:
address = 'Putte, Belgium'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

The geograpical coordinate of Putte, Belgium are 51.0570823, 4.6310473.


Next, we import the shapefiles of the three cities with geopandas. Initially we had 5 cities but Foursquare didnt allow to download that much data without subscription

In [3]:
Antw = geopandas.read_file('./data/GIS files/NeighborhoodsAntwerpen.shp')
Leuv = geopandas.read_file('./data/GIS files/NeighborhoodsLeuven.shp')
#Gent = geopandas.read_file('./data/GIS files/NeighborhoodsGent.shp')
Genk = geopandas.read_file('./data/GIS files/NeighborhoodsGenk.shp')
#Brus = geopandas.read_file('./data/GIS files/NeighborhoodsBrussels.shp')

We add all geodataframes together so we can cluster them together afterwards and compare neighborhoods regardless of the city. If we would want to include more cities, we could modify the Cities_list array

In [4]:
Cities_list = [Antw, Leuv, Genk]
Cities = pd.concat(Cities_list)

Since there are a lot of columns with information that it is not relevant, we drop them

In [5]:
drop_columns = ['T_SEC_FR', 'T_SEC_DE', 'T_NIS6_FR', 'CNIS5_2020', 'T_MUN_FR', 'T_MUN_DE', 'C_ARRD',
       'T_ARRD_NL', 'T_ARRD_FR', 'T_ARRD_DE', 'C_PROVI', 'T_PROVI_NL',
       'T_PROVI_FR', 'T_PROVI_DE', 'C_REGIO', 'T_REGIO_NL', 'T_REGIO_FR',
       'T_REGIO_DE', 'C_COUNTRY', 'NUTS1', 'NUTS2', 'NUTS3', 'OPENDATA_S', 'OPENDATA_1', 'OPENDATA_2',
       'OPENDATA_3', 'OPENDATA_4', 'OPENDATA_5', 'OPENDATA_6', 'OPENDATA_7']
Cities.drop(columns = drop_columns, inplace=True)
Cities

Unnamed: 0,CS01012020,T_SEC_NL,C_NIS6,T_NIS6_NL,T_MUN_NL,M_AREA_HA,M_PERI_M,Density,OPENDATA_8,lon,lat,geometry
0,1100212MQ,STABROEK,110021,WIJZIGING VAN GEMEENTEGRENS,Antwerpen,81.007781,5228.0,0,0,4.341878,51.329844,POINT (4.34188 51.32984)
1,11002A00-,ANTWERPEN KERN - OUDE STAD (SPAANSE WALLEN ),11002A,1-2-3-4 ADMINISTR. WIJK OF DISTRICT,Antwerpen,22.257464,2208.0,125,2788,4.400889,51.222503,POINT (4.40089 51.22250)
2,11002A01-,KLAPDORP - BROUWERSVLIET,11002A,1-2-3-4 ADMINISTR. WIJK OF DISTRICT,Antwerpen,13.816001,1753.0,115,1595,4.403601,51.225234,POINT (4.40360 51.22523)
3,11002A02-,GROENPLAATS (SPAANSE WALLEN),11002A,1-2-3-4 ADMINISTR. WIJK OF DISTRICT,Antwerpen,10.901108,1343.0,49,537,4.402588,51.219115,POINT (4.40259 51.21911)
4,11002A03-,HOOGSTRAAT (SPAANSE WALLEN),11002A,1-2-3-4 ADMINISTR. WIJK OF DISTRICT,Antwerpen,9.244515,1212.0,114,1053,4.398159,51.219379,POINT (4.39816 51.21938)
...,...,...,...,...,...,...,...,...,...,...,...,...
48,71016B242,NIEUWE KEMPEN,71016B,WATERSCHEI - ZWARTBERG,Genk,91.773174,6146.0,20,1845,5.491861,51.012518,POINT (5.49186 51.01252)
49,71016B273,WOLFSBERG,71016B,WATERSCHEI - ZWARTBERG,Genk,115.502537,7052.0,0,12,5.478330,51.004559,POINT (5.47833 51.00456)
50,71016B29-,ZWARTBERG-KOOLMIJN-VLIEGVELD,71016B,WATERSCHEI - ZWARTBERG,Genk,252.703010,10414.0,0,83,5.515443,51.013665,POINT (5.51544 51.01367)
51,71016B2AA,ZWARTBERG-ZUID,71016B,WATERSCHEI - ZWARTBERG,Genk,48.027172,3245.0,39,1857,5.502997,51.007247,POINT (5.50300 51.00725)


After dropping them, we rename the remaining columns for clarity

In [6]:
rename_columns = ['Code1', 'Name', 'Code2', 'Name2', 'Municipality', 'Area_ha', 'Peri_m', 'Density', 'Population', 'Lon', 'Lat', 'Geometry']
Cities.columns = rename_columns

Time to visualize the neighborhoods in the map.

In [7]:
# create map using latitude and longitude values
map_city = folium.Map(location=[latitude, longitude], zoom_start=9)

# add markers to map
for lat, lon, Code1, Name in zip(Cities['Lat'], Cities['Lon'], Cities['Code1'], Cities['Name']):
    label = '{}, {}'.format(Code1, Name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=0.5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_city)  
    
map_city

Next, we will implement the code for downloading the venues. We chose a radios of 200 m which seems appropriate due to the distance between neighborhoods, so there is not much overlap

In [8]:
CLIENT_ID = 'G4CN4VUVBD0EFCCT0LXPFSLWKJAHXJT2BPMPOD235BCMQ1N4' # your Foursquare ID
CLIENT_SECRET = '2C5510CMN0A2BBHX2JTJWNYBVRU4S5CXZJD3P3LCK5H0FVFN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [9]:
def getNearbyVenues(names, latitudes, longitudes, radius=200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [10]:
city_venues = getNearbyVenues(names=Cities['Name'],
                                   latitudes=Cities['Lat'],
                                   longitudes=Cities['Lon']
                                  )

STABROEK
ANTWERPEN KERN - OUDE STAD (SPAANSE WALLEN )
KLAPDORP - BROUWERSVLIET
GROENPLAATS (SPAANSE WALLEN)
HOOGSTRAAT (SPAANSE WALLEN)
OUDAAN (SPAANSE WALLEN)
GEVANGENIS (SPAANSE WALLEN)
SCHELDEKADEN NOORD
MEIR - LEYSSTRAAT (SP. WALLEN)
KIPDORP-ST.-JACOBS (SP.WALLEN)
K.N.S.-NAT. BANK (SP. WALLEN)
STADSWAAG-BEGIJNHOF(SP.WALLEN)
TABAKSVEST (SPAANSE WALLEN)
HESSENHUIS (SPAANSE WALLEN)
ST.-ANDRIES (SPAANSE WALLEN)
ST.-MICHIELSKAAI (SP. WALLEN)
LINKEROEVER-ZUID (LINKEROEVER)
LINKEROEVER - STATION
LINKEROEVER-NOORD
ST.-ANNA (LINKEROEVER)
THOENETLAAN (LINKEROEVER)
GLORIANTLAAN (LINKEROEVER)
GALGEWEEL (LINKEROEVER)
ST.-ANNABOS (LINKEROEVER )
BLANCEFLOERLAAN  (LINKEROEVER)
CHARLES DE COSTERLAAN(L.OEVER)
STATIEKWARTIER (STATIONSWIJK)
ATHENEUM (STATIONSWIJK)
DE CONINCPLEIN-Z.(STATIONSW.)
OFFERANDESTRAAT(STATIONSWIJK )
PROVINCIESTRAAT(STATIONSWIJK )
PELIKAANSTRAAT (STATIONSWIJK )
STATION - ZOO (STATIONSWIJK )
STADSPARK (STATIONSWIJK )
JEZUITENCOLLEGE(STATIONSWIJK )
DAMBRUGGESTRAAT-N.(STATIONSW.)


Below, a bit of exploratory analysis of the resulting dataframe

In [11]:
print(city_venues.shape)
city_venues.head()

(2463, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,STABROEK,51.329844,4.341878,kaai 734 zuidnatie,51.329007,4.343253,Port
1,STABROEK,51.329844,4.341878,Becomar Zuidnatie,51.330475,4.343896,Harbor / Marina
2,ANTWERPEN KERN - OUDE STAD (SPAANSE WALLEN ),51.222503,4.400889,Grote Markt,51.221163,4.39981,Plaza
3,ANTWERPEN KERN - OUDE STAD (SPAANSE WALLEN ),51.222503,4.400889,Bia Mara,51.220894,4.400189,Fish & Chips Shop
4,ANTWERPEN KERN - OUDE STAD (SPAANSE WALLEN ),51.222503,4.400889,De Pottekijker,51.221097,4.40187,Belgian Restaurant


In [12]:
city_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
'T EILANDJE,20,20,20,20,20,20
'T FORT,3,3,3,3,3,3
'T LAAR,4,4,4,4,4,4
'T MESTPUTTEKE,9,9,9,9,9,9
ALBERTPARK (OOSTWIJK),8,8,8,8,8,8
...,...,...,...,...,...,...
ZWAAIKOM,12,12,12,12,12,12
ZWAANTJES,4,4,4,4,4,4
ZWARTBERG - TUINWIJK-NOORD,5,5,5,5,5,5
ZWARTBERG-KOOLMIJN-VLIEGVELD,3,3,3,3,3,3


In [13]:
print('There are {} uniques categories.'.format(len(city_venues['Venue Category'].unique())))
print(city_venues['Venue Category'].unique())

There are 305 uniques categories.
['Port' 'Harbor / Marina' 'Plaza' 'Fish & Chips Shop' 'Belgian Restaurant'
 'Coffee Shop' 'French Restaurant' 'Bookstore' 'Italian Restaurant'
 'Steakhouse' 'Church' 'Restaurant' 'Fishing Store' 'Hotel'
 'Adult Boutique' 'History Museum' 'Gay Bar' 'Thai Restaurant'
 'Ethiopian Restaurant' 'Café' 'Soup Place' 'Empanada Restaurant' 'Bar'
 'American Restaurant' 'Health & Beauty Service' 'Nightclub' 'Waterfront'
 'Snack Place' 'Asian Restaurant' 'Greek Restaurant' 'Juice Bar'
 'Beer Bar' 'Frozen Yogurt Shop' 'Indonesian Restaurant' 'Cocktail Bar'
 'Kitchen Supply Store' 'Clothing Store' 'Deli / Bodega' 'Brasserie'
 'Vegetarian / Vegan Restaurant' 'Gourmet Shop' 'Sushi Restaurant'
 'Perfume Shop' 'Burger Joint' 'Gaming Cafe' 'Camera Store' 'Optical Shop'
 'Dessert Shop' 'Cupcake Shop' 'Falafel Restaurant' 'Pie Shop'
 'Taco Place' 'Wine Bar' 'Ice Cream Shop' 'Art Museum'
 'Argentinian Restaurant' 'Comfort Food Restaurant' 'Breakfast Spot'
 'Mexican Restauran

We start preparing the dataset for the clustering with the onehot

In [14]:
# one hot encoding
city_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
city_onehot['Neighborhood'] = city_venues['Neighborhood'] 

city_onehot.head()

Unnamed: 0,ATM,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# move neighborhood column to the first column
fixed_columns = [city_onehot.columns[-1]] + list(city_onehot.columns[:-1])
city_onehot = city_onehot[fixed_columns]

city_onehot.head()

Unnamed: 0,Zoo Exhibit,ATM,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
city_grouped = city_onehot.groupby('Neighborhood').mean().reset_index()
city_grouped

Unnamed: 0,Neighborhood,Zoo Exhibit,ATM,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Aquarium,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,'T EILANDJE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0
1,'T FORT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0
2,'T LAAR,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0
3,'T MESTPUTTEKE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,...,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0
4,ALBERTPARK (OOSTWIJK),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
381,ZWAAIKOM,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0
382,ZWAANTJES,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0
383,ZWARTBERG - TUINWIJK-NOORD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0
384,ZWARTBERG-KOOLMIJN-VLIEGVELD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0


Let's explore the top 5 venues per neighborhood

In [17]:
num_top_venues = 5

for hood in city_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = city_grouped[city_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----'T EILANDJE----
            venue  freq
0     Coffee Shop  0.15
1  Sandwich Place  0.10
2    Cocktail Bar  0.10
3             Pub  0.05
4      Restaurant  0.05


----'T FORT----
          venue  freq
0         Trail  0.33
1          Park  0.33
2  Cocktail Bar  0.33
3   Music Store  0.00
4        Notary  0.00


----'T LAAR----
                venue  freq
0                 Bar  0.25
1              Bakery  0.25
2             Dog Run  0.25
3  Athletics & Sports  0.25
4         Zoo Exhibit  0.00


----'T MESTPUTTEKE----
              venue  freq
0    Ice Cream Shop  0.11
1       Art Gallery  0.11
2              Park  0.11
3            Bakery  0.11
4  Doner Restaurant  0.11


----ALBERTPARK (OOSTWIJK)----
              venue  freq
0              Park  0.25
1       Coffee Shop  0.12
2  Insurance Office  0.12
3               Gym  0.12
4       Supermarket  0.12


----ALSEMBERG----
                  venue  freq
0           Music Store   1.0
1           Zoo Exhibit   0.0
2             Nightcl

In [18]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [19]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = city_grouped['Neighborhood']

for ind in np.arange(city_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,'T EILANDJE,Coffee Shop,Sandwich Place,Cocktail Bar,Gastropub,Pub,Laser Tag,Event Space,Restaurant,Sushi Restaurant,Supermarket
1,'T FORT,Trail,Park,Cocktail Bar,Zoo,Film Studio,Event Space,Factory,Falafel Restaurant,Farm,Farmers Market
2,'T LAAR,Bar,Dog Run,Bakery,Athletics & Sports,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Film Studio,Financial or Legal Service
3,'T MESTPUTTEKE,Greek Restaurant,Restaurant,Ice Cream Shop,Pool,Pizza Place,Doner Restaurant,Art Gallery,Bakery,Park,Ethiopian Restaurant
4,ALBERTPARK (OOSTWIJK),Park,Supermarket,Gym,Insurance Office,Coffee Shop,Tram Station,Dog Run,Zoo,Fast Food Restaurant,Factory


In [20]:
city_grouped.head()

Unnamed: 0,Neighborhood,Zoo Exhibit,ATM,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Aquarium,Argentinian Restaurant,Art Gallery,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,'T EILANDJE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0
1,'T FORT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,'T LAAR,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,'T MESTPUTTEKE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ALBERTPARK (OOSTWIJK),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finally, the clustering. After a bit of trial and error, we set the cluster to 8. It's a bit high due to some neighborhoods being not residential in nature, at all, such as the harbour or parks. Otherwise all residential neighborhoods would be clustered together and there would not be much interest for us in this exercise

In [21]:
# set number of clusters
kclusters = 8

city_grouped_clustering = city_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[:]

array([5, 4, 0, 5, 7, 7, 7, 2, 5, 7, 0, 5, 7, 7, 5, 7, 7, 7, 5, 5, 6, 5,
       5, 5, 0, 7, 5, 7, 0, 7, 7, 5, 7, 5, 5, 7, 7, 7, 7, 7, 4, 5, 7, 7,
       5, 7, 7, 7, 5, 4, 5, 5, 5, 5, 5, 7, 5, 5, 7, 0, 7, 5, 7, 7, 7, 5,
       7, 0, 5, 0, 0, 7, 5, 3, 7, 7, 2, 7, 5, 5, 7, 2, 7, 5, 7, 7, 7, 0,
       7, 7, 7, 7, 7, 4, 7, 7, 7, 5, 5, 0, 7, 7, 4, 5, 7, 7, 5, 7, 5, 5,
       0, 5, 5, 5, 5, 7, 2, 5, 7, 7, 7, 5, 2, 5, 4, 2, 4, 0, 2, 7, 5, 2,
       5, 5, 5, 7, 7, 1, 7, 4, 7, 5, 5, 7, 5, 0, 7, 7, 5, 7, 7, 5, 7, 5,
       5, 7, 7, 7, 7, 7, 7, 7, 5, 5, 7, 5, 7, 5, 5, 1, 3, 1, 0, 0, 0, 7,
       7, 7, 5, 5, 7, 7, 6, 7, 0, 7, 0, 5, 2, 0, 7, 5, 5, 7, 5, 7, 0, 5,
       0, 5, 0, 0, 5, 5, 5, 5, 5, 7, 5, 7, 5, 7, 7, 5, 7, 6, 1, 5, 5, 7,
       7, 0, 1, 5, 6, 7, 7, 7, 3, 1, 7, 5, 1, 7, 5, 5, 7, 7, 7, 7, 5, 7,
       5, 0, 5, 2, 7, 0, 5, 0, 7, 6, 5, 7, 5, 5, 0, 5, 2, 7, 6, 2, 5, 5,
       1, 5, 5, 7, 1, 5, 7, 7, 5, 5, 7, 5, 7, 7, 7, 5, 7, 5, 7, 5, 7, 0,
       5, 7, 7, 5, 5, 7, 7, 5, 7, 2, 7, 5, 7, 5, 5,

In [22]:
# to be used if we rerun the below
neighborhoods_venues_sorted.drop(columns=['Cluster Labels'], inplace=True)

KeyError: "['Cluster Labels'] not found in axis"

In [23]:
# add clustering labels --> Manually change Cluster Labels name to make it work each iteration
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

city_merged = Cities

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
city_merged = city_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Name')

city_merged.head() # check the last columns!

Unnamed: 0,Code1,Name,Code2,Name2,Municipality,Area_ha,Peri_m,Density,Population,Lon,...,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1100212MQ,STABROEK,110021,WIJZIGING VAN GEMEENTEGRENS,Antwerpen,81.007781,5228.0,0,0,4.341878,...,Port,Harbor / Marina,Zoo,Fast Food Restaurant,Ethiopian Restaurant,Event Space,Factory,Falafel Restaurant,Farm,Farmers Market
1,11002A00-,ANTWERPEN KERN - OUDE STAD (SPAANSE WALLEN ),11002A,1-2-3-4 ADMINISTR. WIJK OF DISTRICT,Antwerpen,22.257464,2208.0,125,2788,4.400889,...,Belgian Restaurant,Restaurant,Plaza,Coffee Shop,Italian Restaurant,Harbor / Marina,Bookstore,Ethiopian Restaurant,Gay Bar,Café
2,11002A01-,KLAPDORP - BROUWERSVLIET,11002A,1-2-3-4 ADMINISTR. WIJK OF DISTRICT,Antwerpen,13.816001,1753.0,115,1595,4.403601,...,Gay Bar,Bar,Nightclub,Coffee Shop,Health & Beauty Service,Café,Empanada Restaurant,Greek Restaurant,Asian Restaurant,Snack Place
3,11002A02-,GROENPLAATS (SPAANSE WALLEN),11002A,1-2-3-4 ADMINISTR. WIJK OF DISTRICT,Antwerpen,10.901108,1343.0,49,537,4.402588,...,Coffee Shop,Cocktail Bar,Juice Bar,Bar,Bookstore,Plaza,Brasserie,Beer Bar,Gaming Cafe,Sushi Restaurant
4,11002A03-,HOOGSTRAAT (SPAANSE WALLEN),11002A,1-2-3-4 ADMINISTR. WIJK OF DISTRICT,Antwerpen,9.244515,1212.0,114,1053,4.398159,...,Cocktail Bar,Bar,Falafel Restaurant,Beer Bar,Ice Cream Shop,Hotel,Coffee Shop,French Restaurant,Bistro,Restaurant


In [24]:
city_merged_nonan = city_merged.dropna(subset=['Cluster Labels'])

### Results

In the next three maps, we can see each city separately

In [25]:
address = 'Antwerpen, Belgium'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged_nonan['Lat'], city_merged_nonan['Lon'], city_merged_nonan['Name'], city_merged_nonan['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The geograpical coordinate of Antwerpen, Belgium are 51.2211097, 4.3997081.


In [26]:
address = 'Leuven, Belgium'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged_nonan['Lat'], city_merged_nonan['Lon'], city_merged_nonan['Name'], city_merged_nonan['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The geograpical coordinate of Leuven, Belgium are 50.879202, 4.7011675.


In [27]:
address = 'Genk, Belgium'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged_nonan['Lat'], city_merged_nonan['Lon'], city_merged_nonan['Name'], city_merged_nonan['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The geograpical coordinate of Genk, Belgium are 50.9654864, 5.5001456.


For future reference only, below the code for checking individually the results for each cluster

In [34]:
city_merged_nonan.loc[city_merged_nonan['Cluster Labels'] == 5, city_merged_nonan.columns[[1] + list(range(5, city_merged_nonan.shape[1]))]].head()

Unnamed: 0,Name,Area_ha,Peri_m,Density,Population,Lon,Lat,Geometry,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,ANTWERPEN KERN - OUDE STAD (SPAANSE WALLEN ),22.257464,2208.0,125,2788,4.400889,51.222503,POINT (4.40089 51.22250),5.0,Belgian Restaurant,Restaurant,Plaza,Coffee Shop,Italian Restaurant,Harbor / Marina,Bookstore,Ethiopian Restaurant,Gay Bar,Café
2,KLAPDORP - BROUWERSVLIET,13.816001,1753.0,115,1595,4.403601,51.225234,POINT (4.40360 51.22523),5.0,Gay Bar,Bar,Nightclub,Coffee Shop,Health & Beauty Service,Café,Empanada Restaurant,Greek Restaurant,Asian Restaurant,Snack Place
3,GROENPLAATS (SPAANSE WALLEN),10.901108,1343.0,49,537,4.402588,51.219115,POINT (4.40259 51.21911),5.0,Coffee Shop,Cocktail Bar,Juice Bar,Bar,Bookstore,Plaza,Brasserie,Beer Bar,Gaming Cafe,Sushi Restaurant
4,HOOGSTRAAT (SPAANSE WALLEN),9.244515,1212.0,114,1053,4.398159,51.219379,POINT (4.39816 51.21938),5.0,Cocktail Bar,Bar,Falafel Restaurant,Beer Bar,Ice Cream Shop,Hotel,Coffee Shop,French Restaurant,Bistro,Restaurant
5,OUDAAN (SPAANSE WALLEN),14.882418,1935.0,103,1530,4.403102,51.214563,POINT (4.40310 51.21456),5.0,Coffee Shop,Gastropub,Thrift / Vintage Store,Sporting Goods Shop,Bed & Breakfast,Bar,Bakery,French Restaurant,Jewelry Store,Pub


In [35]:
city_merged_nonan.loc[city_merged_nonan['Cluster Labels'] == 7, city_merged_nonan.columns[[1] + list(range(5, city_merged_nonan.shape[1]))]].head()

Unnamed: 0,Name,Area_ha,Peri_m,Density,Population,Lon,Lat,Geometry,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,STABROEK,81.007781,5228.0,0,0,4.341878,51.329844,POINT (4.34188 51.32984),7.0,Port,Harbor / Marina,Zoo,Fast Food Restaurant,Ethiopian Restaurant,Event Space,Factory,Falafel Restaurant,Farm,Farmers Market
17,LINKEROEVER - STATION,40.792798,2623.0,48,1946,4.383326,51.217985,POINT (4.38333 51.21799),7.0,Playground,Doner Restaurant,Cocktail Bar,Zoo,Financial or Legal Service,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
18,LINKEROEVER-NOORD,43.680652,2651.0,89,3896,4.382045,51.230408,POINT (4.38205 51.23041),7.0,Garden,Financial or Legal Service,Event Space,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Film Studio,Zoo
19,ST.-ANNA (LINKEROEVER),62.554485,5274.0,1,49,4.386638,51.233987,POINT (4.38664 51.23399),7.0,Campground,Trail,Playground,Forest,Bike Rental / Bike Share,Stables,Creperie,Ice Cream Shop,Dance Studio,Cultural Center
20,THOENETLAAN (LINKEROEVER),27.691246,2226.0,57,1569,4.389088,51.230675,POINT (4.38909 51.23067),7.0,Tennis Court,Gym,Music Venue,Bike Rental / Bike Share,Zoo,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Film Studio


### Discussion

There is a clear clustering of more central locations (cluster 5) in both the large cities, Antwerpen and Leuven, versus clusters outside the city center (cluster 7). This second cluster (7) is, however, more predominantly in the smaller provincial city of Genk, proving that living in Genk is more likely to be similar to living in the rim of the larger cities.

Perhaps the clustering is not clear enough. This could be solved by grouping further the venues categories, for instance, all restaurants together, all cultural venues together, all sport venues together... That could bring a better view on the different neighborhoods and potentially needing less clusters.

### Conclusion

We can conclude that the clustering of the three cities makes sense and while we needed to choose for a large number of clusters to get significant results, once we focus on the general partner rather than on specific neighborhoods, we realize that the method is valid as it clearly shows that Genk neighborhoods are more similar to those in the outer rim of the large cities of Antwerpen and Leuven.

Date: 05/05/2021