Cluster Analysis of municipalities in the commuter belt around Brussels, Belgium

The data is taken from Foursquare - with a focus on shops and stores in the various municipalities

Part 1: I use pandas to extract a table from a github page, turn it into a data frame, which I then clean

Note: the github resource in question has all of the municipalities in Belgium, with their postal codes, names, and geographical coordinates

Note: in the first few code blocks, I import all the libraries I need for all three parts, based on the New York assignment we studied in the course

In [162]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import json5 # library to handle JSON files

In [163]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [164]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [165]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [166]:
import folium # map rendering library

In [167]:
print('Libraries imported.')

Libraries imported.


In [168]:
# display options so as to view data frames in full
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Source of data is github user jief - special thanks

In [169]:
url = 'https://raw.githubusercontent.com/jief/zipcode-belgium/master/zipcode-belgium.csv'

In [170]:
xx=pd.read_csv(url, header=None) # read data frame from the designated url

In [171]:
xx.head(10)

Unnamed: 0,0,1,2,3
0,1000,Bruxelles,4.351697,50.846557
1,1020,Laeken,4.348713,50.883392
2,1030,Schaerbeek,4.373712,50.867604
3,1040,Etterbeek,4.38951,50.836851
4,1050,Ixelles,4.381571,50.822285
5,1060,Saint-Gilles,4.345668,50.826741
6,1070,Anderlecht,4.31234,50.838141
7,1080,Molenbeek-Saint-Jean,4.322778,50.854355
8,1081,Koekelberg,4.325708,50.862263
9,1082,Berchem-Sainte-Agathe,4.292702,50.863984


In [172]:
xx.columns=['Postal','Municipality','Long','Lat']

In [173]:
xx.head(10)

Unnamed: 0,Postal,Municipality,Long,Lat
0,1000,Bruxelles,4.351697,50.846557
1,1020,Laeken,4.348713,50.883392
2,1030,Schaerbeek,4.373712,50.867604
3,1040,Etterbeek,4.38951,50.836851
4,1050,Ixelles,4.381571,50.822285
5,1060,Saint-Gilles,4.345668,50.826741
6,1070,Anderlecht,4.31234,50.838141
7,1080,Molenbeek-Saint-Jean,4.322778,50.854355
8,1081,Koekelberg,4.325708,50.862263
9,1082,Berchem-Sainte-Agathe,4.292702,50.863984


In [174]:
xx.shape

(2757, 4)

I now define the municipality of Brussels as the centre - because later I will compute distances from that centre

In [175]:
centre=xx[xx['Municipality'].str.contains('Bruxelles')]
centre

Unnamed: 0,Postal,Municipality,Long,Lat
0,1000,Bruxelles,4.351697,50.846557


I will use the Haversine distance function to compute distances from each municipality to Brussels, in km

In [176]:
# source of code
# https://towardsdatascience.com/heres-how-to-calculate-distance-between-2-geolocations-in-python-93ecab5bbba4

def haverdist(lat1, lon1, lat2, lon2):
   r = 6371
   phi1 = np.radians(lat1)
   phi2 = np.radians(lat2)
   delta_phi = np.radians(lat2 - lat1)
   delta_lambda = np.radians(lon2 - lon1)
   a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2)**2
   res = r * (2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a)))
   return np.round(res, 2)

p1lat=float(centre['Lat'])
p1long=float(centre['Long'])

def haver_df(row):
    p2lat=float(row['Lat'])
    p2long=float(row['Long'])
    dist=haverdist(p1lat,p1long,p2lat,p2long)
    return dist

xx['Dist']=xx.apply (lambda row: haver_df(row), axis=1)

xx.head(10)

Unnamed: 0,Postal,Municipality,Long,Lat,Dist
0,1000,Bruxelles,4.351697,50.846557,0.0
1,1020,Laeken,4.348713,50.883392,4.1
2,1030,Schaerbeek,4.373712,50.867604,2.8
3,1040,Etterbeek,4.38951,50.836851,2.87
4,1050,Ixelles,4.381571,50.822285,3.42
5,1060,Saint-Gilles,4.345668,50.826741,2.24
6,1070,Anderlecht,4.31234,50.838141,2.92
7,1080,Molenbeek-Saint-Jean,4.322778,50.854355,2.21
8,1081,Koekelberg,4.325708,50.862263,2.53
9,1082,Berchem-Sainte-Agathe,4.292702,50.863984,4.57


My focus is on the 'commuter belt' around Brussels, which I define as being between 7 and 20 km from Brussels

In [177]:
max_dist=20
min_dist=7
region=xx[xx['Dist']<max_dist]
region=region[region['Dist']>min_dist]
region.shape

(114, 5)

In [178]:
region.head(10)

Unnamed: 0,Postal,Municipality,Long,Lat,Dist
25,1310,La Hulpe,4.479654,50.731505,15.64
41,1330,Rixensart,4.52729,50.713355,19.28
42,1331,Rosières,4.546311,50.73713,18.31
43,1332,Genval,4.497139,50.720745,17.33
87,1380,Ohain,4.450553,50.695114,18.22
99,1410,Waterloo,4.397805,50.717356,14.73
100,1420,Braine-L'alleud,4.354815,50.694094,16.95
110,1440,Braine-Le-Château,4.266669,50.680882,19.37
111,1440,Wauthier-Braine,4.313304,50.680832,18.62
132,1480,Clabecq,4.221413,50.689374,19.73


Now that I have my region of focus, I will get data on venues in this region from Foursquare

Define credentials for Foursquare requests

In [179]:
CLIENT_ID = 'ZO3BBSIQNBQP3WG1YW0X0JUB0GCG0BTCRRBIZLALIWTTQK1X'
CLIENT_SECRET = 'S0GMZOANBJLCXIJMUCK0MPRMYLNNBBEHDVJWS5QLJHBHIL50'
VERSION = '20180605' # Foursquare API version

In [180]:
# as in the course example, define max no of venues to be extracted per request, and define geog. radius of requests
LIMIT=250
radius_4sq=1200

In [181]:
def getNearbyVenues(names, latitudes, longitudes, rad=radius_4sq):
    
    venues_list=[]
    print("Processing each neighborhood...")
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            rad, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Municipality', 
                  'Mun Lat', 
                  'Mun Long', 
                  'Venue', 
                  'Venue Lat', 
                  'Venue Long', 
                  'Venue Cat']
    
    return(nearby_venues)

In [182]:
print("Getting venue information by municipality")
venues = getNearbyVenues(names=region['Municipality'],
                                   latitudes=region['Lat'],
                                   longitudes=region['Long']
                                  )
print("Venue information has been obtained!")

Getting venue information by municipality
Processing each neighborhood...
La Hulpe
Rixensart
Rosières
Genval
Ohain
Waterloo
Braine-L'alleud
Braine-Le-Château
Wauthier-Braine
Clabecq
Tubize
Halle
Buizingen
Lembeek
Hoeilaart
Oudenaken
Sint-Laureins-Berchem
Sint-Pieters-Leeuw
Ruisbroek
Vlezenbeek
Drogenbos
Linkebeek
Rhode-Saint-Genèse
Beersel
Lot
Alsemberg
Dworp
Huizingen
Bogaarden
Pepingen
Elingen
Beert
Bellingen
Sint-Martens-Bodegem
Sint-Ulriks-Kapelle
Itterbeek
Schepdaal
Asse
Bekkerzeel
Kobbegem
Mollem
Relegem
Ternat
Wambeek
Sint-Katherina-Lombeek
Mazenzele
Opwijk
Gaasbeek
Lennik
Sint-Kwintens-Lennik
Sint-Martens-Lennik
Gooik
Kester
Leerbeek
Onze-Lieve-Vrouw-Lombeek
Pamel
Roosdaal
Strijtem
Borchtlombeek
Liedekerke
Wemmel
Brussegem
Merchtem
Affligem
Essene
Hekelgem
Teralfene
Peutie
Vilvoorde
Melsbroek
Perk
Steenokkerzeel
Machelen
Diegem
Londerzeel
Steenhuffel
Grimbergen
Humbeek
Beigem
Strombeek-Bever
Meise
Wolvertem
Kapelle-Op-Den-Bos
Nieuwenrode
Ramsdonk
Kampenhout
Nederokkerzeel
Nosse

In [183]:
print(venues.shape)
venues.head(12)

(3718, 7)


Unnamed: 0,Municipality,Mun Lat,Mun Long,Venue,Venue Lat,Venue Long,Venue Cat
0,La Hulpe,50.731505,4.479654,Barbavin,50.730077,4.480356,French Restaurant
1,La Hulpe,50.731505,4.479654,Le 20 Heures Vin,50.730868,4.486052,Wine Bar
2,La Hulpe,50.731505,4.479654,Les Tartes de Françoise,50.730594,4.482041,Bakery
3,La Hulpe,50.731505,4.479654,Nanoo's,50.730851,4.485829,French Restaurant
4,La Hulpe,50.731505,4.479654,S'eat,50.727353,4.487566,Snack Place
5,La Hulpe,50.731505,4.479654,"Giot, createur de saveurs",50.730434,4.482144,Gourmet Shop
6,La Hulpe,50.731505,4.479654,Chez Clément,50.726785,4.486916,French Restaurant
7,La Hulpe,50.731505,4.479654,Jardin du Sud,50.730507,4.482617,Asian Restaurant
8,La Hulpe,50.731505,4.479654,Le Solitaire,50.730995,4.487079,Cheese Shop
9,La Hulpe,50.731505,4.479654,L'heure du sushi,50.731721,4.49002,Sushi Restaurant


I will now restrict the dataset to include only those venue types that are shops or stores

In [184]:
df2=venues[venues['Venue Cat'].str.contains('shop|store|Shop|Store')].reset_index()
#remove duplicate venues (a few venues are recorded twice, in two neigbouring municipalities)
df2.drop_duplicates(subset='Venue', inplace=True)
# remove old index column
df2.drop(['index'], axis=1, inplace=True)
# show head
df2.head(12)

Unnamed: 0,Municipality,Mun Lat,Mun Long,Venue,Venue Lat,Venue Long,Venue Cat
0,La Hulpe,50.731505,4.479654,"Giot, createur de saveurs",50.730434,4.482144,Gourmet Shop
1,La Hulpe,50.731505,4.479654,Le Solitaire,50.730995,4.487079,Cheese Shop
2,La Hulpe,50.731505,4.479654,Tom & Co,50.728375,4.487826,Pet Store
3,La Hulpe,50.731505,4.479654,La Mazerine,50.727351,4.486791,Shopping Mall
4,La Hulpe,50.731505,4.479654,L'Atelier de Gepetto,50.727124,4.486855,Toy / Game Store
5,Rixensart,50.713355,4.52729,Full Time,50.714093,4.530708,Bookstore
6,Rosières,50.73713,4.546311,Hunting Lodge,50.733528,4.556831,Hobby Shop
7,Rosières,50.73713,4.546311,Golf & Country Pro Shop,50.731756,4.558747,Sporting Goods Shop
9,Genval,50.720745,4.497139,Biostory,50.722534,4.51396,Health Food Store
11,Genval,50.720745,4.497139,Carrefour Express,50.72674,4.505277,Convenience Store


In [185]:
df2.shape

(451, 7)

In [186]:
# one hot encoding - same as in the course
onehot = pd.get_dummies(df2[['Venue Cat']], prefix="", prefix_sep="")

# add municipality column back to dataframe
onehot['Municipality'] = df2['Municipality'] 

# move municipality column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head(10)

Unnamed: 0,Municipality,Accessories Store,Antique Shop,Arts & Crafts Store,Auto Workshop,Automotive Shop,Baby Store,Bagel Shop,Beer Store,Big Box Store,Bike Shop,Board Shop,Bookstore,Bridal Shop,Camera Store,Candy Store,Carpet Store,Cheese Shop,Chocolate Shop,Clothing Store,Coffee Shop,Convenience Store,Cosmetics Shop,Cupcake Shop,Department Store,Dessert Shop,Discount Store,Electronics Store,Fabric Shop,Fishing Store,Flower Shop,Food & Drink Shop,Fruit & Vegetable Store,Furniture / Home Store,Gift Shop,Gourmet Shop,Grocery Store,Hardware Store,Health Food Store,Hobby Shop,Ice Cream Shop,Jewelry Store,Kids Store,Lighting Store,Lingerie Store,Liquor Store,Medical Supply Store,Men's Store,Miscellaneous Shop,Mobile Phone Shop,Mobility Store,Motorcycle Shop,Music Store,Optical Shop,Other Repair Shop,Outdoor Supply Store,Paper / Office Supplies Store,Pet Store,Print Shop,Salon / Barbershop,Shoe Store,Shop & Service,Shopping Mall,Smoke Shop,Souvenir Shop,Sporting Goods Shop,Stationery Store,Tailor Shop,Thrift / Vintage Store,Toy / Game Store,Vape Store,Video Store,Wine Shop,Women's Store
0,La Hulpe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,La Hulpe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,La Hulpe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,La Hulpe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,La Hulpe,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
5,Rixensart,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,Rosières,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,Rosières,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
9,Genval,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11,Genval,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [187]:
grouped = onehot.groupby('Municipality').mean().reset_index()

grouped.head(10)

Unnamed: 0,Municipality,Accessories Store,Antique Shop,Arts & Crafts Store,Auto Workshop,Automotive Shop,Baby Store,Bagel Shop,Beer Store,Big Box Store,Bike Shop,Board Shop,Bookstore,Bridal Shop,Camera Store,Candy Store,Carpet Store,Cheese Shop,Chocolate Shop,Clothing Store,Coffee Shop,Convenience Store,Cosmetics Shop,Cupcake Shop,Department Store,Dessert Shop,Discount Store,Electronics Store,Fabric Shop,Fishing Store,Flower Shop,Food & Drink Shop,Fruit & Vegetable Store,Furniture / Home Store,Gift Shop,Gourmet Shop,Grocery Store,Hardware Store,Health Food Store,Hobby Shop,Ice Cream Shop,Jewelry Store,Kids Store,Lighting Store,Lingerie Store,Liquor Store,Medical Supply Store,Men's Store,Miscellaneous Shop,Mobile Phone Shop,Mobility Store,Motorcycle Shop,Music Store,Optical Shop,Other Repair Shop,Outdoor Supply Store,Paper / Office Supplies Store,Pet Store,Print Shop,Salon / Barbershop,Shoe Store,Shop & Service,Shopping Mall,Smoke Shop,Souvenir Shop,Sporting Goods Shop,Stationery Store,Tailor Shop,Thrift / Vintage Store,Toy / Game Store,Vape Store,Video Store,Wine Shop,Women's Store
0,Affligem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Alsemberg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Asse,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.272727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.090909,0.0,0.181818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Baardegem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Beersel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Beigem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0
6,Bekkerzeel,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Bellingen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Bertem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0
9,Borchtlombeek,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Define function to sort venues in descending order

In [188]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [189]:
num_top_venues = 4

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Municipality']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Type'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Type'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Municipality'] = grouped['Municipality']

for ind in np.arange(grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)
    
venues_sorted.shape

(100, 5)

And now the cluster analysis using k-means

In [190]:
# set number of clusters
kclusters = 5

grouped_clustering = grouped.drop('Municipality', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)


Create a new dataframe that includes the cluster as well as the top 5 venue types for each neighborhood

In [191]:
# add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

venues_sorted.head(8)

Unnamed: 0,Cluster Labels,Municipality,1st Type,2nd Type,3rd Type,4th Type
0,0,Affligem,Coffee Shop,Ice Cream Shop,Flower Shop,Electronics Store
1,1,Alsemberg,Furniture / Home Store,Print Shop,Cosmetics Shop,Dessert Shop
2,1,Asse,Clothing Store,Furniture / Home Store,Shop & Service,Food & Drink Shop
3,1,Baardegem,Food & Drink Shop,Liquor Store,Flower Shop,Electronics Store
4,0,Beersel,Flower Shop,Convenience Store,Cosmetics Shop,Food & Drink Shop
5,0,Beigem,Wine Shop,Flower Shop,Electronics Store,Convenience Store
6,1,Bekkerzeel,Outdoor Supply Store,Antique Shop,Furniture / Home Store,Discount Store
7,4,Bellingen,Fruit & Vegetable Store,Cosmetics Shop,Women's Store,Discount Store


In [192]:
final_df = region

# merge
final_df = final_df.join(venues_sorted.set_index('Municipality'), on='Municipality')

# get rid of rows (municipalities) for which data is not available (in case this occurs)
final_df.dropna(axis=0,inplace=True)

final_df.reset_index(inplace=True)

final_df.head(10)

Unnamed: 0,index,Postal,Municipality,Long,Lat,Dist,Cluster Labels,1st Type,2nd Type,3rd Type,4th Type
0,25,1310,La Hulpe,4.479654,50.731505,15.64,1.0,Cheese Shop,Shopping Mall,Gourmet Shop,Toy / Game Store
1,41,1330,Rixensart,4.52729,50.713355,19.28,2.0,Bookstore,Women's Store,Clothing Store,Convenience Store
2,42,1331,Rosières,4.546311,50.73713,18.31,1.0,Hobby Shop,Sporting Goods Shop,Women's Store,Discount Store
3,43,1332,Genval,4.497139,50.720745,17.33,1.0,Discount Store,Health Food Store,Convenience Store,Motorcycle Shop
4,99,1410,Waterloo,4.397805,50.717356,14.73,1.0,Clothing Store,Electronics Store,Bookstore,Sporting Goods Shop
5,100,1420,Braine-L'alleud,4.354815,50.694094,16.95,1.0,Furniture / Home Store,Women's Store,Health Food Store,Department Store
6,110,1440,Braine-Le-Château,4.266669,50.680882,19.37,1.0,Cheese Shop,Video Store,Convenience Store,Flower Shop
7,111,1440,Wauthier-Braine,4.313304,50.680832,18.62,1.0,Furniture / Home Store,Convenience Store,Tailor Shop,Electronics Store
8,132,1480,Clabecq,4.221413,50.689374,19.73,1.0,Thrift / Vintage Store,Ice Cream Shop,Flower Shop,Department Store
9,135,1480,Tubize,4.204696,50.69302,19.96,4.0,Cosmetics Shop,Shoe Store,Women's Store,Discount Store


Create map with clusters

In [193]:
# create map with clusters - centred on the central point (Brussels) that I chose earlier

c_lat=centre['Lat']
c_long=centre['Long']

map_clusters = folium.Map(location=[c_lat, c_long], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(final_df['Lat'], final_df['Long'], final_df['Municipality'], final_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters.save("belmap2.html")
print("belmap2 saved")
map_clusters

belmap2 saved
