# Where to Locate a new Italian restaurant in Antwerpen, Belgium
## Coursera Capstone Project

*Created by: Anna Sukhareva, data scientist  
Antwerpen, Belgium  
Date: November 04, 2019  
Contacts: anna@linefeed.be*  
***

### Table of Contents
Stage 1 : Business Understanding  
Stage 2 : Analytic Approach  
Stage 3 : Data Requirements  
Stage 4 : Data Collection & Data Understanding  
Stage 5 : Data Preparation  
Stage 6 : Modeling & Evaluation 
Stage 7 : Conclusions
***

### Stage 1 : Business Understanding
**Problem:**  
Location influences the success or failure of a restaurant in a host of ways, from attracting enough initial customer interest to being convenient to visit. The restaurant’s location is also interrelated to other factors, for instance, the immediate surroundings of the restaurant site, accessibility and right neighborhoods (as competitive).
To determine the location for a new restaurant, making the research surrounding businesses is a must, to answer the following questions:
+ Is there enough room for a new restaurant?   
+ What are the local trends in that area?   
+ How location works for surrounding businesses, and what impact will it make on a new business performance?

**Question:**  
Can we cluster similar areas of Antwerpen  and make a profile of each area?

### Stage 2 : Analytic Approach
As the question requires clustering, the clustering model with the K-Means method will be built.    
To evaluate models performance, we'll use the inercia method. 

### Stage 3 : Data Requirements
**Data content:** To answer the question we need map of Antwerp, information about city's boroughs and neighborhoods, its different venues and geographical coordinates.  
**Data sources:** We're going to gather all the information from open sources (boroughs and neighborhoods - Wikioedia, venues and its locations - [Foursquare](https://foursquare.com/))  

### Stage 4 :  Data Collection & Data Understanding
Importing libraries:

In [1]:
! pip install folium



In [2]:
! pip install geopy

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/80/93/d384479da0ead712bdaf697a8399c13a9a89bd856ada5a27d462fb45e47b/geopy-1.20.0-py2.py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 7.2MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.20.0


In [3]:
import pandas as pd
import numpy as np 

# Scrapping Foursquare API
import json
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim


# scrapping web
#import lxml 
import html5lib
import requests
import io
from IPython.display import display_html

#visualization
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import seaborn
from IPython.display import Image 
from IPython.core.display import HTML 


# Machine Learning
from sklearn.cluster import KMeans

#map
import folium
  
print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

##### Preparing dataset:  Antwerpen: boroughs and neighborhoods and coordinates
Getting Antwerpen boroughs and neighborhoods

In [4]:
# Borough column
# data from here : https://portaal-stadantwerpen.opendata.arcgis.com/datasets/district/data

borough = pd.read_csv('districtscsv.csv')
borough.districtnaam = borough.districtnaam.str.capitalize() 
borough = borough[['districtnaam']]
borough.columns = ['Borough']
borough.replace({'Borough' : 'Berendrecht_zandvliet_lillo'}, 'Berendrecht', inplace=True)
borough

Unnamed: 0,Borough
0,Wilrijk
1,Hoboken
2,Berchem
3,Borgerhout
4,Deurne
5,Merksem
6,Antwerpen
7,Berendrecht
8,Ekeren


In [5]:
# Antwerpen Neighborhood
#from Wikipedia https://en.wikipedia.org/wiki/Antwerp_(district)

antwerpen_districts = ['Antwerpen Noord', 'Brederode', 'Centraal Station', 'Den Dam', 'Eilandje, Haringrode', 'Harmonie', 'Historisch Centrum', 'Kiel', 'Linkeroever', 'Luchtbal', 'Rozemaai', 'Schoonbroek', 'Markgrave', 'Meir', 'Middelheim', 'Schipperskwartier', 'Sint-Andries', 'Stadspark', 'Tentoonstellingswijk', 'Theaterbuurt', 'Universiteitswijk', 'Zurenborg', 'Zuid']
neighborhood = pd.DataFrame(antwerpen_districts,  columns=['Neighborhood']) 
neighborhood['Borough'] = 'Antwerpen'
neighborhood

Unnamed: 0,Neighborhood,Borough
0,Antwerpen Noord,Antwerpen
1,Brederode,Antwerpen
2,Centraal Station,Antwerpen
3,Den Dam,Antwerpen
4,"Eilandje, Haringrode",Antwerpen
5,Harmonie,Antwerpen
6,Historisch Centrum,Antwerpen
7,Kiel,Antwerpen
8,Linkeroever,Antwerpen
9,Luchtbal,Antwerpen


In [None]:
# join 
antwerpen = pd.merge(borough, neighborhood, on='Borough', how='outer')
antwerpen.loc[antwerpen['Neighborhood'].isnull(),'Neighborhood'] = antwerpen['Borough']

antwerpen['latitude'] = ''
antwerpen['longitude'] = ''

antwerpen

Getting location latitude, longtitude:

In [7]:
from geopy.geocoders import Nominatim

In [12]:
geolocator = Nominatim(user_agent="anwerp_app")
location = geolocator.geocode('Belgium, Antwerpen', timeout=15)
latitude = location.latitude
longitude = location.longitude

In [13]:
print('The geograpical coordinate of Antwerpen are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Antwerpen are 51.2211097, 4.3997081.


Developing a function to get coordinates using Geopy library :

In [16]:
def getGeo(address):
    """Function returns latitude and lontitude for 1 address"""
    
    geolocator = Nominatim(user_agent="anwerp_app")
    location = geolocator.geocode('Antwerpen, ' + address.Neighborhood , timeout=15)
    if location is None:
    #    print('address not found:' + address.Neighborhood)
        address.latitude = 0
        address.longitude = 0
    else:
    #    print('address found:' + address.Neighborhood)
        address.latitude = location.latitude
        address.longitude = location.longitude
    
    return address


In [None]:
#for every adress in the dataframe enrich the geo location using the getGeo function
neighborhood = antwerpen.apply(lambda row: getGeo(row), axis=1)
neighborhood

In [None]:
neighborhood

Checking for missing coordinates:

In [None]:
# Checking for 0
missed_coordinates = neighborhood.loc[neighborhood['latitude'] == 0]
missed_coordinates

In [None]:
neighborhood.drop(missed_coordinates.index, inplace=True)
neighborhood

In [None]:
missed = pd.DataFrame([['Antwerpen', 'Eilandje', 51.2353, 4.4099],
                       ['Antwerpen', 'Historisch Centrum', 51.2212, 4.3998],
                       ['Antwerpen', 'Universiteitswijk', 51.2235, 4.4096]], columns= ['Borough', 'Neighborhood', 'latitude', 'longitude'])

missed

In [36]:
neighborhood = pd.concat([neighborhood, missed], axis = 0)
neighborhood.sort_values(by=['Borough'], inplace=True)
neighborhood.reset_index(drop=True, inplace=True)
neighborhood

Unnamed: 0,Borough,Neighborhood,latitude,longitude
0,Antwerpen,Schoonbroek,51.279933,4.409993
1,Antwerpen,Eilandje,51.2353,4.4099
2,Antwerpen,Zuid,51.199941,4.390926
3,Antwerpen,Zurenborg,51.206853,4.430287
4,Antwerpen,Theaterbuurt,51.214919,4.409325
5,Antwerpen,Tentoonstellingswijk,51.190651,4.388902
6,Antwerpen,Stadspark,51.21248,4.414373
7,Antwerpen,Sint-Andries,51.216174,4.398647
8,Antwerpen,Schipperskwartier,51.225922,4.404064
9,Antwerpen,Middelheim,51.180541,4.413492


In [None]:
neighborhood.dtypes

In [None]:
type(neighborhood)

In [39]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(antwerpen['Borough'].unique()),
        antwerpen.shape[0]
    )
)

The dataframe has 9 boroughs and 31 neighborhoods.


#### Visualization: Map of Antwerpen, with all Boroughs and Neighborhoods.

In [26]:
map_antwerp = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhood['latitude'], neighborhood['longitude'], neighborhood['Borough'], neighborhood['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_antwerp)  
    
map_antwerp

Slicing the original dataframe: 

In [40]:
type(neighborhood)

pandas.core.frame.DataFrame

In [41]:
antwerpen_data = neighborhood.loc[neighborhood.Borough == 'Antwerpen'].reset_index(drop=True)
antwerpen_data

Unnamed: 0,Borough,Neighborhood,latitude,longitude
0,Antwerpen,Schoonbroek,51.279933,4.409993
1,Antwerpen,Eilandje,51.2353,4.4099
2,Antwerpen,Zuid,51.199941,4.390926
3,Antwerpen,Zurenborg,51.206853,4.430287
4,Antwerpen,Theaterbuurt,51.214919,4.409325
5,Antwerpen,Tentoonstellingswijk,51.190651,4.388902
6,Antwerpen,Stadspark,51.21248,4.414373
7,Antwerpen,Sint-Andries,51.216174,4.398647
8,Antwerpen,Schipperskwartier,51.225922,4.404064
9,Antwerpen,Middelheim,51.180541,4.413492


In [None]:
antwerpen_data_ = antwerpen_data.copy()
antwerpen_data_

### Visualization: Antwerpen Botough and its neighborhoods:

Let's visualize Antwerpen district the neighborhoods in it.

In [43]:
# create map of Manhattan using latitude and longitude values
map_antwerp = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map to plot neighboorhoods
for lat, lng, label in zip(antwerpen_data['latitude'], antwerpen_data['longitude'], antwerpen_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_antwerp)  
    
map_antwerp

In [44]:
type(antwerpen_data)

pandas.core.frame.DataFrame

### Getting info about all venues, using Foursquare API

Getting venues information about **the first Neighborhood**

In [45]:
CLIENT_ID = 'WKRCF42TGVIQEZIZWMYRDJKHPG2S23UGP2HUXIEE05OIBOIZ'
CLIENT_SECRET = 'LQPYUVVYILRWP5AYEDLITEA43KUYF2VUM2CXC5LETD41F2H5'
VERSION = '20191031'
LIMIT = 100

Get the neighborhood's name, latitude and longitude values:

In [46]:
neighborhood_latitude = antwerpen_data.loc[0, 'latitude'] 
neighborhood_longitude = antwerpen_data.loc[0, 'longitude']
neighborhood_name = antwerpen_data.loc[0, 'Neighborhood']

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, neighborhood_latitude, neighborhood_longitude))

Latitude and longitude values of Schoonbroek are 51.2799334, 4.4099928.


Getting top 100 venues that are in a radius of 500 meters:

In [47]:
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=WKRCF42TGVIQEZIZWMYRDJKHPG2S23UGP2HUXIEE05OIBOIZ&client_secret=LQPYUVVYILRWP5AYEDLITEA43KUYF2VUM2CXC5LETD41F2H5&v=20191031&ll=51.2799334,4.4099928&radius=500&limit=100'

In [None]:
#Sending  the GET request
results = requests.get(url).json()
results

Cleaning the result and structure it to pandas dataframe: 

In [49]:
# Getting the categoty types, using  custom function
def get_category_type(row):
    """function extracts the category of the venue"""
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    

#Cleaning the json file from request and structure it into a pandas dataframe:

venues = results['response']['groups'][0]['items']
# flatten JSON
nearby_venues = json_normalize(venues) 
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
#clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,'t Aperoke,Bar,51.27867,4.410173
1,Zwembad De Schinde,Pool,51.279817,4.413643
2,Putten van Ekeren,Park,51.282588,4.404558
3,Halte Ekeren Akkerstraat,Bus Stop,51.27901,4.410192
4,Halte Antwerpen Oorderseweg,Bus Stop,51.281993,4.411528


Result: How much were returned by Foursquare?

In [50]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

11 venues were returned by Foursquare.


<a id='item2'></a>

### Getting venues information about the **ALL Neighborhood in Antwerpen Borough** :

Creating a custom function to repeat the same process to all the neighborhoods in Antwerpen Borough :

In [51]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    """The function makes a request to Foursquare API, row by row in given dataframe, returns dataframe"""
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
antwerpen_venues = getNearbyVenues(names=antwerpen_data['Neighborhood'], latitudes=antwerpen_data['latitude'], longitudes=antwerpen_data['longitude'])

Calling the function:

In [52]:
#antwerpen_venues = getNearbyVenues(names=antwerpen_data['Neighborhood'], latitudes=antwerpen_data['latitude'], longitudes=antwerpen_data['longitude'])
antwerpen_venues.head()

Schoonbroek
Eilandje
Zuid
Zurenborg
Theaterbuurt
Tentoonstellingswijk
Stadspark
Sint-Andries
Schipperskwartier
Middelheim
Meir
Markgrave
Historisch Centrum
Rozemaai
Universiteitswijk
Linkeroever
Kiel
Harmonie
Den Dam
Centraal Station
Brederode
Antwerpen Noord
Luchtbal


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Schoonbroek,51.279933,4.409993,'t Aperoke,51.27867,4.410173,Bar
1,Schoonbroek,51.279933,4.409993,Zwembad De Schinde,51.279817,4.413643,Pool
2,Schoonbroek,51.279933,4.409993,Putten van Ekeren,51.282588,4.404558,Park
3,Schoonbroek,51.279933,4.409993,Halte Ekeren Akkerstraat,51.27901,4.410192,Bus Stop
4,Schoonbroek,51.279933,4.409993,Halte Antwerpen Oorderseweg,51.281993,4.411528,Bus Stop


Checking the size of the resulting dataframe:

In [53]:
print(antwerpen_venues.shape)

(1139, 7)


How many venues were returned for each neighborhood?

In [54]:
antwerpen_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Antwerpen Noord,1,1,1,1,1,1
Brederode,39,39,39,39,39,39
Centraal Station,64,64,64,64,64,64
Den Dam,100,100,100,100,100,100
Eilandje,43,43,43,43,43,43
Harmonie,23,23,23,23,23,23
Historisch Centrum,100,100,100,100,100,100
Kiel,26,26,26,26,26,26
Linkeroever,16,16,16,16,16,16
Luchtbal,23,23,23,23,23,23


How many unique categories can be curated from all the returned venues?

In [55]:
print('There are {} uniques categories.'.format(len(antwerpen_venues['Venue Category'].unique())))

There are 196 uniques categories.


<a id='item3'></a>

## Analyzing each neighborhood
Creating dummy variables:

In [56]:
# one hot encoding
antwerpen_onehot = pd.get_dummies(antwerpen_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
antwerpen_onehot['Neighborhood'] = antwerpen_venues['Neighborhood'] 
# move neighborhood column to the first column
fixed_columns = [antwerpen_onehot.columns[-1]] + list(antwerpen_onehot.columns[:-1])
antwerpen_onehot = antwerpen_onehot[fixed_columns]

antwerpen_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,Auto Garage,...,Used Bookstore,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Waterfront,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Schoonbroek,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Schoonbroek,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Schoonbroek,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Schoonbroek,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Schoonbroek,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Examining the new dataframe size:

In [57]:
antwerpen_onehot.shape

(1139, 197)

Groupping rows by neighborhood and by taking the **mean of the frequency of occurrence** of each category : 

In [58]:
antwerpen_grouped = antwerpen_onehot.groupby('Neighborhood').mean().reset_index()
antwerpen_grouped

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,Auto Garage,...,Used Bookstore,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Waterfront,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Antwerpen Noord,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Brederode,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,...,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.025641,0.0,0.0
2,Centraal Station,0.0,0.0,0.015625,0.015625,0.0,0.0,0.078125,0.0,0.0,...,0.015625,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.078125
3,Den Dam,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,...,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
4,Eilandje,0.0,0.0,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,0.0
5,Harmonie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Historisch Centrum,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,...,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
7,Kiel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Linkeroever,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Luchtbal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Confirming the new size : 

In [59]:
antwerpen_grouped.shape

(23, 197)

### Printing each neighborhood along with the top 5 most common venues

In [60]:
num_top_venues = 5

for hood in antwerpen_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = antwerpen_grouped[antwerpen_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Antwerpen Noord----
           venue  freq
0  Train Station   1.0
1           Park   0.0
2    Music Venue   0.0
3      Nightclub   0.0
4   Noodle House   0.0


----Brederode----
                       venue  freq
0                Coffee Shop  0.13
1                        Bar  0.08
2                        Pub  0.08
3                   Friterie  0.05
4  Middle Eastern Restaurant  0.05


----Centraal Station----
                venue  freq
0  Italian Restaurant  0.08
1    Asian Restaurant  0.08
2         Zoo Exhibit  0.08
3     Thai Restaurant  0.06
4         Coffee Shop  0.06


----Den Dam----
               venue  freq
0                Bar  0.11
1        Coffee Shop  0.08
2       Cocktail Bar  0.07
3  French Restaurant  0.05
4              Plaza  0.05


----Eilandje----
            venue  freq
0       Nightclub  0.30
1  Sandwich Place  0.07
2     Coffee Shop  0.07
3      Restaurant  0.07
4    Cocktail Bar  0.05


----Harmonie----
            venue  freq
0  Sandwich Place  0.13
1  

Creating a dataframe :

In [61]:
# Developing a custom function to sort the venues in descending order

def return_most_common_venues(row, num_top_venues):
    """a function sorts the venues in descending order"""
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood:

In [62]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = antwerpen_grouped['Neighborhood']

for ind in np.arange(antwerpen_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(antwerpen_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Antwerpen Noord,Train Station,Zoo Exhibit,Donut Shop,Flower Shop,Flea Market,Fishing Store,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant
1,Brederode,Coffee Shop,Bar,Pub,Friterie,Middle Eastern Restaurant,Pizza Place,Café,Restaurant,Bus Stop,Tailor Shop
2,Centraal Station,Zoo Exhibit,Asian Restaurant,Italian Restaurant,Thai Restaurant,Coffee Shop,Sandwich Place,Hotel,Bakery,Grocery Store,Supermarket
3,Den Dam,Bar,Coffee Shop,Cocktail Bar,French Restaurant,Restaurant,Plaza,Italian Restaurant,Belgian Restaurant,Fish & Chips Shop,Sushi Restaurant
4,Eilandje,Nightclub,Sandwich Place,Coffee Shop,Restaurant,Beach Bar,Cocktail Bar,Theater,Bar,Brewery,French Restaurant


<a id='item4'></a>

# Stage 6: Modeling - Clustering Neighborhoods

Starts from 5 clusters:

In [63]:
kclusters = 5
antwerpen_grouped_clustering = antwerpen_grouped.drop('Neighborhood', 1)

# model + fit
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(antwerpen_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 0, 3, 0, 3, 3, 0, 3, 3, 2], dtype=int32)

### Looking for best k:

In [None]:
ks = [1,2,3,4,5,6]
inertias = []

for k in ks:
    model = KMeans(n_clusters=ks)
    model.fit(antwerpen_grouped_clustering)
    # append the inertia to the mlist of inertias
    inertias.append(model.inertia_)

# plot ks vs inertias
#plt.plot(ks, inertias, '-o')
#plt.xlabel('number of clusters, k')
#plt.ylabel('inertia')
#plt.xticks(ks)
#plt.show()  

Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [80]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

antwerpen_merged = antwerpen_data

# merge antwerpen_merged with antwerpen_data to add latitude/longitude for each neighborhood
antwerpen_merged = antwerpen_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

antwerpen_merged.head() 

Unnamed: 0,Borough,Neighborhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Antwerpen,Schoonbroek,51.279933,4.409993,2,Soccer Field,Bus Stop,Health & Beauty Service,Sports Club,Bar,Park,Basketball Court,College Classroom,Pool,Design Studio
1,Antwerpen,Eilandje,51.2353,4.4099,3,Nightclub,Sandwich Place,Coffee Shop,Restaurant,Beach Bar,Cocktail Bar,Theater,Bar,Brewery,French Restaurant
2,Antwerpen,Zuid,51.199941,4.390926,0,Bar,Pizza Place,Restaurant,Coffee Shop,Moving Target,Train Station,Supermarket,Doner Restaurant,Sandwich Place,Gourmet Shop
3,Antwerpen,Zurenborg,51.206853,4.430287,0,Bar,Restaurant,Bistro,Pizza Place,Bakery,Italian Restaurant,Gastropub,Moroccan Restaurant,Pub,Indian Restaurant
4,Antwerpen,Theaterbuurt,51.214919,4.409325,3,Boutique,Clothing Store,Coffee Shop,Theater,Gastropub,Bakery,Men's Store,Pub,Plaza,Bar


### Visualization of clusters :

In [81]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(antwerpen_merged['latitude'], antwerpen_merged['longitude'], antwerpen_merged['Neighborhood'], antwerpen_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Legend :

cluster 0 - red   
cluster 1 - purple  
cluster 2 - blue  
cluster 3 - green  
cluster 4 - orange

# Stage 7 : Conclusions

Let's examine the clusters:

#### Cluster 0, red

In [82]:
antwerpen_merged.loc[antwerpen_merged['Cluster Labels'] == 0, antwerpen_merged.columns[[1] + list(range(5, antwerpen_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Zuid,Bar,Pizza Place,Restaurant,Coffee Shop,Moving Target,Train Station,Supermarket,Doner Restaurant,Sandwich Place,Gourmet Shop
3,Zurenborg,Bar,Restaurant,Bistro,Pizza Place,Bakery,Italian Restaurant,Gastropub,Moroccan Restaurant,Pub,Indian Restaurant
5,Tentoonstellingswijk,Park,Bakery,Grocery Store,Pharmacy,Supermarket,Belgian Restaurant,Restaurant,Notary,Friterie,Bar
8,Schipperskwartier,Bar,Restaurant,Coffee Shop,Italian Restaurant,Gay Bar,Friterie,French Restaurant,Asian Restaurant,Sandwich Place,Scenic Lookout
11,Markgrave,Sandwich Place,Italian Restaurant,Supermarket,Belgian Restaurant,Restaurant,Thai Restaurant,Bar,Breakfast Spot,French Restaurant,Friterie
12,Historisch Centrum,Bar,Coffee Shop,Cocktail Bar,Restaurant,French Restaurant,Plaza,Italian Restaurant,Belgian Restaurant,BBQ Joint,Fish & Chips Shop
14,Universiteitswijk,Bar,Coffee Shop,Sandwich Place,Asian Restaurant,Soup Place,Pub,Plaza,Spanish Restaurant,Salad Place,French Restaurant
18,Den Dam,Bar,Coffee Shop,Cocktail Bar,French Restaurant,Restaurant,Plaza,Italian Restaurant,Belgian Restaurant,Fish & Chips Shop,Sushi Restaurant
20,Brederode,Coffee Shop,Bar,Pub,Friterie,Middle Eastern Restaurant,Pizza Place,Café,Restaurant,Bus Stop,Tailor Shop


Coclusion:
    
+ Includes 8 neighboorhoods. 
+ Business environment: the most popular categories of venues are bar, coffee shop and restaurant. 
+ Competitors: Hight competitors area; competitors are with different horeca type (coffee bars, bar, restaurant, fast food restaurants). Italian restaurants meets 5 times in top of 'venues categories '. 
+ Accessibility: foot traffic.

**The cluster doesn't look like appropriate.**

#### Cluster 1, purple

In [83]:
antwerpen_merged.loc[antwerpen_merged['Cluster Labels'] == 1, antwerpen_merged.columns[[1] + list(range(5, antwerpen_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,Rozemaai,Sports Club,Pharmacy,Bus Stop,Auto Garage,Light Rail Station,Deli / Bodega,Empanada Restaurant,Flea Market,Fishing Store,Fish & Chips Shop


Conclusion:
    
+ Includes 1 neighboorhood ('Rozemaai'). 
+ Business environment: the most popular categories are sport clubs, pharmacy,  bus stops; we can consider the cluster as  'uptown'. 
+ Competitors: Low competitors area. 
+ Accessibility: very good; train station, wide net of bus stations.

**The cluster doesn't look like appropriate.**


#### Cluster 2, blue

In [84]:
antwerpen_merged.loc[antwerpen_merged['Cluster Labels'] == 2, antwerpen_merged.columns[[1] + list(range(5, antwerpen_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Schoonbroek,Soccer Field,Bus Stop,Health & Beauty Service,Sports Club,Bar,Park,Basketball Court,College Classroom,Pool,Design Studio
9,Middelheim,Sports Club,Café,Restaurant,Park,Gastropub,Bus Stop,Sculpture Garden,Zoo Exhibit,Fish & Chips Shop,Fast Food Restaurant
22,Luchtbal,Bus Stop,Electronics Store,Pharmacy,Bar,Bakery,Friterie,Park,Athletics & Sports,Supermarket,Men's Store


Conclusion:
    
+ Includes 3 neighboorhoods. 
+ Business environment: the most popular categories are sport clubs, parks, horeca, health & beauty service. 
+ Competitors: Middle competitors area. 
+ Accessibility: very good; wide net of bus stations, foot traffic.

**The cluster does look like appropriate. Maybe considered as appropriate as family Italian restaurant, with target audience - family, evening and weekend working hours, foot traffic from sport and park areas.**

#### Cluster 3, green

In [85]:
antwerpen_merged.loc[antwerpen_merged['Cluster Labels'] == 3, antwerpen_merged.columns[[1] + list(range(5, antwerpen_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Eilandje,Nightclub,Sandwich Place,Coffee Shop,Restaurant,Beach Bar,Cocktail Bar,Theater,Bar,Brewery,French Restaurant
4,Theaterbuurt,Boutique,Clothing Store,Coffee Shop,Theater,Gastropub,Bakery,Men's Store,Pub,Plaza,Bar
6,Stadspark,Sandwich Place,Coffee Shop,Hotel,Theater,Bar,Breakfast Spot,Café,Bistro,Electronics Store,Deli / Bodega
7,Sint-Andries,Clothing Store,Boutique,Italian Restaurant,Coffee Shop,Bar,Shoe Store,Cocktail Bar,Spanish Restaurant,Bakery,Burger Joint
10,Meir,Clothing Store,Boutique,Coffee Shop,Cosmetics Shop,Theater,Sandwich Place,Bar,Juice Bar,Furniture / Home Store,Gastropub
15,Linkeroever,Music Venue,Tram Station,Supermarket,Dog Run,Plaza,Bookstore,Salon / Barbershop,Parking,Park,Athletics & Sports
16,Kiel,Clothing Store,Cosmetics Shop,Bakery,Park,Sandwich Place,Brasserie,Pharmacy,Spanish Restaurant,Flower Shop,Event Space
17,Harmonie,Sandwich Place,Restaurant,Supermarket,Coffee Shop,Park,Gym,Friterie,Café,Dog Run,Breakfast Spot
19,Centraal Station,Zoo Exhibit,Asian Restaurant,Italian Restaurant,Thai Restaurant,Coffee Shop,Sandwich Place,Hotel,Bakery,Grocery Store,Supermarket


Conclusion:
    
+ Includes 9 neighboorhoods. 
+ Business environment: the most popular categories are horeca, clothing stores & boutique. 
+ Competitors: Highly competitors area. 
+ Accessibility: foot traffic.

**The cluster doesn't look like appropriate.**

#### Cluster 4

In [86]:
antwerpen_merged.loc[antwerpen_merged['Cluster Labels'] == 4, antwerpen_merged.columns[[1] + list(range(5, antwerpen_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,Antwerpen Noord,Train Station,Zoo Exhibit,Donut Shop,Flower Shop,Flea Market,Fishing Store,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant


Conclusion:
    
+ Includes 1 neighboorhood. 
+ Business environment: the most popular categories are sport clubs, pharmacy, auto garage; we can consider the cluster as 'uptown'.
+ Competitors: Low competitors area. 
+ Accessibility: good, a wide net of bus stops.

**The cluster doesn't look like appropriate.**