# Clustering Cities in Grand Paris

##### The aim of this study is to explore, segment, and cluster the neighborhoods in the cities around Paris, an area called "Grand Paris" based on the venues. 

##### I will scrape the Wikipedia page that has all the information we need concerning the neighborhoods in Toronto, wrangle the data, clean it and then read it into a pandas dataframe so that it is in a structured format.

##### Once the data is in a structured format, I will explore and cluster the neighborhoods in the city of Toronto.


## Prerequesities

### Installation of the necessary libraries

In [1]:
!pip install bs4
#!pip install requests

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 25.9MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed bea

In [2]:
!conda install -c conda-forge folium=0.5.0 --yes

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    attrs-21.2.0               |     pyhd8ed1ab_0          44 KB  conda-forge
    branca-0.4.2               |     pyhd8ed1ab_0          26 KB  conda-forge
    entrypoints-0.3            |  pyhd8ed1ab_1003           8 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    jsonschema-3.2.0           |     pyhd8ed1ab_3          45 KB  conda-forge
    pyrsistent-0.17.3          |   py36h8f

In [5]:
conda install -c conda-forge geopy

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.52         |     pyhd8ed1ab_0          35 KB  conda-forge
    geopy-2.2.0                |     pyhd8ed1ab_0          67 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         102 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.52-pyhd8ed1ab_0
  geopy              conda-forge/noarch::geopy-2.2.0-pyhd8ed1ab_0



Downloading and Extracting Packages
geographiclib-1.52   | 35 KB     | ##################################### | 100% 
geopy-2.2.0          | 67 KB     | ########################

### Loading of the libraries

In [6]:
import pandas as pd
import numpy as np
#from bs4 import BeautifulSoup
import requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import matplotlib.cm as cm
import matplotlib.colors as colors


import json 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

## Part 1: Data Scraping and Cleaning

##### We will use the list of cities which compose the Grand Paris area provided as open data by the French Government:
##### https://www.data.gouv.fr/fr/datasets/communes-de-la-metropole-du-grand-paris-par-ept/

##### The data can be downloaded as excel file and contains:
- the city names("Libellé géographique").
- some city codes ("code géographique"), which are not the postal code so not directly usable by interfaces using the postal code.
- the codes "région", "département" and "EPT", which are not relevant for our study.

### Import the data into a dataframe and clean it
##### Target:
- import the xlsx data.
- keep only the cities' names (key "Libellé géographique").

In [7]:
df_raw=pd.DataFrame()
df_raw= pd.read_excel('communes-metropole-du-grand-paris-par-ept.xlsx')
df_raw.head()

Unnamed: 0,Code géographique,Région,Département,Libellé géographique,EPT
0,75056,11,75,Paris,Ville de Paris - T1
1,94015,11,94,Bry-sur-Marne,Paris-Est-Marne et Bois - T10
2,94017,11,94,Champigny-sur-Marne,Paris-Est-Marne et Bois - T10
3,94018,11,94,Charenton-le-Pont,Paris-Est-Marne et Bois - T10
4,94033,11,94,Fontenay-sous-Bois,Paris-Est-Marne et Bois - T10


In [8]:
del df_raw['Code géographique']
del df_raw['Région']
del df_raw['Département']
del df_raw['EPT']

df_raw.head()

Unnamed: 0,Libellé géographique
0,Paris
1,Bry-sur-Marne
2,Champigny-sur-Marne
3,Charenton-le-Pont
4,Fontenay-sous-Bois


##### Then I remove the city of Paris which is out of the scope of the study.

In [41]:
df=pd.DataFrame()
df = df_raw.iloc[1: , :]
df.reset_index(inplace=True)
df.head(10)

Unnamed: 0,index,Libellé géographique
0,1,Bry-sur-Marne
1,2,Champigny-sur-Marne
2,3,Charenton-le-Pont
3,4,Fontenay-sous-Bois
4,5,Joinville-le-Pont
5,6,Maisons-Alfort
6,7,Nogent-sur-Marne
7,8,Le Perreux-sur-Marne
8,9,Saint-Mandé
9,10,Saint-Maur-des-Fossés


## Part 2: Add geospatial data into the dataframe

##### Now that a dataframe of the cities' names was built, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

##### For this, we use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html, using the cities' names as inputs.


##### Define Foursquare Credentials and Version.

In [10]:
CLIENT_ID = 'ERPZV0TPS2CQWWO2KLL4FFCHQJZVEMCOZFZWIE0WJU2K5Z4P' # your Foursquare ID
CLIENT_SECRET = 'WOBJDZNIRL5MCSSWLU43DPTO53LBMBD1VSIVCSDLDEO0T335' # your Foursquare Secret
ACCESS_TOKEN = 'MGDYWNEMINFMG3ELVCQTABQSS03DCGMNB54C4RARIX13MUN1' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ERPZV0TPS2CQWWO2KLL4FFCHQJZVEMCOZFZWIE0WJU2K5Z4P
CLIENT_SECRET:WOBJDZNIRL5MCSSWLU43DPTO53LBMBD1VSIVCSDLDEO0T335


##### We run a loop across the cities' names, which requests the latitude and longitude of each city and puts them in a new dataframe:

In [11]:
df2= df

for i in range(len(df)):
    adr = df.loc[i,'Libellé géographique']
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(adr)
    df2.loc[i,'Latitude'] = location.latitude
    df2.loc[i,'Longitude'] = location.longitude
    
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


Unnamed: 0,index,Libellé géographique,Latitude,Longitude
0,1,Bry-sur-Marne,48.835287,2.519332
1,2,Champigny-sur-Marne,48.813776,2.510738
2,3,Charenton-le-Pont,48.819848,2.415951
3,4,Fontenay-sous-Bois,48.849072,2.474935
4,5,Joinville-le-Pont,48.818372,2.466808


## Part 3: Neighborhoods Clustering

In [12]:
df3=df2


### List venues in each city with the Foursquare API

##### I now want to create a list of all venues in the vicinity of each of the above cities. For this, I use the Foursquare API, which can give me the list of all venues in a given radius of a geographic point defined by its latitude and longitude.

##### I create a function to list all venues in a defined radius of defined geospatial coordinates:

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

##### Now I run this function on each city of the Grand Paris area and I create a new dataframe called GrandParis_venues.

In [16]:
GrandParis_venues = getNearbyVenues(names=df3['Libellé géographique'],
                                   latitudes=df3['Latitude'],
                                   longitudes=df3['Longitude']
                                  )

Bry-sur-Marne
Champigny-sur-Marne
Charenton-le-Pont
Fontenay-sous-Bois
Joinville-le-Pont
Maisons-Alfort
Nogent-sur-Marne
Le Perreux-sur-Marne
Saint-Mandé
Saint-Maur-des-Fossés
Saint-Maurice
Villiers-sur-Marne
Vincennes
Alfortville
Boissy-Saint-Léger
Bonneuil-sur-Marne
Chennevières-sur-Marne
Créteil
Limeil-Brévannes
Mandres-les-Roses
Marolles-en-Brie
Noiseau
Ormesson-sur-Marne
Périgny
Le Plessis-Trévise
La Queue-en-Brie
Santeny
Sucy-en-Brie
Villecresnes
Athis-Mons
Juvisy-sur-Orge
Morangis
Paray-Vieille-Poste
Savigny-sur-Orge
Viry-Châtillon
Ablon-sur-Seine
Arcueil
Cachan
Chevilly-Larue
Choisy-le-Roi
Fresnes
Gentilly
L'Haÿ-les-Roses
Ivry-sur-Seine
Le Kremlin-Bicêtre
Orly
Rungis
Thiais
Valenton
Villejuif
Villeneuve-le-Roi
Villeneuve-Saint-Georges
Vitry-sur-Seine
Antony
Bagneux
Bourg-la-Reine
Châtenay-Malabry
Châtillon
Clamart
Fontenay-aux-Roses
Malakoff
Montrouge
Le Plessis-Robinson
Sceaux
Boulogne-Billancourt
Chaville
Issy-les-Moulineaux
Marnes-la-Coquette
Meudon
Sèvres
Vanves
Ville-d'Avr

##### Let's check the size of the resulting dataframe

In [19]:
print(GrandParis_venues.shape)
GrandParis_venues.to_csv('GrandParis_venues.csv')
#print('A total of '&len(GrandParis_venues)&' venues were retrieved by the Foursquare API in the Grand Paris area')

(5482, 7)


In [21]:
GrandParis_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bry-sur-Marne,48.835287,2.519332,Keftedes & Tzatziki,48.835981,2.513415,Greek Restaurant
1,Bry-sur-Marne,48.835287,2.519332,Studios de Bry,48.835987,2.534052,Film Studio
2,Bry-sur-Marne,48.835287,2.519332,Les Délices de Fred,48.835291,2.498486,Bakery
3,Bry-sur-Marne,48.835287,2.519332,IKEA,48.827993,2.530133,Furniture / Home Store
4,Bry-sur-Marne,48.835287,2.519332,Quai Est,48.834266,2.516968,French Restaurant


### Group venues by super-categories

##### At this point, the Foursquare API provides us with too many venue categories, which in my opinion makes a good clustering difficult for my study. It would for example be much more relevant to group all restaurants, whichever their speciality they have, in a category "Food".

##### Foursquare gives us this possibility since super-categories are already defined: https://developer.foursquare.com/docs/build-with-foursquare/categories/

##### Let's create a function to make a table listing the venue categories and the super-categories they belong to, and save this table in a dataframe:

##### Note: only the level 0 super-categories are kept, because I am only interested in the highest categories.

In [22]:
from collections import defaultdict
import requests
import pandas as pd


'''
CLIENT_ID = 'ERPZV0TPS2CQWWO2KLL4FFCHQJZVEMCOZFZWIE0WJU2K5Z4P' # your Foursquare ID
CLIENT_SECRET = 'WOBJDZNIRL5MCSSWLU43DPTO53LBMBD1VSIVCSDLDEO0T335' # your Foursquare Secret
ACCESS_TOKEN = 'MGDYWNEMINFMG3ELVCQTABQSS03DCGMNB54C4RARIX13MUN1' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
'''

#API Parameters
CATEGORIES_API = 'https://api.foursquare.com/v2/venues/categories'
PARAMS = {
    'client_id': 'ERPZV0TPS2CQWWO2KLL4FFCHQJZVEMCOZFZWIE0WJU2K5Z4P',
    'client_secret': 'WOBJDZNIRL5MCSSWLU43DPTO53LBMBD1VSIVCSDLDEO0T335',
    'v': '20180605'
}

# a dictionary to store subcategory (key) and all of its possible parents
SUBCATEGORIES = {}


def subcategorize(cat, prev):
    if cat.get('categories', False):
        
        lvl = len(prev)-1
        for subcat in cat['categories']:
            
            child = subcat['name']
            subcategorize(subcat, prev+[child])
            if child not in SUBCATEGORIES:
                SUBCATEGORIES[child] = [(prev[0], 0)]
                
            for i in range(1, lvl+1):
                SUBCATEGORIES[child].append((prev[i], i))

#fetch categories from api
response = requests.get(CATEGORIES_API, params=PARAMS).json()

#subcategorize each category
for cat in response['response']['categories']:
    name = cat['name']
    subcategorize(cat, [name])

#populate a dataframe from SUBCATEGORIES dictionary
dfCategories_raw = pd.DataFrame(columns = ['venue', 'venue_category', 'level'])
for k, v in SUBCATEGORIES.items():
    for sub, lvl in v:
        dfCategories_raw.loc[len(dfCategories_raw)] = (k, sub, lvl)

dfCategories = pd.DataFrame(columns = ['venue', 'venue_category', 'level'])
dfCategories = dfCategories_raw[dfCategories_raw['level']==0]
dfCategories.reset_index(inplace=True)
        
#dfCategories.to_csv('venue_subcategories.csv', index=False)
dfCategories.head(20)

Unnamed: 0,index,venue,venue_category,level
0,0,Amphitheater,Arts & Entertainment,0
1,1,Aquarium,Arts & Entertainment,0
2,2,Arcade,Arts & Entertainment,0
3,3,Art Gallery,Arts & Entertainment,0
4,4,Bowling Alley,Arts & Entertainment,0
5,5,Casino,Arts & Entertainment,0
6,6,Circus,Arts & Entertainment,0
7,7,Comedy Club,Arts & Entertainment,0
8,8,Concert Hall,Arts & Entertainment,0
9,9,Country Dance Club,Arts & Entertainment,0


##### Now I want to add these super-categories into my dataframe containing the list of venues:

##### Let's check how many venues were returned for each neighborhood

In [23]:
GrandParis_venues['Venue Group']=0

for i in range(len(GrandParis_venues)):
    for j in range(len(dfCategories)):
        if GrandParis_venues.loc[i,'Venue Category']==dfCategories.loc[j,'venue']:
            GrandParis_venues.loc[i,'Venue Group']= dfCategories.loc[j,'venue_category']
            

GrandParis_venues.to_csv('GrandParis_venues.csv', index=False)
GrandParis_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Group
0,Bry-sur-Marne,48.835287,2.519332,Keftedes & Tzatziki,48.835981,2.513415,Greek Restaurant,Food
1,Bry-sur-Marne,48.835287,2.519332,Studios de Bry,48.835987,2.534052,Film Studio,Shop & Service
2,Bry-sur-Marne,48.835287,2.519332,Les Délices de Fred,48.835291,2.498486,Bakery,Food
3,Bry-sur-Marne,48.835287,2.519332,IKEA,48.827993,2.530133,Furniture / Home Store,Shop & Service
4,Bry-sur-Marne,48.835287,2.519332,Quai Est,48.834266,2.516968,French Restaurant,Food


##### I check if any row could not find a corresponding venues group

In [24]:
for i in range(len(GrandParis_venues)):
    if GrandParis_venues.loc[i,'Venue Group']==0:
        print(i)

1630


##### There are 2 of them which I fill manually:

In [27]:
GrandParis_venues.loc[1630,'Venue Group']='Shop & Service'
GrandParis_venues.loc[2639,'Venue Group']='Shop & Service'

##### Let's list these super-categories / groups:

In [28]:
print(pd.unique(GrandParis_venues['Venue Group']))

['Food' 'Shop & Service' 'Outdoors & Recreation' 'Travel & Transport'
 'Arts & Entertainment' 'Nightlife Spot' 'Professional & Other Places'
 'Residence']


In [29]:
print('There are {} uniques super-categories/groups.'.format(len(GrandParis_venues['Venue Group'].unique())))

There are 8 uniques super-categories/groups.


### Cities clustering

##### In order to meet the study target, determine which city fits the best with the inhabitants groups, I want to cluster the cities using a criterion based on these super-categories of venues

##### We Run _k_-means to cluster the neighborhood into 4 clusters.

In [30]:
# one hot encoding
GrandParis_onehot = pd.get_dummies(GrandParis_venues[['Venue Group']], prefix="", prefix_sep="")

# add city column back to dataframe
GrandParis_onehot['City'] = GrandParis_venues['City'] 

# move city column to the first column
fixed_columns = [GrandParis_onehot.columns[-1]] + list(GrandParis_onehot.columns[:-1])
GrandParis_onehot = GrandParis_onehot[fixed_columns]

GrandParis_onehot.head()

Unnamed: 0,City,Arts & Entertainment,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,Bry-sur-Marne,0,1,0,0,0,0,0,0
1,Bry-sur-Marne,0,0,0,0,0,0,1,0
2,Bry-sur-Marne,0,1,0,0,0,0,0,0
3,Bry-sur-Marne,0,0,0,0,0,0,1,0
4,Bry-sur-Marne,0,1,0,0,0,0,0,0


##### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each super-category:


In [32]:
GrandParis_grouped = GrandParis_onehot.groupby('City').mean().reset_index()
GrandParis_grouped.head(20)

Unnamed: 0,City,Arts & Entertainment,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,Ablon-sur-Seine,0.0,0.4,0.0,0.0,0.0,0.0,0.333333,0.266667
1,Alfortville,0.047619,0.333333,0.047619,0.214286,0.0,0.0,0.285714,0.071429
2,Antony,0.033333,0.4,0.033333,0.2,0.0,0.0,0.266667,0.066667
3,Arcueil,0.06,0.49,0.02,0.16,0.01,0.0,0.16,0.1
4,Argenteuil,0.066667,0.466667,0.0,0.0,0.0,0.0,0.2,0.266667
5,Asnières-sur-Seine,0.04,0.61,0.02,0.12,0.0,0.0,0.16,0.05
6,Athis-Mons,0.0,0.230769,0.0,0.0,0.076923,0.0,0.230769,0.461538
7,Aubervilliers,0.161765,0.367647,0.014706,0.102941,0.0,0.0,0.235294,0.117647
8,Aulnay-sous-Bois,0.058824,0.176471,0.0,0.058824,0.058824,0.0,0.411765,0.235294
9,Bagnolet,0.12,0.49,0.11,0.13,0.02,0.0,0.12,0.01


##### I run the k-means with 4 clusters:

In [33]:
# set number of clusters
kclusters = 4

GrandParis_grouped_clustering = GrandParis_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(GrandParis_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 


array([1, 1, 1, 0, 0, 0, 3, 1, 2, 0], dtype=int32)

In [34]:
# add clustering labels
GrandParis_grouped.insert(0, 'Cluster Labels', kmeans.labels_)
GrandParis_grouped.to_csv("GrandParis_clustered.csv")

In [35]:
GrandParis_grouped.head() # check the last columns!

Unnamed: 0,Cluster Labels,City,Arts & Entertainment,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,1,Ablon-sur-Seine,0.0,0.4,0.0,0.0,0.0,0.0,0.333333,0.266667
1,1,Alfortville,0.047619,0.333333,0.047619,0.214286,0.0,0.0,0.285714,0.071429
2,1,Antony,0.033333,0.4,0.033333,0.2,0.0,0.0,0.266667,0.066667
3,0,Arcueil,0.06,0.49,0.02,0.16,0.01,0.0,0.16,0.1
4,0,Argenteuil,0.066667,0.466667,0.0,0.0,0.0,0.0,0.2,0.266667


##### If I analyze the clusters, I notice that:
- The cluster 0 has the higher density for food, nightlife and is also 2nd for the professional category, so I would recommend it to the students, young workers. 
- The cluster 1 has more residence and outdoor and recreational areas, so I would recommend it for nature-lovers and possibly retired people.
- In the cluster 2, there is a higher density of professional places, shops and services, I would recommend it to install companies, businesses.
- The cluster 3 is the best for Travel and Transport, so I would recommend it for the frequent travellers.



In [36]:
GrandParis_grouped.groupby('Cluster Labels').mean()

Unnamed: 0_level_0,Arts & Entertainment,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.055275,0.538116,0.032077,0.11214,0.005029,0.000393,0.156135,0.100835
1,0.048,0.336544,0.01414,0.1746,0.002478,0.000565,0.272524,0.151149
2,0.026816,0.220651,0.002156,0.10152,0.016951,0.0,0.53724,0.094666
3,0.021296,0.186731,0.0,0.027437,0.008547,0.0,0.274882,0.481107


In [37]:
# add clustering labels
GrandParis_grouped.head()

Unnamed: 0,Cluster Labels,City,Arts & Entertainment,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,1,Ablon-sur-Seine,0.0,0.4,0.0,0.0,0.0,0.0,0.333333,0.266667
1,1,Alfortville,0.047619,0.333333,0.047619,0.214286,0.0,0.0,0.285714,0.071429
2,1,Antony,0.033333,0.4,0.033333,0.2,0.0,0.0,0.266667,0.066667
3,0,Arcueil,0.06,0.49,0.02,0.16,0.01,0.0,0.16,0.1
4,0,Argenteuil,0.066667,0.466667,0.0,0.0,0.0,0.0,0.2,0.266667


In [38]:
GrandParis_grouped_clustered = GrandParis_grouped.reset_index()

for i in range(len(GrandParis_grouped)):
    for j in range(0, len(df3)):
        if GrandParis_grouped_clustered.loc[i,"City"] == df3.loc[j,"Libellé géographique"]:
            GrandParis_grouped_clustered.loc[i,"Latitude"] = df3.loc[j,"Latitude"]
            GrandParis_grouped_clustered.loc[i,"Longitude"] = df3.loc[j,"Longitude"]
GrandParis_grouped_clustered.head()

Unnamed: 0,index,Cluster Labels,City,Arts & Entertainment,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport,Latitude,Longitude
0,0,1,Ablon-sur-Seine,0.0,0.4,0.0,0.0,0.0,0.0,0.333333,0.266667,48.724758,2.421509
1,1,1,Alfortville,0.047619,0.333333,0.047619,0.214286,0.0,0.0,0.285714,0.071429,48.805162,2.419711
2,2,1,Antony,0.033333,0.4,0.033333,0.2,0.0,0.0,0.266667,0.066667,48.753554,2.295942
3,3,0,Arcueil,0.06,0.49,0.02,0.16,0.01,0.0,0.16,0.1,48.8065,2.33665
4,4,0,Argenteuil,0.066667,0.466667,0.0,0.0,0.0,0.0,0.2,0.266667,48.947907,2.24818


##### And finally I map the different clusters on a map of the Paris ouskirts.

In [39]:
# create map
map_clusters = folium.Map(location=[48.8, 2.3], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(GrandParis_grouped_clustered['Latitude'], GrandParis_grouped_clustered['Longitude'], GrandParis_grouped_clustered['City'], GrandParis_grouped_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

##### I find some logic by comparing the map and the clusters:
- the first circle around Paris in mainly made of cluster 0 (red dots), 
- the outer area around Paris is the cluster 1 (purple dots) (recommended for outddors activities and possibly retired people), not a surprise since it is the least urbanized area.
- the cluster 3 (green dots) (recommended for frequent travellers) is composed of cities located close to the airports.


##### Limitations of this study, potential improvements:
- the clustering is based on the relative distribution of each venues category in each city, but not based on their density. This means that a city with one restaurant as only venue for example will go into the cluster of cities where foods venues are majority, though its density of food venues is low. In order to solve this issue, a different encoding must be made.
- the Foursquare datasets are sometimes not comprehensive enough:
    - In the Paris area, there is no venue of categories "College and Universities" listed, which makes it impossible to define a cluster recommended for families for example.
    - The category "Residence" includes very few types of venues.