<h1 align="center">Clustering the Neighbourhoods of French cities

# 1. Introduction

The Covid 19 crisis has been shaking the world now for more than 6 months. 
Faced with the health emergency, many governments, notably those of European countries, have for many made the choice of health over economic life by choosing to confine its population and close its borders. 
One of the consequences of these policies is to radically change the daily lives of citizens around the world, especially in **cities**. 

# 2. Business Problem

The aim of this project is to identify what characterizes French cities in their decomposition in terms of neighborhood and venues.   
Thanks to this decomposition, we will be able to understand in part the choices of economic specification in France.  
This could help to identify the strengths, but also the weaknesses of the French economy when crises like the one we are experiencing with Covid hit the world. 
Are French cities highly dependent on the tourism sector? What about the commercial sector in this festive period that is coming up? 
We understand here that our conclusion in this project is aimed at interested citizens but also at the stakehodlers (city hall, government etc ... ) to better identify the sectors to be protected in their cities. 


# 3. Data Description
As presented in section 4. methodology, we first retrieve as much data as possible on French cities. 
We need the postal codes, the names of the cities and their different neighborhoods (if they have any). 

## 3.1 Rank Cities in France

In order to perfom an analysis in the 5., we will reduce our clustering to the top 10 of french cities. 

To do that, we scrape our data from https://en.wikipedia.org/wiki/List_of_communes_in_France_with_over_20,000_inhabitants

This wikipedia page has information about list of big communes in France and provide us a ranking by inhabitants.

1. *Commune* : Name of Commune
2. *Department* : Name of Department
3. *Region* : Name of Region
4. *Population, 2013* : Population at year 2013
5. *Population, 2017* : Population at year 2017
6. *Rank* : Rank based on the Population at year 2017


## 3.2 Get french cities and their neighbourhoods

We use JSON data available at https://www.data.gouv.fr/fr/datasets/r/34d4364c-22eb-4ac0-b179-7a1845ac033a

1. *codePostal* : Postal codes for France
2. *codeCommune* : Code for Commune in France
3. *nomCommune* : Name of the boroughs (for big cities), equivalent to Commune in France
4. *libelleAcheminement* : Name of city


## 3.3 Foursquare API Data

To meet the need identified above, we are going to need the different venues of cities in France. 
Thanks to the Foursquare API, we will find this information. The API needs GPS codes (geocoding) to work. 

The api gives the following information : 

1. *Neighbourhood* : Name of the Neighbourhood
2. *Neighbourhood Latitude* : Latitude of the Neighbourhood
3. *Neighbourhood Longitude* : Longitude of the Neighbourhood
4. *Venue* : Name of the Venue
5. *Venue Latitude* : Latitude of Venue
6. *Venue Longitude* : Longitude of Venue
7. *Venue Category* : Category of Venue

## 4. Methodology

We are going to collect the maximum amount of data on cities in France in a first step in order to perform our clustering on the largest number of cities.   
This clustering would be done according to the different venues categories that the foursquare API will provide us.
This provides a complete, ready-to-use clustering for further analysis. We could think of a report for the French government for example.   
For this project here, we will filter our results to the Top 10 cities in France and present our conclusions. 

In the meantime, we are going to build different maps of cities in France and then of the different clusters. 


## 4.0 Load package

In [2]:
import pandas as pd
import requests
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim 
import numpy as np
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

## 4.1 Importing Data

### 4.1.1 Top from wikipedia

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_communes_in_France_with_over_20,000_inhabitants"
wiki_url = requests.get(url)
wiki_url

<Response [200]>

Response 200 means that we are able to make the connection to the page

In [4]:
wiki_data = pd.read_html(wiki_url.text)
wiki_data

[                   Commune          Department                      Region  \
 0                    Paris               Paris               Île-de-France   
 1                Marseille    Bouches-du-Rhône  Provence-Alpes-Côte d'Azur   
 2                     Lyon     Lyon Metropolis        Auvergne-Rhône-Alpes   
 3                 Toulouse       Haute-Garonne                   Occitanie   
 4                     Nice     Alpes-Maritimes  Provence-Alpes-Côte d'Azur   
 ..                     ...                 ...                         ...   
 267      Charenton-le-Pont        Val-de-Marne               Île-de-France   
 268  Pierrefitte-sur-Seine   Seine-Saint-Denis               Île-de-France   
 269                 Chatou            Yvelines               Île-de-France   
 270       Rillieux-la-Pape     Lyon Metropolis        Auvergne-Rhône-Alpes   
 271    Vandœuvre-lès-Nancy  Meurthe-et-Moselle                   Grand Est   
 
      Population, 2013  Population, 2017  Rank  
 

In [5]:
len(wiki_data), type(wiki_data)

(11, list)

We need the first table alone, so we drop the other tables

In [6]:
wiki_data = wiki_data[0]
wiki_data

Unnamed: 0,Commune,Department,Region,"Population, 2013","Population, 2017",Rank
0,Paris,Paris,Île-de-France,2420069,2187526,1
1,Marseille,Bouches-du-Rhône,Provence-Alpes-Côte d'Azur,855393,863310,2
2,Lyon,Lyon Metropolis,Auvergne-Rhône-Alpes,500715,516092,3
3,Toulouse,Haute-Garonne,Occitanie,458298,479553,4
4,Nice,Alpes-Maritimes,Provence-Alpes-Côte d'Azur,342295,340017,5
...,...,...,...,...,...,...
267,Charenton-le-Pont,Val-de-Marne,Île-de-France,30408,30374,268
268,Pierrefitte-sur-Seine,Seine-Saint-Denis,Île-de-France,28459,30306,269
269,Chatou,Yvelines,Île-de-France,30809,30253,270
270,Rillieux-la-Pape,Lyon Metropolis,Auvergne-Rhône-Alpes,30645,30012,271


In [8]:
wiki_data['Commune'] = wiki_data['Commune'].str.upper() 

### 4.1.2 Collect French cities data

To collect data for cities, we download the JSON file containg all the postal codes of France from https://www.data.gouv.fr/fr/datasets/r/34d4364c-22eb-4ac0-b179-7a1845ac033a

Using Pandas we load the table after reading the JSON file:

In [9]:
#set the file location as URL or filepath of the json file
f_data_url="https://www.data.gouv.fr/fr/datasets/r/34d4364c-22eb-4ac0-b179-7a1845ac033a"
#load the json data from the file to a pandas dataframe
france_raw = pd.read_json(f_data_url)

In [10]:
france_raw.head()

Unnamed: 0,codePostal,codeCommune,nomCommune,libelleAcheminement
0,10200,10002,Ailleville,AILLEVILLE
1,10160,10003,Aix-Villemaur-Pâlis,AIX-VILLEMAUR-PALIS
2,10190,10003,Aix-Villemaur-Pâlis,AIX-VILLEMAUR-PALIS
3,10700,10004,Allibaudières,ALLIBAUDIERES
4,10140,10005,Amance,AMANCE


In [11]:
france_raw[france_raw['nomCommune'].str.contains('Lyon')]

Unnamed: 0,codePostal,codeCommune,nomCommune,libelleAcheminement
6400,27480,27048,Beauficel-en-Lyons,BEAUFICEL EN LYONS
6685,27480,27377,Lyons-la-Forêt,LYONS LA FORET
12472,42140,42059,Chazelles-sur-Lyon,CHAZELLES SUR LYON
24158,69110,69202,Sainte-Foy-lès-Lyon,SAINTE FOY LES LYON
24244,69001,69381,Lyon 1er Arrondissement,LYON
24245,69002,69382,Lyon 2e Arrondissement,LYON
24246,69003,69383,Lyon 3e Arrondissement,LYON
24247,69004,69384,Lyon 4e Arrondissement,LYON
24248,69005,69385,Lyon 5e Arrondissement,LYON
24249,69006,69386,Lyon 6e Arrondissement,LYON


## 4.2 Data Processing

In [12]:
wiki_data.head()

Unnamed: 0,Commune,Department,Region,"Population, 2013","Population, 2017",Rank
0,PARIS,Paris,Île-de-France,2420069,2187526,1
1,MARSEILLE,Bouches-du-Rhône,Provence-Alpes-Côte d'Azur,855393,863310,2
2,LYON,Lyon Metropolis,Auvergne-Rhône-Alpes,500715,516092,3
3,TOULOUSE,Haute-Garonne,Occitanie,458298,479553,4
4,NICE,Alpes-Maritimes,Provence-Alpes-Côte d'Azur,342295,340017,5


In [13]:
france_raw.head()

Unnamed: 0,codePostal,codeCommune,nomCommune,libelleAcheminement
0,10200,10002,Ailleville,AILLEVILLE
1,10160,10003,Aix-Villemaur-Pâlis,AIX-VILLEMAUR-PALIS
2,10190,10003,Aix-Villemaur-Pâlis,AIX-VILLEMAUR-PALIS
3,10700,10004,Allibaudières,ALLIBAUDIERES
4,10140,10005,Amance,AMANCE


We perform an inner join of the two tables in order to have a classification of the towns with the Cities data.

In [14]:
combined_data = france_raw.join(wiki_data.set_index('Commune'), on='libelleAcheminement', how='inner')

In [15]:
combined_data.head()

Unnamed: 0,codePostal,codeCommune,nomCommune,libelleAcheminement,Department,Region,"Population, 2013","Population, 2017",Rank
376,10000,10387,Troyes,TROYES,Aube,Grand Est,59671,61652,90
499,11000,11069,Carcassonne,CARCASSONNE,Aude,Occitanie,46724,46031,148
690,11100,11262,Narbonne,NARBONNE,Aude,Occitanie,52082,54700,109
858,11150,11434,Villepinte,VILLEPINTE,Seine-Saint-Denis,Île-de-France,35329,36830,207
31738,93420,93078,Villepinte,VILLEPINTE,Seine-Saint-Denis,Île-de-France,35329,36830,207


Sort by rank

In [16]:
combined_data.sort_values("Rank")

Unnamed: 0,codePostal,codeCommune,nomCommune,libelleAcheminement,Department,Region,"Population, 2013","Population, 2017",Rank
26300,75009,75109,Paris 9e Arrondissement,PARIS,Paris,Île-de-France,2420069,2187526,1
26301,75010,75110,Paris 10e Arrondissement,PARIS,Paris,Île-de-France,2420069,2187526,1
26299,75008,75108,Paris 8e Arrondissement,PARIS,Paris,Île-de-France,2420069,2187526,1
26296,75005,75105,Paris 5e Arrondissement,PARIS,Paris,Île-de-France,2420069,2187526,1
26297,75006,75106,Paris 6e Arrondissement,PARIS,Paris,Île-de-France,2420069,2187526,1
...,...,...,...,...,...,...,...,...,...
31681,92240,92046,Malakoff,MALAKOFF,Hauts-de-Seine,Île-de-France,30304,30720,266
7126,28410,28185,Goussainville,GOUSSAINVILLE,Val-d'Oise,Île-de-France,31212,30637,267
31870,95190,95280,Goussainville,GOUSSAINVILLE,Val-d'Oise,Île-de-France,31212,30637,267
27599,78400,78146,Chatou,CHATOU,Yvelines,Île-de-France,30809,30253,270


Keep only interessed columns

In [17]:
combined_data = combined_data[['codePostal','nomCommune','libelleAcheminement','Population, 2017','Rank']]

## 4.3 Feature Engineering

In order to use the api foursquare, we need the geocoding of cities and neighborhoods. We will use the geopy library to geocode our variables.

Let's make a test with Paris 9E Arrondissement

In [18]:
address = 'Paris 9e Arrondissement'

geolocator = Nominatim(user_agent="test")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The coordinates of Paris 9e are {}, {}.'.format(latitude, longitude))

The coordinates of Paris 9e are 48.876019, 2.339962.


Working ! 

Let's apply our geocoding to our full dataset

In [20]:
# 1 - create function to delay between geocoding calls
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
# 2- - create location column
combined_data['location'] = combined_data['nomCommune'].apply(geocode)
# 3 - create longitude, laatitude and altitude from location column (returns tuple)
combined_data['point'] = combined_data['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
combined_data[['latitude', 'longitude', 'altitude']] = pd.DataFrame(combined_data['point'].tolist(), index=combined_data.index)

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Marseille 3e Arrondissement',), **{}).
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\c

In [21]:
combined_data=combined_data.sort_values("Rank")
combined_data

Unnamed: 0,codePostal,nomCommune,libelleAcheminement,"Population, 2017",Rank,location,point,latitude,longitude,altitude
26300,75009,Paris 9e Arrondissement,PARIS,2187526,1,"(Paris 9e Arrondissement, Paris, Île-de-France...","(48.876019, 2.339962, 0.0)",48.876019,2.339962,0.0
26301,75010,Paris 10e Arrondissement,PARIS,2187526,1,"(Paris 10e Arrondissement, Paris, Île-de-Franc...","(48.876106, 2.35991, 0.0)",48.876106,2.359910,0.0
26299,75008,Paris 8e Arrondissement,PARIS,2187526,1,"(Paris 8e Arrondissement, Paris, Île-de-France...","(48.8774799, 2.31765, 0.0)",48.877480,2.317650,0.0
26296,75005,Paris 5e Arrondissement,PARIS,2187526,1,"(Paris 5e Arrondissement, Paris, Île-de-France...","(48.8460591, 2.3445228, 0.0)",48.846059,2.344523,0.0
26297,75006,Paris 6e Arrondissement,PARIS,2187526,1,"(Paris 6e Arrondissement, Paris, Île-de-France...","(48.8504333, 2.3329507, 0.0)",48.850433,2.332951,0.0
...,...,...,...,...,...,...,...,...,...,...
31681,92240,Malakoff,MALAKOFF,30720,266,"(Malakoff, Antony, Hauts-de-Seine, Île-de-Fran...","(48.8211559, 2.3019814, 0.0)",48.821156,2.301981,0.0
7126,28410,Goussainville,GOUSSAINVILLE,30637,267,"(Goussainville, Sarcelles, Val-d'Oise, Île-de-...","(49.0323168, 2.4733628, 0.0)",49.032317,2.473363,0.0
31870,95190,Goussainville,GOUSSAINVILLE,30637,267,"(Goussainville, Sarcelles, Val-d'Oise, Île-de-...","(49.0323168, 2.4733628, 0.0)",49.032317,2.473363,0.0
27599,78400,Chatou,CHATOU,30253,270,"(Chatou, Saint-Germain-en-Laye, Yvelines, Île-...","(48.8897044, 2.1573695, 0.0)",48.889704,2.157370,0.0


Drop Na with combined_data

In [24]:
combined_data=combined_data.dropna()

## 4.5 Visualizing the Neighbourhoods of French cities

In [26]:
# Creating the map of Toronto

map_france = folium.Map(location=[48.866667, 2.333333], zoom_start=1)


# adding markers to map

for latitude, longitude, nomCommune, Rank in zip(combined_data['latitude'], combined_data['longitude'], combined_data['nomCommune'], combined_data['Rank']):

    label = '{}, {},{},{}'.format(latitude, longitude, nomCommune, Rank)

    label = folium.Popup(label, parse_html=True)

    folium.CircleMarker(

        [latitude, longitude],

        radius=5,

        popup=label,

        color='red',

        fill=True

        ).add_to(map_france)  

    

map_france

## 4.6 Get French Venues with Foursquare API

We have set our Foursquare API and removed it (privacy)

In [49]:
LIMIT=100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

In [52]:
venues_in_france = getNearbyVenues(combined_data['nomCommune'], combined_data['latitude'], combined_data['longitude'])

Paris 9e Arrondissement
Paris 10e Arrondissement
Paris 8e Arrondissement
Paris 5e Arrondissement
Paris 6e Arrondissement
Paris 4e Arrondissement
Paris 2e Arrondissement
Paris 1er Arrondissement
Paris 11e Arrondissement
Paris 12e Arrondissement
Paris 3e Arrondissement
Paris 14e Arrondissement
Paris 15e Arrondissement
Paris 17e Arrondissement
Paris 18e Arrondissement
Paris 19e Arrondissement
Paris 20e Arrondissement
Paris 13e Arrondissement
Issy-les-Moulineaux
Paris 7e Arrondissement
Marseille 12e Arrondissement
Marseille 12e Arrondissement
Marseille 13e Arrondissement
Marseille 13e Arrondissement
Marseille 15e Arrondissement
Marseille 15e Arrondissement
Marseille 16e Arrondissement
Marseille 12e Arrondissement
Marseille 14e Arrondissement
Marseille 11e Arrondissement
Marseille 14e Arrondissement
Marseille 11e Arrondissement
Marseille 10e Arrondissement
Marseille 10e Arrondissement
Marseille 9e Arrondissement
Marseille 8e Arrondissement
Marseille 7e Arrondissement
Marseille 5e Arrondisse

In [55]:
venues_in_france.shape

(9728, 5)

So we have 9728 records and 5 columns. Checking sample data

In [56]:
venues_in_france.groupby('Neighbourhood').head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Paris 9e Arrondissement,48.876019,2.339962,Caillebotte,French Restaurant
1,Paris 9e Arrondissement,48.876019,2.339962,Le Bouclier de Bacchus,Wine Bar
2,Paris 9e Arrondissement,48.876019,2.339962,So Nat,Vegetarian / Vegan Restaurant
3,Paris 9e Arrondissement,48.876019,2.339962,Farine & O,Bakery
4,Paris 9e Arrondissement,48.876019,2.339962,Juste,Seafood Restaurant
...,...,...,...,...,...
9690,Chatou,48.889704,2.157370,Île des Impressionnistes,Island
9691,Chatou,48.889704,2.157370,Au Bureau,Pub
9692,Chatou,48.889704,2.157370,Les rives de la Courtille,French Restaurant
9693,Chatou,48.889704,2.157370,Monoprix,Supermarket


In [57]:
venues_in_france.groupby('Venue Category').max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ATM,Saint-Louis,38.626804,-90.199410,U.S. Bank ATM
Afghan Restaurant,Paris 17e Arrondissement,48.884224,2.379703,Buzkashi
African Restaurant,Schiltigheim,48.889343,7.748449,Waly Fay
Airport Terminal,Tremblay-en-France,48.980204,2.558956,Ladies Room
Alsatian Restaurant,Strasbourg,48.584614,7.750713,Wistub de la Petite Venise
...,...,...,...,...
Wine Bar,Vannes,50.636565,7.750713,Ze Bar
Wine Shop,Reims,49.257789,7.013442,Veuve Clicquot
Wings Joint,Suresnes,48.871099,2.228400,JFC
Women's Store,Toulon,48.867684,5.930492,Mango


We have 323 Category Venue

## 4.7 One Hot encoding the venue Categories

In [58]:
france_venue_cat = pd.get_dummies(venues_in_france[['Venue Category']], prefix="", prefix_sep="")
france_venue_cat

Unnamed: 0,ATM,Afghan Restaurant,African Restaurant,Airport Terminal,Alsatian Restaurant,American Restaurant,Antique Shop,Aquarium,Argentinian Restaurant,Art Gallery,...,Venezuelan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding the neighbourhood to the encoded dataframe

In [59]:
france_venue_cat['Neighbourhood'] = venues_in_france['Neighbourhood'] 

In [60]:

# moving neighborhood column to the first column
fixed_columns = [france_venue_cat.columns[-1]] + list(france_venue_cat.columns[:-1])
france_venue_cat = france_venue_cat[fixed_columns]

# Grouping and calculating the mean
france_grouped = france_venue_cat.groupby('Neighbourhood').mean().reset_index()




In [61]:
france_grouped.head()

Unnamed: 0,Neighbourhood,ATM,Afghan Restaurant,African Restaurant,Airport Terminal,Alsatian Restaurant,American Restaurant,Antique Shop,Aquarium,Argentinian Restaurant,...,Venezuelan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ajaccio,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Albi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Alfortville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Amiens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4.8 Top Venues in the Neighbourhoods

Let's make a function to get the top most common venue categories

In [62]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

There are way too many venue categories, we can take the top 10 to cluster the neighbourhoods

In [63]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = france_grouped['Neighbourhood']

for ind in np.arange(france_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(france_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agen,Bar,Cosmetics Shop,Diner,Department Store,Mobile Phone Shop,Plaza,Bookstore,Multiplex,Hotel,French Restaurant
1,Ajaccio,French Restaurant,Restaurant,Chinese Restaurant,Steakhouse,Supermarket,Bistro,Grocery Store,Harbor / Marina,Boat or Ferry,Plaza
2,Albi,Restaurant,French Restaurant,Historic Site,Tea Room,Multiplex,Pub,Farmers Market,Garden,Bar,Performing Arts Venue
3,Alfortville,Supermarket,Bus Stop,Convenience Store,Music Venue,Pool,Bakery,Plaza,Flea Market,Park,Outdoor Sculpture
4,Amiens,Bar,Hotel,Plaza,Restaurant,Italian Restaurant,Supermarket,Fast Food Restaurant,Clothing Store,Japanese Restaurant,Department Store


Let's make the model to cluster our Neighbourhoods

## 4.9 Model Building - KMeans

In [64]:
# set number of clusters
k_num_clusters = 5

france_grouped_clustering = france_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k_num_clusters, random_state=0).fit(france_grouped_clustering)
kmeans

KMeans(n_clusters=5, random_state=0)

Checking the labelling of our model

In [65]:
kmeans.labels_[0:100]

array([1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 3, 1, 1, 1, 1, 2, 2, 2, 1, 0, 2, 1, 1, 2, 1, 1, 1, 4, 2,
       1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1,
       1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1,
       2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2])

Let's add the clustering Label column to the top 10 common venue categories

In [66]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

venues_in_france.groupby('Venue Category').max()

Join france_grouped with combined_data on neighbourhood to add latitude & longitude for each neighborhood to prepare it for plotting

In [69]:
france_merged = combined_data

france_merged = france_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='nomCommune')

france_merged.head()

Unnamed: 0,codePostal,nomCommune,libelleAcheminement,"Population, 2017",Rank,location,point,latitude,longitude,altitude,...,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
26300,75009,Paris 9e Arrondissement,PARIS,2187526,1,"(Paris 9e Arrondissement, Paris, Île-de-France...","(48.876019, 2.339962, 0.0)",48.876019,2.339962,0.0,...,French Restaurant,Hotel,Bar,Burger Joint,Restaurant,Wine Bar,Tea Room,Vegetarian / Vegan Restaurant,Bakery,Bistro
26301,75010,Paris 10e Arrondissement,PARIS,2187526,1,"(Paris 10e Arrondissement, Paris, Île-de-Franc...","(48.876106, 2.35991, 0.0)",48.876106,2.35991,0.0,...,French Restaurant,Hotel,Coffee Shop,Bar,Café,Bistro,Restaurant,Pizza Place,Indian Restaurant,Breakfast Spot
26299,75008,Paris 8e Arrondissement,PARIS,2187526,1,"(Paris 8e Arrondissement, Paris, Île-de-France...","(48.8774799, 2.31765, 0.0)",48.87748,2.31765,0.0,...,Hotel,French Restaurant,Bistro,Pub,Pizza Place,Restaurant,Sandwich Place,Thai Restaurant,Sushi Restaurant,Bakery
26296,75005,Paris 5e Arrondissement,PARIS,2187526,1,"(Paris 5e Arrondissement, Paris, Île-de-France...","(48.8460591, 2.3445228, 0.0)",48.846059,2.344523,0.0,...,French Restaurant,Hotel,Bar,Italian Restaurant,Indie Movie Theater,Pub,Café,Bakery,Ice Cream Shop,Plaza
26297,75006,Paris 6e Arrondissement,PARIS,2187526,1,"(Paris 6e Arrondissement, Paris, Île-de-France...","(48.8504333, 2.3329507, 0.0)",48.850433,2.332951,0.0,...,French Restaurant,Italian Restaurant,Plaza,Café,Wine Bar,Chocolate Shop,Bistro,Ice Cream Shop,Seafood Restaurant,Fountain


In [70]:
france_merged_nonan = france_merged.dropna(subset=['Cluster Labels'])

## 4.10 Map Clustering

In [71]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(france_merged_nonan['latitude'], france_merged_nonan['longitude'], france_merged_nonan['nomCommune'], france_merged_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters)
        
map_clusters

## 4.11 Examining our Clusters


In [73]:
Cluster1=france_merged[france_merged['Cluster Labels'] == 1]

Select top ten for Cluster 1

In [74]:
Cluster1[Cluster1["Rank"]<= 10]

Unnamed: 0,codePostal,nomCommune,libelleAcheminement,"Population, 2017",Rank,location,point,latitude,longitude,altitude,...,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
26300,75009,Paris 9e Arrondissement,PARIS,2187526,1,"(Paris 9e Arrondissement, Paris, Île-de-France...","(48.876019, 2.339962, 0.0)",48.876019,2.339962,0.0,...,French Restaurant,Hotel,Bar,Burger Joint,Restaurant,Wine Bar,Tea Room,Vegetarian / Vegan Restaurant,Bakery,Bistro
26301,75010,Paris 10e Arrondissement,PARIS,2187526,1,"(Paris 10e Arrondissement, Paris, Île-de-Franc...","(48.876106, 2.35991, 0.0)",48.876106,2.359910,0.0,...,French Restaurant,Hotel,Coffee Shop,Bar,Café,Bistro,Restaurant,Pizza Place,Indian Restaurant,Breakfast Spot
26299,75008,Paris 8e Arrondissement,PARIS,2187526,1,"(Paris 8e Arrondissement, Paris, Île-de-France...","(48.8774799, 2.31765, 0.0)",48.877480,2.317650,0.0,...,Hotel,French Restaurant,Bistro,Pub,Pizza Place,Restaurant,Sandwich Place,Thai Restaurant,Sushi Restaurant,Bakery
26296,75005,Paris 5e Arrondissement,PARIS,2187526,1,"(Paris 5e Arrondissement, Paris, Île-de-France...","(48.8460591, 2.3445228, 0.0)",48.846059,2.344523,0.0,...,French Restaurant,Hotel,Bar,Italian Restaurant,Indie Movie Theater,Pub,Café,Bakery,Ice Cream Shop,Plaza
26297,75006,Paris 6e Arrondissement,PARIS,2187526,1,"(Paris 6e Arrondissement, Paris, Île-de-France...","(48.8504333, 2.3329507, 0.0)",48.850433,2.332951,0.0,...,French Restaurant,Italian Restaurant,Plaza,Café,Wine Bar,Chocolate Shop,Bistro,Ice Cream Shop,Seafood Restaurant,Fountain
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9089,33300,Bordeaux,BORDEAUX,254436,9,"(Bordeaux, Gironde, Nouvelle-Aquitaine, France...","(44.841225, -0.5800364, 0.0)",44.841225,-0.580036,0.0,...,Plaza,French Restaurant,Coffee Shop,Hotel,Pedestrian Plaza,Shopping Mall,Multiplex,Electronics Store,Bistro,Bakery
9088,33000,Bordeaux,BORDEAUX,254436,9,"(Bordeaux, Gironde, Nouvelle-Aquitaine, France...","(44.841225, -0.5800364, 0.0)",44.841225,-0.580036,0.0,...,Plaza,French Restaurant,Coffee Shop,Hotel,Pedestrian Plaza,Shopping Mall,Multiplex,Electronics Store,Bistro,Bakery
19062,59777,Lille,LILLE,232787,10,"(Lille, Nord, Hauts-de-France, France métropol...","(50.6365654, 3.0635282, 0.0)",50.636565,3.063528,0.0,...,French Restaurant,Bar,Plaza,Bakery,Burger Joint,Cocktail Bar,Café,Japanese Restaurant,Coffee Shop,Hotel
19060,59000,Lille,LILLE,232787,10,"(Lille, Nord, Hauts-de-France, France métropol...","(50.6365654, 3.0635282, 0.0)",50.636565,3.063528,0.0,...,French Restaurant,Bar,Plaza,Bakery,Burger Joint,Cocktail Bar,Café,Japanese Restaurant,Coffee Shop,Hotel


In [75]:
Cluster4=france_merged[france_merged['Cluster Labels'] == 4]

In [76]:
Cluster4[Cluster4["Rank"]<= 10]

Unnamed: 0,codePostal,nomCommune,libelleAcheminement,"Population, 2017",Rank,location,point,latitude,longitude,altitude,...,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


# 5. Results and Discussion


**Cluster 1 Analysis** : in general, the 1st commun venue is french restaurant.   
The french city top 10  bets on its comparative advantage in terms of food.   
One can analyse that this cluster 1 emphasizes the touristi dimension of big City in France : Hotel, Bar, Cafe, Plaza then some Museum.   

To this touristic dimension, we can add an important commercial dimension for the top cities of France. There are a lot of shopping centers such as clothing shops or food stores etc.... 

**Cluster 4** : is more about daily life in top city with Park, Pedestrian Plaza, Metro Station, Bakery ...     

In addition, we can notice a very important multiculral in the neighborhoods of these cities with different Indian, Italian, greek etc... restaurants.   
The food remains very important for these cities and therefore, we can make the hypothesis for the daily life of the French.   
The modes of transport differ from city to city: tram for Lyon, Metro for Paris.

# 6. Conclusion

The aim of this project was to establish comprehensive comparisons between cities in France using the Kmeans technique. In doing so, we can study the attractiveness of these cities, which is also their specificity.   

Using a complete database, we have assigned the clustering to all cities in France. This allows us to have, for the future, access to a complete comparison of France. For the sake of this task, we have reduced our anayse to the Top 10 of France by cross-referencing wikipedia data.   

We can first observe that the quartiers of the top cities in France are similar: restaurants, bakery, bar, museum etc... This shows the specialization of France in the tourism sector. 
Each city then has its own cultural specificity. Some cities have an important divisersity, which is linked with their immigrant dimension. 

In this period of Covid 19, thanks to this study, we can be worried about the economy of these cities, which relies heavily on trade and tourist places and ring the alarm with the competent authorities. 

How to replace the economic contributions of these flows when the borders are closed, the trade forbidden to open?   
Are the cities and their neighbourhood going to change as a result of this crisis? Will we have the same clustering in 2 years ? This is an interesting question that remains open at the end of this project.

## 7.References

1.  [GitHub](https://github.com/Thomas-George-T/A-Tale-of-Two-Cities/blob/master/Tale_of_Two_Cities_A_Data_Science_Take.ipynb)

2. [Foursquare API](https://foursquare.com)

3. [ArcGIS API](https://www.arcgis.com/index.html)