## Capston Project - The Battle of Neighborhoods¶


### Introduction

#### Background and problem description

The French capital of Paris is one of the well-known metropolitan areas in getting abreast of cities such as New York and London and the dream destination for its rich historical and cultural heritage and its gastronomy which UNESCO declared as 'world intangible heritage'. Paris is the real melting pot of culinary world as people from various countries have moved for a long time, shared, developed their specialities.

Since the past few years from now, Paris has set to work for 'The Grand Paris' Project to redefine it as a leading metropolitan area, home to 12 million people, bringing together attrativeness and competitiveness by binding Paris and surroudning suburbs.

- Having been stayed over 10 years in the west suburban area of Paris, two friends, one from Korea and the other from Vietnam have a common project to launch Korean-Vietnamese Fusion Restaurant with the following priorities:

- Main target audience: business people working in La Defense area (a major business district located three kilometres west of the city limits of Paris. It is part of the Paris metropolitan area in the Île-de-France region)

- Localisation with easy access to public transports with visibility

- Opening hours only morning and afternoon in weekdays

- Near Farmers markets / local organic producers to provide a qulity of healthy foods and propose additional service like 'ready to cook' box and participate in local economy

#### A description of the data and how it will be used to solve the problem.

To realize this project, we need to spot competitors with the same opening hours which propose similar menu/services. It's also important to search for ideal area near from Farmer markets / local organic producers and transportations.

We'll exploring areas and extract features in using Foursquare API and appying Folium to visualize for the audience to easily understand our choice. In addition to those tools, we'll analyze the average number of public transport users through nearby stations and adjacent commerces throughout datasets provided by the RATP Group (French: Groupe RATP), best known as the Régie Autonome des Transports Parisiens (English: Autonomous Parisian Transportation Administration), is a state-owned public transport operator and maintainer headquartered in Paris, France.

In [1]:
# Import necessary libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files


import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


print('Libraries imported.')


Libraries imported.


In [2]:
!pip install geocoder==1.5.0
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Collecting geocoder==1.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/2d/ea/9554295b2abce67935ae1640ae8d8aa9cadc0f42deb27b3f6fc432a4e541/geocoder-1.5.0-py2.py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 10.5MB/s eta 0:00:01
Collecting ratelim (from geocoder==1.5.0)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.5.0 ratelim-0.1.6


In [5]:
!pip -q install folium
import folium # map rendering library

### Methodology

#### Load and Explore data

##### Download data to facilitate to see the Ile-de-France region and narrow down our search on boroughs near La Defense area.

In [6]:
# This data is provided by by the French National Institute for Statistics and Economic Studies (INSEE). 
!wget -q -O 'correspondances-code-insee-code-postal.json' https://www.data.gouv.fr/fr/datasets/r/e88c6fda-1d09-42a0-a069-606d3259114e
print('Data downloaded')

Data downloaded


In [7]:
# open this json file and rename it as ile-de-france-data 
with open('correspondances-code-insee-code-postal.json') as json_data:
    idf_region_data = json.load(json_data)

##### # In France, Department has more or less a notion of Department, Neighborhood has a sense of Borough.

In [11]:
# let's see only the first element of the file to create our own dataframe.
idf_region_data[0]

{'datasetid': 'correspondances-code-insee-code-postal',
 'recordid': '2bf36b38314b6c39dfbcd09225f97fa532b1fc45',
 'fields': {'code_comm': '645',
  'nom_dept': 'ESSONNE',
  'statut': 'Commune simple',
  'z_moyen': 121.0,
  'nom_region': 'ILE-DE-FRANCE',
  'code_reg': '11',
  'insee_com': '91645',
  'code_dept': '91',
  'geo_point_2d': [48.750443119964764, 2.251712972144151],
  'postal_code': '91370',
  'id_geofla': '16275',
  'code_cant': '03',
  'geo_shape': {'type': 'Polygon',
   'coordinates': [[[2.238024349288764, 48.735565859837095],
     [2.226414985434264, 48.75003536744732],
     [2.22450256558849, 48.75882853410981],
     [2.232859032169924, 48.76598806763034],
     [2.250043759055985, 48.761213267519565],
     [2.269288614654887, 48.76063999654954],
     [2.276145972515501, 48.75666127305422],
     [2.283691112862691, 48.748081131389654],
     [2.274517407535147, 48.74072222671912],
     [2.238024349288764, 48.735565859837095]]]},
  'superficie': 999.0,
  'nom_comm': 'VERRIERE

##### Let's define the data frame columns
##### In France, Department has more or less a notion of Department, Neighborhood has a sense of Borough.

In [12]:
column_names = ['Department','Department Code','Borough','Postal Code','Population','Latitude', 'Longitude']

# instantiate the dataframe for sorting out departments and its boroughs
idf_dep_bor = pd.DataFrame(columns=column_names)

##### Let's visualize what our dataframe columns look like to deploy 'for loops' to fill the follwoing rows with the adequate data.

In [13]:
idf_dep_bor

Unnamed: 0,Department,Department Code,Borough,Postal Code,Population,Latitude,Longitude


In [14]:
# Loop through the data and fill the data frame one row at a time.
for data in idf_region_data:
    department = data['fields']['nom_dept']
    department_code = data['fields']['code_dept']
    borough_name = data['fields']['nom_comm']
    postal_code = data['fields']['postal_code']
    population = data['fields']['population']
    
    idf_region_latlon = idf_region_latlon = data['geometry']['coordinates']
    idf_region_lat = idf_region_latlon[1]
    idf_region_lon = idf_region_latlon[0]
    
    idf_dep_bor = idf_dep_bor.append({'Department': department,
                                          'Department Code': department_code,
                                          'Borough': borough_name,
                                          'Postal Code': postal_code,
                                          'Population': population,
                                          'Latitude': idf_region_lat,
                                          'Longitude': idf_region_lon}, ignore_index=True)

In [15]:
# Let's examine the resulting dataframe.
idf_dep_bor.head()

Unnamed: 0,Department,Department Code,Borough,Postal Code,Population,Latitude,Longitude
0,ESSONNE,91,VERRIERES-LE-BUISSON,91370,15.5,48.750443,2.251713
1,SEINE-ET-MARNE,77,COURCELLES-EN-BASSEE,77126,0.2,48.412561,3.052941
2,ESSONNE,91,MAUCHAMPS,91730,0.3,48.527268,2.197182
3,SEINE-ET-MARNE,77,LAGNY-SUR-MARNE,77400,20.2,48.87307,2.709781
4,SEINE-ET-MARNE,77,SAINT-HILLIERS,77160,0.4,48.628915,3.258236


In [16]:
# Make sure that the dataset has the right number of departments and boroughs.
print('The dataframe has {} departments and {} boroughs.'.format(
        len(idf_dep_bor['Department Code'].unique()),
        idf_dep_bor.shape[0]
    )
)

The dataframe has 8 departments and 1300 boroughs.


In [17]:
# We can check easily through the fact that Paris has its 20 boroughs.
idf_dep_bor['Department'].value_counts()

SEINE-ET-MARNE       514
YVELINES             262
ESSONNE              196
VAL-D'OISE           185
VAL-DE-MARNE          47
SEINE-SAINT-DENIS     40
HAUTS-DE-SEINE        36
PARIS                 20
Name: Department, dtype: int64

##### There are 7 other departments around Paris. We'll focus on the department called as "Hauts-de-Seine" to which La Defense belong.

In [18]:
# Narrow down the dataframe only for Hauts-de-Seine where we can find out La Defense area.
df_hauts_de_seine = idf_dep_bor[idf_dep_bor["Department"] == "HAUTS-DE-SEINE"].reset_index(drop = True)
df_hauts_de_seine.head()

Unnamed: 0,Department,Department Code,Borough,Postal Code,Population,Latitude,Longitude
0,HAUTS-DE-SEINE,92,CHATILLON,92320,32.4,48.803409,2.287991
1,HAUTS-DE-SEINE,92,ASNIERES-SUR-SEINE,92600,81.6,48.915353,2.288038
2,HAUTS-DE-SEINE,92,COLOMBES,92700,84.6,48.922518,2.246752
3,HAUTS-DE-SEINE,92,SCEAUX,92330,19.3,48.776816,2.295294
4,HAUTS-DE-SEINE,92,RUEIL-MALMAISON,92500,79.1,48.86919,2.177341


##### Use geopy library to get the latitude and longitude values of Hauts-de-Seine department.

In [19]:
# In order to define an instance of the geocoder, it's necessary to define a user_agent. Here we can name it as below.
address = 'Île-de-France, PAR'

geolocator = Nominatim(user_agent="idf_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Île-de-France are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Île-de-France are 48.8566969, 2.3514616.


##### Let's visualize the dimension of Ile-de-France region and compare La Defense area and its surrounding boroughs.

In [20]:
# Create a map of Hauts-de-Seine departmenht with its boroughs superimposed on top, using latitude and longitude values
map_hauts_de_seine = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, df_hauts_de_seine in zip(df_hauts_de_seine["Latitude"], df_hauts_de_seine["Longitude"], df_hauts_de_seine["Department"], df_hauts_de_seine["Borough"]):
    label = '{}, {}'.format(department, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_hauts_de_seine)  
    
map_hauts_de_seine

##### Instead of exploring all boroughs spotted as above, we'll explore more in detail 4 surrounding boroughs around La Defense area where you could spot on the upper right hand side. Rename it as boroughs near la defense

In [39]:
df_bors_near_ld = idf_dep_bor.query("Borough in ['PUTEAUX','COURBEVOIE','NANTERRE','LA GARENNE-COLOMBES']")
df_bors_near_ld.head()
df_bors_near_ld

Unnamed: 0,Department,Department Code,Borough,Postal Code,Population,Latitude,Longitude
342,HAUTS-DE-SEINE,92,LA GARENNE-COLOMBES,92250,27.1,48.906758,2.244646
429,HAUTS-DE-SEINE,92,PUTEAUX,92800,44.9,48.883709,2.238342
601,HAUTS-DE-SEINE,92,COURBEVOIE,92400,86.9,48.89845,2.255706
1161,HAUTS-DE-SEINE,92,NANTERRE,92000,90.0,48.89607,2.206713


##### Check again if the data frame contains 4 boroughs after query command.

In [40]:
df_bors_near_ld

Unnamed: 0,Department,Department Code,Borough,Postal Code,Population,Latitude,Longitude
342,HAUTS-DE-SEINE,92,LA GARENNE-COLOMBES,92250,27.1,48.906758,2.244646
429,HAUTS-DE-SEINE,92,PUTEAUX,92800,44.9,48.883709,2.238342
601,HAUTS-DE-SEINE,92,COURBEVOIE,92400,86.9,48.89845,2.255706
1161,HAUTS-DE-SEINE,92,NANTERRE,92000,90.0,48.89607,2.206713


##### Let's visualize La Defense area and its surrounding boroughs superimposed on top, using latitude and longitude values

In [24]:
map_bors_near_ld = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, df_bors_near_ld in zip(df_bors_near_ld["Latitude"], df_bors_near_ld["Longitude"], df_bors_near_ld["Department"], df_bors_near_ld["Borough"]):
    label = '{}, {}'.format(department, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_bors_near_ld)  
    
map_bors_near_ld

##### Download another file to see shops near public transport stations and its closing dates to understand a general trend.

In [25]:
!wget -O 'liste-des-commerces-de-proximite-agrees-ratp.json' "https://data.ratp.fr/explore/dataset/liste-des-commerces-de-proximite-agrees-ratp/download/?format=json&timezone=Europe/Berlin&lang=fr"
print('Data Downloaded')

--2020-07-19 19:42:41--  https://data.ratp.fr/explore/dataset/liste-des-commerces-de-proximite-agrees-ratp/download/?format=json&timezone=Europe/Berlin&lang=fr
Resolving data.ratp.fr (data.ratp.fr)... 34.249.199.226, 34.248.20.69
Connecting to data.ratp.fr (data.ratp.fr)|34.249.199.226|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘liste-des-commerces-de-proximite-agrees-ratp.json’

    [   <=>                                 ] 458,120      923KB/s   in 0.5s   

2020-07-19 19:42:43 (923 KB/s) - ‘liste-des-commerces-de-proximite-agrees-ratp.json’ saved [458120]

Data Downloaded


In [27]:
# open this json file and rename it as ile-de-france-data 
with open('liste-des-commerces-de-proximite-agrees-ratp.json') as json_data:
    commerces_near_trans = json.load(json_data)

##### Let's visualize what our dataframe columns look like to deploy 'for loops' to fill the follwoing rows with the adequate data.

In [28]:
commerces_near_trans[0]

{'datasetid': 'liste-des-commerces-de-proximite-agrees-ratp',
 'recordid': 'f9ac56203dc48fcbe1026d4cce024fab777bb29d',
 'fields': {'ville': 'PARIS',
  'code_postal': 75001,
  'tco_libelle': 'Et Cetera',
  'coord_geo': [48.859744, 2.346908],
  'dea_fermeture': 'dimanche',
  'dea_numero_rue_livraison_dea_rue_livraison': '10 R. DES HALLES'},
 'geometry': {'type': 'Point', 'coordinates': [2.346908, 48.859744]},
 'record_timestamp': '2020-07-16T16:16:10.197+02:00'}

In [29]:
# define the data frame columns
# In France, Department has more or less a notion of Department, Neighborhood has a sense of Borough.
column_names = ['Borough','Postal Code','Shop Name','Closing Date','Latitude', 'Longitude']

# instantiate the dataframe for sorting out departments and its boroughs
shops_near_trans = pd.DataFrame(columns=column_names)


In [30]:
# Loop through the data and fill the data frame one row at a time.
for data in commerces_near_trans:
    borough = data['fields']['ville']
    postal_code = data['fields']['code_postal']
    shop_name = data['fields']['tco_libelle']
    closing_date = data['fields']['dea_fermeture'] if 'dea_fermeture' in data['fields'] else 'Unknown'
    
    shops_near_trans_latlon = commerces_near_trans_latlon = data['geometry']['coordinates']
    shops_near_trans_lat = commerces_near_trans_latlon[1]
    shops_near_trans_lon = commerces_near_trans_latlon[0]
    
    shops_near_trans = shops_near_trans.append({'Borough': borough,
                                          'Postal Code': postal_code,
                                          'Shop Name': shop_name,
                                          'Closing Date': closing_date,
                                          'Latitude': idf_region_lat,
                                          'Longitude': idf_region_lon}, ignore_index=True)

In [31]:
# Let's visualize what our dataframe columns look like 
# to deploy 'for loops' to fill the follwoing rows with the adequate data.
shops_near_trans

Unnamed: 0,Borough,Postal Code,Shop Name,Closing Date,Latitude,Longitude
0,PARIS,75001,Et Cetera,dimanche,48.764989,3.213241
1,PARIS,75002,Calumet des Halles,Unknown,48.764989,3.213241
2,PARIS,75003,La Violette,dimanche,48.764989,3.213241
3,PARIS,75004,Le Tabac de Rivoli,dimanche,48.764989,3.213241
4,PARIS,75004,Le Victoria,dimanche,48.764989,3.213241
5,PARIS,75005,Tabac Alfa,dimanche,48.764989,3.213241
6,PARIS,75005,Le Rond-Point,dimanche,48.764989,3.213241
7,PARIS,75005,Presse - Tabac,dimanche,48.764989,3.213241
8,PARIS,75005,Civette Saint-Michel,dimanche,48.764989,3.213241
9,PARIS,75005,Tabac Notre-Dame,mardi,48.764989,3.213241


##### Shortlist 4 nearest boroughs from La Defense.

In [32]:
shops_near_trans_ld = shops_near_trans.query("Borough in ['PUTEAUX','COURBEVOIE','NANTERRE','LA GARENNE-COLOMBES']" if 'dea_fermeture' in data['fields'] else 'Unknown') 
shops_near_trans_ld.head()

Unnamed: 0,Borough,Postal Code,Shop Name,Closing Date,Latitude,Longitude
92,NANTERRE,92000,Le Clémenceau,dimanche,48.764989,3.213241
110,LA GARENNE-COLOMBES,92250,Tabac du Marché,lundi,48.764989,3.213241
111,LA GARENNE-COLOMBES,92250,Le Celtique,dimanche,48.764989,3.213241
126,COURBEVOIE,92400,Le Saint-Claude,Unknown,48.764989,3.213241
127,COURBEVOIE,92400,Presse Sainte Marie,lundi,48.764989,3.213241


##### See if the dataset includes Nan values apart from the closing date column modified previously as 'unknown'.

In [33]:
shops_near_trans_ld.isna().sum()

Borough         0
Postal Code     0
Shop Name       0
Closing Date    0
Latitude        0
Longitude       0
dtype: int64

##### To see the number of shops near those boroughs' stations.

In [34]:
shops_near_trans_ld.shape

(33, 6)

##### dimanche = Sunday, lundi = Monday, Samedi = Saturday
##### This example allows us to assume that lots of shops close only for Sunday. 

In [35]:
shops_near_trans_ld.groupby('Closing Date').count()

Unnamed: 0_level_0,Borough,Postal Code,Shop Name,Latitude,Longitude
Closing Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unknown,2,2,2,2,2
dimanche,26,26,26,26,26
lundi,4,4,4,4,4
samedi,1,1,1,1,1


##### To explore further into various venues, let's set up parameters for Foursqaure API.

In [36]:
#Define Foursquare Credentials and Version
CLIENT_ID = 'DCWR3XPE3YXZPVHACVO5T0GFBPOW0PV0KPDWM4OPH2ZRDEJQ' # your Foursquare ID
CLIENT_SECRET = 'TQ5HFUYL5JNZZ4CTRWTNZZ1QQOMSMPGDREGO0VURQYGX0A34' # your Foursquare Secret
VERSION = '20200130' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DCWR3XPE3YXZPVHACVO5T0GFBPOW0PV0KPDWM4OPH2ZRDEJQ
CLIENT_SECRET:TQ5HFUYL5JNZZ4CTRWTNZZ1QQOMSMPGDREGO0VURQYGX0A34


##### Let's get the top 100 venues for our 4 boroughs within a radius of 2500 meters in considering that it could take 25 minutes walking distance from the center of La Defense to the endpoint of each borough.


In [37]:
# Define a new function to process 4 boroughs at once in using for loops.
def getNearbyVenues(names, latitudes, longitudes, radius=2500):
    
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            300)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


##### Now write the code to run the above function on each neighborhood and create a new dataframe called ladefense_venues.

In [79]:
# Get the nearby venues for boroughs as above.
ladefense_venues = getNearbyVenues(names=df_bors_near_ld['Borough'],
                                   latitudes=df_bors_near_ld['Latitude'],
                                   longitudes=df_bors_near_ld['Longitude']
                                  )

LA GARENNE-COLOMBES
PUTEAUX
COURBEVOIE
NANTERRE


##### Check the size of the resulting dataframe

In [80]:
print(ladefense_venues.shape)
ladefense_venues.head()

(400, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,LA GARENNE-COLOMBES,48.906758,2.244646,Pâtisserie Nicolas Bernardé,48.90976,2.245964,Dessert Shop
1,LA GARENNE-COLOMBES,48.906758,2.244646,Le Dattier,48.900733,2.248519,Moroccan Restaurant
2,LA GARENNE-COLOMBES,48.906758,2.244646,Montecatini,48.905987,2.25088,Pizza Place
3,LA GARENNE-COLOMBES,48.906758,2.244646,Marché de La Garenne-Colombes,48.90914,2.247115,Market
4,LA GARENNE-COLOMBES,48.906758,2.244646,Thaïoria,48.900409,2.239551,Thai Restaurant


##### Check the number of venues returned for each neighborhood

In [69]:
ladefense_venues_restaurant=ladefense_venues[ladefense_venues['Venue Category'].str.contains("Restaurant")]
ladefense_venues_restaurant.groupby(['Neighborhood', 'Venue Category']).size().reset_index(name='counts')

Unnamed: 0,Neighborhood,Venue Category,counts
0,COURBEVOIE,African Restaurant,1
1,COURBEVOIE,American Restaurant,1
2,COURBEVOIE,Asian Restaurant,1
3,COURBEVOIE,Chinese Restaurant,1
4,COURBEVOIE,French Restaurant,17
5,COURBEVOIE,Gluten-free Restaurant,1
6,COURBEVOIE,Indian Restaurant,1
7,COURBEVOIE,Italian Restaurant,2
8,COURBEVOIE,Japanese Restaurant,3
9,COURBEVOIE,Korean Restaurant,1


##### Let's see where we can find out Farmers market.

In [75]:
ladefense_venues_market=ladefense_venues[ladefense_venues['Venue Category'].str.contains("Market")]
ladefense_venues_market

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
3,LA GARENNE-COLOMBES,48.906758,2.244646,Marché de La Garenne-Colombes,48.90914,2.247115,Market
232,COURBEVOIE,48.89845,2.255706,Marché de La Garenne-Colombes,48.90914,2.247115,Market
384,NANTERRE,48.89607,2.206713,Marché du Centre,48.889185,2.194905,Farmers Market


##### Check the number of venues returned for each neighborhood

In [76]:
ladefense_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
COURBEVOIE,100,100,100,100,100,100
LA GARENNE-COLOMBES,100,100,100,100,100,100
NANTERRE,100,100,100,100,100,100
PUTEAUX,100,100,100,100,100,100


##### See the number of the unique places

In [84]:
print(
    'There are {} uniques categories'.format(len(ladefense_venues['Venue Category'].unique())))

There are 91 uniques categories


#### Analyze each neighborhood

In [85]:
# one hot encoding
ladefense_onehot = pd.get_dummies(ladefense_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ladefense_onehot['Neighborhood'] = ladefense_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ladefense_onehot.columns[-1]] + list(ladefense_onehot.columns[:-1])
ladefense_onehot = ladefense_onehot[fixed_columns]

ladefense_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,African Restaurant,American Restaurant,Art Museum,Asian Restaurant,Bagel Shop,Bakery,Basketball Stadium,Beer Store,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Brewery,Burger Joint,Café,Campground,Chinese Restaurant,Clothing Store,Coffee Shop,Cosmetics Shop,Creperie,Cupcake Shop,Department Store,Dessert Shop,Electronics Store,Farm,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Food Court,Forest,Fountain,French Restaurant,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gas Station,Gluten-free Restaurant,Go Kart Track,Golf Course,Gym,Gym / Fitness Center,Historic Site,Hotel,Indian Restaurant,Island,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Lebanese Restaurant,Lounge,Market,Mexican Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Multiplex,New American Restaurant,Office,Park,Pedestrian Plaza,Pet Store,Pizza Place,Plaza,Pool,Pub,Restaurant,Roof Deck,Salad Place,Sandwich Place,Sauna / Steam Room,Scenic Lookout,School,Shopping Mall,Snack Place,Soccer Field,Soccer Stadium,Sporting Goods Shop,Stadium,Supermarket,Sushi Restaurant,Tapas Restaurant,Tennis Court,Thai Restaurant,Theater,Trail,Train Station,Tunnel,Turkish Restaurant,Vietnamese Restaurant
0,LA GARENNE-COLOMBES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,LA GARENNE-COLOMBES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,LA GARENNE-COLOMBES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,LA GARENNE-COLOMBES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,LA GARENNE-COLOMBES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


##### Examine the new dataframe size

In [86]:
ladefense_onehot.shape

(400, 92)

##### Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [87]:
ladefense_grouped = ladefense_onehot.groupby('Neighborhood').mean().reset_index()
ladefense_grouped

Unnamed: 0,Neighborhood,Accessories Store,African Restaurant,American Restaurant,Art Museum,Asian Restaurant,Bagel Shop,Bakery,Basketball Stadium,Beer Store,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Brewery,Burger Joint,Café,Campground,Chinese Restaurant,Clothing Store,Coffee Shop,Cosmetics Shop,Creperie,Cupcake Shop,Department Store,Dessert Shop,Electronics Store,Farm,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Food Court,Forest,Fountain,French Restaurant,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gas Station,Gluten-free Restaurant,Go Kart Track,Golf Course,Gym,Gym / Fitness Center,Historic Site,Hotel,Indian Restaurant,Island,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Lebanese Restaurant,Lounge,Market,Mexican Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Multiplex,New American Restaurant,Office,Park,Pedestrian Plaza,Pet Store,Pizza Place,Plaza,Pool,Pub,Restaurant,Roof Deck,Salad Place,Sandwich Place,Sauna / Steam Room,Scenic Lookout,School,Shopping Mall,Snack Place,Soccer Field,Soccer Stadium,Sporting Goods Shop,Stadium,Supermarket,Sushi Restaurant,Tapas Restaurant,Tennis Court,Thai Restaurant,Theater,Trail,Train Station,Tunnel,Turkish Restaurant,Vietnamese Restaurant
0,COURBEVOIE,0.0,0.01,0.01,0.01,0.01,0.0,0.03,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.04,0.02,0.0,0.01,0.01,0.01,0.01,0.01,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.17,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.02,0.0,0.03,0.01,0.01,0.02,0.03,0.01,0.0,0.0,0.01,0.01,0.01,0.03,0.01,0.01,0.0,0.0,0.05,0.02,0.0,0.02,0.03,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.01,0.02,0.0,0.01,0.02,0.03,0.01,0.02,0.0,0.02,0.01,0.01,0.0,0.0,0.01,0.01
1,LA GARENNE-COLOMBES,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.01,0.02,0.01,0.01,0.02,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.18,0.0,0.03,0.01,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.04,0.0,0.0,0.01,0.03,0.01,0.0,0.01,0.01,0.01,0.01,0.02,0.0,0.01,0.01,0.0,0.08,0.02,0.0,0.01,0.05,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.01,0.01,0.0,0.02,0.0,0.01,0.02,0.02,0.01,0.01,0.0,0.02,0.01,0.01,0.0,0.01,0.01,0.01
2,NANTERRE,0.01,0.0,0.0,0.0,0.01,0.0,0.02,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.02,0.02,0.0,0.01,0.01,0.0,0.02,0.01,0.01,0.01,0.0,0.01,0.0,0.0,0.04,0.01,0.03,0.01,0.02,0.0,0.01,0.01,0.0,0.01,0.01,0.06,0.0,0.0,0.03,0.05,0.0,0.01,0.0,0.0,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.04,0.01,0.0,0.01,0.06,0.0,0.0,0.0,0.01,0.02,0.03,0.0,0.01,0.01,0.02,0.0,0.02,0.01,0.03,0.01,0.06,0.0,0.0,0.01,0.02,0.02,0.0,0.01,0.01,0.0,0.0
3,PUTEAUX,0.0,0.01,0.0,0.01,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.02,0.01,0.01,0.02,0.0,0.01,0.0,0.01,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.01,0.17,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.04,0.0,0.01,0.02,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.03,0.0,0.01,0.0,0.01,0.05,0.02,0.01,0.0,0.05,0.0,0.01,0.01,0.0,0.0,0.02,0.01,0.02,0.01,0.01,0.0,0.02,0.01,0.01,0.02,0.02,0.0,0.0,0.01,0.01,0.02,0.0,0.01,0.01,0.0,0.01


##### Confirm new size

In [88]:
ladefense_grouped.shape

(4, 92)

##### Let's print each neighborhood along with the top 10 most common venues

In [89]:
num_top_venues = 10

for hood in ladefense_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = ladefense_grouped[ladefense_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----COURBEVOIE----
                 venue  freq
0    French Restaurant  0.17
1                 Park  0.05
2         Burger Joint  0.04
3                Plaza  0.03
4                Hotel  0.03
5               Bakery  0.03
6          Supermarket  0.03
7  Japanese Restaurant  0.03
8  Moroccan Restaurant  0.03
9     Pedestrian Plaza  0.02


----LA GARENNE-COLOMBES----
                    venue  freq
0       French Restaurant  0.18
1                    Park  0.08
2                   Plaza  0.05
3                   Hotel  0.04
4            Burger Joint  0.03
5  Furniture / Home Store  0.03
6    Gym / Fitness Center  0.03
7     Japanese Restaurant  0.03
8        Pedestrian Plaza  0.02
9            Soccer Field  0.02


----NANTERRE----
                 venue  freq
0                Plaza  0.06
1                Hotel  0.06
2          Supermarket  0.06
3  Japanese Restaurant  0.05
4    French Restaurant  0.04
5                 Park  0.04
6         Burger Joint  0.03
7  Sporting Goods Shop  0.03


##### Let's put that into a pandas dataframe

In [90]:
# transform it into the pandas dataframe
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [91]:
# create the new dataframe and display the top 10 venues for each neighborhood.
# num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ladefense_grouped['Neighborhood']

for ind in np.arange(ladefense_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ladefense_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,COURBEVOIE,French Restaurant,Park,Burger Joint,Hotel,Moroccan Restaurant,Bakery,Supermarket,Plaza,Japanese Restaurant,Sandwich Place
1,LA GARENNE-COLOMBES,French Restaurant,Park,Plaza,Hotel,Japanese Restaurant,Furniture / Home Store,Burger Joint,Gym / Fitness Center,Bakery,Clothing Store
2,NANTERRE,Hotel,Supermarket,Plaza,Japanese Restaurant,Park,French Restaurant,Sporting Goods Shop,Sandwich Place,Burger Joint,Furniture / Home Store
3,PUTEAUX,French Restaurant,Park,Plaza,Hotel,Moroccan Restaurant,Soccer Field,Bookstore,Pedestrian Plaza,Burger Joint,Italian Restaurant


#### Cluster Neighborhoods

##### Run k-means to cluster the neighborhood into 2 clusters.

In [92]:
# set number of clusters
kclusters = 2

ladefense_grouped_clustering = ladefense_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ladefense_grouped_clustering)


##### Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [93]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

ladefense_merged = df_bors_near_ld

# merge ladefense_grouped with ladefense_data to add latitude/longitude for each neighborhood
ladefense_merged = ladefense_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Borough')

ladefense_merged.head() # check the last columns!

Unnamed: 0,Department,Department Code,Borough,Postal Code,Population,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
342,HAUTS-DE-SEINE,92,LA GARENNE-COLOMBES,92250,27.1,48.906758,2.244646,0,French Restaurant,Park,Plaza,Hotel,Japanese Restaurant,Furniture / Home Store,Burger Joint,Gym / Fitness Center,Bakery,Clothing Store
429,HAUTS-DE-SEINE,92,PUTEAUX,92800,44.9,48.883709,2.238342,0,French Restaurant,Park,Plaza,Hotel,Moroccan Restaurant,Soccer Field,Bookstore,Pedestrian Plaza,Burger Joint,Italian Restaurant
601,HAUTS-DE-SEINE,92,COURBEVOIE,92400,86.9,48.89845,2.255706,0,French Restaurant,Park,Burger Joint,Hotel,Moroccan Restaurant,Bakery,Supermarket,Plaza,Japanese Restaurant,Sandwich Place
1161,HAUTS-DE-SEINE,92,NANTERRE,92000,90.0,48.89607,2.206713,1,Hotel,Supermarket,Plaza,Japanese Restaurant,Park,French Restaurant,Sporting Goods Shop,Sandwich Place,Burger Joint,Furniture / Home Store


##### Let's visualize the resulting clusters

In [94]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ladefense_merged['Latitude'], ladefense_merged['Longitude'], ladefense_merged['Borough'], ladefense_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Examine Clusters

In [95]:
#Cluster1
ladefense_merged.loc[ladefense_merged['Cluster Labels'] == 0, ladefense_merged.columns[[1] + list(range(5, ladefense_merged.shape[1]))]]

Unnamed: 0,Department Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
342,92,48.906758,2.244646,0,French Restaurant,Park,Plaza,Hotel,Japanese Restaurant,Furniture / Home Store,Burger Joint,Gym / Fitness Center,Bakery,Clothing Store
429,92,48.883709,2.238342,0,French Restaurant,Park,Plaza,Hotel,Moroccan Restaurant,Soccer Field,Bookstore,Pedestrian Plaza,Burger Joint,Italian Restaurant
601,92,48.89845,2.255706,0,French Restaurant,Park,Burger Joint,Hotel,Moroccan Restaurant,Bakery,Supermarket,Plaza,Japanese Restaurant,Sandwich Place


In [96]:
#Cluster2
ladefense_merged.loc[ladefense_merged['Cluster Labels'] == 1, ladefense_merged.columns[[1] + list(range(5, ladefense_merged.shape[1]))]]

Unnamed: 0,Department Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1161,92,48.89607,2.206713,1,Hotel,Supermarket,Plaza,Japanese Restaurant,Park,French Restaurant,Sporting Goods Shop,Sandwich Place,Burger Joint,Furniture / Home Store


### Results

K-means algorithm is ideal to segment to close the distance from our inital alternatives for this unsupervised dataset obtained from Foursquare API. 
La Defense share the part of territories of the four adjacent boroughs. 

Throughout k-means machine learning method, the first cluster has a similar tendancy where common venues are quite similar from the 1st to 5th. It includes three following boroughs: Puteaux, La Garenne Colombes, and Courbevoie.

For the second cluster, there is only one borough called 'Nanterre' where represents more population than the others. 

### Oberservations

To go further into this project, it would be interesting to get a data for shops for sale/rent and sub-borough information to spot the best place to start this future business. 

As it's not easy to get the right data for the number of the public transport users per station and its geo information, there are two huge stations in Puteaux and Nanterre which belong to top 10 in terms of frequency in Ile-de-France region. 

### Conclusion

As a conclusion of this project, opening a Korean-Vietnamese restaurant in Nanterre is optimal with the following reasons:

- There are over 90k populations and 71k public transport users for Nanterre Metro / RER (Suburban train) on daily basis.
- The Ratio of the presence of Korean and Vietnam restaurants is 0 compared to other boroughs even though the total number of restaurants are quite similar one another and there's no difference in terms of Asian food offers.
- All boroughs near La Defense have a smiliar pattern in terms of closing day as Sunday. There are many other ways to take novel profit structure by inventing atypic services targetings for the working people in the zone and the nearby habitants.
