**Plan**


1.   Business Problem
2.   Proposed Approach
3. Data configuraion
4. Experiments
5. Evaluation
6. Conclusion




# Description of the problem










In this work, we would like to find the good district from 20 districts to recommend Asian Restaurants in Paris, France.

# Proposed Approach

First of all, we create a dataframe with the Paris neighborhood dataset. This dataset contains Postcode and Neighborhood information.

Second, we create the coordinates of all districts in Paris, France.

Third, we experiment by exploring, segmenting and clustering all the neighborhoods in the city of Paris based on the most common venues.

For evaluations, we analyze the clustered results and then propose appropriated districts to recommand Asian Restaurants in Paris. Finally, we conclude with some perspectives to enhance the performances of our approach.

# Data configuration

In this section, we create the dataframe from Paris neighborhood dataset with their coordinates.

* Paris Arrondissements & Neighborhoods Map: https://parismap360.com/paris-arrondissement-map#.XfVpqtEo91l
* Arrondissements in Paris, France: https://francetravelplanner.com/go/paris/areas/arrondismt.html

* Using package geopy to convert an address into latitude and longitude values.

In [0]:
import os
import pandas as pd

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from geopy.extra.rate_limiter import RateLimiter

In [0]:
COL_NAME_POSTCODE = "postcode"
COL_NAME_COUNTRY = "country"
COL_NAME_ADDRESS = "address"
COL_NAME_LOCATION = "location"
COL_NAME_POINT = "point"
COL_NAME_LATITUDE = "latitude"
COL_NAME_LONGITUDE = "longitude"
COL_NAME_ALTITUDE = "altitude"
COL_NAME_NEIGHBOURHOOD = "neighbourhood"

file_coordinate_path = "./data/Geospatial_Coordinates_Paris.csv"
file_neighbourhood_path = "./data/Paris_Neighbourhood.csv"

Create the dataframe

In [6]:
if os.path.exists(file_neighbourhood_path):
    print("Loading Paris neighbourhood data from file : %s" % file_neighbourhood_path)
    df_neighbourhood = pd.read_csv(file_neighbourhood_path, header=0)
else:
    # The following neighbourhood data of Paris that I built based on the information in
    # https://parismap360.com/paris-arrondissement-map#.XfXp89Eo91m
    # https://francetravelplanner.com/go/paris/areas/arrondismt.html
    list_neighbourhood = [
    ["75001", "75002"], ["75001", "75003"], ["75001", "75004"], ["75001", "75005"], 
    ["75001", "75006"], ["75001", "75007"], ["75001", "75008"], ["75001", "75009"], 
    ["75002", "75001"], ["75002", "75003"], ["75002", "75009"], ["75002", "75010"],
    ["75003", "75001"], ["75003", "75002"], ["75003", "75004"], ["75003", "75010"],
    ["75003", "75011"], ["75004", "75001"], ["75004", "75003"], ["75004", "75005"],
    ["75004", "75006"], ["75004", "75011"], ["75004", "75012"], ["75005", "75001"],
    ["75005", "75004"], ["75005", "75006"], ["75005", "75012"], ["75005", "75013"],
    ["75005", "75014"], ["75006", "75001"], ["75006", "75004"], ["75006", "75005"],
    ["75006", "75007"], ["75006", "75014"], ["75006", "75015"], ["75007", "75001"],
    ["75007", "75006"], ["75007", "75008"], ["75007", "75015"], ["75007", "75016"],
    ["75008", "75001"], ["75008", "75007"], ["75008", "75009"], ["75008", "75016"],
    ["75008", "75017"], ["75008", "75018"], ["75009", "75001"], ["75009", "75002"],
    ["75009", "75008"], ["75009", "75010"], ["75009", "75017"], ["75009", "75018"],
    ["75010", "75002"], ["75010", "75003"], ["75010", "75009"], ["75010", "75011"],
    ["75010", "75018"], ["75010", "75019"], ["75010", "75020"], ["75011", "75003"],
    ["75011", "75004"], ["75011", "75010"], ["75011", "75012"], ["75011", "75019"],
    ["75011", "75020"], ["75012", "75004"], ["75012", "75005"], ["75012", "75011"],
    ["75012", "75013"], ["75012", "75020"], ["75013", "75005"], ["75013", "75012"],
    ["75013", "75014"], ["75014", "75005"], ["75014", "75006"], ["75014", "75013"],
    ["75014", "75015"], ["75015", "75006"], ["75015", "75007"], ["75015", "75014"],
    ["75015", "75016"], ["75016", "75007"], ["75016", "75008"], ["75016", "75015"],
    ["75016", "75017"], ["75017", "75008"], ["75017", "75009"], ["75017", "75016"],
    ["75017", "75018"], ["75018", "75008"], ["75018", "75009"], ["75018", "75010"],
    ["75018", "75017"], ["75018", "75019"], ["75019", "75010"], ["75019", "75011"],
    ["75019", "75018"], ["75019", "75020"], ["75020", "75010"], ["75020", "75011"],
    ["75020", "75012"], ["75020", "75019"]]

    df_neighbourhood = pd.DataFrame(data=list_neighbourhood, columns=[COL_NAME_POSTCODE, COL_NAME_NEIGHBOURHOOD])

    df_neighbourhood.to_csv(file_neighbourhood_path, header=True, index=False)

Loading Paris neighbourhood data from file : ./data/Paris_Neighbourhood.csv


In [10]:
df_neighbourhood.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 2 columns):
postcode         102 non-null object
neighbourhood    102 non-null object
dtypes: object(2)
memory usage: 1.7+ KB


In [14]:
df_neighbourhood.head()

Unnamed: 0,postcode,neighbourhood
0,75001,75002
1,75001,75003
2,75001,75004
3,75001,75005
4,75001,75006


In [8]:
df_neighbourhood.columns

Index(['postcode', 'neighbourhood'], dtype='object')

In [0]:
# Convert into string all of values in dataframe
df_neighbourhood = df_neighbourhood.astype(str)

In [13]:
# Get the shape of the dataframe
print("(row, colum) = ", str(df_neighbourhood.shape))

(row, colum) =  (102, 2)


In [16]:
# group by postcode with coma
df_combined = df_neighbourhood.groupby(by=[COL_NAME_POSTCODE]).agg(lambda x: ",".join(x)).reset_index()
df_combined.head()

Unnamed: 0,postcode,neighbourhood
0,75001,7500275003750047500575006750077500875009
1,75002,75001750037500975010
2,75003,7500175002750047501075011
3,75004,750017500375005750067501175012
4,75005,750017500475006750127501375014


Build the Coordinates of Paris Districts

In [17]:
if os.path.exists(file_coordinate_path):
    print("Loading file input : {}".format(file_coordinate_path))
    df_coordinates = pd.read_csv(file_coordinate_path, header=0)
else:
    # In Paris, France, there are 20 districts
    list_of_districts_in_Paris = ["750" + str(x).zfill(2) for x in range(1, 21)]
    
    # Create DataFrame with given list of districts of Paris
    df_coordinates = pd.DataFrame(data=list_of_districts_in_Paris, columns=[COL_NAME_POSTCODE])

    df_coordinates[COL_NAME_COUNTRY] = "FR"
    df_coordinates[COL_NAME_ADDRESS] = df_coordinates.apply(lambda row: str(row[COL_NAME_POSTCODE]) + ", " + row[COL_NAME_COUNTRY], axis=1)

    locator = Nominatim(user_agent="paris_explorer")

    # convenient function to delay between geocoding calls
    geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

    # create column "location"
    df_coordinates[COL_NAME_LOCATION] = df_coordinates[COL_NAME_ADDRESS].apply(geocode)

    # extract from location column to (longitude, latitude, altitude)  (returns tuple)
    df_coordinates[COL_NAME_POINT] = df_coordinates[COL_NAME_LOCATION].apply(lambda loc: tuple(loc.point) if loc else None)

    # split point column into latitude, longitude and altitude columns
    df_coordinates[[COL_NAME_LATITUDE, COL_NAME_LONGITUDE, COL_NAME_ALTITUDE]] = pd.DataFrame(df_coordinates[COL_NAME_POINT].tolist(), index=df_coordinates.index)
    
    # save to file csv
    df_coordinates.to_csv(file_coordinate_path, header=True, index=False)

Loading file input : ./data/Geospatial_Coordinates_Paris.csv


In [18]:
df_coordinates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
postcode     20 non-null int64
country      20 non-null object
address      20 non-null object
location     20 non-null object
point        20 non-null object
latitude     20 non-null float64
longitude    20 non-null float64
altitude     20 non-null float64
dtypes: float64(3), int64(1), object(4)
memory usage: 1.4+ KB


In [19]:
df_coordinates.head()

Unnamed: 0,postcode,country,address,location,point,latitude,longitude,altitude
0,75001,FR,"75001, FR","Quartier du Palais Royal, Paris 1er Arrondisse...","(48.8635535039561, 2.33885565919603, 0.0)",48.863554,2.338856,0.0
1,75002,FR,"75002, FR","Quartier du Mail, Paris 2e Arrondissement, Par...","(48.8674178540292, 2.34425631198765, 0.0)",48.867418,2.344256,0.0
2,75003,FR,"75003, FR","Quartier des Enfants-Rouges, Paris 3e Arrondis...","(48.8626070953067, 2.36021125472424, 0.0)",48.862607,2.360211,0.0
3,75004,FR,"75004, FR","Quartier Saint-Gervais, Paris 4e Arrondissemen...","(48.8560044890472, 2.35702787286538, 0.0)",48.856004,2.357028,0.0
4,75005,FR,"75005, FR","Quartier de la Sorbonne, Paris 5e Arrondisseme...","(48.85275155, 2.34634315537975, 0.0)",48.852752,2.346343,0.0


In [0]:
# Remove the useless columns
df_coordinates.drop([COL_NAME_COUNTRY, COL_NAME_POINT, COL_NAME_ALTITUDE, COL_NAME_LOCATION], axis=1, inplace=True)

In [22]:
df_coordinates.columns

Index(['postcode', 'address', 'latitude', 'longitude'], dtype='object')

In [0]:
df_coordinates[COL_NAME_POSTCODE] = df_coordinates[COL_NAME_POSTCODE].astype(str)

In [24]:
df_coordinates.head(2)

Unnamed: 0,postcode,address,latitude,longitude
0,75001,"75001, FR",48.863554,2.338856
1,75002,"75002, FR",48.867418,2.344256


In [25]:
# District 5
df_coordinates[df_coordinates[COL_NAME_POSTCODE]=="75005"]

Unnamed: 0,postcode,address,latitude,longitude
4,75005,"75005, FR",48.852752,2.346343


Merge two dataframes by postal code

In [27]:
print(df_combined.columns)
print(df_coordinates.columns)

Index(['postcode', 'neighbourhood'], dtype='object')
Index(['postcode', 'address', 'latitude', 'longitude'], dtype='object')


In [28]:
df_merged = pd.merge(df_combined, df_coordinates, 
                     left_on=COL_NAME_POSTCODE, right_on=COL_NAME_POSTCODE,
                     how="inner")
df_merged.head(2)

Unnamed: 0,postcode,neighbourhood,address,latitude,longitude
0,75001,7500275003750047500575006750077500875009,"75001, FR",48.863554,2.338856
1,75002,75001750037500975010,"75002, FR",48.867418,2.344256


In [29]:
df_merged.columns

Index(['postcode', 'neighbourhood', 'address', 'latitude', 'longitude'], dtype='object')

In [31]:
df_merged.describe()

Unnamed: 0,latitude,longitude
count,20.0,20.0
mean,48.859848,2.34218
std,0.01777,0.031128
min,48.826997,2.273958
25%,48.853341,2.315255
50%,48.856565,2.344069
75%,48.873627,2.357824
max,48.893074,2.409257


In [32]:
# Get the shape of the dataframe
print("(row, colum) = ", str(df_merged.shape))

(row, colum) =  (20, 5)


**Explore and cluster the neighborhoods in Paris**

In [33]:
# Get distinct postal codes from dataframe
df_merged[COL_NAME_POSTCODE].unique()

array(['75001', '75002', '75003', '75004', '75005', '75006', '75007',
       '75008', '75009', '75010', '75011', '75012', '75013', '75014',
       '75015', '75016', '75017', '75018', '75019', '75020'], dtype=object)

In [35]:
print('The dataframe has {} distinct districts and {} neighborhoods.'.format(
      df_merged[COL_NAME_POSTCODE].nunique(),
      df_merged.shape[0]))

The dataframe has 20 distinct districts and 20 neighborhoods.


# Experiments

**Use the geopy package to get the coordinates values**

In [36]:
# Get the coordinate of Paris, France
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

def get_latitude_longitude(address=""):
    if not address:
        return None, None
    
    geolocator = Nominatim(user_agent="paris_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return (latitude, longitude)

def get_latitude_longitude_paris_fr():
    address = 'Paris, FR'
    return get_latitude_longitude(address)

latitude, longitude = get_latitude_longitude_paris_fr()
print('The geograpical coordinate of Paris, FR are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Paris, FR are 48.8566969, 2.3514616.


In [38]:
# Create a plan with neighborhoods

import folium

# create map using latitude and longitude values
plan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, district, neighborhood in zip(df_merged[COL_NAME_LATITUDE], 
                                            df_merged[COL_NAME_LONGITUDE], 
                                            df_merged[COL_NAME_POSTCODE], 
                                            df_merged[COL_NAME_NEIGHBOURHOOD]):
    label = 'District:{}, Neighbourhood:{}'.format(district, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(plan)  
    
plan

In [76]:
CLIENT_ID = 'X'     # Foursquare ID
CLIENT_SECRET = 'X' # Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: X
CLIENT_SECRET:X


In [41]:
# # Get the neighborhood's latitude and longitude values
neighborhood_latitude = df_merged.loc[0, COL_NAME_LATITUDE]   # neighborhood latitude value
neighborhood_longitude = df_merged.loc[0, COL_NAME_LONGITUDE] # neighborhood longitude value

neighborhood_name = df_merged.loc[0, COL_NAME_NEIGHBOURHOOD] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of 75002,75003,75004,75005,75006,75007,75008,75009 are 48.8635535039561, 2.33885565919603.


In [77]:
# Get the top 100 venues that are in Marble Hill within a radius of 500 meters.
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=X&client_secret=X&v=20180604&ll=48.8635535039561,2.33885565919603&radius=500&limit=100'

In [78]:
# Send the GET request and examine the resutls
import requests # library to handle requests

results = requests.get(url).json()
results

{'meta': {'code': 400,
  'errorDetail': 'Missing access credentials. See https://developer.foursquare.com/docs/api/configuration/authentication for details.',
  'errorType': 'invalid_auth',
  'requestId': '5e628b2a760a7f001b8c2f02'},
 'response': {}}

In [45]:
# Save to dataframe
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Jardin du Palais Royal,Garden,48.864941,2.337728
1,Palais Royal,Historic Site,48.863236,2.337127
2,Comédie-Française,Theater,48.863088,2.336612
3,Place du Palais Royal,Plaza,48.862523,2.336688
4,La Clef Louvre Paris,Hotel,48.863977,2.33614


In [46]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


**Explore all neighborhoods**

In [0]:
# Create a function to repeat the same process to all the neighborhoods
COL_NAME_VENUE = "Venue"
COL_NAME_CATEGORY = "Category"

COL_NAME_NEIGHBOURHOOD_LATITUDE = COL_NAME_NEIGHBOURHOOD + " " + COL_NAME_LATITUDE
COL_NAME_NEIGHBOURHOOD_LONGITUDE = COL_NAME_NEIGHBOURHOOD + " " + COL_NAME_LONGITUDE
COL_NAME_VENUE_LATITUDE = COL_NAME_VENUE + " " + COL_NAME_LATITUDE
COL_NAME_VENUE_LONGITUDE = COL_NAME_VENUE + " " + COL_NAME_LONGITUDE
COL_NAME_VENUE_CATEGORY = COL_NAME_VENUE + " " + COL_NAME_CATEGORY


def get_near_by_venues(names, latitudes, longitudes, radius=500):    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [COL_NAME_NEIGHBOURHOOD, 
                             COL_NAME_NEIGHBOURHOOD_LATITUDE,
                             COL_NAME_NEIGHBOURHOOD_LONGITUDE,
                             COL_NAME_VENUE,
                             COL_NAME_VENUE_LATITUDE,
                             COL_NAME_VENUE_LONGITUDE,
                             COL_NAME_VENUE_CATEGORY]
    return(nearby_venues)

In [48]:
# dataframe that contains all the neighborhoods 
venues_neighbourhoods = get_near_by_venues(
    names=df_merged[COL_NAME_NEIGHBOURHOOD],
    latitudes=df_merged[COL_NAME_LATITUDE],                           
    longitudes=df_merged[COL_NAME_LONGITUDE])

75002,75003,75004,75005,75006,75007,75008,75009
75001,75003,75009,75010
75001,75002,75004,75010,75011
75001,75003,75005,75006,75011,75012
75001,75004,75006,75012,75013,75014
75001,75004,75005,75007,75014,75015
75001,75006,75008,75015,75016
75001,75007,75009,75016,75017,75018
75001,75002,75008,75010,75017,75018
75002,75003,75009,75011,75018,75019,75020
75003,75004,75010,75012,75019,75020
75004,75005,75011,75013,75020
75005,75012,75014
75005,75006,75013,75015
75006,75007,75014,75016
75007,75008,75015,75017
75008,75009,75016,75018
75008,75009,75010,75017,75019
75010,75011,75018,75020
75010,75011,75012,75019


In [50]:
# Get the shape of the dataframe
print("(row, column) = %s" % str(venues_neighbourhoods.shape))
venues_neighbourhoods.head(2)

(row, column) = (1399, 7)


Unnamed: 0,neighbourhood,neighbourhood latitude,neighbourhood longitude,Venue,Venue latitude,Venue longitude,Venue Category
0,7500275003750047500575006750077500875009,48.863554,2.338856,Jardin du Palais Royal,48.864941,2.337728,Garden
1,7500275003750047500575006750077500875009,48.863554,2.338856,Palais Royal,48.863236,2.337127,Historic Site


In [51]:
# check how many venues were returned for each neighborhood
venues_neighbourhoods.groupby(COL_NAME_NEIGHBOURHOOD).count()

Unnamed: 0_level_0,neighbourhood latitude,neighbourhood longitude,Venue,Venue latitude,Venue longitude,Venue Category
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7500175002750047501075011,100,100,100,100,100,100
750017500275008750107501775018,100,100,100,100,100,100
750017500375005750067501175012,100,100,100,100,100,100
75001750037500975010,100,100,100,100,100,100
750017500475005750077501475015,100,100,100,100,100,100
750017500475006750127501375014,100,100,100,100,100,100
7500175006750087501575016,56,56,56,56,56,56
750017500775009750167501775018,61,61,61,61,61,61
7500275003750047500575006750077500875009,100,100,100,100,100,100
75002750037500975011750187501975020,57,57,57,57,57,57


In [52]:
print('There are {} distinct categories.'.format(
    len(venues_neighbourhoods[COL_NAME_VENUE_CATEGORY].unique())))

There are 199 distinct categories.


In [53]:
# All categories
venues_neighbourhoods[COL_NAME_VENUE_CATEGORY].unique()

array(['Garden', 'Historic Site', 'Theater', 'Plaza', 'Hotel',
       'Shoe Store', 'French Restaurant', 'Cheese Shop', 'Restaurant',
       'Bar', 'Smoke Shop', 'Café', 'Spa', 'Breakfast Spot',
       'Sculpture Garden', 'Coffee Shop', 'Pizza Place',
       'Ramen Restaurant', 'Bistro', 'Bakery', 'Wine Shop',
       'Udon Restaurant', 'Sandwich Place', 'Wine Bar', 'Art Museum',
       'Pedestrian Plaza', 'Japanese Restaurant', 'Chinese Restaurant',
       'Korean Restaurant', 'Brasserie', 'Clothing Store', 'Cocktail Bar',
       'Tea Room', 'Cosmetics Shop', 'Italian Restaurant', 'Exhibit',
       'Furniture / Home Store', 'General College & University',
       'Bubble Tea Shop', 'Shopping Mall', 'Perfume Shop',
       'Grocery Store', 'Gift Shop', 'Vietnamese Restaurant', 'Bookstore',
       'Beer Bar', 'Nightclub', 'Souvlaki Shop', 'Music Store',
       'Peruvian Restaurant', "Women's Store", 'Donut Shop',
       'Burger Joint', 'Greek Restaurant', 'Creperie', 'Ice Cream Shop',
    

**Analyze each neighbordhood**

In [54]:
# one hot encoding
df_onehot = pd.get_dummies(venues_neighbourhoods[[COL_NAME_VENUE_CATEGORY]], 
                                        prefix="", 
                                        prefix_sep="")

# add neighborhood column back to dataframe
df_onehot[COL_NAME_NEIGHBOURHOOD] = venues_neighbourhoods[COL_NAME_NEIGHBOURHOOD] 

# move neighborhood column to the first column
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]

df_onehot.head()

Unnamed: 0,neighbourhood,African Restaurant,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Bagel Shop,Bakery,Bank,Bar,Basque Restaurant,Beer Bar,Beer Garden,Beer Store,Bistro,Bookstore,Boutique,Brasserie,Brazilian Restaurant,Breakfast Spot,Breton Restaurant,Brewery,Bubble Tea Shop,Burger Joint,Burgundian Restaurant,Bus Stop,Café,Cambodian Restaurant,Candy Store,Caucasian Restaurant,Ch'ti Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,Comedy Club,...,Sculpture Garden,Seafood Restaurant,Shoe Store,Shop & Service,Shopping Mall,Skate Park,Smoke Shop,Snack Place,Soccer Stadium,Southern / Soul Food Restaurant,Souvlaki Shop,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Tennis Stadium,Thai Restaurant,Theater,Toy / Game Store,Track Stadium,Train Station,Tram Station,Turkish Restaurant,Udon Restaurant,Used Bookstore,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,7500275003750047500575006750077500875009,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7500275003750047500575006750077500875009,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,7500275003750047500575006750077500875009,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,7500275003750047500575006750077500875009,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,7500275003750047500575006750077500875009,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [55]:
# Get the shape of the dataframe
df_onehot.shape

(1399, 200)

In [58]:
# Group by neighborhoods of each category
df_grouped = df_onehot.groupby(COL_NAME_NEIGHBOURHOOD).mean().reset_index()
df_grouped.head(2)

Unnamed: 0,neighbourhood,African Restaurant,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Bagel Shop,Bakery,Bank,Bar,Basque Restaurant,Beer Bar,Beer Garden,Beer Store,Bistro,Bookstore,Boutique,Brasserie,Brazilian Restaurant,Breakfast Spot,Breton Restaurant,Brewery,Bubble Tea Shop,Burger Joint,Burgundian Restaurant,Bus Stop,Café,Cambodian Restaurant,Candy Store,Caucasian Restaurant,Ch'ti Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,Comedy Club,...,Sculpture Garden,Seafood Restaurant,Shoe Store,Shop & Service,Shopping Mall,Skate Park,Smoke Shop,Snack Place,Soccer Stadium,Southern / Soul Food Restaurant,Souvlaki Shop,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Stadium,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Tennis Stadium,Thai Restaurant,Theater,Toy / Game Store,Track Stadium,Train Station,Tram Station,Turkish Restaurant,Udon Restaurant,Used Bookstore,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,7500175002750047501075011,0.0,0.0,0.03,0.01,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.01,0.0,0.0,0.04,0.02,0.03,0.01,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.02,0.02,0.05,0.0,...,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0
1,750017500275008750107501775018,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.04,0.0,0.04,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.05,0.02,0.0,...,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.01,0.01,0.02,0.0,0.0,0.0


In [59]:
# Print each neighborhood along with the top 5 most common venues
num_top_venues = 5
COL_NAME_FREQUENCE = 'freq'

for hood in df_grouped[COL_NAME_NEIGHBOURHOOD]:
    print("----"+hood+"----")
    temp = df_grouped[df_grouped[COL_NAME_NEIGHBOURHOOD] == hood].T.reset_index()
    temp.columns = [COL_NAME_VENUE, COL_NAME_FREQUENCE]
    temp = temp.iloc[1:]
    temp[COL_NAME_FREQUENCE] = temp[COL_NAME_FREQUENCE].astype(float)
    temp = temp.round({COL_NAME_FREQUENCE: 2})
    print(temp.sort_values(COL_NAME_FREQUENCE, ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----75001,75002,75004,75010,75011----
                 Venue  freq
0    French Restaurant  0.07
1                 Café  0.05
2          Coffee Shop  0.05
3  Japanese Restaurant  0.04
4               Bakery  0.04


----75001,75002,75008,75010,75017,75018----
                Venue  freq
0   French Restaurant  0.17
1               Hotel  0.12
2        Cocktail Bar  0.05
3  Italian Restaurant  0.04
4              Bakery  0.04


----75001,75003,75005,75006,75011,75012----
               Venue  freq
0  French Restaurant  0.10
1     Clothing Store  0.05
2        Pastry Shop  0.04
3              Hotel  0.03
4             Bakery  0.03


----75001,75003,75009,75010----
               Venue  freq
0  French Restaurant  0.10
1       Cocktail Bar  0.06
2           Wine Bar  0.05
3             Bakery  0.05
4              Hotel  0.04


----75001,75004,75005,75007,75014,75015----
               Venue  freq
0  French Restaurant  0.12
1          Bookstore  0.05
2              Hotel  0.05
3               

In [0]:
# Create a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

**Create the new dataframe and display the top 10 venues for each neighborhood**

In [62]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = [COL_NAME_NEIGHBOURHOOD]
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted[COL_NAME_NEIGHBOURHOOD] = df_grouped[COL_NAME_NEIGHBOURHOOD]

for ind in np.arange(df_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], 
                                                                          num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,7500175002750047501075011,French Restaurant,Café,Coffee Shop,Japanese Restaurant,Burger Joint,Bistro,Bakery,Gourmet Shop,Restaurant,Sandwich Place
1,750017500275008750107501775018,French Restaurant,Hotel,Cocktail Bar,Bistro,Italian Restaurant,Bar,Bakery,Japanese Restaurant,Lounge,Theater
2,750017500375005750067501175012,French Restaurant,Clothing Store,Pastry Shop,Bakery,Ice Cream Shop,Wine Bar,Gourmet Shop,Hotel,Garden,Furniture / Home Store
3,75001750037500975010,French Restaurant,Cocktail Bar,Bakery,Wine Bar,Italian Restaurant,Hotel,Bistro,Coffee Shop,Thai Restaurant,Restaurant
4,750017500475005750077501475015,French Restaurant,Hotel,Bookstore,Plaza,Bar,Seafood Restaurant,Café,Creperie,Coffee Shop,Lebanese Restaurant


# Evaluation

In [63]:
# Run k-means to cluster the neighborhood into 6 clusters.
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 6

clustering_grouped_paris = df_grouped.drop(COL_NAME_NEIGHBOURHOOD, 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(clustering_grouped_paris)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 3, 1, 1, 1, 1, 3, 3, 3, 3], dtype=int32)

In [64]:
# Create a new dataframe
COL_NAME_CLUSTER_LABELS = 'Cluster Labels'

# add clustering labels
neighborhoods_venues_sorted.insert(0, COL_NAME_CLUSTER_LABELS, kmeans.labels_)

df_merged_paris = df_merged

df_merged_paris = df_merged_paris.join(neighborhoods_venues_sorted.set_index(COL_NAME_NEIGHBOURHOOD), 
                                                           on=COL_NAME_NEIGHBOURHOOD)

df_merged_paris.head(2) 

Unnamed: 0,postcode,neighbourhood,address,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,75001,7500275003750047500575006750077500875009,"75001, FR",48.863554,2.338856,3,French Restaurant,Hotel,Japanese Restaurant,Café,Plaza,Coffee Shop,Historic Site,Bakery,Udon Restaurant,Bistro
1,75002,75001750037500975010,"75002, FR",48.867418,2.344256,1,French Restaurant,Cocktail Bar,Bakery,Wine Bar,Italian Restaurant,Hotel,Bistro,Coffee Shop,Thai Restaurant,Restaurant


In [67]:
df_merged_paris.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 16 columns):
postcode                  20 non-null object
neighbourhood             20 non-null object
address                   20 non-null object
latitude                  20 non-null float64
longitude                 20 non-null float64
Cluster Labels            20 non-null int32
1st Most Common Venue     20 non-null object
2nd Most Common Venue     20 non-null object
3rd Most Common Venue     20 non-null object
4th Most Common Venue     20 non-null object
5th Most Common Venue     20 non-null object
6th Most Common Venue     20 non-null object
7th Most Common Venue     20 non-null object
8th Most Common Venue     20 non-null object
9th Most Common Venue     20 non-null object
10th Most Common Venue    20 non-null object
dtypes: float64(2), int32(1), object(13)
memory usage: 3.2+ KB


**Visualize the clusters**

In [68]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Let's get the geographical coordinates of Paris, France
latitude, longitude = get_latitude_longitude_paris_fr()
print('The geograpical coordinate of Paris, FR are {}, {}.'.format(latitude, longitude))
# ------------------------------------------------------------------------------------------------

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, district, poi, cluster in zip(df_merged_paris[COL_NAME_LATITUDE], 
                                  df_merged_paris[COL_NAME_LONGITUDE],
                                  df_merged_paris[COL_NAME_POSTCODE],
                                  df_merged_paris[COL_NAME_NEIGHBOURHOOD], 
                                  df_merged_paris[COL_NAME_CLUSTER_LABELS]):
    label = 'District:{}, Neighbourhood:{}, Number of Cluster:{}'.format(district, poi, cluster+1)
    label = folium.Popup(label,
                         parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The geograpical coordinate of Paris, FR are 48.8566969, 2.3514616.


**Examine clusters according to distinct venue categories**

In [69]:
# Cluster 1
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 0, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

Unnamed: 0,neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,7500475005750117501375020,0,French Restaurant,Hotel,Beer Garden,Museum,Garden,Skate Park,Chinese Restaurant,Steakhouse,Coffee Shop,Convenience Store


In [70]:
# Cluster 2
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 1, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

Unnamed: 0,neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,75001750037500975010,1,French Restaurant,Cocktail Bar,Bakery,Wine Bar,Italian Restaurant,Hotel,Bistro,Coffee Shop,Thai Restaurant,Restaurant
2,7500175002750047501075011,1,French Restaurant,Café,Coffee Shop,Japanese Restaurant,Burger Joint,Bistro,Bakery,Gourmet Shop,Restaurant,Sandwich Place
3,750017500375005750067501175012,1,French Restaurant,Clothing Store,Pastry Shop,Bakery,Ice Cream Shop,Wine Bar,Gourmet Shop,Hotel,Garden,Furniture / Home Store
4,750017500475006750127501375014,1,French Restaurant,Café,Bar,Hotel,Coffee Shop,Bookstore,Plaza,Bakery,Creperie,Bistro
5,750017500475005750077501475015,1,French Restaurant,Hotel,Bookstore,Plaza,Bar,Seafood Restaurant,Café,Creperie,Coffee Shop,Lebanese Restaurant
10,750037500475010750127501975020,1,French Restaurant,Pizza Place,Coffee Shop,Bar,Italian Restaurant,Hotel,Bookstore,Bistro,Pub,Cocktail Bar
12,750057501275014,1,French Restaurant,Vietnamese Restaurant,Bar,Bakery,Thai Restaurant,Hotel,Bistro,Japanese Restaurant,Italian Restaurant,Juice Bar
17,7500875009750107501775019,1,French Restaurant,Bar,Hotel,Pizza Place,Bistro,Gastropub,Café,Italian Restaurant,Supermarket,Restaurant
18,75010750117501875020,1,French Restaurant,Bar,Park,Bistro,Pool,Restaurant,Café,Moroccan Restaurant,Greek Restaurant,Bus Stop


In [71]:
# Cluster 3
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 2, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

Unnamed: 0,neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,75005750067501375015,2,Bakery,Supermarket,Hotel,Japanese Restaurant,Theater,Café,Flea Market,Plaza,Fast Food Restaurant,Stadium


In [72]:
# Cluster 4
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 3, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

Unnamed: 0,neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,7500275003750047500575006750077500875009,3,French Restaurant,Hotel,Japanese Restaurant,Café,Plaza,Coffee Shop,Historic Site,Bakery,Udon Restaurant,Bistro
6,7500175006750087501575016,3,French Restaurant,Hotel,Café,Plaza,Italian Restaurant,History Museum,Art Museum,Garden,Park,Historic Site
7,750017500775009750167501775018,3,French Restaurant,Hotel,Bakery,Cocktail Bar,Spa,Theater,Japanese Restaurant,Art Gallery,Cycle Studio,Brewery
8,750017500275008750107501775018,3,French Restaurant,Hotel,Cocktail Bar,Bistro,Italian Restaurant,Bar,Bakery,Japanese Restaurant,Lounge,Theater
9,75002750037500975011750187501975020,3,French Restaurant,Hotel,Indian Restaurant,Coffee Shop,Japanese Restaurant,Restaurant,Bakery,Breton Restaurant,Café,Record Shop
14,75006750077501475016,3,French Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Coffee Shop,Seafood Restaurant,Dessert Shop,Beer Store,Pizza Place,Garden


In [73]:
# Cluster 5
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 4, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

Unnamed: 0,neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,75010750117501275019,4,Hotel,French Restaurant,Tram Station,Supermarket,Japanese Restaurant,Tennis Court,Music Venue,Fast Food Restaurant,Discount Store,Pharmacy


In [74]:
# Cluster 6
df_merged_paris.loc[df_merged_paris[COL_NAME_CLUSTER_LABELS] == 5, 
                    df_merged_paris.columns[[1] + list(range(5, df_merged_paris.shape[1]))]]

Unnamed: 0,neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,75007750087501575017,5,French Restaurant,Italian Restaurant,Bakery,Japanese Restaurant,Plaza,Bar,Seafood Restaurant,Train Station,Grocery Store,Sandwich Place
16,75008750097501675018,5,French Restaurant,Italian Restaurant,Hotel,Bakery,Bistro,Sushi Restaurant,Japanese Restaurant,Bar,Restaurant,Diner


# Conclusion

In this work, we have clustered and examined all six clusters based on top 10 most common venues for each neighborhood.

We observed the French restaurant is omnipresent in clusters 1, 2, 4 and 5 with the first most common venue in most of districts in Paris.

However, we notice that the recommandations for Asian restaurants are not sufficient and not relevant as well. For example, in cluster 1, Chinese restaurant is in the seventh most common venue. In cluster 2, Vietnamese restaurant is in the second most common venue; Thai restaurant is in the fifth and ninth most common venues depending on several districts and so on. 

In perspective, we should add more relevant features for each district such as the transport info (public transport, parking, etc.), the information of asian communities, the information of major tourist venues as well.

We could experiment more algorithms: Fuzzy c-means method, DBSCAN: Density-based clustering, Hierarchical K-Means Clustering, Deep Learning Models. 



# References


1. The tutorials in course "Applied Data Science Capstone": https://www.coursera.org/learn/applied-data-science-capstone/

2. Paris Arrondissements & Neighborhoods Map: https://parismap360.com/paris-arrondissement-map#.XfVpqtEo91l

3. Arrondissements in Paris, France: https://francetravelplanner.com/go/paris/areas/arrondismt.html
