# Comparative Analysis of Neighborhoods | Venue Data Preparation

We will now collect data on local amenities and services such as cafes, schools, shops, hospitals, etc.

We will use the Foursquare location platform providing data through its API to retrieve the most up to date venue names, types of venues or categories, and their location.

Though we should note that with the free http request version, we will be limited to 100 places per search.
- Venues: https://docs.foursquare.com/developer/reference/foursquare-apis-overview  
- Categories: https://docs.foursquare.com/data-products/docs/categories#places-api--flat-file

## [1] Working environment set up

Before starting, we need to install and import libraries.

In [27]:
# Data Access and Web Scraping
!pip install requests
import requests

# Data Storage and File Handling
import json
from pandas import json_normalize

# Data Manipulation and Processing
!pip install pandas
import pandas as pd
import numpy as np

print("Libraries imported.")

Libraries imported.


## [2] Data Collection

Before starting, we need to open the geolocation dataframe and dictionary.

In [3]:
# Load from CSV file
cities_df = pd.read_csv('geolocation_df_output.csv', encoding='utf-8')

In [5]:
# Load from JSON file
with open("geolocation_dic_output.json", "r") as f:
    citiesinfo_dic = json.load(f)

Define Foursquare credentials and version.

In [6]:
# @hidde_cell !!!
CLIENT_ID = 'GPPJ4ZU41Q1I0ABMD3EKVNQS2DX5CHOOUVLJIZ40NE2XF1X0' # Foursquare ID
CLIENT_SECRET = 'ZZXSQUNX3RX13XAANMWZNM2IRHEOR2RC5HBBOL01XUCOB1DF' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
print('Your credentails:')
print('CLIENT_ID: '+ CLIENT_ID)
print('CLIENT_SECRET: '+ CLIENT_SECRET)

Your credentails:
CLIENT_ID: GPPJ4ZU41Q1I0ABMD3EKVNQS2DX5CHOOUVLJIZ40NE2XF1X0
CLIENT_SECRET: ZZXSQUNX3RX13XAANMWZNM2IRHEOR2RC5HBBOL01XUCOB1DF


To efficiently retrieve data from Foursquare without exceeding API limits, we need to test our approach using the city coordinates previously extracted with the Nominatim geocoder. 

Once validated, we will develop a function to accurately extracts the specific venue information associated with each postal code, using coordinates extracted with the ArcGIS geocoder for the city.

For each neighborhood, we have chosen the radius to be 1000 meters. The number of places retrieved per neighborhood parameter will be reasonably limited to 100 as it is Foursquare default limit.

So let's begin by evaluating the data obtained from the Foursquare API:

In [7]:
# Search radius in meters
radius = 1000
# Call response limit (#)
LIMIT = 100
# Test the first coordinates
lat_test = citiesinfo_dic['coordinates'][0][0]
long_test = citiesinfo_dic['coordinates'][0][1]

# Create the url to use the Foursquare API's "explore" endpoint
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat_test,
    long_test, 
    radius,
    LIMIT)
# Send a GET request to the Foursquare API for a response returned in JSON format
results = requests.get(url).json()

Identify the type of data or columns returned by the Foursquare API call to ensure we filter and select only the relevant information for our analysis.

In [8]:
# Extract the list of venues and normalise it back into pandas
venues_list = results['response']['groups'][0]['items']
nearbyvenues_df = json_normalize(venues_list)
# Display the dataframe columns
nearbyvenues_df.columns

Index(['referralId', 'reasons.count', 'reasons.items', 'venue.id',
       'venue.name', 'venue.location.address', 'venue.location.lat',
       'venue.location.lng', 'venue.location.labeledLatLngs',
       'venue.location.distance', 'venue.location.postalCode',
       'venue.location.cc', 'venue.location.city', 'venue.location.state',
       'venue.location.country', 'venue.location.formattedAddress',
       'venue.categories', 'venue.photos.count', 'venue.photos.groups',
       'venue.createdAt', 'photo.id', 'photo.createdAt', 'photo.prefix',
       'photo.suffix', 'photo.width', 'photo.height', 'photo.visibility',
       'venue.location.crossStreet', 'venue.venuePage.id'],
      dtype='object')

In [9]:
# Select and display only the columns we want
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearbyvenues_df = nearbyvenues_df.loc[:, filtered_columns]

In [10]:
# Check the results
nearbyvenues_df.head()

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Centre Vidéotron,"[{'id': '4bf58dd8d48988d185941735', 'name': 'H...",46.822707,-71.25077
1,Centre de foires de Québec,"[{'id': '4bf58dd8d48988d1ff931735', 'name': 'C...",46.825676,-71.249599
2,Fruiterie 440,"[{'id': '4bf58dd8d48988d118951735', 'name': 'G...",46.820783,-71.256018
3,ExpoCité,"[{'id': '4d4b7104d754a06370d81259', 'name': 'A...",46.826695,-71.24958
4,Le Grand Marché de Québec,"[{'id': '50be8ee891d4fa8dcc7199a7', 'name': 'M...",46.828985,-71.245578


Venues in the Foursquare API can belong to multiple categories, and each category is represented by a dictionary containing details like the category id, name or other. Hence we need to build a function to return the first category's name (since typically a venue's main category is the first in the list) of nearby venues.

In [11]:
# Build a function to extract a clean category name
def get_category_type(row):
    categories_list = row['venue.categories']
    return categories_list[0]['name'] if len(categories_list) > 0 else None

In [12]:
# Apply the function built to the categories in our dataframe 
nearbyvenues_df['venue.categories'] = nearbyvenues_df.apply(get_category_type, axis=1)

# Clean each column name by only retaining the last part
nearbyvenues_df.columns = [col.split(".")[-1] for col in nearbyvenues_df.columns]

# Check the result in our dataframe
nearbyvenues_df.head(5)

Unnamed: 0,name,categories,lat,lng
0,Centre Vidéotron,Hockey Stadium,46.822707,-71.25077
1,Centre de foires de Québec,Convention Center,46.825676,-71.249599
2,Fruiterie 440,Grocery Store,46.820783,-71.256018
3,ExpoCité,Arts and Entertainment,46.826695,-71.24958
4,Le Grand Marché de Québec,Market,46.828985,-71.245578


Now that we understand the data can be extracted from Foursquare and formatted for our exercise, we can explore the nearby venues of all our city neighborhoods.

In [13]:
# Create a function to repeat the same process to all the neighborhoods in our city
def get_nearby_venues(cities, neighborhoods, latitudes, longitudes, radius = 1000):
    venues_list = []
    for city, neighborhood, lat, lng in zip(cities, neighborhoods, latitudes, longitudes):
        
        # Create the URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Send the GET request
        results_list = requests.get(url).json()['response']['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            city,
            neighborhood, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'] if len(v['venue']['categories']) > 0 else None) for v in results_list])

    # Create the dataframe
    returnvenues_df = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    returnvenues_df.columns = ['City',
                               'Neighborhood', 
                               'Neighborhood Latitude', 
                               'Neighborhood Longitude',
                               'Venue', 
                               'Venue Latitude', 
                               'Venue Longitude', 
                               'Venue Category']
    
    return returnvenues_df

In [14]:
# Run the function on each neighborhood and create a new dataframe
nearbyvenues_df = get_nearby_venues(cities = cities_df['City'],
                                    neighborhoods = cities_df['Neighborhood'],
                                    latitudes = cities_df['Latitude'],
                                    longitudes = cities_df['Longitude']
                                   )

# Display the number of venues found for each city
for city_name, city_data in nearbyvenues_df.groupby('City'):
    print(f"We have located {city_data.shape[0]} venues across neighborhoods in {city_name}.")
    
print(f"A total of {nearbyvenues_df.shape[0]} venues have been identified across all cities.")
nearbyvenues_df.head()

We have located 4757 venues across neighborhoods in Montreal.
We have located 1986 venues across neighborhoods in Ottawa.
We have located 1945 venues across neighborhoods in Paris.
We have located 3146 venues across neighborhoods in Quebec City.
We have located 4743 venues across neighborhoods in Toronto.
We have located 10314 venues across neighborhoods in Vancouver.
A total of 26891 venues have been identified across all cities.


Unnamed: 0,City,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Assemblée nationale du Québec,46.808703,-71.214151,Capitol Building
1,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Fontaine de Tourny,46.809683,-71.212636,Plaza
2,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Parc de l'Esplanade,46.810424,-71.211505,Park
3,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Hôtel Manoir D'auteuil,46.811391,-71.211379,Bed and Breakfast
4,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Le Saint-Amour,46.811331,-71.210221,French Restaurant


## [3] Data Cleaning and Formatting

Clean the venues duplicates only when we do a full city analysis. Indeed, if two neighborhoods are too close each other, we will retrieve twice a venue in each neighborhood.

In [15]:
duplicates_df = nearbyvenues_df[nearbyvenues_df.duplicated(subset=['City', 'Venue', 'Venue Latitude', 'Venue Longitude'], keep=False)]
nearbyvenuescleaned_df = nearbyvenues_df.drop_duplicates(subset=['City', 'Venue', 'Venue Latitude', 'Venue Longitude'])

for city in nearbyvenues_df['City'].unique():
    city_duplicates = duplicates_df[duplicates_df['City'] == city].shape[0]
    city_total = nearbyvenues_df[nearbyvenues_df['City'] == city].shape[0]
    city_cleaned = nearbyvenuescleaned_df[nearbyvenuescleaned_df['City'] == city].shape[0]
    # City summary
    print(f"{city}: {city_duplicates} duplicates removed from {city_total} total venues, resulting in {city_cleaned} unique venues.")

# Overall summary
print(f"\nOverall, {duplicates_df.shape[0]} duplicates have been removed from {nearbyvenues_df.shape[0]} venues across all cities, "
      f"leaving a cleaned list of {nearbyvenuescleaned_df.shape[0]} unique venues.")

Quebec City: 2510 duplicates removed from 3146 total venues, resulting in 916 unique venues.
Montreal: 2202 duplicates removed from 4757 total venues, resulting in 3382 unique venues.
Ottawa: 1034 duplicates removed from 1986 total venues, resulting in 1303 unique venues.
Toronto: 1892 duplicates removed from 4743 total venues, resulting in 3564 unique venues.
Vancouver: 7998 duplicates removed from 10314 total venues, resulting in 3784 unique venues.
Paris: 330 duplicates removed from 1945 total venues, resulting in 1771 unique venues.

Overall, 15966 duplicates have been removed from 26891 venues across all cities, leaving a cleaned list of 14720 unique venues.


Retrieve the list of venue categories.

In [16]:
# Load the CSV file to see its structure and content
categories_path = r'C:\Users\marin\OneDrive\Documents\GITHUB\Foursquarecategories.csv'
categorieslabels_df = pd.read_csv(categories_path)

# Display the dataframe to understand its structure
categorieslabels_df.head()

Unnamed: 0,Category ID,Category Label
0,10000,Arts and Entertainment
1,10001,Arts and Entertainment > Amusement Park
2,10002,Arts and Entertainment > Aquarium
3,10003,Arts and Entertainment > Arcade
4,10004,Arts and Entertainment > Art Gallery


Clean and format our list of venue categories.

In [17]:
# Correct the process of handling '>' and applying prevailing terms
prevailing_terms = ["Education", "Office", "Financial Service", "Bar", "Food and Beverage Retail"]

categorieslabels_df['Venue Category'] = categorieslabels_df['Category Label'].apply(lambda x: x.split('>')[-1].strip())
categorieslabels_df['Venue Label'] = categorieslabels_df.apply(
    lambda row: next(
        (term for term in prevailing_terms if term in row['Category Label']),
        row['Category Label'].split('>')[0].strip()
    ), axis=1
).str.strip().replace(r'\s*Dining\s*and\s*Drinking\s*', 'Dining and Lunching', regex=True).replace('>', '', regex=False)

categorieslabels_df = categorieslabels_df.loc[:, ['Venue Label', 'Venue Category']]
categorieslabels_df.head()

Unnamed: 0,Venue Label,Venue Category
0,Arts and Entertainment,Arts and Entertainment
1,Arts and Entertainment,Amusement Park
2,Arts and Entertainment,Aquarium
3,Arts and Entertainment,Arcade
4,Arts and Entertainment,Art Gallery


To assign the label to each venue category, we will need to merge our main dataframe "citiesvenues_df" with the "categorieslabels_df" dataframe based on their common column "Venue Category".

In [18]:
nearbyvenues_df = pd.merge(nearbyvenues_df, categorieslabels_df, on='Venue Category', how='left')
nearbyvenues_df.head()

Unnamed: 0,City,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Label
0,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Assemblée nationale du Québec,46.808703,-71.214151,Capitol Building,Community and Government
1,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Fontaine de Tourny,46.809683,-71.212636,Plaza,Landmarks and Outdoors
2,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Parc de l'Esplanade,46.810424,-71.211505,Park,Landmarks and Outdoors
3,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Hôtel Manoir D'auteuil,46.811391,-71.211379,Bed and Breakfast,Travel and Transportation
4,Quebec City,Quebec Provincial Government,46.809315,-71.213023,Le Saint-Amour,46.811331,-71.210221,French Restaurant,Dining and Lunching


In [19]:
# Let's not forget our cleaned city list
nearbyvenuescleaned_df = pd.merge(nearbyvenuescleaned_df, categorieslabels_df, on='Venue Category', how='left')

Next, we will integrate our venue information back into the "cities_df" dataframe, using the postal code as the primary key to generate insights. Specifically, for each postal code, we aim to display the frequency of occurrence for each venue category across all neighborhoods. To do so, we will employ **One-Hot Encoding** to convert the unique categorical venues counted above into a binary format, which will be more practical for to transform it into frenquencies.

In [20]:
# Count of unique categories curated from all the returned venues
print(f'There are {len(nearbyvenues_df["Venue Category"].unique())} unique venue categories to convert in columns.')

There are 483 unique venue categories to convert in columns.


In [21]:
# Create a new column for each unique category in "Venue Category" and assign it a binary (0 or 1) mark
citiescategoryonehot_df = pd.get_dummies(nearbyvenues_df[['Venue Category']], prefix="", prefix_sep="")

# Add both 'City' and 'Neighborhood' columns to the DataFrame
citiescategoryonehot_df['City'] = nearbyvenues_df['City']
citiescategoryonehot_df['Neighborhood'] = nearbyvenues_df['Neighborhood']

# Set 'City' and 'Neighborhood' as a multi-index
citiescategoryonehot_df.set_index(['City', 'Neighborhood'], inplace=True)

# Show the current shape
print(citiescategoryonehot_df.shape)
citiescategoryonehot_df.head(5)

(26891, 482)


Unnamed: 0_level_0,Unnamed: 1_level_0,Adult Store,Advertising Agency,Afghan Restaurant,African Restaurant,Agriculture and Forestry Service,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,...,Waterfront,Whisky Bar,Wholesaler,Wine Bar,Wine Store,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
City,Neighborhood,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Quebec City,Quebec Provincial Government,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Quebec City,Quebec Provincial Government,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Quebec City,Quebec Provincial Government,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Quebec City,Quebec Provincial Government,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Quebec City,Quebec Provincial Government,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


We use the mean instead of the count because The mean gives a normalized frequency (0 to 1 range) for each category, which is helpful for comparisons across neighborhoods with different numbers of venues. This way, neighborhoods with different total venue counts are comparable because we’re looking at proportions rather than raw counts.

In [22]:
# Group rows by both 'City' and 'Neighborhood', taking the mean of the frequency of occurrence of each category
citiescategoryfreq_df = citiescategoryonehot_df.groupby(['City', 'Neighborhood']).mean().reset_index()

# Confirm the new shape
print(citiescategoryfreq_df.shape)
citiescategoryfreq_df.head(5)

(613, 484)


Unnamed: 0,City,Neighborhood,Adult Store,Advertising Agency,Afghan Restaurant,African Restaurant,Agriculture and Forestry Service,Airport,Airport Food Court,Airport Lounge,...,Waterfront,Whisky Bar,Wholesaler,Wine Bar,Wine Store,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Montreal,Ahuntsic East,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Montreal,Ahuntsic North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Montreal,Ahuntsic South,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0
3,Montreal,Ahuntsic West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0
4,Montreal,Anjou North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Compare the "cities_df" and "citiescategoryfreq_df" to list the postal codes removed from the analysis due to the absence of nearby venues.

In [23]:
# Initialize a dictionary to store missing neighborhoods for each city
missing_neighborhoods_dict = {}
total_missing = 0

# Loop through each unique city
for city in cities_df['City'].unique():
    city_neighborhoods = cities_df[cities_df['City'] == city]['Neighborhood']
    
    # Identify missing neighborhoods for the current city
    missing_neighborhoods = city_neighborhoods[~city_neighborhoods.isin(citiescategoryfreq_df[citiescategoryfreq_df['City'] == city]['Neighborhood'])]
    if not missing_neighborhoods.empty:
        missing_neighborhoods_dict[city] = missing_neighborhoods.tolist()
        total_missing += len(missing_neighborhoods)

# Print the total missing neighborhoods across all cities
print(f"\nTotal missing postal codes across all cities: {total_missing} \n")
# Print results for each city
for city, missing_neighborhoods in missing_neighborhoods_dict.items():
    print(f"In {city}, {len(missing_neighborhoods)} postal code(s) are missing in the venue categories frenquency dataframe "
    f"which include the following neighborhood(s): {missing_neighborhoods}\n")


Total missing postal codes across all cities: 34 

In Quebec City, 19 postal code(s) are missing in the venue categories frenquency dataframe which include the following neighborhood(s): ['North Loretteville', 'Saint-Augustin-de-Desmaures', 'Lac-Delage', "L'Ancienne-Lorette, YQB", 'Pont-Rouge', 'Val-Bélair North', 'Charlesbourg  Bourg-Royal', 'Chibougamau', 'Grosse-Île', "Beaupré, Boischatel, Cap-Santé, Château-Richer, Deschambault-Grondines, Deschambault-Grondines, L'Isle-aux-Coudres, L'Isle-aux-Coudres, Lac-Sergent, L'Ange-Gardien, Petite-Rivière-Saint-François, Les Éboulements, Neuville, Portneuf, Rivière-à-Pierre, Saint-Alban, Sainte-Anne-de-Beaupré, Saint-Basile, L'Isle-aux-Coudres, Sainte-Brigitte-de-Laval, Saint-Casimir, Sainte-Famille-de-l'Île-d'Orléans, Saint-Ferréol-les-Neiges, Saint-François-de-l'Île-d'Orléans, Saint-Gilbert, Saint-Hilarion, Saint-Jean-de-l'Île-d'Orléans, Saint-Joachim, Les Éboulements, Saint-Laurent-de-l'Île-d'Orléans, Saint-Léonard-de-Portneuf, Saint-Marc

We now have the frequency of each venue category available. However, given the large number of categories, it is more effective to group them under their parent labels. By calculating the frequency of these parent labels and ranking them, we can gain a clearer understanding of the most common types of services provided in each neighborhood.

In [24]:
# Create a new column for each unique label in "Venue Label" and assign it a binary (0 or 1) mark
citieslabelonehot_df = pd.get_dummies(nearbyvenues_df[['Venue Label']], prefix="", prefix_sep="")
citieslabelonehot_df['City'] = nearbyvenues_df['City']
citieslabelonehot_df['Neighborhood'] = nearbyvenues_df['Neighborhood']
citieslabelonehot_df.set_index(['City', 'Neighborhood'], inplace=True)
citieslabelfreq_df = citieslabelonehot_df.groupby(['City', 'Neighborhood']).mean().reset_index()
print(citieslabelfreq_df.shape)
citieslabelfreq_df.head(5)

(613, 17)


Unnamed: 0,City,Neighborhood,Arts and Entertainment,Bar,Business and Professional Services,Community and Government,Dining and Lunching,Education,Event,Financial Service,Food and Beverage Retail,Health and Medicine,Landmarks and Outdoors,Office,Retail,Sports and Recreation,Travel and Transportation
0,Montreal,Ahuntsic East,0.071429,0.071429,0.0,0.0,0.571429,0.0,0.0,0.0,0.142857,0.0,0.071429,0.0,0.0,0.0,0.071429
1,Montreal,Ahuntsic North,0.020408,0.061224,0.0,0.0,0.571429,0.0,0.0,0.0,0.142857,0.0,0.061224,0.0,0.102041,0.0,0.040816
2,Montreal,Ahuntsic South,0.033333,0.0,0.0,0.0,0.466667,0.0,0.0,0.0,0.05,0.0,0.0,0.016667,0.383333,0.0,0.05
3,Montreal,Ahuntsic West,0.0,0.047619,0.0,0.0,0.47619,0.0,0.0,0.0,0.142857,0.0,0.047619,0.0,0.142857,0.0,0.142857
4,Montreal,Anjou North,0.0,0.0,0.0,0.0,0.375,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.25,0.0,0.25


In [25]:
# Build a function to sort the venue labels in descending order
def return_most_common_labels(row, num_top_labels):
    # Select all the label frequencies from the row except the neighborhood name
    row_labels = row.iloc[2:]
    # Sort the frequencies in descending order
    row_labels_sorted = row_labels.sort_values(ascending = False)
    # Return the index of the sorted values, i.e. their respective value names
    return row_labels_sorted.index.values[0:num_top_labels]

In [28]:
# Focus on the top 10 venue labels for each neighborhood
num_top_labels = 10
# Create labels for the 1"st", 2"nd", and 3"rd" venue labels
indicators = ['st', 'nd', 'rd']

# Create as many columns as stated in num_top_labels
columns = ['City','Neighborhood']
for ind in np.arange(num_top_labels):
    try:
        columns.append('{}{} Most Common Label'.format(ind + 1, indicators[ind]))
    except:
        columns.append('{}th Most Common Label'.format(ind + 1))

# Create a new dataframe to display our top 10 most common venue labels for each neighborhood
citieslabelranked_df = pd.DataFrame(columns = columns)
citieslabelranked_df['City'] = citieslabelfreq_df['City']
citieslabelranked_df['Neighborhood'] = citieslabelfreq_df['Neighborhood']

# Populating the dataframe with the most common venues labels
for ind in np.arange(citieslabelfreq_df.shape[0]):
    citieslabelranked_df.iloc[ind, 2:] = return_most_common_labels(citieslabelfreq_df.iloc[ind, :], num_top_labels)

# Display the dataframe
citieslabelranked_df.head()

Unnamed: 0,City,Neighborhood,1st Most Common Label,2nd Most Common Label,3rd Most Common Label,4th Most Common Label,5th Most Common Label,6th Most Common Label,7th Most Common Label,8th Most Common Label,9th Most Common Label,10th Most Common Label
0,Montreal,Ahuntsic East,Dining and Lunching,Food and Beverage Retail,Arts and Entertainment,Bar,Landmarks and Outdoors,Travel and Transportation,Business and Professional Services,Community and Government,Education,Event
1,Montreal,Ahuntsic North,Dining and Lunching,Food and Beverage Retail,Retail,Bar,Landmarks and Outdoors,Travel and Transportation,Arts and Entertainment,Business and Professional Services,Community and Government,Education
2,Montreal,Ahuntsic South,Dining and Lunching,Retail,Food and Beverage Retail,Travel and Transportation,Arts and Entertainment,Office,Bar,Business and Professional Services,Community and Government,Education
3,Montreal,Ahuntsic West,Dining and Lunching,Food and Beverage Retail,Retail,Travel and Transportation,Bar,Landmarks and Outdoors,Arts and Entertainment,Business and Professional Services,Community and Government,Education
4,Montreal,Anjou North,Dining and Lunching,Retail,Travel and Transportation,Food and Beverage Retail,Arts and Entertainment,Bar,Business and Professional Services,Community and Government,Education,Event


In [29]:
# Add the total number of venues found for each neighborhood within each city to our main dataframe
venues_count_df = citiescategoryonehot_df.groupby(['City', 'Neighborhood']).size().reset_index(name='Total Venues')
cities_df = cities_df.merge(venues_count_df, on=['City', 'Neighborhood'], how='left')

# Merge cityvenuesranked_df results into cities_df for each city and neighborhood
cities_df = cities_df.merge(citieslabelranked_df, on=['City', 'Neighborhood'], how='left', suffixes=('_left', '_right'))

# Check the result
cities_df = cities_df.dropna(subset=['Total Venues'])
cities_df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,City,City1,City2,Latitude,Longitude,IsDuplicate,Total Venues,1st Most Common Label,2nd Most Common Label,3rd Most Common Label,4th Most Common Label,5th Most Common Label,6th Most Common Label,7th Most Common Label,8th Most Common Label,9th Most Common Label,10th Most Common Label
0,G1A,Quebec Provincial Government,Quebec Provincial Government,Quebec City,"Quebec City, Quebec","Quebec City, QC",46.809315,-71.213023,False,100.0,Dining and Lunching,Landmarks and Outdoors,Travel and Transportation,Bar,Arts and Entertainment,Retail,Community and Government,Food and Beverage Retail,Business and Professional Services,Education
3,G4A,Clermont,Clermont,Quebec City,"Quebec City, Quebec","Quebec City, QC",47.690608,-70.219378,False,5.0,Travel and Transportation,Food and Beverage Retail,Landmarks and Outdoors,Retail,Arts and Entertainment,Bar,Business and Professional Services,Community and Government,Dining and Lunching,Education
4,G5A,La Malbaie,La Malbaie,Quebec City,"Quebec City, Quebec","Quebec City, QC",47.656944,-70.151389,False,6.0,Dining and Lunching,Landmarks and Outdoors,Travel and Transportation,Arts and Entertainment,Bar,Business and Professional Services,Community and Government,Education,Event,Financial Service
5,G6A,Saint-Georges,Saint-Georges Northwest,Quebec City,"Quebec City, Quebec","Quebec City, QC",46.122714,-70.670151,True,11.0,Dining and Lunching,Arts and Entertainment,Food and Beverage Retail,Retail,Travel and Transportation,Bar,Business and Professional Services,Community and Government,Education,Event
6,G7A,Lévis,Lévis South,Quebec City,"Quebec City, Quebec","Quebec City, QC",46.699333,-71.301231,False,9.0,Travel and Transportation,Landmarks and Outdoors,Retail,Dining and Lunching,Arts and Entertainment,Bar,Business and Professional Services,Community and Government,Education,Event


## [4] Saving

In [30]:
# Save the DataFrame to a CSV file with UTF-8 encoding
cities_df.to_csv('venue_df_output.csv', index=False, encoding='utf-8')
print("DataFrame saved as 'venue_df_output.csv'.")

DataFrame saved as 'venue_df_output.csv'.


In [31]:
# Save the DataFrame nearbyvenues_df to a CSV file with UTF-8 encoding
nearbyvenues_df.to_csv('venue_df_output2.csv', index=False, encoding='utf-8')
print("DataFrame saved as 'venue_df_output2.csv'.")

DataFrame saved as 'venue_df_output2.csv'.


In [32]:
# Save the DataFrame nearbyvenuescleaned_df to a CSV file with UTF-8 encoding
nearbyvenuescleaned_df.to_csv('venue_df_output3.csv', index=False, encoding='utf-8')
print("DataFrame saved as 'venue_df_output3.csv'.")

DataFrame saved as 'venue_df_output3.csv'.


In [34]:
# Save the DataFrame citieslabelfreq_df to a CSV file with UTF-8 encoding
citieslabelfreq_df.to_csv('venue_df_output4.csv', index=False, encoding='utf-8')
print("DataFrame saved as 'venue_df_output4.csv'.")

DataFrame saved as 'venue_df_output4.csv'.
