# Comparative Analysis of Neighborhoods | Geolocation Data Preparation

We will use the following Geolocation APIs to locate our neighborhoods:
- ArcGIS geocoder API: get latitude and longitude for specific postal codes combined with the city name.
- Nominatim geocoder API: the OpenStreetMap geocoding service will help us find the geographical coordinates of the whole city we are investigating.

## [1] Working environment set up

Before starting, we need to install and import libraries.

In [1]:
# Data Storage and File Handling
import json

# Data Manipulation and Processing
!pip install pandas
import pandas as pd

# Geolocation and Mapping
!pip install geocoder
import geocoder
!pip install geopy
from geopy.geocoders import Nominatim 

# Miscellaneous
import time

print("Libraries imported.")

Libraries imported.


## [2] Data Collection

Before starting, we need to open the wikipedia dataframe and dictionary.

In [2]:
# Load from CSV file
cities_df = pd.read_csv('wikipedia_df_output.csv', encoding='utf-8')

In [3]:
# Load from JSON file
with open("wikipedia_dic_output.json", "r") as f:
    citiesinfo_dic = json.load(f)

Then, we will get the longitude and latitude of our cities. To do so, we will use the Nominatim geolocator that is more adapted to larger locations. Indeed, the city is larger than the boroughs.

In [4]:
# Initialize the geolocator
geolocator = Nominatim(user_agent="myGeocoderproject4")

# Initialize lists in dictionary to store latitude and longitude
citiesinfo_dic['latitude'] = []
citiesinfo_dic['longitude'] = []

# Iterate over each city in 'city2' for geolocation
for city in citiesinfo_dic['city2']:
    try:
        # Fetch the location with a 5-second timeout
        location = geolocator.geocode(city, timeout=5)
        
        # Store latitude and longitude or None if not found
        latitude = location.latitude if location else None
        longitude = location.longitude if location else None
        
        # Append the coordinates to the dictionary lists
        citiesinfo_dic['latitude'].append(latitude)
        citiesinfo_dic['longitude'].append(longitude)
        
        # Print the result for each city
        if location:
            print(f'The geographical coordinates of "{city}" are Latitude: {latitude}, Longitude: {longitude}.')
        else:
            print(f'Coordinates for "{city}" could not be found.')
        
        # Add a delay to avoid overloading the geocoding service
        time.sleep(2)
    
    except Exception as e:
        # Handle exceptions and append None for coordinates if there's an error
        print(f"An error occurred for city {city}: {e}")
        citiesinfo_dic['latitude'].append(None)
        citiesinfo_dic['longitude'].append(None)

The geographical coordinates of "Quebec City, QC" are Latitude: 46.8137431, Longitude: -71.2084061.
The geographical coordinates of "Montreal, QC" are Latitude: 45.5031824, Longitude: -73.5698065.
The geographical coordinates of "Ottawa, ON" are Latitude: 45.4208777, Longitude: -75.6901106.
The geographical coordinates of "Toronto, ON" are Latitude: 43.6534817, Longitude: -79.3839347.
The geographical coordinates of "Vancouver, BC" are Latitude: 49.2608724, Longitude: -123.113952.
The geographical coordinates of "Paris, France" are Latitude: 48.8588897, Longitude: 2.3200410217200766.


The OpenStreetMap geocoding service has been shut down for a while so I decided to create a back up with arcgis.

In [5]:
# Initialize lists to store latitude and longitude
citiesinfo_dic['latitude'] = []
citiesinfo_dic['longitude'] = []

# Iterate over each city in 'city2'
for city in citiesinfo_dic['city1']:
    try:
        # Use ArcGIS to fetch latitude and longitude
        g = geocoder.arcgis(city)
        lati_long_coords = g.latlng

        # Append the latitude and longitude or None if not found
        if lati_long_coords:
            latitude, longitude = lati_long_coords
            citiesinfo_dic['latitude'].append(latitude)
            citiesinfo_dic['longitude'].append(longitude)
            print(f'The geographical coordinates of "{city}" are Latitude: {latitude}, Longitude: {longitude}.')
        else:
            citiesinfo_dic['latitude'].append(None)
            citiesinfo_dic['longitude'].append(None)
            print(f'Coordinates for "{city}" could not be found.')

        # Add a delay to avoid overloading the geocoding service
        time.sleep(1)

    except Exception as e:
        # Handle exceptions and append None for coordinates if there's an error
        print(f"An error occurred for city '{city}': {e}")
        citiesinfo_dic['latitude'].append(None)
        citiesinfo_dic['longitude'].append(None)

print("Geocoding completed.")

The geographical coordinates of "Quebec City, Quebec" are Latitude: 46.812280000000044, Longitude: -71.21453999999994.
The geographical coordinates of "Montreal, Quebec" are Latitude: 45.508867000000066, Longitude: -73.55424199999999.
The geographical coordinates of "Ottawa, Ontario" are Latitude: 45.425226000000066, Longitude: -75.69996299999997.
The geographical coordinates of "Toronto, Ontario" are Latitude: 43.65352400000006, Longitude: -79.38390699999997.
The geographical coordinates of "Vancouver, British Columbia" are Latitude: 49.26163600000007, Longitude: -123.11334999999997.
The geographical coordinates of "Paris, France" are Latitude: 48.86369757600005, Longitude: 2.3616573370000538.
Geocoding completed.


Then, we will retrieve the longitudes and latitudes for each postal code in our city and add the data to our city dataframe. We will now use the arcgis geocoder that is more adapter to smaller and more precise locations.

In [6]:
# Create functions to retrieve coordinates with geocoder.arcgis for postal codes and neighborhoods
def get_latilong_postalcode(postalcode, city):
    lati_long_coords = None
    while lati_long_coords is None:
        # Use the city string in the geocoder query
        g = geocoder.arcgis('{}, {}'.format(postalcode, city))
        lati_long_coords = g.latlng
        time.sleep(2)  # Add a delay between requests
    return lati_long_coords

def get_latilong_neighborhood(neighborhood, city):
    lati_long_coords = None
    while lati_long_coords is None:
        g = geocoder.arcgis('{}, {}'.format(neighborhood, city))
        lati_long_coords = g.latlng
        time.sleep(2)
    return lati_long_coords

In [7]:
# Test one of the functions with samples like a single postal code
coordinates = []
for city, test in zip(citiesinfo_dic['city1'], citiesinfo_dic['test1']):
    try:
        latitude, longitude = get_latilong_postalcode(test, city)
        coordinates.append((latitude, longitude))  # Save coordinates in the list
        print(f"The geographical coordinates for {city} with postal code {test} are Latitude: {latitude}, Longitude: {longitude}.")
    except Exception as e:
        coordinates.append((None, None))  # Save None for failed retrievals
        print(f"An error occurred for city '{city}' with postal code '{test}': {e}")

citiesinfo_dic['coordinates'] = coordinates

The geographical coordinates for Quebec City, Quebec with postal code G3N are Latitude: 46.82439100000005, Longitude: -71.25270099999994.
The geographical coordinates for Montreal, Quebec with postal code H3A are Latitude: 45.50534869300003, Longitude: -73.57705756099995.
The geographical coordinates for Ottawa, Ontario with postal code K2C are Latitude: 45.375505469000075, Longitude: -75.70906227299997.
The geographical coordinates for Toronto, Ontario with postal code M4G are Latitude: 43.70884486600005, Longitude: -79.36613953399996.
The geographical coordinates for Vancouver, British Columbia with postal code V5A are Latitude: 49.26505536800005, Longitude: -122.93620847099999.
The geographical coordinates for Paris, France with postal code 75003 are Latitude: 48.8637018, Longitude: 2.3610909.


In [8]:
# Retrieve latitude and longitude coordinates directly into new columns using using 'Postalcode' and 'City1'
cities_df[['Latitude1', 'Longitude1']] = cities_df.apply(lambda row: get_latilong_postalcode(row['Postalcode'], row['City1']), axis=1, result_type='expand')

In [9]:
# Retrieve latitude and longitude coordinates directly into new columns using 'Postalcode' and 'City2'
cities_df[['Latitude2', 'Longitude2']] = cities_df.apply(lambda row: get_latilong_postalcode(row['Postalcode'], row['City2']), axis=1, result_type='expand')

In [10]:
# Retrieve latitude and longitude coordinates directly into new columns using 'Borough' and 'City1'
cities_df[['Latitude3', 'Longitude3']] = cities_df.apply(lambda row: get_latilong_neighborhood(row['Borough'], row['City1']), axis=1, result_type='expand')

In [11]:
# Retrieve latitude and longitude coordinates directly into new columns using 'Borough' and 'City2'.
cities_df[['Latitude4', 'Longitude4']] = cities_df.apply(lambda row: get_latilong_neighborhood(row['Borough'], row['City2']), axis=1, result_type='expand')

In [12]:
# Retrieve the most refined latitude and longitude coordinates by selecting the best results from the above methods
cities_df['Latitude5'] = cities_df['Latitude1']
cities_df['Longitude5'] = cities_df['Longitude1']

# Iteratively flag duplicates and replace with higher-level data
for level in range(2, 5):  # Levels 2, 3, 4
    cities_df['IsDuplicate'] = cities_df.duplicated(subset=['Latitude5', 'Longitude5'], keep=False)
    if level <= 4:
        lat_col = f'Latitude{level}'
        long_col = f'Longitude{level}'
        mask = cities_df['IsDuplicate']
        cities_df.loc[mask, ['Latitude5', 'Longitude5']] = cities_df.loc[mask, [lat_col, long_col]].values

# Final re-flagging for duplicates after Level 4
cities_df['IsDuplicate'] = cities_df.duplicated(subset=['Latitude5', 'Longitude5'], keep=False)

In [13]:
# Build a function to compute the number of duplicated coordinates for each city at each level
def compute_duplicated(cities_df, levels):
    # Initialize a dictionary to store results
    results = {'City': citiesinfo_dic['city0']}
    results_df = pd.DataFrame(results)

    # Group by city and coordinates to count duplicates and merge the results together
    for level in range(1, levels + 1):
        coordinatescount_df = (cities_df.groupby(['City', f'Latitude{level}', f'Longitude{level}']).size().reset_index(name='Count'))
        duplicatedcount_df = (coordinatescount_df[coordinatescount_df['Count'] > 1].groupby('City')['Count'].sum().reset_index().rename(columns={'Count': f'Search {level}'}))
        results_df = pd.merge(results_df, duplicatedcount_df, on='City', how='left')
    
    # Add the total number of postal codes per city
    totalpostalcodes_df = cities_df['City'].value_counts().reset_index()
    totalpostalcodes_df.columns = ['City', 'Total Postal Codes']
    results_df = pd.merge(results_df, totalpostalcodes_df, on='City', how='left')

    # Calculate percentages for each level
    for level in range(1, levels + 1):
        results_df[f'{f'Search {level}'}'] = ((results_df[f'Search {level}'] / results_df['Total Postal Codes']) * 100).fillna(0).astype(int).astype(str) + '%'

    # Return the results
    results_df = results_df.rename(columns={'Search 5': f'Merged'})
    return results_df

# Compute the percentage of duplicates by level for each city
duplicated_df = compute_duplicated(cities_df, 5)

# Display the results
print("\033[1m\033[4mProportion of Duplicated Coordinates Across Cities Within Each Search and Merged Results\033[0m\n")
print(f"As a reminder:\n - Search 1: by Postal Codes accross City1\n - Search 2: by Postal Codes accross City2\n - Search 3: by Neighborhoods accross City1\n - Search 4: by Neighborhoods accross City2\n")
duplicated_df

[1m[4mProportion of Duplicated Coordinates Across Cities Within Each Search and Merged Results[0m

As a reminder:
 - Search 1: by Postal Codes accross City1
 - Search 2: by Postal Codes accross City2
 - Search 3: by Neighborhoods accross City1
 - Search 4: by Neighborhoods accross City2



Unnamed: 0,City,Search 1,Search 2,Search 3,Search 4,Merged,Total Postal Codes
0,Quebec City,92%,72%,60%,59%,40%,140
1,Montreal,7%,7%,82%,81%,5%,123
2,Ottawa,34%,38%,21%,21%,16%,84
3,Toronto,0%,0%,80%,80%,0%,97
4,Vancouver,59%,60%,77%,72%,36%,195
5,Paris,0%,0%,0%,0%,0%,20


There is notable variability in duplication across geolocation methodologies, divided into four search categories based on distinct approaches: postal codes versus neighborhoods and naming conventions in City1 versus City2.

- In Quebec City and Ottawa, the Postal Code based searches (Search 1 and 2) exhibit higher duplication compared to Neighborhood based searches (Search 3 and 4).
- Conversely, in Montreal, Toronto, and Vancouver, neighborhood-based searches (Search 3 and 4) show higher duplication rates than Postal Code based searches (Search 1 and 2). 

After overlapping results from each methodology to minimize duplication, the "Merged" column displays a lower duplication percentage for most cities. We notice that the variability persists across geographic areas.

- Quebec City and Vancouver show high duplication percentages, likely due to the dense clustering of postal codes or neighborhoods in these regions.
- Montreal and Ottawa experienced significant duplication reduction during the merging process.
- Toronto presents a unique case where Search 3 and 4 (neighborhoods) have an 80% duplication rate, while all other searches report 0%.
- Paris stands out with 0% duplication across all searches, likely due to its precise and organized postal code system, minimizing geolocation overlap.

In [14]:
# Drop intermediate latitude/longitude columns and update the final merged results
cities_df.drop(columns=[f'Latitude{i}' for i in range(1, 5)] + [f'Longitude{i}' for i in range(1, 5)], inplace=True, errors='ignore')
cities_df.drop(columns=['IsDuplicate'])
cities_df.rename(columns={'Latitude5': 'Latitude', 'Longitude5': 'Longitude'}, inplace=True)

## [3] Data Cleaning and Formatting

As we proceed with our analysis using neighborhoods as keys, it's important to verify if any neighborhoods in our "cities_df" dataframe are associated with multiple postal codes. If such cases are found, we will update the neighborhood names by appending a cardinal direction suffix based on their relative positions to one another.

In [15]:
# Find duplicate neighborhoods based on 'Neighborhood'
neighboroodduplicates_df = cities_df[cities_df.duplicated(subset=['Neighborhood'], keep=False)]

if neighboroodduplicates_df.empty:
    print('There are no duplicated neighborhoods in the cities_df dataframe.')
else:
    print(f'There are {neighboroodduplicates_df.shape[0]} neighborhoods with two or more different postal codes assigned.')
    # Group by neighborhood to handle each neighborhood with multiple postal codes separately
    for neighborhood, group in neighboroodduplicates_df.groupby('Neighborhood'):
        
        # Find the max/min values for latitude and longitude
        max_latitude = group['Latitude'].max()
        min_latitude = group['Latitude'].min()
        max_longitude = group['Longitude'].max()
        min_longitude = group['Longitude'].min()
        
        # Create a function to apply the cardinal directions to the neighborhoods
        def assign_cardinal_direction(row):
            if row['Latitude'] == max_latitude:
                return f"{row['Neighborhood']} North"
            elif row['Latitude'] == min_latitude:
                return f"{row['Neighborhood']} South"
            elif row['Longitude'] == max_longitude:
                return f"{row['Neighborhood']} East"
            elif row['Longitude'] == min_longitude:
                return f"{row['Neighborhood']} West"
            else:
               return f"{row['Neighborhood']} Center"
        
        # Assign initial names with cardinal directions
        group['Neighborhood'] = group.apply(assign_cardinal_direction, axis=1)
        
        # If duplicates remain, add ordinal suffixes
        neighborhood_counts = group['Neighborhood'].value_counts()
        for name, count in neighborhood_counts.items():
            if count > 1:  # Only apply ordinal if there are duplicates
                duplicates = group[group['Neighborhood'] == name]
                unique_coords = duplicates[['Latitude', 'Longitude']].drop_duplicates()
                if len(unique_coords) > 1:
                    for i, idx in enumerate(duplicates.index, 1):
                        group.at[idx, 'Neighborhood'] = f"{name} {i}"

        # Update the neighborhood names in the original cities_df
        cities_df.loc[group.index, 'Neighborhood'] = group['Neighborhood']

# Print the updated DataFrame for duplicates, sorted by 'Neighborhood'
cities_df.loc[neighboroodduplicates_df.index].sort_values(by="Neighborhood").head()

There are 94 neighborhoods with two or more different postal codes assigned.


Unnamed: 0,Postalcode,Borough,Neighborhood,City,City1,City2,Latitude,Longitude,IsDuplicate
198,H2M,Ahuntsic,Ahuntsic East,Montreal,"Montreal, Quebec","Montreal, QC",45.554532,-73.639078,False
155,H2C,Ahuntsic,Ahuntsic North,Montreal,"Montreal, Quebec","Montreal, QC",45.560602,-73.658873,False
203,H2N,Ahuntsic,Ahuntsic South,Montreal,"Montreal, Quebec","Montreal, QC",45.540904,-73.651166,False
193,H3L,Ahuntsic,Ahuntsic West,Montreal,"Montreal, Quebec","Montreal, QC",45.546533,-73.672191,False
179,H1J,Anjou,Anjou North,Montreal,"Montreal, Quebec","Montreal, QC",45.615959,-73.573361,False


## [4] Saving

In [16]:
# Save the DataFrame to a CSV file with UTF-8 encoding
cities_df.to_csv('geolocation_df_output.csv', index=False, encoding='utf-8')
print("DataFrame saved as 'geolocation_df_output.csv'.")

DataFrame saved as 'geolocation_df_output.csv'.


In [1]:
# Save to JSON file
with open("geolocation_dic_output.json", "w") as f:
    json.dump(citiesinfo_dic, f, indent=4)
print("Dictionary saved as 'geolocation_dic_output.json'.")

Dictionary saved as 'geolocation_dic_output.json'.
