# Comparative Analysis of Neighborhoods | Wikipedia Data Preparation

We will format our wikipedia pages listing the most up to date postal codes, neighborhoods and boroughs of the cities we want to investigate.
- Halifax: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_B
- Quebec City: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_G
- Montreal: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_H
- Ottawa: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_K
- Toronto: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
- Vancouver: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V
- Paris: https://fr.geneawiki.com/wiki/Liste_des_quartiers_de_Paris

## [1] Working environment set up

Before starting, we need to install and import libraries.

In [1]:
# Data Access and Web Scraping
!pip install beautifulsoup4
from bs4 import BeautifulSoup
!pip install requests
import requests

# Data Storage and File Handling
import json
#from pandas import json_normalize
#!pip install openpyxl
!pip install lxml
import xml
import zipfile
#from io import BytesIO, StringIO

# Data Manipulation and Processing
!pip install pandas
import pandas as pd
#import numpy as np
import re
#import unicodedata
#!pip install unidecode
#import unidecode
#!pip install fuzzywuzzy
#from fuzzywuzzy import fuzz
#from fuzzywuzzy import process
#from difflib import get_close_matches

# Geolocation and Mapping
#!pip install geocoder
#import geocoder
#!pip install geopy
#from geopy.geocoders import Nominatim 
#from geopy.distance import geodesic

# Statistical Analysis and Clustering
#from scipy import stats
#import researchpy as rp
#!pip install scikit-learn
#from sklearn.cluster import KMeans

# Data Visualization
#!pip install matplotlib
#import matplotlib.cm as cm
#import matplotlib.colors as colors
#import matplotlib.pyplot as plt
#%matplotlib inline
#import seaborn as sns
#!pip install folium
#import folium

# Display and Configuration
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
#pd.set_option('display.expand_frame_repr', False)
#pd.set_option('display.width', 1000)

# Miscellaneous
#import time
#from IPython.display import display
#import warnings
#warnings.filterwarnings("ignore")
#warnings.simplefilter(action='ignore', category=FutureWarning)

print("Libraries imported.")

Libraries imported.


## [2] User Input

Save the wikipedia pages data for future selection.

In [2]:
# Input the wikipedia pages into a dictionary and transform it into a dataframe
citiesinfo_dic = {
    'city0': ['Quebec City', 'Montreal', 'Ottawa', 'Toronto', 'Vancouver', 'Paris'],
    'city1': ['Quebec City, Quebec', 'Montreal, Quebec', 'Ottawa, Ontario', 
              'Toronto, Ontario', 'Vancouver, British Columbia', 'Paris, France'],
    'city2': ['Quebec City, QC', 'Montreal, QC', 'Ottawa, ON', 
              'Toronto, ON', 'Vancouver, BC', 'Paris, France'],
    'test1': ['G3N', 'H3A', 'K2C', 'M4G', 'V5A','75003'],
    'url': [
        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_G',
        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_H',
        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_K',
        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',
        'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V',
        'https://fr.geneawiki.com/wiki/Liste_des_quartiers_de_Paris'
    ]
}

## [3] Data Collection

To be able to extract the data from our wikipedia sources, it’s crucial to first understand the type of data we need to retrieve.

In [3]:
for city, url in zip(citiesinfo_dic['city0'], citiesinfo_dic['url']):
    url_data = pd.read_html(url)
    print(f"City: {city} - Type of url_data: {type(url_data)}")

City: Quebec City - Type of url_data: <class 'list'>
City: Montreal - Type of url_data: <class 'list'>
City: Ottawa - Type of url_data: <class 'list'>
City: Toronto - Type of url_data: <class 'list'>
City: Vancouver - Type of url_data: <class 'list'>
City: Paris - Type of url_data: <class 'list'>


Now we know it is a list, we can scrap the postal codes from the url page with BeautifulSoup.

In [4]:
citiesinfo_dic['soup'] = []
for city, url in zip(citiesinfo_dic['city0'], citiesinfo_dic['url']):
    try:
        # Fetch the HTML content of the page
        extracted_data = requests.get(url).text
        
        # Parse the HTML using BeautifulSoup
        soup = BeautifulSoup(extracted_data, 'lxml') # lxml is more powerful than html.parser

        # Store the parsed BeautifulSoup object in the dictionary for further analysis
        citiesinfo_dic['soup'].append(soup)
        print(f"Data for {city} stored successfully.")
        
    except Exception as e:
        # Handle any errors encountered
        print(f"An error occurred at URL: {city}: {e}")
        citiesinfo_dic['soup'].append(None)  # Store None if there's an error

Data for Quebec City stored successfully.
Data for Montreal stored successfully.
Data for Ottawa stored successfully.
Data for Toronto stored successfully.
Data for Vancouver stored successfully.
Data for Paris stored successfully.


Let's preview of a compartment block of data from the wikipedia page to clarify how we can scrap the different information we need from it: Postal codes, borough and neighborhoods.

Not to overcrowed with information this notebook, I will only call the information from the Parisian data:

In [5]:
print(citiesinfo_dic['soup'][citiesinfo_dic['city0'].index("Quebec City")].prettify()[43050:43490])

       </div>
         <table cellpadding="2" cellspacing="0" rules="all" style="border-collapse: collapse; border: 1px solid #ccc;" width="100%">
          <tbody>
           <tr>
            <td valign="top" width="11.1%">
             <b>
              G1A
             </b>
             <br/>
             <span style="font-size: smaller; line-height: 125%;">
              <a href="/wiki/Quebec_City" title="Quebec City">
             


List the postal codes to investigate in the wikipedia page.

In [6]:
# Create a function to extract the list of postal codes for Paris
def extract_french_postal_codes(soup):
    postalcodes_list = []
    for tr in soup.find_all('tr'):
        postalcodes_elements = tr.find_all('td', align="center") # Get all the <td> elements within the row
        if len(postalcodes_elements) >= 2: # Ensure the row contains at least two <td> elements
            postalcode = postalcodes_elements[1].text.strip()
            if postalcode.isdigit() and len(postalcode) == 5: # Append only valid 5-digit postal codes
                postalcodes_list.append(postalcode)
    return postalcodes_list

# Create a function to extract the list of postal codes for canadian cities
def extract_canadian_postal_codes(soup):
    postalcodes_elements = soup.find_all('td', {'style': 'vertical-align:top;'}) + soup.find_all('td', {'valign': 'top'})
    postalcodes_list = [td.find('b').text.strip() for td in postalcodes_elements if td.find('b')]
    return postalcodes_list

def extract_postal_codes(city, soup):
    return extract_french_postal_codes(soup) if city == 'Paris' else extract_canadian_postal_codes(soup)

# Add a new key 'postal_codes' to wiki_dic and store the parsed data with postal codes
citiesinfo_dic['postal_codes'] = [
    extract_postal_codes(city, soup) if soup else []
    for city, soup in zip(citiesinfo_dic['city0'], citiesinfo_dic['soup'])
]

# Display summary of extracted postal codes
for city, postal_codes in zip(citiesinfo_dic['city0'], citiesinfo_dic['postal_codes']):
    print(f"{city}: {len(postal_codes)} postal codes extracted.")

Quebec City: 200 postal codes extracted.
Montreal: 180 postal codes extracted.
Ottawa: 160 postal codes extracted.
Toronto: 97 postal codes extracted.
Vancouver: 200 postal codes extracted.
Paris: 20 postal codes extracted.


We will use the following methodology to retrieve the postal codes, borough and neighborhoods from the html data:
- **Compartment**: to investigate all the information assigned to a single postal code, I will need to isolate the *[td]* tags for Canada and the *[tr]* tags for France.
- **Postal Code**: to extract the postal codes, I will check within the *[b]* tags of each compartments in Canada and within the *[td]* tags for France.
- **Borough**: to extract the borough, I will first check for both the *[span]* and *[a]* tags. Only process the postal codes that have an assigned borough.
- **Neighborhoods**: to extract the neighborhoods, I will stop at each *[a]* tag (excluding the first one linked to the Borough) for Canada and at each *[dd]* tag for Paris. Indeed, more than one neighborhood can exist in one postal code area.

In [7]:
# Build a function to extract the french data
def extract_french_data(soup):
    city_data = []
    # [Compartment]
    for tr in soup.find_all('tr'):
        # [Postal Code]
        postalcode_element = tr.find_all('td', align="center")
        postalcode = (postalcode_element[1].text.strip() if len(postalcode_element) >= 2
                                                            and postalcode_element[1].text.strip().isdigit()
                                                            and len(postalcode_element[1].text.strip()) == 5
                                                        else None)
        if postalcode:
            # [Borough]
            borough = tr.find('a', title=True).text.strip() if tr.find('a', title=True) else "Unknown"
            # [Neighborhoods]
            neighborhoods_list = [n.find('a').text.strip() if n.find('a') else n.text.strip().split(' - ')[-1] for n in tr.find_all('dd')]
            neighborhood = ', '.join(neighborhoods_list) if neighborhoods_list else "Unknown"
            # [Append the final result]
            city_data.append({'Postalcode': postalcode, 'Borough': borough, 'Neighborhood': neighborhood})
    return city_data

In [8]:
# Build a function to extract the canadian data
def extract_canadian_data(soup, city):
    city_data = []
    # Define cardinal directions to check
    cardinal_directions_list = ["north", "south", "east", "west", "northeast", "northwest", "southeast", "southwest", "central"]
    # Create a tuple of cardinal directions with a space after each for the startswith check
    cardinal_directions_tuple = tuple(direction + " " for direction in cardinal_directions_list)
    # [Compartment]
    for td in soup.find_all('td', {'style': 'vertical-align:top;'}) + soup.find_all('td', {'valign': 'top'}):
        # [Postal Code]
        postalcode = td.find('b').text.strip() if td.find('b') else None
        if postalcode:
            # [Borough]
            span = td.find('span')
            borough = span.find('a').text.strip() if span and span.find('a') else "Unknown"
            
            # [Neighborhoods]
            neighborhood_text = span.text if span else ""
            neighborhoods_list = [a.text.strip() for a in span.find_all('a')[1:]] if span else []
            # Fallback if neighborhoods list is empty
            if not neighborhoods_list:
                neighborhood_text = span.get_text(strip=True) if span else ""
                neighborhoods_list = [neighborhood_text.strip()]
            # Build the neighborhood string with cardinal direction handling
            neighborhood = ', '.join(neighborhoods_list).strip()
            # Clean up the neighborhood: remove parentheses and check for empty values
            neighborhood = re.sub(r"[()]", " ", neighborhood).strip()
            # Remove borough from neighborhood only if it starts with it and there’s no cardinal direction
            if neighborhood.startswith(borough) and not any(n.lower() in cardinal_directions_list for n in neighborhoods_list):
                neighborhood = neighborhood[len(borough):].strip()
            # Handle cases where neighborhood is just a direction
            if neighborhood.lower() in cardinal_directions_list:
                neighborhood = f"{borough} {neighborhood}"
            # Set neighborhood to "Unknown" if it's empty or invalid
            if not neighborhood or neighborhood in {"", "[", "]", ")", "(", ",", "Unknown"} or neighborhood.lower() == "unknown":
                neighborhood = borough
            # Ensure borough names are unique enough
            if borough == city or borough.lower() == "unknown":
                borough = neighborhood.split(',', 1)[0].strip() + " & co" if ',' in neighborhood else neighborhood
            # If neighborhood starts or ends with a hyphen, prepend or append borough as needed
            if neighborhood.startswith('-'):
                neighborhood = f"{borough} {neighborhood.lstrip('-').strip()}".replace(' ', '-')
            elif neighborhood.endswith('-'):
                neighborhood = f"{neighborhood.rstrip('-').strip()} {borough}".replace(' ', '-')
            # [Append the final result]
            city_data.append({'Postalcode': postalcode, 'Borough': borough, 'Neighborhood': neighborhood})
    
    return city_data

In [9]:
# Initialize the 'extracted_data' field in cityinfo_dic to store results for each city
citiesinfo_dic['extracted_data'] = []

# Iterate over each city in wiki_dic to extract and store data
for city, soup in zip(citiesinfo_dic['city0'], citiesinfo_dic['soup']):
    if soup:
        extracted_data = extract_french_data(soup) if city == 'Paris' else extract_canadian_data(soup, city)
        citiesinfo_dic['extracted_data'].append(extracted_data)
        print(f"Data extracted for {city}.")
    else:
        citiesinfo_dic['extracted_data'].append([])
        print(f"No data available for {city}.")

Data extracted for Quebec City.
Data extracted for Montreal.
Data extracted for Ottawa.
Data extracted for Toronto.
Data extracted for Vancouver.
Data extracted for Paris.


Transform the retrieved data into a dataframe and check that we could retrieve all the postal codes.

In [10]:
# Initialize a list to hold all data with city information
all_data = []

# Iterate over each city and add the city name to each extracted row
for city, city1, city2, extracted_data in zip(citiesinfo_dic['city0'], citiesinfo_dic['city1'], citiesinfo_dic['city2'], citiesinfo_dic['extracted_data']):
    # Add the city name to each entry in extracted_data
    for row in extracted_data:
        row['City'] = city
        row['City1'] = city1
        row['City2'] = city2
        all_data.append(row)

# Create a dataframe from the rows
cities_df = pd.DataFrame(all_data)

# Check row counts for consistency
print("Comparing the count of rows in our DataFrame with the count of postal codes in the Wikipedia source:")
for city in citiesinfo_dic['city0']:
    city_rows_count = cities_df[cities_df['City'] == city].shape[0]
    postal_codes_count = len(citiesinfo_dic['postal_codes'][citiesinfo_dic['city0'].index(city)])
    if city_rows_count == postal_codes_count:
        print(f"{city}: Matched ({city_rows_count})")
    else:
        print(f"{city}: Mismatch (DataFrame: {city_rows_count}, Source: {postal_codes_count})")

Comparing the count of rows in our DataFrame with the count of postal codes in the Wikipedia source:
Quebec City: Matched (200)
Montreal: Matched (180)
Ottawa: Matched (160)
Toronto: Matched (97)
Vancouver: Matched (200)
Paris: Matched (20)


Everything has been properly retrieved.

## [4] Data Cleaning and Formatting

Remove the noise from the dataframe.

In [11]:
# Initialize a dictionary to store the count of removed rows for each city
city_counts = {}

# Calculate the initial number of rows for each city, including unassigned postal codes
for city in citiesinfo_dic['city0']:
    city_df = cities_df[cities_df['City'] == city]
    unassigned_city_count = city_df[((city_df['Borough'].isna()) | (city_df['Borough'] == 'Not assigned') | (city_df['Borough'] == 'Unknown')) &
    ((city_df['Neighborhood'].isna()) | (city_df['Neighborhood'] == 'Not assigned') | (city_df['Neighborhood'] == 'Unknown'))
    ].shape[0]
    final_city_count = city_df.shape[0] - unassigned_city_count
    city_counts[city] = (unassigned_city_count, final_city_count)

# Now perform the removal of unassigned postal codes for the entire DataFrame
cities_df = cities_df[~(((cities_df['Borough'].isna()) |(cities_df['Borough'] == 'Not assigned') | (cities_df['Borough'] == 'Unknown')) & 
                        ((cities_df['Neighborhood'].isna()) |(cities_df['Neighborhood'] == 'Not assigned') | (cities_df['Neighborhood'] == 'Unknown')))]

# Calculate total removed across all cities
unassigned_cities_count = sum(unassigned_city_count[0] for unassigned_city_count in city_counts.values())
final_cities_count = cities_df.shape[0]

# Print the results
print("Summary of unassigned postal codes removal")
for city, (unassigned_city_count, final_city_count) in city_counts.items():
    if unassigned_city_count > 0:
        print(f"{city}: Removed {unassigned_city_count}, final rows: {final_city_count}")
    else:
        print(f"{city}: No removals, final rows: {final_city_count}")
print("Total")
if unassigned_cities_count > 0:
    print(f"Removed {unassigned_cities_count}, final rows: {final_cities_count}")
else:
    print(f"No unassigned postal codes were removed across all cities. The dataframe retains all {final_cities_count} rows.")

Summary of unassigned postal codes removal
Quebec City: Removed 60, final rows: 140
Montreal: Removed 57, final rows: 123
Ottawa: Removed 76, final rows: 84
Toronto: No removals, final rows: 97
Vancouver: Removed 5, final rows: 195
Paris: No removals, final rows: 20
Total
Removed 198, final rows: 659


Let's proceed to a visual review.

In [12]:
# Reset the index of the filtered DataFrame and display it
cities_df.reset_index(drop=True, inplace=True)
cities_df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,City,City1,City2
0,G1A,Quebec Provincial Government,Quebec Provincial Government,Quebec City,"Quebec City, Quebec","Quebec City, QC"
1,G2A,North Loretteville,North Loretteville,Quebec City,"Quebec City, Quebec","Quebec City, QC"
2,G3A,Saint-Augustin-de-Desmaures,Saint-Augustin-de-Desmaures,Quebec City,"Quebec City, Quebec","Quebec City, QC"
3,G4A,Clermont,Clermont,Quebec City,"Quebec City, Quebec","Quebec City, QC"
4,G5A,La Malbaie,La Malbaie,Quebec City,"Quebec City, Quebec","Quebec City, QC"


Upon visual inspection, we observed that the results are not flawless, as they include the following issues:
- Neighborhood entries containing commas, embedded postal codes, parentheses, or missing spaces
- Incomplete retrieval of borough and neighborhood names
- Reversed borough and neighborhood names
- Other similar inconsistencies

However, since our geolocation analysis will primarily focus on postal codes, these issues won't significantly affect our process.

## [5] Saving

In [16]:
# Save the DataFrame to a CSV file with UTF-8 encoding
cities_df.to_csv('wikipedia_df_output.csv', index=False, encoding='utf-8')
print("DataFrame saved as 'wikipedia_df_output.csv'.")

DataFrame saved as 'wikipedia_df_output.csv'.


In [14]:
# Delete the element with key "key2"
del citiesinfo_dic["soup"]
# Save to JSON file
with open("wikipedia_dic_output.json", "w") as f:
    json.dump(citiesinfo_dic, f, indent=4)