# Comparative Analysis of Neighborhoods | French Rent Data Preparation

The website data.gouv.fr is the French government's open data platform, managed by Etalab, which provides public access to datasets from various governmental institutions. We will also retrieve 2023 french median rent prices.

Link: https://www.data.gouv.fr/fr/datasets/carte-des-loyers-indicateurs-de-loyers-dannonce-par-commune-en-2023/#/resources

## [1] Working environment set up

Before starting, we need to install and import libraries.

In [1]:
# Data Access and Web Scraping
#!pip install beautifulsoup4
#from bs4 import BeautifulSoup
!pip install requests
import requests

# Data Storage and File Handling
import json
#from pandas import json_normalize
#!pip install openpyxl
#!pip install lxml
#import xml
#import zipfile
from io import BytesIO, StringIO

# Data Manipulation and Processing
!pip install pandas
import pandas as pd
#import numpy as np
#import re
#import unicodedata
#!pip install unidecode
#import unidecode
#!pip install fuzzywuzzy
#!pip install python-Levenshtein
#from fuzzywuzzy import fuzz
#from fuzzywuzzy import process
#from difflib import get_close_matches

# Geolocation and Mapping
#!pip install geocoder
#import geocoder
!pip install geopy
#from geopy.geocoders import Nominatim 
from geopy.distance import geodesic

# Statistical Analysis and Clustering
#from scipy import stats
#import researchpy as rp
#!pip install scikit-learn
#from sklearn.cluster import KMeans

# Data Visualization
#!pip install matplotlib
#import matplotlib.cm as cm
#import matplotlib.colors as colors
#import matplotlib.pyplot as plt
#%matplotlib inline
#import seaborn as sns
#!pip install folium
#import folium

# Display and Configuration
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
#pd.set_option('display.expand_frame_repr', False)
#pd.set_option('display.width', 1000)

# Miscellaneous
#import time
#from IPython.display import display
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)

print("Libraries imported.")

Libraries imported.


## [2] Data Retrieval

Before starting, we need to open the canadian rent dataframe and geolocation dictionary.

In [2]:
# Load from CSV file
cities_df = pd.read_csv('canadianrent_df_output.csv', encoding='utf-8')

In [3]:
# Load from JSON file
with open("geolocation_dic_output.json", "r") as f:
    citiesinfo_dic = json.load(f)

Investigate the links provided in the following french research "Carte des loyers - Indicateurs de loyers d'annonce par commune en 2023" available in the data.gouv website. the research itself was based on data extracted from two well known real estate websites: Seloger.com and LeBonCoin.

In [4]:
# List of URLs to download data from and initialise the dataframes exported
urls = [
    ("12rooms","https://static.data.gouv.fr/resources/carte-des-loyers-indicateurs-de-loyers-dannonce-par-commune-en-2023/20240115-134722/pred-app12-mef-dhup.csv"),
    ("3rooms","https://static.data.gouv.fr/resources/carte-des-loyers-indicateurs-de-loyers-dannonce-par-commune-en-2023/20240115-134704/pred-app3-mef-dhup.csv")
]
dataframes = []

# Loop through each URL and download the data
for source_label, url in urls:
    response = requests.get(url)
    data = StringIO(response.text)
    df = pd.read_csv(data, delimiter=';')
    df['source'] = source_label
    dataframes.append(df)

# Merge all the dataframes (axis=0 concatenates them row-wise)
datagouv_df = pd.concat(dataframes, axis=0, ignore_index=True)

datagouv_df.head()

Unnamed: 0,id_zone,INSEE_C,LIBGEO,EPCI,DEP,REG,loypredm2,lwr.IPm2,upr.IPm2,TYPPRED,nbobs_com,nbobs_mail,R2_adj,source
0,1,36077,Fontguenand,200040558,36,24,918996654326032,719993079768473,117300412239243,maille,0,521,608561344447099,12rooms
1,1,36092,Langé,200040558,36,24,918996654326032,719993079768473,117300412239243,maille,0,521,608561344447099,12rooms
2,1,36123,Mézières-en-Brenne,243600343,36,24,918996654326032,719993079768473,117300412239243,maille,8,521,608561344447099,12rooms
3,1,36013,Baudres,243600293,36,24,918996654326032,719993079768473,117300412239243,maille,0,521,608561344447099,12rooms
4,1,36229,Val-Fouzon,243600202,36,24,918996654326032,719993079768473,117300412239243,maille,1,521,608561344447099,12rooms


## [3] Data Cleaning and Formatting

**Cleaning and Formatting the French Rent Data**

In [5]:
# Get the columns from the merged dataframe
print(datagouv_df.columns.tolist())

['id_zone', 'INSEE_C', 'LIBGEO', 'EPCI', 'DEP', 'REG', 'loypredm2', 'lwr.IPm2', 'upr.IPm2', 'TYPPRED', 'nbobs_com', 'nbobs_mail', 'R2_adj', 'source']


Can we use the TYPRED field to differentiate the property type as it likely refers to the type of prediction model or category ?

In [6]:
unique_typred_values = datagouv_df['TYPPRED'].unique()
print(unique_typred_values)

['maille' 'commune' 'EPCI']


No, we can't use that as it lacks sufficient useful information.

Let's now check the format of the postal codes, they might not be matching ours.

In [7]:
# Test the postal code format
if (datagouv_df['INSEE_C'] == "75012").any():
    print("The format is 75012.")
elif (datagouv_df['INSEE_C'] == "75112").any():
    print("The format is 75112.")

The format is 75112.


In [8]:
# Update the postal code format
datagouv_df['INSEE_C'] = datagouv_df['INSEE_C'].apply(lambda x: str(x) if pd.notnull(x) else '')

# Function to replace the third character from 0 to 1
def update_postal_code(value):
    if pd.isnull(value) or len(value) < 3:
        return value
    return value[:2] + '0' + value[3:] if value[2] == '1' else value

# Apply the function to the column
datagouv_df['Postal_Code'] = datagouv_df['INSEE_C'].apply(update_postal_code)

# Test again the postal code format
if (datagouv_df['Postal_Code'] == "75012").any():
    print("The format is 75012.")
elif (datagouv_df['Postal_Code'] == "75112").any():
    print("The format is 75112.")

The format is 75012.


In [9]:
# Filter housing_df to keep only rows where the postal code is in cities_df
datagouv_df = datagouv_df[datagouv_df['Postal_Code'].isin(cities_df['Postalcode'])]

According to the information provided on the research analysis:
- For a T1-T2 type apartment: area of 37 m² and average area per room of 23.0 m².
- For a T3 and larger type apartment: area of 72 m² and average area per room of 21.2 m².
- For a house: area of 92 m² and average area per room of 22.4 m².

We will therefore assume the following average surface areas for each apartment type:
- A studio apartment is estimated to be around 25 square meters and its rent will based on the pricing from the '12rooms' source.
- A 1-bedroom apartment is estimated to be around 40 square meters and its rent will based on the pricing from the '12rooms' source.
- A 2-bedroom apartment is estimated to be around 60 square meters and its rent will based on the pricing from the '3rooms' source.
- A 3-bedroom apartment is estimated to be around 80 square meters and its rent will based on the pricing from the '3rooms' source.

In [10]:
# Define the surfaces for different room counts (in square meters)
surface_mapping = {0: 25, 1: 40, 2: 60, 3: 80}

# Duplicate the dataset based on the source and assign room counts and surfaces
datagouv_0_bed = datagouv_df[datagouv_df['source'] == '12rooms'].copy()
datagouv_0_bed['Bedroom_Count'] = 0
datagouv_0_bed['Surface'] = surface_mapping[0]

datagouv_1_bed = datagouv_df[datagouv_df['source'] == '12rooms'].copy()
datagouv_1_bed['Bedroom_Count'] = 1
datagouv_1_bed['Surface'] = surface_mapping[1]

datagouv_2_bed = datagouv_df[datagouv_df['source'] == '3rooms'].copy()
datagouv_2_bed['Bedroom_Count'] = 2
datagouv_2_bed['Surface'] = surface_mapping[2]

datagouv_3_bed = datagouv_df[datagouv_df['source'] == '3rooms'].copy()
datagouv_3_bed['Bedroom_Count'] = 3
datagouv_3_bed['Surface'] = surface_mapping[3]

# Combine all the datasets back together and format the fields
datagouv_updated_df = pd.concat([datagouv_0_bed, datagouv_1_bed, datagouv_2_bed, datagouv_3_bed], ignore_index=True)
datagouv_updated_df['Bedroom_Count'] = datagouv_updated_df['Bedroom_Count'].astype('Int64')

After investigating the results, we will use the 'upr.IPm2' metric, representing the uper median rent price per square meter. We choose the upper interval predicion to reflect the recent inflation on the prices. We will multiply the price per square meter by the average surface areas defined for each property type. This will allow us to estimate the median rent price for each type of property.

In [11]:
# Correct formatting for the 'loypredm2' and 'Bedroom_Count' columns
datagouv_updated_df['upr.IPm2'] = datagouv_updated_df['upr.IPm2'].apply(lambda x: float(str(x).replace(',', '.')) if pd.notnull(x) else None)

# Calculate the updated valuation by multiplying the valuation by the surface
datagouv_updated_df['Median_Rent'] = datagouv_updated_df['upr.IPm2'] * datagouv_updated_df['Surface']

# Display the resulting dataframe
datagouv_updated_df.reset_index(drop=True, inplace=True)

# First, create a pivot table from the frenchhousing_df to get each bedroom type as a column
frenchrent_df = datagouv_updated_df.pivot(index='Postal_Code', 
                                                columns='Bedroom_Count', 
                                                values='Median_Rent')

# Rename the columns according to the number of bedrooms
rent_columns = ['Median Rent Studio', 'Median Rent 1 Bedroom', 'Median Rent 2 Bedrooms', 'Median Rent 3 Bedrooms']
frenchrent_df.columns = rent_columns

# Reset index to convert the postal codes back to a column
frenchrent_df.reset_index(inplace=True)
frenchrent_df[rent_columns] = frenchrent_df[rent_columns].round(0)

frenchrent_df.shape
frenchrent_df.head(10)

Unnamed: 0,Postal_Code,Median Rent Studio,Median Rent 1 Bedroom,Median Rent 2 Bedrooms,Median Rent 3 Bedrooms
0,75001,1337.0,2140.0,3002.0,4003.0
1,75002,1108.0,1773.0,2409.0,3211.0
2,75003,1177.0,1884.0,2502.0,3336.0
3,75004,1208.0,1934.0,2650.0,3533.0
4,75005,1065.0,1704.0,2372.0,3163.0
5,75006,1239.0,1983.0,2625.0,3501.0
6,75007,1170.0,1871.0,2643.0,3523.0
7,75008,1167.0,1868.0,2568.0,3423.0
8,75009,1045.0,1672.0,2191.0,2922.0
9,75010,979.0,1566.0,2104.0,2805.0


In [12]:
# Ensure 'Postalcode' is set as the index in both DataFrames for alignment
cities_df.set_index('Postalcode', inplace=True)
frenchrent_df.set_index('Postal_Code', inplace=True)

# For each rent column in rent_columns, fill missing values in cities_df using frenchrent_df
for column in rent_columns:
    if column in frenchrent_df.columns:
        # Only fill missing values in cities_df with corresponding values from frenchrent_df
        cities_df[column].fillna(frenchrent_df[column], inplace=True)

# Reset the index to restore 'Postalcode' as a column
cities_df.reset_index(inplace=True)

**Final review of the Canadian and French Rent Data**

Droping unecessary columns.

In [13]:
# Drop the unecessary columns
cities_df = cities_df.drop(columns=['City1', 'City2', 'Neigh_normalized'])

Checked how many missing rent data do we have in our dataset to evaluate if we need to further populate it.

In [14]:
# Initialize a dictionary to store missing data details per city
missing_data_per_city = {}

# Loop through each unique city and perform the merge and filtering operations
for city in cities_df['City'].unique():
    # Filter the DataFrame for the current city
    city_df = cities_df[cities_df['City'] == city]
    
    # Filter rows where 'Median Rent Studio' is NaN
    nan_city_df = city_df[city_df['Median Rent Studio'].isna()]
    nan_city_df_count = nan_city_df.shape[0]
    nan_city_df_total = city_df.shape[0]
    nan_city_list = nan_city_df['Borough'].unique().tolist()

    # Store the information for the current city
    missing_data_per_city[city] = {
        "missing_count": nan_city_df_count,
        "total_count": nan_city_df_total,
        "missing_boroughs": nan_city_list
    }

# Print the results for each city
print("Summary of missing median rent data by postal codes:")
for city, data in missing_data_per_city.items():
    print(f"- {city}: {data['missing_count']} missing out of {data['total_count']}.")
    print(f"=> List of missing boroughs: {data['missing_boroughs']}\n")

Summary of missing median rent data by postal codes:
- Quebec City: 84 missing out of 121.
=> List of missing boroughs: ['Clermont', 'La Malbaie', 'Saint-Georges', 'Saguenay', 'Trois-Rivières', 'Beauport North', 'Lac-Beauport', 'Port-Cartier', 'Lac-Mégantic', 'Quebec City Northwest', 'Baie-Comeau', 'Saint-Émile', 'Sainte-Marie', 'Charlesbourg  Orsainville', 'Thetford Mines', 'Métabetchouan–Lac-à-la-Croix', 'South Charlesbourg', 'Mont-Joli', 'Roberval', 'Bécancour', 'Saint-Prime', 'South Val-Bélair', 'Saint-Félicien', 'Saint-Raymond', 'Rimouski', 'Plessisville', 'Dolbeau-Mistassini', 'Donnacona', 'Normandin', 'Quebec City West', 'Sainte-Catherine-de-la-Jacques-Cartier', 'Hébertville', 'Shawinigan', 'Victoriaville', 'Saint-Ambroise', 'Sept-Îles', 'Rivière-du-Loup', 'Shannon', 'Dégelis', 'Sainte-Anne-des-Monts', 'Montmagny', 'Lévis', 'Beauceville', 'La Tuque', 'Cap-Rouge', 'Baie-Saint-Paul', 'Saint-Martin', 'Saint-Joseph-de-Coleraine', 'Armagh', "Saint-Eugène-d'Argentenay", 'Aston-Jonctio

Up to a third of the rent data is missing in some cities. So we need ot build a function that replaces the missing rent data with the one from the nearest postal code with available rent data.

In [15]:
# Define a function to find the closest postal code with available rent data
def find_closest_rent_data(row, filled_df):
    # Extract the row's geolocation
    origin = (row['Latitude'], row['Longitude'])
    
    # Initialize minimum distance and closest rent values
    min_distance = float('inf')
    closest_rents = {}
    
    # Loop through the filled data to find the nearest location
    for _, filled_row in filled_df.iterrows():
        target = (filled_row['Latitude'], filled_row['Longitude'])
        distance = geodesic(origin, target).kilometers
        
        # Update if this is the closest location
        if distance < min_distance:
            min_distance = distance
            closest_rents = filled_row[rent_columns].to_dict()  # Get rent data for closest postal code
    
    # Return the closest rent values as a dictionary
    return closest_rents

# Separate rows with missing rent data and those with complete rent data
missing_rent_df = cities_df[cities_df[rent_columns].isna().any(axis=1)]
filled_rent_df = cities_df.dropna(subset=rent_columns)

# Loop through each row with missing rent data and fill with closest rent data
for i, row in missing_rent_df.iterrows():
    closest_rents = find_closest_rent_data(row, filled_rent_df)
    for column in rent_columns:
        if pd.isna(cities_df.at[i, column]):  # Only fill if still NaN
            cities_df.at[i, column] = closest_rents.get(column)

We have so far computed the rents in their local currencies. Though to be able to perform international comparison analysis, we need to align all the rents in US dollar.

In [16]:
# Display rents only in dollar
conversion_rate = 0.9248 # Conversion rate USD/EUR
for col in rent_columns:
    new_col_name = f"{col} in USD"
    cities_df[new_col_name] = cities_df.apply(lambda x: round(x[col] * conversion_rate, 1) if x['City'] == 'Paris' else x[col], axis=1)
rent_columns_USD = cities_df.columns[-4:] 

**Reviewing the final cities_df data to prepare the dataset for descriptive, explanatory and predicitive analysis.**

Retrieve the city index to keep the same city order to display our results when performing the different data analysis.

In [17]:
neigh_distrib_city_counts = cities_df['City'].value_counts()
cities_index = neigh_distrib_city_counts.index

If we want to compare rents and amenities between cities, we need to normalise their size and spacial layout. Indeed, the cities we investage are located in different countries and they vary greatly in population, size and spatial layout.

So before testing hypothesis, investigating correlations or performing regressions, it's essential to quantify and normalize the distance between each neighborhoods and their respective city center. This way we will be able to compare them.

To normalize, I will use the transform function to group by City and calculate the z-score within each city. This will adjust rent values, the proximity distance and the number of venues to each city’s specific mean and standard deviation.

In [18]:
# Initialize the variables, columns and dictionary to store calculated inner radius values for each city
citiesinfo_dic['radius'] = []
percentage_of_city_radius = 0.05  # 5%
cities_df['innerouter'] = 'outer' # Default value
cities_df['Proximity'] = 0.0  # Temporary placeholder for distances

 # Calculate the maximum distance from the city center to neighborhoods in that city
for city in cities_index:
    city_index = citiesinfo_dic['city0'].index(city)
    center_coords = (citiesinfo_dic['latitude'][city_index], citiesinfo_dic['longitude'][city_index])
    city_neighborhoods = cities_df[cities_df['City'] == city]
    max_distance = city_neighborhoods.apply(
        lambda row: geodesic(center_coords, (row['Latitude'], row['Longitude'])).kilometers,
        axis=1
    ).max()
    inner_radius_km = max_distance * percentage_of_city_radius
    citiesinfo_dic['radius'].append(inner_radius_km)

# Iterate through each row in cities_df
for i, row in cities_df.iterrows():
    city = row['City']
    city_index = citiesinfo_dic['city0'].index(city)
    center_coords = (citiesinfo_dic['latitude'][city_index], citiesinfo_dic['longitude'][city_index])
    neighborhood_coords = (row['Latitude'], row['Longitude'])
    distance_to_center = geodesic(center_coords, neighborhood_coords).kilometers
    if distance_to_center <= citiesinfo_dic['radius'][city_index]:
        cities_df.at[i, 'innerouter'] = 'inner'
    cities_df.at[i, 'Proximity'] = distance_to_center

# Apply z-score normalization
for rent_column in rent_columns_USD:
    cities_df[f'Normalized {rent_column}'] = cities_df.groupby('City')[rent_column].transform(lambda x: (x - x.mean()) / x.std())
cities_df['Normalized Proximity'] = cities_df.groupby('City')['Proximity'].transform(lambda x: (x - x.mean()) / x.std())
cities_df['Normalized Total Venues'] = cities_df.groupby('City')['Total Venues'].transform(lambda x: (x - x.mean()) / x.std())

## [4] Saving

We can now review the data that has been collected, cleaning, formatted and centralised: 

In [19]:
for col in cities_df.columns:
    print(f"- {col}")

- Postalcode
- Borough
- Neighborhood
- City
- Latitude
- Longitude
- IsDuplicate
- Total Venues
- 1st Most Common Label
- 2nd Most Common Label
- 3rd Most Common Label
- 4th Most Common Label
- 5th Most Common Label
- 6th Most Common Label
- 7th Most Common Label
- 8th Most Common Label
- 9th Most Common Label
- 10th Most Common Label
- Median Rent Studio
- Median Rent 1 Bedroom
- Median Rent 2 Bedrooms
- Median Rent 3 Bedrooms
- Median Rent Studio in USD
- Median Rent 1 Bedroom in USD
- Median Rent 2 Bedrooms in USD
- Median Rent 3 Bedrooms in USD
- innerouter
- Proximity
- Normalized Median Rent Studio in USD
- Normalized Median Rent 1 Bedroom in USD
- Normalized Median Rent 2 Bedrooms in USD
- Normalized Median Rent 3 Bedrooms in USD
- Normalized Proximity
- Normalized Total Venues


In [20]:
# Check a sample of the final dataframe
cities_df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,City,Latitude,Longitude,IsDuplicate,Total Venues,1st Most Common Label,2nd Most Common Label,...,Median Rent 2 Bedrooms in USD,Median Rent 3 Bedrooms in USD,innerouter,Proximity,Normalized Median Rent Studio in USD,Normalized Median Rent 1 Bedroom in USD,Normalized Median Rent 2 Bedrooms in USD,Normalized Median Rent 3 Bedrooms in USD,Normalized Proximity,Normalized Total Venues
0,G1A,Quebec Provincial Government,Quebec Provincial Government,Quebec City,46.809315,-71.213023,False,100.0,Dining and Lunching,Landmarks and Outdoors,...,906.0,1025.0,inner,0.349394,-0.352042,-0.772624,-0.754949,-0.900366,-0.30154,0.232268
1,G4A,Clermont,Clermont,Quebec City,47.690608,-70.219378,False,5.0,Travel and Transportation,Food and Beverage Retail,...,1029.0,1381.0,inner,123.327396,-0.664645,-0.050637,-0.098263,0.497262,-0.040151,-0.355174
2,G5A,La Malbaie,La Malbaie,Quebec City,47.656944,-70.151389,False,6.0,Dining and Lunching,Landmarks and Outdoors,...,1029.0,1381.0,inner,123.687661,-0.664645,-0.050637,-0.098263,0.497262,-0.039385,-0.34899
3,G6A,Saint-Georges,Saint-Georges Northwest,Quebec City,46.122714,-70.670151,True,11.0,Dining and Lunching,Arts and Entertainment,...,1257.0,1519.0,inner,87.314832,1.255634,1.1917,1.11901,1.039039,-0.116695,-0.318072
4,G7A,Lévis,Lévis South,Quebec City,46.699333,-71.301231,False,9.0,Travel and Transportation,Landmarks and Outdoors,...,1257.0,1519.0,inner,14.195728,1.255634,1.1917,1.11901,1.039039,-0.27211,-0.33044


In [21]:
# Save the DataFrame to a CSV file with UTF-8 encoding
cities_df.to_csv('final_df_output.csv', index=False, encoding='utf-8')
print("DataFrame saved as 'final_df_output.csv'.")

DataFrame saved as 'final_df_output.csv'.
