# Demographic clustering of Toulouse
The goal of this project is to get some insight about the town I am living in: Toulouse (FR). 
In the very next future I could be interested in buying or selling an appartment so I would like to have an insight that could help me finding the best location. The example of the introduction of this module will be completed with a dataset that is freely available on insee website (even if it is in French).
## Step 1: Import many helpful libraries
This libraries are both presented in the course or were found on the web to do some manipulation on demographic data.

In [1]:
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import scipy
import shapely
import matplotlib as mpl
import matplotlib.pyplot as plt
import bokeh
import cartopy
import statsmodels
import sklearn
import geoplot
import folium
import dash
import rasterio
import rasterstats
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import matplotlib.cm as cm
import matplotlib.colors as colors

## Step 2: Download and unzip French Demographic database

In [2]:
os.system("wget -O Filosofi2015_carreaux_niveau_naturel_csv.zip https://www.insee.fr/fr/statistiques/fichier/4176281/Filosofi2015_carreaux_niveau_naturel_csv.zip")
#os.system("wget -O Filosofi2015_carreaux_200m_shp.zip https://www.insee.fr/fr/statistiques/fichier/4176290/Filosofi2015_carreaux_200m_shp.zip")
print("Data Downloaded")

Data Downloaded


In [3]:
!unzip -o -j Filosofi2015_carreaux_niveau_naturel_csv.zip
 #!unzip -o -j Filosofi2015_carreaux_200m_shp.zip

Archive:  Filosofi2015_carreaux_niveau_naturel_csv.zip
  inflating: Filosofi2015_carreaux_niveau_naturel_reg02_csv.7z  
  inflating: Filosofi2015_carreaux_niveau_naturel_reg04_csv.7z  
  inflating: Filosofi2015_carreaux_niveau_naturel_metropole_csv.7z  


Toulouse is in metropolitan France so we unzip only the corresponding file.

In [4]:
#!7z e Filosofi2015_carreaux_200m_metropole_shp.7z
!7z e Filosofi2015_carreaux_niveau_naturel_metropole_csv.7z


7-Zip 19.00 (x64) : Copyright (c) 1999-2018 Igor Pavlov : 2019-02-21

Scanning the drive for archives:
1 file, 6368936 bytes (6220 KiB)

Extracting archive: Filosofi2015_carreaux_niveau_naturel_metropole_csv.7z
--
Path = Filosofi2015_carreaux_niveau_naturel_metropole_csv.7z
Type = 7z
Physical Size = 6368936
Headers Size = 202
Method = LZMA:25
Solid = -
Blocks = 1

Everything is Ok

Size:       26147825
Compressed: 6368936


 ## Step 3: Read csv file in a dataframe 
 

In [5]:
#import the dataframe and give a look to it
df_demographics = pd.read_csv('Filosofi2015_carreaux_niveau_naturel_metropole.csv')

A complete documentation containing details about this data attributes can be found here:
https://www.insee.fr/fr/statistiques/4176290?sommaire=4176305

Of course it is in french so I will make some translations here.
Each row of this dataset contains some demographic data relative to a square on the french territory. 


In [6]:
df_demographics.head()

Unnamed: 0,Id_carr_n,Ind,Men,Men_pauv,Men_1ind,Men_5ind,Men_prop,Men_fmp,Ind_snv,Men_surf,...,Ind_11_17,Ind_18_24,Ind_25_39,Ind_40_54,Ind_55_64,Ind_65_79,Ind_80p,Ind_inc,I_pauv,t_maille
0,CRS3035RES1000mN2034000E4252000,144.0,57.0,9.0,17.0,4.9,33.2,5.0,3352210.9,4874.9,...,9.3,8.1,32.3,23.1,19.7,19.8,5.9,2.1,0,1000
1,CRS3035RES1000mN2034000E4253000,45.0,21.0,3.0,6.0,0.0,12.0,3.0,845666.2,1389.0,...,6.0,2.1,12.0,6.0,3.9,9.0,0.9,1.2,0,1000
2,CRS3035RES1000mN2035000E4252000,81.0,33.8,3.1,11.2,3.1,19.1,9.8,1713668.3,2777.1,...,5.8,6.4,9.7,17.6,16.1,9.8,3.1,3.3,0,1000
3,CRS3035RES1000mN2035000E4253000,46.0,19.9,4.0,7.1,0.9,13.1,1.9,1013579.4,1765.9,...,3.1,3.1,7.9,13.2,8.0,5.1,1.9,0.9,0,1000
4,CRS3035RES1000mN2044000E4253000,91.0,41.9,3.1,18.0,3.0,27.9,4.0,2246566.4,3280.0,...,4.0,3.1,16.9,16.9,14.0,19.1,7.9,0.0,0,1000


Each square is identified by Id_carr_n which contains its bottom left corner and its side length.
for instance RES 1000m will means that the side of the square is 1000 meters this is also reported in the attribute t_maille.
The coordinates are provided in a projected reference system called epsg 3035 as described in the documentation. 
N stands for Nord and and E stands for East, so it means tha instead of providing longitude and latitude data are provided as on a 2D map with x and y positions. 

In [7]:
df_demographics.shape

(142921, 31)

The entries in this dataset are 142921 it is indeed quite big.

## Step 4: Let's extract x and y coordinates from Id_carr_n

**Here the coordinates of the center of the square are computed reading the Id_carr_n and using the square size (t_maille).**

In [8]:
df_demographics['x_coordinate'] = df_demographics.Id_carr_n.str.extract('(N\d\d\d\d\d\d\dE)',expand=False)
df_demographics['x_coordinate'] = df_demographics.x_coordinate.str.extract('(\d\d\d\d\d\d\d)',expand=False)
df_demographics['x_coordinate'] = df_demographics['x_coordinate'].apply(lambda x: int(x))+df_demographics['t_maille']/2
df_demographics['y_coordinate'] = df_demographics.Id_carr_n.str.extract('(E\d\d\d\d\d\d\d)',expand=False)
df_demographics['y_coordinate'] = df_demographics.y_coordinate.str.extract('(\d\d\d\d\d\d\d)',expand=False)
df_demographics['y_coordinate'] = df_demographics['y_coordinate'].apply(lambda x: int(x))+df_demographics['t_maille']/2

In [9]:
df_demographics.head()

Unnamed: 0,Id_carr_n,Ind,Men,Men_pauv,Men_1ind,Men_5ind,Men_prop,Men_fmp,Ind_snv,Men_surf,...,Ind_25_39,Ind_40_54,Ind_55_64,Ind_65_79,Ind_80p,Ind_inc,I_pauv,t_maille,x_coordinate,y_coordinate
0,CRS3035RES1000mN2034000E4252000,144.0,57.0,9.0,17.0,4.9,33.2,5.0,3352210.9,4874.9,...,32.3,23.1,19.7,19.8,5.9,2.1,0,1000,2034500.0,4252500.0
1,CRS3035RES1000mN2034000E4253000,45.0,21.0,3.0,6.0,0.0,12.0,3.0,845666.2,1389.0,...,12.0,6.0,3.9,9.0,0.9,1.2,0,1000,2034500.0,4253500.0
2,CRS3035RES1000mN2035000E4252000,81.0,33.8,3.1,11.2,3.1,19.1,9.8,1713668.3,2777.1,...,9.7,17.6,16.1,9.8,3.1,3.3,0,1000,2035500.0,4252500.0
3,CRS3035RES1000mN2035000E4253000,46.0,19.9,4.0,7.1,0.9,13.1,1.9,1013579.4,1765.9,...,7.9,13.2,8.0,5.1,1.9,0.9,0,1000,2035500.0,4253500.0
4,CRS3035RES1000mN2044000E4253000,91.0,41.9,3.1,18.0,3.0,27.9,4.0,2246566.4,3280.0,...,16.9,16.9,14.0,19.1,7.9,0.0,0,1000,2044500.0,4253500.0


## Step 5: Let's compute longitude and latitude from each square

**The square coordinates need to be transformed into latitude and longitude. For that we use pyproj library.**

In [10]:
from pyproj import Proj, transform

inProj = Proj(init='epsg:3035')
outProj = Proj(init='epsg:4326')
x1,y1 = -11705274.6374,4826473.6922
x2,y2 = transform(inProj,outProj,x1,y1)
#transform(inProj,outProj,df_demographics['x_coordinate'],df_demographics['y_coordinate'])
df_demographics[['long','lat']]=df_demographics.apply(lambda x: transform(inProj,outProj,x['y_coordinate'],x['x_coordinate']), axis=1, result_type="expand")


In [11]:
df_demographics.head()

Unnamed: 0,Id_carr_n,Ind,Men,Men_pauv,Men_1ind,Men_5ind,Men_prop,Men_fmp,Ind_snv,Men_surf,...,Ind_55_64,Ind_65_79,Ind_80p,Ind_inc,I_pauv,t_maille,x_coordinate,y_coordinate,long,lat
0,CRS3035RES1000mN2034000E4252000,144.0,57.0,9.0,17.0,4.9,33.2,5.0,3352210.9,4874.9,...,19.7,19.8,5.9,2.1,0,1000,2034500.0,4252500.0,9.184071,41.408295
1,CRS3035RES1000mN2034000E4253000,45.0,21.0,3.0,6.0,0.0,12.0,3.0,845666.2,1389.0,...,3.9,9.0,0.9,1.2,0,1000,2034500.0,4253500.0,9.195982,41.408392
2,CRS3035RES1000mN2035000E4252000,81.0,33.8,3.1,11.2,3.1,19.1,9.8,1713668.3,2777.1,...,16.1,9.8,3.1,3.3,0,1000,2035500.0,4252500.0,9.183952,41.417336
3,CRS3035RES1000mN2035000E4253000,46.0,19.9,4.0,7.1,0.9,13.1,1.9,1013579.4,1765.9,...,8.0,5.1,1.9,0.9,0,1000,2035500.0,4253500.0,9.195865,41.417432
4,CRS3035RES1000mN2044000E4253000,91.0,41.9,3.1,18.0,3.0,27.9,4.0,2246566.4,3280.0,...,14.0,19.1,7.9,0.0,0,1000,2044500.0,4253500.0,9.194808,41.49879


## Step 6: Let's select Tolouse squares

To know longitude and latitude of Toulouse we can use Nominatim as in the class of the cours.


In [12]:
address = 'Toulouse, FR'

geolocator = Nominatim(user_agent="FR_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toulouse are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toulouse are 43.6044622, 1.4442469.


we select all the squares at a Latitude/longitude distance lower than 0.1°

In [13]:
selection_size=0.1
lower_corner=[latitude-selection_size, longitude-selection_size]
upper_corner=[latitude+selection_size, longitude+selection_size]
df_demographics=df_demographics.loc[(lower_corner[0]<=df_demographics.lat) & (df_demographics.lat<=upper_corner[0]) & (lower_corner[1]<=df_demographics.long) & (df_demographics.long<=upper_corner[1])]
df_demographics.head()

Unnamed: 0,Id_carr_n,Ind,Men,Men_pauv,Men_1ind,Men_5ind,Men_prop,Men_fmp,Ind_snv,Men_surf,...,Ind_55_64,Ind_65_79,Ind_80p,Ind_inc,I_pauv,t_maille,x_coordinate,y_coordinate,long,lat
4288,CRS3035RES1000mN2305000E3631000,125.0,40.0,2.9,2.9,4.0,33.3,2.0,3981311.5,6275.7,...,15.0,12.9,7.1,12.9,0,1000,2305500.0,3631500.0,1.475173,43.505158
4289,CRS3035RES1000mN2305000E3634000,1230.5,521.0,17.9,141.0,29.1,388.0,48.9,34788531.4,50434.0,...,189.0,149.9,44.0,51.0,0,1000,2305500.0,3634500.0,1.511997,43.508222
4290,CRS3035RES1000mN2305000E3635000,54.0,23.0,1.0,7.8,0.0,13.0,0.0,1426195.4,2323.8,...,6.9,9.3,2.9,0.0,0,1000,2305500.0,3635500.0,1.524273,43.50924
4376,CRS3035RES1000mN2306000E3622000,1463.0,659.0,54.9,214.0,27.0,406.1,61.0,35119333.9,55500.0,...,151.1,235.0,65.1,19.0,0,1000,2306500.0,3622500.0,1.363387,43.504846
4377,CRS3035RES1000mN2306000E3623000,614.5,255.0,27.2,70.1,15.2,171.0,27.2,16022939.6,26761.0,...,81.1,99.3,33.9,18.9,0,1000,2306500.0,3623500.0,1.37566,43.505882


## Step 7: Let's compute some other metrics

The population counts in each square is tranformed in population density in order to have an homegenized measure.

In [14]:
df_demographics['pop_density'] =df_demographics['Ind']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_0_3'] =df_demographics['Ind_0_3']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_4_5'] =df_demographics['Ind_4_5']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_6_10'] =df_demographics['Ind_6_10']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_11_17'] =df_demographics['Ind_11_17']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_18_24'] =df_demographics['Ind_18_24']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_25_39'] =df_demographics['Ind_25_39']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_40_54'] =df_demographics['Ind_40_54']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_55_64'] =df_demographics['Ind_55_64']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_65_79'] =df_demographics['Ind_65_79']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics['pop_density_80p'] =df_demographics['Ind_80p']/(df_demographics['t_maille']**2)*10**6 #[ individuals per squared kilometer]
df_demographics.head()

Unnamed: 0,Id_carr_n,Ind,Men,Men_pauv,Men_1ind,Men_5ind,Men_prop,Men_fmp,Ind_snv,Men_surf,...,pop_density_0_3,pop_density_4_5,pop_density_6_10,pop_density_11_17,pop_density_18_24,pop_density_25_39,pop_density_40_54,pop_density_55_64,pop_density_65_79,pop_density_80p
4288,CRS3035RES1000mN2305000E3631000,125.0,40.0,2.9,2.9,4.0,33.3,2.0,3981311.5,6275.7,...,3.9,4.0,6.7,11.1,4.4,12.1,34.9,15.0,12.9,7.1
4289,CRS3035RES1000mN2305000E3634000,1230.5,521.0,17.9,141.0,29.1,388.0,48.9,34788531.4,50434.0,...,56.1,30.9,72.1,89.5,53.0,193.9,301.1,189.0,149.9,44.0
4290,CRS3035RES1000mN2305000E3635000,54.0,23.0,1.0,7.8,0.0,13.0,0.0,1426195.4,2323.8,...,4.0,1.0,0.4,3.8,1.0,14.9,9.8,6.9,9.3,2.9
4376,CRS3035RES1000mN2306000E3622000,1463.0,659.0,54.9,214.0,27.0,406.1,61.0,35119333.9,55500.0,...,79.4,22.5,96.9,96.1,74.9,365.0,258.0,151.1,235.0,65.1
4377,CRS3035RES1000mN2306000E3623000,614.5,255.0,27.2,70.1,15.2,171.0,27.2,16022939.6,26761.0,...,23.7,10.8,33.3,49.5,36.0,105.9,122.1,81.1,99.3,33.9


## Step 8: Let's display population density on a map

The results are shown on a map using folium as in the cours. The color of each marker is in relation with the population density. 

In [15]:
# create map of Toulouse using latitude and longitude values
map_Toulouse = folium.Map(location=[latitude, longitude], zoom_start=12)
sismic_color = plt.get_cmap('hot')
norm = mpl.colors.Normalize(vmin=df_demographics['pop_density'].min(),vmax=df_demographics['pop_density'].max())
color_list=sismic_color(norm(df_demographics['pop_density']))
rainbow = [colors.rgb2hex(i) for i in color_list]
# add markers to map
for lat, lng, population_size,clr in zip(df_demographics['lat'], df_demographics['long'], df_demographics['pop_density'],rainbow):
    label = 'Pop density: {}'.format(population_size)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=clr,
        fill=True,
        fill_color=clr,
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toulouse)  
    
map_Toulouse