# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [8]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [9]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
625981,1,GÉRARD,1931,58,23
3004023,2,MALIKA,1953,80,3
2273137,2,DJAMILA,1989,13,6
3481697,2,ROSETTE,1916,972,4
414508,1,EDMOND,1989,76,3


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [10]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
26,28,Eure-et-Loir,"POLYGON ((0.81482 48.67017, 0.82767 48.68072, ..."
13,14,Calvados,"POLYGON ((-1.11962 49.35557, -1.07822 49.38849..."
65,65,Hautes-Pyrénées,"MULTIPOLYGON (((-0.10308 43.24282, -0.12194 43..."
91,91,Essonne,"POLYGON ((2.22656 48.77610, 2.23298 48.76620, ..."
2,3,Allier,"POLYGON ((3.03207 46.79491, 3.04907 46.75808, ..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [11]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
1504832,76,Seine-Maritime,"POLYGON ((1.38155 50.06577, 1.40926 50.05707, ...",1,SOUHEYL,2020,76,3
1273692,76,Seine-Maritime,"POLYGON ((1.38155 50.06577, 1.40926 50.05707, ...",1,OLIVIER,1998,76,13
1830595,44,Loire-Atlantique,"POLYGON ((-2.45849 47.44812, -2.45343 47.46207...",2,ANDRÉE,1965,44,3
430976,66,Pyrénées-Orientales,"POLYGON ((2.16605 42.66392, 2.17620 42.65251, ...",1,EMILE,1956,66,6
2208871,72,Sarthe,"POLYGON ((-0.05453 48.38200, -0.04463 48.37976...",2,DAPHNÉ,1996,72,3


### Marc - Added

In [12]:
#import altair as alt
#import pandas as pd
#import geopandas as gpd
#
## Enable large data sets
#alt.data_transformers.enable('json')
#
## Load the baby names data
#names = pd.read_csv("dpt2020.csv", sep=";")
#names = names[names.preusuel != '_PRENOMS_RARES']
#names = names[names.dpt != 'XX']
#
## Load the geographical data for departments
#depts = gpd.read_file('departements-version-simplifiee.geojson')
#
## Merge names data with geographical data
#names = depts.merge(names, how='right', left_on='code', right_on='dpt')
#
## Identify the most common name in each department
#most_common_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
#most_common_names = most_common_names.loc[most_common_names.groupby('dpt')['nombre'].idxmax()]
#
## Merge with department data to retain geometry
#most_common_names = depts.merge(most_common_names, how='right', left_on='code', right_on='dpt')
#
## Calculate top 5 names for each department
#top_5_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
#top_5_names['rank'] = top_5_names.groupby('dpt')['nombre'].rank(method='first', ascending=False)
#top_5_names = top_5_names[top_5_names['rank'] <= 1]
#
## Create the base map colored by the most common name
#base = alt.Chart(most_common_names).mark_geoshape(
#    stroke='white'
#).encode(
#    color=alt.Color('preusuel:N', title='Most Common Name'),
#    tooltip=['nom:N', 'dpt:N', 'preusuel:N', 'nombre:Q']
#).properties(
#    width=800,
#    height=600
#)
#
## Prepare tooltip data for top 5 names
#top_5_tooltips = top_5_names.groupby('dpt').apply(
#    lambda df: '<br>'.join([f"{row['preusuel']}: {row['nombre']}" for _, row in df.iterrows()])
#).reset_index(name='top_5_names')
#
## Merge tooltip data with department geometries
#tooltip_data = depts.merge(top_5_tooltips, how='left', left_on='code', right_on='dpt')
#
## Add top 5 names tooltip to the map
#tooltip_chart = alt.Chart(tooltip_data).mark_geoshape(
#    stroke='white',
#    fillOpacity=0
#).encode(
#    tooltip=['nom:N', 'top_5_names:N']
#)
#
## Combine the base map and tooltip chart
#final_chart = base + tooltip_chart
#
## Display the final chart
#final_chart


In [13]:
#import altair as alt
import pandas as pd
import geopandas as gpd
#
## Enable large data sets
#alt.data_transformers.enable('json')
#
## Load the baby names data
#names = pd.read_csv("dpt2020.csv", sep=";")
#names = names[names.preusuel != '_PRENOMS_RARES']
#names = names[names.dpt != 'XX']
#
## Load the geographical data for departments
#depts = gpd.read_file('departements-version-simplifiee.geojson')
#
## Merge names data with geographical data
#names = depts.merge(names, how='right', left_on='code', right_on='dpt')
#
## Identify the most common name in each department
#most_common_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
#most_common_names = most_common_names.loc[most_common_names.groupby('dpt')['nombre'].idxmax()]
#
## Merge with department data to retain geometry
#most_common_names = depts.merge(most_common_names, how='right', left_on='code', right_on='dpt')
#
## Calculate top 5 and flop 5 names for each department
#top_flop_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
#top_flop_names['rank_top'] = top_flop_names.groupby('dpt')['nombre'].rank(method='first', ascending=False)
#top_flop_names['rank_flop'] = top_flop_names.groupby('dpt')['nombre'].rank(method='first', ascending=True)
#top_5_names = top_flop_names[top_flop_names['rank_top'] <= 5]
#flop_5_names = top_flop_names[(top_flop_names['rank_flop'] <= 5) & (top_flop_names['nombre'] > 0)]
#
## Pivot the data for top 5 names
#top_5_pivot = top_5_names.pivot(index='dpt', columns='rank_top', values=['preusuel', 'nombre'])
#top_5_pivot.columns = [f'top_{int(col[1])}_{col[0]}' for col in top_5_pivot.columns]
#top_5_pivot = top_5_pivot.reset_index()
#
## Pivot the data for flop 5 names
#flop_5_pivot = flop_5_names.pivot(index='dpt', columns='rank_flop', values=['preusuel', 'nombre'])
#flop_5_pivot.columns = [f'flop_{int(col[1])}_{col[0]}' for col in flop_5_pivot.columns]
#flop_5_pivot = flop_5_pivot.reset_index()
#
## Merge the pivot tables with department data
#tooltip_data = depts[['code', 'nom']].merge(top_5_pivot, how='left', left_on='code', right_on='dpt')
#tooltip_data = tooltip_data.merge(flop_5_pivot, how='left', left_on='code', right_on='dpt', suffixes=('', '_flop'))
#
## Merge the tooltip data with the most common names data
#final_data = most_common_names.merge(tooltip_data, how='left', left_on='code', right_on='code')
#
## Fill missing values with "N/A" for non-geometry columns
#for col in final_data.columns:
#    if col != 'geometry':
#        final_data[col] = final_data[col].fillna("N/A")
#
## Print sample data to verify correctness
#print("final_data sample:\n", final_data.sample(5))
#
## Create the map with the consolidated tooltip
#final_chart = alt.Chart(final_data).mark_geoshape(
#    stroke='white'
#).encode(
#    color=alt.Color('preusuel:N', title='Most Common Name'),
#    tooltip=[
#        alt.Tooltip('nom:N', title='Department'),
#        alt.Tooltip('preusuel:N', title='Most Common Name'),
#        alt.Tooltip('nombre:Q', title='Most Common Count'),
#        alt.Tooltip('top_1_nombre:Q', title='top_1_preusuel:N'), alt.Tooltip('top_1_nombre:Q', title='Top 1 Count'),
#        alt.Tooltip('top_2_preusuel:N', title='Top 2 Name'), alt.Tooltip('top_2_nombre:Q', title='Top 2 Count'),
#        alt.Tooltip('top_3_preusuel:N', title='Top 3 Name'), alt.Tooltip('top_3_nombre:Q', title='Top 3 Count'),
#        alt.Tooltip('top_4_preusuel:N', title='Top 4 Name'), alt.Tooltip('top_4_nombre:Q', title='Top 4 Count'),
#        alt.Tooltip('top_5_preusuel:N', title='Top 5 Name'), alt.Tooltip('top_5_nombre:Q', title='Top 5 Count'),
#        alt.Tooltip('flop_1_preusuel:N', title='Flop 1 Name'), alt.Tooltip('flop_1_nombre:Q', title='Flop 1 Count'),
#        alt.Tooltip('flop_2_preusuel:N', title='Flop 2 Name'), alt.Tooltip('flop_2_nombre:Q', title='Flop 2 Count'),
#        alt.Tooltip('flop_3_preusuel:N', title='Flop 3 Name'), alt.Tooltip('flop_3_nombre:Q', title='Flop 3 Count'),
#        alt.Tooltip('flop_4_preusuel:N', title='Flop 4 Name'), alt.Tooltip('flop_4_nombre:Q', title='Flop 4 Count'),
#        alt.Tooltip('flop_5_preusuel:N', title='Flop 5 Name'), alt.Tooltip('flop_5_nombre:Q', title='Flop 5 Count')
#    ]
#).properties(
#    width=800,
#    height=600
#)
#
## Display the final chart
#final_chart
#

# Region vizualisation for week 2

In [14]:
import altair as alt
import pandas as pd
import geopandas as gpd

# Enable large data sets
alt.data_transformers.enable('json')

# Load the baby names data
names = pd.read_csv("dpt2020.csv", sep=";")
names = names[names.preusuel != '_PRENOMS_RARES']
names = names[names.dpt != 'XX']

# Load the geographical data for departments
depts = gpd.read_file('departements-version-simplifiee.geojson')

# Merge names data with geographical data
names = depts.merge(names, how='right', left_on='code', right_on='dpt')

# Identify the most common name in each department
most_common_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
most_common_names = most_common_names.loc[most_common_names.groupby('dpt')['nombre'].idxmax()]

# Merge with department data to retain geometry
most_common_names = depts.merge(most_common_names, how='right', left_on='code', right_on='dpt')

# Calculate top 5 and flop 5 names for each department
top_flop_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
top_flop_names['rank_top'] = top_flop_names.groupby('dpt')['nombre'].rank(method='first', ascending=False)
top_flop_names['rank_flop'] = top_flop_names.groupby('dpt')['nombre'].rank(method='first', ascending=True)
top_5_names = top_flop_names[top_flop_names['rank_top'] <= 5]
flop_5_names = top_flop_names[(top_flop_names['rank_flop'] <= 5) & (top_flop_names['nombre'] > 0)]

# Pivot the data for top 5 names
top_5_pivot = top_5_names.pivot(index='dpt', columns='rank_top', values=['preusuel', 'nombre'])
top_5_pivot.columns = [f'top_{int(col[1])}_{col[0]}' for col in top_5_pivot.columns]
top_5_pivot = top_5_pivot.reset_index()

# Pivot the data for flop 5 names
flop_5_pivot = flop_5_names.pivot(index='dpt', columns='rank_flop', values=['preusuel', 'nombre'])
flop_5_pivot.columns = [f'flop_{int(col[1])}_{col[0]}' for col in flop_5_pivot.columns]
flop_5_pivot = flop_5_pivot.reset_index()

# Merge the pivot tables with department data
tooltip_data = depts[['code', 'nom']].merge(top_5_pivot, how='left', left_on='code', right_on='dpt')
tooltip_data = tooltip_data.merge(flop_5_pivot, how='left', left_on='code', right_on='dpt', suffixes=('', '_flop'))

# Merge the tooltip data with the most common names data
final_data = most_common_names.merge(tooltip_data, how='left', left_on='code', right_on='code')

# Fill missing values with "N/A" for non-geometry columns
for col in final_data.columns:
    if col != 'geometry':
        final_data[col] = final_data[col].fillna("N/A")

# Add calculated fields for tooltips
final_data = final_data.assign(
    top_1=lambda x: x['top_1_preusuel'] + ' (' + x['top_1_nombre'].astype(str) + ')',
    top_2=lambda x: x['top_2_preusuel'] + ' (' + x['top_2_nombre'].astype(str) + ')',
    top_3=lambda x: x['top_3_preusuel'] + ' (' + x['top_3_nombre'].astype(str) + ')',
    top_4=lambda x: x['top_4_preusuel'] + ' (' + x['top_4_nombre'].astype(str) + ')',
    top_5=lambda x: x['top_5_preusuel'] + ' (' + x['top_5_nombre'].astype(str) + ')',
    flop_1=lambda x: x['flop_1_preusuel'] + ' (' + x['flop_1_nombre'].astype(str) + ')',
    flop_2=lambda x: x['flop_2_preusuel'] + ' (' + x['flop_2_nombre'].astype(str) + ')',
    flop_3=lambda x: x['flop_3_preusuel'] + ' (' + x['flop_3_nombre'].astype(str) + ')',
    flop_4=lambda x: x['flop_4_preusuel'] + ' (' + x['flop_4_nombre'].astype(str) + ')',
    flop_5=lambda x: x['flop_5_preusuel'] + ' (' + x['flop_5_nombre'].astype(str) + ')'
)

# Create the map with the consolidated tooltip
final_chart = alt.Chart(final_data).mark_geoshape(
    stroke='white'
).encode(
    color=alt.Color('preusuel:N', title='Most Common Name'),
    tooltip=[
        alt.Tooltip('nom_x:N', title='Department'),
        alt.Tooltip('top_1:N', title='Top 1'),
        alt.Tooltip('top_2:N', title='Top 2'),
        alt.Tooltip('top_3:N', title='Top 3'),
        alt.Tooltip('top_4:N', title='Top 4'),
        alt.Tooltip('top_5:N', title='Top 5'),
        alt.Tooltip('flop_1:N', title='Flop 1'),
        alt.Tooltip('flop_2:N', title='Flop 2'),
        alt.Tooltip('flop_3:N', title='Flop 3'),
        alt.Tooltip('flop_4:N', title='Flop 4'),
        alt.Tooltip('flop_5:N', title='Flop 5')
    ]
).properties(
    width=800,
    height=600
)

# Display the final chart
final_chart


In [15]:
#import altair as alt
#import pandas as pd
#import geopandas as gpd
#
## Enable large data sets
#alt.data_transformers.enable('json')
#
## Load the baby names data
#names = pd.read_csv("dpt2020.csv", sep=";")
#names = names[names.preusuel != '_PRENOMS_RARES']
#names = names[names.dpt != 'XX']
#
## Ensure there is a year field
#names['year'] = names['annais'].astype(int)
#
## Load the geographical data for departments
#depts = gpd.read_file('departements-version-simplifiee.geojson')
#
## Define a selection interval for selecting the year range
#year_interval = alt.selection_interval(encodings=['x'])
#
## Create a base chart for year selection
#year_chart = alt.Chart(names).mark_bar().encode(
#    x=alt.X('year:O', axis=alt.Axis(title='Year')),
#    y=alt.Y('count()', axis=alt.Axis(title='Count'))
#).add_selection(
#    year_interval
#).properties(
#    width=800,
#    height=100
#)
#
## Filter data based on selected year range
#filtered_data = names.transform_filter(
#    year_interval
#)
#
## Identify the most common name in each department within the selected year range
#most_common_names = filtered_data.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
#most_common_names = most_common_names.loc[most_common_names.groupby('dpt')['nombre'].idxmax()]
#
## Merge with department data to retain geometry
#most_common_names = depts.merge(most_common_names, how='right', left_on='code', right_on='dpt')
#
## Create the map with the consolidated tooltip
#map_chart = alt.Chart(most_common_names).mark_geoshape(
#    stroke='white'
#).encode(
#    color=alt.Color('preusuel:N', title='Most Common Name'),
#    tooltip=[
#        alt.Tooltip('nom:N', title='Department'),
#        alt.Tooltip('preusuel:N', title='Most Common Name'),
#        alt.Tooltip('nombre:Q', title='Most Common Count')
#    ]
#).properties(
#    width=800,
#    height=600
#).transform_filter(
#    year_interval
#)
#
## Combine the year selection chart with the map chart
#final_chart = alt.vconcat(
#    year_chart,
#    map_chart
#)
#
## Display the final chart
#final_chart
