# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [1]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [2]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
96805,1,ALLAN,1995,52,7
50163,1,ALAIN,1944,974,17
863598,1,JIMMY,1994,93,33
2922000,2,LUCIE,1909,1,24
2625395,2,ISABELLE,1947,57,7


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [3]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
87,87,Haute-Vienne,"POLYGON ((0.82343 46.12858, 0.83345 46.16655, ..."
46,46,Lot,"POLYGON ((1.44826 45.01931, 1.47632 45.01845, ..."
93,93,Seine-Saint-Denis,"POLYGON ((2.55306 49.00982, 2.58031 48.99159, ..."
66,66,Pyrénées-Orientales,"POLYGON ((2.16605 42.66392, 2.17620 42.65251, ..."
61,61,Orne,"POLYGON ((-0.84094 48.75222, -0.81927 48.75413..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [4]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
1155557,81,Tarn,"POLYGON ((1.99017 44.14945, 2.02970 44.15704, ...",1,MAXIME,1983,81,12
659159,86,Vienne,"POLYGON ((-0.10212 47.06480, -0.09806 47.09135...",1,GUY,1958,86,19
1947435,88,Vosges,"POLYGON ((5.47006 48.42093, 5.51099 48.41822, ...",2,AURORE,1987,88,48
2610788,78,Yvelines,"POLYGON ((2.20059 48.90868, 2.16838 48.89508, ...",2,JEANNE,1950,78,30
3356147,55,Meuse,"POLYGON ((4.95099 49.23687, 4.96436 49.24745, ...",2,PIERRETTE,1913,55,8


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [5]:
grouped = names.drop(columns='geometry').groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum()

In [6]:
names
names2 = names.drop(columns=['geometry','nom'])

In [7]:
#grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum()
grouped = names2.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum()
grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt') # Add geometry data back in
grouped2 = grouped[['geometry','preusuel','sexe','nombre','dpt','nom']]


Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [8]:
name = 'LUCIEN'
subset = grouped2[grouped2.preusuel == name]
alt.Chart(subset).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'dpt', 'nombre'],
    color='nombre',
).properties(width=800, height=600)

### Marc - Added

In [9]:
import altair as alt
import pandas as pd
import geopandas as gpd

# Enable large data sets
alt.data_transformers.enable('json')

# Load the baby names data
names = pd.read_csv("dpt2020.csv", sep=";")
names = names[names.preusuel != '_PRENOMS_RARES']
names = names[names.dpt != 'XX']

# Load the geographical data for departments
depts = gpd.read_file('departements-version-simplifiee.geojson')

# Merge names data with geographical data
names = depts.merge(names, how='right', left_on='code', right_on='dpt')

# Identify the most common name in each department
most_common_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
most_common_names = most_common_names.loc[most_common_names.groupby('dpt')['nombre'].idxmax()]

# Merge with department data to retain geometry
most_common_names = depts.merge(most_common_names, how='right', left_on='code', right_on='dpt')

# Calculate top 5 names for each department
top_5_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
top_5_names['rank'] = top_5_names.groupby('dpt')['nombre'].rank(method='first', ascending=False)
top_5_names = top_5_names[top_5_names['rank'] <= 1]

# Create the base map colored by the most common name
base = alt.Chart(most_common_names).mark_geoshape(
    stroke='white'
).encode(
    color=alt.Color('preusuel:N', title='Most Common Name'),
    tooltip=['nom:N', 'dpt:N', 'preusuel:N', 'nombre:Q']
).properties(
    width=800,
    height=600
)

# Prepare tooltip data for top 5 names
top_5_tooltips = top_5_names.groupby('dpt').apply(
    lambda df: '<br>'.join([f"{row['preusuel']}: {row['nombre']}" for _, row in df.iterrows()])
).reset_index(name='top_5_names')

# Merge tooltip data with department geometries
tooltip_data = depts.merge(top_5_tooltips, how='left', left_on='code', right_on='dpt')

# Add top 5 names tooltip to the map
tooltip_chart = alt.Chart(tooltip_data).mark_geoshape(
    stroke='white',
    fillOpacity=0
).encode(
    tooltip=['nom:N', 'top_5_names:N']
)

# Combine the base map and tooltip chart
final_chart = base + tooltip_chart

# Display the final chart
final_chart


In [10]:
import altair as alt
import pandas as pd
import geopandas as gpd

# Enable large data sets
alt.data_transformers.enable('json')

# Load the baby names data
names = pd.read_csv("dpt2020.csv", sep=";")
names = names[names.preusuel != '_PRENOMS_RARES']
names = names[names.dpt != 'XX']

# Load the geographical data for departments
depts = gpd.read_file('departements-version-simplifiee.geojson')

# Merge names data with geographical data
names = depts.merge(names, how='right', left_on='code', right_on='dpt')

# Identify the most common name in each department
most_common_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
most_common_names = most_common_names.loc[most_common_names.groupby('dpt')['nombre'].idxmax()]

# Merge with department data to retain geometry
most_common_names = depts.merge(most_common_names, how='right', left_on='code', right_on='dpt')

# Calculate top 5 and flop 5 names for each department
top_flop_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
top_flop_names['rank_top'] = top_flop_names.groupby('dpt')['nombre'].rank(method='first', ascending=False)
top_flop_names['rank_flop'] = top_flop_names.groupby('dpt')['nombre'].rank(method='first', ascending=True)
top_5_names = top_flop_names[top_flop_names['rank_top'] <= 5]
flop_5_names = top_flop_names[(top_flop_names['rank_flop'] <= 5) & (top_flop_names['nombre'] > 0)]

# Pivot the data for top 5 names
top_5_pivot = top_5_names.pivot(index='dpt', columns='rank_top', values=['preusuel', 'nombre'])
top_5_pivot.columns = [f'top_{int(col[1])}_{col[0]}' for col in top_5_pivot.columns]
top_5_pivot = top_5_pivot.reset_index()

# Pivot the data for flop 5 names
flop_5_pivot = flop_5_names.pivot(index='dpt', columns='rank_flop', values=['preusuel', 'nombre'])
flop_5_pivot.columns = [f'flop_{int(col[1])}_{col[0]}' for col in flop_5_pivot.columns]
flop_5_pivot = flop_5_pivot.reset_index()

# Merge the pivot tables with department data
tooltip_data = depts[['code', 'nom']].merge(top_5_pivot, how='left', left_on='code', right_on='dpt')
tooltip_data = tooltip_data.merge(flop_5_pivot, how='left', left_on='code', right_on='dpt', suffixes=('', '_flop'))

# Merge the tooltip data with the most common names data
final_data = most_common_names.merge(tooltip_data, how='left', left_on='code', right_on='code')

# Fill missing values with "N/A" for non-geometry columns
for col in final_data.columns:
    if col != 'geometry':
        final_data[col] = final_data[col].fillna("N/A")

# Print sample data to verify correctness
print("final_data sample:\n", final_data.sample(5))

# Create the map with the consolidated tooltip
final_chart = alt.Chart(final_data).mark_geoshape(
    stroke='white'
).encode(
    color=alt.Color('preusuel:N', title='Most Common Name'),
    tooltip=[
        alt.Tooltip('nom:N', title='Department'),
        alt.Tooltip('preusuel:N', title='Most Common Name'),
        alt.Tooltip('nombre:Q', title='Most Common Count'),
        alt.Tooltip('top_1_nombre:Q', title='top_1_preusuel:N'), alt.Tooltip('top_1_nombre:Q', title='Top 1 Count'),
        alt.Tooltip('top_2_preusuel:N', title='Top 2 Name'), alt.Tooltip('top_2_nombre:Q', title='Top 2 Count'),
        alt.Tooltip('top_3_preusuel:N', title='Top 3 Name'), alt.Tooltip('top_3_nombre:Q', title='Top 3 Count'),
        alt.Tooltip('top_4_preusuel:N', title='Top 4 Name'), alt.Tooltip('top_4_nombre:Q', title='Top 4 Count'),
        alt.Tooltip('top_5_preusuel:N', title='Top 5 Name'), alt.Tooltip('top_5_nombre:Q', title='Top 5 Count'),
        alt.Tooltip('flop_1_preusuel:N', title='Flop 1 Name'), alt.Tooltip('flop_1_nombre:Q', title='Flop 1 Count'),
        alt.Tooltip('flop_2_preusuel:N', title='Flop 2 Name'), alt.Tooltip('flop_2_nombre:Q', title='Flop 2 Count'),
        alt.Tooltip('flop_3_preusuel:N', title='Flop 3 Name'), alt.Tooltip('flop_3_nombre:Q', title='Flop 3 Count'),
        alt.Tooltip('flop_4_preusuel:N', title='Flop 4 Name'), alt.Tooltip('flop_4_nombre:Q', title='Flop 4 Count'),
        alt.Tooltip('flop_5_preusuel:N', title='Flop 5 Name'), alt.Tooltip('flop_5_nombre:Q', title='Flop 5 Count')
    ]
).properties(
    width=800,
    height=600
)

# Display the final chart
final_chart


final_data sample:
    code              nom_x                                           geometry  \
88   89              Yonne  POLYGON ((2.93631 48.16339, 2.93475 48.17882, ...   
92   93  Seine-Saint-Denis  POLYGON ((2.55306 49.00982, 2.58031 48.99159, ...   
22   23             Creuse  POLYGON ((2.16779 46.42407, 2.19757 46.42830, ...   
66   67           Bas-Rhin  POLYGON ((7.63529 49.05416, 7.67449 49.04504, ...   
71   72             Sarthe  POLYGON ((-0.05453 48.38200, -0.04463 48.37976...   

   dpt_x preusuel  nombre              nom_y dpt_y top_1_preusuel  \
88    89     JEAN    8001              Yonne    89           JEAN   
92    93  MOHAMED    8010  Seine-Saint-Denis    93        MOHAMED   
22    23    MARIE   10491             Creuse    23          MARIE   
66    67    MARIE   70872           Bas-Rhin    67          MARIE   
71    72     JEAN   19576             Sarthe    72           JEAN   

   top_2_preusuel  ... flop_1_preusuel flop_2_preusuel flop_3_preusuel  \
88  

In [11]:
import altair as alt
import pandas as pd
import geopandas as gpd

# Enable large data sets
alt.data_transformers.enable('json')

# Load the baby names data
names = pd.read_csv("dpt2020.csv", sep=";")
names = names[names.preusuel != '_PRENOMS_RARES']
names = names[names.dpt != 'XX']

# Load the geographical data for departments
depts = gpd.read_file('departements-version-simplifiee.geojson')

# Merge names data with geographical data
names = depts.merge(names, how='right', left_on='code', right_on='dpt')

# Identify the most common name in each department
most_common_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
most_common_names = most_common_names.loc[most_common_names.groupby('dpt')['nombre'].idxmax()]

# Merge with department data to retain geometry
most_common_names = depts.merge(most_common_names, how='right', left_on='code', right_on='dpt')

# Calculate top 5 and flop 5 names for each department
top_flop_names = names.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
top_flop_names['rank_top'] = top_flop_names.groupby('dpt')['nombre'].rank(method='first', ascending=False)
top_flop_names['rank_flop'] = top_flop_names.groupby('dpt')['nombre'].rank(method='first', ascending=True)
top_5_names = top_flop_names[top_flop_names['rank_top'] <= 5]
flop_5_names = top_flop_names[(top_flop_names['rank_flop'] <= 5) & (top_flop_names['nombre'] > 0)]

# Pivot the data for top 5 names
top_5_pivot = top_5_names.pivot(index='dpt', columns='rank_top', values=['preusuel', 'nombre'])
top_5_pivot.columns = [f'top_{int(col[1])}_{col[0]}' for col in top_5_pivot.columns]
top_5_pivot = top_5_pivot.reset_index()

# Pivot the data for flop 5 names
flop_5_pivot = flop_5_names.pivot(index='dpt', columns='rank_flop', values=['preusuel', 'nombre'])
flop_5_pivot.columns = [f'flop_{int(col[1])}_{col[0]}' for col in flop_5_pivot.columns]
flop_5_pivot = flop_5_pivot.reset_index()

# Merge the pivot tables with department data
tooltip_data = depts[['code', 'nom']].merge(top_5_pivot, how='left', left_on='code', right_on='dpt')
tooltip_data = tooltip_data.merge(flop_5_pivot, how='left', left_on='code', right_on='dpt', suffixes=('', '_flop'))

# Merge the tooltip data with the most common names data
final_data = most_common_names.merge(tooltip_data, how='left', left_on='code', right_on='code')

# Fill missing values with "N/A" for non-geometry columns
for col in final_data.columns:
    if col != 'geometry':
        final_data[col] = final_data[col].fillna("N/A")

# Add calculated fields for tooltips
final_data = final_data.assign(
    top_1=lambda x: x['top_1_preusuel'] + ' (' + x['top_1_nombre'].astype(str) + ')',
    top_2=lambda x: x['top_2_preusuel'] + ' (' + x['top_2_nombre'].astype(str) + ')',
    top_3=lambda x: x['top_3_preusuel'] + ' (' + x['top_3_nombre'].astype(str) + ')',
    top_4=lambda x: x['top_4_preusuel'] + ' (' + x['top_4_nombre'].astype(str) + ')',
    top_5=lambda x: x['top_5_preusuel'] + ' (' + x['top_5_nombre'].astype(str) + ')',
    flop_1=lambda x: x['flop_1_preusuel'] + ' (' + x['flop_1_nombre'].astype(str) + ')',
    flop_2=lambda x: x['flop_2_preusuel'] + ' (' + x['flop_2_nombre'].astype(str) + ')',
    flop_3=lambda x: x['flop_3_preusuel'] + ' (' + x['flop_3_nombre'].astype(str) + ')',
    flop_4=lambda x: x['flop_4_preusuel'] + ' (' + x['flop_4_nombre'].astype(str) + ')',
    flop_5=lambda x: x['flop_5_preusuel'] + ' (' + x['flop_5_nombre'].astype(str) + ')'
)

# Create the map with the consolidated tooltip
final_chart = alt.Chart(final_data).mark_geoshape(
    stroke='white'
).encode(
    color=alt.Color('preusuel:N', title='Most Common Name'),
    tooltip=[
        alt.Tooltip('nom_x:N', title='Department'),
        alt.Tooltip('top_1:N', title='Top 1'),
        alt.Tooltip('top_2:N', title='Top 2'),
        alt.Tooltip('top_3:N', title='Top 3'),
        alt.Tooltip('top_4:N', title='Top 4'),
        alt.Tooltip('top_5:N', title='Top 5'),
        alt.Tooltip('flop_1:N', title='Flop 1'),
        alt.Tooltip('flop_2:N', title='Flop 2'),
        alt.Tooltip('flop_3:N', title='Flop 3'),
        alt.Tooltip('flop_4:N', title='Flop 4'),
        alt.Tooltip('flop_5:N', title='Flop 5')
    ]
).properties(
    width=800,
    height=600
)

# Display the final chart
final_chart


final_data sample:
    code                 nom_x  \
43   44      Loire-Atlantique   
24   25                 Doubs   
63   64  Pyrénées-Atlantiques   
81   82       Tarn-et-Garonne   
77   78              Yvelines   

                                             geometry dpt_x preusuel  nombre  \
43  POLYGON ((-2.45849 47.44812, -2.45343 47.46207...    44    MARIE   53849   
24  POLYGON ((6.80701 47.56280, 6.81666 47.54792, ...    25    MARIE   14318   
63  POLYGON ((-0.24284 43.58498, -0.21061 43.59324...    64    MARIE   42367   
81  POLYGON ((1.06408 44.37851, 1.10672 44.39235, ...    82     JEAN    8456   
77  POLYGON ((2.20059 48.90868, 2.16838 48.89508, ...    78     JEAN   38183   

                   nom_y dpt_y top_1_preusuel top_2_preusuel  ...  \
43      Loire-Atlantique    44          MARIE           JEAN  ...   
24                 Doubs    25          MARIE           JEAN  ...   
63  Pyrénées-Atlantiques    64          MARIE           JEAN  ...   
81       Tarn-et-Garonne

In [12]:
import altair as alt
import pandas as pd
import geopandas as gpd

# Enable large data sets
alt.data_transformers.enable('json')

# Load the baby names data
names = pd.read_csv("dpt2020.csv", sep=";")
names = names[names.preusuel != '_PRENOMS_RARES']
names = names[names.dpt != 'XX']

# Ensure there is a year field
names['year'] = names['annais'].astype(int)

# Load the geographical data for departments
depts = gpd.read_file('departements-version-simplifiee.geojson')

# Define a selection interval for selecting the year range
year_interval = alt.selection_interval(encodings=['x'])

# Create a base chart for year selection
year_chart = alt.Chart(names).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(title='Year')),
    y=alt.Y('count()', axis=alt.Axis(title='Count'))
).add_selection(
    year_interval
).properties(
    width=800,
    height=100
)

# Filter data based on selected year range
filtered_data = names.transform_filter(
    year_interval
)

# Identify the most common name in each department within the selected year range
most_common_names = filtered_data.groupby(['dpt', 'preusuel'], as_index=False)['nombre'].sum()
most_common_names = most_common_names.loc[most_common_names.groupby('dpt')['nombre'].idxmax()]

# Merge with department data to retain geometry
most_common_names = depts.merge(most_common_names, how='right', left_on='code', right_on='dpt')

# Create the map with the consolidated tooltip
map_chart = alt.Chart(most_common_names).mark_geoshape(
    stroke='white'
).encode(
    color=alt.Color('preusuel:N', title='Most Common Name'),
    tooltip=[
        alt.Tooltip('nom:N', title='Department'),
        alt.Tooltip('preusuel:N', title='Most Common Name'),
        alt.Tooltip('nombre:Q', title='Most Common Count')
    ]
).properties(
    width=800,
    height=600
).transform_filter(
    year_interval
)

# Combine the year selection chart with the map chart
final_chart = alt.vconcat(
    year_chart,
    map_chart
)

# Display the final chart
final_chart




AttributeError: 'DataFrame' object has no attribute 'transform_filter'