# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [1]:
import altair as alt
import pandas as pd
# Requires geopandas -- e.g.: conda install -c conda-forge geopandas
import geopandas as gpd
# Let Altair/Vega-Lite work with large data sets
alt.data_transformers.enable('json')

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [2]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
3705143,2,YOUSRA,2013,38,3
738242,1,ISHAK,2011,84,3
701677,1,HERVÉ,1957,972,9
3042035,2,MARIA,1990,34,5
536853,1,FLORENT,1986,57,33


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [3]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
91,91,Essonne,"POLYGON ((2.22656 48.77610, 2.23298 48.76620, ..."
84,84,Vaucluse,"MULTIPOLYGON (((4.89291 44.36482, 4.90663 44.3..."
35,35,Ille-et-Vilaine,"MULTIPOLYGON (((-2.12371 48.60441, -2.14142 48..."
12,13,Bouches-du-Rhône,"POLYGON ((4.73906 43.92406, 4.82174 43.91283, ..."
31,31,Haute-Garonne,"POLYGON ((0.95398 43.78737, 0.97780 43.78644, ..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [4]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
1615703,93,Seine-Saint-Denis,"POLYGON ((2.55306 49.00982, 2.58031 48.99159, ...",1,VINCENT,1990,93,134
1268891,38,Isère,"POLYGON ((5.62375 45.61327, 5.62303 45.60428, ...",1,OLIVIER,1930,38,5
2980126,80,Somme,"POLYGON ((1.38155 50.06577, 1.45388 50.11033, ...",2,MARGUERITE,1912,80,70
1438405,69,Rhône,"POLYGON ((4.38808 46.21979, 4.39205 46.26302, ...",1,ROMAIN,2008,69,73
1936169,25,Doubs,"POLYGON ((6.80701 47.56280, 6.81666 47.54792, ...",2,AUGUSTINE,1904,25,7


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [5]:
greoup = names.groupby(
    ['preusuel', 'dpt', 'sexe']).nombre.sum().reset_index()
greoup = depts.merge(greoup, how='right', left_on='code', right_on='dpt')

greoup

Unnamed: 0,code,nom,geometry,preusuel,dpt,sexe,nombre
0,84,Vaucluse,"MULTIPOLYGON (((4.89291 44.36482, 4.90663 44.3...",AADIL,84,1,3
1,92,Hauts-de-Seine,"POLYGON ((2.29097 48.95097, 2.32697 48.94536, ...",AADIL,92,1,3
2,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",AAHIL,95,1,3
3,75,Paris,"POLYGON ((2.41634 48.84924, 2.46226 48.84254, ...",AALIYA,75,2,3
4,06,Alpes-Maritimes,"POLYGON ((6.88743 44.36105, 6.92257 44.35073, ...",AALIYAH,06,2,43
...,...,...,...,...,...,...,...
239574,74,Haute-Savoie,"POLYGON ((6.80252 45.77837, 6.75551 45.76635, ...",ÖMER,74,1,7
239575,77,Seine-et-Marne,"POLYGON ((2.57166 48.69201, 2.56880 48.70722, ...",ÖMER,77,1,8
239576,91,Essonne,"POLYGON ((2.22656 48.77610, 2.23298 48.76620, ...",ÖMER,91,1,18
239577,93,Seine-Saint-Denis,"POLYGON ((2.55306 49.00982, 2.58031 48.99159, ...",ÖMER,93,1,17


Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [6]:
name = 'LUCIEN'
subset = greoup[greoup.preusuel == name]
alt.Chart(subset).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'code', 'nombre'],
    color='nombre',
).properties(width=800, height=600)

In [7]:
# Identify the top 5 most popular names
top5 = names.groupby('preusuel').nombre.sum().nlargest(5).index

# Filter the data to only include the top 5 names
top5_names = names[names.preusuel.isin(top5)]

# Identify the top 5 most unpopular names
bottom5 = names.groupby('preusuel').nombre.sum().nsmallest(5).index

# Filter the data to only include the top 5 names
bottom5_names = names[names.preusuel.isin(bottom5)]

# create a chart TOP 5 most popular baby name evolution over time.
pop_chart = alt.Chart(top5_names).mark_line().encode(
    x='annais:O',
    y='sum(nombre):Q',
    color=alt.Color('preusuel:N', scale=alt.Scale(
        scheme='category10'), legend=alt.Legend(title="Popular Names")),
    tooltip=['preusuel', 'sum(nombre)', 'annais']
).properties(
    title='Top 5 Most Popular and Unpopular Baby Names Evolution Over Time in France'
)

unpop_chart = alt.Chart(bottom5_names).mark_line().encode(
    x='annais:O',
    y='sum(nombre):Q',
    color=alt.Color('preusuel:N', scale=alt.Scale(
        scheme='category20b'), legend=alt.Legend(title="Unpopular Names")),
    tooltip=['preusuel', 'sum(nombre)', 'annais']
)

# Combine the two charts
combined_chart = alt.layer(pop_chart, unpop_chart).resolve_scale(
    color='independent'
).properties(
    width=800,
    height=400
)

combined_chart

In [8]:
# get top 5 popular names each year
top5_names = names[names.preusuel.isin(top5)]
top5_names = top5_names.groupby(
    ['annais', 'preusuel']).nombre.sum().reset_index()

# get top 5 unpopular names each year
bottom5_names = names[names.preusuel.isin(bottom5)]
bottom5_names = bottom5_names.groupby(
    ['annais', 'preusuel']).nombre.sum().reset_index()


# create a bar chart TOP 5 most popular baby name evolution over time
pop_chart = alt.Chart(top5_names).mark_bar().encode(
    x='annais:O',
    y='sum(nombre):Q',
    color=alt.Color('preusuel:N', scale=alt.Scale(
        scheme='category10'), legend=alt.Legend(title="Popular Names")),
    tooltip=['preusuel', 'sum(nombre)', 'annais']
).properties(
    title='Top 5 Most Popular and Unpopular Baby Names Evolution Over Time in France'
).facet(
    row='preusuel:N'
)


pop_chart

In [45]:
# create a line chart showing the evolution of baby names over time for the top 5 most popular and unpopular names
line_chart = alt.Chart(names).mark_line().encode(
    x='annais:O',
    y='sum(nombre):Q',
    color=alt.Color('preusuel:N', scale=alt.Scale(
        scheme='category10'), legend=alt.Legend(title="Popular Names")),
    tooltip=['preusuel', 'sum(nombre)', 'annais']
).properties(
    title='Top 5 Most Popular and Unpopular Baby Names Evolution Over Time in France'
)

line_chart

In [13]:
# create a choropleth map showing the Regional Distribution 
top5 = greoup.groupby(['preusuel', 'code']).nombre.sum().nlargest(
    1).index.get_level_values(0)
top5_names = greoup[greoup.preusuel.isin(top5)]

top5_map = alt.Chart(top5_names).mark_geoshape(stroke='white').encode(
    shape='geometry:N',
    color=alt.Color('nombre:Q', scale=alt.Scale(
        scheme='blues'), title='Number of Occurrences'),
    tooltip=['nom', 'code', 'nombre', 'preusuel']
).properties(width=800, height=600)

top5_map

In [17]:
# Overall Popularity of Names in Different Regions

overall_popularity = greoup.groupby('dpt').nombre.sum().reset_index()
overall_popularity = depts.merge(
    overall_popularity, how='right', left_on='code', right_on='dpt')

overall_map = alt.Chart(overall_popularity).mark_geoshape(stroke='white').encode(
    shape='geometry:N',
    color=alt.Color('nombre:Q', scale=alt.Scale(
        scheme='blues'), title='Number of Occurrences'),
    tooltip=['nom', 'code', 'nombre']
).properties(width=800, height=600)


overall_map

In [24]:
# Summarize the data by year, department, and gender
names_year_gender = names.groupby(
    ['annais', 'dpt', 'sexe']).nombre.sum().reset_index()

# Convert gender codes to labels
names_year_gender['sexe'] = names_year_gender['sexe'].map(
    {1: 'Masculin', 2: 'Féminin'})

# Create a negative value for male counts to display on the left side of the pyramid
names_year_gender['nombre'] = names_year_gender.apply(
    lambda x: -x['nombre'] if x['sexe'] == 'Masculin' else x['nombre'], axis=1
)

In [None]:
# Create a selection for the year using a slider
year_select = alt.binding_range(min=names_year_gender['annais'].min(
), max=names_year_gender['annais'].max(), step=1)
year_selection = alt.selection_single(
    name='annais', fields=['annais'], bind=year_select, init={'annais': 1990})

# Base chart
base = alt.Chart(names_year_gender).encode(
    y=alt.Y('nom:N', title='Department', sort=alt.EncodingSortField(
        field='nombre', op='sum', order='descending'))
)

# Male bars
male_bars = base.transform_filter(
    (alt.datum.sexe == 'Masculin') & (alt.datum.annais == year_selection.annais)
).mark_bar(color='steelblue').encode(
    x=alt.X('nombre:Q', title='Number of Occurrences')
)

# Female bars
female_bars = base.transform_filter(
    (alt.datum.sexe == 'Féminin') & (alt.datum.annais == year_selection.annais)
).mark_bar(color='lightpink').encode(
    x=alt.X('nombre:Q', title='Number of Occurrences')
)

# Text labels for male bars
male_text = male_bars.mark_text(
    align='left',
    baseline='middle',
    dx=3  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text='nombre:Q'
)

# Text labels for female bars
female_text = female_bars.mark_text(
    align='right',
    baseline='middle',
    dx=-3  # Nudges text to left so it doesn't appear on top of the bar
).encode(
    text='nombre:Q'
)

# Combine the charts
pyramid_chart = alt.layer(male_bars, female_bars, male_text, female_text).add_selection(
    year_selection
).properties(
    title='Evolution du nombre de prénom masculin et féminin par département au cours des 20ieme siècle',
    width=800,
    height=600
)

pyramid_chart