# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [1]:
pip install geopandas

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install altair

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [4]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
751348,1,JACKY,1964,54,5
3173146,2,MARYVONNE,1965,59,11
2942508,2,LUDIVINE,1986,28,11
1945595,2,ANTOINETTE,1983,13,9
1220264,1,MOHAMED,1986,26,9


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [5]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
15,16,Charente,"POLYGON ((-0.10294 45.96966, -0.04143 45.99348..."
5,6,Alpes-Maritimes,"POLYGON ((6.88743 44.36105, 6.92257 44.35073, ..."
22,24,Dordogne,"POLYGON ((0.62974 45.71457, 0.65423 45.6887, 0..."
35,35,Ille-et-Vilaine,"MULTIPOLYGON (((-2.12371 48.60441, -2.14142 48..."
20,22,Côtes-d'Armor,"POLYGON ((-3.65914 48.65921, -3.63649 48.67069..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [6]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
1305211,50,Manche,"POLYGON ((-1.11962 49.35557, -1.11346 49.32795...",1,PAUL,1963,50,5
3312509,59,Nord,"MULTIPOLYGON (((3.0404 50.15971, 3.06301 50.17...",2,OLYMPE,1922,59,3
2665268,37,Indre-et-Loire,"POLYGON ((0.61443 47.69421, 0.63131 47.7091, 0...",2,JULIA,1921,37,3
759098,68,Haut-Rhin,"POLYGON ((7.19828 48.31047, 7.24173 48.30243, ...",1,JEAN,1969,68,137
19575,68,Haut-Rhin,"POLYGON ((7.19828 48.31047, 7.24173 48.30243, ...",1,ADOLPHE,1931,68,10


In [7]:
just_names

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
10885,1,AADIL,1983,84,3
10886,1,AADIL,1992,92,3
10888,1,AAHIL,2016,95,3
10892,1,AARON,1962,75,3
10893,1,AARON,1976,75,3
...,...,...,...,...,...
3727545,2,ZYA,2013,44,4
3727546,2,ZYA,2013,59,3
3727547,2,ZYA,2017,974,3
3727548,2,ZYA,2018,59,3


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [9]:
# Select only numeric columns for the sum operation
numeric_columns = names.select_dtypes(include=['number']).columns

# Group by the specified columns and sum only the numeric columns
grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False)[numeric_columns].sum()

# Merge with the depts DataFrame to add geometry data back in
grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt')

Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [10]:
name = 'LUCIEN'
subset = grouped[grouped.preusuel == name]
alt.Chart(subset).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'code', 'nombre'],
    color='nombre',
).properties(width=800, height=600)

In [17]:
### Visualisation 1
grouped

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre
0,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,15,160
1,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABBY,2,3
2,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDALLAH,2,7
3,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDEL,1,3
4,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDELKADER,1,3
...,...,...,...,...,...,...,...
239574,,,,974,ÉSAÏE,1,3
239575,,,,974,ÉTHAN,8,53
239576,,,,974,ÉTIENNE,1,3
239577,,,,974,ÉVA,16,32


## Visualisation 1

In [11]:
# Create a DataFrame
names_df = pd.DataFrame(just_names)

# Aggregate the data to get the total count of each name per year
names_df_grouped = names_df.groupby(['preusuel', 'annais']).sum().reset_index()

# Get the top 10 names based on their overall popularity
top_10_names = names_df_grouped.groupby('preusuel')['nombre'].sum().nlargest(10).index

# Filter the data for the top 10 names
top_10_df = names_df_grouped[names_df_grouped['preusuel'].isin(top_10_names)]

# Create the Altair line chart
chart = alt.Chart(top_10_df).mark_line().encode(
    x='annais:O',  # Year as ordinal
    y='sum(nombre):Q',  # Sum of occurrences as quantitative
    color='preusuel:N',  # Name as nominal for color encoding
    strokeWidth=alt.value(2)
).properties(
    width=800,
    height=500,
    title='Evolution of Top 10 Baby Names Over Time'
)

# Display the chart
chart.show()

## Visualisation 2

In [None]:
import pandas as pd
import altair as alt

selected_names = ['ALAIN', 'ANDRE', 'JEAN', 'JEANNE', 'LOUIS', 'MARIE', 'MICHEL', 'PHILIPPE', 'PIERRE', 'RENE']
filtered_names = names[names['preusuel'].isin(selected_names)]

numeric_columns = filtered_names.select_dtypes(include=['number']).columns

grouped = filtered_names.groupby(['dpt', 'preusuel', 'sexe', 'annais'], as_index=False)[numeric_columns].sum()

grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt')

name_dropdown = alt.binding_select(options=sorted(grouped['preusuel'].unique().tolist()))
name_select = alt.selection_point(fields=['preusuel'], bind=name_dropdown, name='Select a name:')

year_slider = alt.binding_range(min=int(grouped['annais'].min()), max=int(grouped['annais'].max()), step=1)
year_select = alt.selection_point(fields=['annais'], bind=year_slider, name='Select a year:')

chart = alt.Chart(grouped).mark_geoshape(stroke='white').encode(
    tooltip=['nom:N', 'code:N', 'nombre:Q'],
    color=alt.Color('nombre:Q', scale=alt.Scale(scheme='blues'), legend=None)
).add_params(
    name_select,
    year_select
).transform_filter(
    name_select
).transform_filter(
    year_select
).properties(
    width=800,
    height=600,
    title='Regional Distribution of Baby Names in France'
)

chart.show()


### Visualisation 3 : 

In [8]:
## find unixes names 
grouped = names.groupby(['preusuel', 'sexe'], as_index=False)['nombre'].sum()
pivoted = grouped.pivot(index='preusuel', columns='sexe', values='nombre').fillna(0)
pivoted.columns = ['Girls', 'Boys'] 

pivoted['Total'] = pivoted['Girls'] + pivoted['Boys']
pivoted['ProportionDiff'] = abs(pivoted['Girls'] - pivoted['Boys']) / pivoted['Total']

filtered = pivoted[pivoted['Total'] > 1000]

unisex_names = filtered.sort_values('ProportionDiff').head(10)

print(unisex_names)

              Girls      Boys     Total  ProportionDiff
preusuel                                               
YAEL         1615.0    1666.0    3281.0        0.015544
CHARLIE     10060.0   10465.0   20525.0        0.019732
JANICK        774.0     870.0    1644.0        0.058394
LOUISON      3146.0    3987.0    7133.0        0.117903
JANY          668.0     900.0    1568.0        0.147959
YAËL          863.0     626.0    1489.0        0.159167
DOMINIQUE  238623.0  165887.0  404510.0        0.179813
SASHA        8380.0    5800.0   14180.0        0.181946
MAE          1234.0    2123.0    3357.0        0.264820
GABY          447.0     779.0    1226.0        0.270799


In [11]:
filtered = names[names['preusuel'].isin(['CHARLIE', 'MARIE', 'CAMILLE', 'YAEL', 'JANICK',  'LOUISON', 'JANY', 'DOMINIQUE', 'SASHA', 'MAE', 'GABY'])]  
filtered['annais'] = pd.to_numeric(filtered['annais'], errors='coerce')
filtered = filtered.dropna(subset=['annais'])
filtered['annais'] = filtered['annais'].astype(int)

grouped = filtered.groupby(['preusuel', 'sexe', 'annais'], as_index=False)['nombre'].sum()

grouped['nombre_signed'] = grouped.apply(
    lambda row: -row['nombre'] if row['sexe'] == 2 else row['nombre'], axis=1
)

grouped['sexe'] = grouped['sexe'].astype(str)

chart = alt.Chart(grouped).mark_bar().encode(
    y=alt.Y('annais:O', title='Year', sort='descending'),
    x=alt.X('nombre_signed:Q', title='Number of births', axis=alt.Axis(format='~s')),
    color=alt.Color('sexe:N',
                    scale=alt.Scale(domain=['1', '2'], range=['steelblue', 'hotpink']),
                    title='Sex',
                    legend=alt.Legend(labelExpr="datum.value == '1' ? 'Boys' : 'Girls'")),
    tooltip=['preusuel:N', 'annais:O', 'sexe:N', 'nombre:Q']
).properties(
    width=500,
    height=700
).facet(
    row=alt.Row('preusuel:N', title=None)
).resolve_scale(
    x='independent'
)

chart.show()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)
