# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [77]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [78]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
1238708,1,NADIR,1984,75,7
1180515,1,MAXIME,2019,28,7
609583,1,GAUTIER,1998,88,3
628430,1,GÉRARD,1957,972,67
1755979,2,ADRIENNE,1937,73,3


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [79]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
94,94,Val-de-Marne,"POLYGON ((2.33190 48.81701, 2.36395 48.81632, ..."
20,22,Côtes-d'Armor,"POLYGON ((-3.65914 48.65921, -3.63649 48.67069..."
80,80,Somme,"POLYGON ((1.38155 50.06577, 1.45388 50.11033, ..."
22,24,Dordogne,"POLYGON ((0.62974 45.71457, 0.65423 45.68870, ..."
21,23,Creuse,"POLYGON ((2.16779 46.42407, 2.19757 46.42830, ..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [80]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
3107157,13.0,Bouches-du-Rhône,"POLYGON ((4.73906 43.92406, 4.82174 43.91283, ...",2,MARTINE,1974,13,20
2005688,26.0,Drôme,"POLYGON ((4.80049 45.29836, 4.85880 45.30895, ...",2,CAMILLE,1996,26,48
912676,54.0,Meurthe-et-Moselle,"POLYGON ((5.47091 49.49721, 5.54118 49.51526, ...",1,JUSTIN,1911,54,3
1794695,,,,2,AMBRE,2005,971,5
444952,30.0,Gard,"POLYGON ((3.37365 44.17076, 3.43083 44.14800, ...",1,ENZO,1993,30,8


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [81]:
#grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum()

#Without numeric only = true, the code isn't working !!
grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum(numeric_only=True)

grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt') # Add geometry data back in
grouped

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre
0,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,160
1,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABBY,2,3
2,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDALLAH,1,7
3,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDEL,1,3
4,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDELKADER,1,3
...,...,...,...,...,...,...,...
239574,,,,974,ÉSAÏE,1,3
239575,,,,974,ÉTHAN,1,53
239576,,,,974,ÉTIENNE,1,3
239577,,,,974,ÉVA,2,32


Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [82]:
name = 'JEAN'
subset = grouped[grouped.preusuel == name]
alt.Chart(subset).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'code', 'nombre'],
    color='nombre',
).properties(width=800, height=600)

### Visualization 2: 
- **Is there a regional effect in the data?**
- **Are some names more popular in some regions?**
- **Are popular names generally popular across the whole country?**


To answer these question, we prepared global functions to try things, but after some studies, we put our best solutions at the end of the notebook !

In [85]:
def heatmap(name):
    """
    La fonction prend en argument un nom et crée une heatmap de la popularité du nom par région.

    Arguments name: str, le nom pour lequel on souhaite afficher le heatmap de popularité

    Returns la heatmap de popularité du nom par région
    """

    subset = grouped[grouped.preusuel == name]

    heatmap = alt.Chart(subset).mark_rect().encode(
        x='nom:N',
        y='code:N',
        color=alt.Color('nombre:Q', legend=alt.Legend(title='Popularité')),
        tooltip=['nom', 'code', 'nombre']
    ).properties(
        width=800,
        height=1400
    )

    return heatmap


In [89]:
def top_10_region(name):
    """
    La fonction prend en argument un nom et crée un diagramme à barres des 10 régions les plus populaires pour ce nom.

    Arguments name: str, le nom pour lequel on souhaite afficher les 10 régions les plus populaires

    Renvoie : le diagramme à barres des 10 régions contenant le plus de name
    """
    
    subset = grouped[grouped.preusuel == name]
    top_regions = subset.nlargest(10, 'nombre')

    bar_chart = alt.Chart(top_regions).mark_bar().encode(
        x='nombre:Q',
        y=alt.Y('nom:N', sort='-x'),
        tooltip=['nom', 'nombre']
    ).properties(
        width=400,
        height=300,
        title=f'Top 10 Régions pour le nom {name}'
    )

    return bar_chart


In [97]:
def popularity_per_region(name):
    """
    La fonction prend en argument un nom et crée un diagramme à barres représentant la popularité du nom dans chaque région.

    Arguments name: str, le nom pour lequel on souhaite afficher la popularité par région

    Returns le diagramme à barres de la popularité du nom par région
    """
    
    names = pd.read_csv("dpt2020.csv", sep=";")
    names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
    names.drop(names[names.dpt == 'XX'].index, inplace=True)


    subset = names[names.preusuel == name]
    
    bar_chart = alt.Chart(subset).mark_bar().encode(
        x='dpt:N',
        y='nombre:Q',
        tooltip=['dpt', 'nombre']
    ).properties(
        width=1000,
        height=400,
        title=f'Popularité du nom {name} dans chaque région en 2020'
    )

    return bar_chart


In [100]:
def popularity_top_7_names(departement):
    """
    La fonction prend en argument le code d'un département et crée un graphique en ligne montrant la popularité des 5 noms les plus courants dans ce département.

    Arguments departement: str, le code du département pour lequel on souhaite afficher la popularité des noms

    Returns le graphique en ligne de la popularité des 5 noms les plus courants dans le département
    """

    names = pd.read_csv("dpt2020.csv", sep=";")
    names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
    names.drop(names[names.dpt == 'XX'].index, inplace=True)

    names['annais'] = names['annais'].astype(int)

    desired_dpt = departement
    subset = names[(names.annais >= 2000) & (names.annais <= 2020) & (names.dpt == desired_dpt)]

    grouped = subset.groupby(['annais', 'dpt', 'preusuel'])['nombre'].sum().reset_index()

    top_n = 7
    top_names = grouped.groupby('preusuel')['nombre'].sum().nlargest(top_n).index

    top_names_subset = grouped[grouped.preusuel.isin(top_names)]

    line_chart = alt.Chart(top_names_subset).mark_line().encode(
        x='annais:O',
        y='nombre:Q',
        color='preusuel:N',
        tooltip=['preusuel', 'annais', 'nombre']
    ).properties(
        width=600,
        height=400,
        title='Evolution du top 7 des prénoms dans le département'
    )

    return line_chart


## Question 1 Are some names more popular in some regions?

In [114]:
popularity_per_region("GABRIELLE")

In [103]:
popularity_per_region("JEREMY")

**Indeed, name popularity can vary across different regions. This variation can be attributed to cultural, historical, and regional factors that shape naming trends and preferences. France, with its rich cultural heritage and diverse regional identities, is no exception. The cultural richness of France contributes to a wide range of naming traditions and influences. Historical events, local customs, and regional identities can all play a role in determining popular names in specific areas.**

## Question 2 : Are popular names generally popular across the whole country?

In [104]:
popularity_top_7_names("75")

In [105]:
popularity_top_7_names("15")

In [106]:
popularity_top_7_names("45")

**It looks like no at first glance on the data. But some names have a more universal appeal and remain popular throughout France**

## Question 3 : Is there a regional effect in the data?

In [107]:
popularity_per_region("HUGO")


In [115]:
heatmap("GABRIELLE")

**We can clearly see some regionnal effect in our data**

# OUR SOLUTION FOR VIZUALISATION 2

In [None]:
def solution(name):



    names = pd.read_csv("dpt2020.csv", sep=";")
    names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
    names.drop(names[names.dpt == 'XX'].index, inplace=True)


    names['annais'] = names['annais'].astype(int)

    subset = names[(names.annais >= 2000) & (names.annais <= 2020) & (names.dpt != 'XX')]




    subset = subset[subset.preusuel == name]

    heatmap = alt.Chart(subset).mark_rect().encode(
        x=alt.X('annais:O', title='Year'),
        y=alt.Y('dpt:N', title='Region'),
        color=alt.Color('sum(nombre):Q', title='Popularity'),
        tooltip=['annais', 'dpt', 'sum(nombre)']
    ).properties(
        width=1000,
        height=1400,
        title='Popularity of the Name "{}" across Regions and Years'.format(name)
    )
    return heatmap


In [113]:
solution("GABRIELLE")

Here are the questions we need to answer with this solution. But why do we think it's a good visualisation ?

- **Is there a regional effect in the data?**
- **Are some names more popular in some regions?**
- **Are popular names generally popular across the whole country?**


**Is there a regional effect in the data?**

The data visualization offers valuable insights into the regional effect on naming preferences. It demonstrates that certain names, like Gabrielle, enjoy high popularity in specific regions such as Paris, but their popularity might not extend uniformly across the entire country. This suggests that naming trends exhibit significant regional variations, indicating the influence of cultural diversity and regional factors in shaping naming preferences in different parts of France.

**Are popular names generally popular across the whole country of France? and are some names more popular in some regions?**

The visualization provides evidence of a regional effect in the popularity of names, as seen with the example of Gabrielle. While some names achieve popularity nationwide, others experience varying degrees of popularity depending on the region. This observation highlights the unique cultural and social dynamics present in different regions of France, contributing to the preference and prominence of certain names in specific areas. Consequently, it can be concluded that popular names are not necessarily universally popular across the entire country, indicating the existence of regional variations in naming preferences.

**Strengths:**

The visualization enables a quick assessment of name trends and their evolution over time.
By focusing on a specific name, it facilitates the observation of popularity shifts and patterns associated with that name.
The visualization sheds light on regional differences in naming preferences, underscoring the cultural richness and diversity within France.

**Weaknesses:**

The choice of the name under analysis can significantly impact the results, warranting careful selection.
Aesthetically, the visualization could be further refined for enhanced visual appeal.
While the visualization captures temporal trends, it might not provide a comprehensive understanding of the multifaceted factors influencing naming patterns. Additionally, the data size limitations of the visualization should be acknowledged.
Importantly, the visualization does not provide a definitive answer to whether popular names are generally popular nationwide, as the regional effect demonstrates variations in popularity across different parts of the country.