# Baby Names: Visualization 2

In this notebook we present two maps to answer task two of the Baby Names mini project.
The maps are preceded by big bold markdow blocks. Their strengths and weaknesses are discusses at the bottom of the notebook.

In [1]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

pass

#### Reading our names data

We strip out data points for rare names or when the département is missing.
We also ignore data from Outre-Mer départements because we don't have geoshapes for those.

In [2]:
names = pd.read_csv("dpt2020.csv", sep=";").rename(columns={"preusuel": "prenom", "annais": "annees"})
names.drop(names[names.prenom == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)
names.drop(names[names.dpt >= '97'].index, inplace=True)

In [3]:
#names.sample(5)

#### Loading map data

The map data we use come from the [INSEE] and [IGN] and were processed into the `geojson` format by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [4]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

In [5]:
#depts[25:31]

Merge départements 2A and 2B for conveniency

In [6]:
for i,line in depts.iterrows():
    if line.code == "2A":
        i2A = i
        geom2A = line.geometry
    elif line.code == "2B":
        i2B = i
        geom2B = line.geometry
depts = depts.drop([i2A,i2B])

new_row = {'code':'20', 'nom':'Corse', 'geometry':geom2A.union(geom2B)}
depts = depts.append(new_row, ignore_index=True)

In [7]:
#depts

In [8]:
names = depts.merge(names, how='right', left_on='code', right_on='dpt')

In [9]:
#names.sample(5)

In [10]:
grouped = names.groupby(['dpt', 'prenom', 'sexe'], as_index=False).sum()
grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt') # Add geometry data back in
#grouped[34130:34563]

In [11]:
def absolute_count_chart(name):
    name = name.upper()
    subset = grouped[grouped.prenom == name]
    if len(subset) == 0:
        return "This name was not found anywhere."
    return alt.Chart(subset).mark_geoshape(stroke='white').encode(
        tooltip=['nom', 'code', 'nombre'],
        color='nombre',
    ).properties(width=800, height=600)
#absolute_count_chart('Thomas')

We count the total number of births across all years per each département so that we can get popularity of names as a percentage of births, rather than an absolute number of births 

In [12]:
births = names.groupby(['dpt'], as_index=False).sum().drop(columns=["sexe"]).rename(columns={'nombre': 'tot_births'})
#births

In [13]:
grouped_with_totals = grouped.merge(births, how='right', left_on='dpt', right_on='dpt') # Add geometry data back in
grouped_with_totals['proportion'] = grouped_with_totals.nombre / grouped_with_totals.tot_births
#grouped_with_totals

In [14]:
def proportion_chart(_name):
    name = _name.upper()
    subset = grouped_with_totals[grouped_with_totals.prenom == name]
    if len(subset) == 0:
        return "This name was not found anywhere."
    title = "Proportion of babies with the name "+_name+" from 1900 to 2010 in France départements"
    return alt.Chart(subset).mark_geoshape(stroke='white').encode(
        tooltip=['nom', 'code', 'nombre'],
        color=alt.Color('proportion', scale=alt.Scale(scheme="reds"))
    ).properties(width=800, height=600, title=title).configure_title(fontSize=18)

## Map 1: Proportion of a name out of all births by département

Replace with a name to see where this name is more popular.
Here are 4 suggestions of names that give completely different charts: Jean (popular everywhere except near Paris), Nicolas (only popular around Paris), Nolwenn (Breton name, also given elsewhere), Txomin (Basque name, doesn't exist anywhere else)
                
                   vvvvvv

In [15]:
proportion_chart('Nolwenn')

In [16]:
grouped_france = names.groupby(['prenom', 'sexe'], as_index=False).sum()
data_france_male = grouped_france[grouped_france["sexe"] == 1]
data_france_female = grouped_france[grouped_france["sexe"] == 2]
top_france_male = data_france_male.nlargest(100, 'nombre').drop(columns=["sexe"])
top_france_female = data_france_female.nlargest(100, 'nombre').drop(columns=["sexe"])
top_france = [top_france_male, top_france_female]

In [17]:
#top_france_male

In [18]:
#top_france_female

In [19]:
codes = depts.code

In [20]:
top_par_departement = {code:[None,None] for code in codes} 
for code in codes:
    departement_data = grouped[grouped["code"] == code].drop(columns=["geometry", "code", "nom", "dpt"])
    top_par_departement[code][0] = departement_data[departement_data["sexe"]==1].nlargest(100, 'nombre').drop(columns=["sexe"])
    top_par_departement[code][1] = departement_data[departement_data["sexe"]==2].nlargest(100, 'nombre').drop(columns=["sexe"])
    codeint = int(code)
    top_par_departement[codeint] = top_par_departement[code]

In [21]:
#top_par_departement[20][0][:5]

In [22]:
import numpy as np
n = 20
def similarity_score_sex(top_A, top_B):
    max_th = 2*np.sum(np.array(top_A.nombre[:n])*np.array(top_B.nombre[:n]))
    score = 0
    for _ in range(2):
        for i,prenom in enumerate(np.array(top_A["prenom"][:n])):
            if prenom in np.array(top_B["prenom"]):
                j = np.array(top_B["prenom"]).tolist().index(prenom)
                assert np.array(top_B["prenom"])[j] == prenom
                score += np.array(top_A["nombre"])[i] * np.array(top_B["nombre"])[j]
        top_A,top_B = top_B,top_A
    ans = score/max_th
    return ans
def similarity_score_between(top_A, top_B):
    top_A_male, top_A_female = top_A
    top_B_male, top_B_female = top_B
    return (similarity_score_sex(top_A_male, top_B_male)+similarity_score_sex(top_A_female, top_B_female))/2

In [23]:
def similarity_two_dpts(dept_code_A, dept_code_B):
    return similarity_score_between(top_par_departement[dept_code_A], top_par_departement[dept_code_B])
def similarity_with_france(dept_code):
    return similarity_score_between(top_par_departement[dept_code], top_france)

In [24]:
#print(similarity_two_dpts("22", "29"))
#print(similarity_two_dpts("22", "75"))

In [25]:
depts_sim = depts.copy()
depts_sim['Similarity'] = 0
for i,dataPt in depts_sim.iterrows():
    depts_sim.loc[i,'Similarity'] = similarity_with_france(dataPt.code)

In [26]:
#depts_sim

In [27]:
def similarity_chart(code=None):
    try:
        assert 1 <= int(code) <= 95
        if isinstance(code, int):
            code = str(code+100)[1:]
        for i,dataPt in depts_sim.iterrows():
            depts_sim.loc[i,'Similarity'] = similarity_two_dpts(dataPt.code, code)
            if dataPt.code == code:
                dptName = dataPt.nom
        title = "Similarity of local name popularity rankings with the ranking in "+dptName+" ("+code+")"
    except:
        for i,dataPt in depts_sim.iterrows():
            depts_sim.loc[i,'Similarity'] = similarity_with_france(dataPt.code)
        title = "Similarity of local name popularity rankings with the national ranking"
            
    return alt.Chart(depts_sim).mark_geoshape(stroke='white').encode(
        tooltip=['nom', 'code', alt.Tooltip('Similarity', title='similarity')],
        color=alt.Color('Similarity', scale=alt.Scale(domain=(0,1), scheme="darkmulti"))
    ).properties(width=800, height=600, title=title).configure_title(fontSize=18).configure(background='#D9E9F0')

# Map 2: Similarity between local and national name leaderboards

A similarity index of 1 means that the top 20 men and top 20 women names are the same in the two compared zones. A similarity of 0 means that there is no name in common between the two top 20.

Execute *similarity_chart(XY)* to compare all départements with the département XY (between 01 and 95) or just *similarity_chart("France")* to compare local name popularity rankings with the national ranking.

                  vv

In [28]:
similarity_chart("75")
# suggestions: France, 20 (Corse), 22 (in Brittany), 63 (rural), 75 (Paris), 93 (near Paris)

Oddly enough, the départements that stand out are those with codes 91, 92, 93, 94 and 95

In [29]:
#top_par_departement["33"][0],top_france[0]

In [30]:
#top_par_departement["94"][0],top_france[0]

# Discussion

## Map 1

### Strengths

Querying popular names (Jean, Nicolas...) clearly shows the regional effects of the data: in the most common case, names have a constant popularity outside of Paris, and a different popularity (higher: Nicolas, or lower: Jean) in the greater Paris region.

It's also possible to explore the data by inputting rarer names that don't necessarily answer the third question ("Are popular names generally popular across the whole country?") but occasionally provide curious results that the user can find interesting.

### Weaknesses

You have to input several names to get a global idea of the regional differences. If you're unlucky, you might query names with a constant popularity which goes against the expected conclusion that this visualization tries to convey.


## Map 2

### Strengths

There is an immediately notticeable pattern with 2 types of départements. Those who closely follow the national trend, and those who don't follow it *at all*. The latter category is easily spotted: départements 91, 92, 93, 94, 95.

### Weaknesses

The initial design plan was to include a richer tooltip that would display the top 5 names of the département, for men and women, when you hover over it. Without this, we don't really know which names are making départements 91 to 95 so different. We also don't know what the most popular names shared by départements 01 to 90 are.
