# Exploring the French birth rates dataset

This notebook aims at exploring and providing some insights into the birth rates of the year 2019 in France.

The dataset is provided by the Insee and can be found here: https://www.insee.fr/fr/statistiques/4768335?sommaire=4768339

In [None]:
!pip install --upgrade --user pip pandas numpy geopandas plotly jupyter-dash
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import plotly.express as px

from dash import dcc, html, Input, Output
from jupyter_dash import JupyterDash
from calendar import monthrange

## Motivation for using plotly and Dash

I decided to use plotly as I'm already familiar with it, and in turn will help me get better results, faster.

Using it in tandem with Dash allows us to use the same logic code for different views (that is, here in this notebook, and in the fully fledged dashboard).


## Loading the data

First, we will load the data: the birth rates in France in 2019.

In [None]:
df = pd.read_csv("./FD_NAIS_2019.csv", delimiter=";")


# A bit of cleaning, useful later.

def clean_dep(dep) -> str:
    dep = str(dep)
    if len(dep) == 1:
        return f"0{dep}"
    return dep


df["DEPNAIS"] = df["DEPNAIS"].replace({
    dep: clean_dep(dep)
    for dep in df["DEPNAIS"].unique()
})

# Visualization

## Number of births per department

An interesting thing we can do, given we have information about the birth location, is plot the births depending on the department (the most granular info we have).

We will use the France GeoJSON info, which can be found at https://france-geojson.gregoiredavid.fr

In [None]:
gdf = gpd.read_file("https://france-geojson.gregoiredavid.fr/repo/departements.geojson")
gdf = gdf.set_index("code", drop=True)

Let's plot the birth rates per department.

We will have two views: one with the absolute numbers, which will highlight which departments have the highest rates, and one with a logarithmic scale, which will help us see which departments have the lowest rates.

In [None]:
df_choropleth = pd.DataFrame(
    [
        (code, df[df["DEPNAIS"] == clean_dep(code)].shape[0])
        for code in gdf.index
    ],
    columns=["Department", "Births (abs)"],
)
df_choropleth["Births (log)"] = np.log(df_choropleth["Births (abs)"])


app = JupyterDash(__name__)


app.layout = html.Div([
    dcc.Graph(id="choropleth-graph"),
    html.P("Graph scale:"),
    dcc.RadioItems(
        ['Absolute', 'Logarithmic'],
        'Absolute',
        id='choropleth-graph-scale'
    ),
])


@app.callback(
    Output("choropleth-graph", "figure"),
    Input("choropleth-graph-scale", "value"),
)
def update_map(scale: str):
    birth_var = {
        "Absolute": "Births (abs)",
        "Logarithmic": "Births (log)"
    }
    fig = px.choropleth_mapbox(
        df_choropleth,
        geojson=gdf.geometry,
        locations='Department',
        color=birth_var[scale],
        color_continuous_scale="Viridis",
        mapbox_style="carto-positron",
        zoom=4, center = {"lat": 46.2447976, "lon": 4.1123575},
        opacity=0.5,
        title="Births per department",
    )
    fig.update_layout(margin={"r": 0,"t": 0,"l": 0,"b": 0})
    return fig


app.run_server(mode="inline", port=8051)

We observe there is a very large disparity between where people are born on the territory.

Quite predictably, the departments with the highest rates are those containing the largest cities in the country.

The "diagonale du vide" is not very noticeable, except for a few departments, like the 23rd and the 48th.

## Monthly distribution

On another topic, I'd be intersted to know if there are times in the year when there are more births.

Intuitively, a few things come to mind:
- The births should be pretty constant throughout the year
- There might be spikes about 9 months after a festive event
  - The only special one I can think of is France wining the FIFA World Cup (July 2018), so perhaps there's a spike in April 2019.
  - Other ones include Valentine's day (expected: November 2019) and Christmas (expected: September 2019).

Let's check that!

Two thing that spring to mind:
- some months, like February, have less days, thus we should normalize them
- I'm also interested in mapping the birth month to the procreation month

In [None]:
months = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]
df_barplot = pd.DataFrame(
    [
        (month, months[month - 1], df[df["MNAIS"] == month].shape[0])
        for month in df["MNAIS"].unique()
    ],
    columns=["Month_ind", "Month (birth)", "Births (abs)"],
).sort_values("Month_ind")
df_barplot["Births (norm)"] = df_barplot.apply(
    lambda ser: round((ser["Births (abs)"] / monthrange(2019, ser["Month_ind"])[1] * 30)),
    axis=1,
)
df_barplot["Month (procreation)"] = df_barplot["Month (birth)"].replace({
    "January": "April 2018",
    "February": "May 2018",
    "March": "June 2018",
    "April": "July 2018",
    "May": "August 2018",
    "June": "September 2018",
    "July": "October 2018",
    "August": "November 2018",
    "September": "December 2018",
    "October": "January 2019",
    "November": "February 2019",
    "December": "March 2019",
})
df_barplot["Number"] = df_barplot["Births (norm)"]  # alias

app = JupyterDash(__name__)


app.layout = html.Div([
    html.H4('Interactive choropleth of births per department'),
    dcc.Graph(id="barplot-graph"),
    html.P("Births:"),
    dcc.RadioItems(
        ['Absolute', 'Normalized'],
        'Absolute',
        id='barplot-graph-bars'
    ),
    html.P("Months:"),
    dcc.RadioItems(
        ['Births', 'Procreations'],
        'Births',
        id='barplot-graph-type'
    ),
])


@app.callback(
    Output("barplot-graph", "figure"),
    Input("barplot-graph-bars", "value"),
    Input("barplot-graph-type", "value"),
)
def update_barplot(bars: str, data_type: str):
    birth_var = {
        "Absolute": "Births (abs)",
        "Normalized": "Births (norm)",
    }
    type_var = {
        "Births": "Month (birth)",
        "Procreations": "Month (procreation)",
    }
    fig = px.bar(
        df_barplot,
        x=type_var[data_type], y=birth_var[bars],
        title=f"{data_type} per month",
    )
    return fig


app.run_server(mode="inline", port=8052)

There are little fluctuations, but no single month stands out.

We can notice nonetheless that there's a smooth curve when the data is normalized.

So, from this graph, it's possible to conclude on the intuition I had earlier:
- It looks like the result of the FIFA World Cup didn't have any noticeable impact on the short-term
- Niether Christmas nor Valentine's day seem to have a significant effect
- Although, there seems to be a tendency about the winter being the most active period of the year

# Parents' ages

Next thing I'm interested in: how are couples distributed accross ages.

To know that, I'm going to plot the mother's and father's age on a heatmap, parametrized by the number of births.

In [None]:
df_heatmapage = pd.DataFrame(
    index=sorted(df["AGEXACTM"].unique(), reverse=True),
    columns=sorted(df["AGEXACTP"].unique()),
).fillna(0)
for i, ser in df.groupby(["AGEXACTP", "AGEXACTM"])["ANAIS"].count().reset_index().iterrows():
    df_heatmapage[ser["AGEXACTP"]][ser["AGEXACTM"]] = ser["ANAIS"]

fig = px.imshow(df_heatmapage,
                labels=dict(x="Father's age", y="Mother's age", color="Births"),
                x=df_heatmapage.columns, y=df_heatmapage.index, title="Comparison of parent's age and the number of births")
fig.update_xaxes(side="top")
fig.show()

Unsurprisingly, the diagonal stands out: people tend to have relationships (thus, kids) with people their age. The high peak is around the 30s, with the best probability of having a kid being when the father is 31 and the mother 30.

I notice an offset diagonal between father=22 & mother=17 and father=28 & mother=23 that I'm not sure how to explain, but it looks like it then fades into the larger cluster further on.

A quirk of the data is noticeable on the right side of the vertical axis: the dataset stops at age 46, aggregating older folks at the 46 mark. Thus, as male are still somewhat fertile past this age, it's still possible for them to have kids ; this doesn't hold true for women.

Albeit, these are quite unsurprising observations and conclusions, but at least we have confirmed our intuition with actual numbers.

Something I notice and want to dig into: the diffusion in the top-right part of the diagonal seems to imply that older men tend to have children with younger women, the contrary being way less common.

For the next figure, I'd like to plot the age difference compared to the partner for both men and women depending on their age. Let's do that.

In [None]:
data = []
data.extend([
    (age, age - df[df["AGEXACTM"] == age]["AGEXACTP"].mean(), "Mother")
    for age in sorted(df["AGEXACTM"].unique())
])
data.extend([
    (age, age - df[df["AGEXACTP"] == age]["AGEXACTM"].mean(), "Father")
    for age in sorted(df["AGEXACTP"].unique())
])
data.extend([
    (age, 0, "Equal")
    for age in sorted(df["AGEXACTP"].unique())
])

df_agediff = pd.DataFrame(
    data,
    columns=("Age", "Mean age difference compared to the other parent", "Parent"),
)

fig = px.line(df_agediff, x='Age', y="Mean age difference compared to the other parent",
              color='Parent', title="Mean age difference of a parent compared to the other")
fig.show()

Wow... that's way more than I imagined!

This graphs shows us that on average:
- Women tend to have children with older partners until age 42
- In opposition, men tend to have children with younger partners from age 22 onwards

New fathers aged 46+ are on average 10 years older than their partner! (it might be even more because the data is aggregated)

This also to some extent be explained by the difference in fertility between older men and women.

## Last name inheriting

I'd be interested in knowing what last name the children usually inherit.

To me, one factor that might influence this stat is the cultural origin. We can try using the nationality indicator to see if that makes any difference.

In [None]:
ind_to_origine_nom = {
    1: "Father",
    2: "Mother",
    3: "Father - Mother",
    4: "Mother - Father",
    5: "Other",
}

df["Last name choice"] = df.apply(
    lambda ser: ind_to_origine_nom.get(ser["ORIGINOM"], "Father"),
    axis=1,
)
ind_to_nat = {
    1: "French",
    2: "Foreign",
}

groups = df.groupby(["INDNATP", "INDNATM", "Last name choice"]).count()["ANAIS"].reset_index()

fig = px.sunburst(
    pd.DataFrame(
        [
            (ser["ANAIS"], f"{ind_to_nat[ser['INDNATP']]}/{ind_to_nat[ser['INDNATM']]}", ser["Last name choice"])
            for i, ser in groups.iterrows()
        ],
        columns=("Births", "Parents nationality (father/mother)", "Last name choice"),
    ),
    values='Births', 
    path=['Parents nationality (father/mother)', 'Last name choice'],
    title="Representation of children born from French and/or foreign parents, and last name choice",
)
fig.show()

Doesn't look nationality plays an important role in the choice of last name.

Maybe the parents' ages can impact this result.

My intuition is that older couples would tend be more conservative, thus using the father's name, while younger generations might be more relaxed about this subject.

In [None]:
data = []
for age in sorted(df["AGEXACTP"].unique()):
    sub_df = df[(df["AGEXACTP"] == age) | (df["AGEXACTM"] == age)]
    for name_choice in df["Last name choice"].unique():
        data.append((
            age, name_choice,
            ((sub_df["Last name choice"] == name_choice).sum() / sub_df.shape[0]) * 100
        ))

fig = px.line(
    pd.DataFrame(
        data,
        columns=("Age", "Last name choice", "Usage (%)")
    ),
    range_y=(0, 100),
    x="Age", y="Usage (%)", color="Last name choice",
    title="Last name choice depending on parents' age",
)
fig.show()

That is indeed what we observe: younger folks (< 25 years old) have a much higher tendency of using the mother's name for their child.

Alternative combinations (Father - Mother / Mother - Father) are quite uncommon regardless of the couple's age.

Next up: the parent's recognition.
My intuition would be that the graph would show a growing curve:
- Young parents might often not recognize the child
- Past age 25, parents are more responsible, and often married, thus recognizing the child more often

In [None]:
data = []
for age in sorted(df["AGEXACTP"].unique()):
    data.append((
        "Mother", age,
        (((df["AGEXACTM"] == age) & ((df["ARECM"] != 0) | (df["AMAR"] != 0))).sum() / (df["AGEXACTM"] == age).sum()) * 100
    ))
    data.append((
        "Father", age,
        (((df["AGEXACTP"] == age) & ((df["ARECP"] != 0) | (df["AMAR"] != 0))).sum() / (df["AGEXACTP"] == age).sum()) * 100
    ))

fig = px.line(
    pd.DataFrame(
        data,
        columns=("Parent", "Age", "Recognition rate (%)")
    ),
    range_y=(0, 100),
    x="Age", y="Recognition rate (%)", color="Parent",
    title="Recognition rate per parent depending on age",
)
fig.show()

My intuition was partly correct, but it seems very young parents (< 20 years old) recognize the child more often.

There's a dip around age 23 for both, and the peak for both stands at 45+ years old.

Last thing I'm interested in from the data at hand: what is usually the number of children born at a time?

In [None]:
fig = px.pie(
    df.groupby("NBENF").count().reset_index(),
    values="ANAIS", names="NBENF",
    title="Number of children per birth",
)
fig.show()

Interesting! It turns out having twins is pretty uncommon (~ 3%), which I would've assumed to be much higher (~ 8-10%) since I'm biased by my family.

And it looks like three-borns are very scarse!

# Conclusion

From this data, it seems that, as of 2019:
- the ideal age for having children is around 30 years old for both parents
- there is statistically no better or worse month to procreate and have a child
- using the mother's last name or a combination of the two names is becoming more common

# Future work

## Datasources to cross this data with

It might be interesting to cross this data with other datasets, such as

- Birth rates for other years
- Census data about the French population
- Why are some departments more "fertile" than others?
  - Jobs data
  - Public infrastructures data