# This Notebook represent our final implementation, in 2 parts.

### I-   Final implementation for the vizualisations

### II- Our discussion around all the other vizualisations weaknesses and strenghts

In [55]:
import altair as alt
import pandas as pd

import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

DataTransformerRegistry.enable('json')

# I- Final implementation

## 1) Vizualisation 1 

Reminder : 

**How do baby names evolve over time?**

**Are there names that have consistently remained popular or unpopular?**

**Are there some that have were suddenly or briefly popular or unpopular?**

**Are there trends in time?**

In [56]:
import altair as alt
import pandas as pd

# Load the data
names = pd.read_csv("dpt2020.csv", sep=";")

names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

# Aggregating the data to find top 10 names for each year
top_10_names = names.groupby('annais').apply(lambda x: x.nlargest(10, 'nombre')).reset_index(drop=True)
# Creating the stacked bar chart
chart = alt.Chart(top_10_names).mark_bar().encode(
    x='annais:O',
    y='sum(nombre):Q',
    color='preusuel:N',
    tooltip=['preusuel:N', 'annais:O', 'nombre:Q']
).properties(
    width=900,
    height=600,
    title='Top 10 Most Popular Names Over the Years'
)

chart


**How do baby names evolve over time?**

This visualization offers valuable insights into the evolution of popular baby names over time. With the interactive features, it becomes easy to observe the changes in certain names. For instance, we can clearly see the decline of the name "Gerard," which was popular between 1947 and 1952. Nowadays, it is commonly associated with an older generation. By combining interaction and visualization, we effectively address the question and provide a compelling way to understand the trends in name popularity.

**Are there names that have consistently remained popular or unpopular?**

There are indeed certain names that maintain their popularity consistently over time, such as Marie and Jean. One could hypothesize that these names have a connection to the historical influence of the Catholic Church in France, considering that Marie is associated with the mother of Christ and Jean is linked to a significant figure who supported Mary. However, it is important to note that this hypothesis may or may not be accurate. Other factors, such as cultural traditions or personal preferences, could also contribute to the enduring popularity of these classic names.


**Are there some that have were suddenly or briefly popular or unpopular?**

ndeed, it is intriguing to observe the sudden rise and subsequent disappearance of the name Maurice in 1917, as depicted by the appearance and disappearance of the little green rectangle in the visualization.

While the specific reasons behind this phenomenon remain uncertain, it is understandable that you may find it difficult to establish a direct link between the name Maurice and the context of World War I in France. It is worth mentioning that there could be various factors at play, including cultural influences, popular figures or characters, or even random fluctuations in naming trends.

It is important to acknowledge that the name Maurice is Ross's monkey in the TV show Friends ! :) 

**Are there trends in time?**

We can see clearly the trends in this chart, so yes ! 

**Stacked bar chart**

We created here a stacked bar chart using Altair to display the top 10 most popular names over the years. It encodes the x-axis with the annais field as an ordinal scale, the y-axis with the sum of nombre field as a quantitative scale, and the color of the bars with the preusuel field. Additionally, it includes a tooltip that shows the name, year, and count for each bar.

The strengths of using a stacked bar chart to display the top names for each year in a bar chart format include:

**Comparison**: A stacked bar chart allows for easy visual comparison between names within each year. We can quickly identify the most popular and least popular names by comparing the heights of the bars.

**Trend Analysis**: By observing the changes in the distribution of the stacked bars over the years, We can identify trends in name popularity. For example, We can see if certain names consistently remain popular or if there are fluctuations in their popularity.

**Total Count**: The stacked bars also provide information on the total count of names in a given year. By looking at the overall height of the bars, We can understand the total number of occurrences of names and compare it across different years.

**Name Contributions**: The stacked nature of the bars allows us to see the contribution of each name to the total count. This helps in identifying the relative popularity of different names within a year.

However, there are also some potential weaknesses to consider:

**Visual Clutter**: There are too many names the dataset spans a large number of years so the stacked bar chart can become visually cluttered and challenging to interpret. This makes it difficult to distinguish individual names and observe trends clearly.

**Lack of Granularity**: A stacked bar chart provides an overview of name popularity trends but may not offer detailed insights into specific names or their variations (e.g., spelling variations).

**Data Size Limitations**: We encounter limitations in terms of the number of names or years that can be effectively displayed in a single chart.

## 2) Vizualisation 2

Reminder : 

**Is there a regional effect in the data?**

**Are some names more popular in some regions?**

**Are popular names generally popular across the whole country?**

In [57]:
scatter_chart = alt.Chart(top_names_region_grouped).mark_circle().encode(
    x=alt.X('preusuel:N', title='Prénom'),
    y=alt.Y('nombre:Q', title='Nombre de bébés'),
    color=alt.Color('nom:N'),
    tooltip=['preusuel', 'code', 'nombre']
).properties(
    width=800,
    height=400,
    title='Répartition des prénoms populaires par région (Nuage de points)'
)

scatter_chart.interactive()

**Is there a regional effect in the data?**

The data visualization offers valuable insights into the regional effect on naming preferences. It demonstrates that certain names, enjoy high popularity, but their popularity might not extend uniformly across the entire country. This suggests that naming trends exhibit significant regional variations.

**Are some names more popular in some regions?**

The visualization provides evidence of a regional effect in the popularity of names. While some names achieve popularity nationwide, others experience varying degrees of popularity depending on the region. This observation highlights the unique cultural and social dynamics present in different regions of France, contributing to the preference and prominence of certain names in specific areas. Consequently, it can be concluded that popular names are not necessarily universally popular across the entire country, indicating the existence of regional variations in naming preferences.

**Are popular names generally popular across the whole country?**

Yes and No when we look at the small number of points for many names. But it's hard to not consider the fact that Jean and Marie crush the competition, and it has nothing to do with the fact that my name is Jean of course. 

**Strengths:**

The visualization enables a quick assessment of name trends and their evolution over time. By focusing on a specific name, it facilitates the observation of popularity shifts and patterns associated with that name. The visualization sheds light on regional differences in naming preferences, underscoring the cultural richness and diversity within France.

**Weaknesses:**

The choice of the name under analysis can significantly impact the results, warranting careful selection. Aesthetically, the visualization could be further refined for enhanced visual appeal. While the visualization captures temporal trends, it might not provide a comprehensive understanding of the multifaceted factors influencing naming patterns. Additionally, the data size limitations of the visualization should be acknowledged. Importantly, the visualization does not provide a definitive answer to whether popular names are generally popular nationwide, as the regional effect demonstrates variations in popularity across different parts of the country.

## 3) Vizualisation 3 

Reminder : 

**Are there gender effects in the data?**

**Does popularity of names given to both sexes evolve consistently?**



In [59]:
gender_counts = names.groupby(['annais', 'sexe'], as_index=False)['nombre'].sum()

chart = alt.Chart(gender_counts).mark_line().encode(
    x='annais',
    y='nombre',
    color='sexe:N',
    tooltip=['annais', 'nombre']
).properties(
    width=800,
    height=400,
    title="Évolution de la popularité des prénoms par sexe"
)

chart

In [58]:
name = 'CAMILLE'

subset = names[(names.preusuel == name)]

chart = alt.Chart(subset).mark_line().encode(
    x='annais',
    y='nombre',
    color='sexe:N',
    tooltip=['annais', 'nombre']
).properties(
    width=800,
    height=400,
    title=f"Évolution du nombre de bébés prénommés '{name}'"
)

chart

Are there gender effects in the data?

Does popularity of names given to both sexes evolve consistently?

# II- Discussions around other possible solutions


On retrouve ici, toutes les autres solutions envisagées, mais non choisis pour certains critères.

### Functions implementations : 

In [2]:
def top_10_region(name):
    """
    La fonction prend en argument un nom et crée un diagramme à barres des 10 régions les plus populaires pour ce nom.

    Arguments name: str, le nom pour lequel on souhaite afficher les 10 régions les plus populaires

    Renvoie : le diagramme à barres des 10 régions contenant le plus de name
    """
    
    subset = grouped[grouped.preusuel == name]
    top_regions = subset.nlargest(10, 'nombre')

    bar_chart = alt.Chart(top_regions).mark_bar().encode(
        x='nombre:Q',
        y=alt.Y('nom:N', sort='-x'),
        tooltip=['nom', 'nombre']
    ).properties(
        width=400,
        height=300,
        title=f'Top 10 Régions pour le nom {name}'
    )

    return bar_chart

In [3]:
def heatmap(name):
    """
    La fonction prend en argument un nom et crée une heatmap de la popularité du nom par région.

    Arguments name: str, le nom pour lequel on souhaite afficher le heatmap de popularité

    Returns la heatmap de popularité du nom par région
    """

    subset = grouped[grouped.preusuel == name]

    heatmap = alt.Chart(subset).mark_rect().encode(
        x='nom:N',
        y='code:N',
        color=alt.Color('nombre:Q', legend=alt.Legend(title='Popularité')),
        tooltip=['nom', 'code', 'nombre']
    ).properties(
        width=800,
        height=1400
    )

    return heatmap

In [4]:
def popularity_per_region(name):
    """
    La fonction prend en argument un nom et crée un diagramme à barres représentant la popularité du nom dans chaque région.

    Arguments name: str, le nom pour lequel on souhaite afficher la popularité par région

    Returns le diagramme à barres de la popularité du nom par région
    """
    
    names = pd.read_csv("dpt2020.csv", sep=";")
    names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
    names.drop(names[names.dpt == 'XX'].index, inplace=True)


    subset = names[names.preusuel == name]
    
    bar_chart = alt.Chart(subset).mark_bar().encode(
        x='dpt:N',
        y='nombre:Q',
        tooltip=['dpt', 'nombre']
    ).properties(
        width=1000,
        height=400,
        title=f'Popularité du nom {name} dans chaque région en 2020'
    )

    return bar_chart

In [5]:
def popularity_top_7_names(departement):
    """
    La fonction prend en argument le code d'un département et crée un graphique en ligne montrant la popularité des 5 noms les plus courants dans ce département.

    Arguments departement: str, le code du département pour lequel on souhaite afficher la popularité des noms

    Returns le graphique en ligne de la popularité des 5 noms les plus courants dans le département
    """

    names = pd.read_csv("dpt2020.csv", sep=";")
    names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
    names.drop(names[names.dpt == 'XX'].index, inplace=True)

    names['annais'] = names['annais'].astype(int)

    desired_dpt = departement
    subset = names[(names.annais >= 2000) & (names.annais <= 2020) & (names.dpt == desired_dpt)]

    grouped = subset.groupby(['annais', 'dpt', 'preusuel'])['nombre'].sum().reset_index()

    top_n = 7
    top_names = grouped.groupby('preusuel')['nombre'].sum().nlargest(top_n).index

    top_names_subset = grouped[grouped.preusuel.isin(top_names)]

    line_chart = alt.Chart(top_names_subset).mark_line().encode(
        x='annais:O',
        y='nombre:Q',
        color='preusuel:N',
        tooltip=['preusuel', 'annais', 'nombre']
    ).properties(
        width=600,
        height=400,
        title='Evolution du top 7 des prénoms dans le département'
    )

    return line_chart

In [6]:
def solution(name):



    names = pd.read_csv("dpt2020.csv", sep=";")
    names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
    names.drop(names[names.dpt == 'XX'].index, inplace=True)


    names['annais'] = names['annais'].astype(int)

    subset = names[(names.annais >= 2000) & (names.annais <= 2020) & (names.dpt != 'XX')]




    subset = subset[subset.preusuel == name]

    heatmap = alt.Chart(subset).mark_rect().encode(
        x=alt.X('annais:O', title='Year'),
        y=alt.Y('dpt:N', title='Region'),
        color=alt.Color('sum(nombre):Q', title='Popularity'),
        tooltip=['annais', 'dpt', 'sum(nombre)']
    ).properties(
        width=1000,
        height=1400,
        title='Popularity of the Name "{}" across Regions and Years'.format(name)
    )
    return heatmap

## 1) Vizualisation 1 

**Stacked bar chart**

We created here a stacked bar chart using Altair to display the top 10 most popular names over the years. It encodes the x-axis with the annais field as an ordinal scale, the y-axis with the sum of nombre field as a quantitative scale, and the color of the bars with the preusuel field. Additionally, it includes a tooltip that shows the name, year, and count for each bar.

The strengths of using a stacked bar chart to display the top names for each year in a bar chart format include:

**Comparison**: A stacked bar chart allows for easy visual comparison between names within each year. We can quickly identify the most popular and least popular names by comparing the heights of the bars.

**Trend Analysis**: By observing the changes in the distribution of the stacked bars over the years, We can identify trends in name popularity. For example, We can see if certain names consistently remain popular or if there are fluctuations in their popularity.

**Total Count**: The stacked bars also provide information on the total count of names in a given year. By looking at the overall height of the bars, We can understand the total number of occurrences of names and compare it across different years.

**Name Contributions**: The stacked nature of the bars allows us to see the contribution of each name to the total count. This helps in identifying the relative popularity of different names within a year.

However, there are also some potential weaknesses to consider:

**Visual Clutter**: There are too many names the dataset spans a large number of years so the stacked bar chart can become visually cluttered and challenging to interpret. This makes it difficult to distinguish individual names and observe trends clearly.

**Lack of Granularity**: A stacked bar chart provides an overview of name popularity trends but may not offer detailed insights into specific names or their variations (e.g., spelling variations).

**Data Size Limitations**: We encounter limitations in terms of the number of names or years that can be effectively displayed in a single chart.

In [7]:
import altair as alt
import pandas as pd

# Load the data
names = pd.read_csv("dpt2020.csv", sep=";")

names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

# Aggregating the data to find top 10 names for each year
top_10_names = names.groupby('annais').apply(lambda x: x.nlargest(10, 'nombre')).reset_index(drop=True)

In [8]:
# Creating the stacked bar chart
chart = alt.Chart(top_10_names).mark_bar().encode(
    x='annais:O',
    y='sum(nombre):Q',
    color='preusuel:N',
    tooltip=['preusuel:N', 'annais:O', 'nombre:Q']
).properties(
    width=900,
    height=600,
    title='Top 10 Most Popular Names Over the Years'
)

chart

In [9]:
####################################

Line Chart

To identify names that have consistently remained popular or unpopular, we calculate the average occurrences of each name across all years and sort them accordingly.

To identify names that were suddenly or briefly popular or unpopular, we analyze the yearly occurrences of each name and look for significant changes or spikes.

To identify trends over time, we analyze the overall pattern of occurrences for different names by creating line plots for a selected set of names.

Strengths:

Visualizing Trends: Line charts are effective at showing trends and patterns over time. They allow us to easily observe the rise or decline of name popularity and identify any long-term trends.

Comparing Multiple Names: Line charts enable the comparison of multiple names on the same chart.

Highlighting Significant Changes: By plotting the significant changes or spikes in occurrences, we can easily identify names that experienced sudden or brief popularity or unpopularity. These changes are visually apparent as peaks or valleys in the chart.

Exploring Individual Name Histories: Line charts provide a way to explore the history of individual names over time. By hovering over the lines, we can see the specific occurrences of each name in different years.

Weaknesses :

Limited Comparison for Large Number of Names: If the number of names is very large, it can become visually cluttered and challenging to compare all the lines effectively. In such cases, focusing on a subset of names or using interactive features to filter or highlight specific names can be helpful.
In conclusion Line chart are quite useful if we want to answer to each question individually but it can become challenging if we want the answers to all the questions with a single graph.

In [11]:
# Group the data by name and year and calculate the total occurrences
name_counts = names.groupby(['preusuel', 'annais'])['nombre'].sum().reset_index()

# Calculate the average occurrences of each name
name_avg_counts = name_counts.groupby('preusuel')['nombre'].mean().reset_index()

# Sort the names based on average occurrences in descending order
popular_names = name_avg_counts.sort_values('nombre', ascending=False)

# Get the top 10 popular names
top_10_popular_names = popular_names.head(10)

# Get the bottom 10 unpopular names
bottom_10_unpopular_names = popular_names.tail(10)

# Filter the data for the top 10 popular names
top_10_popular_counts = name_counts[name_counts['preusuel'].isin(top_10_popular_names['preusuel'])]

# Filter the data for the top 10 unpopular names
top_10_unpopular_counts = name_counts[name_counts['preusuel'].isin(bottom_10_unpopular_names['preusuel'])]

# Define the Altair line chart for the top 10 popular names
chart_popular = alt.Chart(top_10_popular_counts).mark_line().encode(
    x='annais:O',
    y='nombre:Q',
    color='preusuel:N',
    tooltip=['preusuel:N', 'annais:O', 'nombre:Q']
).properties(
    width=800,
    height=400,
    title='Evolution of Top 10 Popular Names Over Time'
)

# Define the Altair line chart for the top 10 unpopular names
chart_unpopular = alt.Chart(top_10_unpopular_counts).mark_line().encode(
    x='annais:O',
    y='nombre:Q',
    color='preusuel:N',
    tooltip=['preusuel:N', 'annais:O', 'nombre:Q']
).properties(
    width=800,
    height=400,
    title='Evolution of Top 10 Unpopular Names Over Time'
)

# Display the line charts
chart_popular | chart_unpopular

In [12]:
# Calculate the yearly occurrences of each name
name_yearly_counts = name_counts.groupby(['preusuel', 'annais'])['nombre'].sum().reset_index()

# Calculate the difference in occurrences between consecutive years for each name
name_yearly_diff = name_yearly_counts.groupby('preusuel')['nombre'].diff()

# Filter names with significant changes or spikes in occurrences
significant_changes = name_yearly_counts[(name_yearly_diff > 1000) | (name_yearly_diff < -1000)]


# Define the Altair line chart
chart = alt.Chart(significant_changes).mark_line().encode(
    x='annais:O',
    y='nombre:Q',
    color='preusuel:N',
    tooltip=['preusuel:N', 'annais:O', 'nombre:Q']
).properties(
    width=800,
    height=400,
    title='Names with Significant Changes in Occurrences Over Time'
)

# Display the line chart
chart

In [13]:
# Select a set of names to visualize
selected_names = ['MARIE', 'JEAN', 'THIERRY']

# Filter the data for the selected names
selected_names_counts = name_counts[name_counts['preusuel'].isin(selected_names)]

# Define the Altair line plot
line_plot = alt.Chart(selected_names_counts).mark_line().encode(
    x='annais:O',
    y='nombre:Q',
    color='preusuel:N',
    tooltip=['preusuel:N', 'annais:O', 'nombre:Q']
).properties(
    width=900,
    height=600,
    title='Trends of Selected Names (MARIE, JEAN, THIERRY) Over Time'
)

# Display the line plot
line_plot

In [None]:
#######################################################

In [14]:
popular_names = names.groupby(['annais', 'preusuel']).sum().reset_index()
popular_names = popular_names[popular_names['nombre'] > 1000]

top_15_names = popular_names.groupby('preusuel')['nombre'].sum().nlargest(15).index
popular_names = popular_names[popular_names['preusuel'].isin(top_15_names)]

chart = alt.Chart(popular_names).mark_bar().encode(
    x='annais:T',
    y='sum(nombre):Q',
    color='preusuel:N',
    tooltip=['preusuel:N', 'annais:T', 'sum(nombre):Q']
)

chart

In [15]:
##############################################

In [16]:
emerging_names = names.groupby(['annais', 'preusuel']).sum().reset_index()
emerging_names["annee"]=emerging_names['annais']
emerging_names['annee'] = emerging_names['annee'].astype('int64')
emerging_names = emerging_names[emerging_names['annee'] > 1920]

In [17]:
emerging_names = emerging_names[emerging_names['nombre'] > 1000]

In [18]:
emerging_names = emerging_names.sort_values(['annais', 'nombre'], ascending=[True, False])
emerging_names = emerging_names.groupby('preusuel').first().reset_index()

chart = alt.Chart(emerging_names).mark_line().encode(
    x='annais:T',
    y='nombre:Q',
    color=alt.condition(
        alt.datum.nombre > 5000,
        alt.value('green'),  
        alt.value('red')  
    ),
    tooltip=['preusuel:N', 'annais:T', 'nombre:Q']
)

chart

In [19]:
################################################################

In [22]:
names_fem = names[names.sexe==2]
names_masc = names[names.sexe==1]

In [23]:
names_fem_popular = names_fem[['preusuel', 'nombre']].groupby('preusuel', as_index=False).sum()
top_names_fem = names_fem_popular.sort_values('nombre', ascending=False)[:20]

names_masc_popular = names_masc[['preusuel', 'nombre']].groupby('preusuel', as_index=False).sum()
top_names_masc = names_masc_popular.sort_values('nombre', ascending=False)[:20]

names_popular = names[['preusuel', 'nombre']].groupby('preusuel', as_index=False).sum()
top_names = names_popular.sort_values('nombre', ascending=False)[:20]

In [24]:
names_masc_filt = names_masc[names_masc['preusuel'].isin(list(top_names_masc['preusuel']))].groupby(['preusuel', 'annais'], as_index=False).sum().drop('sexe', axis=1)
names_fem_filt = names_fem[names_fem['preusuel'].isin(list(top_names_fem['preusuel']))].groupby(['preusuel', 'annais'], as_index=False).sum().drop('sexe', axis=1)
names_filt = names[names['preusuel'].isin(list(top_names['preusuel']))].groupby(['preusuel', 'annais'], as_index=False).sum().drop('sexe', axis=1)


In [25]:
selection = alt.selection_multi(fields=['preusuel'], bind='legend')
chart = alt.Chart(names_filt).mark_line().encode(
  x='annais:T',
  y='nombre:Q',
  color=alt.condition(selection, 'preusuel:N', alt.value('lightgray')),
  size=alt.condition(selection, alt.value(3), alt.value(1))
).add_selection(selection).properties(
    width=500,
    height=400,
    title='Top 20 prénoms (masculins et féminins confondus)'
)
chart




Cliquer sur un prénom en légende sur la visualisation ci-dessus

In [26]:
selection = alt.selection_multi(fields=['preusuel'], bind='legend')
chart = alt.Chart(names_fem_filt).mark_line().encode(
  x='annais:T',
  y='nombre:Q',
  color=alt.condition(selection, 'preusuel:N', alt.value('lightgray')),
  size=alt.condition(selection, alt.value(3), alt.value(1))
).add_selection(selection).properties(
    width=500,
    height=400,
    title='Top 20 prénoms (féminins uniquement)'
)
chart



Si on clique dans la légende sur le prénom "Natalie", on peut apercevoir qu'il connaît un pic de popularité dans les années 1970

Notre première visualisation est un graphique de courbes chart créé avec altair. Il représente l’évolution des 20 noms les plus utilisés au cours du temps. Nous en avons fait plusieurs versions : une version pour les sexes confondus, une autre pour les prénoms masculins et enfin une dernière pour les prénoms féminin.

Au niveau des 20 noms les plus utilisés, nous pouvons clairement constater certaines évolutions. De manière générale, aucuns noms (féminins ou masculins) n’arrivent à garder une popularité constate. Nous pouvons apercevoir que certains noms voient leur nombre décroître très fortement au fil des années, et d’autres qui deviennent populaires très rapidement pendant un court moment. Nous pouvons aussi constater des tendances générales, par exemple la baisse générale des attributions de tous les noms dans les années 1918 (faible taux de natalité lié à la Première Guerre Mondiale ?)

Notre visualisation ici nous permet de suivre la tendance d’un prénom du top 20 en particulier. En cliquant sur un prénom en légende, la courbe lui correspondant va venir se démarquer des autres, nous permettant ainsi de pleinement l’apprécier à travers u nuage de courbes.

Dans cette visualisation, il aurait été intéressant de mettre en lumière les noms les plus attribués par fourchette d’années (par exemple les noms les plus attribués de 1900 à 1940, de 1940 à 1980, etc…). En effet, la plupart des prénoms retenus dans le graphique ont connu un pic de popularité entre 1920 et 1980, d’où leur nombre très élevé. Cependant, les prénoms tendent à se diversifier au fil des années. Cette visualisation pourrait donc bénéficier d’une mise en valeur des « nouveaux » prénoms les plus populaires à partir des années 1980 par exemple (l’année où la majorité des prénoms populaires d’avant se retrouvent en dessous du nombre de 5000)

## 2) Vizualisation 2

Here are the questions we need to answer with this solution. But why do we think it's a good visualisation ?

**Is there a regional effect in the data?**
**Are some names more popular in some regions?**
**Are popular names generally popular across the whole country?**


**Is there a regional effect in the data?**

The data visualization offers valuable insights into the regional effect on naming preferences. It demonstrates that certain names, like Gabrielle, enjoy high popularity in specific regions such as Paris, but their popularity might not extend uniformly across the entire country. This suggests that naming trends exhibit significant regional variations, indicating the influence of cultural diversity and regional factors in shaping naming preferences in different parts of France.

**Are popular names generally popular across the whole country of France? and are some names more popular in some regions?**

The visualization provides evidence of a regional effect in the popularity of names, as seen with the example of Gabrielle. While some names achieve popularity nationwide, others experience varying degrees of popularity depending on the region. This observation highlights the unique cultural and social dynamics present in different regions of France, contributing to the preference and prominence of certain names in specific areas. Consequently, it can be concluded that popular names are not necessarily universally popular across the entire country, indicating the existence of regional variations in naming preferences.

**Strengths:**

The visualization enables a quick assessment of name trends and their evolution over time. By focusing on a specific name, it facilitates the observation of popularity shifts and patterns associated with that name. The visualization sheds light on regional differences in naming preferences, underscoring the cultural richness and diversity within France.

**Weaknesses:**

The choice of the name under analysis can significantly impact the results, warranting careful selection. Aesthetically, the visualization could be further refined for enhanced visual appeal. While the visualization captures temporal trends, it might not provide a comprehensive understanding of the multifaceted factors influencing naming patterns. Additionally, the data size limitations of the visualization should be acknowledged. Importantly, the visualization does not provide a definitive answer to whether popular names are generally popular nationwide, as the regional effect demonstrates variations in popularity across different parts of the country.

In [27]:
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets


DataTransformerRegistry.enable('json')

In [28]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)
depts = gpd.read_file('departements-version-simplifiee.geojson')

just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt') 

grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum(numeric_only=True)

grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt')

In [29]:
###########################################""

Heatmap
In this case, we created a heatmap to provide a comprehensive view of the popularity of baby names and to explore the regional effect in the data. We use a color gradient to represent the number of occurrences, with darker shades indicating higher popularity. This visualization allows for easy identification of popular and unpopular names across different departments.

The strengths and weaknesses of this type of chart for representing baby name popularity by region are as follows:

Strengths:

Comparison of Popularity: The the color represents the number of occurrences of a specific name in a region, allowing for easy visual comparison of popularity across regions and names.

Effective for displaying patterns and trends: Heatmaps are particularly useful for identifying patterns and trends in data. They allow us to quickly identify areas of high and low values, making it easy to spot clusters, correlations, and outliers.

Weaknesses:

Distortion due to color perception: Color perception can vary among individuals, which may lead to differences in interpretation. It's essential to choose a color scheme that is accessible and avoids misleading interpretations.

Difficulty in comparing exact values: While heatmaps provide a good sense of the relative magnitude or density of data values, they may not be ideal for precise comparisons between specific values.

In [30]:
region_counts = top_10_names.groupby(['dpt', 'preusuel'])['nombre'].sum().reset_index()

# Specify a color scheme
color_scheme = 'category10'  # You can choose from 'category10', 'accent', 'dark2', 'paired', 'pastel1', 'pastel2', 'set1', 'set2', 'set3', etc.

# Create the treemap chart with the specified color scheme
chart2 = alt.Chart(region_counts).mark_rect().encode(
    alt.X('dpt:N', axis=alt.Axis(title='Department')),
    alt.Y('preusuel:N', axis=alt.Axis(title='Name')),
    alt.Color('nombre:Q', scale=alt.Scale(scheme=color_scheme)),  # Set the color scheme
    alt.Tooltip(['preusuel:N', 'nombre:Q'])
).properties(
    width=800,
    height=900,
    title='Baby Name Popularity by Region'
)

# Display the chart
chart2

In [31]:
#####################################################################""

In [32]:
popularity_per_region("GABRIELLE")

In [33]:
###################################""

In [34]:
popularity_top_7_names("75")

In [35]:
#################################

In [36]:
heatmap("GABRIELLE")

In [37]:
#################################

In [38]:
solution("GABRIELLE")

Here are the questions we need to answer with this solution. But why do we think it's a good visualisation ?

**Is there a regional effect in the data?**
**Are some names more popular in some regions?**
**Are popular names generally popular across the whole country?**


**Is there a regional effect in the data?**

The data visualization offers valuable insights into the regional effect on naming preferences. It demonstrates that certain names, like Gabrielle, enjoy high popularity in specific regions such as Paris, but their popularity might not extend uniformly across the entire country. This suggests that naming trends exhibit significant regional variations, indicating the influence of cultural diversity and regional factors in shaping naming preferences in different parts of France.

**Are popular names generally popular across the whole country of France? and are some names more popular in some regions?**

The visualization provides evidence of a regional effect in the popularity of names, as seen with the example of Gabrielle. While some names achieve popularity nationwide, others experience varying degrees of popularity depending on the region. This observation highlights the unique cultural and social dynamics present in different regions of France, contributing to the preference and prominence of certain names in specific areas. Consequently, it can be concluded that popular names are not necessarily universally popular across the entire country, indicating the existence of regional variations in naming preferences.

**Strengths:**

The visualization enables a quick assessment of name trends and their evolution over time. By focusing on a specific name, it facilitates the observation of popularity shifts and patterns associated with that name. The visualization sheds light on regional differences in naming preferences, underscoring the cultural richness and diversity within France.

**Weaknesses:**

The choice of the name under analysis can significantly impact the results, warranting careful selection. Aesthetically, the visualization could be further refined for enhanced visual appeal. While the visualization captures temporal trends, it might not provide a comprehensive understanding of the multifaceted factors influencing naming patterns. Additionally, the data size limitations of the visualization should be acknowledged. Importantly, the visualization does not provide a definitive answer to whether popular names are generally popular nationwide, as the regional effect demonstrates variations in popularity across different parts of the country.

In [39]:
##########################################

In [40]:
import altair as alt

top_names_region = names.groupby(['nom', 'preusuel'], as_index=False)['nombre'].sum()
top_names_region = top_names_region.groupby('nom').apply(lambda x: x.nlargest(5, 'nombre')).reset_index(drop=True)
top_names_region_grouped = depts.merge(top_names_region, how='right', left_on='nom', right_on='nom')

chart = alt.Chart(top_names_region_grouped).mark_geoshape().encode(
    x=alt.X('preusuel:N', title='Prénom'),
    y=alt.Y('nombre:Q', title='Nombre de bébés'),
    tooltip=['preusuel', 'code', 'nombre'],
    #color=alt.Color('nom:N')
    color=alt.Color('nombre')
).properties(
    width=800,
    height=600,
    title='Répartition des prénoms populaires par région'
)

chart

In [41]:
####################################################

In [42]:
import altair as alt

top_names_region = names.groupby(['nom', 'preusuel'], as_index=False)['nombre'].sum()
top_names_region = top_names_region.groupby('nom').apply(lambda x: x.nlargest(5, 'nombre')).reset_index(drop=True)
top_names_region_grouped = depts.merge(top_names_region, how='right', left_on='nom', right_on='nom')

chart = alt.Chart(top_names_region_grouped).mark_geoshape().encode(
    x=alt.X('preusuel:N', title='Prénom'),
    y=alt.Y('nombre:Q', title='Nombre de bébés'),
    #color=alt.Color('nombre')
    tooltip=['preusuel', 'code', 'nombre'],
    color=alt.Color('nom:N')
).properties(
    width=800,
    height=600,
    title='Répartition des prénoms populaires par région'
)

chart

In [43]:
###################################################

In [44]:
scatter_chart = alt.Chart(top_names_region_grouped).mark_circle().encode(
    x=alt.X('preusuel:N', title='Prénom'),
    y=alt.Y('nombre:Q', title='Nombre de bébés'),
    color=alt.Color('nom:N'),
    tooltip=['preusuel', 'code', 'nombre']
).properties(
    width=800,
    height=400,
    title='Répartition des prénoms populaires par région (Nuage de points)'
)

scatter_chart.interactive()

In [45]:
###############################################

In [46]:
faceted_bar_chart = alt.Chart(top_names_region_grouped).mark_bar().encode(
    x=alt.X('preusuel:N', title='Prénom'),
    y=alt.Y('nombre:Q', title='Nombre de bébés'),
    color=alt.Color('nom:N'),
    column=alt.Column('code:N', title='Région'),
    tooltip=['preusuel', 'code', 'nombre'],
).properties(
    width=200,
    height=300,
    title='Répartition des prénoms populaires par région (Diagramme en treillis)'
).interactive()

faceted_bar_chart

In [47]:
###############################################

In [48]:
top_names_region = grouped.groupby(['nom', 'preusuel'], as_index=False)['nombre'].sum()
top_names_region = top_names_region.groupby('nom').apply(lambda x: x.nlargest(5, 'nombre')).reset_index(drop=True)

chart = alt.Chart(top_names_region).mark_bar().encode(
    x=alt.X('preusuel:N', title='Prénom'),
    y=alt.Y('nombre:Q', title='Nombre de bébés'),
    color=alt.Color('nom:N')
).properties(
    width=800,
    height=400,
    title='Répartition des prénoms populaires par région'
)

chart

## 3) Vizualisation 3 

In [49]:
name = 'CAMILLE'

subset = names[(names.preusuel == name)]

chart = alt.Chart(subset).mark_line().encode(
    x='annais',
    y='nombre',
    color='sexe:N',
    tooltip=['annais', 'nombre']
).properties(
    width=800,
    height=400,
    title=f"Évolution du nombre de bébés prénommés '{name}'"
)

chart

In [50]:
gender_counts = names.groupby(['annais', 'sexe'], as_index=False)['nombre'].sum()

chart = alt.Chart(gender_counts).mark_line().encode(
    x='annais',
    y='nombre',
    color='sexe:N',
    tooltip=['annais', 'nombre']
).properties(
    width=800,
    height=400,
    title="Évolution de la popularité des prénoms par sexe"
)

chart

Regroupement des données par année, prénom et sexe pour obtenir le nombre total de prénoms par sexe pour chaque année

In [54]:

grouped = names.groupby(['annais', 'preusuel', 'sexe'], as_index=False).sum(numeric_only=True)

# Création du graphique à barres empilées
chart = alt.Chart(grouped).mark_bar().encode(
    x='annais:O',
    y='sum(nombre):Q',
    color='sexe:N',
    column='sexe:N',
    tooltip=['annais:O', 'sum(nombre):Q', 'sexe:N']
).properties(
    width=500,
    height=300
).interactive()

chart