# Deputies' questions corpora exploration during the 14th and 15th legislatures

This notebook is intended to explore some statistics on the corpora of deputies' questions during the 14th and 15th French legislatures.


## <a id="content">Contents</a>
1. [Male and female presence in the legislatures](#gender-presence)
   - [Number of males and females per legislature](#gender-per-leg)
   - [Number of male and female question authors per legislature](#gender-authorship-per-leg)
   - [Number of male and female questions per legislature](#gender-questions-per-leg)
2. [Question authors political parties distribution](#authors-political-parties)
3. [Distribution of authors' gender per question section](#authors-gender-per-section)
4. [Authors usage of gender-relative words](#authors-usage-of-gender-words)

<ins>Loading useful libraries</ins>

In [59]:
# importing external libraries
import os
import json
import re
import string

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


import plotly.graph_objects as go
import plotly.express as px

pal_ = list(sns.color_palette(palette='plasma_r',
                              n_colors=6).as_hex())

# increase resolution of graphs
sns.set(rc={"figure.dpi":200, 'savefig.dpi':200})
plt.rcParams['figure.dpi'] = 150
plt.rcParams['savefig.dpi'] = 150

<ins>Import questions and actors datasets</ins>

In [2]:
df_questions_XIV = pd.read_csv('../data/legislature_XIV/df_questions.csv')
df_questions_XV = pd.read_csv('../data/legislature_XV/df_questions.csv')

df_actors_XIV = pd.read_csv('../data/legislature_XIV/df_actors.csv')
df_actors_XV = pd.read_csv('../data/legislature_XV/df_actors.csv')

questions_XIV = list(df_questions_XIV['q_text'])
questions_XV = list(df_questions_XV['q_text'])

df_questions_XIV.head(5)

Unnamed: 0,q_text,author_id,author_org,author_org_abrev,section,analysis_head,answer_min
0,des affaires sociales et de la santé sur l' ac...,PA2907,Les Républicains,LES-REP,santé,remboursement,Affaires sociales et santé
1,"de l' agriculture , de l' agroalimentaire et d...",PA607595,"Socialiste, républicain et citoyen",SRC,élevage,bovins,"Agriculture, agroalimentaire et forêt"
2,"du travail , de l' emploi , de la formation pr...",PA343623,Union pour un Mouvement Populaire,UMP,entreprises,entreprises en difficulté,"Travail, emploi, formation professionnelle et ..."
3,"d' état , auprès de la ministre de l' écologie...",PA1857,Union pour un Mouvement Populaire,UMP,transports ferroviaires,SNCF,"Transports, mer et pêche"
4,"de l' économie , de l' industrie et du numériq...",PA608826,Union pour un Mouvement Populaire,UMP,télécommunications,Internet,"Économie, industrie et numérique"


## <a id="gender-presence">1. Male and female presence in the legislatures </a> ([&uarr;](#content))

### <a id="gender-per-leg">1.1. Number of males and females per legislature </a> ([&uarr;](#content))


We will first merge the 2 datasets (questions, actors) on `author_id` to plot the distribution of sections by gender

In [3]:
#result = pd.concat([df_questions_XIV, df_actors_XIV], axis=1, join="inner")
df_XIV_merged = pd.merge(df_questions_XIV, df_actors_XIV, left_on='author_id', right_on='author_id')
df_XV_merged = pd.merge(df_questions_XV, df_actors_XV, left_on='author_id', right_on='author_id')


# restrict to important columns
columns = ['q_text', 'author_id', 'civ', 'section', 'author_org', 'answer_min']
df_XIV_merged = df_XIV_merged[columns]
df_XV_merged = df_XV_merged[columns]


# ensure there are only 2 gender classes
print(df_XIV_merged['civ'].unique())
print(df_XV_merged['civ'].unique())

['M.' 'Mme']
['M.' 'Mme']


In [4]:
# dataframes by gender
df_XIV_male = df_XIV_merged.loc[df_XIV_merged['civ']=='M.']
df_XIV_female = df_XIV_merged.loc[df_XIV_merged['civ']=='Mme']

df_XV_male = df_XV_merged.loc[df_XV_merged['civ']=='M.']
df_XV_female = df_XV_merged.loc[df_XV_merged['civ']=='Mme']

In [89]:
def make_plot_df(df, group1='leg', group2='civ', apply_func=None, apply_col=None):
    """ create a dataframe that will help make a bar plot. """
    df_g = df.groupby([group1, group2]).size().reset_index()
    df_g['percentage'] = df.groupby([group1, group2]).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values
    df_g.columns = [group1, group2, 'Counts', 'Percentage']
    if apply_func is not None:
        df_g[apply_col] = df_g[apply_col].apply(apply_func).values
    return df_g

In [90]:
# create a dataframe for the plot 
df_full = pd.concat((df_actors_XIV, df_actors_XV), axis=0).reset_index()
df_leg = pd.DataFrame({'leg':['leg. XIV']*len(df_actors_XIV)+['leg. XV']*len(df_actors_XV)})
df_full = pd.concat((df_full, df_leg), axis=1)

apply_func = lambda x: 'female' if x=='Mme' else 'male'
df_g = make_plot_df(df_full, group1='leg', group2='civ', apply_col='civ', apply_func=apply_func)

# make a plot 
fig = px.bar(df_g, x='leg', y=['Counts'], color='civ', 
             text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
             color_discrete_sequence=pal_,
             labels={'civ':'Gender'})

fig.update_layout(margin=dict(t=100, b=30, l=100, r=0), showlegend=True, height=400, width=800,
                        plot_bgcolor='white', 
                        #paper_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=16, color='#4b4d52'),
                        xaxis_title_text='Legislature',
                        yaxis_title_text = "Count",
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')
fig.show()
#fig.write_image('../figures/nb_men_women_per_leg.png', scale=4)

### <a id="gender-authorship-per-leg">1.2. Number of male and female question authors per legislature</a> ([&uarr;](#content))

In [244]:
df_full = pd.concat((df_XIV_merged, df_XV_merged), axis=0).reset_index()
df_leg = pd.DataFrame({'leg':['leg. XIV']*len(df_XIV_merged)+['leg. XV']*len(df_XV_merged)})
df_full = pd.concat((df_full, df_leg), axis=1)

df_g = df_full.groupby(['leg', 'civ']).size().reset_index()
ddd = df_full.groupby(['leg', 'civ', 'author_id']).size().reset_index()
dz = ddd.groupby(['leg', 'civ']).size().reset_index()
df_g['percentage'] = ddd.groupby(['leg', 'civ']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values
df_g.columns = ['leg', 'civ', 'Counts', 'Percentage']

df_g['civ'] = df_g['civ'].apply(lambda x: 'female' if x=='Mme' else 'male')
df_g['Counts'] = dz[0]

# make a plot 
pal_ = list(sns.color_palette(palette='plasma_r',
                              n_colors=6).as_hex())

fig = px.bar(df_g, x='leg', y=['Counts'], color='civ', 
             text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
             color_discrete_sequence=pal_,
             labels={'civ':'Gender'})


fig.update_layout(margin=dict(t=100, b=30, l=100, r=0), showlegend=True, height=400, width=800,
                        plot_bgcolor='white', 
                        #paper_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=16, color='#4b4d52'),
                        xaxis_title_text='Legislature',
                        yaxis_title_text = "Count",
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

#fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='f3f3f3')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')
fig.show()
fig.write_image('../figures/nb_men_women_asking_per_leg.png', scale=4)

### <a id="gender-questions-per-leg">1.3. Number of male and female questions per legislature</a> ([&uarr;](#content))


In [91]:
df_full = pd.concat((df_XIV_merged, df_XV_merged), axis=0).reset_index()
df_leg = pd.DataFrame({'leg':['leg. XIV']*len(df_XIV_merged)+['leg. XV']*len(df_XV_merged)})
df_full = pd.concat((df_full, df_leg), axis=1)

apply_func = lambda x: 'female' if x=='Mme' else 'male'
df_g = make_plot_df(df_full, group1='leg', group2='civ', apply_col='civ', apply_func=apply_func)

# make a plot 
pal_ = list(sns.color_palette(palette='plasma_r',
                              n_colors=6).as_hex())

fig = px.bar(df_g, x='leg', y=['Counts'], color='civ', 
             text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
             color_discrete_sequence=pal_,
             labels={'civ':'Gender'})


fig.update_layout(margin=dict(t=100, b=30, l=100, r=0), showlegend=True, height=400, width=800,
                        plot_bgcolor='white', 
                        #paper_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=16, color='#4b4d52'),
                        xaxis_title_text='Legislature',
                        yaxis_title_text = "Count",
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

#fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='f3f3f3')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')
fig.show()
#fig.write_image('../figures/nb_men_women_questions_per_leg.png', scale=4)

## <a id="authors-political-parties">2. Question authors political parties distribution</a> ([&uarr;](#content))


<ins>**Legislature XIV**</ins>

In [100]:
pal_ = list(sns.color_palette(palette='plasma_r',
                              n_colors=df_questions_XIV['author_org'].nunique()).as_hex())

fig = px.pie(df_questions_XIV, names='author_org', 
             height=800, width=1000,
             hole=0.6,
             color_discrete_sequence=pal_)

fig.update_traces(hovertemplate=None, textposition='outside', textinfo='percent+label', rotation=50)
fig.update_layout(margin=dict(t=100, b=30, l=0, r=0), showlegend=False,
                        plot_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=14, color='#4b4d52'),
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

func_label = f"<br>".join([f"Political parties",
                      f"Leg. XIV"])
fig.add_annotation(dict(x=0.5, y=0.5,  align='center',
                        xref = "paper", yref = "paper",
                        showarrow = False, font_size=30,
                        text=func_label))

#fig.update_layout(plot_bgcolor='#fafafa', legend=dict(orientation="h", yanchor="bottom", y=0.3, xanchor="right", x=2))
fig.show()

fig.write_image('../figures/dep_polit_parties_XIV.png', scale=4)

<ins>**Legislature XV**</ins>

In [99]:
pal_ = list(sns.color_palette(palette='plasma_r',
                              n_colors=df_questions_XV['author_org'].nunique()).as_hex())

fig = px.pie(df_questions_XV, names='author_org', 
             height=800, width=1000,
             hole=0.6,
             color_discrete_sequence=pal_)

fig.update_traces(hovertemplate=None, textposition='outside', textinfo='percent+label', rotation=50)
fig.update_layout(margin=dict(t=100, b=30, l=0, r=0), showlegend=False,
                        plot_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=14, color='#4b4d52'),
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

func_label = f"<br>".join([f"Political parties",
                      f"Leg. XV"])
fig.add_annotation(dict(x=0.5, y=0.5,  align='center',
                        xref = "paper", yref = "paper",
                        showarrow = False, font_size=30,
                        text=func_label))

#fig.update_layout(plot_bgcolor='#fafafa', legend=dict(orientation="h", yanchor="bottom", y=0.3, xanchor="right", x=2))
fig.show()

fig.write_image('../figures/dep_polit_parties_XV.png', scale=4)

## <a id="authors-gender-per-section">3. Distribution of authors' gender per question section</a> ([&uarr;](#content))

In [8]:
# number of sections in each legislature 
n_sections_XIV = df_XIV_merged['section'].nunique()
n_sections_XV = df_XV_merged['section'].nunique()

In [92]:
# create a reduced dataframe with first 10 most repeated section
lst_sections_ordered = df_XIV_merged['section'].value_counts().index.to_list()[:10]
reduced_df = df_XIV_merged[df_XIV_merged['section'].isin(lst_sections_ordered)]

# create a dataframe to use for the plot
apply_func = lambda x: 'female' if x=='Mme' else 'male'
df_g = make_plot_df(reduced_df, group1='section', group2='civ', apply_col='civ', apply_func=apply_func)


# make a plot 
fig = px.bar(df_g, x='section', y=['Counts'], color='civ', 
             text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
             color_discrete_sequence=pal_,
             labels={'civ':'Gender'},
             #title='Leg. XIV: Distriubtion of questions w.r.t. sections'
             )


fig.update_layout(margin=dict(t=100, b=30, l=100, r=0), showlegend=True, height=600, width=1100,
                        plot_bgcolor='white', 
                        #paper_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=16, color='#4b4d52'),
                        #xaxis_title_text='Legislature',
                        yaxis_title_text = "Count",bargap=.1,
                        xaxis={'categoryorder': 'total descending'},
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

fig.show()
#fig.write_image('../figures/sections_per_gender_XIV.png', scale=4)

In [93]:
# create a reduced dataframe with first 10 most repeated section
reduced_df = df_XIV_merged

# create a dataframe to use for the plot
apply_func = lambda x: 'female' if x=='Mme' else 'male'
df_g = make_plot_df(reduced_df, group1='section', group2='civ', apply_col='civ', apply_func=apply_func)



# make a plot 
fig = px.bar(df_g, y='section', x=['Counts'], color='civ', 
             text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
             color_discrete_sequence=pal_,
             labels={'civ':'Gender'},
             orientation='h',
             title='Distribution of sections by gender in leg. XIV'
             )


fig.update_layout(margin=dict(t=100, b=30, l=100, r=0), showlegend=True, height=5000, width=1000,
                        plot_bgcolor='white', 
                        #paper_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=10, color='#4b4d52'),
                        #xaxis_title_text='Legislature',
                        yaxis_title_text = "Count",bargap=.1,
                        yaxis={'categoryorder': 'total ascending'},
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

fig.show()

#fig.write_image('../figures/sections_per_gender_XV.png', scale=4)

**Distribution of sections by gender in the 15th legislature**

In [95]:
# create a reduced dataframe with first 10 most repeated section
lst_sections_ordered = df_XV_merged['section'].value_counts().index.to_list()[:10]
reduced_df = df_XV_merged[df_XV_merged['section'].isin(lst_sections_ordered)]

# create a dataframe to use for the plot
# create a dataframe to use for the plot
apply_func = lambda x: 'female' if x=='Mme' else 'male'
df_g = make_plot_df(reduced_df, group1='section', group2='civ', apply_col='civ', apply_func=apply_func)


# make a plot 
fig = px.bar(df_g, x='section', y=['Counts'], color='civ', 
             text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
             color_discrete_sequence=pal_,
             labels={'civ':'Gender'},
             #title='Leg. XIV: Distriubtion of questions w.r.t. sections'
             )


fig.update_layout(margin=dict(t=100, b=30, l=100, r=0), showlegend=True, height=600, width=1100,
                        plot_bgcolor='white', 
                        #paper_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=16, color='#4b4d52'),
                        #xaxis_title_text='Legislature',
                        yaxis_title_text = "Count",bargap=.1,
                        xaxis={'categoryorder': 'total descending'},
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

fig.show()

#fig.write_image('../figures/sections_per_gender_XV.png', scale=4)

In [94]:
# create a reduced dataframe with first 10 most repeated section
reduced_df = df_XV_merged

# create a dataframe to use for the plot
apply_func = lambda x: 'female' if x=='Mme' else 'male'
df_g = make_plot_df(reduced_df, group1='section', group2='civ', apply_col='civ', apply_func=apply_func)

# make a plot 
fig = px.bar(df_g, y='section', x=['Counts'], color='civ', 
             text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
             color_discrete_sequence=pal_,
             labels={'civ':'Gender'},
             orientation='h', 
             title='Distribution of sections by gender in leg. XV'
             )


fig.update_layout(margin=dict(t=100, b=30, l=100, r=0), showlegend=True, height=5000, width=1000,
                        plot_bgcolor='white', 
                        #paper_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=10, color='#4b4d52'),
                        #xaxis_title_text='Legislature',
                        yaxis_title_text = "Count",bargap=.1,
                        yaxis={'categoryorder': 'total ascending'},
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

fig.show()

#fig.write_image('../figures/sections_per_gender_XV.png', scale=4)

**Conclusion:**

The majority of the questions asked in each section are majorily produced by men. This raises a concern: it is likely that bias towards men measured in embeddings trained on such corpus is only due to the fact that the majority of askers are men. 

One solution is therefore to reduce the size of the corpus: the number of questions posed by men is the same as the number of questions asked by women for each section. 

## <a id="authors-usage-of-gender-words">4. Authors usage of gender-relative words </a> ([&uarr;](#content))

In order to answer this question, we need to first fetch the data and create an adequate dataframe. We also set a 2 lists of gender-relative words: female-relative words, and male-relative words. 

In [42]:
# gender relative words
fem_relative_words = "femme fille femmes filles mère mères maman tante soeur soeurs grand-mère grand-mères".split()
male_relative_words = "homme hommes fils garçon garçons père pères papa oncle frère frères grand-père grand-pères".split()

# words in questions asked by males and females
female_q_XIV = df_XV_merged[df_XV_merged['civ']=='Mme']['q_text'].apply(lambda x: x.split())
male_q_XIV = df_XV_merged[df_XV_merged['civ']=='M.']['q_text'].apply(lambda x: x.split())

# gender-relative words used by each gender
female_used_fem_words = [word for question in female_q_XIV for word in question if word in fem_relative_words]
male_used_fem_words = [word for question in male_q_XIV for word in question if word in fem_relative_words]
female_used_male_words = [word for question in female_q_XIV for word in question if word in male_relative_words]
male_used_male_words = [word for question in male_q_XIV for word in question if word in male_relative_words]

# creating a dataframe
female_used_gender_words = pd.DataFrame({'gen_words':female_used_male_words+female_used_fem_words, 'category':['male']*len(female_used_male_words) + ['female']*len(female_used_fem_words)})
male_used_gender_words = pd.DataFrame({'gen_words':male_used_male_words+male_used_fem_words, 'category':['male']*len(male_used_male_words) + ['female']*len(male_used_fem_words)})

female_used_gender_words = pd.concat([female_used_gender_words, pd.DataFrame({'user':['female']*len(female_used_gender_words)})], axis=1)
male_used_gender_words = pd.concat([male_used_gender_words, pd.DataFrame({'user':['male']*len(male_used_gender_words)})], axis=1)

df_words_per_gender = pd.concat([female_used_gender_words, male_used_gender_words], axis=0)

In [96]:
# create a dataframe for the plot 
df_g = make_plot_df(df_words_per_gender, group1='user', group2='category')

fig = px.bar(df_g, x='user', y=['Counts'], color='category', 
             text=df_g['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
             color_discrete_sequence=pal_,
             labels={'category':'gender words'})

fig.update_layout(margin=dict(t=100, b=30, l=100, r=0), showlegend=True, height=400, width=800,
                        plot_bgcolor='white', 
                        #paper_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=16, color='#4b4d52'),
                        xaxis_title_text='Author gender',
                        yaxis_title_text = "Count",
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')
fig.show()
#fig.write_image('../figures/deputies_usage_of_gender_words.png', scale=4)
