# <span style="color:blue"> Data Preparation </span>

This notebook describes how we construct datasets that we might use in this project. 

In [1]:
# importing external libraries
import os
import json
import re

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# plotly for nice plots
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# import internal libraries
os.chdir('../')
import lib.text_pre_processing as preprocess
os.chdir('notebooks/')

sns.set(rc={"figure.dpi":200, 'savefig.dpi':200})
plt.rcParams['figure.dpi'] = 200
plt.rcParams['savefig.dpi'] = 200


## <span style="color:blue"> XIV legislature</span>


Loading the questions and actors info (deputies) of the 14th legislature.

In [2]:
# loading the json files 
PATH_JSON_XIV_QUESTIONS = "../data/legislature_XIV/questions.json"
PATH_JSON_XIV_INFO_DEPUTES = "../data/legislature_XIV/info_deputes_senateurs_ministres.json"


with open(PATH_JSON_XIV_QUESTIONS, 'r') as fjson_Q:
  data_questions = json.loads(fjson_Q.read())

with open(PATH_JSON_XIV_INFO_DEPUTES, 'r') as fjson_info:
  data_info_dep = json.loads(fjson_info.read())


In [3]:
# retrieve list of questions and actors
list_questions = data_questions['questionsEcrites']['question']
list_actors = data_info_dep['export']['acteurs']['acteur']

In [4]:
list_questions[0]['textesQuestion']['texteQuestion']['texte']

"M. Jean-Sébastien Vialatte attire l'attention de Mme la ministre des affaires sociales et de la santé sur l'accroissement des coûts liés au remboursement de certains actes de radiothérapie dans le secteur public depuis 2009. La loi de financement de la sécurité sociale pour 2014 pose dans son article 34 le cadre d'une expérimentation dont l'objectif est d'élaborer un nouveau modèle de financement du traitement du cancer par radiothérapie, plus intégré et plus lisible, qui pourra prendre en compte toutes les composantes du parcours de soins lors du traitement du cancer en radiothérapie et accompagner l'évolution des techniques et des prises en charge. Dans cette optique l'Agence technique de l'information hospitalière (ATIH) devait mener une enquête de pratique accompagnée d'une enquête de coûts dont le périmètre devait couvrir le cancer du sein et celui de la prostate pour un début d'expérimentation en 2015. L'absence de mise en place de cette expérimentation, d'une durée de quatre an

### Creating datasets

We will need to create 2 datasets:
* `df_actors`: including deputies (askers) information:
  * author_id: id of the deputy
  * civ:  M. or Mme
  * firstname
  * lastname
  * birth_date: date of birth
  * birth_dep: department of birth
  * birth_country: country of birth
  
  <br>
* `df_questions`: questions information:
  * q_text: text of the question
  * author_id: id of the author
  * author_org: organisation of the author
  * section: question section
  * analysis_head: head of the question
  * answer_min: minister answering the question



*Actors (deputies) dataset*

In [22]:
# prepare dictionary of actors for the dataframe
info_actors = {'author_id':[], 'civ':[], 'firstname':[], 'lastname':[], 'birth_date':[], 'birth_dep': [], 'birth_country':[]}
for i, actor in enumerate(list_actors):
    author_id = actor['uid']['#text']
    civ = actor['etatCivil']['ident']['civ']
    firstname = actor['etatCivil']['ident']['prenom']
    lastname = actor['etatCivil']['ident']['nom']
    birth_date = actor['etatCivil']['infoNaissance']['dateNais']
    birth_dep = actor['etatCivil']['infoNaissance']['depNais']
    birth_country = actor['etatCivil']['infoNaissance']['paysNais']

    info_actors['author_id'].append(author_id)
    info_actors['civ'].append(civ)
    info_actors['firstname'].append(firstname)
    info_actors['lastname'].append(lastname)
    info_actors['birth_date'].append(birth_date)
    info_actors['birth_dep'].append(birth_dep)
    info_actors['birth_country'].append(birth_country)

# creating dataframes
df_actors = pd.DataFrame(info_actors)

# ensure there are no duplicate instances
len(df_actors) == df_actors['author_id'].nunique()

True

*Questions dataset*

In [4]:
# prepare dictionary of questions for the dataframe
info_questions = {'q_text':[], 'author_id':[], 'author_org':[], 'author_org_abrev':[], 'section':[], 'analysis_head':[], 'answer_min':[]}
for i, question in enumerate(list_questions):
    try: 
        q_text = question['textesQuestion']['texteQuestion']['texte']
    except:
        q_text = question['textesQuestion']['texteQuestion'][0]['texte']
        
    if type(q_text) == str:
        author_id = question['auteur']['identite']['acteurRef']
        author_org = question['auteur']['groupe']['developpe']
        author_org_abrev = question['auteur']['groupe']['abrege']
        section = question['indexationAN']['rubrique']
        analysis_head = question['indexationAN']['teteAnalyse']
        answer_min = question['minInt']['abrege']

        info_questions['q_text'].append(q_text)
        info_questions['author_id'].append(author_id)
        info_questions['author_org'].append(author_org)
        info_questions['author_org_abrev'].append(author_org_abrev)
        info_questions['section'].append(section)
        info_questions['analysis_head'].append(analysis_head)
        info_questions['answer_min'].append(answer_min)

# creating dataframes
df_questions = pd.DataFrame(info_questions)

print('Number of questions : ', len(df_questions))

Number of questions :  104142


In [9]:
pal_ = list(sns.color_palette(palette='plasma_r',
                              n_colors=df_questions['author_org'].nunique()).as_hex())

fig = px.pie(df_questions, names='author_org', 
             height=800, width=1000,
             hole=0.6,
             color_discrete_sequence=pal_)

fig.update_traces(hovertemplate=None, textposition='outside', textinfo='percent+label', rotation=50)
fig.update_layout(margin=dict(t=100, b=30, l=0, r=0), showlegend=False,
                        plot_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=14, color='#4b4d52'),
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

func_label = f"<br>".join([f"Political parties",
                      f"Leg. XIV"])
fig.add_annotation(dict(x=0.5, y=0.5,  align='center',
                        xref = "paper", yref = "paper",
                        showarrow = False, font_size=30,
                        text=func_label))

#fig.update_layout(plot_bgcolor='#fafafa', legend=dict(orientation="h", yanchor="bottom", y=0.3, xanchor="right", x=2))
fig.show()

#fig.write_image('../figures/dep_polit_parties_XIV.png', scale=3)

### Pre-processing questions text

We will only be interested on learning embeddings of words in the questions asked by deptuies. For this purpose, we will first prepape the data for the embedding models. 

As a first step, we clean the text by removing:
* html tags
* removing some anomalies: "xa0" and "\\'"
* url links
* put in lowercase
* put space between words and punctuation


In [5]:
text_list = list(df_questions['q_text'])

# remove html tags using BeautifulSoup
html_q = preprocess.remove_tags(text_list)

# removing some anomalies: \xa0, apostrophes (replace \' by '), url
cln_corpus = preprocess.remove_anomalies(html_q)

# inserting the cleaned text into the dataframe
df_questions['q_text'] = cln_corpus

In [7]:
df_questions.head(3)

Unnamed: 0,q_text,author_id,author_org,author_org_abrev,section,analysis_head,answer_min
0,des affaires sociales et de la santé sur l' ac...,PA2907,Les Républicains,LES-REP,santé,remboursement,Affaires sociales et santé
1,"de l' agriculture , de l' agroalimentaire et d...",PA607595,"Socialiste, républicain et citoyen",SRC,élevage,bovins,"Agriculture, agroalimentaire et forêt"
2,"du travail , de l' emploi , de la formation pr...",PA343623,Union pour un Mouvement Populaire,UMP,entreprises,entreprises en difficulté,"Travail, emploi, formation professionnelle et ..."


In [8]:
# save into dataframes
df_questions.to_csv("data/legislature_XIV/df_questions.csv", index=False)
df_actors.to_csv("data/legislature_XIV/df_actors.csv", index=False)

## <span style='color:blue'> XV legislature </span>
We repeat the same previous steps for the 15th legislature.

*Questions dataset*

In [11]:
# loading questions
PATH_JSON_XV_QUESTIONS = "../data/legislature_XV/questions"

# prepare dictionary of questions for the dataframe
info_questions = {'q_text':[], 'author_id':[], 'author_org':[], 'author_org_abrev':[], 'section':[], 'analysis_head':[], 'answer_min':[]}

for filename in os.listdir(PATH_JSON_XV_QUESTIONS):
    fjson = open(os.path.join(PATH_JSON_XV_QUESTIONS, filename), 'r')
    question = json.loads(fjson.read())
    try:
        q_text = question['question']['textesQuestion']['texteQuestion']['texte']
    except:
        q_text = question['question']['textesQuestion']['texteQuestion'][0]['texte']
    author_id = question['question']['auteur']['identite']['acteurRef']
    author_org = question['question']['auteur']['groupe']['developpe']
    author_org_abrev = question['question']['auteur']['groupe']['abrege']
    section = question['question']['indexationAN']['rubrique']
    analysis_head = question['question']['indexationAN']['teteAnalyse']
    answer_min = question['question']['minInt']['abrege']

    info_questions['q_text'].append(q_text)
    info_questions['author_id'].append(author_id)
    info_questions['author_org'].append(author_org)
    info_questions['author_org_abrev'].append(author_org_abrev)
    info_questions['section'].append(section)
    info_questions['analysis_head'].append(analysis_head)
    info_questions['answer_min'].append(answer_min)

    
# creating dataframes
df_questions = pd.DataFrame(info_questions)

print('Number of questions : ', len(df_questions))  
    

Number of questions :  45665


In [12]:
pal_ = list(sns.color_palette(palette='plasma_r',
                              n_colors=df_questions['author_org'].nunique()).as_hex())

fig = px.pie(df_questions, names='author_org', 
             height=800, width=1000,
             hole=0.6,
             color_discrete_sequence=pal_)

fig.update_traces(hovertemplate=None, textposition='outside', textinfo='percent+label', rotation=50)
fig.update_layout(margin=dict(t=100, b=30, l=0, r=0), showlegend=False,
                        plot_bgcolor='#fafafa',
                        title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                        font=dict(size=14, color='#4b4d52'),
                        hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

func_label = f"<br>".join([f"Political parties",
                      f"Leg. XIV"])
fig.add_annotation(dict(x=0.5, y=0.5,  align='center',
                        xref = "paper", yref = "paper",
                        showarrow = False, font_size=30,
                        text=func_label))

#fig.update_layout(plot_bgcolor='#fafafa', legend=dict(orientation="h", yanchor="bottom", y=0.3, xanchor="right", x=2))
fig.show()

fig.write_image('../figures/dep_polit_parties_XV.png', scale=3)

In [13]:
text_list = list(df_questions['q_text'])

# remove html tags using BeautifulSoup
html_q = preprocess.remove_tags(text_list)

# removing some anomalies: \xa0, apostrophes (replace \' by '), url
cln_corpus = preprocess.remove_anomalies(html_q)

# inserting the clean text into the dataframe
df_questions['q_text'] = cln_corpus

In [18]:
df_questions.head(3)

Unnamed: 0,q_text,author_id,author_org,author_org_abrev,section,analysis_head,answer_min
0,"du travail , de l' emploi et de l' insertion c...",PA720814,La République en Marche,LAREM,entreprises,,"Travail, emploi et insertion"
1,"d' état , ministre de l' intérieur , sur le no...",PA330240,Les Républicains,LR,étrangers,,Intérieur
2,des solidarités et de la santé sur la prise en...,PA721726,La République en Marche,LAREM,assurance maladie maternité,,Solidarités et santé


*Actors dataset*

In [11]:
# loading info actors
PATH_JSON_XV_ACTORS = "data/legislature_XV/info_deputes_senateurs_ministres/acteur"

# prepare dictionary of actors for the dataframe
info_actors = {'author_id':[], 'civ':[], 'firstname':[], 'lastname':[], 'birth_date':[], 'birth_dep': [], 'birth_country':[]}

for filename in os.listdir(PATH_JSON_XV_ACTORS):
    fjson = open(os.path.join(PATH_JSON_XV_ACTORS, filename), 'r')
    actor = json.loads(fjson.read())

    author_id = actor['acteur']['uid']['#text']
    civ = actor['acteur']['etatCivil']['ident']['civ']
    firstname = actor['acteur']['etatCivil']['ident']['prenom']
    lastname = actor['acteur']['etatCivil']['ident']['nom']
    birth_date = actor['acteur']['etatCivil']['infoNaissance']['dateNais']
    birth_dep = actor['acteur']['etatCivil']['infoNaissance']['depNais']
    birth_country = actor['acteur']['etatCivil']['infoNaissance']['paysNais']
    if type(birth_date) != str:
        birth_date = None
    if type(birth_dep) != str:
        birth_dep = None
    if type(birth_country) != str:
        birth_country = None

    info_actors['author_id'].append(author_id)
    info_actors['civ'].append(civ)
    info_actors['firstname'].append(firstname)
    info_actors['lastname'].append(lastname)
    info_actors['birth_date'].append(birth_date)
    info_actors['birth_dep'].append(birth_dep)
    info_actors['birth_country'].append(birth_country)

# creating dataframes
df_actors = pd.DataFrame(info_actors)

# ensure there are no duplicate instances
len(df_actors) == df_actors['author_id'].nunique()
    

True

*Save datasets*

In [17]:
# save into dataframes
df_questions.to_csv("data/legislature_XV/df_questions.csv", index=False)
df_actors.to_csv("data/legislature_XV/df_actors.csv", index=False)

**Load the dataframes, extract questions and save them in a text file**

The text file will be used later for domain adaptive pre-training and fine-tuning of tranformers based LM.

In [37]:
import pandas as pd
import re 

df_questions_XIV = pd.read_csv("../data/legislature_XIV/df_questions.csv")
df_questions_XV = pd.read_csv("../data/legislature_XV/df_questions.csv")
df_actors_XIV = pd.read_csv("../data/legislature_XIV/df_actors.csv")
df_actors_XV = pd.read_csv("../data/legislature_XV/df_actors.csv")

In [38]:
# some pre-processing
def more_preprocess(string):
    string = re.sub(' ([:.,!?)])', r'\1', string)   # remove space between punctuation and words
    string = re.sub('([(]) ', r'\1', string)   # remove space between (), [] and words
    string = string.replace("' ", "'") 
    string = string.replace("\' ", "'")
    string = string.replace(" mr ", ' mr-mme ')
    string = string.replace(" mme ", ' mr-mme ')
    string = string.replace(" il ", ' il-elle ') 
    string = string.replace(" elle ", ' il-elle ')  
    return string

In [39]:
df_questions_XIV['q_text'] = df_questions_XIV['q_text'].apply(more_preprocess)
df_questions_XV['q_text'] = df_questions_XV['q_text'].apply(more_preprocess)

In [46]:
df_clf_XIV = pd.merge(df_questions_XIV, df_actors_XIV, left_on='author_id', right_on='author_id')
df_clf_XV = pd.merge(df_questions_XV, df_actors_XV, left_on='author_id', right_on='author_id')

df_clf_XIV = df_clf_XIV[['q_text', 'civ']]
df_clf_XV = df_clf_XV[['q_text', 'civ']]

In [48]:
# save into dataframes
df_clf_XIV.to_csv("../data/legislature_XIV/df_clf_XIV.csv", index=False)
df_clf_XV.to_csv("../data/legislature_XV/df_clf_XV.csv", index=False)