The data of handicaps and other employees is stored in two different csv files. The very first step is to read the csv files and transform them into dataframes using pandas. 

In [2]:
import pandas as pd
file_path_1= '../data/raw/EDF/bilan-social-d-edf-sa-effectifs-et-repartition-par-age-statut-et-sexe.csv'
file_path_2= '../data/raw/EDF/bilan-social-d-edf-sa-salaries-en-situation-de-handicap.csv'
df_age_statut_sexe = pd.read_csv(file_path_1, delimiter=';')
df_handicap=pd.read_csv(file_path_2, delimiter=';')

Drop all english columns from both dataframes

In [3]:
df_age_statut_sexe_fr=df_age_statut_sexe.drop(['Spatial perimeter','Indicator', 'Type of contract', 
                         'Employee category', 'Employee subcategory', 'Gender','M3E classification', 
                         'Nationality', 'Seniority', 'Age bracket', 'Unit'], axis=1)

df_handicap_fr=df_handicap.drop(['Spatial perimeter', 'Indicator',
       'Type of contract', 'Employee category', 'Gender', 'Unit'], axis=1)

We only keep the columns of interest in both dataframes.  

In [4]:
colonnes_a_conserver=['Année', 'Indicateur', 'Valeur']
df1=df_age_statut_sexe_fr[colonnes_a_conserver]
df2=df_handicap_fr[colonnes_a_conserver]

We are only interested in the 'Effectif' category of the column 'Indicateur' in the non_handicap dataframe, and in the 'Salariés en situation de handicap' category oàf the column 'Indicateur' in the handicap dataframe. 

In [5]:
df3=df1[df1['Indicateur']=='Effectif']
df4=df2[df2['Indicateur']=='Salariés en situation de handicap']

We group now per year, in order to calculate the total number of employees, using the 'Valeur' colonnes

In [6]:
df_grouped = df3.groupby('Année', as_index=False)['Valeur'].sum()
df_grouped.rename(columns={'Valeur': 'Effectif'}, inplace=True)

We do the same for the handicap dataframe

In [7]:
df_grouped_handicap = df4.groupby('Année', as_index=False)['Valeur'].sum()
df_grouped_handicap.rename(columns={'Valeur': 'Effectif_handicap'}, inplace=True)

We merge the two dataframes in order to calculte the percentage of handicaps with respect to the total number of employees, and we calculate the corresponding percentage. 

In [8]:
merged_df = pd.merge(df_grouped, df_grouped_handicap, on='Année', how='outer')
merged_df['Pourcentage']=merged_df['Effectif_handicap']/merged_df['Effectif']*100

We transform our clean dataframe to a csv file, ready to be used in Tableau for visualization. 

In [9]:
merged_df.to_csv('../data/processed/Fichier1.csv', index=False)

We now move to access the web scrapped data saved in a csv file for other companies, and store it in a dataframe. 

In [10]:
file_path_3= '../data/processed/toutes_entreprises_data_effectif_et_handicap.csv'
df_all = pd.read_csv(file_path_3, delimiter=',')

We group per year, on the company name and the 'Indicateur' column to calulate the total number of employees and handicaps. 

In [11]:
df_all.groupby(['Année', 'Perimètre juridique', 'Indicateur'], as_index=False)['Valeur'].sum()

Unnamed: 0,Année,Perimètre juridique,Indicateur,Valeur
0,2019,ENGIE,Effectif,4045
1,2019,ENGIE,Salariés en situation de handicap,189
2,2019,Orange,Effectif,79774
3,2019,Orange,Salariés en situation de handicap,5247
4,2020,Auchan,Effectif,173412
5,2020,Auchan,Salariés en situation de handicap,6936
6,2020,Decathlon,Effectif,93710
7,2020,Decathlon,Salariés en situation de handicap,2999
8,2020,ENGIE,Effectif,4131
9,2020,ENGIE,Salariés en situation de handicap,155


We reorganize our dataframe per year for all companies. 

In [12]:
df_pivot = df_all.pivot_table(index=['Année', 'Perimètre juridique'], columns='Indicateur', values='Valeur', aggfunc='sum')
df_pivot.columns = ['Effectif', 'Effectif_handicap']
df_pivot = df_pivot.reset_index()

We calculate the percentage of the handicaps employees with respect to the total number of employees. 

In [13]:
df_pivot['Pourcentage']=df_pivot['Effectif_handicap']/df_pivot['Effectif']*100

We transform our dataframe inot a csv file, erady to be used for vizualisation in Tableau. 

In [14]:
df_pivot.to_csv('../data/processed/Fichier2.csv', index=False)

For real scrapped-data, the process is nearly identical with the file produced in the other notebook "../notebooks/repertoires_entreprises_pipeline_complet.ipynb" and the previous file with EDF's percentages "Fichier1": 

In [32]:
# data from other enterprises
file_path_4= '../data/processed/salaries-en-situation-handicap_entreprises-cibles.csv'
df_other = pd.read_csv(file_path_4, delimiter=',')
display(df_other.head(2))
# select only percentages
mask=df_other.Indicateur=="Salariés en situation de handicap (%)"
df_other=df_other.loc[mask]
# reshape data to match "Fichier1"
df_other=df_other.groupby(['Année', 'Perimètre juridique', 'Indicateur'], as_index=False)['Valeur'].sum()
df_pivot = df_other.pivot_table(index=['Année', 'Perimètre juridique'], columns='Indicateur', values='Valeur', aggfunc='sum')
df_pivot.columns = ['Pourcentage']
df_pivot = df_pivot.reset_index()
display(df_pivot.head(2))

# data from EDF 
df_edf = pd.read_csv('../data/processed/Fichier1.csv').loc[:,['Année', 'Pourcentage']]
df_edf['Perimètre juridique']='EDF SA'
display(df_edf.head(2))

# merge EDF and other enterprises data
df_all = df_pivot.merge(df_edf, on=["Année", "Perimètre juridique", "Pourcentage"], how='outer')
display(df_all)

# export to csv for tableau
df_all.to_csv('../data/processed/toutes_entreprises_data_pourcentage_handicap.csv', index=False)

Unnamed: 0,Année,Perimètre juridique,Valeur,Perimètre spatial,Indicateur,Unité
0,2020,Auchan,,France,Salariés en situation de handicap,nombre
1,2021,Auchan,,France,Salariés en situation de handicap,nombre


Unnamed: 0,Année,Perimètre juridique,Pourcentage
0,2020,Auchan,4.0
1,2020,Decathlon,3.2


Unnamed: 0,Année,Pourcentage,Perimètre juridique
0,2017,0.671964,EDF SA
1,2018,0.707893,EDF SA


Unnamed: 0,Année,Perimètre juridique,Pourcentage
0,2017,EDF SA,0.671964
1,2018,EDF SA,0.707893
2,2019,EDF SA,0.721877
3,2020,Auchan,4.0
4,2020,Decathlon,3.2
5,2020,EDF SA,0.75856
6,2021,Auchan,4.6
7,2021,Carrefour,3.4
8,2021,Decathlon,6.2
9,2021,EDF SA,0.778404
