# Project: What promotes higher rates of family female workers?

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, we are going to _analyse the impact of several indicators on the number of family female workers_. The data is extracted from [GapMinder](https://www.gapminder.org/data/).

The purpose of this analysis is to understand:
- What are the factors that give a stronger chance to a country to count more family female workers? 
- How did the female/male family workers ratio evolve over time in developed and emerging countries?

In this perspective, we will analyse _economy, education, equality and society_ indicators to estimate their effect on the rate of female family workers. We are going to select 10 countries around the world to frame this analysis.

####  Scope

> The 10 countries we will keep in the dataset for this analysis are:
1. Sweden
2. Germany
3. Belgium
4. Italy
5. Senegal
6. India
7. USA
8. Brasil
9. Syria
10. Australia

#### Questions

In this analysis, we will attempt to answer the following detailed questions:

1. How are the list of 10 countries ranked based on # of female workers ?
2. Which indicator has a highest average correlation with # of female workers ? 
3. What is the male/female ratio of family workers ?
4. What level of income/Aid for most equal ratio of male/female family workers ?
5. How did male/female ratio of family workers evolved in Belgium and Italy in the past vs. today ?

#### Data collection

> **Datasets**: we have downloaded 5 datasets from GapMinder in order to perform this analysis:
>
> - Female Family workers
> - Male Family workers
> - Income
> - Gender ratio of mean years at school (25 - 34 years)
> - Human development Index

In [None]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import functools 
%matplotlib inline

# import datasets
df_female = pd.read_csv('female_family_workers_percent_of_female_employment.csv')
df_male = pd.read_csv('male_family_workers_percent_of_male_employment.csv')
df_income = pd.read_csv('income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
df_school = pd.read_csv('mean_years_in_school_women_percent_men_25_to_34_years.csv')
df_hdi = pd.read_csv('hdi_human_development_index.csv')

<a id='wrangling'></a>
## Data Wrangling

#### Assessment
> Let's check each of these dataframes info, we noticed:
> - All these datasets look like they have the same structure: countries in rows and years in columns
> - Indicator data is in a float type, so no need to reformat them
> - Data is more complete only recently 
> - Some countries may be missing
> - School years and HDI only have collected data until 2015

In [None]:
# Female family workers
df_female.info()

In [None]:
# Male family workers
df_male.info()

In [None]:
# Income
df_income.head()

In [None]:
# School years
df_school.info()

In [None]:
# HDI
df_hdi.info()

### Data cleaning and new unified datasets

#### Cleaning

>So in terms of data:
> - In order to have one column by indicator in our newly formed dataset, we will calculate the average of each indicator
> - For the first questions, we will calculate the value of indicators by average from 2010 until 2015 (recent overview)
> - For the last question, we will calculate the value of indicators by average from 1995 until 2000 (past overview)
> - We will then merge all the datasets based on the countries
> - We need to check if all countries we want are there
> - Filter out countries we do not analyse
> - Sanity check for duplicates in the final dataset

We will create a loop because each dataset needs to go through the same transformation as they have the same structure. Each dataframe at this step can be described as a matrix of the chosen indicators by country and by year. The final goal in this data wrangling step, as explained in the bullet points above, is to gather all the indicators in one dataset, keep the country dimension and calculate 2 averages over time for each indicators. Then bring all this into one last dataframe containing the 10 chosen countries for this analysis.

In [None]:
# Create a list of dataframes extracted from our csv files in order to wrangle these datasets in the next step
list_dataframes = [df_female, df_male, df_income, df_school, df_hdi]


# Create function to get the name of a dataframe
def get_df_name(df):
    name =[x for x in globals() if globals()[x] is df][0]
    return name

In [None]:
# create loop to get from new dataframes from each indicator

for i in list_dataframes:

    # Prepare naming   
    x = get_df_name(i).split("_",2)[1]
    # new columns  
    y = x + "_10_15"
    z = x + "_95_00"
    # new dataframe
    df_name = "df_"+ x +"_ind"
    print(df_name)
   
    # Calculate mean for i indicator from 2010 to 2015 period per country
    k = i.loc[:,'2010':'2015'].mean(axis=1).round(2)
    # Calculate mean for i indicator from 1995 to 2000 period per country 
    l = i.loc[:,'1995':'2000'].mean(axis=1).round(2)

    # Add means to dataframe
    i [y] = k
    i [z] = l


   # Build a dataframe with only necessary information for our analysis and this indicator 
    
   # Get positions of last columns   
    a = i.shape[1]-2
    b = i.shape[1]
    
    # Build our dataframe   
    df = i.iloc[:,np.r_[0:1,a:b]]
    exec('{} = df'.format(df_name)) #It has been necessary to apply this exec statement in the end of the loop to give the 5 expected dataframes in our output 

In [None]:
# New dataframes to join in one dataframe for our analysis
list_new_dfs = [df_female_ind, df_male_ind, df_income_ind, df_school_ind, df_hdi_ind]

In [None]:
# Merge all the dataframes using reduce() in order to pass the merge function to all elements in the list
df_final = functools.reduce(lambda x, y: pd.merge(x, y, on = 'country', how = 'left'), list_new_dfs)
df_final.info()

In [None]:
# Have the column names in order to identify the spelling of each country
# If the datasets are neat, we expect it will be the same names in each dataset so that when we merge it on country,
# it will not cause any issue
df_female.country.unique()

In [None]:
# List the countries to keep in analysis
countries = ['Sweden', 'Belgium','Italy', 'Germany', 'Brazil', 'Senegal', 'India','United States', 'Australia', 'Syria'  ]

In [None]:
# Clean the final dataframe in order to only keep the countries we selected for the analysis
df_countries = df_final[df_final['country'].isin(countries)].reset_index(drop = True)
df_countries

In [None]:
df_final = df_female_ind.merge(df_male_ind, how = 'left', on='country').merge(df_income_ind, how = 'left', on='country').merge(df_school_ind, how = 'left', on='country').merge(df_hdi_ind, how = 'left', on='country')

In [None]:
# Check for duplicates
df_countries[df_countries.duplicated()].count()

In [None]:
df_final

>All is ready to go on to the exploration step in the project with the newly formed dataset **df_countries**

In [None]:
df_final.describe()

<a id='eda'></a>
## Exploratory Data Analysis

### Preliminary observations

In [None]:
# Info about dataframe
df_countries.info()

In [None]:
# Get insights about the dataframe we will for our analysis
df_countries.describe()

In [None]:
# Split the dataframe into recent and past indicators

# Dataframe with recent indicators
df_countries_recent = df_countries.iloc[:,np.r_[0,np.arange(1,11,2)]].reset_index(drop = True)
# Dataframe with past indicators
df_countries_past = df_countries.iloc[:,np.r_[np.arange(0,12,2)]].reset_index(drop = True)

In [None]:
# quick check to see if these are the columns we expect for each time period
print(df_countries_recent.columns)
print(df_countries_past.columns)


### Research question 1: 
#### What are the factors that give a stronger chance to a country to count more family female workers?

_Note: for this question, we consider the indicators in the recent period dataframe_

Sub-questions:
1. How are the list of 10 countries ranked based on % of female workers (% based on total female employement) ?
2. Which indicator has a highest average correlation with # of female workers ?
3. What is the female/male ratio of family workers ?
4. What level of income/hdi for most equal ratio of female/male family workers ?


In [None]:
# Let's answer question 1 (we look at recent rates - so indicator = female_10_15)
# How are the list of 10 countries ranked based on % of female workers (% based on total female employement) ?
df_countries_recent.iloc[:,0:2].sort_values(by = 'female_10_15', ascending = False)

> Interesting... It looks like % of female family workers over the total female employment is much higher in emerging countries than in developped countries !

In [None]:
# Question 2: Which indicator has a highest average correlation with # of female workers ?
df_countries_recent.corr(method = 'pearson').iloc[:,0:1].nlargest(10, columns = 'female_10_15')

> This correlation table gives us great insights regarding how the indicators relate to the % of female family workers/female workers. 
> 1. The indicator that correlates the strongest to % female family workers is the mean years at school (negative     correlation), so that tells us that a country that offers more school years sees less female family workers
> 2. In the same sense, a country with larger Income and Human Development Index observes less female family workers
> 3. On the countrary, a country that has a higher number of male family workers correlates with more female family workers

In [None]:
# Question 3: What is the female/male ratio of family workers ?
df_countries_recent['male/female_10_15'] = (df_countries_recent['male_10_15']/df_countries_recent['female_10_15']).round(2)

In [None]:
df_countries_recent

> Sweden is far ahead the country that shows a most equal gender distribution of family workers. What is surprising is that a developed country like Germany has a same ratio of male/female workers as an emerging country like Brazil. What does this really tell us? So Far we have observed that developed countries tend to have less female family workers. What we find here is that male family workers are even fewer. 

In [None]:
# Question 4: What level of income/hdi for most equal ratio of female/male family workers ?
df_countries_recent.query('country == "Sweden"').iloc[:,np.r_[3,5]]


In [None]:
# Let's see how it compares when we group developed countries and emerging countries

# developed countries 
dev_countries = ['Sweden', 'Belgium','Italy', 'Germany','United States', 'Australia']
# emerging countries
em_countries = ['Brazil', 'Senegal', 'India', 'Syria']

# New dimension 'country_level' for developed and emerging countries
df_countries_recent.loc[df_countries_recent.country.isin(dev_countries), 'country_level'] = 'developed' 
df_countries_recent.loc[df_countries_recent.country.isin(em_countries), 'country_level'] = 'emerging' 

In [None]:
# Check the new dimension
df_countries_recent

In [None]:
# Look at mean income, HDI and male/female family workers ratio by developed and emerging country
df_countries_recent.groupby('country_level').agg({'income_10_15': np.mean, 'hdi_10_15': np.mean,'male/female_10_15': np.mean}).round(2)

> The level of income and HDI are defnitely high in Sweden, where the Male/female family worker ratio is the most 'equal'. In sweden, there are as many male family workers as female family workers in percent of total workers by gender. We see that for other developed countries the ratio does not especially show a more equal balance than in emerging countries. So despite that developed countries have generally higher Income and HDI, they do not especially have as many male as female family workers. 
> On average though, higher income and HDI still shows that there is a progress toward a more equal balance of female and male family workers.

### Research question 2:
#### How did female/male ratio of family workers evolved in developed and emerging countries in the past vs. today ?

In [None]:
# We have already checked on developed vs. emerging countries in the previous step

# Let's add the male/female ratio for this dataset
df_countries_past['male/female_95_00'] = (df_countries_past['male_95_00']/df_countries_past['female_95_00']).round(2)
# Let's create the 'country_level' dimension for the df_countries_past dataframe
df_countries_past.loc[df_countries_past.country.isin(dev_countries), 'country_level'] = 'developed' 
df_countries_past.loc[df_countries_past.country.isin(em_countries), 'country_level'] = 'emerging' 


In [None]:
# Check the dataframe
df_countries_past

> We already see that we do not have the data for 2 of the emerging countries

In [None]:
# let's check the mean of indicators for the past
df_countries_past.groupby('country_level').mean()

In [None]:
# let's check the mean of indicators for the recent years
df_countries_recent.groupby('country_level').mean()

> We really see a strong evolution in developed countries where now there is less family workers in general but the balance between men and women has a better equilibrium. However, the evolution in emerging countries is much slower it seems but going in the same direction

<a id='conclusions'></a>
## Conclusions

### General question 1: 
#### What are the factors that give a stronger chance to a country to count more family female workers?

#### Conclusion: 

The end of the study in our analysis show a tremendous difference between developed and emerging countries. Developed countries tend to have much less female family workers. With the subquestions that have been addressed in this analysis, the main findings are that:
- The % of female family workers over the total female employment is much higher in emerging countries than in developped countries 
- A country that offers more school years to females proportionally to males sees less female family workers
- A country with larger Income and Human Development Index observes less female family workers
- A country that has a higher number of male family workers correlates with more female family workers
- Sweden is far ahead the country that shows a most equal gender distribution of family workers
- On average, higher income and HDI still shows that there is a progress toward a more equal balance of female and male family workers. But Belgium and Germany are not totally following that trend.


In [None]:
# Create bar graph to give context on the Female Family workers indicator
Fig, ax = plt.subplots(figsize = (16,5))
ax.set_title('% Female family workers per country')
ax.set_ylabel('% female family worker/total female workers')
female_country = df_countries_recent.iloc[:,0:2].sort_values(by = 'female_10_15', ascending = False).reset_index(drop = True)
xvalues = female_country.index
yvalues = female_country.female_10_15
xlabels = female_country.country

plt.bar( xvalues, yvalues, tick_label = xlabels);


In [None]:
# Show table of correlations with all the different indicators
df_countries_recent.corr(method = 'pearson').iloc[:,0:1].nlargest(10, columns = 'female_10_15')


> All the indicators are strongly correlated to % female family workers, which may mean that it could be easily predictable. Let's check if it is the case by showing more results

In [None]:
# Correlation graph with the strongest potential "predictor" of % female family workers
Fig, ax = plt.subplots(figsize = (4,4))
ax.set_title('Correlation of mean school years with % female family workers')
ax.set_ylabel('Gender ratio of mean school years')
ax.set_xlabel('% female family worker/total female workers')
plt.scatter(df_countries_recent.female_10_15,df_countries_recent.school_10_15,color='b');

> There is a strong negative correlation between % female family worker/total female workers and Gender ratio of mean school years. The countries where women go more years to school than men are correlated with less female family workers. And very directly we observe that countries where women go less in school correlate with much more women family workers in this country

> Thought: Does it mean that in these countries with less %female family workers, there are less female building families or that females who have families work less? - We do not have data in this analysis about the number of families in a country.

In [None]:
# Male/Female family workers ratio by country
Fig, ax = plt.subplots(figsize = (16,5))
ax.set_title('Male/Female family workers ratio by country')
ax.set_ylabel('Male/Female family workers ratio')
male_female_country = df_countries_recent.iloc[:, np.r_[0,6]].sort_values(by = 'male/female_10_15' ,ascending = False).reset_index(drop = True)
xvalues = male_female_country.index
yvalues = male_female_country['male/female_10_15']
xlabels = male_female_country.country

plt.bar( xvalues, yvalues, tick_label = xlabels);

In [None]:
df_level = df_countries_recent.groupby('country_level').agg({'income_10_15': np.mean, 'hdi_10_15': np.mean,'male/female_10_15': np.mean}).round(2)
df_level

In [None]:
# Indicators by country level of development

#Set the plots and figures configuration in graph
Fig, ax1 = plt.subplots(figsize = (10,5))
ax2 = ax1.twinx()


#Title and axis labels
ax1.set_title('Indicators by country level of development')
ax1.set_ylabel('Income')
ax2.set_ylabel('HDI and male/female family workers ratio')

# X coordinates, x labels and y values
xvalues = [1,2]
xlabels = df_level.index
y1 = df_level.income_10_15
y2 = df_level.hdi_10_15
y3 = df_level['male/female_10_15']

#Plot the whole data
w = 0.2
a = ax1.bar([0.8,1.8], y1, width=w, color='orange', align='center')
b = ax2.bar(xvalues, y2, width=w, color='brown', align='center')
c = ax2.bar([1.2,2.2], y3, width=w, color='grey', align='center')

#Legend
ax1.legend(labels = 'Income', loc = (0.57,0.9))
ax2.legend((b[0], c[0]), ('HDI', 'male/female family workers ratio'), loc = (0.57,0.78))
plt.show();

### General question 2:
#### How did male/female ratio of family workers evolved in developed countries in the past vs. today ?

##### Conclusion: 
 Developed countries saw the equality of genders in family workers increase along with other things like income, gender ratio of mean school years (years spent by female/male at school), and HDI. According to the analysis for the first question, there are defnitely less female family workers in developed countries but it tends to happen the same to male family workers as a country develops.

In [None]:
# Prepare the data to show in graph
df_past = df_countries_past.groupby('country_level').mean().reset_index()
df_recent = df_countries_recent.groupby('country_level').mean().reset_index()
df_past_dev = df_past[df_past['country_level'] == "developed"]
df_recent_dev = df_recent[df_recent['country_level'] == "developed"]

In [None]:
# Equality of genders in family workers and the evolution of correlated indicators in developed countries 

#Set the plots and figures configuration in graph
Fig, ax1 = plt.subplots(figsize = (10,5))
ax2 = ax1.twinx()


#Title and axis labels
ax1.set_title('Equality of genders in family workers and the evolution of correlated indicators in developed countries ')
ax1.set_ylabel('Income and Gender ratio of mean school years')
ax2.set_ylabel('HDI and male/female family workers ratio')

# X coordinates, x labels and y values
xvalues = [1,2]
xlabels = ['1995 to 2000', '2010 to 2015']
y1 = [df_past_dev['male/female_95_00'],df_recent_dev['male/female_10_15'] ]
y2 = [df_past_dev['income_95_00'],df_recent_dev['income_10_15'] ]
y3 = [df_past_dev['school_95_00'],df_recent_dev['school_10_15'] ]
y4 = [df_past_dev['hdi_95_00'],df_recent_dev['hdi_10_15'] ]

#Plot the whole data
w = 0.2
a = ax2.plot(xvalues, y1, color='orange')
b = ax1.plot(xvalues, y2, color='green')
c = ax1.plot(xvalues, y3, color='grey')
d = ax2.plot(xvalues, y4, color='blue')

#Legend
ax1.legend((b[0], c[0]), ('Income', 'Gender ratio of mean school years'), loc = (0.57,0.75))
ax2.legend((a[0], d[0]), ('HDI', 'male/female family workers ratio'), loc = (0.57,0.63));