# Project: What promotes higher rates of family female workers?

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, we are going to _analyse the impact of several indicators on the number of family female workers_. The data is extracted from [GapMinder](https://www.gapminder.org/data/).

The purpose of this analysis is to understand:
- What are the factors that give a stronger chance to a country to count more family female workers? 
- How did the female/male family workers ratio evolve over time in developed countries?

In this perspective, we will analyse _economy, education, equality and society_ indicators to estimate their effect on the rate of female family workers. We are going to select 10 countries around the world to frame this analysis.

####  Scope

> The 10 countries we will keep in the dataset for this analysis are:
1. Sweden
2. Germany
3. Belgium
4. Italy
5. Senegal
6. India
7. USA
8. Brasil
9. Syria
10. Australia

#### Questions

In this analysis, we will attempt to answer the following detailed questions:

1. How are the list of 10 countries ranked based on # of female workers ?
2. Which indicator has a highest average correlation with # of female workers ? 
3. What is the female/male ratio of family workers ?
4. What level of income/Aid for most equal ratio of female/male family workers ?
5. How did female/male ratio of family workers evolved in Belgium and Italy in the past vs. today ?

#### Data collection

> **Datasets**: we have downloaded 5 datasets from GapMinder in order to perform this analysis:
>
> - Female Family workers
> - Male Family workers
> - Income
> - Mean years in school
> - Human development Index

In [1]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import functools 
%matplotlib inline

# import datasets
df_female = pd.read_csv('female_family_workers_percent_of_female_employment.csv')
df_male = pd.read_csv('male_family_workers_percent_of_male_employment.csv')
df_income = pd.read_csv('income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
df_school = pd.read_csv('mean_years_in_school_women_percent_men_25_to_34_years.csv')
df_hdi = pd.read_csv('hdi_human_development_index.csv')

<a id='wrangling'></a>
## Data Wrangling

#### Assessment
> Let's check each of these dataframes info, we noticed:
> - All these datasets look like they have the same structure: countries in rows and years in columns
> - Indicator data is in a float type, so no need to reformat them
> - Data is more complete only recently 
> - Some countries may be missing
> - School years and HDI only have collected data until 2015

In [None]:
# Female family workers
df_female.info()

In [None]:
# Male family workers
df_male.info()

In [None]:
# Income
df_income.head()

In [None]:
# School years
df_school.info()

In [None]:
# HDI
df_hdi.info()

### Data cleaning and new unified datasets

#### Cleaning

>So in terms of data:
> - In order to have one column by indicator in our newly formed dataset, we will calculate the average of each indicator
> - For the first questions, we will calculate the value of indicators by average from 2010 until 2015 (recent overview)
> - For the last question, we will calculate the value of indicators by average from 1995 until 2000 (past overview)
> - We will then merge all the datasets based on the countries
> - We need to check if all countries we want are there
> - Filter out countries we do not analyse
> - Sanity check for duplicates in the final dataset

We will create a loop because each dataset needs to go through the same transformation as they have the same structure. Each dataframe at this step can be described as a matrix of the chosen indicators by country and by year. The final goal in this data wrangling step, as explained in the bullet points above, is to gather all the indicators in one dataset, keep the country dimension and calculate 2 averages over time for each indicators. Then bring all this into one last dataframe containing the 10 chosen countries for this analysis.

In [2]:
# Create a list of dataframes extracted from our csv files in order to wrangle these datasets in the next step
list_dataframes = [df_female, df_male, df_income, df_school, df_hdi]


# Create function to get the name of a dataframe
def get_df_name(df):
    name =[x for x in globals() if globals()[x] is df][0]
    return name

In [3]:
# create loop to get from new dataframes from each indicator

for i in list_dataframes:

    # Prepare naming   
    x = get_df_name(i).split("_",2)[1]
    # new columns  
    y = x + "_10_15"
    z = x + "_95_00"
    # new dataframe
    df_name = "df_"+ x +"_ind"
    print(df_name)
   
    # Calculate mean for i indicator from 2010 to 2015 period per country
    k = i.loc[:,'2010':'2015'].mean(axis=1).round(2)
    # Calculate mean for i indicator from 1995 to 2000 period per country 
    l = i.loc[:,'1995':'2000'].mean(axis=1).round(2)

    # Add means to dataframe
    i [y] = k
    i [z] = l


   # Build a dataframe with only necessary information for our analysis and this indicator 
    
   # Get positions of last columns   
    a = i.shape[1]-2
    b = i.shape[1]
    
    # Build our dataframe   
    df = i.iloc[:,np.r_[0:1,a:b]]
    exec('{} = df'.format(df_name)) #It has been necessary to apply this exec statement in the end of the loop to give the 5 expected dataframes in our output 

df_female_ind
df_male_ind
df_income_ind
df_school_ind
df_hdi_ind


In [4]:
# New dataframes to join in one dataframe for our analysis
list_new_dfs = [df_female_ind, df_male_ind, df_income_ind, df_school_ind, df_hdi_ind]

In [5]:
# Merge all the dataframes using reduce() in order to pass the merge function to all elements in the list
df_final = functools.reduce(lambda x, y: pd.merge(x, y, on = 'country', how = 'left'), list_new_dfs)
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 0 to 170
Data columns (total 11 columns):
country         171 non-null object
female_10_15    137 non-null float64
female_95_00    106 non-null float64
male_10_15      138 non-null float64
male_95_00      107 non-null float64
income_10_15    171 non-null float64
income_95_00    171 non-null float64
school_10_15    167 non-null float64
school_95_00    167 non-null float64
hdi_10_15       168 non-null float64
hdi_95_00       153 non-null float64
dtypes: float64(10), object(1)
memory usage: 16.0+ KB


In [6]:
# Have the column names in order to identify the spelling of each country
# If the datasets are neat, we expect it will be the same names in each dataset so that when we merge it on country,
# it will not cause any issue
df_female.country.unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia',
       'Cameroon', 'Canada', 'Cape Verde', 'Chad', 'Chile', 'Colombia',
       'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica',
       "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
       'Denmark', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia',
       'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala',
       'Guinea', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Ireland', 'Israel', 'It

In [7]:
# List the countries to keep in analysis
countries = ['Sweden', 'Belgium','Italy', 'Germany', 'Brazil', 'Senegal', 'India','United States', 'Australia', 'Syria'  ]

In [8]:
# Clean the final dataframe in order to only keep the countries we selected for the analysis
df_countries = df_final[df_final['country'].isin(countries)].reset_index(drop = True)
df_countries

Unnamed: 0,country,female_10_15,female_95_00,male_10_15,male_95_00,income_10_15,income_95_00,school_10_15,school_95_00,hdi_10_15,hdi_95_00
0,Australia,0.28,1.24,0.2,0.63,42650.0,32733.33,102.5,100.33,0.93,0.89
1,Belgium,1.56,4.9,0.42,0.54,41200.0,34716.67,104.0,102.5,0.89,0.86
2,Brazil,4.35,9.9,1.71,4.5,15016.67,11200.0,112.67,108.5,0.74,0.67
3,Germany,0.75,1.85,0.29,0.36,42666.67,34933.33,101.5,98.42,0.92,0.85
4,India,31.65,33.6,10.6,12.4,5015.0,2271.67,66.57,57.53,0.6,0.48
5,Italy,2.05,6.66,1.06,2.83,35000.0,34566.67,103.67,101.0,0.88,0.81
6,Senegal,29.35,,19.5,,2205.0,1815.0,62.22,54.67,0.48,0.38
7,Sweden,0.23,0.55,0.23,0.48,43866.67,33466.67,104.0,102.33,0.91,0.87
8,Syria,9.68,,2.58,,5343.33,5271.67,83.32,72.67,0.6,0.58
9,United States,0.09,0.17,0.06,0.08,50883.33,42616.67,103.0,100.67,0.92,0.88


In [9]:
df_final = df_female_ind.merge(df_male_ind, how = 'left', on='country').merge(df_income_ind, how = 'left', on='country').merge(df_school_ind, how = 'left', on='country').merge(df_hdi_ind, how = 'left', on='country')

In [10]:
# Check for duplicates
df_countries[df_countries.duplicated()].count()

country         0
female_10_15    0
female_95_00    0
male_10_15      0
male_95_00      0
income_10_15    0
income_95_00    0
school_10_15    0
school_95_00    0
hdi_10_15       0
hdi_95_00       0
dtype: int64

In [11]:
df_final

Unnamed: 0,country,female_10_15,female_95_00,male_10_15,male_95_00,income_10_15,income_95_00,school_10_15,school_95_00,hdi_10_15,hdi_95_00
0,Afghanistan,38.60,,7.90,,1741.67,937.50,23.20,19.75,0.47,0.33
1,Albania,43.98,,22.15,,10455.00,4626.67,102.00,98.25,0.76,0.64
2,Algeria,3.81,,2.26,,13266.67,9706.67,89.90,85.62,0.74,0.62
3,Angola,24.54,,16.66,,6081.67,3345.00,71.97,63.68,0.52,0.39
4,Antigua and Barbuda,,,,,19216.67,17650.00,110.00,109.00,0.78,
...,...,...,...,...,...,...,...,...,...,...,...
166,Venezuela,1.44,1.64,0.63,1.32,16866.67,14983.33,109.00,105.33,0.77,0.67
167,Vietnam,24.42,55.56,12.92,23.66,5046.67,2365.00,98.78,92.85,0.67,0.55
168,Yemen,38.50,0.35,9.35,0.33,3726.67,3686.67,28.85,21.55,0.49,0.43
169,Zambia,52.00,48.13,17.10,18.00,3498.33,2080.00,84.07,77.53,0.56,0.41


>All is ready to go on to the exploration step in the project with the newly formed dataset **df_countries**

In [12]:
df_final.describe()

Unnamed: 0,female_10_15,female_95_00,male_10_15,male_95_00,income_10_15,income_95_00,school_10_15,school_95_00,hdi_10_15,hdi_95_00
count,137.0,106.0,138.0,107.0,171.0,171.0,167.0,167.0,168.0,153.0
mean,13.83292,12.801981,6.192754,5.488598,17670.782807,13701.44345,93.083713,88.331916,0.699345,0.623203
std,16.005366,16.818073,7.718418,7.290457,19211.528997,17502.023957,19.341991,21.063724,0.152979,0.168806
min,0.02,0.11,0.03,0.02,675.5,499.5,23.2,19.75,0.34,0.24
25%,1.12,1.81,0.55,0.615,3671.67,2595.83,83.945,75.655,0.58,0.48
50%,7.03,4.82,2.945,2.06,10993.33,7085.0,102.0,97.98,0.73,0.66
75%,23.88,15.88,9.3475,8.28,24041.67,15908.335,105.0,102.585,0.82,0.75
max,65.9,75.3,33.95,32.7,123833.33,101883.33,126.0,127.0,0.94,0.9


<a id='eda'></a>
## Exploratory Data Analysis

### Preliminary observations

In [14]:
# Info about dataframe
df_countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
country         10 non-null object
female_10_15    10 non-null float64
female_95_00    8 non-null float64
male_10_15      10 non-null float64
male_95_00      8 non-null float64
income_10_15    10 non-null float64
income_95_00    10 non-null float64
school_10_15    10 non-null float64
school_95_00    10 non-null float64
hdi_10_15       10 non-null float64
hdi_95_00       10 non-null float64
dtypes: float64(10), object(1)
memory usage: 1008.0+ bytes


In [15]:
# Get insights about the dataframe we will for our analysis
df_countries.describe()

Unnamed: 0,female_10_15,female_95_00,male_10_15,male_95_00,income_10_15,income_95_00,school_10_15,school_95_00,hdi_10_15,hdi_95_00
count,10.0,8.0,10.0,8.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,7.999,7.35875,3.665,2.7275,28384.667,23359.168,94.345,89.862,0.787,0.727
std,12.216873,11.131792,6.40615,4.201356,19155.723931,16101.369055,17.406331,20.181486,0.168856,0.187501
min,0.09,0.17,0.06,0.08,2205.0,1815.0,62.22,54.67,0.48,0.38
25%,0.3975,1.0675,0.245,0.45,7761.665,6753.7525,87.865,79.1075,0.635,0.6025
50%,1.805,3.375,0.74,0.585,38100.0,33100.0,102.75,100.5,0.885,0.83
75%,8.3475,7.47,2.3625,3.2475,42662.5025,34679.17,103.9175,101.9975,0.9175,0.8675
max,31.65,33.6,19.5,12.4,50883.33,42616.67,112.67,108.5,0.93,0.89


In [16]:
# Split the dataframe into recent and past indicators

# Dataframe with recent indicators
df_countries_recent = df_countries.iloc[:,np.r_[0,np.arange(1,11,2)]] 
# Dataframe with past indicators
df_countries_past = df_countries.iloc[:,np.r_[np.arange(0,12,2)]]

In [17]:
# quick check to see if these are the columns we expect for each time period
print(df_countries_recent.columns)
print(df_countries_past.columns)

Index(['country', 'female_10_15', 'male_10_15', 'income_10_15', 'school_10_15',
       'hdi_10_15'],
      dtype='object')
Index(['country', 'female_95_00', 'male_95_00', 'income_95_00', 'school_95_00',
       'hdi_95_00'],
      dtype='object')


In [None]:
cols = [col for col in df_countries.columns if '_10_15' in col]
print(list(df_countries.columns))
print (cols)

In [18]:
# Split the dataframe into recent and past indicators

# Indicators recent time period
recent_ind = [col for col in df_countries.columns if '_10_15' in col]
recent_ind.insert(0,'country')
print(recent_ind)

# Indicators past time period
past_ind = [col for col in df_countries.columns if '_95_00' in col]
past_ind.insert(0,'country')
print(past_ind)

['country', 'female_10_15', 'male_10_15', 'income_10_15', 'school_10_15', 'hdi_10_15']
['country', 'female_95_00', 'male_95_00', 'income_95_00', 'school_95_00', 'hdi_95_00']


In [None]:
np.arange(1,11,2)

In [None]:
df_countries.iloc[:,np.r_[0,np.arange(1,11,2)]]


### Research question 1: 
#### What are the factors that give a stronger chance to a country to count more family female workers?

_Note: for this question, we consider the indicators in the recent period dataframe_

Sub-questions:
1. How are the list of 10 countries ranked based on % of female workers (% based on total female employement) ?
2. Which indicator has a highest average correlation with # of female workers ?
3. What is the female/male ratio of family workers ?
4. What level of income/hdi for most equal ratio of female/male family workers ?


In [None]:
# Let's answer question 1 (we look at recent rates - so indicator = female_10_15)
# How are the list of 10 countries ranked based on % of female workers (% based on total female employement) ?
df_countries_recent.iloc[:,0:2].sort_values(by = 'female_10_15', ascending = False)

> Interesting... It looks like % of female family workers over the total female employment is much higher in emerging countries than in developped countries !

In [None]:
# Question 2: Which indicator has a highest average correlation with # of female workers ?
df_countries_recent.corr(method = 'pearson').iloc[:,0:1].nlargest(10, columns = 'female_10_15')

In [None]:
df_countries.columns

> This correlation table gives us great insights regarding how the indicators relate to the % of female family workers/female workers. 
> 1. The indicator that correlates the strongest to % female family workers is the mean years at school (negative     correlation), so that tells us that a country that offers more school years sees less female family workers
> 2. Inthe same sense, a country with larger Income and Human Development Index observes less female family workers
> 3. On the countrary, a country that has a higher number of male family workers correlates with more female family workers

In [None]:
recent_ind

In [None]:
df_countries_past

In [None]:
df_countries_recent.plot('female_10_15','male_10_15',kind = 'scatter');

In [None]:
# Question 3: What is the female/male ratio of family workers ?
df_countries_recent['male/female_10_15'] = (df_countries_recent['male_10_15']/df_countries_recent['female_10_15']).round(2)

In [None]:
df_countries_recent

> Sweden is far ahead the country that shows a most equal gender distribution of family workers. What is surprising is that a developed country like Germany has a same ratio of male/female workers as an emerging country like Brazil. What does this really tell us? So Far we have observed that developed countries tend to have less female family workers. What we find here is that male family workers are even more rare. 

In [None]:
# Question 4: What level of income/hdi for most equal ratio of female/male family workers ?
df_countries_recent.query('country == "Sweden"').iloc[:,np.r_[3,5]]


In [None]:
# Let's see how it compares when we group developed countries and emerging countries

In [None]:
# developed countries 
dev_countries = ['Sweden', 'Belgium','Italy', 'Germany','United States', 'Australia']
# emerging countries
em_countries = ['Brazil', 'Senegal', 'India', 'Syria']

# Dataframes for developed and emerging countries
df_dev_countries_recent = df_countries_recent[df_countries_recent['country'].isin(dev_countries)]
df_em_countries_recent = df_countries_recent[df_countries_recent['country'].isin(em_countries)]

In [None]:
# df_countries_recent['country_level'] = df['country'].apply(lambda x: 'developed' if x in dev_countries else 'emerging')
df_countries.loc[df_countries.country.isin(dev_countries), 'country_level'] = 'developed' 
df_countries.loc[df_countries.country.isin(em_countries), 'country_level'] = 'emerging' 



In [None]:
df_countries

In [None]:
df_countries_recent

In [None]:
df_dev_countries_recent.agg({'income_10_15': np.mean, 'hdi_10_15': np.mean, 'male/female_10_15': np.mean})

In [None]:
df_em_countries_recent.agg({'income_10_15': np.mean, 'hdi_10_15': np.mean,'male/female_10_15': np.mean})

### Research question 2:
#### How did female/male ratio of family workers evolved in developed countries in the past vs. today ?

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!