# Project: What promotes higher rates of family female workers?

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, we are going to _analyse the impact of several indicators on the number of family female workers_. The data is extracted from [GapMinder](https://www.gapminder.org/data/).

The purpose of this analysis is to understand:
- What are the factors that give a stronger chance to a country to count more family female workers? 
- How did the female/male family workers ratio evolve over time in 2 selected european countries?

In this perspective, we will analyse _economy, education, equality and society_ indicators to estimate their effect on the rate of female family workers. We are going to select 10 countries around the world to frame this analysis.

####  Scope

> The 10 countries we will keep in the dataset for this analysis are:
1. Sweden
2. Germany
3. Belgium
4. Italy
5. Senegal
6. India
7. USA
8. Brasil
9. Syria
10. Australia

#### Questions

In this analysis, we will attempt to answer the following detailed questions:

1. How are the list of 10 countries ranked based on # of female workers ?
2. Which indicator has a highest average positive correlation with # of female workers ? 
3. What is the female/male ratio of family workers ?
4. What level of income/Aid for most equal ratio of female/male family workers ?
5. How did female/male ratio of family workers evolved in Belgium and Italy in the past vs. today ?

#### Data collection

> **Datasets**: we have downloaded 5 datasets from GapMinder in order to perform this analysis:
>
> - Female Family workers
> - Male Family workers
> - Income
> - Mean years in school
> - Human development Index

In [49]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# import datasets
df_female = pd.read_csv('female_family_workers_percent_of_female_employment.csv')
df_male = pd.read_csv('male_family_workers_percent_of_male_employment.csv')
df_income = pd.read_csv('income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
df_school = pd.read_csv('mean_years_in_school_women_percent_men_25_to_34_years.csv')
df_hdi = pd.read_csv('hdi_human_development_index.csv')

<a id='wrangling'></a>
## Data Wrangling

#### Assessment
> Let's check each of these dataframes info, we noticed:
> - All these datasets look like they have the same structure: countries in rows and years in columns
> - Indicator data is in a float type, so no need to reformat them
> - Data is more complete only recently 
> - Some countries may be missing
> - School years and HDI only have collected data until 2015

In [20]:
# Female family workers
df_female.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 49 columns):
country    171 non-null object
1970       1 non-null float64
1971       1 non-null float64
1972       1 non-null float64
1973       1 non-null float64
1974       1 non-null float64
1975       1 non-null float64
1976       4 non-null float64
1977       3 non-null float64
1978       3 non-null float64
1979       3 non-null float64
1980       4 non-null float64
1981       3 non-null float64
1982       4 non-null float64
1983       14 non-null float64
1984       13 non-null float64
1985       13 non-null float64
1986       16 non-null float64
1987       22 non-null float64
1988       22 non-null float64
1989       26 non-null float64
1990       37 non-null float64
1991       49 non-null float64
1992       47 non-null float64
1993       55 non-null float64
1994       59 non-null float64
1995       64 non-null float64
1996       76 non-null float64
1997       71 non-null float64
1998     

In [10]:
# Male family workers
df_male.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 49 columns):
country    170 non-null object
1970       1 non-null float64
1971       1 non-null float64
1972       1 non-null float64
1973       1 non-null float64
1974       1 non-null float64
1975       1 non-null float64
1976       4 non-null float64
1977       3 non-null float64
1978       3 non-null float64
1979       3 non-null float64
1980       4 non-null float64
1981       3 non-null float64
1982       4 non-null float64
1983       14 non-null float64
1984       13 non-null float64
1985       13 non-null float64
1986       16 non-null float64
1987       22 non-null float64
1988       22 non-null float64
1989       26 non-null float64
1990       37 non-null float64
1991       49 non-null float64
1992       47 non-null float64
1993       55 non-null float64
1994       58 non-null float64
1995       64 non-null float64
1996       75 non-null float64
1997       72 non-null float64
1998     

In [12]:
# Income
df_income.head()

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2031,2032,2033,2034,2035,2036,2037,2038,2039,2040
0,Afghanistan,603,603,603,603,603,603,603,603,603,...,2420,2470,2520,2580,2640,2700,2760,2820,2880,2940
1,Albania,667,667,667,667,667,668,668,668,668,...,18500,18900,19300,19700,20200,20600,21100,21500,22000,22500
2,Algeria,715,716,717,718,719,720,721,722,723,...,15600,15900,16300,16700,17000,17400,17800,18200,18600,19000
3,Andorra,1200,1200,1200,1200,1210,1210,1210,1210,1220,...,73200,74800,76400,78100,79900,81600,83400,85300,87200,89100
4,Angola,618,620,623,626,628,631,634,637,640,...,6270,6410,6550,6700,6850,7000,7150,7310,7470,7640


In [14]:
# School years
df_school.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 47 columns):
country    187 non-null object
1970       187 non-null float64
1971       187 non-null float64
1972       187 non-null float64
1973       187 non-null float64
1974       187 non-null float64
1975       187 non-null float64
1976       187 non-null float64
1977       187 non-null float64
1978       187 non-null float64
1979       187 non-null float64
1980       187 non-null float64
1981       187 non-null float64
1982       187 non-null float64
1983       187 non-null float64
1984       187 non-null float64
1985       187 non-null float64
1986       187 non-null float64
1987       187 non-null float64
1988       187 non-null float64
1989       187 non-null float64
1990       187 non-null float64
1991       187 non-null float64
1992       187 non-null float64
1993       187 non-null float64
1994       187 non-null float64
1995       187 non-null float64
1996       187 non-null float64


In [17]:
# HDI
df_hdi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 27 columns):
country    187 non-null object
1990       143 non-null float64
1991       143 non-null float64
1992       143 non-null float64
1993       143 non-null float64
1994       143 non-null float64
1995       147 non-null float64
1996       147 non-null float64
1997       147 non-null float64
1998       147 non-null float64
1999       150 non-null float64
2000       167 non-null float64
2001       167 non-null float64
2002       167 non-null float64
2003       169 non-null float64
2004       172 non-null float64
2005       181 non-null float64
2006       181 non-null float64
2007       181 non-null float64
2008       181 non-null float64
2009       181 non-null float64
2010       187 non-null float64
2011       187 non-null float64
2012       187 non-null float64
2013       187 non-null float64
2014       187 non-null float64
2015       187 non-null float64
dtypes: float64(26), object(1)
m

In [25]:
df_female.country.unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia',
       'Cameroon', 'Canada', 'Cape Verde', 'Chad', 'Chile', 'Colombia',
       'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica',
       "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
       'Denmark', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia',
       'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala',
       'Guinea', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Ireland', 'Israel', 'It

### Data cleaning and new unified datasets

#### Cleaning

>So in terms of data:
> - In order to have one column by indicator in our newly formed dataset, we will calculate the average of each indicator
> - For the first questions, we will calculate the value of indicators by average from 2010 until 2015 (recent overview)
> - For the last question, we will calculate the value of indicators by average from 1995 until 2000 (past overview)
> - We will then merge all the datasets based on the countries
> - We need to check if all countries we want are there
> - Filter out countries we do not analyse
> - Sanity check for duplicates in the final dataset

In [23]:
# List the countries to keep in analysis
countries = ['Sweden', 'Belgium','Italy', 'Germany', 'Brazil', 'Senegal', 'India','United States', 'Australia', 'Syria'  ]


In [26]:
df_female.head()

Unnamed: 0,country,1970,1971,1972,1973,1974,1975,1976,1977,1978,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Afghanistan,,,,,,,,,,...,74.5,,,,38.6,,,,,
1,Albania,,,,,,,,,,...,48.7,43.9,43.6,43.2,50.2,45.1,42.0,39.8,,26.8
2,Algeria,,,,,,,,,,...,7.98,5.32,8.48,2.18,,2.94,1.63,,,2.41
3,Angola,,,,,,,,,,...,,,,42.1,,,6.99,,,
4,Antigua and Barbuda,,,,,,,,,,...,,,,,,,,,,


In [45]:
# Calculate mean for female workers indicator from 2010 to 2015 period per country
female_workers_10_15 = df_female.loc[:,'2010':'2015'].mean(axis=1).round(2)
# Calculate mean for female workers indicator from 1995 to 2000 period per country 
female_workers_95_00 = df_female.loc[:,'1995':'2000'].mean(axis=1).round(2)

In [53]:
# Add means to dataframe
df_female ['female_workers_10_15'] = female_workers_10_15
df_female ['female_workers_95_00'] = female_workers_95_00

# Build a dataframe with only necessary information for our analysis and this indicator 
# keep countries in case we wanted to extend our analysis to other countries later
df_female_ind = df_female.iloc[:,np.r_[0:1,49:51]]

In [54]:
df_female_ind.head()

Unnamed: 0,country,female_workers_10_15,female_workers_95_00
0,Afghanistan,38.6,
1,Albania,43.98,
2,Algeria,3.81,
3,Angola,24.54,
4,Antigua and Barbuda,,


In [55]:
# Calculate mean for male workers indicator from 2010 to 2015 period per country
male_workers_10_15 = df_male.loc[:,'2010':'2015'].mean(axis=1).round(2)
# Calculate mean for male workers indicator from 1995 to 2000 period per country 
male_workers_95_00 = df_male.loc[:,'1995':'2000'].mean(axis=1).round(2)

# Add means to dataframe
df_male ['male_workers_10_15'] = male_workers_10_15
df_male ['male_workers_95_00'] = male_workers_95_00

# Build a dataframe with only necessary information for our analysis and this indicator 
# keep countries in case we wanted to extend our analysis to other countries later
df_male_ind = df_male.iloc[:,np.r_[0:1,49:51]]

In [75]:
list_df_to_combine = [df_female, df_male, df_income, df_school, df_hdi]

In [106]:
# Create function to get the name of a dataframe
def get_df_name(df):
    name =[x for x in globals() if globals()[x] is df][0]
    return name


# create loop to from new dataframes from each indicator

for i in list_df_to_combine:

    # Prepare naming   
    x = get_df_name(i).split("_",2)[1]
    # new columns  
    y = x + "_10_15"
    z = x + "_95_00"
    # new dataframe
    df = "df_"+ x +"_ind"
    exec('{} = pd.DataFrame()'.format(df))
   
    # Calculate mean for i indicator from 2010 to 2015 period per country
    k = i.loc[:,'2010':'2015'].mean(axis=1).round(2)
    # Calculate mean for i indicator from 1995 to 2000 period per country 
    l = i.loc[:,'1995':'2000'].mean(axis=1).round(2)

    # Add means to dataframe
    i [y] = k
    i [z] = l

    # Build a dataframe with only necessary information for our analysis and this indicator 
    # keep countries in case we wanted to extend our analysis to other countries later
    a = i.shape[1]-2
    b = i.shape[1]
    df = i.iloc[:,np.r_[0:1,a:b]]
    print(get_df_name(df))

df
df
df
df
df


In [108]:
df.head()

Unnamed: 0,country,hdi_10_15,hdi_95_00
0,Afghanistan,0.47,0.33
1,Albania,0.76,0.64
2,Algeria,0.74,0.62
3,Andorra,0.84,
4,Angola,0.52,0.39


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!