# Project: What promotes higher rates of family female workers?

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, we are going to _analyse the impact of several indicators on the number of family female workers_. The data is extracted from [GapMinder](https://www.gapminder.org/data/).

The purpose of this analysis is to understand:
- What are the factors that give a stronger chance to a country to count more family female workers? 
- How did the female/male family workers ratio evolve over time in 2 selected european countries?

In this perspective, we will analyse _economy, education, equality and society_ indicators to estimate their effect on the rate of female family workers. We are going to select 10 countries around the world to frame this analysis.

####  Scope

> The 10 countries we will keep in the dataset for this analysis are:
1. Sweden
2. Germany
3. Belgium
4. Italy
5. Senegal
6. India
7. USA
8. Brasil
9. Syria
10. Australia

#### Questions

In this analysis, we will attempt to answer the following detailed questions:

1. How are the list of 10 countries ranked based on # of female workers ?
2. Which indicator has a highest average positive correlation with # of female workers ? 
3. What is the female/male ratio of family workers ?
4. What level of income/Aid for most equal ratio of female/male family workers ?
5. How did female/male ratio of family workers evolved in Belgium and Italy in the past vs. today ?

#### Data collection

> **Datasets**: we have downloaded 5 datasets from GapMinder in order to perform this analysis:
>
> - Female Family workers
> - Male Family workers
> - Income
> - Mean years in school
> - Human development Index

In [1]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# import datasets
df_female = pd.read_csv('female_family_workers_percent_of_female_employment.csv')
df_male = pd.read_csv('male_family_workers_percent_of_male_employment.csv')
df_income = pd.read_csv('income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
df_school = pd.read_csv('mean_years_in_school_women_percent_men_25_to_34_years.csv')
df_hdi = pd.read_csv('hdi_human_development_index.csv')

<a id='wrangling'></a>
## Data Wrangling

#### Assessment
> Let's check each of these dataframes info, we noticed:
> - All these datasets look like they have the same structure: countries in rows and years in columns
> - Indicator data is in a float type, so no need to reformat them
> - Data is more complete only recently 
> - Some countries may be missing
> - School years and HDI only have collected data until 2015

In [None]:
# Female family workers
df_female.info()

In [None]:
# Male family workers
df_male.info()

In [None]:
# Income
df_income.head()

In [None]:
# School years
df_school.info()

In [None]:
# HDI
df_hdi.info()

In [None]:
df_female.country.unique()

### Data cleaning and new unified datasets

#### Cleaning

>So in terms of data:
> - In order to have one column by indicator in our newly formed dataset, we will calculate the average of each indicator
> - For the first questions, we will calculate the value of indicators by average from 2010 until 2015 (recent overview)
> - For the last question, we will calculate the value of indicators by average from 1995 until 2000 (past overview)
> - We will then merge all the datasets based on the countries
> - We need to check if all countries we want are there
> - Filter out countries we do not analyse
> - Sanity check for duplicates in the final dataset

In [None]:
# List the countries to keep in analysis
countries = ['Sweden', 'Belgium','Italy', 'Germany', 'Brazil', 'Senegal', 'India','United States', 'Australia', 'Syria'  ]


In [17]:
list_dataframes = [df_female, df_male, df_income, df_school, df_hdi]


# Create function to get the name of a dataframe
def get_df_name(df):
    name =[x for x in globals() if globals()[x] is df][0]
    return name

In [22]:
# create loop to get from new dataframes from each indicator

for i in list_dataframes:

    # Prepare naming   
    x = get_df_name(i).split("_",2)[1]
    # new columns  
    y = x + "_10_15"
    z = x + "_95_00"
    # new dataframe
    df_name = "df_"+ x +"_ind"
   
    # Calculate mean for i indicator from 2010 to 2015 period per country
    k = i.loc[:,'2010':'2015'].mean(axis=1).round(2)
    # Calculate mean for i indicator from 1995 to 2000 period per country 
    l = i.loc[:,'1995':'2000'].mean(axis=1).round(2)

    # Add means to dataframe
    i [y] = k
    i [z] = l

    # Build a dataframe with only necessary information for our analysis and this indicator 
    
    # List of new dataframes     
    list_new_df = []
   # Get positions of last columns   
    a = i.shape[1]-2
    b = i.shape[1]
    
    # Build our dataframe   
    df = i.iloc[:,np.r_[0:1,a:b]]
    exec('{} = df'.format(df_name))
    # Get the new dataframes in a list      
    list_new_df.append(df_name)

In [23]:
list_new_df

['df_hdi_ind']

In [15]:
df_female_ind

Unnamed: 0,country,female_10_15,female_95_00
0,Afghanistan,38.60,
1,Albania,43.98,
2,Algeria,3.81,
3,Angola,24.54,
4,Antigua and Barbuda,,
...,...,...,...
166,Venezuela,1.44,1.64
167,Vietnam,24.42,55.56
168,Yemen,38.50,0.35
169,Zambia,52.00,48.13


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!