<a href="https://colab.research.google.com/github/Abbast/Thinkfulta18/blob/main/Capstone_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Research Proposal ##

For this capstone project, we have elected to perform analysis upon the dataset from the link below:

https://datacatalog.worldbank.org/dataset/worldwide-bureaucracy-indicators

The above dataset comes from the Worldwide Bureaucracy Indicators database, via the webpage of the World Bank. The dataset is comprised of census-based data, mainly floating point numerical values, on public and private sector employment in 130+ nations between the years 2000 and 2018. 

This dataset is interesting due to its scope, content, and potential in responding to socio-economic questions. The shape of this dataset in a pandas frame is (12144, 24) i.e., 12144 rows for country names and 24 columns. Analysis of this dataset allows for comparisons between nations or continents over a variety of axes, or across spans of time. 

Specifically, this capstone will seek to answer the following question:

1. On average, is there a difference in the female share of paid private sector employees in European Union member nations between 2012 and 2017, given the samples that have been collected?

To answer this question, we will state our null and alternative hypotheses as follows:

$H_0 : $ there is no difference in the female share of paid private sector employees in EU member nations between 2012 and 2017

$H_a : $ there is a difference in the female share of paid private sector employees in EU member nations between 2012 and 2017

To test these hypotheses, we will begin by conducting normality tests on the data from the SciPy stats library, in particular the normal test and Shapiro-Wilks tests, to determine if the data is normally distributed. Depending on the results of the normality tests, we will either proceed with an parametric tests such as the one-way ANOVA test or non-parametric tests to further test our hypothesis. 

Time-permitting, we may seek to answer a secondary question:

2. On average, is there a difference in the female to male wage ratio in the private sector of European Union member nations, given the samples collected, between 2012 and 2017?

Our null and alternative hypotheses for this question are the following:

$H_0 : $ there is no difference in the female to male wage ratio in the private sector of EU member nations, given the collected samples, between  2000 and 2018

$H_a : $ there is a difference in the female to male wage ratio in the private sector of EU member nations, given the collected samples, between  2000 and 2018

The testing for this secondary question will generally proceed in the same manner as that for the primary question above.


These are questions rooted in economics and social justice pertinent to the advancement of women in the European Union, and they would be interesting to those in government, social justice, or the business sector seeking to address potential pay discrepancies or hiring/employment biases based upon gender in Europe (specifically the European Union). Additionally, the general public would benefit greatly from directly seeing the responses to the above questions as they are relevant regardless of location, especially the female to male wage ratio.


## Question 1 ##

In [1]:
# Import modules and libraries 

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
# Import our data into a dataframe

wwbi = pd.read_csv("WWBIData.csv")

FileNotFoundError: ignored

In [None]:
# Determine the shape of the table 

wwbi.shape

12144 rows and 24 columns. Not too bad.

In [None]:
# Display the first 10 rows

wwbi.head(10)

It seems that there are a lot of null values or missing entries within the dataframe.  

In [None]:
# Check for null values 

wwbi.isna()

In [None]:
# Females as a share of private paid employees

wwbi_female_priv_paid = wwbi[wwbi['Indicator Name'] == 'Females as a share of private paid employees']

wwbi_female_priv_paid.head(5)

In [None]:
# Check countries 

wwbi_female_priv_paid[wwbi_female_priv_paid['Country Name'] == 'Sweden']

wwbi.columns

In [None]:
# List of EU countries with data within our table

eu = ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 
      'Estonia', 'Finland', 'France', 'Greece', 'Hungary', 
      'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 
      'Poland', 'Portugal', 'Romania', 'Slovak Republic', 'Spain']

In [None]:
# Iterate through the above eu list to append to our the following empty DataFrame

wwbi_female_priv_paid_EU = pd.DataFrame(columns = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
      '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
      '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017',
      '2018', 'Unnamed: 23'])

for i in range(len(eu)):

    eu_value = wwbi_female_priv_paid[wwbi_female_priv_paid['Country Name'] == eu[i]]
    
    wwbi_female_priv_paid_EU = wwbi_female_priv_paid_EU.append(eu_value, ignore_index = True)


In [None]:
wwbi_EU_fem = wwbi_female_priv_paid_EU.drop(columns=['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 23'])

wwbi_EU_fem

In [None]:
# Generate a plot for the evolution of the female share of the private
# paid sector over time, this is unfortunately a lot of the same line of code

# Set a figure size such that the plot is legible

plt.figure(figsize = (20,10))

# Austria - Croatia

plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[0].values[1:]), label = 'AUT')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[1].values[1:]), label = 'BEL')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[2].values[1:]), label = 'BUL')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[3].values[1:]), label = 'CRO')

# Cyprus - Finland

plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[4].values[1:]), label = 'CYP')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[5].values[1:]), label = 'CZH')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[6].values[1:]), label = 'EST')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[7].values[1:]), label = 'FIN')

# France - Ireland

plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[8].values[1:]), label = 'FRA')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[9].values[1:]), label = 'GRE')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[10].values[1:]), label = 'HUN')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[11].values[1:]), label = 'IRL')

# Italy - Luxembourg

plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[12].values[1:]), label = 'ITA')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[13].values[1:]), label = 'LAT')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[14].values[1:]), label = 'LIT')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[15].values[1:]), label = 'LUX')

# Malta - Portugal

plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[16].values[1:]), label = 'MAL')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[17].values[1:]), label = 'POL')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[18].values[1:]), label = 'POR')

# Romania - Spain

plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[19].values[1:]), label = 'ROM')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[20].values[1:]), label = 'SVK')
plt.plot(wwbi_EU_fem.columns[1:], list(wwbi_EU_fem.iloc[21].values[1:]), label = 'ESP')

# Add labels and a legend to the plot 

plt.legend(loc = 'lower left', ncol = 2)
plt.title('Female share of the paid private sector 2000 - 2018')
plt.xlabel('Year')
plt.ylabel('Female Share')

plt.savefig('EU_female_share.png')

Upon inspection of the dataframe and the above plot, we would not be able to do an analysis from 2000 to 2018, since a majority of the nations within the above dataframe are missing data for the years 2000 to 2004. An analysis would be possible for any year following 2007.

In [None]:
 # Conduct a preliminary normality check via histogram plots

for i in range(11):

    plt.figure(figsize=(10,5))
    
    plt.hist(wwbi_EU_fem[wwbi_EU_fem.columns[i + 8]], alpha = 0.5, label = wwbi_EU_fem.columns[i + 8])
    plt.title("Female share of private sector - {year}".format(year = wwbi_EU_fem.columns[i + 8]))
    

  

Some of these appear to be vaguely normally distributed. To be more certain, we will conduct the normal and Shapiro-Wilks tests for each year. 

In [None]:
# Conduct normality tests for the years 2007 - 2017


for i in range(11):

    print("Normal Test: ", stats.normaltest(wwbi_EU_fem[wwbi_EU_fem.columns[i + 8]]))
    print("Shapiro-Wilks Test: ", stats.shapiro(wwbi_EU_fem[wwbi_EU_fem.columns[i + 8]]))

Hm, it seems that when conducting tests for the years 2007 to 2011, we are unable to obtain a test statistic. This most likely stems from the small sample size for each of those years in the dataset. Thus, for further analysis, we will only consider the years 2012 to 2017.

In [None]:
# Print Shapiro-Wilks and Normal test result statements for the years 2012 - 2017

for i in range(6):

    print("Normal Test {}: ".format(wwbi_EU_fem.columns[i + 13]), stats.normaltest(wwbi_EU_fem[wwbi_EU_fem.columns[i + 13]]))
    print("Shapiro-Wilks Test {}: ".format(wwbi_EU_fem.columns[i + 13]), stats.shapiro(wwbi_EU_fem[wwbi_EU_fem.columns[i + 13]]))

In [None]:
# Check the distributions for kurtosis and skewnes

for i in range(6):

    print("{}: ".format(wwbi_EU_fem.columns[i + 13]),stats.describe(wwbi_EU_fem[wwbi_EU_fem.columns[i + 13]]))


Since every p-value for the years 2012 to 2017 is greater than 0.05, we can interpret the test results as stating that it is very likely these were sampled from a normal distribution, one with skewness and kurtosis in the range (-1, 1). We can thus proceed with a parametric test method. In particular, as we have multiple groups that are normally distributed, we will conduct an ANOVA test

In [None]:
# Conduct an ANOVA parametric test

print(stats.f_oneway(wwbi_EU_fem['2012'], wwbi_EU_fem['2013'], wwbi_EU_fem['2014'], wwbi_EU_fem['2015'], wwbi_EU_fem['2016'], wwbi_EU_fem['2017']))

Since the p-value is greater than 0.05, we are unable to reject the null hypothesis.rue We can interpret this p-value as the following: Even if our null hypothesis was true, there is a 99.99% chance of observing a difference as large as what was observed.

In [None]:
# Plot the difference in means via point plot 

g = sns.pointplot(data= [wwbi_EU_fem['2012'], wwbi_EU_fem['2013'], wwbi_EU_fem['2014'], 
                         wwbi_EU_fem['2015'], wwbi_EU_fem['2016'], wwbi_EU_fem['2017']], join = False)

g.set(xticklabels = ['2012', '2013', '2014', '2015', '2016', '2017'])

plt.savefig('Point_plot_EU_fem.png')

## Question 2 ##

In [None]:
wwbi_wage_ratio = wwbi[wwbi['Indicator Name'] == 'Female to male wage ratio in the private sector (using median)']


wwbi_wage_ratio

In [None]:
# We will iterate through a list of EU member nations to append Series to an empty
# DataFrame

# Create a list of European Union nations

EU_Mitglieder = ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 
      'Denmark', 'Estonia', 'Finland', 'France', 'Greece', 'Hungary', 
      'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 
      'Poland', 'Portugal', 'Romania', 'Slovak Republic', 'Spain']

# Create an empty DataFrame with the same columns as our original DataFrame

wwbi_empty_frame = pd.DataFrame(columns = ['Country Name','Indicator Name', '2000', '2001', 
      '2002', '2003', '2004', '2005', '2006', '2007', '2008','2009', '2010', 
      '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018'])


# Iterate through the list and append Series to our empty DataFrame

for country in EU_Mitglieder:

      EU_val = wwbi[wwbi['Country Name'] == country]

      wwbi_empty_frame = wwbi_empty_frame.append(EU_val, ignore_index = True)

In [None]:
# Filter the DataFrame based upon the chosen indicator

wwbi_eu_wage_ratio = wwbi_empty_frame[wwbi_empty_frame['Indicator Name'] == 'Female to male wage ratio in the private sector (using median)']

# Drop the columns 'Country Code', 'Indicator Code', and 'Unnamed: 23'
# Display the resulting DataFrame

wwbi_eu_wage_ratio = wwbi_eu_wage_ratio.drop(columns= ['Country Code', 'Indicator Code', 'Indicator Name', 'Unnamed: 23'])
wwbi_eu_wage_ratio

For any analysis, we will have to consider the years 2008 to 2018, as there are more than 20 countries with data in each year. 

In [None]:
# Plot histograms for a preliminary normality check

# wwbi_eu_wage_ratio.columns[19]

for i in range(10):

    plt.figure(figsize= (10, 5))

    plt.hist(wwbi_eu_wage_ratio[wwbi_eu_wage_ratio.columns[i + 9]], alpha = 0.5)
    plt.title("Female to male wage ratio (using median) - {}".format(wwbi_eu_wage_ratio.columns[i + 9]))

In [None]:
# We will need to check for skewness and kurtosis


for i in range(10):

    print(stats.describe(wwbi_eu_wage_ratio[wwbi_eu_wage_ratio.columns[i + 9]]))


In [None]:
# Conduct normal and Shapiro-Wilks tests

for i in range(10):

    print("{}: ".format(wwbi_eu_wage_ratio.columns[i + 9]), stats.normaltest(wwbi_eu_wage_ratio[wwbi_eu_wage_ratio.columns[i + 9]]))
    print("{}: ".format(wwbi_eu_wage_ratio.columns[i + 9]), stats.shapiro(wwbi_eu_wage_ratio[wwbi_eu_wage_ratio.columns[i + 9]]))

It is evident that the years 2008 - 2015 do not have enough entries for the normality checks to return a value. There are two options going forward:

1. We fill in the missing values for each nation with a replacement value

2. We proceed with our analysis for only the years 2016 and 2017

The only ethical option would be to proceed with option 2, i.e. conduct our analysis for the years 2016 and 2017

In [None]:
# Repeat our normality tests, this time for only 2016 and 2017

for i in range(2):

    print("{}:".format(wwbi_eu_wage_ratio.columns[i + 17]), stats.normaltest(wwbi_EU_fem[wwbi_eu_wage_ratio.columns[i + 17]]))
    print("{}:".format(wwbi_eu_wage_ratio.columns[i + 17]), stats.shapiro(wwbi_EU_fem[wwbi_eu_wage_ratio.columns[i + 17]]))

Okay, the p-values for each year and test are greater than 0.05, insinuating that these samples originate from a normal distribution. We may then continue our analysis with an independent samples t-test.

In [None]:
# Perform an independent samples t-test

print(stats.ttest_ind(wwbi_eu_wage_ratio['2016'], wwbi_eu_wage_ratio['2017']))

As the p-value is greater than 0.05, we are unable to reject $H_0$, implying that there is not a statistically significant difference between the female to male wage ratio in 2016 and 2017. We will now plot a point plot and obtain the confidence interval.

In [None]:
# Get the 95 confidence interval

# Import the math module

import math

# First we define a function to compute the 95% confidence interval
def get_95_ci(array_1, array_2):

    sample_1_n = array_1.shape[0]
    sample_2_n = array_2.shape[0]
    sample_mean_1 = array_1.mean()
    sample_mean_2 = array_2.mean()
    sample_1_var = array_1.var()
    sample_2_var = array_2.var()

    mean_diff = sample_mean_2 - sample_mean_1
    std_err_diff = math.sqrt((sample_1_var/sample_1_n) + (sample_2_var/sample_2_n))
    margin_of_error = 1.96 * std_err_diff

    ci_lower = mean_diff - margin_of_error
    ci_upper = mean_diff + margin_of_error

    return("The difference in means at the 95% confidence interval (two-tail) is between "+str(ci_lower)+" and "+str(ci_upper)+".") 

get_95_ci(wwbi_eu_wage_ratio['2016'], wwbi_eu_wage_ratio['2017'])


In [None]:
# Plot the difference in means with a pointplot

h = sns.pointplot(data = [wwbi_eu_wage_ratio['2016'], wwbi_eu_wage_ratio['2017']], join = False)

h.set(xticklabels = ['2016', '2017'])
plt.title("Point plot 2016 vs. 2017")