# **Group 2 - Project Update**

##Section 1: Project Overview

Provide an overview of your project including the research questions. You may start with the text from your proposal but it must be edited for clarity and include any updates to your thinking.

Questions of Interest:
1. What’s the relation between education level and employment rate gap?
2. Are birth rates correlated to unemployment rate (by education level)?
3. What are the differences between paternity and maternity leave in the United States?
4. What is the correlation between GDP and women’s labor participation?
5. How does women’s labor participation rate change over time?
6. How much of the global labor rate gap is due to cultural norm differences?

## Section 2: Milestones and Progress

Below is our project timeline:

*   Data Collection - Shaunak - 10/19
*   Data Filtering - Shaunak - 10/24
*   Data Selection - Shamit 10/25
*   Handling Missing Data - Ishita - 10/28
*   Data Transformation and Merging - Siddhant - 10/31
*   Data Cleaning - Yue - 11/6
*   Data Processing - Asheer - 11/9

Based on our project timeline, we are almost on schedule, completing most of the data processing. As you will see in the remaining cells of this notebook, the core of our processing tasks are complete. We will be doing some further processing however at the moment our goal is to refine the code we have to make it more efficient and modular so it can be applied in variation with ease. Once our existing code is refined, we will be doing further processing so that the code can be easily implemented into visualizations.

Based on our timeline, we were supposed to have completed our processing by the 9th. We believe we are not behind at a problematic rate and are confident we will maintain timeliness with the rest of our schedule.

## Section 3: Data Acquisition and Cleaning Code

Provide code that demonstrates you have made progress with data acquisition and cleaning. In a markdown cell at the top of the section, summarize what you have accomplished thus far. Then follow through with the code that shows what has been accomplished.



In [None]:
# Import the 'drive' module from the 'google.colab' library to mount Google Drive.
from google.colab import drive

# Mount Google Drive at the '/content/drive' directory in the Colab environment.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Install the 'pycountry' library.
!pip install pycountry

# Install the 'pycountry-convert' library.
!pip install pycountry-convert



Installing two Python libraries using pip:

* We install the pycountry library, which provides information about countries.
* We also install the pycountry-convert library, which is an extension of pycountry and provides utilities for converting and working with country-related data.

These libraries can be useful for tasks such as retrieving country information, working with country codes, and converting between different country-related data formats.

In [None]:
# Import necessary libraries and modules.
import pandas as pd
import numpy as np
import matplotlib as mtp
import spacy
import pycountry_convert as pc

In this code, Import the following libraries and alias them for convenient use:

* pandas (as pd) for data manipulation.
* numpy (as np) for numerical computations.
* matplotlib (as mtp) for data visualization.
* spacy for natural language processing.
* pycountry_convert (as pc) for working with country-related data.



---



## Explanation of Dataset Structure

The 'Gender_StatsData.csv' dataset at hand exhibits a comprehensive collection of gender-related data, sourced from the World Bank, covering a broad range of countries or regions over several years. The dataset's structure is as follows:

- **Country Name**: This column contains the names of the countries or regions to which the data pertains. It serves as a reference point for identifying geographic locations.

- **Country Code**: This column provides unique codes or identifiers for each country or region within the dataset. These codes are valuable for cross-referencing and data management.

- **Indicator Name**: In this column, you'll find descriptions or names of specific gender-related indicators or metrics being measured. These indicators span a wide range of gender-related data points.

- **Indicator Code**: This column contains unique codes or identifiers corresponding to the gender-related indicators. These codes are useful for programmatic referencing and organization.

- **Year Columns (1960 to 2022)**: The dataset encompasses a significant time span, from 1960 to 2022. Each column within this range represents a specific year and contains corresponding data values associated with the gender-related indicators for that year.

The dataset follows a time-series structure, enabling users to analyze and track changes in gender-related data across different countries or regions over time. To work effectively with this dataset, various data analysis tasks can be performed, such as filtering, aggregating, and visualizing the data. This enables the extraction of valuable insights into gender-related trends and patterns, allowing for informed decision-making and policy analysis.


In [None]:
# Define the file path to your CSV file.
file_path = r'drive/MyDrive/Copy of Gender_StatsData.csv'

# Use the pandas library to read the CSV file into a DataFrame.
gender_df = pd.read_csv(file_path)

In [None]:
# Display the top 10 rows of the dataset
gender_df.head(10)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Africa Eastern and Southern,AFE,A woman can apply for a passport in the same w...,SG.APL.PSPT.EQ,,,,,,,...,,,,,,,,,,
1,Africa Eastern and Southern,AFE,A woman can be head of household in the same w...,SG.HLD.HEAD.EQ,,,,,,,...,,,,,,,,,,
2,Africa Eastern and Southern,AFE,A woman can choose where to live in the same w...,SG.LOC.LIVE.EQ,,,,,,,...,,,,,,,,,,
3,Africa Eastern and Southern,AFE,A woman can get a job in the same way as a man...,SG.GET.JOBS.EQ,,,,,,,...,,,,,,,,,,
4,Africa Eastern and Southern,AFE,A woman can obtain a judgment of divorce in th...,SG.OBT.DVRC.EQ,,,,,,,...,,,,,,,,,,
5,Africa Eastern and Southern,AFE,A woman can open a bank account in the same wa...,SG.OPN.BANK.EQ,,,,,,,...,,,,,,,,,,
6,Africa Eastern and Southern,AFE,A woman can register a business in the same wa...,SG.BUS.REGT.EQ,,,,,,,...,,,,,,,,,,
7,Africa Eastern and Southern,AFE,A woman can sign a contract in the same way as...,SG.CNT.SIGN.EQ,,,,,,,...,,,,,,,,,,
8,Africa Eastern and Southern,AFE,A woman can travel outside her home in the sam...,SG.HME.TRVL.EQ,,,,,,,...,,,,,,,,,,
9,Africa Eastern and Southern,AFE,A woman can travel outside the country in the ...,SG.CTR.TRVL.EQ,,,,,,,...,,,,,,,,,,


1. We define the file_path variable, which contains the path to your CSV file. The 'r' before the string indicates a raw string, which can be useful when dealing with file paths that include backslashes on Windows.
2. We use the pd.read_csv() method to read the CSV file specified by file_path and store the data in the gender_df DataFrame.

In [None]:
# Melt the 'gender_df' DataFrame to transform it into long format, using specific columns as identifiers.
gender_df = gender_df.melt(id_vars=['Country Name','Country Code','Indicator Name','Indicator Code'])

# Rename the 'variable' column to 'Year' for clarity.
gender_df.rename(columns={"variable":"Year"}, inplace=True)

## Data Transformation: Melting Wide Data to Long Format in the gender_df DataFrame
This code snippet is performing data transformation on the `gender_df` DataFrame. Below is the Explanation of what this code does:

1. **Melt Data**: The `melt` function is used to transform the `gender_df` DataFrame from a wide format to a long format. It rearranges the data so that each row represents a unique combination of 'Country Name,' 'Country Code,' 'Indicator Name,' 'Indicator Code,' and 'Year.' The `id_vars` parameter specifies the columns that will serve as identifiers, and the other columns are "melted" into rows.

2. **Rename Column**: After melting the DataFrame, the column labeled 'variable' is renamed to 'Year' for clarity. This is done using the `rename` method with the `inplace=True` argument, which modifies the DataFrame in place.

The result of running this code will be a new DataFrame, `gender_df`, in a long format with columns 'Country Name,' 'Country Code,' 'Indicator Name,' 'Indicator Code,' and 'Year.' Each row of this DataFrame will represent a specific combination of these identifiers for a particular year, and the actual data values for that year will be stored in a separate column, which was previously labeled as 'variable' but is now renamed to 'Year' for better understanding.

The appearance of the output DataFrame will have a structure suitable for various data analysis tasks, such as filtering, aggregating, and visualizing data over time.

In [None]:
# Display the top 10 rows to check if the changes to the dataset have been reflected
gender_df.head(10)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,Year,value
0,Africa Eastern and Southern,AFE,A woman can apply for a passport in the same w...,SG.APL.PSPT.EQ,1960,
1,Africa Eastern and Southern,AFE,A woman can be head of household in the same w...,SG.HLD.HEAD.EQ,1960,
2,Africa Eastern and Southern,AFE,A woman can choose where to live in the same w...,SG.LOC.LIVE.EQ,1960,
3,Africa Eastern and Southern,AFE,A woman can get a job in the same way as a man...,SG.GET.JOBS.EQ,1960,
4,Africa Eastern and Southern,AFE,A woman can obtain a judgment of divorce in th...,SG.OBT.DVRC.EQ,1960,
5,Africa Eastern and Southern,AFE,A woman can open a bank account in the same wa...,SG.OPN.BANK.EQ,1960,
6,Africa Eastern and Southern,AFE,A woman can register a business in the same wa...,SG.BUS.REGT.EQ,1960,
7,Africa Eastern and Southern,AFE,A woman can sign a contract in the same way as...,SG.CNT.SIGN.EQ,1960,
8,Africa Eastern and Southern,AFE,A woman can travel outside her home in the sam...,SG.HME.TRVL.EQ,1960,
9,Africa Eastern and Southern,AFE,A woman can travel outside the country in the ...,SG.CTR.TRVL.EQ,1960,


In [None]:
# Set a multi-index on the DataFrame, with 'Country Name' and 'Year' as the index levels.
gender_df.set_index(["Country Name","Year"], inplace=True)

# Sort the DataFrame based on the multi-index levels, first by 'Country Name' and then by 'Year'.
gender_df.sort_index(level=['Country Name','Year'], inplace=True)

# Delete the 'Country Code' column, as it's no longer needed.
del gender_df['Country Code']

# Remove rows with missing values (NaN) from the DataFrame.
gender_df.dropna(inplace=True)

## Data Refinement: Preparing gender_df DataFrame for Analysis
The above code continues to perform various operations on the `gender_df` DataFrame. Here's an explanation of each step:

1. **Set a Multi-Index**: The code sets a multi-index on the DataFrame using the `.set_index` method. It specifies that the index levels should be 'Country Name' and 'Year'. This operation organizes the data so that each row is uniquely identified by a combination of country name and year.

2. **Sort the DataFrame**: After setting the multi-index, the code sorts the DataFrame based on the multi-index levels. It first sorts by 'Country Name' and then by 'Year'. Sorting the data in this way can be useful for data visualization and analysis tasks, ensuring that the data is organized in a meaningful order.

3. **Delete 'Country Code' Column**: The code deletes the 'Country Code' column from the DataFrame using the `del` statement. Since 'Country Code' was likely used for identification and is no longer needed for analysis, removing it can help reduce unnecessary data and simplify the DataFrame.

4. **Remove Rows with Missing Values (NaN)**: The code uses the `.dropna` method to remove rows in the DataFrame that contain missing values (NaN). This step is essential for data quality, as it ensures that only rows with complete data are retained for analysis.

The result of running this code will be a modified version of the `gender_df` DataFrame with the following characteristics:

- It has a multi-index with 'Country Name' and 'Year' as the index levels.
- The DataFrame is sorted first by 'Country Name' and then by 'Year.'
- The 'Country Code' column is deleted from the DataFrame.
- Rows with missing values (NaN) are removed from the DataFrame.

This modified DataFrame is now in a structured and clean format, ready for further data analysis, visualization, and other tasks.

In [None]:
gender_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Indicator Name,Indicator Code,value
Country Name,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Afghanistan,1960,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,138.876
Afghanistan,1960,Age dependency ratio (% of working-age populat...,SP.POP.DPND,80.051114
Afghanistan,1960,"Age population, age 0, female, interpolated",SP.POP.AG00.FE.IN,178344.5
Afghanistan,1960,"Age population, age 0, male, interpolated",SP.POP.AG00.MA.IN,182281.0
Afghanistan,1960,"Age population, age 01, female, interpolated",SP.POP.AG01.FE.IN,151954.5
Afghanistan,1960,"Age population, age 01, male, interpolated",SP.POP.AG01.MA.IN,154208.5
Afghanistan,1960,"Age population, age 02, female, interpolated",SP.POP.AG02.FE.IN,139152.0
Afghanistan,1960,"Age population, age 02, male, interpolated",SP.POP.AG02.MA.IN,141483.5
Afghanistan,1960,"Age population, age 03, female, interpolated",SP.POP.AG03.FE.IN,130322.0
Afghanistan,1960,"Age population, age 03, male, interpolated",SP.POP.AG03.MA.IN,132635.0


In [None]:
# Filter data for a specific gender-related indicator and reset the index for analysis.
indicator1 = 'Unemployment with advanced education, female (% of female labor force with advanced education)'

# Filter the 'gender_df' DataFrame to select rows where the 'Indicator Name' matches the specified indicator.
employment_time_data_indicator1 = gender_df[gender_df['Indicator Name'] == indicator1].reset_index()

# Display the first 15 rows of the dataset
employment_time_data_indicator1.head(10)

Unnamed: 0,Country Name,Year,Indicator Name,Indicator Code,value
0,Afghanistan,2014,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,9.96
1,Afghanistan,2017,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,30.49
2,Afghanistan,2020,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,27.74
3,Afghanistan,2021,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,20.21
4,Albania,2002,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,3.13
5,Albania,2005,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,5.48
6,Albania,2007,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,15.31
7,Albania,2008,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,15.38
8,Albania,2009,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,18.53
9,Albania,2010,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,14.39


## Filtering and Displaying Data for a Specific Indicator

In the above code snippet, we are filtering and displaying data from the `gender_data_all` DataFrame for a specific indicator, "Unemployment with advanced education, female (% of female labor force with advanced education)." This code can be broken down into the following steps:

1. **Defining the Indicator**:
   - An indicator has been defined, named `indicator1`, which corresponds to the specific gender-related metric, "Unemployment with advanced education, female (% of female labor force with advanced education)."

2. **Filtering Data**:
   - The code filters the `gender_data_all` DataFrame to select rows where the 'Indicator Name' matches the defined indicator, `indicator1`. This effectively isolates the data associated with the specified metric.

3. **Resetting the Index**:
   - Following the filtering step, the code calls `.reset_index()` to reset the index of the filtered data. This ensures that the DataFrame retains a traditional numeric index and that the original index (which included 'Country Name' and 'Year') is added as regular columns.

4. **Displaying Data**:
   - The code, by displaying the resulting `employment_time_data_indicator1` DataFrame, provides a view of the data associated with the indicator "Unemployment with advanced education, female (% of female labor force with advanced education)." This DataFrame now includes information about 'Country Name,' 'Year,' and other relevant columns, allowing the user to analyze and interpret the data for the specified indicator.

In summary, this code is an example of how to filter and extract data for a specific gender-related indicator, making it easier to focus on and explore data related to a specific indicator. The resulting DataFrame, `employment_time_data_indicator1`, is a valuable resource for in-depth analysis and further investigation of this particular gender-related metric.


In [None]:
# List of all countries in the dataset.
all_countries = employment_time_data_indicator1['Country Name'].unique()

# List of non-country entities that need to be categorized separately.
non_country = ['Early-demographic dividend',
               'Caribbean small states',
               'Central Europe and the Baltics',
               'Euro area',
               'European Union','High income','IDA blend',
               'Latin America & Caribbean','Lower middle income',
               'Middle East & North Africa',
               'North America','OECD members',
               'Post-demographic dividend','South Asia']

# List of countries or regions in different continents
asia = ["Hong Kong SAR, China",'Korea, Rep.','Lao PDR','Macao SAR, China','Timor-Leste','West Bank and Gaza']
europe = ['Kosovo','Turkiye']
africa = ["Cote d'Ivoire"]
south_america = ['Curacao']

# List of keywords to categorize non-countries.
logic = ['(','&']

In [None]:
# Function to categorize countries into continents.
def make_continent(country):
    if logic[0] in country or logic[1] in country:
      return('ZZ_REGION')
    elif country in non_country:
      return('ZZ_REGION')
    elif country in asia:
      return('Asia')
    elif country in europe:
      return('Europe')
    elif country in south_america:
      return('South America')
    elif country in africa:
      return('Africa')
    else:
      # Split the country name to handle cases like "Country, Subdivision".
      country = country.split(',')[0]
      # Use the pycountry library to determine the country's continent.
      country_alpha2 = pc.country_name_to_country_alpha2(country)
      country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
      country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
      return country_continent_name

In [None]:
# Apply the 'make_continent' function to create a new 'Continent' column.
employment_time_data_indicator1['Continent'] = employment_time_data_indicator1['Country Name'].apply(make_continent)

In [None]:
employment_time_data_indicator1.head(10)

Unnamed: 0,Country Name,Year,Indicator Name,Indicator Code,value,Continent
0,Afghanistan,2014,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,9.96,Asia
1,Afghanistan,2017,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,30.49,Asia
2,Afghanistan,2020,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,27.74,Asia
3,Afghanistan,2021,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,20.21,Asia
4,Albania,2002,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,3.13,Europe
5,Albania,2005,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,5.48,Europe
6,Albania,2007,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,15.31,Europe
7,Albania,2008,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,15.38,Europe
8,Albania,2009,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,18.53,Europe
9,Albania,2010,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,14.39,Europe


## Categorizing Countries by Continent in the 'employment_time_data_indicator1' DataFrame


The above code snippets details the process of categorizing countries and regions within the 'employment_time_data_indicator1' DataFrame based on their respective continents. The code performs the following steps:

1. **Defining Country Lists**: The code begins by defining several lists:
   - `all_countries` stores unique country names within the dataset.
   - `non_country` lists entities that are not countries and need to be categorized separately.
   - `asia`, `europe`, `africa`, and `south_america` list specific countries or regions belonging to these continents.
   - `logic` stores keywords used to categorize non-countries.

2. **Categorization Function**: The 'make_continent' function is defined to categorize countries into continents based on a set of conditions:
   - If the country name contains specific keywords from the 'logic' list or is found in the 'non_country' list, it is categorized as 'ZZ_REGION,' indicating it's not a specific country.
   - Countries in Asia, Europe, South America, and Africa are categorized accordingly.
   - For all other countries, the code splits the country name to handle cases where there are subdivisions (e.g., "Country, Subdivision"). It then utilizes the 'pycountry' library to determine the country's continent based on the alpha-2 country code.

3. **Applying the Function**: The 'make_continent' function is applied to the 'Country Name' column in the 'employment_time_data_indicator1' DataFrame. The results are stored in a new 'Continent' column, which now categorizes each country or region into its respective continent.

This code aids in the analysis of 'employment_time_data_indicator1' by providing a 'Continent' column, allowing to explore and examine the data in the context of continents, making it easier to identify continent-specific trends and patterns.

In [None]:
# Create a multi-index for the DataFrame using 'Continent,' 'Country Name,' and 'Year.'
continent_employement_time_grouped = employment_time_data_indicator1.set_index(['Continent', 'Country Name', 'Year'])

# Sort the DataFrame based on the multi-index levels.
continent_employement_time_grouped = continent_employement_time_grouped.sort_index(level=['Continent', 'Country Name', 'Year'])

# Display the first 50 rows of the grouped DataFrame.
continent_employement_time_grouped.head(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Indicator Name,Indicator Code,value
Continent,Country Name,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,Algeria,2004,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,35.43
Africa,Algeria,2017,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,25.36
Africa,Angola,2004,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,25.62
Africa,Angola,2009,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,5.58
Africa,Angola,2011,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,20.81
Africa,Angola,2014,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,6.86
Africa,Angola,2019,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,24.9
Africa,Angola,2021,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,22.17
Africa,Benin,2011,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,16.18
Africa,Benin,2018,"Unemployment with advanced education, female (...",SL.UEM.ADVN.FE.ZS,3.86


## Title: Organizing Employment Time Data by Continent, Country, and Year

**Explanation:**

The above code aims to structure and organize employment time data within the 'employment_time_data_indicator1' DataFrame by creating a multi-index that includes 'Continent,' 'Country Name,' and 'Year' as hierarchical levels. The step-by-step explanation is as follows:

**Step 1: Set Multi-Index**

In this step, a multi-index is established for the DataFrame. The columns 'Continent,' 'Country Name,' and 'Year' are selected as the levels of the index. This hierarchical structure enables data organization and categorization based on these key attributes, facilitating continent-specific and time-based analysis.

**Step 2: Sort the DataFrame**

After creating the multi-index, the DataFrame is sorted for enhanced clarity. The sorting order first arranges the data by 'Continent,' followed by 'Country Name,' and lastly by 'Year.' This ordered presentation streamlines data analysis and visualization tasks.

**Step 3: Display the Data**

The final step involves displaying the initial 50 rows of the organized DataFrame, 'continent_employement_time_grouped.' By doing so, users can quickly inspect a portion of the data in its structured form, ideal for continent-specific exploration and analysis.

The resulting DataFrame is now structured with a multi-index, making it easier to delve into employment time trends by continent, country, and year.