# 1. Data Set Review & Description

In [1]:
import pandas as pd 
import numpy as np

In [2]:
def read_csv(csv_url):
    return pd.read_csv(csv_url)

In [52]:
def is_null(df):
    return df.isnull().sum().sort_values(ascending=False)

## Unpaid care work dataset

- Unpaid care work refers to all unpaid services provided within a household for its members, including care of persons, housework and
voluntary community work.
- Source : https://ourworldindata.org/grapher/female-to-male-ratio-of-time-devoted-to-unpaid-care-work

In [3]:
df_unpaid_care_work = read_csv('data/1-female-to-male-ratio-of-time-devoted-to-unpaid-care-work.csv')

In [4]:
df_unpaid_care_work.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 4 columns):
 #   Column                                                                  Non-Null Count  Dtype  
---  ------                                                                  --------------  -----  
 0   Entity                                                                  69 non-null     object 
 1   Code                                                                    68 non-null     object 
 2   Year                                                                    69 non-null     int64  
 3   Female to male ratio of time devoted to unpaid care work (OECD (2014))  69 non-null     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 2.3+ KB


- The ratio is calculated by dividing the average time spent by females by the average time spent by males.

In [5]:
df_unpaid_care_work.head()

Unnamed: 0,Entity,Code,Year,Female to male ratio of time devoted to unpaid care work (OECD (2014))
0,Albania,ALB,2014,7.21
1,Algeria,DZA,2014,6.75
2,Argentina,ARG,2014,2.88
3,Armenia,ARM,2014,5.24
4,Australia,AUS,2014,1.81


In [8]:
df_unpaid_care_work_2 =read_csv('data/4-female-to-male-ratio-of-time-devoted-to-unpaid-care-work.csv')

In [9]:
df_unpaid_care_work_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 4 columns):
 #   Column                                                                  Non-Null Count  Dtype  
---  ------                                                                  --------------  -----  
 0   Entity                                                                  69 non-null     object 
 1   Code                                                                    68 non-null     object 
 2   Year                                                                    69 non-null     int64  
 3   Female to male ratio of time devoted to unpaid care work (OECD (2014))  69 non-null     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 2.3+ KB


In [12]:
# It seems two datasets are the same 
# Because there is one null value code returns false
(df_unpaid_care_work_2 == df_unpaid_care_work).all()

Entity                                                                     True
Code                                                                      False
Year                                                                       True
Female to male ratio of time devoted to unpaid care work (OECD (2014))     True
dtype: bool

In [13]:
# Deleting one of the dataframes because it already exists
del df_unpaid_care_work_2

In [6]:
# All the countries are unique
# There 69 countries
df_unpaid_care_work['Entity'].nunique()

69

In [53]:
# There is one null value
# It is not important 
is_null(df_unpaid_care_work)

Code                                                                      1
Entity                                                                    0
Year                                                                      0
Female to male ratio of time devoted to unpaid care work (OECD (2014))    0
dtype: int64

In [7]:
# Data only contains information about 2014
df_unpaid_care_work.describe()

Unnamed: 0,Year,Female to male ratio of time devoted to unpaid care work (OECD (2014))
count,69.0,69.0
mean,2014.0,3.248696
std,0.0,2.510711
min,2014.0,1.18
25%,2014.0,1.81
50%,2014.0,2.53
75%,2014.0,3.38
max,2014.0,17.29


In [56]:
# Changing the column name for readability
df_unpaid_care_work = df_unpaid_care_work.rename(columns={"Female to male ratio of time devoted to unpaid care work (OECD (2014))": "f_to_m_unpaid_care_work_ratio"})

In [59]:
# Normalizing the ratio for better understanding
# Using min-max normalization 
# Range between 0-1
df_unpaid_care_work['normalized_ratio'] = (df_unpaid_care_work['f_to_m_unpaid_care_work_ratio'] - df_unpaid_care_work['f_to_m_unpaid_care_work_ratio'].min()) \
/ (df_unpaid_care_work['f_to_m_unpaid_care_work_ratio'].max() - df_unpaid_care_work['f_to_m_unpaid_care_work_ratio'].min())   



In [61]:
df_unpaid_care_work.describe()

Unnamed: 0,Year,f_to_m_unpaid_care_work_ratio,normalized_ratio
count,69.0,69.0,69.0
mean,2014.0,3.248696,0.128411
std,0.0,2.510711,0.155848
min,2014.0,1.18,0.0
25%,2014.0,1.81,0.039106
50%,2014.0,2.53,0.083799
75%,2014.0,3.38,0.136561
max,2014.0,17.29,1.0


## Maternal Mortality Dataset

 - The World Health Organization defines maternal death as the death of a pregnant mother due to complications related to pregnancy, underlying conditions worsened by the pregnancy or management of these conditions
 - Maternal Mortality Ratio — which is a ratio comparing the number of pregnancy-related deaths to the number of births in the same period
 - https://ourworldindata.org/maternal-mortality
 - [Maternal mortality declined by 34 per cent between 2000 and 2020](https://data.unicef.org/topic/maternal-health/maternal-mortality/)

In [14]:
df_maternal_mortality = read_csv('data/5-maternal-mortality.csv')

In [15]:
df_maternal_mortality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5800 entries, 0 to 5799
Data columns (total 4 columns):
 #   Column                                                             Non-Null Count  Dtype  
---  ------                                                             --------------  -----  
 0   Entity                                                             5800 non-null   object 
 1   Code                                                               5548 non-null   object 
 2   Year                                                               5800 non-null   int64  
 3   Maternal Mortality Ratio (Gapminder (2010) and World Bank (2015))  5800 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 181.4+ KB


In [16]:
df_maternal_mortality.head()

Unnamed: 0,Entity,Code,Year,Maternal Mortality Ratio (Gapminder (2010) and World Bank (2015))
0,Afghanistan,AFG,2000,1450.0
1,Afghanistan,AFG,2001,1390.0
2,Afghanistan,AFG,2002,1300.0
3,Afghanistan,AFG,2003,1240.0
4,Afghanistan,AFG,2004,1180.0


In [17]:
df_maternal_mortality.groupby(by='Entity').count()

Unnamed: 0_level_0,Code,Year,Maternal Mortality Ratio (Gapminder (2010) and World Bank (2015))
Entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,18,18,18
Albania,18,18,18
Algeria,18,18,18
Angola,18,18,18
Antigua and Barbuda,18,18,18
...,...,...,...
Vietnam,18,18,18
World,18,18,18
Yemen,18,18,18
Zambia,18,18,18


In [18]:
df_maternal_mortality.groupby(by='Year').count()

Unnamed: 0_level_0,Entity,Code,Maternal Mortality Ratio (Gapminder (2010) and World Bank (2015))
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1751,2,2,2
1752,2,2,2
1753,2,2,2
1754,2,2,2
1755,2,2,2
...,...,...,...
2016,200,186,200
2017,200,186,200
2018,35,35,35
2019,32,32,32


## Income datasets

In [19]:
df_income = read_csv('data/2-share-of-women-in-top-income-groups.csv')

- Percentage of individuals falling into top income brackets that are women
- https://ourworldindata.org/grapher/share-of-women-in-top-income-groups-atkinson-casarico-voitchovsky-2018

In [20]:
df_income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168 entries, 0 to 167
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Entity                       168 non-null    object 
 1   Code                         148 non-null    object 
 2   Year                         168 non-null    int64  
 3   Share of women in top 0.1%   131 non-null    float64
 4   Share of women in top 0.25%  37 non-null     float64
 5   Share of women in top 0.5%   82 non-null     float64
 6   Share of women in top 1%     167 non-null    float64
 7   Share of women in top 10%    168 non-null    float64
 8   Share of women in top 5%     168 non-null    float64
dtypes: float64(6), int64(1), object(2)
memory usage: 11.9+ KB


In [21]:
df_income.head()

Unnamed: 0,Entity,Code,Year,Share of women in top 0.1%,Share of women in top 0.25%,Share of women in top 0.5%,Share of women in top 1%,Share of women in top 10%,Share of women in top 5%
0,Australia,AUS,2000,14.2,,,18.3,24.9,21.1
1,Australia,AUS,2001,13.2,,,18.4,25.1,21.4
2,Australia,AUS,2002,13.5,,,18.8,25.1,21.5
3,Australia,AUS,2003,14.4,,,19.1,25.1,21.6
4,Australia,AUS,2004,15.2,,,19.6,25.5,22.2


In [22]:
df_income.groupby(by='Entity').count()

Unnamed: 0_level_0,Code,Year,Share of women in top 0.1%,Share of women in top 0.25%,Share of women in top 0.5%,Share of women in top 1%,Share of women in top 10%,Share of women in top 5%
Entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Australia,15,15,14,0,0,15,15,15
Canada,23,23,23,0,0,23,23,23
Denmark,34,34,34,0,0,34,34,34
Italy,17,17,17,17,17,17,17,17
New Zealand,36,36,0,0,22,35,36,36
Norway,8,8,8,0,8,8,8,8
Spain,15,15,15,0,15,15,15,15
UK,0,20,20,20,20,20,20,20


In [23]:
df_gender_gap_wage = read_csv('data/6-gender-gap-in-average-wages-ilo.csv')

- The gender wage gap is defined as the difference between median earnings of men and women relative to median earnings of men.

In [24]:
df_gender_gap_wage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Entity               413 non-null    object 
 1   Code                 413 non-null    object 
 2   Year                 413 non-null    int64  
 3   Gender wage gap (%)  413 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 13.0+ KB


In [25]:
df_gender_gap_wage.head()

Unnamed: 0,Entity,Code,Year,Gender wage gap (%)
0,Argentina,ARG,1986,15.79
1,Argentina,ARG,1987,12.5
2,Argentina,ARG,1988,11.31
3,Argentina,ARG,1991,6.71
4,Argentina,ARG,1992,8.33


In [26]:
df_gender_gap_wage.describe()

Unnamed: 0,Year,Gender wage gap (%)
count,413.0,413.0
mean,2005.159806,10.886949
std,7.820724,10.241873
min,1981.0,-28.27
25%,2000.0,3.92
50%,2006.0,10.68
75%,2012.0,17.63
max,2016.0,35.75


In [27]:
df_gender_gap_wage.groupby(by='Year').count()

Unnamed: 0_level_0,Entity,Code,Gender wage gap (%)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1981,1,1,1
1982,1,1,1
1983,1,1,1
1984,1,1,1
1985,1,1,1
1986,2,2,2
1987,2,2,2
1988,2,2,2
1989,6,6,6
1990,4,4,4


In [28]:
df_gender_gap_wage.groupby(by='Entity').count()

Unnamed: 0_level_0,Code,Year,Gender wage gap (%)
Entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,27,27,27
Austria,4,4,4
Belgium,4,4,4
Belize,6,6,6
Bolivia,15,15,15
...,...,...,...
Turkey,1,1,1
United Kingdom,2,2,2
Uruguay,21,21,21
Venezuela,13,13,13


## Labor datasets

- The labor force is the number of people who are employed plus the unemployed who are looking for work.
- At its most basic level, entrepreneurship refers to an individual or a small group of partners who strike out on an original path to create a new business.

- The labor force participation rate is the proportion of the population aged
15 years and older that is economically active.
- This ratio is calculated by dividing the labor force participation rate among women, by the
corresponding rate for men. 
- https://ourworldindata.org/grapher/ratio-of-female-to-male-labor-force-participation-rates-ilo-wdi

In [29]:
df_ratio_labor = read_csv('data/3-ratio-of-female-to-male-labor-force-participation-rates-ilo-wdi.csv')

In [30]:
df_ratio_labor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6432 entries, 0 to 6431
Data columns (total 4 columns):
 #   Column                                                                             Non-Null Count  Dtype  
---  ------                                                                             --------------  -----  
 0   Entity                                                                             6432 non-null   object 
 1   Code                                                                               5984 non-null   object 
 2   Year                                                                               6432 non-null   int64  
 3   Ratio of female to male labor force participation rate (%) (modeled ILO estimate)  6432 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 201.1+ KB


In [31]:
df_ratio_labor.head()

Unnamed: 0,Entity,Code,Year,Ratio of female to male labor force participation rate (%) (modeled ILO estimate)
0,Afghanistan,AFG,1990,19.604805
1,Afghanistan,AFG,1991,19.71338
2,Afghanistan,AFG,1992,19.803307
3,Afghanistan,AFG,1993,19.844606
4,Afghanistan,AFG,1994,19.88471


In [32]:
df_ratio_labor.groupby(by="Entity").count()

Unnamed: 0_level_0,Code,Year,Ratio of female to male labor force participation rate (%) (modeled ILO estimate)
Entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,32,32,32
Albania,32,32,32
Algeria,32,32,32
Angola,32,32,32
Argentina,32,32,32
...,...,...,...
West Bank and Gaza,0,32,32
World,32,32,32
Yemen,32,32,32
Zambia,32,32,32


In [33]:
df_ratio_labor.groupby(by="Year").count()

Unnamed: 0_level_0,Entity,Code,Ratio of female to male labor force participation rate (%) (modeled ILO estimate)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1990,201,187,201
1991,201,187,201
1992,201,187,201
1993,201,187,201
1994,201,187,201
1995,201,187,201
1996,201,187,201
1997,201,187,201
1998,201,187,201
1999,201,187,201


In [34]:
df_labor_entp =pd.read_csv('data/Labor-Force-Women-Entrpreneurship.csv',sep=';')

In [35]:
df_labor_entp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 9 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   No                                     51 non-null     int64  
 1   Country                                51 non-null     object 
 2   Level of development                   51 non-null     object 
 3   European Union Membership              51 non-null     object 
 4   Currency                               51 non-null     object 
 5   Women Entrepreneurship Index           51 non-null     float64
 6   Entrepreneurship Index                 51 non-null     float64
 7   Inflation rate                         51 non-null     float64
 8   Female Labor Force Participation Rate  51 non-null     float64
dtypes: float64(4), int64(1), object(4)
memory usage: 3.7+ KB


In [36]:
df_labor_entp.head()

Unnamed: 0,No,Country,Level of development,European Union Membership,Currency,Women Entrepreneurship Index,Entrepreneurship Index,Inflation rate,Female Labor Force Participation Rate
0,4,Austria,Developed,Member,Euro,54.9,64.9,0.9,67.1
1,6,Belgium,Developed,Member,Euro,63.6,65.5,0.6,58.0
2,17,Estonia,Developed,Member,Euro,55.4,60.2,-0.88,68.5
3,18,Finland,Developed,Member,Euro,66.4,65.7,-0.2,67.7
4,19,France,Developed,Member,Euro,68.8,67.3,0.0,60.6


- Level of development: countries based on their level of development

- European Union Membership: whether the country is a member of the European Union (EU) or not

- Currency: This column likely specifies the currency used in each country

- Women Entrepreneurship Index: Women Entrepreneurship Index (WEI) for each country. It quantifies the conditions favorable or unfavorable for female entrepreneurs within each country. The values are typically between 0 and 100, with higher values indicating a more favorable environment for women entrepreneurs.

- Entrepreneurship Index: Entrepreneurship Index (EI) for each country. It assesses the overall entrepreneurial environment within each country, regardless of gender. Similar to the Women Entrepreneurship Index, the values are typically between 0 and 100, with higher values indicating a more favorable environment for entrepreneurship.

- Inflation rate: inflation rate for each country. Inflation rate measures the percentage change in the general price level of goods and services over a period of time. 

- Female Labor Force Participation Rate: percentage of women who are actively participating in the labor force within each country. It measures the proportion of women who are either employed or actively seeking employment. 

In [37]:
df_labor_entp.describe()

Unnamed: 0,No,Women Entrepreneurship Index,Entrepreneurship Index,Inflation rate,Female Labor Force Participation Rate
count,51.0,51.0,51.0,51.0,51.0
mean,29.980392,47.835294,47.241176,2.587647,58.481765
std,18.017203,14.26848,16.193149,5.380639,13.864567
min,1.0,25.3,24.8,-2.25,13.0
25%,14.5,36.35,31.9,-0.5,55.8
50%,30.0,44.5,42.7,0.6,61.0
75%,45.5,59.15,65.4,3.6,67.4
max,60.0,74.8,77.6,26.5,82.3


In [38]:
df_labour_female = read_csv('data\Labour-Force-Participation-Female.csv')

In [39]:
df_labour_female.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 37 columns):
 #   Column                                                                Non-Null Count  Dtype  
---  ------                                                                --------------  -----  
 0   ISO3                                                                  195 non-null    object 
 1   Country                                                               195 non-null    object 
 2   Continent                                                             195 non-null    object 
 3   Hemisphere                                                            195 non-null    object 
 4   HDI Rank (2021)                                                       191 non-null    float64
 5   Labour force participation rate, female (% ages 15 and older) (1990)  180 non-null    float64
 6   Labour force participation rate, female (% ages 15 and older) (1991)  180 non-null    float64
 7  

In [40]:
# Extract year as a feauture
df_labour_female.head()

Unnamed: 0,ISO3,Country,Continent,Hemisphere,HDI Rank (2021),"Labour force participation rate, female (% ages 15 and older) (1990)","Labour force participation rate, female (% ages 15 and older) (1991)","Labour force participation rate, female (% ages 15 and older) (1992)","Labour force participation rate, female (% ages 15 and older) (1993)","Labour force participation rate, female (% ages 15 and older) (1994)",...,"Labour force participation rate, female (% ages 15 and older) (2012)","Labour force participation rate, female (% ages 15 and older) (2013)","Labour force participation rate, female (% ages 15 and older) (2014)","Labour force participation rate, female (% ages 15 and older) (2015)","Labour force participation rate, female (% ages 15 and older) (2016)","Labour force participation rate, female (% ages 15 and older) (2017)","Labour force participation rate, female (% ages 15 and older) (2018)","Labour force participation rate, female (% ages 15 and older) (2019)","Labour force participation rate, female (% ages 15 and older) (2020)","Labour force participation rate, female (% ages 15 and older) (2021)"
0,AFG,Afghanistan,Asia,Northern Hemisphere,180.0,15.18,15.214,15.223,15.197,15.178,...,15.879,16.794,17.749,18.746,19.798,20.887,21.228,21.566,16.189,14.848
1,AGO,Angola,Africa,Southern Hemisphere,148.0,75.408,75.381,75.369,75.371,75.387,...,74.834,74.833,74.843,74.864,74.882,74.912,74.955,75.011,73.618,73.968
2,ALB,Albania,Europe,Northern Hemisphere,67.0,51.364,54.727,55.608,54.638,53.825,...,48.778,43.598,43.733,46.898,49.676,49.51,51.189,52.723,49.786,50.733
3,AND,Andorra,Europe,Northern Hemisphere,40.0,,,,,,...,,,,,,,,,,
4,ARE,United Arab Emirates,Asia,Northern Hemisphere,26.0,29.083,29.779,30.272,30.944,31.121,...,44.718,46.19,47.659,49.072,50.373,51.947,48.951,48.923,45.703,46.542


- Hemisphere: This column categorizes countries based on their geographical hemispheres, such as "Northern Hemisphere" or "Southern Hemisphere."

- HDI Rank (2021): This column represents the Human Development Index (HDI) rank of each country for the year 2021.The HDI is a composite index measuring average achievement in three basic dimensions of human development: health (life expectancy at birth), education (mean years of schooling and expected years of schooling), and standard of living (gross national income per capita). The HDI is the geometric mean of normalized indices for each of the three dimensions. A country scores a higher level of HDI when the lifespan is higher, the education level is higher, and the gross national income GNI (PPP) per capita is higher.A lower rank indicates higher human development.

In [41]:
df_labour_male = read_csv('data\Labour-Force-Participation-Male.csv')

In [42]:
df_labour_male.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 37 columns):
 #   Column                                                              Non-Null Count  Dtype  
---  ------                                                              --------------  -----  
 0   ISO3                                                                195 non-null    object 
 1   Country                                                             195 non-null    object 
 2   Continent                                                           195 non-null    object 
 3   Hemisphere                                                          195 non-null    object 
 4   HDI Rank (2021)                                                     191 non-null    float64
 5   Labour force participation rate, male (% ages 15 and older) (1990)  180 non-null    float64
 6   Labour force participation rate, male (% ages 15 and older) (1991)  180 non-null    float64
 7   Labour force part

In [43]:
# Extract year as a feauture
# Merge female and male
df_labour_male.head()

Unnamed: 0,ISO3,Country,Continent,Hemisphere,HDI Rank (2021),"Labour force participation rate, male (% ages 15 and older) (1990)","Labour force participation rate, male (% ages 15 and older) (1991)","Labour force participation rate, male (% ages 15 and older) (1992)","Labour force participation rate, male (% ages 15 and older) (1993)","Labour force participation rate, male (% ages 15 and older) (1994)",...,"Labour force participation rate, male (% ages 15 and older) (2012)","Labour force participation rate, male (% ages 15 and older) (2013)","Labour force participation rate, male (% ages 15 and older) (2014)","Labour force participation rate, male (% ages 15 and older) (2015)","Labour force participation rate, male (% ages 15 and older) (2016)","Labour force participation rate, male (% ages 15 and older) (2017)","Labour force participation rate, male (% ages 15 and older) (2018)","Labour force participation rate, male (% ages 15 and older) (2019)","Labour force participation rate, male (% ages 15 and older) (2020)","Labour force participation rate, male (% ages 15 and older) (2021)"
0,AFG,Afghanistan,Asia,Northern Hemisphere,180.0,77.43,77.176,76.871,76.58,76.33,...,76.42,75.588,74.737,73.875,73.045,72.183,72.023,71.863,65.58,66.515
1,AGO,Angola,Africa,Southern Hemisphere,148.0,79.292,79.367,79.405,79.409,79.381,...,79.922,79.93,79.912,79.865,79.827,79.756,79.653,79.519,78.798,79.071
2,ALB,Albania,Europe,Northern Hemisphere,67.0,72.51,75.143,75.858,75.222,74.68,...,65.197,61.18,62.984,63.957,64.8,66.44,67.247,67.742,65.631,66.154
3,AND,Andorra,Europe,Northern Hemisphere,40.0,,,,,,...,,,,,,,,,,
4,ARE,United Arab Emirates,Asia,Northern Hemisphere,26.0,91.714,91.894,91.989,92.196,92.168,...,89.976,90.557,91.098,91.509,91.697,91.559,90.621,90.686,87.191,88.003


In [44]:
df_ent = pd.read_csv('data\Women-Ent_Data3.csv',sep=';')

In [45]:
df_ent.head()

Unnamed: 0,No,Country,Level of development,European Union Membership,Currency,Women Entrepreneurship Index,Entrepreneurship Index,Inflation rate,Female Labor Force Participation Rate
0,4,Austria,Developed,Member,Euro,54.9,64.9,0.9,67.1
1,6,Belgium,Developed,Member,Euro,63.6,65.5,0.6,58.0
2,17,Estonia,Developed,Member,Euro,55.4,60.2,-0.88,68.5
3,18,Finland,Developed,Member,Euro,66.4,65.7,-0.2,67.7
4,19,France,Developed,Member,Euro,68.8,67.3,0.0,60.6


In [46]:
df_ent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 9 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   No                                     51 non-null     int64  
 1   Country                                51 non-null     object 
 2   Level of development                   51 non-null     object 
 3   European Union Membership              51 non-null     object 
 4   Currency                               51 non-null     object 
 5   Women Entrepreneurship Index           51 non-null     float64
 6   Entrepreneurship Index                 51 non-null     float64
 7   Inflation rate                         51 non-null     float64
 8   Female Labor Force Participation Rate  51 non-null     float64
dtypes: float64(4), int64(1), object(4)
memory usage: 3.7+ KB


In [47]:
# df_ent and df_labor_entp are the same
(df_ent == df_labor_entp).all()

No                                       True
Country                                  True
Level of development                     True
European Union Membership                True
Currency                                 True
Women Entrepreneurship Index             True
Entrepreneurship Index                   True
Inflation rate                           True
Female Labor Force Participation Rate    True
dtype: bool

In [48]:
# Deleting one of the datasets
del df_ent

In [49]:
df_placement = read_csv('data\Placement.csv')

In [50]:
df_placement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   gender               215 non-null    object 
 1   ssc_percentage       215 non-null    float64
 2   ssc_board            215 non-null    object 
 3   hsc_percentage       215 non-null    float64
 4   hsc_board            215 non-null    object 
 5   hsc_subject          215 non-null    object 
 6   degree_percentage    215 non-null    float64
 7   undergrad_degree     215 non-null    object 
 8   work_experience      215 non-null    object 
 9   emp_test_percentage  215 non-null    float64
 10  specialisation       215 non-null    object 
 11  mba_percent          215 non-null    float64
 12  status               215 non-null    object 
dtypes: float64(5), object(8)
memory usage: 22.0+ KB


- The Secondary School Certificate (SSC) or Secondary School Leaving Certificate (SSLC), Matriculation examination, is a public examination in Bangladesh, India, Pakistan and Maldives conducted by educational boards for the successful completion of the secondary education exam in these countries. Students of 10th grade/class ten can appear in these. 

- The Higher Secondary Certificate (HSC) or Higher Secondary School Certificate (HSSC) or Higher Secondary Education Certificate (HSE) is a secondary education qualification in Bangladesh, India and Pakistan. It is equivalent to the final year of high school in the United States and GCSE and/or A level in the United Kingdom.

- A Master of Business Administration (MBA; also Master in Business Administration) is a postgraduate degree focused on business administration.
Source : Wikipedia

- ssc_percentage: percentage score obtained by individuals in their Secondary School Certificate (SSC) exams.

- ssc_board: educational board from which individuals completed their Secondary School Certificate (SSC) exams.

- hsc_percentage: percentage score obtained by individuals in their Higher Secondary Certificate (HSC) exams

- hsc_board: indicates the educational board from which individuals completed their Higher Secondary Certificate (HSC) exams. 

- hsc_subject: subject or stream chosen by individuals in their Higher Secondary Certificate (HSC) exams. 

- degree_percentage: the percentage score obtained by individuals in their undergraduate degree program.

- undergrad_degree : type or field of undergraduate degree obtained by individuals.

- emp_test_percentage:the percentage score obtained by individuals in an employment test or assessment

- specialisation: specialization or focus area of study for individuals, possibly in their postgraduate or professional education

- mba_percent:the percentage score obtained by individuals in their Master of Business Administration (MBA) program

- status: the status of individuals, possibly related to their employment or educational status (e.g., "Placed," "Not Placed")

In [51]:
df_placement.head()

Unnamed: 0,gender,ssc_percentage,ssc_board,hsc_percentage,hsc_board,hsc_subject,degree_percentage,undergrad_degree,work_experience,emp_test_percentage,specialisation,mba_percent,status
0,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed
3,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed
