# Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### 1. Percentage of women author participation in jounals by year for each country
Source file description (Luke Holman, 2018): The spreadsheet gives the number of male, female and unknown-gender authors that were counted for each combination of year, authorship position (i.e. first/last/middle/single), country (including 'unknown', which refers either to authors with no affiliation, or those with an affiliation for which we could not identify the country), and journal (using the abbreviations favoured by PubMed). 'First' and 'Last' authors were counted from all papers with 2 or more authors. 'Middle' authors are any authors other than the first and last, on papers with three or more authors. Single authors are the authors of papers that list only one author. The unknown-gender authors are people who only gave initials, those whose names were not listed on genderize.io, or those with names that are not associated with one gender >95% of the time (e.g. Alex, Robin).

In [2]:
# source: https://github.com/lukeholman/genderGapCode
women_journal_percentage = pd.read_csv('author_frequency.csv')
# drop position & journal type (unrelated)
women_journal_percentage = women_journal_percentage.drop(columns=['position','journal'])
# add up overall occurance for women(F), men(M), and gender unknown(U) for each year by country
# Reference: https://stackoverflow.com/questions/39922986/how-do-i-pandas-group-by-to-get-sum
women_journal_percentage = women_journal_percentage.groupby(['country', 'year', 'gender']).sum().reset_index()
women_journal_percentage.head()

Unnamed: 0,country,year,gender,n
0,Algeria,2001,F,3
1,Algeria,2001,M,2
2,Algeria,2002,F,19
3,Algeria,2002,M,25
4,Algeria,2002,U,6


In [3]:
# drop unknown gender
women_journal_percentage_noU = women_journal_percentage[~women_journal_percentage.gender.str.contains('U')]

# drop unknown country
women_journal_percentage_1 = women_journal_percentage_noU[women_journal_percentage_noU.country != 'Unknown']
women_journal_percentage_1.head()

Unnamed: 0,country,year,gender,n
0,Algeria,2001,F,3
1,Algeria,2001,M,2
2,Algeria,2002,F,19
3,Algeria,2002,M,25
5,Algeria,2003,F,27


In [4]:
# reshape dataframe
# reference: https://stackoverflow.com/questions/17298313/python-pandas-convert-rows-as-column-headers
df3 = women_journal_percentage_1.pivot_table('n', ['country', 'year'], 'gender').rename_axis(None, axis=1)
df3.reset_index(drop=False, inplace=True)
df3.reindex(['country', 'year', 'F', 'M'], axis=1)

df3.head()

Unnamed: 0,country,year,F,M
0,Algeria,2001,3.0,2.0
1,Algeria,2002,19.0,25.0
2,Algeria,2003,27.0,29.0
3,Algeria,2004,34.0,33.0
4,Algeria,2005,17.0,28.0


In [5]:
# calculate the percentage of women author participation in jounals by year for each country
df3['F_percentage'] = df3['F']/(df3['F'] + df3['M'])*100
df3.head()

Unnamed: 0,country,year,F,M,F_percentage
0,Algeria,2001,3.0,2.0,60.0
1,Algeria,2002,19.0,25.0,43.181818
2,Algeria,2003,27.0,29.0,48.214286
3,Algeria,2004,34.0,33.0,50.746269
4,Algeria,2005,17.0,28.0,37.777778


In [6]:
# only keep data from 2010 to 2016
df3_2016 = df3.loc[(2010 <= df3['year']) & (df3['year']<= 2016)]
df3_2016

Unnamed: 0,country,year,F,M,F_percentage
9,Algeria,2010,93.0,143.0,39.406780
10,Algeria,2011,117.0,173.0,40.344828
11,Algeria,2012,199.0,259.0,43.449782
12,Algeria,2013,227.0,293.0,43.653846
13,Algeria,2014,151.0,215.0,41.256831
...,...,...,...,...,...
1857,Zimbabwe,2012,44.0,75.0,36.974790
1858,Zimbabwe,2013,67.0,95.0,41.358025
1859,Zimbabwe,2014,93.0,113.0,45.145631
1860,Zimbabwe,2015,60.0,105.0,36.363636


##### Clarification: 
1. Why drop position & journal: The authors' positions (first, middle and last won't affect the overall result of women author participation (which I intend to analyze). The specific journal names also have nothing to do with the participation rate. I would research on a country-by-country basis to distinguish economic and gender equality factors in different economies.
2. Why drop unknown countries: data from unknown countries is invalid data, which is not able for me to categorize and assign remaining factors to see the trends.
3. Why drop unknown gender: Identifying subpopulations that differed significantly. i.e., suppose the number of authors with unknown gender is considerable for some countries. In that case, the percentage of women's participation may differ from its actual value. The difference between this group of countries and other countries with known data will be more significant compared to real cases (even when the economic status of those two countries is similar).
4. Why 2010 - 2016: the economic situation of different countries fluctuated wildly due to the economic crisis, which can also help me to see the influence of economic status on the percentage of women authors' participation in journals in the research question more intuitively.

#### 2. Possible factor: GDP of countries, summarized by year
Source file description (DataHub.io): Country, regional and world GDP in current US Dollars ($).

In [7]:
# source: https://datahub.io/core/gdp#data
gdp = pd.read_csv('gdp_csv.csv')
# rename columns
gdp = gdp.rename(columns={'Country Name':'country', 'Country Code':'country_code', \
                          'Year':'year', 'Value':'GDP'})
# preview
gdp.head()

Unnamed: 0,country,country_code,year,GDP
0,Arab World,ARB,1968,25760680000.0
1,Arab World,ARB,1969,28434200000.0
2,Arab World,ARB,1970,31385500000.0
3,Arab World,ARB,1971,36426910000.0
4,Arab World,ARB,1972,43316060000.0


In [8]:
# only keep data from 2010 to 2016
gdp_2016 = gdp.loc[(2010 <= gdp['year']) & (gdp['year']<= 2016)]
gdp_2016

Unnamed: 0,country,country_code,year,GDP
42,Arab World,ARB,2010,2.109646e+12
43,Arab World,ARB,2011,2.501554e+12
44,Arab World,ARB,2012,2.741239e+12
45,Arab World,ARB,2013,2.839627e+12
46,Arab World,ARB,2014,2.906616e+12
...,...,...,...,...
11502,Zimbabwe,ZWE,2012,1.424249e+10
11503,Zimbabwe,ZWE,2013,1.545177e+10
11504,Zimbabwe,ZWE,2014,1.589105e+10
11505,Zimbabwe,ZWE,2015,1.630467e+10


##### Clarification:
1. Why 2010 - 2016: the economic situation of different countries fluctuated wildly due to the economic crisis, which can also help me to see the influence of economic status on the percentage of women authors' participation in journals in the research question more intuitively. At the same time, almost all countries' GDP data are recorded between 2010 and 2016, which helps me to analyze it more comprehensively. The sort order of this dataset is also kept the same as the one of women participation dataset cleaned before. This step makes it easier for me to merge them together in step 6.
2. Why rename columns: The names of columns in this dataset which overlap with the ones in women participation dataset (country, year) is kept the same as above. This step makes it easier for me to merge them together in step 6.

#### 3. Possible factor: Government expenditure
Source file description (UNESCO Institute for Statistics, 2020): Government expenditure per student, primary (% of GDP per capita)

In [9]:
# source: https://data.worldbank.org/indicator/SE.XPD.PRIM.PC.ZS
expenditure = pd.read_csv('government_expenditure.csv')
# drop indicator name and indicator code
expenditure = expenditure.drop(columns=['Indicator Name', 'Indicator Code'])
# preview
expenditure.head()

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Aruba,ABW,,,,,,,,,...,18.38443,17.36433,17.08119,,,,,,,
1,Africa Eastern and Southern,AFE,,,,,,,,,...,,,,,,,,,,
2,Afghanistan,AFG,,,,,,,,,...,8.31733,11.17447,11.72219,10.24688,10.34081,10.25174,,,,
3,Africa Western and Central,AFW,,,,,,,,,...,12.891175,10.67458,,,,,,,,
4,Angola,AGO,,,,,,,,,...,,,,,,,,,,


In [10]:
# only keep data from 2010 to 2016
expenditure_2016 = expenditure[['Country Name','Country Code','2010','2011','2012','2013','2014','2015','2016']]
# rename columns
expenditure_2016 = expenditure_2016.rename(columns={'Country Name':'country','Country Code':'country_code'})
expenditure_2016.head()

Unnamed: 0,country,country_code,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,,,18.38443,17.36433,17.08119,,
1,Africa Eastern and Southern,AFE,,,,,,,
2,Afghanistan,AFG,11.95215,12.21159,8.31733,11.17447,11.72219,10.24688,10.34081
3,Africa Western and Central,AFW,10.917025,10.633625,12.891175,10.67458,,,
4,Angola,AGO,,,,,,,


In [11]:
# reshape dataframe
# reference: https://stackoverflow.com/questions/28654047/convert-columns-into-rows-with-pandas
expenditure_reshape = expenditure_2016.melt(id_vars=['country','country_code'], var_name='year', value_name='expenditure_per_student(%GDP)')
# convert year to int
expenditure_reshape['year'] = expenditure_reshape['year'].astype(int)
# sort dataframe by country and year
expenditure_reshape = expenditure_reshape.sort_values(by=['country','year']).reset_index()
expenditure_reshape = expenditure_reshape.drop(columns=['index'])
expenditure_reshape

Unnamed: 0,country,country_code,year,expenditure_per_student(%GDP)
0,Afghanistan,AFG,2010,11.95215
1,Afghanistan,AFG,2011,12.21159
2,Afghanistan,AFG,2012,8.31733
3,Afghanistan,AFG,2013,11.17447
4,Afghanistan,AFG,2014,11.72219
...,...,...,...,...
1857,Zimbabwe,ZWE,2012,14.01205
1858,Zimbabwe,ZWE,2013,14.00009
1859,Zimbabwe,ZWE,2014,
1860,Zimbabwe,ZWE,2015,


##### Clarification:
1. why drop indicator name, and indicator code: The indicator name and code don't impact or relate to government expenditure and the overall percentage of women author participation.
2. why only keep data from 2001 to 2016: the economic situation of different countries fluctuated wildly due to the economic crisis, which can also help me to see the influence of economic status on the percentage of women authors' participation in journals in the research question more intuitively. At the same time, almost all countries' expenditure data are recorded between 2010 and 2016, which helped me to analyze it more comprehensively. The sort order of this dataset is also kept the same as the one of the women participation datasets cleaned before. This step makes it easier for me to merge them in step 6.
3. why rename columns: The names of columns in this dataset that overlap with the ones in the women participation dataset (country, year) are kept the same as above. This step makes it easier for me to merge them in step 6.
4. why reshape the dataframe: Keep the shape same with the women participation dataset (sort by country and years), making it easier to merge for step 6.
5. why convert year to int: make it easier to sort by years and assign factors in a yearly manner in step 6.
6. why sort dataframe by country and year: Keep the same format of the two datasets generated above. Also, show the progress of expenditure for each country in the same period (2010-2016).

#### 4. Possible factor: Educational Attainment of women
Source file description (UNESCO Institute for Statistics, 2022): Educational Attainment, at least completed upper secondary, female(%) (cumulative)

In [12]:
# source: https://data.worldbank.org/indicator/SE.SEC.CUAT.UP.FE.ZS
e_a = pd.read_csv('educational_attainment_female.csv')
# drop indicator name and indicator code
e_a = e_a.drop(columns=['Indicator Name', 'Indicator Code'])
# preview
e_a.tail()

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
261,Kosovo,XKX,,,,,,,,,...,,,,,,,,,,
262,"Yemen, Rep.",YEM,,,,,,,,,...,,,,,,,,,,
263,South Africa,ZAF,,,,,,,,,...,59.818909,,62.796871,62.796791,,60.47858,,54.322811,,
264,Zambia,ZMB,,,,,,,,,...,,,,,,,,,,
265,Zimbabwe,ZWE,,,,,,,,,...,3.93888,,12.00218,,,9.35805,,,,


In [13]:
# only keep data from 2010 to 2016
e_a_2016 = e_a[['Country Name','Country Code','2010','2011','2012','2013','2014','2015','2016']]
# rename columns
e_a_2016 = e_a_2016.rename(columns={'Country Name':'country','Country Code':'country_code'})
e_a_2016.head(30)

Unnamed: 0,country,country_code,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,32.074589,,,,,,
1,Africa Eastern and Southern,AFE,,,,,,,
2,Afghanistan,AFG,,,,,,,
3,Africa Western and Central,AFW,,,,,,,
4,Angola,AGO,,,,,12.2926,,
5,Albania,ALB,,38.72538,44.491501,,,,
6,Andorra,AND,,,,,47.593479,47.281551,47.035629
7,Arab World,ARB,,,,,,,
8,United Arab Emirates,ARE,,,,,,,
9,Argentina,ARG,,,,,,,


In [14]:
# reshape dataframe
# reference: https://stackoverflow.com/questions/28654047/convert-columns-into-rows-with-pandas
e_a_reshape = e_a_2016.melt(id_vars=['country','country_code'], var_name='year', value_name='e_a_percentage(%)')
# convert year to int
e_a_reshape['year'] = e_a_reshape['year'].astype(int)
# sort dataframe by country and year
e_a_reshape = e_a_reshape.sort_values(by=['country','year']).reset_index()
e_a_reshape = e_a_reshape.drop(columns=['index'])
e_a_reshape

Unnamed: 0,country,country_code,year,e_a_percentage(%)
0,Afghanistan,AFG,2010,
1,Afghanistan,AFG,2011,
2,Afghanistan,AFG,2012,
3,Afghanistan,AFG,2013,
4,Afghanistan,AFG,2014,
...,...,...,...,...
1857,Zimbabwe,ZWE,2012,3.93888
1858,Zimbabwe,ZWE,2013,
1859,Zimbabwe,ZWE,2014,12.00218
1860,Zimbabwe,ZWE,2015,


##### Clarification:
1. why drop indicator name, and indicator code: The indicator name and code don't impact or relate to educational attainment and the overall percentage of women author participation.
2. why only keep data from 2001 to 2016: the economic situation of different countries fluctuated wildly due to the economic crisis, which can also help me to see the influence of economic status on the percentage of women authors' participation in journals in the research question more intuitively. At the same time, almost all educational attainment data are recorded between 2010 and 2016, which helped me to analyze it more comprehensively. The sort order of this dataset is also kept the same as the one of the women participation datasets cleaned before. This step makes it easier for me to merge them in step 6.
3. why rename columns: The names of columns in this dataset that overlap with the ones in the women participation dataset (country, year) are kept the same as above. This step makes it easier for me to merge them in step 6.
4. why reshape the dataframe: Keep the shape same with the women participation dataset (sort by country and years), making it easier to merge for step 6.
5. why convert year to int: make it easier to sort by years and assign factors in a yearly manner in step 6.
6. why sort dataframe by country and year: Keep the same format of the two datasets generated above. Also, show the progress of women educational attainment for each country in the same period (2010-2016).

#### 5. Possible factor: Gender gap in average wages
Source file description (ILOSTAT): Gender wage gap, unadjusted for worker characteristics. Estimates correspond to the difference between
average earnings of men and women, expressed as a percentage(%) of average earnings of men. The gap is positive - women earn less than men.

In [15]:
# source: https://ourworldindata.org/economic-inequality-by-gender
gap = pd.read_csv('gender_gap_in_average_wages.csv')
# preview
gap.head()

Unnamed: 0,Entity,Code,Year,Gender wage gap (%)
0,Argentina,ARG,1986,15.79
1,Argentina,ARG,1987,12.5
2,Argentina,ARG,1988,11.31
3,Argentina,ARG,1991,6.71
4,Argentina,ARG,1992,8.33


In [16]:
# only keep data from 2010 to 2016
gap_2016 = gap.loc[(gap['Year'] >= 2010) & (gap['Year'] <= 2016)]
# rename columns
gap_2016 = gap_2016.rename(columns={'Entity':'country', 'Code':'country_code', 'Year':'year', \
                                    'Gender wage gap (%)':'gender_wage_gap(%)'})
gap_2016

Unnamed: 0,country,country_code,year,gender_wage_gap(%)
22,Argentina,ARG,2010,-0.61
23,Argentina,ARG,2011,0.00
24,Argentina,ARG,2012,-1.90
25,Argentina,ARG,2013,1.45
26,Argentina,ARG,2014,-3.62
...,...,...,...,...
395,Uruguay,URY,2012,8.59
396,Uruguay,URY,2013,10.01
397,Uruguay,URY,2014,8.62
411,Vietnam,VNM,2015,7.69


##### Clarification:
1. Why 2010 - 2016: the economic situation of different countries fluctuated wildly due to the economic crisis, which can also help me to see the influence of economic status on the percentage of women authors' participation in journals in the research question more intuitively. At the same time, almost all countries' gender wage gap data are recorded between 2010 and 2016, which helps me to analyze it more comprehensively. The sort order of this dataset is also kept the same as the one of women participation dataset cleaned before. This step makes it easier for me to merge them together in step 6.
2. Why rename columns: The names of columns in this dataset which overlap with the ones in women participation dataset (country, year) is kept the same as above. This step makes it easier for me to merge them together in step 6.

#### 6. Merge the datasets

In [17]:
# avoid warnings when unifying country names
# reference: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None

# unify country names
rename = {'USA':'United States', 'South Korea':'Korea', 'Korea, Rep.':'Korea'} 
df3_2016['country'] = df3_2016['country'].replace(rename)
gdp_2016['country'] = gdp_2016['country'].replace(rename)
expenditure_reshape['country'] = expenditure_reshape['country'].replace(rename)
e_a_reshape['country'] = e_a_reshape['country'].replace(rename)
gap_2016['country'] = gap_2016['country'].replace(rename)

In [18]:
df3_2016

Unnamed: 0,country,year,F,M,F_percentage
9,Algeria,2010,93.0,143.0,39.406780
10,Algeria,2011,117.0,173.0,40.344828
11,Algeria,2012,199.0,259.0,43.449782
12,Algeria,2013,227.0,293.0,43.653846
13,Algeria,2014,151.0,215.0,41.256831
...,...,...,...,...,...
1857,Zimbabwe,2012,44.0,75.0,36.974790
1858,Zimbabwe,2013,67.0,95.0,41.358025
1859,Zimbabwe,2014,93.0,113.0,45.145631
1860,Zimbabwe,2015,60.0,105.0,36.363636


In [19]:
gdp_2016

Unnamed: 0,country,country_code,year,GDP
42,Arab World,ARB,2010,2.109646e+12
43,Arab World,ARB,2011,2.501554e+12
44,Arab World,ARB,2012,2.741239e+12
45,Arab World,ARB,2013,2.839627e+12
46,Arab World,ARB,2014,2.906616e+12
...,...,...,...,...
11502,Zimbabwe,ZWE,2012,1.424249e+10
11503,Zimbabwe,ZWE,2013,1.545177e+10
11504,Zimbabwe,ZWE,2014,1.589105e+10
11505,Zimbabwe,ZWE,2015,1.630467e+10


In [20]:
# Merge datasets of women participation and GDP
frequency_gdp = pd.merge(df3_2016, gdp_2016, how='left', on=('country','year'))
# change the sequence between F_percentage and GDP
new_frequency_gdp = frequency_gdp[['country', 'year', 'F', 'M', 'GDP', 'F_percentage']]
new_frequency_gdp

Unnamed: 0,country,year,F,M,GDP,F_percentage
0,Algeria,2010,93.0,143.0,1.612073e+11,39.406780
1,Algeria,2011,117.0,173.0,2.000191e+11,40.344828
2,Algeria,2012,199.0,259.0,2.090590e+11,43.449782
3,Algeria,2013,227.0,293.0,2.097550e+11,43.653846
4,Algeria,2014,151.0,215.0,2.138100e+11,41.256831
...,...,...,...,...,...,...
807,Zimbabwe,2012,44.0,75.0,1.424249e+10,36.974790
808,Zimbabwe,2013,67.0,95.0,1.545177e+10,41.358025
809,Zimbabwe,2014,93.0,113.0,1.589105e+10,45.145631
810,Zimbabwe,2015,60.0,105.0,1.630467e+10,36.363636


##### Clarification:
Why drop country_code: The dataset of women's participation in journals doesn't include country code but only indicates country names. Country codes won't help assign factors to each country but would cause an error in this case because some countries in some datasets of factors (e.g. countries in government expenditure dataset) are missing for the women author participation rate dataset.

In [21]:
# Merge datasets of women participation, GDP, and government education expenditure
frequency_gdp_expenditure = pd.merge(new_frequency_gdp, expenditure_reshape, how='left', on=('country','year'))

# change the sequence between F_percentage and expenditure
new_frequency_gdp_expenditure = frequency_gdp_expenditure[['country', 'year', 'F', \
                                                           'M', 'GDP', 'expenditure_per_student(%GDP)', \
                                                           'F_percentage']]
new_frequency_gdp_expenditure

Unnamed: 0,country,year,F,M,GDP,expenditure_per_student(%GDP),F_percentage
0,Algeria,2010,93.0,143.0,1.612073e+11,,39.406780
1,Algeria,2011,117.0,173.0,2.000191e+11,,40.344828
2,Algeria,2012,199.0,259.0,2.090590e+11,,43.449782
3,Algeria,2013,227.0,293.0,2.097550e+11,,43.653846
4,Algeria,2014,151.0,215.0,2.138100e+11,,41.256831
...,...,...,...,...,...,...,...
807,Zimbabwe,2012,44.0,75.0,1.424249e+10,14.01205,36.974790
808,Zimbabwe,2013,67.0,95.0,1.545177e+10,14.00009,41.358025
809,Zimbabwe,2014,93.0,113.0,1.589105e+10,,45.145631
810,Zimbabwe,2015,60.0,105.0,1.630467e+10,,36.363636


In [22]:
# Merge datasets of women participation, GDP, government education expenditure, and women educational attainment
frequency_gdp_expenditure_att = pd.merge(new_frequency_gdp_expenditure, e_a_reshape, \
                                         how='left', on=('country','year'))

# change the sequence between F_percentage and educational attainment
new_df = frequency_gdp_expenditure_att[[\
                                        'country', 'year', 'F', 'M', 'GDP', \
                                        'expenditure_per_student(%GDP)', 'e_a_percentage(%)', \
                                        'F_percentage']]
new_df.iloc[620:650]

Unnamed: 0,country,year,F,M,GDP,expenditure_per_student(%GDP),e_a_percentage(%),F_percentage
620,Singapore,2014,3485.0,6565.0,308142800000.0,,67.067337,34.676617
621,Singapore,2015,3146.0,5689.0,296840700000.0,,68.283501,35.608376
622,Singapore,2016,1742.0,3086.0,296975700000.0,,67.8349,36.081193
623,Slovakia,2010,797.0,983.0,,,,44.775281
624,Slovakia,2011,1056.0,1177.0,,,,47.29064
625,Slovakia,2012,1049.0,1229.0,,,,46.049166
626,Slovakia,2013,1117.0,1254.0,,,,47.110924
627,Slovakia,2014,1195.0,1399.0,,,,46.067849
628,Slovakia,2015,1228.0,1229.0,,,,49.97965
629,Slovakia,2016,582.0,650.0,,,,47.24026


In [23]:
# Merge datasets of women participation, GDP, government education expenditure,
# women educational attainment, and gender wage gap

final_df = pd.merge(new_df, gap_2016, how='left', on=('country','year'))

# change the sequence between F_percentage and gender wage gap
overall_df = final_df[['country', \
                       'year', 'F', 'M', 'GDP', 'expenditure_per_student(%GDP)', 'e_a_percentage(%)', \
                       'gender_wage_gap(%)', 'F_percentage']]

In [24]:
# comfirm shape
overall_df.iloc[620:650]

Unnamed: 0,country,year,F,M,GDP,expenditure_per_student(%GDP),e_a_percentage(%),gender_wage_gap(%),F_percentage
620,Singapore,2014,3485.0,6565.0,308142800000.0,,67.067337,,34.676617
621,Singapore,2015,3146.0,5689.0,296840700000.0,,68.283501,,35.608376
622,Singapore,2016,1742.0,3086.0,296975700000.0,,67.8349,,36.081193
623,Slovakia,2010,797.0,983.0,,,,19.66,44.775281
624,Slovakia,2011,1056.0,1177.0,,,,,47.29064
625,Slovakia,2012,1049.0,1229.0,,,,,46.049166
626,Slovakia,2013,1117.0,1254.0,,,,,47.110924
627,Slovakia,2014,1195.0,1399.0,,,,19.66,46.067849
628,Slovakia,2015,1228.0,1229.0,,,,19.1,49.97965
629,Slovakia,2016,582.0,650.0,,,,,47.24026


In [25]:
# comfirm shape
overall_df.shape

(812, 9)