# Lab Assignment 9: Data Management Using `pandas`, Part 2
## DS 6001: Practice and Application of Data Science
## Name: Afnan Alabdulwahab

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

In this lab, we are going to build the Country Analysis Relational DataBase (which we will call the C.A.R.D.B. or the "Cardi B"):

![CardbiB](https://media.giphy.com/media/3oEjI5ry4IwZ3RDw6k/giphy.gif "cardib")

We will be collecting data from two sources. First, we will use open data from the World Bank's [Sovereign
Environmental, Social, and Governance (ESG) Data](https://datatopics.worldbank.org/esg/) project. The ESG data reports data from every country in the world over the time frame from 1960-2022 on a wide variety of topics including education, health, and economic factors within the countries. Second, we will use data on the quality and democratic character of countries' governments as reported by the [Varieties of Democracy (V-Dem)](https://www.v-dem.net/data/the-v-dem-dataset/) project at the University of Notre Dame. By using both data sources, we can conduct analyses to see whether democratic openness leads to better societal outcomes for countries. We can also write queries to capture a wide range of information on countries' political parties, tax systems, and banking industries, for example. Or as Cardi B would say, "You in the club just to party, I'm there, I get paid a fee. I be in and out them banks so much, I know they're tired of me."

## Problem 0
Import the following packages (use `pip install` to download any packages you don't already have installed):

In [1]:
import numpy as np
import pandas as pd
import requests
import os
import io
import zipfile

Both the World Bank and V-Dem store their data in zipped directories containing CSV files. Download the World Bank data into your current working directory by typing the following code:

And download the V-Dem data by typing:

After you've run this code successfully once, the files you need will be in your working directory and you should save time by switching these cells from "code" to "raw" so that they don't run again if you restart the kernel.

You will only need three of the files you've downloaded. Load the 'V-Dem-CY-Core-v13.csv' file as `vdem` and the 'ESGData.csv' file as `wb`. 

In [2]:
vdem = pd.read_csv('V-Dem-CY-Core-v13.csv')
wb = pd.read_csv('ESGCSV.csv')

## Problem 1
First, let's focus on the `vdem` data ('V-Dem-CY-Core-v13.csv'). Use `pandas` methods to perform the following tasks:

### Part a
Keep only the 'country_text_id', 'country_name','year', 'v2x_polyarchy', and 'v2peedueq' columns. [1 point]

In [3]:
# Setting pandas option to display all columns
pd.set_option('display.max_columns', None) 

To include only these columns I am defining a list of these column names, then passing the list to the dataframe index as follows:

In [4]:
vdem = vdem[['country_text_id', 'country_name', 'year', 'v2x_polyarchy', 'v2peedueq' ]]
vdem

Unnamed: 0,country_text_id,country_name,year,v2x_polyarchy,v2peedueq
0,MEX,Mexico,1789,0.028,
1,MEX,Mexico,1790,0.028,
2,MEX,Mexico,1791,0.028,
3,MEX,Mexico,1792,0.028,
4,MEX,Mexico,1793,0.028,
...,...,...,...,...,...
27550,SPD,Piedmont-Sardinia,1857,0.207,
27551,SPD,Piedmont-Sardinia,1858,0.210,
27552,SPD,Piedmont-Sardinia,1859,0.210,
27553,SPD,Piedmont-Sardinia,1860,0.213,


### Part b
Use the `.query()` method to keep only the rows in which year is greater than or equal to 1960 and less than or equal to 2021. [1 point]

Using pandas' `.query()` method with a logical condition `'year >= 1960 & year <= 2021'` then the result is reassigned to `vdem`:

In [5]:
vdem = vdem.query('year >= 1960 & year <= 2021')
vdem

Unnamed: 0,country_text_id,country_name,year,v2x_polyarchy,v2peedueq
171,MEX,Mexico,1960,0.232,-1.438
172,MEX,Mexico,1961,0.234,-1.438
173,MEX,Mexico,1962,0.233,-1.438
174,MEX,Mexico,1963,0.233,-1.438
175,MEX,Mexico,1964,0.231,-1.438
...,...,...,...,...,...
26150,ZZB,Zanzibar,2017,0.267,1.661
26151,ZZB,Zanzibar,2018,0.268,1.486
26152,ZZB,Zanzibar,2019,0.266,1.486
26153,ZZB,Zanzibar,2020,0.258,1.427


### Part c
Rename 'country_text_id' to 'country_code', 'country_name' to 'country_name_vdem', 'v2x_polyarchy' to 'democracy', and 'v2peedueq' to 'educational_equality'. [1 point]

Using the `.rename()` method on the dataframe, `vdem`, with a dictionary parameter that contains mapping of the old column names to the new names, and `axis=1` to work with columns:

In [6]:
vdem = vdem.rename({'country_text_id': 'country_code',
                    'country_name': 'country_name_vdem',
                    'v2x_polyarchy': 'democracy',
                    'v2peedueq':'educational_equality'}, axis=1)
vdem

Unnamed: 0,country_code,country_name_vdem,year,democracy,educational_equality
171,MEX,Mexico,1960,0.232,-1.438
172,MEX,Mexico,1961,0.234,-1.438
173,MEX,Mexico,1962,0.233,-1.438
174,MEX,Mexico,1963,0.233,-1.438
175,MEX,Mexico,1964,0.231,-1.438
...,...,...,...,...,...
26150,ZZB,Zanzibar,2017,0.267,1.661
26151,ZZB,Zanzibar,2018,0.268,1.486
26152,ZZB,Zanzibar,2019,0.266,1.486
26153,ZZB,Zanzibar,2020,0.258,1.427


### Part d
Sort the rows by 'country_code' and 'year' in ascending order. [1 point]

To sort, I am using the `.sort_values()` method. Within the method, I use the `by` argument to specify which columns I want to sort by in a list, starting with `country_code`, then `year`. Then I'll pass a list of boolean values to the `ascending` argument to specify whether each of the columns should be sorted in ascending or descending order. In this case both are sorted in ascending order:

In [7]:
vdem = vdem.sort_values(by = ['country_code', 'year'], ascending = [True, True])
vdem

Unnamed: 0,country_code,country_name_vdem,year,democracy,educational_equality
5433,AFG,Afghanistan,1960,0.080,-1.123
5434,AFG,Afghanistan,1961,0.083,-1.123
5435,AFG,Afghanistan,1962,0.082,-1.123
5436,AFG,Afghanistan,1963,0.085,-1.123
5437,AFG,Afghanistan,1964,0.137,-0.951
...,...,...,...,...,...
26150,ZZB,Zanzibar,2017,0.267,1.661
26151,ZZB,Zanzibar,2018,0.268,1.486
26152,ZZB,Zanzibar,2019,0.266,1.486
26153,ZZB,Zanzibar,2020,0.258,1.427


## Problem 2
Next focus on the World Bank `wb` dataset 'ESGData.csv'. Use `pandas` methods to perform the following tasks:

### Part a
Keep only the columns named 'Country Code', 'Country Name', and 'Indicator Code', or begin with '19' or '20'. (Don't type in all the years individually. Instead, use code that finds all columns that begin '19' or '20'.) [1 point]

First, I am defining a list `mycols` containing the column names we want to keep:

In [8]:
mycols = ['Country Code', 'Country Name', 'Indicator Code']

Here, I am creating a list of column names from the dataframe, `wb`, that start with '19' or '20'. I am using a list comprehension to iterate over the column names and check if each name starts with '19' or '20'. The resulting list, `yearcols`, contains only those column names that meet this condition:

In [9]:
# getting the year columns
yearcols = [x for x in wb.columns if x.startswith("19") or x.startswith("20")]

Here, I am overwriting the `wb` dataframe with a dataframe only containing `mycols` and `yearcols`:

In [10]:
wb = wb[mycols + yearcols]
wb

Unnamed: 0,Country Code,Country Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,ARB,Arab World,EG.CFT.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,74.998725,76.638513,78.084222,79.317256,80.412595,81.436463,82.364313,83.153561,83.852744,84.520175,84.995489,85.545061,86.024476,86.406487,86.705717,86.942778,87.228705,87.390856,87.617862,87.798740,87.948264,88.092536,,
1,ARB,Arab World,EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,76.964326,77.623521,78.521409,79.112320,81.108851,81.936223,81.904019,82.888219,82.830288,83.590460,85.815868,84.152501,83.838359,84.735838,87.482231,87.719569,87.402502,89.340705,88.832276,89.053852,89.539016,90.662754,89.176939,90.352802,90.635050,90.845661,,
2,ARB,Arab World,NY.ADJ.DRES.GN.ZS,,,,,,,,,,,5.961207,6.065469,8.242519,11.302722,24.108772,16.683838,18.461057,18.377966,15.884196,33.643670,27.427013,20.161788,11.285801,9.753678,9.670800,8.289274,5.416472,6.875435,5.765414,7.849742,7.983244,8.360681,8.124148,8.070243,8.245936,7.845689,9.076897,7.773420,5.479318,6.976811,9.377097,7.561574,7.352450,8.647153,10.056139,12.062701,12.156493,11.130419,12.794641,7.994157,9.292306,12.665397,12.199566,11.176049,10.050554,6.130655,5.265859,6.245422,8.187714,7.234436,4.598506,,,
3,ARB,Arab World,NY.ADJ.DFOR.GN.ZS,,,,,,,,,,,0.174568,0.139287,0.132084,0.152424,0.087390,0.098760,0.066341,0.102411,0.107103,0.074228,0.057760,0.055111,0.114144,0.081489,0.079541,0.038306,0.089656,0.085068,0.089702,0.087330,0.067230,0.071939,0.057122,0.044027,0.045949,0.062263,0.058694,0.055242,0.077565,0.036729,0.021056,0.024409,0.026641,0.032390,0.024895,0.020664,0.021369,0.017943,0.024938,0.028746,0.029684,0.030113,0.033494,0.065108,0.084361,0.096672,0.092911,0.102684,0.057123,0.064516,0.075686,,,
4,ARB,Arab World,AG.LND.AGRI.ZS,,30.981414,30.982663,31.007054,31.018001,31.042466,31.0504,31.103223,31.133565,31.190429,31.254493,31.386588,31.499948,31.496808,31.550807,31.529648,31.599736,31.621997,31.666078,31.678494,31.758868,31.454314,31.480303,31.528747,31.942270,32.442177,33.026539,33.582999,34.186977,34.697784,35.109453,35.159977,35.321083,36.096181,36.750654,37.380826,37.977641,38.488463,39.096353,39.622190,39.656941,39.679751,39.752997,39.941566,39.991253,40.048478,40.119766,40.172169,40.111560,40.122131,40.160303,40.176419,39.789967,39.838650,39.834421,39.872575,39.937814,39.984452,39.969738,39.907031,39.973290,39.970742,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16964,ZWE,Zimbabwe,ER.PTD.TOTL.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,27.214542,27.214585,27.214585,27.214747,27.214747,27.214747,27.214747,
16965,ZWE,Zimbabwe,AG.LND.FRLS.HA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16966,ZWE,Zimbabwe,SL.UEM.TOTL.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.750000,4.929000,5.007000,4.960000,5.557000,6.123000,6.930000,6.356000,6.000000,5.683000,5.297000,4.994000,4.743000,4.390000,4.665000,4.802000,5.073000,5.543000,5.617000,5.542000,5.370000,5.020000,4.932000,4.770000,5.412000,5.918000,6.349000,6.767000,7.370000,8.651000,9.540000,9.256000,9.116
16967,ZWE,Zimbabwe,SP.UWT.TFRT,,,,,,,,,,,,,,,,,,,,,,,,,14.100000,,,,,,,,,,19.100000,,,,,16.700000,,,,,,,15.500000,,,,,14.600000,,,10.382129,10.400000,,,,,,,,


### Part b
Rename 'Country Code' to'country_code', 'Country Name' to 'country_name_wb', and 'Indicator Code' to 'feature'. [1 point]

Using the `.rename()` method on the dataframe, `wb`, with a dictionary parameter that contains mapping of the old column names to the new names, and `axis=1` to work with columns:

In [11]:
wb = wb.rename({'Country Code': 'country_code',
                'Country Name': 'country_name_wb',
                'Indicator Code': 'feature'}, axis=1)
wb

Unnamed: 0,country_code,country_name_wb,feature,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,ARB,Arab World,EG.CFT.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,74.998725,76.638513,78.084222,79.317256,80.412595,81.436463,82.364313,83.153561,83.852744,84.520175,84.995489,85.545061,86.024476,86.406487,86.705717,86.942778,87.228705,87.390856,87.617862,87.798740,87.948264,88.092536,,
1,ARB,Arab World,EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,76.964326,77.623521,78.521409,79.112320,81.108851,81.936223,81.904019,82.888219,82.830288,83.590460,85.815868,84.152501,83.838359,84.735838,87.482231,87.719569,87.402502,89.340705,88.832276,89.053852,89.539016,90.662754,89.176939,90.352802,90.635050,90.845661,,
2,ARB,Arab World,NY.ADJ.DRES.GN.ZS,,,,,,,,,,,5.961207,6.065469,8.242519,11.302722,24.108772,16.683838,18.461057,18.377966,15.884196,33.643670,27.427013,20.161788,11.285801,9.753678,9.670800,8.289274,5.416472,6.875435,5.765414,7.849742,7.983244,8.360681,8.124148,8.070243,8.245936,7.845689,9.076897,7.773420,5.479318,6.976811,9.377097,7.561574,7.352450,8.647153,10.056139,12.062701,12.156493,11.130419,12.794641,7.994157,9.292306,12.665397,12.199566,11.176049,10.050554,6.130655,5.265859,6.245422,8.187714,7.234436,4.598506,,,
3,ARB,Arab World,NY.ADJ.DFOR.GN.ZS,,,,,,,,,,,0.174568,0.139287,0.132084,0.152424,0.087390,0.098760,0.066341,0.102411,0.107103,0.074228,0.057760,0.055111,0.114144,0.081489,0.079541,0.038306,0.089656,0.085068,0.089702,0.087330,0.067230,0.071939,0.057122,0.044027,0.045949,0.062263,0.058694,0.055242,0.077565,0.036729,0.021056,0.024409,0.026641,0.032390,0.024895,0.020664,0.021369,0.017943,0.024938,0.028746,0.029684,0.030113,0.033494,0.065108,0.084361,0.096672,0.092911,0.102684,0.057123,0.064516,0.075686,,,
4,ARB,Arab World,AG.LND.AGRI.ZS,,30.981414,30.982663,31.007054,31.018001,31.042466,31.0504,31.103223,31.133565,31.190429,31.254493,31.386588,31.499948,31.496808,31.550807,31.529648,31.599736,31.621997,31.666078,31.678494,31.758868,31.454314,31.480303,31.528747,31.942270,32.442177,33.026539,33.582999,34.186977,34.697784,35.109453,35.159977,35.321083,36.096181,36.750654,37.380826,37.977641,38.488463,39.096353,39.622190,39.656941,39.679751,39.752997,39.941566,39.991253,40.048478,40.119766,40.172169,40.111560,40.122131,40.160303,40.176419,39.789967,39.838650,39.834421,39.872575,39.937814,39.984452,39.969738,39.907031,39.973290,39.970742,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16964,ZWE,Zimbabwe,ER.PTD.TOTL.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,27.214542,27.214585,27.214585,27.214747,27.214747,27.214747,27.214747,
16965,ZWE,Zimbabwe,AG.LND.FRLS.HA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16966,ZWE,Zimbabwe,SL.UEM.TOTL.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.750000,4.929000,5.007000,4.960000,5.557000,6.123000,6.930000,6.356000,6.000000,5.683000,5.297000,4.994000,4.743000,4.390000,4.665000,4.802000,5.073000,5.543000,5.617000,5.542000,5.370000,5.020000,4.932000,4.770000,5.412000,5.918000,6.349000,6.767000,7.370000,8.651000,9.540000,9.256000,9.116
16967,ZWE,Zimbabwe,SP.UWT.TFRT,,,,,,,,,,,,,,,,,,,,,,,,,14.100000,,,,,,,,,,19.100000,,,,,16.700000,,,,,,,15.500000,,,,,14.600000,,,10.382129,10.400000,,,,,,,,


### Part c
Use the `.query()` method to remove the rows in which 'country_name_wb' is equal to one of the entries in the folowing `noncountries` list: [1 point]

In [12]:
noncountries = ["Arab World", "Central Europe and the Baltics",
                "Caribbean small states",
                "East Asia & Pacific (excluding high income)",
                "Early-demographic dividend","East Asia & Pacific",
                "Europe & Central Asia (excluding high income)",
                "Europe & Central Asia", "Euro area",
                "European Union","Fragile and conflict affected situations",
                "High income",
                "Heavily indebted poor countries (HIPC)","IBRD only",
                "IDA & IBRD total",
                "IDA total","IDA blend","IDA only",
                "Latin America & Caribbean (excluding high income)",
                "Latin America & Caribbean",
                "Least developed countries: UN classification",
                "Low income","Lower middle income","Low & middle income",
                "Late-demographic dividend","Middle East & North Africa",
                "Middle income",
                "Middle East & North Africa (excluding high income)",
                "North America","OECD members",
                "Other small states","Pre-demographic dividend",
                "Pacific island small states",
                "Post-demographic dividend",
                "Sub-Saharan Africa (excluding high income)",
                "Sub-Saharan Africa",
                "Small states","East Asia & Pacific (IDA & IBRD)",
                "Europe & Central Asia (IDA & IBRD)",
                "Latin America & Caribbean (IDA & IBRD)",
                "Middle East & North Africa (IDA & IBRD)","South Asia",
                "South Asia (IDA & IBRD)",
                "Sub-Saharan Africa (IDA & IBRD)",
                "Upper middle income", "World"]

To remove the rows in which 'country_name_wb' is equal to one of the entries in the `noncountries` list using the `.query()` I am using the following query string: `"country_name_wb not in @noncountries"`.

The query string within the `.query()` call on the dataframe contains `country_name_wb` which refers to the column in the DataFrame that contains country names. `not in`, is a logical operator that checks if the value in `country_name_wb` is not present in the specified list. `@noncountries`, `@` symbol is used to reference a variable that exists outside the query string. In this case, it refers to the `noncountries` list defined earlier.

In [13]:
wb = wb.query("country_name_wb not in @noncountries")
wb

Unnamed: 0,country_code,country_name_wb,feature,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
3266,AFG,Afghanistan,EG.CFT.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6.700000,7.700000,8.800000,10.000000,11.100000,12.500000,13.900000,15.300000,16.800000,18.200000,19.700000,21.300000,22.700000,24.300000,25.700000,27.250000,28.500000,30.000000,31.100000,32.450000,33.800000,35.400000,,
3267,AFG,Afghanistan,EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.446891,9.294527,14.133616,18.971165,23.814182,28.669672,33.544418,38.440002,42.400000,48.279007,42.700000,43.222019,69.100000,68.040878,89.500000,71.500000,97.700000,97.700000,93.430878,97.700000,97.700000,97.700000,,
3268,AFG,Afghanistan,NY.ADJ.DRES.GN.ZS,,,,,,,,,,,0.503854,0.644591,0.786744,1.242755,1.493057,1.849701,1.962964,1.840750,1.618266,1.692939,1.611075,1.226914,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.270365,0.359454,0.386644,0.380988,0.335091,0.315571,0.290261,0.363282,0.350879,0.401053,0.370131,0.243668,0.335935,,
3269,AFG,Afghanistan,NY.ADJ.DFOR.GN.ZS,,,,,,,,,,,0.279412,0.337567,0.389290,0.751733,0.783186,0.795687,0.756951,0.589740,0.734971,0.559937,0.581685,0.434263,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.229864,0.292788,0.244239,0.211376,0.211413,0.216609,0.232762,0.284781,0.229822,0.237615,0.269353,0.237958,0.317732,,
3270,AFG,Afghanistan,AG.LND.AGRI.ZS,,57.878356,57.955016,58.031676,58.116002,58.123668,58.192662,58.229459,58.230992,58.255523,58.270855,58.316851,58.335250,58.337090,58.338776,58.338776,58.338316,58.338316,58.338316,58.336783,58.336783,58.342916,58.344449,58.344449,58.344449,58.344449,58.344449,58.33065,58.322984,58.322984,58.322984,58.307652,58.307652,58.160465,57.974947,57.898287,57.889088,57.94735,58.059274,57.899821,57.945817,57.947350,57.939684,58.083805,58.151266,58.134400,58.123668,58.129801,58.132867,58.132867,58.134400,58.131334,58.129801,58.123668,58.123668,58.123668,58.123668,58.123668,58.276988,58.276988,58.741548,58.741548,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16964,ZWE,Zimbabwe,ER.PTD.TOTL.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,27.214542,27.214585,27.214585,27.214747,27.214747,27.214747,27.214747,
16965,ZWE,Zimbabwe,AG.LND.FRLS.HA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16966,ZWE,Zimbabwe,SL.UEM.TOTL.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.750000,4.929000,5.007000,4.960000,5.557000,6.123000,6.93000,6.356000,6.000000,5.683000,5.297000,4.994000,4.743000,4.390000,4.665000,4.802000,5.073000,5.543000,5.617000,5.542000,5.370000,5.020000,4.932000,4.770000,5.412000,5.918000,6.349000,6.767000,7.370000,8.651000,9.540000,9.256000,9.116
16967,ZWE,Zimbabwe,SP.UWT.TFRT,,,,,,,,,,,,,,,,,,,,,,,,,14.100000,,,,,,,,,,19.100000,,,,,16.700000,,,,,,,15.500000,,,,,14.600000,,,10.382129,10.400000,,,,,,,,


### Part d
The features in this dataset are given strange and incomprehensible codes such as 'EG.CFT.ACCS.ZS'. Use the `replace_map` dictionary, defined below, to recode all of these values with more descriptive names for each feature. [1 point]

In [14]:
replace_map = {
  "AG.LND.AGRI.ZS": "agricultural_land",
  "AG.LND.FRST.ZS": "forest_area",
  "AG.PRD.FOOD.XD": "food_production_index",
  "CC.EST": "control_of_corruption",
  "EG.CFT.ACCS.ZS": "access_to_clean_fuels_and_technologies_for_cooking",
  "EG.EGY.PRIM.PP.KD": "energy_intensity_level_of_primary_energy",
  "EG.ELC.ACCS.ZS": "access_to_electricity",
  "EG.ELC.COAL.ZS": "electricity_production_from_coal_sources",
  "EG.ELC.RNEW.ZS": "renewable_electricity_output",
  "EG.FEC.RNEW.ZS": "renewable_energy_consumption",
  "EG.IMP.CONS.ZS": "energy_imports",
  "EG.USE.COMM.FO.ZS": "fossil_fuel_energy_consumption",
  "EG.USE.PCAP.KG.OE": "energy_use",
  "EN.ATM.CO2E.PC": "co2_emissions",
  "EN.ATM.METH.PC": "methane_emissions",
  "EN.ATM.NOXE.PC": "nitrous_oxide_emissions",
  "EN.ATM.PM25.MC.M3": "pm2_5_air_pollution",
  "EN.CLC.CDDY.XD": "cooling_degree_days",
  "EN.CLC.GHGR.MT.CE": "ghg_net_emissions",
  "EN.CLC.HEAT.XD": "heat_index_35",
  "EN.CLC.MDAT.ZS": "droughts",
  "EN.CLC.PRCP.XD": "maximum_5-day_rainfall",
  "EN.CLC.SPEI.XD": "mean_drought_index",
  "EN.MAM.THRD.NO": "mammal_species",
  "EN.POP.DNST": "population_density",
  "ER.H2O.FWTL.ZS": "annual_freshwater_withdrawals",
  "ER.PTD.TOTL.ZS": "terrestrial_and_marine_protected_areas",
  "GB.XPD.RSDV.GD.ZS": "research_and_development_expenditure",
  "GE.EST": "government_effectiveness",
  "IC.BUS.EASE.XQ": "ease_of_doing_business_rank",
  "IC.LGL.CRED.XQ": "strength_of_legal_rights_index",
  "IP.JRN.ARTC.SC": "scientific_and_technical_journal_articles",
  "IP.PAT.RESD": "patent_applications",
  "IT.NET.USER.ZS": "individuals_using_the_internet",
  "NV.AGR.TOTL.ZS": "agriculture",
  "NY.ADJ.DFOR.GN.ZS": "net_forest_depletion",
  "NY.ADJ.DRES.GN.ZS": "natural_resources_depletion",
  "NY.GDP.MKTP.KD.ZG": "gdp_growth",
  "PV.EST": "political_stability_and_absence_of_violence",
  "RL.EST": "rule_of_law",
  "RQ.EST": "regulatory_quality",
  "SE.ADT.LITR.ZS": "literacy_rate",
  "SE.ENR.PRSC.FM.ZS": "gross_school_enrollment",
  "SE.PRM.ENRR": "primary_school_enrollment",
  "SE.XPD.TOTL.GB.ZS": "government_expenditure_on_education",
  "SG.GEN.PARL.ZS": "proportion_of_seats_held_by_women_in_national_parliaments",
  "SH.DTH.COMM.ZS": "cause_of_death",
  "SH.DYN.MORT": "mortality_rate",
  "SH.H2O.SMDW.ZS": "people_using_safely_managed_drinking_water_services",
  "SH.MED.BEDS.ZS": "hospital_beds",
  "SH.STA.OWAD.ZS": "prevalence_of_overweight",
  "SH.STA.SMSS.ZS": "people_using_safely_managed_sanitation_services",
  "SI.DST.FRST.20": "income_share_held_by_lowest_20pct",
  "SI.POV.GINI": "gini_index",
  "SI.POV.NAHC": "poverty_headcount_ratio_at_national_poverty_lines",
  "SI.SPR.PCAP.ZG": "annualized_average_growth_rate_in_per_capita_real_survey_mean_consumption_or_income",
  "SL.TLF.0714.ZS": "children_in_employment",
  "SL.TLF.ACTI.ZS": "labor_force_participation_rate",
  "SL.TLF.CACT.FM.ZS": "ratio_of_female_to_male_labor_force_participation_rate",
  "SL.UEM.TOTL.ZS": "unemployment",
  "SM.POP.NETM": "net_migration",
  "SN.ITK.DEFC.ZS": "prevalence_of_undernourishment",
  "SP.DYN.LE00.IN": "life_expectancy_at_birth",
  "SP.DYN.TFRT.IN": "fertility_rate",
  "SP.POP.65UP.TO.ZS": "population_ages_65_and_above",
  "SP.UWT.TFRT": "unmet_need_for_contraception",
  "VA.EST": "voice_and_accountability",
  "EN.CLC.CSTP.ZS": "coastal_protection",
  "SD.ESR.PERF.XQ": "economic_and_social_rights_performance_score",
  "EN.CLC.HDDY.XD": "heating_degree_days",
  "EN.LND.LTMP.DC": "land_surface_temperature",
  "ER.H2O.FWST.ZS": "freshwater_withdrawal",
  "EN.H2O.BDYS.ZS": "water_quality",
  "AG.LND.FRLS.HA": "tree_cover_loss",
}

Using `.map()` on the `feature` column to apply the mapping defined by the dictionary `replace_map`:

In [15]:
wb.feature = wb.feature.map(replace_map)
wb

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wb.feature = wb.feature.map(replace_map)


Unnamed: 0,country_code,country_name_wb,feature,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
3266,AFG,Afghanistan,access_to_clean_fuels_and_technologies_for_coo...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6.700000,7.700000,8.800000,10.000000,11.100000,12.500000,13.900000,15.300000,16.800000,18.200000,19.700000,21.300000,22.700000,24.300000,25.700000,27.250000,28.500000,30.000000,31.100000,32.450000,33.800000,35.400000,,
3267,AFG,Afghanistan,access_to_electricity,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.446891,9.294527,14.133616,18.971165,23.814182,28.669672,33.544418,38.440002,42.400000,48.279007,42.700000,43.222019,69.100000,68.040878,89.500000,71.500000,97.700000,97.700000,93.430878,97.700000,97.700000,97.700000,,
3268,AFG,Afghanistan,natural_resources_depletion,,,,,,,,,,,0.503854,0.644591,0.786744,1.242755,1.493057,1.849701,1.962964,1.840750,1.618266,1.692939,1.611075,1.226914,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.270365,0.359454,0.386644,0.380988,0.335091,0.315571,0.290261,0.363282,0.350879,0.401053,0.370131,0.243668,0.335935,,
3269,AFG,Afghanistan,net_forest_depletion,,,,,,,,,,,0.279412,0.337567,0.389290,0.751733,0.783186,0.795687,0.756951,0.589740,0.734971,0.559937,0.581685,0.434263,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.229864,0.292788,0.244239,0.211376,0.211413,0.216609,0.232762,0.284781,0.229822,0.237615,0.269353,0.237958,0.317732,,
3270,AFG,Afghanistan,agricultural_land,,57.878356,57.955016,58.031676,58.116002,58.123668,58.192662,58.229459,58.230992,58.255523,58.270855,58.316851,58.335250,58.337090,58.338776,58.338776,58.338316,58.338316,58.338316,58.336783,58.336783,58.342916,58.344449,58.344449,58.344449,58.344449,58.344449,58.33065,58.322984,58.322984,58.322984,58.307652,58.307652,58.160465,57.974947,57.898287,57.889088,57.94735,58.059274,57.899821,57.945817,57.947350,57.939684,58.083805,58.151266,58.134400,58.123668,58.129801,58.132867,58.132867,58.134400,58.131334,58.129801,58.123668,58.123668,58.123668,58.123668,58.123668,58.276988,58.276988,58.741548,58.741548,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16964,ZWE,Zimbabwe,terrestrial_and_marine_protected_areas,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,27.214542,27.214585,27.214585,27.214747,27.214747,27.214747,27.214747,
16965,ZWE,Zimbabwe,tree_cover_loss,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16966,ZWE,Zimbabwe,unemployment,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.750000,4.929000,5.007000,4.960000,5.557000,6.123000,6.93000,6.356000,6.000000,5.683000,5.297000,4.994000,4.743000,4.390000,4.665000,4.802000,5.073000,5.543000,5.617000,5.542000,5.370000,5.020000,4.932000,4.770000,5.412000,5.918000,6.349000,6.767000,7.370000,8.651000,9.540000,9.256000,9.116
16967,ZWE,Zimbabwe,unmet_need_for_contraception,,,,,,,,,,,,,,,,,,,,,,,,,14.100000,,,,,,,,,,19.100000,,,,,16.700000,,,,,,,15.500000,,,,,14.600000,,,10.382129,10.400000,,,,,,,,


## Problem 3
The `wb` dataset is strangely organized. The features are stored in the rows, when typically we would want these features to be columns. Also, years are stored in columns, when typically we would want years to be represented by different rows. We can repair this structure by reshaping the data. 

### Part a
First, reshape the data to turn the columns that refer to years into rows. [1 point]

To turn the years columns into rows, I will use `pd.melt()`. 
* The first argument in the function is the name of the dataframe, `wb`. 
* The second argument, `id_vars`, which is a list of the columns we want to continue to exist as columns after the dataframe is melted. In our case, it's a list of the following columns: country_code, country_name_wb, and feature.
* The third argument, `value_vars` is a list of the column names we want to store in rows. In our case, these columns are numeric and can be sequentially ordered so I will use `[str(i) for i in range(1960,2022)]`, which loops over integers from 1960 to 2021, placing each year as a string in a list

In [16]:
wb = pd.melt(wb, id_vars = ['country_code', 'country_name_wb', 'feature'],
            value_vars = [str(i) for i in range(1960,2022)])
wb

Unnamed: 0,country_code,country_name_wb,feature,variable,value
0,AFG,Afghanistan,access_to_clean_fuels_and_technologies_for_coo...,1960,
1,AFG,Afghanistan,access_to_electricity,1960,
2,AFG,Afghanistan,natural_resources_depletion,1960,
3,AFG,Afghanistan,net_forest_depletion,1960,
4,AFG,Afghanistan,agricultural_land,1960,
...,...,...,...,...,...
849581,ZWE,Zimbabwe,terrestrial_and_marine_protected_areas,2021,27.214747
849582,ZWE,Zimbabwe,tree_cover_loss,2021,
849583,ZWE,Zimbabwe,unemployment,2021,9.540000
849584,ZWE,Zimbabwe,unmet_need_for_contraception,2021,


### Part b
Then rename `variable` to `year`, and reshape the data again by turning the rows that refer to features into columns. [1 point]

Now the old column names (1960 through 2021 in this case) is stored in a column named “variable” after melting, and the datapoints that populated those columns are now contained in a column named “value”. So, I'm renaming the "variable" columns to 'year':

In [17]:
wb = wb.rename({'variable': 'year'}, axis=1)
wb

Unnamed: 0,country_code,country_name_wb,feature,year,value
0,AFG,Afghanistan,access_to_clean_fuels_and_technologies_for_coo...,1960,
1,AFG,Afghanistan,access_to_electricity,1960,
2,AFG,Afghanistan,natural_resources_depletion,1960,
3,AFG,Afghanistan,net_forest_depletion,1960,
4,AFG,Afghanistan,agricultural_land,1960,
...,...,...,...,...,...
849581,ZWE,Zimbabwe,terrestrial_and_marine_protected_areas,2021,27.214747
849582,ZWE,Zimbabwe,tree_cover_loss,2021,
849583,ZWE,Zimbabwe,unemployment,2021,9.540000
849584,ZWE,Zimbabwe,unmet_need_for_contraception,2021,


To turn the 'feature' rows into columns, I will use `.pivot_table()` method on the `wb` dataframe. This methods arguments are as follows:
* `index` - a list containing the names of the current columns that we want to remain columns in the reshaped data. In our case, it will be a list of the following columns: country_code, country_name_wb, and year.
* `columns` - the name of the column that contains the names of the new columns we are trying to create. Which is `feature` in our case.
* `values` - the name of the column that contains the datapoints we are trying to move to the new columns.

First, I want to make sure that the `value` column is numeric

In [18]:
wb.dtypes

country_code        object
country_name_wb     object
feature             object
year                object
value              float64
dtype: object

The `value` column is numeric so we can move on to use `.pivot_table()` and chaining `.reset_index()` to ensure that the multi-index created by the pviot is flattened back into regular columns:

In [19]:
wb = wb.pivot_table(index=['country_code', 'country_name_wb', 'year'],
                   columns='feature',
                   values='value').reset_index()
wb

feature,country_code,country_name_wb,year,access_to_clean_fuels_and_technologies_for_cooking,access_to_electricity,agricultural_land,agriculture,annual_freshwater_withdrawals,annualized_average_growth_rate_in_per_capita_real_survey_mean_consumption_or_income,cause_of_death,children_in_employment,co2_emissions,coastal_protection,control_of_corruption,cooling_degree_days,economic_and_social_rights_performance_score,electricity_production_from_coal_sources,energy_imports,energy_intensity_level_of_primary_energy,energy_use,fertility_rate,food_production_index,forest_area,fossil_fuel_energy_consumption,freshwater_withdrawal,gdp_growth,ghg_net_emissions,gini_index,government_effectiveness,government_expenditure_on_education,gross_school_enrollment,heat_index_35,heating_degree_days,hospital_beds,income_share_held_by_lowest_20pct,individuals_using_the_internet,labor_force_participation_rate,land_surface_temperature,life_expectancy_at_birth,literacy_rate,mammal_species,mean_drought_index,methane_emissions,mortality_rate,natural_resources_depletion,net_forest_depletion,net_migration,nitrous_oxide_emissions,patent_applications,people_using_safely_managed_drinking_water_services,people_using_safely_managed_sanitation_services,pm2_5_air_pollution,political_stability_and_absence_of_violence,population_ages_65_and_above,population_density,poverty_headcount_ratio_at_national_poverty_lines,prevalence_of_overweight,prevalence_of_undernourishment,primary_school_enrollment,proportion_of_seats_held_by_women_in_national_parliaments,ratio_of_female_to_male_labor_force_participation_rate,regulatory_quality,renewable_electricity_output,renewable_energy_consumption,research_and_development_expenditure,rule_of_law,scientific_and_technical_journal_articles,strength_of_legal_rights_index,terrestrial_and_marine_protected_areas,tree_cover_loss,unemployment,unmet_need_for_contraception,voice_and_accountability,water_quality
0,AFG,Afghanistan,1960,,,,,,,,,,,,,,,,,,7.282,,,,,,,,,,,,,0.170627,,,,,32.535,,,0.761520,,357.3,,,2606.0,,,,,,,2.833029,,,,,,,,,,,,,,,,,,,,
1,AFG,Afghanistan,1961,,,57.878356,,,,,,,,,,,,,,,7.284,41.00,,,,,,,,,,,,,,,,,33.068,,,-0.076736,,351.7,,,6109.0,,,,,,,2.817674,13.477056,,,,,,,,,,,,,,,,,,,
2,AFG,Afghanistan,1962,,,57.955016,,,,,,,,,,,,,,,7.292,41.34,,,,,,,,,,,,,,,,,33.547,,,-0.665528,,345.8,,,7016.0,,,,,,,2.799055,13.751356,,,,,,,,,,,,,,,,,,,
3,AFG,Afghanistan,1963,,,58.031676,,,,,,,,,,,,,,,7.302,41.16,,,,,,,,,,,,,,,,,34.016,,,0.216942,,340.2,,,6681.0,,,,,,,2.778968,14.040239,,,,,,,,,,,,,,,,,,,
4,AFG,Afghanistan,1964,,,58.116002,,,,,,,,,,,,,,,7.304,44.60,,,,,,,,,,,,,,,,,34.494,,,0.488956,,334.8,,,7079.0,,,,,,,2.758929,14.343888,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11961,ZWE,Zimbabwe,2017,29.8,43.979065,41.876696,8.340969,27.234910,,,,0.663069,,-1.298485,2235.64,1.883484,,,12.79,,3.706,106.59,45.451183,,31.346226,4.080264,,44.3,-1.282108,20.874201,,0.00,330.37,,6.0,24.400000,67.093,30.437847,60.709,,,1.297028,0.816483,56.2,5.905431,4.205873,-59918.0,0.349407,,26.944588,33.586600,22.582451,-0.710431,3.233118,38.131320,30.4,,36.3,98.545097,32.575758,83.796443,-1.583454,,82.63,,-1.396204,334.71,5.0,27.214585,,6.349,,-1.195905,76.5
11962,ZWE,Zimbabwe,2018,30.0,45.400288,41.876696,7.319375,30.761677,,,,0.735435,,-1.246001,2663.61,1.870709,,,12.82,,3.659,107.82,45.332093,,35.405385,5.009867,,,-1.297906,19.039841,,0.01,303.99,,,25.000000,67.052,32.686932,61.414,,10.0,-0.690707,0.817171,53.7,2.783017,1.305149,-59918.0,0.346090,,26.807938,33.544266,22.085555,-0.721038,3.293359,38.909614,,,38.2,97.879272,31.481481,84.079919,-1.525652,,80.43,,-1.292463,406.23,5.0,27.214585,,6.767,,-1.136798,
11963,ZWE,Zimbabwe,2019,30.2,46.682095,41.876696,9.819262,30.761677,1.03,47.647301,,0.663338,,-1.271190,2998.17,,,,13.40,,3.599,105.74,45.213002,,35.405385,-6.332446,,50.3,-1.319774,,,0.21,293.51,,,26.588274,66.938,33.812884,61.292,,,-0.301619,0.811302,52.7,4.088445,2.028044,-59918.0,0.338353,,26.683978,32.961481,20.834700,-0.943286,3.345781,39.691374,38.3,,38.9,97.476608,31.851852,84.432828,-1.486515,,81.52,,-1.303515,431.62,6.0,27.214747,,7.370,,-1.163669,
11964,ZWE,Zimbabwe,2020,30.3,52.747667,41.876696,8.772859,30.761677,,,,0.530484,,-1.287992,2640.14,,,,13.71,,3.545,110.34,45.093912,,35.405385,-7.816951,,,-1.355726,15.666611,,0.11,355.61,,,29.298565,66.259,32.436873,61.124,,,-0.204054,0.769692,51.8,4.108902,2.315889,-29955.0,0.299359,,26.573846,32.381232,,-1.052728,3.376262,40.505793,,,39.1,97.384163,31.851852,83.784351,-1.434415,,84.36,,-1.329611,480.16,,27.214747,,8.651,,-1.113408,83.3


### Part c
After these reshapes, the year column in the `wb` data frame is stored as a string. Convert this column to an integer data type. [1 point] 

Using the `.astype()` method to convert the year column to integer:

In [20]:
wb.year = wb.year.astype(int)

In [21]:
wb.year.dtypes

dtype('int64')

## Problem 4
Next we will merge the `wb` data frame with the `vdem` data frame, matching on the 'country_code' and 'year' columns. 

### Part a
First, write a sentence stating whether you expect this merge to be one-to-one, many-to-one, one-to-many, or many-to-many, and describe your rationale. [1 point]

I expect this merge to be one-to-one. The 'country_code' and 'year' combination should uniquely identify each row in both data frames. Each combination of 'country_code' and 'year' in wb corresponds to a single combination of 'country_code' and 'year' in vdem.

### Part b
Next, merge the two datasets together in a way that checks whether your expectation is met, and also allows you to see the rows that failed to match. [2 points]

Using `pd.merge()` to perform an outer merge on 'country_code' and 'year' of the two dataframes, specifiying the keys on which to match rows as a list with the `on` argument, and specifying the type of the merge with the `how` argument. This code also validates that the merge should be **one-to-one** (from wb to vdem), and includes an `indicator` column to show which rows matched and which did not. This approach allows us to verify the type of merge and identify unmatched rows.

In [22]:
merged_data = pd.merge(wb, vdem, 
               on= ['country_code', 'year'], 
               how = 'outer',
               validate = 'one_to_one',
               indicator = 'matched')
merged_data

Unnamed: 0,country_code,country_name_wb,year,access_to_clean_fuels_and_technologies_for_cooking,access_to_electricity,agricultural_land,agriculture,annual_freshwater_withdrawals,annualized_average_growth_rate_in_per_capita_real_survey_mean_consumption_or_income,cause_of_death,children_in_employment,co2_emissions,coastal_protection,control_of_corruption,cooling_degree_days,economic_and_social_rights_performance_score,electricity_production_from_coal_sources,energy_imports,energy_intensity_level_of_primary_energy,energy_use,fertility_rate,food_production_index,forest_area,fossil_fuel_energy_consumption,freshwater_withdrawal,gdp_growth,ghg_net_emissions,gini_index,government_effectiveness,government_expenditure_on_education,gross_school_enrollment,heat_index_35,heating_degree_days,hospital_beds,income_share_held_by_lowest_20pct,individuals_using_the_internet,labor_force_participation_rate,land_surface_temperature,life_expectancy_at_birth,literacy_rate,mammal_species,mean_drought_index,methane_emissions,mortality_rate,natural_resources_depletion,net_forest_depletion,net_migration,nitrous_oxide_emissions,patent_applications,people_using_safely_managed_drinking_water_services,people_using_safely_managed_sanitation_services,pm2_5_air_pollution,political_stability_and_absence_of_violence,population_ages_65_and_above,population_density,poverty_headcount_ratio_at_national_poverty_lines,prevalence_of_overweight,prevalence_of_undernourishment,primary_school_enrollment,proportion_of_seats_held_by_women_in_national_parliaments,ratio_of_female_to_male_labor_force_participation_rate,regulatory_quality,renewable_electricity_output,renewable_energy_consumption,research_and_development_expenditure,rule_of_law,scientific_and_technical_journal_articles,strength_of_legal_rights_index,terrestrial_and_marine_protected_areas,tree_cover_loss,unemployment,unmet_need_for_contraception,voice_and_accountability,water_quality,country_name_vdem,democracy,educational_equality,matched
0,AFG,Afghanistan,1960,,,,,,,,,,,,,,,,,,7.282,,,,,,,,,,,,,0.170627,,,,,32.535,,,0.761520,,357.3,,,2606.0,,,,,,,2.833029,,,,,,,,,,,,,,,,,,,,,Afghanistan,0.080,-1.123,both
1,AFG,Afghanistan,1961,,,57.878356,,,,,,,,,,,,,,,7.284,41.00,,,,,,,,,,,,,,,,,33.068,,,-0.076736,,351.7,,,6109.0,,,,,,,2.817674,13.477056,,,,,,,,,,,,,,,,,,,,Afghanistan,0.083,-1.123,both
2,AFG,Afghanistan,1962,,,57.955016,,,,,,,,,,,,,,,7.292,41.34,,,,,,,,,,,,,,,,,33.547,,,-0.665528,,345.8,,,7016.0,,,,,,,2.799055,13.751356,,,,,,,,,,,,,,,,,,,,Afghanistan,0.082,-1.123,both
3,AFG,Afghanistan,1963,,,58.031676,,,,,,,,,,,,,,,7.302,41.16,,,,,,,,,,,,,,,,,34.016,,,0.216942,,340.2,,,6681.0,,,,,,,2.778968,14.040239,,,,,,,,,,,,,,,,,,,,Afghanistan,0.085,-1.123,both
4,AFG,Afghanistan,1964,,,58.116002,,,,,,,,,,,,,,,7.304,44.60,,,,,,,,,,,,,,,,,34.494,,,0.488956,,334.8,,,7079.0,,,,,,,2.758929,14.343888,,,,,,,,,,,,,,,,,,,,Afghanistan,0.137,-0.951,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12356,ZZB,,2017,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Zanzibar,0.267,1.661,right_only
12357,ZZB,,2018,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Zanzibar,0.268,1.486,right_only
12358,ZZB,,2019,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Zanzibar,0.266,1.486,right_only
12359,ZZB,,2020,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Zanzibar,0.258,1.427,right_only


### Part c
After this merge, use the `.value_counts()` method to see the total number of observations that were found in both datasets, the number found only in the left dataset, and the number found only in the right dataset. (If you entered the `wb` data frame into the merge function first, then "left_only" refers to the rows found in the World Bank but not V-Dem, and "right_only" refers to the rows found in V-Dem but not the World Bank.) There should be more than 9000 rows that matched, but more than 2000 that failed to match.

Then conduct two data aggregations to help us investigate why these observations did not match:

* First use `.query()` to keep only the observations that were present in `wb` but not `vdem`. (These are the 'left_only' observations if you typed the World Bank data into the merge function first.) Use `.groupby()` to aggregate the data by both 'country_code' and 'country_name_wb'. Then save the minimum and maximum values of 'year' for each country.

* Then use `.query()` to keep only the observations that were present in `vdem` data but not `wb`. Use `.groupby()` to aggregate the data by both 'country_code' and 'country_name_vdem'. Then save the minimum and maximum values of 'year' for each country. [2 points]

In [23]:
merged_data['matched'].value_counts()

matched
both          9976
left_only     1990
right_only     395
Name: count, dtype: int64

I believe these results align with the expectation that there should be more than 9000 rows that matched, and more than 2000 rows that failed to match (total of `left_only` and `right_only` is 2385).

- **Rows found in both datasets ("both")**: 9976
- **Rows found only in the left dataset (World Bank, "left_only")**: 1990
- **Rows found only in the right dataset (V-Dem, "right_only")**: 395

Investigating unmatched observations:

Left Only (World Bank):

In [24]:
left_only = merged_data.query("matched=='left_only'")
left_only.groupby(['country_code', 'country_name_wb'])['year'].agg(['min', 'max']).reset_index()

Unnamed: 0,country_code,country_name_wb,min,max
0,AND,Andorra,1960,2021
1,ARE,United Arab Emirates,1960,1970
2,ARM,Armenia,1960,1989
3,ATG,Antigua and Barbuda,1960,2021
4,AZE,Azerbaijan,1960,1989
5,BGD,Bangladesh,1960,1970
6,BHS,"Bahamas, The",1960,2021
7,BIH,Bosnia and Herzegovina,1960,1991
8,BLR,Belarus,1960,1989
9,BLZ,Belize,1960,2021


Right Only (V-Dem):

In [25]:
right_only = merged_data.query("matched=='right_only'")
right_only.groupby(['country_code', 'country_name_vdem'])['year'].agg(['min', 'max']).reset_index()

Unnamed: 0,country_code,country_name_vdem,min,max
0,DDR,German Democratic Republic,1960,1990
1,HKG,Hong Kong,1960,2021
2,PSE,Palestine/West Bank,1967,2021
3,PSG,Palestine/Gaza,1960,2021
4,SML,Somaliland,1991,2021
5,TWN,Taiwan,1960,2021
6,VDR,Republic of Vietnam,1960,1975
7,XKX,Kosovo,1999,2021
8,YMD,South Yemen,1960,1990
9,ZZB,Zanzibar,1960,2021


### Part d
Here's where a deep understanding of the data becomes very important. There are two reasons why an observation may fail to match in a merge. One reason is a difference in spelling. Suppose that South Korea (which is also known as the Republic of Korea) is coded as SKO in the World Bank data and ROK in V-Dem. In this case, we should recode one or the other of SKO and ROK so that they match, otherwise we will lose the data on South Korea. But the second reason why observations might fail to match is due to differences in coverage in the data collection strategy: it is possible that a country wasn't included in one data's coverage, or that certain years for that country were not included. For differences in coverage, there's no way to manipulate the data to match, so we are out of luck and we have to either delete these observations or proceed with missing data from one of the data sources.

Take a close look at the two data aggregation tables you generated in part (j), and answer the following questions:

* Do you see any countries that are present in both the unmatched World Bank rows and the unmatched V-Dem rows, but with different spellings?

I don's see any clear cases of countries present in both lists with different spellings. However, DDR, German Democratic Republic (V-Dem) is likely part of DEU (Germany) in the World Bank data. YMD, South Yemen (V-Dem) is likely part of YEM (Yemen) in the World Bank dataset.

* Do some digging on Wikipedia and other sources on the Internet. What do you think is the primary reason why some countries are present in the V-Dem data but not the World Bank? (You don't need to describe the reasoning for every country. Just dig until you see a general pattern and describe it here.)

The reason could be because of these countries' historical status changes (like German Democratic Republic unification with West Germany) and/or complex political status (such as Palestine/West Bank and Palestine/Gaza). These changes affect how data is collected, recognized, and reported by different organizations over the years. We can see in the case of DDR, German Democratic Republic in V-dem the max year was 1990 which is the year or the country's reunification with West Germany.

V-Dem likely aims to provide a comprehensive historical and political analysis, including countries that may not be universally recognized but are relevant for studying democratic processes.

* Do some more digging on Wikipedia and other sources on the Internet. What do you think is the primary reason why some countries are present in the World Bank data but not V-Dem? (You don't need to describe the reasoning for every country. Just dig until you see a general pattern and describe it here.) [1 point]

On the other hand, the World Bank data appears to focus on currently existing, internationally recognized states. This aligns with the World Bank's role as an international financial institution that typically works with officially recognized countries.

### Part e
Once you are convinced that all of the unmatched observations are due to differences in the coverage of the data collection strategies of the World Bank and V-Dem, repeat the merge, dropping all unmatched observations. This time there is no need to validate the type of merge, and no need to define a variable to indicate matching. [1 point]

To merge, droppin all umatched observations, we can perform an inner join, which will keep only the rows that have matching 'country_code' and 'year' in both datasets:

In [32]:
cardb_data = pd.merge(wb, vdem,
                      on=['country_code', 'year'],
                      how='inner')
cardb_data

Unnamed: 0,country_code,country_name_wb,year,access_to_clean_fuels_and_technologies_for_cooking,access_to_electricity,agricultural_land,agriculture,annual_freshwater_withdrawals,annualized_average_growth_rate_in_per_capita_real_survey_mean_consumption_or_income,cause_of_death,children_in_employment,co2_emissions,coastal_protection,control_of_corruption,cooling_degree_days,economic_and_social_rights_performance_score,electricity_production_from_coal_sources,energy_imports,energy_intensity_level_of_primary_energy,energy_use,fertility_rate,food_production_index,forest_area,fossil_fuel_energy_consumption,freshwater_withdrawal,gdp_growth,ghg_net_emissions,gini_index,government_effectiveness,government_expenditure_on_education,gross_school_enrollment,heat_index_35,heating_degree_days,hospital_beds,income_share_held_by_lowest_20pct,individuals_using_the_internet,labor_force_participation_rate,land_surface_temperature,life_expectancy_at_birth,literacy_rate,mammal_species,mean_drought_index,methane_emissions,mortality_rate,natural_resources_depletion,net_forest_depletion,net_migration,nitrous_oxide_emissions,patent_applications,people_using_safely_managed_drinking_water_services,people_using_safely_managed_sanitation_services,pm2_5_air_pollution,political_stability_and_absence_of_violence,population_ages_65_and_above,population_density,poverty_headcount_ratio_at_national_poverty_lines,prevalence_of_overweight,prevalence_of_undernourishment,primary_school_enrollment,proportion_of_seats_held_by_women_in_national_parliaments,ratio_of_female_to_male_labor_force_participation_rate,regulatory_quality,renewable_electricity_output,renewable_energy_consumption,research_and_development_expenditure,rule_of_law,scientific_and_technical_journal_articles,strength_of_legal_rights_index,terrestrial_and_marine_protected_areas,tree_cover_loss,unemployment,unmet_need_for_contraception,voice_and_accountability,water_quality,country_name_vdem,democracy,educational_equality
0,AFG,Afghanistan,1960,,,,,,,,,,,,,,,,,,7.282,,,,,,,,,,,,,0.170627,,,,,32.535,,,0.761520,,357.3,,,2606.0,,,,,,,2.833029,,,,,,,,,,,,,,,,,,,,,Afghanistan,0.080,-1.123
1,AFG,Afghanistan,1961,,,57.878356,,,,,,,,,,,,,,,7.284,41.00,,,,,,,,,,,,,,,,,33.068,,,-0.076736,,351.7,,,6109.0,,,,,,,2.817674,13.477056,,,,,,,,,,,,,,,,,,,,Afghanistan,0.083,-1.123
2,AFG,Afghanistan,1962,,,57.955016,,,,,,,,,,,,,,,7.292,41.34,,,,,,,,,,,,,,,,,33.547,,,-0.665528,,345.8,,,7016.0,,,,,,,2.799055,13.751356,,,,,,,,,,,,,,,,,,,,Afghanistan,0.082,-1.123
3,AFG,Afghanistan,1963,,,58.031676,,,,,,,,,,,,,,,7.302,41.16,,,,,,,,,,,,,,,,,34.016,,,0.216942,,340.2,,,6681.0,,,,,,,2.778968,14.040239,,,,,,,,,,,,,,,,,,,,Afghanistan,0.085,-1.123
4,AFG,Afghanistan,1964,,,58.116002,,,,,,,,,,,,,,,7.304,44.60,,,,,,,,,,,,,,,,,34.494,,,0.488956,,334.8,,,7079.0,,,,,,,2.758929,14.343888,,,,,,,,,,,,,,,,,,,,Afghanistan,0.137,-0.951
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9971,ZWE,Zimbabwe,2017,29.8,43.979065,41.876696,8.340969,27.234910,,,,0.663069,,-1.298485,2235.64,1.883484,,,12.79,,3.706,106.59,45.451183,,31.346226,4.080264,,44.3,-1.282108,20.874201,,0.00,330.37,,6.0,24.400000,67.093,30.437847,60.709,,,1.297028,0.816483,56.2,5.905431,4.205873,-59918.0,0.349407,,26.944588,33.586600,22.582451,-0.710431,3.233118,38.131320,30.4,,36.3,98.545097,32.575758,83.796443,-1.583454,,82.63,,-1.396204,334.71,5.0,27.214585,,6.349,,-1.195905,76.5,Zimbabwe,0.295,1.490
9972,ZWE,Zimbabwe,2018,30.0,45.400288,41.876696,7.319375,30.761677,,,,0.735435,,-1.246001,2663.61,1.870709,,,12.82,,3.659,107.82,45.332093,,35.405385,5.009867,,,-1.297906,19.039841,,0.01,303.99,,,25.000000,67.052,32.686932,61.414,,10.0,-0.690707,0.817171,53.7,2.783017,1.305149,-59918.0,0.346090,,26.807938,33.544266,22.085555,-0.721038,3.293359,38.909614,,,38.2,97.879272,31.481481,84.079919,-1.525652,,80.43,,-1.292463,406.23,5.0,27.214585,,6.767,,-1.136798,,Zimbabwe,0.305,0.995
9973,ZWE,Zimbabwe,2019,30.2,46.682095,41.876696,9.819262,30.761677,1.03,47.647301,,0.663338,,-1.271190,2998.17,,,,13.40,,3.599,105.74,45.213002,,35.405385,-6.332446,,50.3,-1.319774,,,0.21,293.51,,,26.588274,66.938,33.812884,61.292,,,-0.301619,0.811302,52.7,4.088445,2.028044,-59918.0,0.338353,,26.683978,32.961481,20.834700,-0.943286,3.345781,39.691374,38.3,,38.9,97.476608,31.851852,84.432828,-1.486515,,81.52,,-1.303515,431.62,6.0,27.214747,,7.370,,-1.163669,,Zimbabwe,0.293,0.999
9974,ZWE,Zimbabwe,2020,30.3,52.747667,41.876696,8.772859,30.761677,,,,0.530484,,-1.287992,2640.14,,,,13.71,,3.545,110.34,45.093912,,35.405385,-7.816951,,,-1.355726,15.666611,,0.11,355.61,,,29.298565,66.259,32.436873,61.124,,,-0.204054,0.769692,51.8,4.108902,2.315889,-29955.0,0.299359,,26.573846,32.381232,,-1.052728,3.376262,40.505793,,,39.1,97.384163,31.851852,83.784351,-1.434415,,84.36,,-1.329611,480.16,,27.214747,,8.651,,-1.113408,83.3,Zimbabwe,0.293,1.674


## Problem 5
Write code using `pandas` that answers the next two questions:

### Part a
Of all countries in the data, which countries have the highest and lowest average levels of democratic quality across the 1960-2022 timespan? [1 point]

Calculating the average levels of democratic quality for each country by first grouping the data by country using `groupby('country_name_wb')` then aggregate the data to calculate the average democratic quality (`.agg({'democracy': 'mean'})`):

In [33]:
avg_democracy = cardb_data.groupby('country_name_wb').agg({'democracy': 'mean'}).reset_index()
avg_democracy

Unnamed: 0,country_name_wb,democracy
0,Afghanistan,0.179484
1,Albania,0.322790
2,Algeria,0.226790
3,Angola,0.133403
4,Argentina,0.587968
...,...,...
167,"Venezuela, RB",0.595565
168,Viet Nam,0.170306
169,"Yemen, Rep.",0.168806
170,Zambia,0.335403


To find the countries with the highest and lowest average democratic quality, I am using the `idxmax()` and `idxmin()` on `avg_democracy['democracy']` to find the indices of the countries with the highest and lowest average democracy scores:

In [34]:
max_avg_democracy = avg_democracy['democracy'].idxmax()
min_avg_democracy = avg_democracy['democracy'].idxmin()

Using the indices obtained, I retrieved the corresponding country names and their average democracy scores then printed the results:

In [35]:
max_country = avg_democracy.loc[max_avg_democracy, 'country_name_wb']
min_country = avg_democracy.loc[min_avg_democracy, 'country_name_wb']
max_score = avg_democracy.loc[max_avg_democracy, 'democracy']
min_score = avg_democracy.loc[min_avg_democracy, 'democracy']

print(f"Highest average democratic quality: {max_country} ({round(max_score, 3)})")
print(f"Lowest average democratic quality: {min_country} ({round(min_score, 3)})")

Highest average democratic quality: Denmark (0.91)
Lowest average democratic quality: Saudi Arabia (0.015)


### Part b
The 'educational_equality' index compiled by V-Dem measures the extent to which "high quality basic education guaranteed to all, sufficient to enable them to exercise their basic rights as adult citizens." They use a Bayesian scaling method to create a score for each country in each year that ranges roughly from -4 to 4, where low values of the scale mean that
> Provision of high quality basic education is extremely unequal and at least 75 percent (%) of children receive such low-quality education that undermines their ability to exercise their basic rights as adult citizens.

And high values mean that
> Basic education is equal in quality and less than five percent (%) of children receive such low-quality education that probably undermines their ability to exercise their basic rights as adult citizens.

Use the `pd.cut()` method to create a categorical version of 'educational_equality' with five categories, one from -4 to -2 called "extremely unequal", one from -2 to -.5 called "very unequal", one from -.5 to .5 called "somewhat unequal", one from .5 to 1.5 called "relatively equal", and one for values from 1.5 to 4 called "equal". (By default, the `pd.cut()` method sets `right=True`, which means the bins include their rightmost edges, so a value of exactly -2 will fall within the "extremely unequal" bin. Leave this default in place.)

Then aggregate the data to have one row per category of the new categorical version of "educational_equality". Collapse the following features to the mean with each category of "educational_equality":

* 'gini_index': The GINI index measures the amount of economic inequality in a country. The higher the index, the greater the economic disparity between rich and poor.
* 'poverty_headcount_ratio_at_national_poverty_lines': a measure of the proportion of the population living in poverty [1 point]
  

Breaking the values of the `educational_equality` column into five categories using `pd.cut()`:

In [36]:
cardb_data['educational_equality_cat'] = pd.cut(cardb_data.educational_equality, 
         bins=[-4, -2, -.5, .5, 1.5, 4], 
         labels=("extremely unequal", "very unequal", "somewhat unequal", "relatively equal", "equal"))
cardb_data[['educational_equality', 'educational_equality_cat']]

Unnamed: 0,educational_equality,educational_equality_cat
0,-1.123,very unequal
1,-1.123,very unequal
2,-1.123,very unequal
3,-1.123,very unequal
4,-0.951,very unequal
...,...,...
9971,1.490,relatively equal
9972,0.995,relatively equal
9973,0.999,relatively equal
9974,1.674,equal


Then to show the average GINI index and poverty headcount ratio for each category of educational equality:

In [37]:
cardb_data.groupby('educational_equality_cat').agg({
    'gini_index': 'mean',
    'poverty_headcount_ratio_at_national_poverty_lines': 'mean'
}).reset_index()

  cardb_data.groupby('educational_equality_cat').agg({


Unnamed: 0,educational_equality_cat,gini_index,poverty_headcount_ratio_at_national_poverty_lines
0,extremely unequal,38.846154,58.16
1,very unequal,45.926484,38.636058
2,somewhat unequal,43.200442,24.149123
3,relatively equal,37.148861,22.548536
4,equal,32.652901,17.207444
