# Data Cleaning
## 01 Import Libraries
## 02 Import Economy and Growth csv
## 03 Extract Relevant Rows for Study
### a) Extract required variables and copy the rows into a new dataframe
### b) Missing data and initial consistency checks
### c) Initial aggregations
### d) Interpolate Data
### e) Tidy Data Principles

## 01 Import Libraries

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 02 Import Economy and Growth csv

In [3]:
# import dataframe

pathData = r'C:\Users\Michael\Desktop\Career Foundry\02 Data Immersion Course\06 Advanced Analytics and Dashboard Design\00 Data'
WB_poverty = pd.read_csv(os.path.join(pathData, 'World Bank Poverty', 'poverty.csv'))

## 03 Extract Relevant Rows for Study
### a) The file contains a column called "Indicator Code" and has an entry for each country. There are 254 indicators, but only the following four will be used for the study, as it is currentl understood:

| INDICATOR_CODE    | INDICATOR_NAME                                                                             |
|:------------------|:-------------------------------------------------------------------------------------------|
| SI.POV.NAHC		| Poverty headcount ratio at national poverty lines (% of population)                        |
| SI.POV.UMIC	    | Poverty headcount ratio at \$6.85 a day (2017 PPP) (\% of population)                       |
| SI.POV.DDAY		| Poverty headcount ratio at \$2.15 a day (2017 PPP) (\% of population)                      |
| SI.POV.LMIC	    | Poverty headcount ratio at \$3.65 a day (2017 PPP) (\% of population)                       |
| SI.POV.GINI		| Gini index                                                                                 |


The process to extract thesee will be as follows:
* All indicators other than these are to be dropped.
* The remaing will be saved as a new df (WB_encon_GDP_GNI)

In [4]:
# make list of the required Indicator Codes
poverty_indicators = [ 'SI.POV.NAHC',
                    'SI.POV.UMIC',
                    'SI.POV.DDAY',
                    'SI.POV.LMIC',
                    'SI.POV.GINI'
                    ]

In [5]:
# if the Indicators Variable is not in the econ_indicators list then the rown is to be dropped
# there new df is to be saved as WB_econ_GDP_GNI

WB_poverty_req = WB_poverty[WB_poverty['Indicator Code'].isin(poverty_indicators)]

In [6]:
#Check that the correct values have been selected
WB_poverty_req.head(100)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
5,Aruba,ABW,Poverty headcount ratio at $6.85 a day (2017 P...,SI.POV.UMIC,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6,Aruba,ABW,Poverty headcount ratio at national poverty li...,SI.POV.NAHC,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16,Aruba,ABW,Poverty headcount ratio at $3.65 a day (2017 P...,SI.POV.LMIC,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
17,Aruba,ABW,Gini index,SI.POV.GINI,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
19,Aruba,ABW,Poverty headcount ratio at $2.15 a day (2017 P...,SI.POV.DDAY,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
34,Africa Eastern and Southern,AFE,Poverty headcount ratio at $6.85 a day (2017 P...,SI.POV.UMIC,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
35,Africa Eastern and Southern,AFE,Poverty headcount ratio at national poverty li...,SI.POV.NAHC,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45,Africa Eastern and Southern,AFE,Poverty headcount ratio at $3.65 a day (2017 P...,SI.POV.LMIC,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
46,Africa Eastern and Southern,AFE,Gini index,SI.POV.GINI,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
48,Africa Eastern and Southern,AFE,Poverty headcount ratio at $2.15 a day (2017 P...,SI.POV.DDAY,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [7]:
WB_poverty_req.shape

(1330, 67)

The economic data is only avaliable from 1996 to 2021. These data will first be trimmed to fit. There is no need to run tests on data that will not be used.

In [8]:
for n in range(1960, 1996):
    WB_poverty_req.drop(columns=[str(n)], inplace=True)

WB_poverty_req.drop(columns=['2022'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  WB_poverty_req.drop(columns=[str(n)], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  WB_poverty_req.drop(columns=[str(n)], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  WB_poverty_req.drop(columns=[str(n)], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  WB_poverty_r

### b) Missing data and initial consistency checks

In [9]:
# Change display to 2dp floats
pd.options.display.float_format = '{:.2f}'.format

In [10]:
# check for duplicates
WB_poverty_req[WB_poverty_req.duplicated()]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021


No duplicates found

### c) Perform aggregations

In [11]:
# making a list of all the years to use later
years = []
for n in range(1996, 2022):
    years.append(str(n))
    
print(years)

['1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']


In [12]:
# find the max value for each type
WB_poverty_req.groupby('Indicator Name').max(years)

Unnamed: 0_level_0,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
Indicator Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Gini index,59.9,65.8,59.6,59.0,61.6,58.4,64.7,63.3,58.1,64.8,57.5,55.8,63.0,61.0,63.4,53.5,53.4,52.7,63.0,59.1,54.6,53.3,53.9,53.5,53.5,52.9
Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population),82.7,69.4,79.9,68.7,84.0,71.0,80.6,81.5,91.5,80.5,71.8,78.9,70.8,52.6,80.2,60.6,80.7,65.1,64.6,60.8,67.3,45.3,50.9,74.4,35.0,65.7
Poverty headcount ratio at $3.65 a day (2017 PPP) (% of population),91.8,89.0,94.2,90.4,95.5,86.2,94.2,94.6,97.6,92.8,90.5,93.0,89.6,80.0,92.4,88.3,92.4,86.7,83.1,78.1,87.3,72.7,81.2,89.1,68.6,85.8
Poverty headcount ratio at $6.85 a day (2017 PPP) (% of population),97.2,98.0,98.5,99.2,99.4,96.0,99.1,99.1,99.7,98.1,98.2,97.9,97.0,95.9,98.1,97.7,98.2,96.5,96.7,94.2,96.8,92.4,95.0,97.3,91.0,96.2
Poverty headcount ratio at national poverty lines (% of population),,,,,69.0,70.8,64.7,54.0,69.3,73.2,76.8,50.4,62.1,63.0,71.7,62.4,70.7,64.9,59.3,54.4,82.3,55.5,56.8,50.7,53.4,68.8


All max values are consistent with expectation

In [13]:
# find the min value for each type
WB_poverty_req.groupby('Indicator Name').min(years)

Unnamed: 0_level_0,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
Indicator Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Gini index,25.8,26.8,28.1,28.1,23.8,28.7,25.3,25.3,24.8,24.6,24.4,24.4,23.7,24.8,24.8,24.6,24.7,24.6,24.0,25.4,24.8,23.2,24.6,23.2,24.0,25.7
Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Poverty headcount ratio at $3.65 a day (2017 PPP) (% of population),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3
Poverty headcount ratio at $6.85 a day (2017 PPP) (% of population),0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.1,0.1,0.0,0.0,0.0,0.1,0.1,0.1,0.0,0.0,0.1,0.1,0.1,0.0,0.1,0.1,0.0,1.0
Poverty headcount ratio at national poverty lines (% of population),,,,,15.3,16.9,10.8,10.0,9.7,9.6,8.9,7.7,6.1,5.4,5.2,5.5,6.0,4.8,4.8,5.1,4.5,3.1,1.7,0.6,0.0,5.2


All min values are consistent with expectation

Find the mean values of each year

In [14]:
WB_poverty_req.groupby('Indicator Name').mean(years)

Unnamed: 0_level_0,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
Indicator Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
Gini index,41.45,43.05,42.78,42.63,40.6,42.25,40.99,39.24,37.85,39.0,37.76,36.48,37.52,37.54,36.44,36.19,36.31,36.46,36.55,36.72,36.18,35.51,35.71,35.21,35.03,40.51
Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population),22.64,18.81,23.86,21.72,18.42,20.09,19.66,16.54,12.15,13.69,9.44,10.14,7.93,9.15,9.78,8.78,7.69,5.62,6.77,7.08,6.94,4.7,6.67,5.35,2.71,7.29
Poverty headcount ratio at $3.65 a day (2017 PPP) (% of population),38.27,32.17,39.68,37.12,32.05,36.86,36.16,30.65,24.29,26.09,18.85,20.62,16.66,19.21,20.01,18.14,16.83,12.97,15.6,16.42,14.73,11.7,16.15,11.73,8.24,18.12
Poverty headcount ratio at $6.85 a day (2017 PPP) (% of population),56.51,48.44,58.17,55.31,48.52,56.41,57.08,48.71,41.48,43.32,34.8,35.51,31.41,35.12,35.64,32.7,33.09,26.68,30.23,31.99,28.12,25.59,30.87,25.21,20.15,39.17
Poverty headcount ratio at national poverty lines (% of population),,,,,36.42,42.67,34.46,25.63,25.58,28.44,25.73,23.69,23.09,23.07,25.47,22.62,21.04,21.7,24.51,22.22,23.94,20.26,23.23,19.57,19.81,20.6


The mean values are all consistent with the expectations of the variables (e.g. percentages are in the range 0-100)

### d) Interpolate Data

There are many missing values, to get an estimated value for these data linear interpolation can be used.
Before interpolating a flag will be made to say that the row has had interpolated data.
Interpolation will only happen if there is data with no more than 5 consecutive years of the missing (20% of total timeframe).

In [15]:
# make a dataframe with only the numerical information
WB_poverty_req_interpolated = WB_poverty_req[WB_poverty_req.columns.intersection(years)]
WB_poverty_req_interpolated.head()

Unnamed: 0,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
5,,,,,,,,,,,,,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,,,,,,,,,,,,,
16,,,,,,,,,,,,,,,,,,,,,,,,,,
17,,,,,,,,,,,,,,,,,,,,,,,,,,
19,,,,,,,,,,,,,,,,,,,,,,,,,,


In [16]:
#interpolate the figures in the numerical dataframe
WB_poverty_req_interpolated = WB_poverty_req_interpolated.interpolate(method='linear',
                                                            axis=1,
                                                            inplace=False,
                                                            limit_direction = 'both',
                                                            limit_area=None)

In [17]:
# check the interpolated values
WB_poverty_req_interpolated

Unnamed: 0,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
5,,,,,,,,,,,,,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,,,,,,,,,,,,,
16,,,,,,,,,,,,,,,,,,,,,,,,,,
17,,,,,,,,,,,,,,,,,,,,,,,,,,
19,,,,,,,,,,,,,,,,,,,,,,,,,,
34,,,,,,,,,,,,,,,,,,,,,,,,,,
35,,,,,,,,,,,,,,,,,,,,,,,,,,
45,,,,,,,,,,,,,,,,,,,,,,,,,,
46,,,,,,,,,,,,,,,,,,,,,,,,,,
48,,,,,,,,,,,,,,,,,,,,,,,,,,


In [18]:
# Re-attach the row headers
WB_poverty_req_headers = WB_poverty_req[
                                    WB_poverty_req.columns.intersection([
                                                'Country Name',
                                                'Country Code',
                                                'Indicator Name',
                                                'Indicator Code'])]
WB_poverty_req_headers.head(50)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code
5,Aruba,ABW,Poverty headcount ratio at $6.85 a day (2017 P...,SI.POV.UMIC
6,Aruba,ABW,Poverty headcount ratio at national poverty li...,SI.POV.NAHC
16,Aruba,ABW,Poverty headcount ratio at $3.65 a day (2017 P...,SI.POV.LMIC
17,Aruba,ABW,Gini index,SI.POV.GINI
19,Aruba,ABW,Poverty headcount ratio at $2.15 a day (2017 P...,SI.POV.DDAY
34,Africa Eastern and Southern,AFE,Poverty headcount ratio at $6.85 a day (2017 P...,SI.POV.UMIC
35,Africa Eastern and Southern,AFE,Poverty headcount ratio at national poverty li...,SI.POV.NAHC
45,Africa Eastern and Southern,AFE,Poverty headcount ratio at $3.65 a day (2017 P...,SI.POV.LMIC
46,Africa Eastern and Southern,AFE,Gini index,SI.POV.GINI
48,Africa Eastern and Southern,AFE,Poverty headcount ratio at $2.15 a day (2017 P...,SI.POV.DDAY


In [19]:
# merge the index names with the interpolated years data
WB_poverty_req_corrected = pd.merge(WB_poverty_req_headers, WB_poverty_req_interpolated, left_index=True, right_index=True)
WB_poverty_req_corrected.tail(50)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
7429,Virgin Islands (U.S.),VIR,Poverty headcount ratio at $6.85 a day (2017 P...,SI.POV.UMIC,,,,,,,,,,,,,,,,,,,,,,,,,,
7430,Virgin Islands (U.S.),VIR,Poverty headcount ratio at national poverty li...,SI.POV.NAHC,,,,,,,,,,,,,,,,,,,,,,,,,,
7440,Virgin Islands (U.S.),VIR,Poverty headcount ratio at $3.65 a day (2017 P...,SI.POV.LMIC,,,,,,,,,,,,,,,,,,,,,,,,,,
7441,Virgin Islands (U.S.),VIR,Gini index,SI.POV.GINI,,,,,,,,,,,,,,,,,,,,,,,,,,
7443,Virgin Islands (U.S.),VIR,Poverty headcount ratio at $2.15 a day (2017 P...,SI.POV.DDAY,,,,,,,,,,,,,,,,,,,,,,,,,,
7458,Viet Nam,VNM,Poverty headcount ratio at $6.85 a day (2017 P...,SI.POV.UMIC,89.0,89.0,88.98,88.96,88.94,88.92,88.9,86.55,84.2,82.0,79.8,78.75,77.7,62.2,46.7,43.7,40.7,37.8,34.9,31.0,27.1,24.65,22.2,20.45,18.7,18.7
7459,Viet Nam,VNM,Poverty headcount ratio at national poverty li...,SI.POV.NAHC,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,9.2,7.9,6.8,5.7,4.8,4.8
7469,Viet Nam,VNM,Poverty headcount ratio at $3.65 a day (2017 P...,SI.POV.LMIC,63.4,63.4,63.84,64.28,64.72,65.16,65.6,59.85,54.1,49.35,44.6,42.15,39.7,26.85,14.0,12.1,10.2,9.5,8.8,7.6,6.4,5.85,5.3,4.55,3.8,3.8
7470,Viet Nam,VNM,Gini index,SI.POV.GINI,35.4,35.4,35.72,36.04,36.36,36.68,37.0,36.9,36.8,36.3,35.8,35.7,35.6,37.45,39.3,37.45,35.6,35.2,34.8,35.05,35.3,35.5,35.7,36.25,36.8,36.8
7472,Viet Nam,VNM,Poverty headcount ratio at $2.15 a day (2017 P...,SI.POV.DDAY,24.4,24.4,25.5,26.6,27.7,28.8,29.9,25.0,20.1,17.5,14.9,13.0,11.1,7.0,2.9,2.3,1.7,1.8,1.9,1.6,1.3,1.25,1.2,0.95,0.7,0.7


Dataframe has been successfully interpolated with a maximum 5 year limit for the economic data required for the study.

### e) Tidy Data Principles

MELT: Each year is an observation and should be in a column called "Year".

PIVOT: The five indicators (currently in 'Indicator Name') are separate variables, and should ben five columns, with the numerical values being the observation in these columns.

Country and Country Code are correctly column headers.

The dataframe should be presented in the following way:

(All values are per capita)

| Country | Year | Indicator 1 | Indicator 2 | Indicator 3 | Indicator 4 | Indicator 5 |
|:--------|:-----|:------------|:------------|:------------|:------------|:------------|
| Aruba   | 1996 | XXX         | XXX         | XXX         | XXX         | XXX         |
| Aruba   | 1997 | XXX         | XXX         | XXX         | XXX         | XXX         |
| Aruba   | 1998 | XXX         | XXX         | XXX         | XXX         | XXX         |
| Aruba   | 1999 | XXX         | XXX         | XXX         | XXX         | XXX         |

The indicators are:

"Poverty headcount ratio at national poverty lines (% of population)"

"Poverty headcount ratio at $6.85 a day (2017 PPP) (% of population)"

"Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population)"

"Poverty headcount ratio at $3.65 a day (2017 PPP) (% of population)"

"Gini index"

This will also make merging with other dataframes easier, and give greater versatility to the visualisations made in the future.


In [20]:
# First use a MELT to turn the years into a variable, the recorded value of the indicator will temporarirly be called "value":
WB_poverty_req_corrected_melt = WB_poverty_req_corrected.melt(['Country Name','Country Code','Indicator Name','Indicator Code'], var_name = "Year", value_name = "value")

In [21]:
WB_poverty_req_corrected_melt.head(50)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,Year,value
0,Aruba,ABW,Poverty headcount ratio at $6.85 a day (2017 P...,SI.POV.UMIC,1996,
1,Aruba,ABW,Poverty headcount ratio at national poverty li...,SI.POV.NAHC,1996,
2,Aruba,ABW,Poverty headcount ratio at $3.65 a day (2017 P...,SI.POV.LMIC,1996,
3,Aruba,ABW,Gini index,SI.POV.GINI,1996,
4,Aruba,ABW,Poverty headcount ratio at $2.15 a day (2017 P...,SI.POV.DDAY,1996,
5,Africa Eastern and Southern,AFE,Poverty headcount ratio at $6.85 a day (2017 P...,SI.POV.UMIC,1996,
6,Africa Eastern and Southern,AFE,Poverty headcount ratio at national poverty li...,SI.POV.NAHC,1996,
7,Africa Eastern and Southern,AFE,Poverty headcount ratio at $3.65 a day (2017 P...,SI.POV.LMIC,1996,
8,Africa Eastern and Southern,AFE,Gini index,SI.POV.GINI,1996,
9,Africa Eastern and Southern,AFE,Poverty headcount ratio at $2.15 a day (2017 P...,SI.POV.DDAY,1996,


The dataframe has had the years conrrectly converted to variables, and each year is an observation.

In [22]:
# Second use PIVOT to convert the Indicators into separate variables, and not observations
WB_poverty_req_corrected_melt_pivot = WB_poverty_req_corrected_melt.pivot_table(
                                            index = ['Country Name', 'Year'], 
                                            columns = 'Indicator Name', 
                                            values = 'value').reset_index()

WB_poverty_req_corrected_melt_pivot.head(500)

Indicator Name,Country Name,Year,Gini index,Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population),Poverty headcount ratio at $3.65 a day (2017 PPP) (% of population),Poverty headcount ratio at $6.85 a day (2017 PPP) (% of population),Poverty headcount ratio at national poverty lines (% of population)
0,Afghanistan,1996,,,,,33.7
1,Afghanistan,1997,,,,,33.7
2,Afghanistan,1998,,,,,33.7
3,Afghanistan,1999,,,,,33.7
4,Afghanistan,2000,,,,,33.7
5,Afghanistan,2001,,,,,33.7
6,Afghanistan,2002,,,,,33.7
7,Afghanistan,2003,,,,,33.7
8,Afghanistan,2004,,,,,33.7
9,Afghanistan,2005,,,,,33.7


The pivoting has correctly converted the data to the standardised Tidy Data norms.

In [23]:
# After the changes the dataframe haas been converted to a list of Tuples.
# This needs to be converted back to a dataframe.
WB_poverty_req_corrected_melt_pivot = pd.DataFrame(WB_poverty_req_corrected_melt_pivot)

In [24]:
WB_poverty_req_corrected_melt_pivot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4784 entries, 0 to 4783
Data columns (total 7 columns):
 #   Column                                                               Non-Null Count  Dtype  
---  ------                                                               --------------  -----  
 0   Country Name                                                         4784 non-null   object 
 1   Year                                                                 4784 non-null   object 
 2   Gini index                                                           4316 non-null   float64
 3   Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population)  4680 non-null   float64
 4   Poverty headcount ratio at $3.65 a day (2017 PPP) (% of population)  4680 non-null   float64
 5   Poverty headcount ratio at $6.85 a day (2017 PPP) (% of population)  4680 non-null   float64
 6   Poverty headcount ratio at national poverty lines (% of population)  4030 non-null   float64
dtypes: floa

## 04 Save data to a cleaned data folder

In [25]:
WB_poverty_req_corrected_melt_pivot.to_csv(os.path.join(pathData, 'World Bank Cleaned', 'poverty_clean.csv'), index=False)