# Recap of data science problem and objective

HDI 
Education
Poverty
There are some fundamental questions to resolve in this notebook:
Do you think you may have the data you need to tackle the desired question?
Have you identified the required target value?
Do you have potentially useful features?
Do you have any fundamental issues with the data?

Imports

In [53]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import requests
import pandas_profiling
from pandas_profiling import ProfileReport
import os

# Loading and cleaning data

### Loading and cleaning Human Index Development (HDI) dataframe

In [55]:
hdi = pd.read_csv('Human Development Index (HDI) (1).csv', sep = ",", skiprows = 5)
hdi_df = pd.DataFrame(hdi)
hdi_df.head()

Unnamed: 0,HDI Rank,Country,1990,Unnamed: 3,1991,Unnamed: 5,1992,Unnamed: 7,1993,Unnamed: 9,...,2015,Unnamed: 53,2016,Unnamed: 55,2017,Unnamed: 57,2018,Unnamed: 59,2019,Unnamed: 61
0,169,Afghanistan,0.302,,0.307,,0.316,,0.312,,...,0.5,,0.502,,0.506,,0.509,,0.511,
1,69,Albania,0.650,,0.631,,0.615,,0.618,,...,0.788,,0.788,,0.79,,0.792,,0.795,
2,91,Algeria,0.572,,0.576,,0.582,,0.586,,...,0.74,,0.743,,0.745,,0.746,,0.748,
3,36,Andorra,..,,..,,..,,..,,...,0.862,,0.866,,0.863,,0.867,,0.868,
4,148,Angola,..,,..,,..,,..,,...,0.572,,0.578,,0.582,,0.582,,0.581,


First glance at the data shows ".." showing missing data which should have been encoded as NaN instead. This need to be cleaned before processing the data.

In [60]:
# Converting empty cells into NaN.
hdi_df.replace("..", np.NaN, inplace = False)

Unnamed: 0,HDI Rank,Country,1990,1991,1992,1993,1994,1995,1996,1997,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,169,Afghanistan,0.302,0.307,0.316,0.312,0.307,0.331,0.335,0.339,...,0.472,0.477,0.489,0.496,0.500,0.500,0.502,0.506,0.509,0.511
1,69,Albania,0.650,0.631,0.615,0.618,0.624,0.637,0.646,0.645,...,0.745,0.764,0.775,0.782,0.787,0.788,0.788,0.790,0.792,0.795
2,91,Algeria,0.572,0.576,0.582,0.586,0.590,0.595,0.602,0.611,...,0.721,0.728,0.728,0.729,0.736,0.740,0.743,0.745,0.746,0.748
3,36,Andorra,,,,,,,,,...,0.837,0.836,0.858,0.856,0.863,0.862,0.866,0.863,0.867,0.868
4,148,Angola,,,,,,,,,...,0.517,0.533,0.544,0.555,0.565,0.572,0.578,0.582,0.582,0.581
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,,Least Developed Countries,0.350,0.353,0.354,0.358,0.358,0.366,0.374,0.381,...,0.485,0.493,0.499,0.504,0.510,0.516,0.520,0.525,0.528,0.538
203,,Small Island Developing States,0.595,0.598,0.603,0.608,0.612,0.618,0.624,0.629,...,0.702,0.706,0.704,0.708,0.712,0.717,0.719,0.722,0.723,0.728
204,,Organization for Economic Co-operation and Dev...,0.785,0.790,0.788,0.800,0.807,0.812,0.817,0.817,...,0.873,0.877,0.879,0.883,0.886,0.889,0.892,0.894,0.895,0.900
205,,World,0.598,0.601,0.601,0.608,0.611,0.617,0.622,0.624,...,0.697,0.703,0.708,0.713,0.718,0.722,0.727,0.729,0.731,0.737


In [62]:
#First glance at the data shows several columns "Unnamed" not having any data. We will drop them directly.
# There are ".." showing missing data which should have been NaN values instead. We need to replace that.
hdi_df = hdi_df.loc[:, ~hdi_df.columns.str.startswith('Unnamed:')]
hdi_df.head()

Unnamed: 0,HDI Rank,Country,1990,1991,1992,1993,1994,1995,1996,1997,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,169,Afghanistan,0.302,0.307,0.316,0.312,0.307,0.331,0.335,0.339,...,0.472,0.477,0.489,0.496,0.5,0.5,0.502,0.506,0.509,0.511
1,69,Albania,0.65,0.631,0.615,0.618,0.624,0.637,0.646,0.645,...,0.745,0.764,0.775,0.782,0.787,0.788,0.788,0.79,0.792,0.795
2,91,Algeria,0.572,0.576,0.582,0.586,0.59,0.595,0.602,0.611,...,0.721,0.728,0.728,0.729,0.736,0.74,0.743,0.745,0.746,0.748
3,36,Andorra,,,,,,,,,...,0.837,0.836,0.858,0.856,0.863,0.862,0.866,0.863,0.867,0.868
4,148,Angola,,,,,,,,,...,0.517,0.533,0.544,0.555,0.565,0.572,0.578,0.582,0.582,0.581


In [63]:
#Now that the data are umported, we will use the pandas profiling feature to see all the possible problems we might have with the data
report = ProfileReport(hdi_df, title='Pandas Profiling Report', explorative=True)
report

Summarize dataset:   0%|          | 0/46 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



We have HDI data for a period of 29 years from 1990 to 2019 which is pretty consistent.
First glance at the data profile shows a couple of issues:
There are 195 countries in the world but we have a distinct list of 206. The tail of the dataframe shows some aggregations by region and others rows we might not need.
There are 612 cells representing 9.2% of data which seems to be more recurrent during older periods.
All HDI data are categorical values instead of float.

The outcome of the clean data will be a dataframe with 3 colums showing the HDI values for the year 2009 and 2019 and the variance representing the growth of HDI for the past 10 years. This years range is appropriate as it is more recent and shows less missing values.

The target features are henceforth "2009" and "2019". 
"2009" shows 172 entries fand 6 values missing 
"2019" shows 165 entries and 3 missing.

In [64]:
# Expanding the tail to show all non countries rows showing aggregations that need to be dropped.
hdi_df.tail(20)

Unnamed: 0,HDI Rank,Country,1990,1991,1992,1993,1994,1995,1996,1997,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
187,146.0,Zambia,0.421,0.417,0.416,0.419,0.414,0.415,0.416,0.416,...,0.527,0.534,0.549,0.557,0.561,0.569,0.571,0.578,0.582,0.584
188,150.0,Zimbabwe,0.478,0.481,0.467,0.463,0.46,0.453,0.453,0.447,...,0.482,0.499,0.525,0.537,0.547,0.553,0.558,0.563,0.569,0.571
189,,Human Development,,,,,,,,,...,,,,,,,,,,
190,,Very high human development,0.779,0.782,0.78,0.79,0.795,0.799,0.804,0.804,...,0.866,0.871,0.874,0.878,0.882,0.886,0.888,0.89,0.892,0.898
191,,High human development,0.568,0.573,0.578,0.584,0.588,0.596,0.604,0.61,...,0.706,0.713,0.72,0.727,0.733,0.738,0.743,0.746,0.75,0.753
192,,Medium human development,0.437,0.439,0.445,0.451,0.457,0.464,0.471,0.476,...,0.575,0.584,0.593,0.599,0.608,0.616,0.625,0.63,0.634,0.631
193,,Low human development,0.352,0.353,0.355,0.356,0.356,0.361,0.368,0.373,...,0.473,0.479,0.484,0.49,0.496,0.499,0.501,0.505,0.507,0.513
194,,Developing Countries,0.516,0.52,0.525,0.53,0.534,0.541,0.547,0.553,...,0.642,0.65,0.657,0.663,0.669,0.674,0.68,0.683,0.686,0.689
195,,Regions,,,,,,,,,...,,,,,,,,,,
196,,Arab States,0.556,0.559,0.565,0.571,0.576,0.581,0.587,0.594,...,0.676,0.681,0.687,0.688,0.691,0.695,0.699,0.701,0.703,0.705


In [203]:
# Removing aggreations
#Above shows that all rows from 189 to 206 can be dropped as they are aggregations.
# Below print the new data frame after dropping all the columns after Zimbabwe.
hdi_df = hdi_df.iloc[:189]
hdi_df.tail()

Unnamed: 0,HDI Rank,Country,1990,1991,1992,1993,1994,1995,1996,1997,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,hdi_var
184,113,Venezuela (Bolivarian Republic of),0.644,0.654,0.66,0.662,0.662,0.666,0.668,0.67,...,0.769,0.772,0.777,0.775,0.769,0.759,0.743,0.733,0.711,-0.045
185,117,Viet Nam,0.483,0.493,0.504,0.514,0.525,0.537,0.548,0.547,...,0.671,0.676,0.681,0.683,0.688,0.693,0.696,0.7,0.704,0.045
186,179,Yemen,0.401,0.401,0.404,0.406,0.408,0.414,0.421,0.426,...,0.506,0.504,0.509,0.502,0.483,0.474,0.467,0.468,0.47,-0.032
187,146,Zambia,0.421,0.417,0.416,0.419,0.414,0.415,0.416,0.416,...,0.534,0.549,0.557,0.561,0.569,0.571,0.578,0.582,0.584,0.067
188,150,Zimbabwe,0.478,0.481,0.467,0.463,0.46,0.453,0.453,0.447,...,0.499,0.525,0.537,0.547,0.553,0.558,0.563,0.569,0.571,0.113


In [199]:
# Since all HDI data are categorical, we need to convert our target features into float (both 2009 and 2019)
hdi_df["2009"] = pd.to_numeric(hdi_df["2009"], downcast="float")

In [200]:
hdi_df["2019"] = pd.to_numeric(hdi_df["2019"], downcast="float")

In [201]:
# Calculating the variance between the 2 target features to see the net growth of HDI over the period
hdi_df["hdi_var"] = hdi_df["2019"] - hdi_df["2009"]

In [359]:
hdi_df10 = hdi_df[["Country", "2009", "2019", "hdi_var"]]
hdi_df10.head()

Unnamed: 0,Country,2009,2019,hdi_var
0,Afghanistan,0.46,0.511,0.051
1,Albania,0.733,0.795,0.062
2,Algeria,0.711,0.748,0.037
3,Andorra,0.839,0.868,0.029
4,Angola,0.515,0.581,0.066


In [371]:
hdi_df10 = hdi_df10.rename(columns={'2009': 'hdi_2009',"2019":"hdi_2019" })

In [372]:
hdi_df10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Country   189 non-null    object 
 1   hdi_2009  189 non-null    float32
 2   hdi_2019  189 non-null    float32
 3   hdi_var   189 non-null    float32
dtypes: float32(3), object(1)
memory usage: 3.8+ KB


In summary cleaning of hdi dataframe resulted into:
17 observations dropped as they were not countries. This leaves us a total of 189 countries which means 6 are still missing. After dropping these rows, no more missing data are observed for both target features.
The final dataframe hdi_df10 showing the hdi data for 2009 and 2019 has been extracted with no missing data which has allowed to calculate the hdi_var.

### Loading and cleaning education dataframe

In [320]:
edu = pd.read_csv('Expected years of schooling (years).csv', sep = ",", skiprows = 6)
edu_df = pd.DataFrame(edu)
edu_df.head()

Unnamed: 0,HDI Rank,Country,1990,Unnamed: 3,1991,Unnamed: 5,1992,Unnamed: 7,1993,Unnamed: 9,...,2015,Unnamed: 53,2016,Unnamed: 55,2017,Unnamed: 57,2018,Unnamed: 59,2019,Unnamed: 61
0,169,Afghanistan,2.6,,2.9,,3.2,,3.6,,...,10.2,,10.3,,10.1,,10.1,,10.2,a
1,69,Albania,11.6,,11.8,,10.7,,10.1,,...,15.1,,14.8,,14.8,,14.7,,14.7,a
2,91,Algeria,9.6,,9.7,,9.8,,9.8,,...,14.2,,14.2,,14.4,,14.5,,14.6,a
3,36,Andorra,10.8,,10.8,,10.8,,10.8,,...,13.1,,13.3,,13.0,,13.3,,13.3,"a,b"
4,148,Angola,3.4,,3.3,,3.2,,3.7,,...,11.0,,11.4,,11.8,,11.8,,11.8,"a,c"


In [326]:
# Converting empty cells into NaN.
edu_df.replace("..", np.NaN, inplace = True)

In [327]:
edu_df = edu_df.loc[:, ~edu_df.columns.str.startswith('Unnamed:')]
edu_df.head()

Unnamed: 0,HDI Rank,Country,1990,1991,1992,1993,1994,1995,1996,1997,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,169,Afghanistan,2.6,2.9,3.2,3.6,3.9,4.2,4.6,4.9,...,9.5,9.5,10.0,10.2,10.3,10.2,10.3,10.1,10.1,10.2
1,69,Albania,11.6,11.8,10.7,10.1,10.1,10.2,10.2,10.5,...,13.0,13.7,14.6,14.9,15.3,15.1,14.8,14.8,14.7,14.7
2,91,Algeria,9.6,9.7,9.8,9.8,9.9,9.8,10.0,10.3,...,14.0,14.3,13.9,13.6,14.0,14.2,14.2,14.4,14.5,14.6
3,36,Andorra,10.8,10.8,10.8,10.8,10.8,10.8,10.8,10.8,...,11.7,11.7,13.5,13.1,13.5,13.1,13.3,13.0,13.3,13.3
4,148,Angola,3.4,3.3,3.2,3.7,3.8,3.9,4.0,4.1,...,8.6,9.5,9.9,10.3,10.7,11.0,11.4,11.8,11.8,11.8


In [328]:
report2 = ProfileReport(edu_df, title = "Education report", explorative = True)
report2

Summarize dataset:   0%|          | 0/46 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Education data shows:
Observation of education levels (number of years of schooling by aduls including both genders).
This covers a period of 29 years including the period 2009 to 2019 which are our target features.
However, some issues need to be resolved:
"2009" shows entries for 94 distinct values with 10 missing 
"2019" shows entries for 92 distinct values with 10 missing 
This poses a bigger problem as this would means more than half countries are missing from the dataframe.

There is a total of 315 values missing.

There is a total of 220 countries instead of the 195 existing countries which with suspicious of some aggregations as shown in the dataframe tail.

All entries are categorical data instead of floats.

Alike HDI, the same outcome is expected: extract a sub dataframe showing education values for 2009 and 2019 including a calculation of the variance between the 2 sets. 

In [329]:
# First removing unwanted rows.
edu_df.tail(30)

Unnamed: 0,HDI Rank,Country,1990,1991,1992,1993,1994,1995,1996,1997,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
190,179,Yemen,7.5,7.5,7.6,7.6,7.6,7.6,7.6,7.6,...,8.6,9.0,8.5,8.8,8.7,8.7,8.7,8.7,8.7,8.8
191,146,Zambia,7.5,7.7,8.0,8.2,8.4,8.7,8.9,9.1,...,11.0,10.9,10.9,11.0,11.0,11.1,11.2,11.3,11.4,11.5
192,150,Zimbabwe,9.8,10.2,9.8,9.8,9.8,9.8,9.8,9.8,...,10.1,10.2,10.3,10.2,10.3,10.3,10.4,10.5,10.5,11.0
193,,Human Development,,,,,,,,,...,,,,,,,,,,
194,,Very high human development,13.3,13.4,13.0,13.7,13.8,13.9,14.0,13.7,...,15.6,15.7,15.8,16.1,16.2,16.3,16.4,16.4,16.4,16.3
195,,High human development,9.7,9.7,9.8,9.9,9.9,10.1,10.3,10.4,...,12.9,13.1,13.3,13.5,13.6,13.7,13.8,13.8,13.8,14.0
196,,Medium human development,7.3,7.4,7.5,7.6,7.7,7.8,7.9,7.9,...,10.3,10.7,10.9,11.0,11.2,11.3,11.6,11.7,11.7,11.5
197,,Low human development,5.4,5.4,5.4,5.5,5.5,5.7,5.9,6.1,...,8.7,8.9,9.0,9.2,9.2,9.3,9.2,9.3,9.3,9.4
198,,Developing Countries,8.4,8.5,8.5,8.6,8.6,8.8,8.9,9.0,...,11.3,11.5,11.7,11.8,12.0,12.0,12.2,12.2,12.2,12.2
199,,Regions,,,,,,,,,...,,,,,,,,,,


In [330]:
edu_df = edu_df.iloc[:193]
edu_df.tail()

Unnamed: 0,HDI Rank,Country,1990,1991,1992,1993,1994,1995,1996,1997,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
188,113,Venezuela (Bolivarian Republic of),10.5,10.7,10.8,10.8,10.7,10.7,10.6,10.6,...,13.6,13.7,13.8,14.2,14.1,14.0,13.6,12.8,12.8,12.8
189,117,Viet Nam,7.8,8.1,8.4,8.7,9.0,9.3,9.6,9.0,...,12.0,12.5,12.6,12.6,12.7,12.7,12.7,12.7,12.7,12.7
190,179,Yemen,7.5,7.5,7.6,7.6,7.6,7.6,7.6,7.6,...,8.6,9.0,8.5,8.8,8.7,8.7,8.7,8.7,8.7,8.8
191,146,Zambia,7.5,7.7,8.0,8.2,8.4,8.7,8.9,9.1,...,11.0,10.9,10.9,11.0,11.0,11.1,11.2,11.3,11.4,11.5
192,150,Zimbabwe,9.8,10.2,9.8,9.8,9.8,9.8,9.8,9.8,...,10.1,10.2,10.3,10.2,10.3,10.3,10.4,10.5,10.5,11.0


In [339]:
edu_df["2009"] = pd.to_numeric(edu_df["2009"], downcast="float")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [340]:
edu_df["2019"] = pd.to_numeric(edu_df["2019"], downcast="float")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [341]:
edu_df.dtypes

HDI Rank     object
Country      object
1990         object
1991         object
1992         object
1993         object
1994         object
1995         object
1996         object
1997         object
1998         object
1999         object
2000         object
2001         object
2002         object
2003         object
2004         object
2005         object
2006         object
2007         object
2008         object
2009        float32
2010         object
2011         object
2012         object
2013         object
2014         object
2015         object
2016         object
2017         object
2018         object
2019        float32
dtype: object

In [342]:
edu_df["edu_var"] = edu_df["2019"] - edu_df["2009"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [358]:
edu_df10 = edu_df[["Country", "2009", "2019", "edu_var"]]
edu_df10.head()

Unnamed: 0,Country,2009,2019,edu_var
0,Afghanistan,8.9,10.2,1.3
1,Albania,12.3,14.7,2.4
2,Algeria,13.6,14.6,1.0
3,Andorra,11.7,13.3,1.6
4,Angola,9.0,11.8,2.8


In [345]:
edu_df['2009'].isna().sum()

5

In [346]:
edu_df['2019'].isna().sum()

0

In [373]:
edu_df10 = edu_df10.rename(columns={'2009': 'edu_2009', "2019":"edu_2019"})

In [374]:
edu_df10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Country   193 non-null    object 
 1   edu_2009  188 non-null    float32
 2   edu_2019  193 non-null    float32
 3   edu_var   188 non-null    float32
dtypes: float32(3), object(1)
memory usage: 3.9+ KB


In summary we ended up with a new dataframe edu_df10 showing education growth over the 10 past years for 193 countries showing 5 missing data

### Loading and cleaning poverty dataframe

In [350]:
pov = pd.read_csv('Multidimensional_poverty_index_MPI_.csv', sep = ",", skiprows = 5)
pov_df = pd.DataFrame(pov)
pov_df.head()

Unnamed: 0,HDI Rank,Country,2008,Unnamed: 3,2009,Unnamed: 5,2010,Unnamed: 7,2011,Unnamed: 9,...,2016,Unnamed: 19,2017,Unnamed: 21,2018,Unnamed: 23,2019,Unnamed: 25,2008-2019,Unnamed: 27
0,169,Afghanistan,..,,..,,..,,..,,...,0.272,,..,,..,,..,,0.272,"a,b"
1,69,Albania,..,,..,,..,,..,,...,..,,..,,0.003,,..,,0.003,a
2,91,Algeria,..,,..,,..,,..,,...,..,,..,,..,,..,,0.008,a
3,148,Angola,..,,..,,..,,..,,...,0.282,,..,,..,,..,,0.282,a
4,81,Armenia,..,,..,,..,,..,,...,0.001,,..,,..,,..,,0.001,a


In [352]:
# Converting empty cells into NaN.
pov_df.replace("..", np.NaN, inplace = True)

In [353]:
pov_df = pov_df.loc[:, ~pov_df.columns.str.startswith('Unnamed:')]
pov_df.head()

Unnamed: 0,HDI Rank,Country,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2008-2019
0,169,Afghanistan,,,,,,,,,0.272,,,,0.272
1,69,Albania,,,,,,,,,,,0.003,,0.003
2,91,Algeria,,,,,,0.008,,,,,,,0.008
3,148,Angola,,,,,,,,,0.282,,,,0.282
4,81,Armenia,,,,,,,,,0.001,,,,0.001


In [354]:
pov_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   HDI Rank   208 non-null    object
 1   Country    197 non-null    object
 2   2008       1 non-null      object
 3   2009       1 non-null      object
 4   2010       4 non-null      object
 5   2011       4 non-null      object
 6   2012       10 non-null     object
 7   2013       4 non-null      object
 8   2014       19 non-null     object
 9   2015       7 non-null      object
 10  2016       21 non-null     object
 11  2017       11 non-null     object
 12  2018       21 non-null     object
 13  2019       4 non-null      object
 14  2008-2019  114 non-null    object
dtypes: object(15)
memory usage: 24.7+ KB


In [356]:
#converting 2008-2019 into float
pov_df["2008-2019"] = pd.to_numeric(pov_df["2008-2019"], downcast="float")

In [360]:
# Dropping all columns except Country and 2008 - 2019 columns which are the only one we are interested in.
pov_df_actual = pov_df[["Country", "2008-2019"]]
pov_df_actual.head()

Unnamed: 0,Country,2008-2019
0,Afghanistan,0.272
1,Albania,0.003
2,Algeria,0.008
3,Angola,0.282
4,Armenia,0.001


In [378]:
pov_df_actual = pov_df_actual.rename(columns={'2008-2019': "pov_index"})

In [379]:
pov_df_actual.dtypes

Country       object
pov_index    float32
dtype: object

# Merging dataframes

In [389]:
df_merge = pd.merge(hdi_df10, edu_df10, on='Country')
df_merge.head()

Unnamed: 0,Country,hdi_2009,hdi_2019,hdi_var,edu_2009,edu_2019,edu_var
0,Afghanistan,0.46,0.511,0.051,8.9,10.2,1.3
1,Albania,0.733,0.795,0.062,12.3,14.7,2.4
2,Algeria,0.711,0.748,0.037,13.6,14.6,1.0
3,Andorra,0.839,0.868,0.029,11.7,13.3,1.6
4,Angola,0.515,0.581,0.066,9.0,11.8,2.8


In [392]:
data = pd.merge(df_merge, pov_df_actual, on='Country')
data.head()

Unnamed: 0,Country,hdi_2009,hdi_2019,hdi_var,edu_2009,edu_2019,edu_var,pov_index
0,Afghanistan,0.46,0.511,0.051,8.9,10.2,1.3,0.272
1,Albania,0.733,0.795,0.062,12.3,14.7,2.4,0.003
2,Algeria,0.711,0.748,0.037,13.6,14.6,1.0,0.008
3,Andorra,0.839,0.868,0.029,11.7,13.3,1.6,
4,Angola,0.515,0.581,0.066,9.0,11.8,2.8,0.282


# Performing basics stats

In [394]:
data.describe()

Unnamed: 0,hdi_2009,hdi_2019,hdi_var,edu_2009,edu_2019,edu_var,pov_index
count,189.0,189.0,189.0,187.0,189.0,187.0,107.0
mean,0.672555,0.722423,0.049868,12.550269,13.326985,0.824599,0.14129
std,0.178472,0.149791,0.078791,2.878664,2.940014,1.006342,0.153217
min,0.0,0.394,-0.107,4.4,5.0,-2.999999,0.001
25%,0.528,0.602,0.029,10.6,11.4,0.3,0.014
50%,0.715,0.74,0.041,12.8,13.2,0.8,0.085
75%,0.8,0.829,0.057,14.5,15.2,1.35,0.243
max,0.937,0.957,0.715,20.299999,22.0,3.8,0.59


In [395]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 189 entries, 0 to 188
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    189 non-null    object 
 1   hdi_2009   189 non-null    float32
 2   hdi_2019   189 non-null    float32
 3   hdi_var    189 non-null    float32
 4   edu_2009   187 non-null    float32
 5   edu_2019   189 non-null    float32
 6   edu_var    187 non-null    float32
 7   pov_index  107 non-null    float32
dtypes: float32(7), object(1)
memory usage: 8.1+ KB


In [None]:
datapov_index.T