# Assignment 2 - Correlation between the ranking of Human Development Index (HDI) values and the values of the factors that make up the HDI

## Introduction

The Human Development Index (HDI) ranks countries based on their standard of living. The purpose of the index is to assess the standard of living within most of the sovereign nations and territories around the world. While the index is far from perfect, it is useful as a general overview of the living standards one could expect should one be born or live in any of the nations assessed on the index. It is also useful in assessing which areas need to be improved in order for a country to attain a higher standard of living.

Indicators that make up the index include the life expectancy of a person born in that country, their expected and mean years of schooling and their Gross National Income (GNI) per capita. Through a complex calculation, the indicators are combined to give a HDI value, measuring the standard of living in the respective country. The higher the life expectancy, the number of expected and the number of mean years at school and the GNI per capita of a country’s populace would generally mean a higher HDI value, meaning that respective country has a higher standard of living. A HDI value closer to 1 means the country has a high standard of living, while a HDI value closer to 0 means the country has a lower standard of living.

The dataset used contains the HDI values and ranks for countries in 2017, in addition to four columns listing the results of the four indicators used to measure the HDI value for each country. There are also two additional columns at the end of the dataset – a column that shows the results of the HDI rank subtracted from the GNI per capita rank and a column that shows the HDI rank for 2016.

The HDI index proves to be a fascinating dataset to use, because it raises several opportunities to ask questions about how the various indicators affect the living standards in various countries.  One interesting question that arises is which indicator is the best at predicting the HDI rank of each country? Does life expectancy, expected years of schooling, mean years of schooling or GNI per capita correlate the best with the rank of the HDI value for each country? 

The question is significant as with the answer, one could use it to predict the HDI rank of a country using just the one indicator that proves to be the best correlated with the ranking of countries based on their HDI ranks. This can give one a good idea of the living standards of a country without the need to know all the indicators or to have to work the calculation of all indicators that would attain the HDI value in the first place. It could also reveal on which indicator the HDI relies on the most in calculating its score for a given country.



## Methods and Results

To begin with, the appropriate packages from R have to be loaded; `rvest` and `dplyr`.

In [2]:
library(rvest)

Loading required package: xml2


In [3]:
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [99]:
vignette('selectorgadget')

After loading the packages, the `read_html` function is used to read the website containing the data. The website is http://hdr.undp.org/en/composite/HDI. The function is saved as the object 'hweb'.

In [4]:
hweb = read_html("http://hdr.undp.org/en/composite/HDI")

In [102]:
?html_nodes

In [120]:
?html_table

Using the `html_nodes` function, the appropriate node for the dataset on the website is searched for. The dataset is a large table listing the HDI ranks, values and components for several countries.

In [5]:
html_nodes(hweb, 'table')

{xml_nodeset (1)}
[1] <table border="0" cellpadding="0" cellspacing="0" width="2268" style="bor ...

The data is scraped from the website onto this notebook. Because there were an irregular number of columns, for the function `html_table`, the variable `fill` had to be set at 'true'.

In [6]:
hweb2 = html_nodes(hweb, 'table') %>% .[[1]] %>% html_table(fill = TRUE)
hweb2

X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X22,X23,X24,X25,X26,X27,X28,X29,X30,X31
,Table 1. Human Development Index and its components,,,,,,,,,...,,,,,,,,,,
,,,,SDG 3,,SDG 4.3,,SDG 4.6,,...,,,,,,,,,,
,,Human Development Index (HDI),,Life expectancy at birth,,Expected years of schooling,,Mean years of schooling,,...,,,,,,,,,,
HDI rank,Country,Value,,(years),,(years),,(years),,...,,,,,,,,,,
,,2017,,2017,,2017,a,2017,a,...,,,,,,,,,,
,VERY HIGH HUMAN DEVELOPMENT,VERY HIGH HUMAN DEVELOPMENT,VERY HIGH HUMAN DEVELOPMENT,VERY HIGH HUMAN DEVELOPMENT,VERY HIGH HUMAN DEVELOPMENT,VERY HIGH HUMAN DEVELOPMENT,VERY HIGH HUMAN DEVELOPMENT,VERY HIGH HUMAN DEVELOPMENT,VERY HIGH HUMAN DEVELOPMENT,...,,,,,,,,,,
1,Norway,0.953,,82.3,,17.9,,12.6,,...,,,,,,,,,,
2,Switzerland,0.944,,83.5,,16.2,,13.4,,...,,,,,,,,,,
3,Australia,0.939,,83.1,,22.9,b,12.9,,...,,,,,,,,,,
4,Ireland,0.938,,81.6,,19.6,b,12.5,c,...,,,,,,,,,,


The dataset scraped from the website is a mess. To make things clearer, the `filter` function was used to clear all rows with blank space in the first column (X1), plus all countries without a defined HDI rank or value (any space filled with two or three periods).

In [31]:
hweb2 = filter(hweb2, X1!='', X1!='...', X1!='..')
hweb2

X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X22,X23,X24,X25,X26,X27,X28,X29,X30,X31
HDI rank,Country,Value,,(years),,(years),,(years),,...,,,,,,,,,,
1,Norway,0.953,,82.3,,17.9,,12.6,,...,,,,,,,,,,
2,Switzerland,0.944,,83.5,,16.2,,13.4,,...,,,,,,,,,,
3,Australia,0.939,,83.1,,22.9,b,12.9,,...,,,,,,,,,,
4,Ireland,0.938,,81.6,,19.6,b,12.5,c,...,,,,,,,,,,
5,Germany,0.936,,81.2,,17.0,,14.1,,...,,,,,,,,,,
6,Iceland,0.935,,82.9,,19.3,b,12.4,c,...,,,,,,,,,,
7,"Hong Kong, China (SAR)",0.933,,84.1,,16.3,,12.0,,...,,,,,,,,,,
7,Sweden,0.933,,82.6,,17.6,,12.4,,...,,,,,,,,,,
9,Singapore,0.932,,83.2,,16.2,d,11.5,,...,,,,,,,,,,


This looks slightly better but it is still a mess. The headings have been cut off - however, that can be fixed later on.
Next, any column with no useful information in it (for example, if they are blank or are one of the many containin 'NA' repeated over and over) is selected out using a combination of the `subset` and `select` functions.
Note that under the `select` function, the `c()` function is listing the columns with blanks or 'NA's'. There is a dash before the `c()` function, indicating these are the columns that **are not** wanted.

In [32]:
hweb3 = subset(hweb2, select=-c(X4,X6,X8,X10,X12,X14,X16:X31))
hweb3

X1,X2,X3,X5,X7,X9,X11,X13,X15
HDI rank,Country,Value,(years),(years),(years),(2011 PPP $),,
1,Norway,0.953,82.3,17.9,12.6,68012,5,1
2,Switzerland,0.944,83.5,16.2,13.4,57625,8,2
3,Australia,0.939,83.1,22.9,12.9,43560,18,3
4,Ireland,0.938,81.6,19.6,12.5,53754,8,4
5,Germany,0.936,81.2,17.0,14.1,46136,13,4
6,Iceland,0.935,82.9,19.3,12.4,45810,13,6
7,"Hong Kong, China (SAR)",0.933,84.1,16.3,12.0,58420,2,8
7,Sweden,0.933,82.6,17.6,12.4,47766,9,7
9,Singapore,0.932,83.2,16.2,11.5,82503,-6,8


Slowly, the dataset is being cleared up. 

The first row of the dataset is actually made up  of parts of the headings that the dataset is supposed to had. This row needs to be filtered out and the current column headings (X1, X2 etc) need to be renamed.
First, the top row is filtered out. This dones by filtering out the row in column 'X2' with the word "Country" in it.

In [33]:
hweb3 = filter(hweb3, X2!="Country")
hweb3

X1,X2,X3,X5,X7,X9,X11,X13,X15
1,Norway,0.953,82.3,17.9,12.6,68012,5,1
2,Switzerland,0.944,83.5,16.2,13.4,57625,8,2
3,Australia,0.939,83.1,22.9,12.9,43560,18,3
4,Ireland,0.938,81.6,19.6,12.5,53754,8,4
5,Germany,0.936,81.2,17.0,14.1,46136,13,4
6,Iceland,0.935,82.9,19.3,12.4,45810,13,6
7,"Hong Kong, China (SAR)",0.933,84.1,16.3,12.0,58420,2,8
7,Sweden,0.933,82.6,17.6,12.4,47766,9,7
9,Singapore,0.932,83.2,16.2,11.5,82503,-6,8
10,Netherlands,0.931,82.0,18.0,12.2,47900,5,10


Now the columns are renamed. It should be noted that some of the columns are not named as they were in the original dataset due to concerns over space. However, most of the changes were three or four words from some columns explaining what 'HDI' and 'GNI' mean. For reference, 'HDI' means 'Human Development Index' and 'GNI' means 'Gross National Income'. Brackets have been eliminated because the R program kept mistaking the use of brackets in the column title as part of a function. Also, underscores have been placed between the words in each column name to connect them up for when data manupulation occurs later. 

In [41]:
colnames(hweb3) = c("HDI_rank_2017", "Country", "HDI_Value", "Life_expectancy_at_birth_in_years", "Expected_years_of_schooling_in_years",
                    "Mean_years_of_schooling_in_years", "GNI_per_capita_in_USdollars", "GNI_per_capita_rank_-_HDI_rank", "HDI_rank_2016")
hweb3

HDI_rank_2017,Country,HDI_Value,Life_expectancy_at_birth_in_years,Expected_years_of_schooling_in_years,Mean_years_of_schooling_in_years,GNI_per_capita_in_USdollars,GNI_per_capita_rank_-_HDI_rank,HDI_rank_2016
1,Norway,0.953,82.3,17.9,12.6,68012,5,1
2,Switzerland,0.944,83.5,16.2,13.4,57625,8,2
3,Australia,0.939,83.1,22.9,12.9,43560,18,3
4,Ireland,0.938,81.6,19.6,12.5,53754,8,4
5,Germany,0.936,81.2,17.0,14.1,46136,13,4
6,Iceland,0.935,82.9,19.3,12.4,45810,13,6
7,"Hong Kong, China (SAR)",0.933,84.1,16.3,12.0,58420,2,8
7,Sweden,0.933,82.6,17.6,12.4,47766,9,7
9,Singapore,0.932,83.2,16.2,11.5,82503,-6,8
10,Netherlands,0.931,82.0,18.0,12.2,47900,5,10


Now the data is ready to be manipulated.

To repeat, we want to find out which indicator (life expectancy, expected years of schooling, mean years of schooling or GNI per capita) is the closest correlated with the HDI rank of each country. The plan to calculate this is to give a rank for each value in each indicator and use to determine a correlation coefficient between the ranks of the indicator and the HDI ranks. A correlation coefficient measures the correlation between two or more datasets. A coefficient closer to -1 is a strong negative correlation, while a coefficient closer to +1 is a positive correlation. A coefficient around 0 shows no or weak correlation.

For example, the data for 'Life_expectancy_at_birth' column shall be manpulated first.

In [68]:
?rank

"cannot open file 'C:\Users\User\AppData\Local\Temp\Rtmp8mYeO8\file18cc711969bb': No such file or directory"ERROR while rich displaying an object: Error in file(con, "r"): cannot open the connection

Traceback:
1. FUN(X[[i]], ...)
2. tryCatch(withCallingHandlers({
 .     rpr <- mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
3. tryCatchList(expr, classes, parentenv, handlers)
4. tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. doTryCatch(return(expr), name, parentenv, handler)
6. withCallingHandlers({
 .     rpr <- mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler)
7. mime2repr[[mime]](obj)
8. repr_latex.help_files_with_topic(obj)
9. repr_help_files_with_topic_generic(obj, Rd2latex)
10. capture.output(Rd2_(rd, package = pkgname, outputEncoding = "UTF-8"))
11. evalVis(expr)

First, an object for the 'Life_expectancy_at_birth' data is created to keep things simple.

In [57]:
le = hweb3$Life_expectancy_at_birth_in_years
le

   Then, each value for life expectancy is given a rank. `rank(desc())` is used because the ranking should be done in descending order - the largest value for life expectancy should have the lowest rank. The function `ties.method` denotes what rank the ties between values should be given. "min" means ties should be given the smallest rank - as is with the HDI ranks.

In [97]:
lerank = rank(desc(le), ties.method = "min")
lerank

To make things simpler, 'HDI_rank_2017' is turned into an object.

In [98]:
hdirank = hweb3$HDI_rank_2017
hdirank

The inverted commas around the values for the HDI rank indicate that these values are not numeric. They must be converted to numeric values first.

In [99]:
hdirank = as.numeric(hdirank)

Now, using the `cor()` function, the correlation between HDI ranks and life expectancy can be determined.

In [161]:
cor(lerank, hdirank)

0.91345536617642 shows a strong correlation between the HDI ranks and life expectancy ranks. However, the other indicators may have stronger correlations.

First, for the 'GNI_per_capita_in_USdollars' column, we have to remove the commas from the values within it. We use `gsub` to tell R that the commas(",") should be removed and replaced with nothing (""). We then change the values to a numerical value using `as.numeric`.

In [None]:
?gsub

In [150]:
class(hweb3$GNI_per_capita_in_USdollars)
gsub(",", "", hweb3$GNI_per_capita_in_USdollars)
gn = as.numeric(gsub(",", "", hweb3$GNI_per_capita_in_USdollars))
gn

Finally, the 'GNI_per_capita_in_USdollars' column values have been sorted.

One may ask why 'GNI_per_capita_rank_-_HDI_rank' was not used instead. Surely it would have been easier to just sum that particular column up, instead of having to code extra to turn the 'GNI_per_capita' column values into numerical values. However, there are two reasons as to why the latter option was taken. One was because the GNI per capita rank is supposed to be subtracted *from* the HDI rank, of which the 'GNI_per_capita_rank_-_HDI_rank does the opposite. The second reason is because it is better to use fresh data than to rely on secondary sources. 

Now that the 'GNI_per_capita_in_USdollars' column values have been sorted, the other two indicators, expected years of schooling and mean years of schooling, need to be turned to objects.

In [130]:
es = hweb3$Expected_years_of_schooling_in_years
ms = hweb3$Mean_years_of_schooling_in_years

In [162]:
?cor

"cannot open file 'C:\Users\User\AppData\Local\Temp\Rtmp8mYeO8\file18cc7a616b1a': No such file or directory"ERROR while rich displaying an object: Error in file(con, "r"): cannot open the connection

Traceback:
1. FUN(X[[i]], ...)
2. tryCatch(withCallingHandlers({
 .     rpr <- mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
3. tryCatchList(expr, classes, parentenv, handlers)
4. tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. doTryCatch(return(expr), name, parentenv, handler)
6. withCallingHandlers({
 .     rpr <- mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler)
7. mime2repr[[mime]](obj)
8. repr_latex.help_files_with_topic(obj)
9. repr_help_files_with_topic_generic(obj, Rd2latex)
10. capture.output(Rd2_(rd, package = pkgname, outputEncoding = "UTF-8"))
11. evalVis(expr)

The objects containing the values for each indicator are then ranked.

In [163]:
esrank = rank(desc(es), ties.method = "min")
msrank = rank(desc(ms), ties.method = "min")
gnrank = rank(desc(gn), ties.method = "min")

In [164]:
cor(esrank, hdirank)
cor(msrank, hdirank)
cor(gnrank, hdirank)


The correlation coefficients between the HDI ranks and each indicator are as follows:
- the correlation coefficient of the hdi ranks and life expectancy ranks is 0.913455366176421.
- the correlation coefficient of the hdi ranks and mean schooling years' ranks is 0.304213266854488.
- the correlation coefficient of the hdi ranks and mean schooling years' ranks is -0.238679106739706.
- the correlation coefficient of the hdi ranks and the GNI per capita ranks is 0.951775940760708.

## Discussion

To recap, the purpose of this assignment was to find out which of the four indicators making up the Human Development Index is the closest correlated with the HDI rank of each country. A rank was given to each value in each indicator and a correlation coefficient was determined between the ranks of the indicator and the ranks of the HDI values. The indicator with the correlation coefficient closest to 1 is the indicator most positively correlated with the HDI rank. Because the ‘GNI_per_capita_in_USdollars’ indicator has the coefficient closest to 1, with a (rounded up) coefficient of 0.952, it provides the best indicator for predicting what the HDI rank will be.

This does not mean ‘GNI per capita’ is the perfect method for predicting HDI values or ranks or that one should only use ‘GNI per capita’ when assessing the living standards of a country. The most likely explanation for the results is that the calculations used to work out the HDI values put a lot more weight on ‘GNI per capita’ than on the other three indicators. Effectively, one could suggest that these results point out a flaw in the HDI calculations and where it can be improved in order to give equal weight to all indicators. 

According to the Human Development Report’s [website](http://hdr.undp.org/en/faq-page/human-development-index-hdi#t292n287), from which the dataset was obtained, GNI per capita only reflects the average national income and not as to whether this translates to better health, education or other improvements in quality of life. However, this does not explain as to why there is such a strong correlation between GNI per capita and HDI. One interpretation is that a higher GNI per capita is actually a result of improved human development, rather than a cause or indicator of it, meaning a higher HDI score of a country will result in a higher GNI per capita.

However, a more [extensive study](https://www.tandfonline.com/doi/full/10.1080/1331677X.2017.138317) published in 2017 claimed that correlations between economic indicators and HDI are a lot more unclear than would be thought. The study claimed that there is a weak correlation between GDP (Gross Domestic Product) and HDI but a stronger correlation between GDP per capita and HDI. The study then points out that the distribution of wealth and resources is a better indicator of human development than the mere accumulation of them. While they relied more on GDP per capita in their study, the fact that GNI per capita is [based on GDP per capita](https://en.wikipedia.org/wiki/Gross_national_income) could explain the high correlation achieved in this assignment – more highly developed countries generally have a higher GNI per capita due to a better distribution of wealth.

To summarise, GNI per capita is the most highly correlated indicator of HDI to the HDI ranks. Among the potential reasons for this are statistical bias, it being an outcome of the HDI value or it being a perfectly valid factor indicating a better distribution of wealth and resources, hence leading to a better HDI. However, GNI per capita is only one indicator and should not be used alone in assessing the living standards of a country.

Help for working the correlation coefficient in R came from http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient.