### Note 2
As mentioned previously, you can follow any approach to find the final list of countries. Either you can use the binning approach, or use the outlier assign approach where you assign each outlier that you got to the given cluster means. Here we'll take a look at one such easier approach where we use the binning approach for one of the indicators which show a good variability among the  different clusters. 

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
fin = pd.read_csv('Country-data.csv')
#Converting exports,imports and health spending percentages to absolute values.
fin['exports'] = fin['exports']*fin['gdpp']/100
fin['imports'] = fin['imports']*fin['gdpp']/100
fin['health'] = fin['health']*fin['gdpp']/100

In [3]:
fin.head()

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,55.3,41.9174,248.297,1610,9.44,56.2,5.82,553
1,Albania,16.6,1145.2,267.895,1987.74,9930,4.49,76.3,1.65,4090
2,Algeria,27.3,1712.64,185.982,1400.44,12900,16.1,76.5,2.89,4460
3,Angola,119.0,2199.19,100.605,1514.37,5900,22.4,60.1,6.16,3530
4,Antigua and Barbuda,10.3,5551.0,735.66,7185.8,19100,1.44,76.8,2.13,12200


In [4]:
#Let's use the binning with gdpp first to see the list of countries which might be important.
#The upper limit that we got from the clustering process was 2300.
#Let's filter the complete dataset with 2300 as the cut-off limit for gdpp.
fin2=fin[fin['gdpp']<=2300]
fin2.head()

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,55.3,41.9174,248.297,1610,9.44,56.2,5.82,553
12,Bangladesh,49.4,121.28,26.6816,165.244,2440,7.14,70.4,2.33,758
17,Benin,111.0,180.404,31.078,281.976,1820,0.885,61.8,5.36,758
18,Bhutan,42.7,926.5,113.36,1541.26,6420,5.99,72.1,2.38,2180
19,Bolivia,46.6,815.76,95.832,679.14,5410,8.78,71.6,3.2,1980


In [5]:
len(fin2)

51

In [6]:
#So we got 51 countries here. We can create further sub-categories by taking another good clustering indicator. 
#Let's use the describe function to see how the variables are aligned now.
fin2.describe()

Unnamed: 0,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,82.196078,277.390932,55.64861,425.589061,2421.039216,8.701471,61.384314,4.456078,921.058824
std,38.227922,245.951433,36.811409,344.028228,1397.644203,5.720899,7.465906,1.39914,476.93733
min,17.2,1.07692,12.8212,0.651092,609.0,0.885,32.1,1.27,231.0
25%,55.75,102.4975,31.5132,185.067,1400.0,4.185,57.4,3.25,557.5
50%,80.3,180.404,45.7442,302.802,1990.0,7.64,61.7,4.75,769.0
75%,104.5,414.393,67.5305,497.965,3300.0,12.1,66.55,5.35,1300.0
max,208.0,943.2,190.71,1541.26,6420.0,23.6,73.1,7.49,2180.0


In [24]:
#From the clustering process we got the life expectancy to be atmost 68 for the most downtrodden cluster. 
#Let's see how many countries lie below that number
len(fin2[fin2['life_expec']<=68])

40

In [28]:
#Ok so we got 40 countries now. We can stop here or take one more indicator and find the final list.
#Here we are taking income as the next one, where around 2330 was the median income of the downtrodden cluster.
fin3=fin2[fin2['life_expec']<=68]
fin4=fin3[fin3['income']<2330]
len(fin4)

27

In [29]:
#We're left with 27 countries, let's use the describe function to see how they're aligned again.
fin4.describe()

Unnamed: 0,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
count,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0
mean,100.292593,142.13037,43.194341,295.217741,1386.777778,7.217222,57.859259,5.343333,617.222222
std,37.982505,120.760264,31.363995,255.449074,428.064547,5.248045,6.52482,0.908024,285.668492
min,28.1,20.6052,12.8212,90.552,609.0,0.885,32.1,3.33,231.0
25%,76.1,72.408,28.52205,169.535,1110.0,2.79,56.35,4.81,432.5
50%,90.5,110.4,37.332,204.282,1410.0,6.39,58.7,5.31,562.0
75%,116.0,170.914,46.1196,292.389,1695.0,10.02,61.25,5.845,705.0
max,208.0,635.97,168.37,1190.51,2180.0,20.8,65.9,7.49,1490.0


In [30]:
#The final list of countries 
fin4

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,55.3,41.9174,248.297,1610,9.44,56.2,5.82,553
17,Benin,111.0,180.404,31.078,281.976,1820,0.885,61.8,5.36,758
25,Burkina Faso,116.0,110.4,38.755,170.2,1430,6.81,57.9,5.87,575
26,Burundi,93.6,20.6052,26.796,90.552,764,12.3,57.7,6.26,231
31,Central African Republic,149.0,52.628,17.7508,118.19,888,2.01,47.5,5.21,446
32,Chad,150.0,330.096,40.6341,390.195,1930,6.39,56.5,6.59,897
36,Comoros,88.2,126.885,34.6819,397.573,1410,3.87,65.9,4.75,769
37,"Congo, Dem. Rep.",116.0,137.274,26.4194,165.664,609,20.8,57.5,6.54,334
50,Eritrea,55.2,23.0878,12.8212,112.306,1420,11.6,61.7,4.61,482
56,Gambia,80.3,133.756,31.9778,239.974,1660,4.3,65.5,5.71,562


#### Final Remarks
Hence, major focus should be given to the countries mentioned above.

#### Additional Remarks (Non-Evaluative)

A similar heuristic is used by the United Nations to calculate the Human Development Index (HDI) to rank countries on the basis of their development. They take 3 measures - Life Expectancy, Literacy Rate and Gross National Income. The logic behind using these 3 variables is that they directly affect the rest of the variables that determine the rest of their development. For example,  if you take a dataset where a lot of different socio-economic variables information is available, those variables would be heavily correlated to atleast one the 3 factors mentioned previously.

In this assignment however, we are actually finding those indicators by using VIF. Then instead of using a metric like HDI, we are clustering them on the basis of those important indicators. The clusters that are formed are similar to the HDI ranges used to denote development. Check out this link to read more on it: http://hdr.undp.org/en/content/human-development-index-hdi. And here is a list of the countries ranked according to HDI: http://hdr.undp.org/en/composite/HDI