Clement NICOLLE

## Decision Theory assignement
In this study we are interested in the quetion of making a ranking of all countries around the world regarding three main criteria : the Health, the Education and the Standard of Living.

### Load and Preprocessing of data used

In [1]:
# Import libraries useful for this case of study

import pandas as pd
import numpy as np
import scipy as sp
from scipy.optimize import linprog

In [2]:
# Load the 2021-2022 HDI available on 
# https://hdr.undp.org/en/content/human-development-index-hdi

df = pd.read_excel('./HDR21-22_Statistical_Annex_HDI_Table.xlsx', 
                   usecols='B,C,E,G,I,K', skiprows=4)
df['Country'] = df['Unnamed: 1'] 
df.drop('Unnamed: 1', axis=1, inplace = True)
df.drop(0, inplace = True) #first line is empty
for i in range(198,202):
    df.drop(i, inplace = True) #some data are uncomplete (this is the case
                               #for 4 countries)
        
for i in range(204,225):       #these lines are note countries but
    df.drop(i, inplace = True) #sets of countries

df.dropna(inplace=True)
df.set_index('Country', inplace=True)
df.head()

Unnamed: 0_level_0,Human Development Index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Switzerland,0.962,83.9872,16.500299,13.85966,66933.00454
Norway,0.961,83.2339,18.1852,13.00363,64660.10622
Iceland,0.959,82.6782,19.163059,13.76717,55782.04981
"Hong Kong, China (SAR)",0.952,85.4734,17.27817,12.22621,62606.8454
Australia,0.951,84.5265,21.05459,12.72682,49238.43335


In [3]:
# From the previous brute DataFrame create a new one more usable, mainly :
# - change the name of columns according to the assignement
# - compute the Standard of living index from expected and mean years 
# of schooling
# - add a column for the HDI ranking

df_hes = pd.DataFrame(df.index)
df_hes['HDI'] = df['Human Development Index (HDI) '].values
df_hes['Health'] = df['Life expectancy at birth'].values
df_hes['Education'] = df.apply(lambda x : (x['Expected years of schooling']+x['Mean years of schooling'])/2, axis=1).values
df_hes['Standard of living'] = df['Gross national income (GNI) per capita'].values
df_hes = df_hes.sort_values('HDI', ascending = False)
df_hes.reset_index(inplace =True)
df_hes['HDI ranking']=df_hes['index'].apply(lambda x : x+1)
df_hes.drop('index',axis=1, inplace=True)
df_hes.set_index('Country', inplace=True)
df_hes.head()

Unnamed: 0_level_0,HDI,Health,Education,Standard of living,HDI ranking
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Switzerland,0.962,83.9872,15.17998,66933.00454,1
Norway,0.961,83.2339,15.594415,64660.10622,2
Iceland,0.959,82.6782,16.465115,55782.04981,3
"Hong Kong, China (SAR)",0.952,85.4734,14.75219,62606.8454,4
Australia,0.951,84.5265,16.890705,49238.43335,5


In [4]:
# Normalize the different criteria 

df_hes['Health'] = df_hes.Health/df_hes.Health.max()
df_hes['Education'] = df_hes.Education/df_hes.Education.max()
df_hes['Standard of living'] = df_hes['Standard of living']/df_hes['Standard of living'].max()

### First Method :  Using Direct Comparisons of Subset of Alternatives to Infer Criteria Weights

Let $C_{2k} \succ C_{2k+1}$ a set of comparisons of some countries.

If we denote as $h_{2k}, e_{2k}, s_{2k}$ the healh, education and standard of living of country $C_{2k}$ (resp $C_{2k+1}$) and $w_h, w_e, w_s$ the weights associated which have to be determined, we can write this optimization problem as :

$$\min \sum\limits_{k} e_k$$
under the constraints : 
$$h_{2k}*w_h+ e_{2k}*w_e+ s_{2k}*w_s-h_{2k+1}*w_h+ e_{2k+1}*w_e+ s_{2k+1}*w_s+e_k\geq\delta_k$$
$$w_h+ w_e+ w_s = 1$$
$$w_h, w_e, w_s\geq 0$$
$$e_k\geq0$$

To solve such a linear problem, the library scipy.optimize provides the wonderful function linprog but we first need to rewrite the previous system as :

$$\min c^T.w$$
under the constraints :
$$A_{ub}.w\leq b_{ub}$$
$$A_{eq}.w = b_{eq}$$

where $w=\begin{pmatrix}w_h \\ w_e \\ w_s \\ \vdots \\ e_k \\ \vdots \end{pmatrix}, c=\begin{pmatrix}0 \\ 0 \\ 0 \\ 1\\ \vdots \\ 1\end{pmatrix}, b_{ub}=\begin{pmatrix} \vdots \\ -\delta_k \\ \vdots \end{pmatrix}, A_{eq} = \begin{pmatrix} 1&1&1&0&\cdots&0\end{pmatrix}$

$A_{ub}= \begin{pmatrix} \vdots & \vdots & \vdots & -1 & 0 & \cdots & 0\\ h_{2k+1}-h_{2k} & e_{2k+1}-e_{2k} & s_{2k+1}-s_{2k} & 0 & \ddots & \ddots & \vdots\\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & 0\\ \vdots & \vdots & \vdots & 0 & \cdots & 0 & -1 \end{pmatrix} $ and  $b_{eq} = \begin{pmatrix} 1 \end{pmatrix}$

In [5]:
# Create a list of comparisons couples

comparisons = [('Oman','Brazil'),('Ireland', 'Portugal'),
               ('Türkiye', 'Ukraine'), ('Zimbabwe', 'Haiti'),
               ('Japan','Estonia'),('Algeria','Panama'),
              ('Kenya','India'),('Peru', 'Romania')]

In [6]:
# Define the value of the parameter delta

delta_ = 0.1

In [7]:
# Create the matrices used by the linprog function

Aub = np.zeros((len(comparisons),len(comparisons)+3))
for i,couple in enumerate(comparisons):
    best,worth = couple
    Aub[i][0] = df_hes.loc[worth].Health - df_hes.loc[best].Health
    Aub[i][1] = df_hes.loc[worth].Education - df_hes.loc[best].Education
    Aub[i][2] = df_hes.loc[worth]['Standard of living'] - df_hes.loc[best]['Standard of living']
    Aub[i][3+i] = -1

bub = - delta_ * np.ones(len(comparisons))

Aeq = np.zeros((1,len(comparisons)+3))
Aeq[0][0] = 1
Aeq[0][1] = 1
Aeq[0][2] = 1

beq = np.array([1])

c = np.ones(len(comparisons)+3)
c[0] = 0
c[1] = 0
c[2] = 0

In [8]:
# Solve the linear problem and store the solution weights 

res = linprog(c, A_ub=Aub, b_ub=bub, A_eq=Aeq, b_eq=beq, method='simplex', options={"disp": True})  # linear programming p[roblem
w = res.x[0:3]
w

Optimization terminated successfully.
         Current function value: 0.583074    
         Iterations: 11


array([0.26926672, 0.67999576, 0.05073751])

In [9]:
# Calculate for each country its score and determine its new rank

df_hes['Result']= df_hes.apply(lambda x : w[0]*x.Health + w[1]*x.Education +
                              w[2]*x['Standard of living'], axis=1)
df_hes.sort_values('Result', ascending=False, inplace = True)
df_hes.reset_index(inplace =True)
df_hes.reset_index(inplace =True)
df_hes['New ranking']=df_hes['index'].apply(lambda x : x+1)
df_hes.drop('index',axis=1, inplace=True)
df_hes.set_index('Country', inplace=True)
df_hes.head()

Unnamed: 0_level_0,HDI,Health,Education,Standard of living,HDI ranking,Result,New ranking
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Australia,0.951,0.988922,1.0,0.335344,5,0.963294,1
New Zealand,0.937,0.964643,0.983635,0.300057,14,0.943838,2
Iceland,0.959,0.967297,0.974803,0.37991,3,0.942599,3
Sweden,0.947,0.970867,0.948103,0.371106,7,0.924957,4
Belgium,0.937,0.957944,0.946672,0.35615,13,0.919745,5


#### Question 1

As we a priori want to give the same importance to the difference of values for every criteria, I took the same value for each delta. 

After several tries, the value of 0.1 seemed to be the most relevant.

#### Question 2

In [10]:
# extract the new rank of countries used to infer weights

original_countries = np.concatenate(np.array(comparisons))
df_hes.loc[original_countries][['Health','Education', 
                                'Standard of living','New ranking']]

Unnamed: 0_level_0,Health,Education,Standard of living,New ranking
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Oman,0.848692,0.776389,0.184256,55
Brazil,0.851147,0.702394,0.097868,93
Ireland,0.959335,0.903676,0.518757,10
Portugal,0.948182,0.782891,0.225803,44
Türkiye,0.889545,0.798408,0.211352,49
Ukraine,0.837968,0.772259,0.090278,63
Zimbabwe,0.693234,0.616371,0.025948,135
Haiti,0.739322,0.45155,0.019393,167
Japan,0.991933,0.846256,0.287914,24
Estonia,0.902545,0.872592,0.259132,25


Our optimal solution exhibits 3 inconsistencies (out of 8), which are Panama $\succ$ Algeria, India $\succ$ Kenya and Romania $\succ$ Peru.

In the case of the two last inconsistencies, they are not surprising, in fact the **three** criteria of India (resp Romania)
are superior to these of Kenya (resp Peru)

#### Question 3

In [11]:
print('Given delta = %f, the optimal weights are %f for the   Health, %f for the Education and %f for the standard of living' % (delta_, w[0], w[1], w[2]))

Given delta = 0.100000, the optimal weights are 0.269267 for the   Health, 0.679996 for the Education and 0.050738 for the standard of living


#### Question 4

In [12]:
# Load statistics of Canada

df_hes.loc['Canada']

HDI                      0.936
Health                0.967044
Education             0.894891
Standard of living    0.318791
HDI ranking                 15
Result                 0.88509
New ranking                 15
Name: Canada, dtype: object

Canada ranks 15th with this method. Interestingly, it obtains the same rank than with the HDI ranking !

### Second Method : Applying Data Envelopment Analysis 

In our study case, the higher values of each criteria of a country are, the better the country will be ranked. In the context of Data Envelopment Analysis, it means that all three criteria are considered as output and there is no input. Applying DAE using an output-oriented model and assuming variable returns to scale (VRS) gives us for every country $o$ this linear problem to solve :

$$\max \phi + \epsilon\cdot(r_h^++r_e^++r_s^+)$$
under the constraints : 
$$\sum\limits_{k=1}^n \lambda_k h_k -r_h^+ = \phi h_o$$
$$\sum\limits_{k=1}^n \lambda_k e_k -r_e^+ = \phi e_o$$
$$\sum\limits_{k=1}^n \lambda_k s_k -r_s^+ = \phi s_o$$
$$\sum\limits_{k=1}^n \lambda_k=1$$
$$\lambda_k\geq0$$
where $n$ is the number of country, $h_k,e_k,s_k$ the criteria associated to country $k$ and $\epsilon$ an arbitrary small number.

Once again, it can be solved thanks to the linprog function rewriting the previous problem as :

$$\min c^T.w$$
under the constraints :
$$A_{eq}.w = b_{eq}$$

where $w=\begin{pmatrix}\phi \\ r_h^+ \\ r_e^+ \\ r_s^+ \\ \vdots \\ \lambda_k \\ \vdots \end{pmatrix}, c=\begin{pmatrix}-1 \\ -\epsilon \\ -\epsilon \\ -\epsilon \\ \vdots \\ 0 \\ \vdots \end{pmatrix},b_{eq} = \begin{pmatrix} 0\\0\\0\\1 \end{pmatrix}$ and 
$A_{eq} = \begin{pmatrix} h_o & 1 & 0 & 0 & \cdots & h_k & \cdots\\ e_o & 0 & 1 & 0 & \cdots & e_k & \cdots\\ s_o & 0 & 0 & 1 & \cdots & s_k & \cdots\\ 0 & 0 & 0 & 0 & 1 & \cdots & 1 \end{pmatrix} $

In [13]:
# From the previous brute DataFrame df create a new one more usable, mainly :
# - change the name of columns according to the assignement
# - compute the Standard of living index from expected and mean years 
# of schooling
# This time, values are not normalized 

df_hes = pd.DataFrame(df.index)
df_hes['Health'] = df['Life expectancy at birth'].values
df_hes['Education'] = df.apply(lambda x : (x['Expected years of schooling']+x['Mean years of schooling'])/2, axis=1).values
df_hes['Standard of living'] = df['Gross national income (GNI) per capita'].values
df_hes.set_index('Country', inplace=True)
df_hes.head()

Unnamed: 0_level_0,Health,Education,Standard of living
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Switzerland,83.9872,15.17998,66933.00454
Norway,83.2339,15.594415,64660.10622
Iceland,82.6782,16.465115,55782.04981
"Hong Kong, China (SAR)",85.4734,14.75219,62606.8454
Australia,84.5265,16.890705,49238.43335


In [14]:
# Define the arbitrary small number epsilon

epsilon = 10**(-10)

In [15]:
# Solve these linear problem and store results

efficiencies = []
results = np.zeros((df_hes.shape[0],df_hes.shape[0]+4))

for n in range(df_hes.shape[0]):
    # for each country, define their associated matrices
    
    c = np.zeros(df_hes.shape[0]+4)
    c[0] = -1
    c[1] = -epsilon
    c[2] = -epsilon
    c[3] = -epsilon

    Aeq = np.zeros((4, df_hes.shape[0]+4))
    Aeq[0][0] = df_hes.iloc[n].Health
    Aeq[0][1] = 1
    Aeq[1][0] = df_hes.iloc[n].Education
    Aeq[1][2] = 1
    Aeq[2][0] = df_hes.iloc[n]['Standard of living']
    Aeq[2][3] = 1
    for k in range(df_hes.shape[0]):
        Aeq[0][k+4] = - df_hes.iloc[k].Health
        Aeq[1][k+4] = - df_hes.iloc[k].Education
        Aeq[2][k+4] = - df_hes.iloc[k]['Standard of living']
        Aeq[3][k+4] = 1

    beq = np.zeros((4))
    beq[3] = 1
    
    # Solve the linear problem for country n
    res = linprog(c, A_eq=Aeq, b_eq=beq, method='simplex', options={"disp": False})  # linear programming p[roblem
    w = res.x
    efficiencies.append(1/w[0]) #store efficiency
    results[n] = w #store other parameters

  res = linprog(c, A_eq=Aeq, b_eq=beq, method='simplex', options={"disp": False})  # linear programming p[roblem
  res = linprog(c, A_eq=Aeq, b_eq=beq, method='simplex', options={"disp": False})  # linear programming p[roblem


In [16]:
# Add the Efficiency column to the DataFrame

df_hes['Efficiency'] = efficiencies

#### Question 1

In [17]:
# Extract countries lying on efficient frontier 

df_hes[df_hes['Efficiency']==1]

Unnamed: 0_level_0,Health,Education,Standard of living,Efficiency
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Hong Kong, China (SAR)",85.4734,14.75219,62606.8454,1.0
Australia,84.5265,16.890705,49238.43335,1.0
Liechtenstein,83.2575,13.860055,146829.7006,1.0


Three countries lie on the efficient frontier, they are Honk Kong, Australia and the Liechtenstein.

#### Question 2

In [18]:
# Extract from the results array the efficient target
# of each country and the convex coefficients of ERS

hk_index = 4 + df_hes.reset_index()[df_hes.reset_index()[
    'Country']=='Hong Kong, China (SAR)'].index.values[0]
aus_index = 4 + df_hes.reset_index()[df_hes.reset_index()[
    'Country']=='Australia'].index.values[0]
lie_index = 4 + df_hes.reset_index()[df_hes.reset_index()[
    'Country']=='Liechtenstein'].index.values[0]

eff_target_h = []
eff_target_e = []
eff_target_s = []
coeff_hk = []
coeff_aus = []
coeff_lie = []

for n in range(df_hes.shape[0]):
    eff_target_h.append(results[n][0]*df_hes.iloc[n].Health+results[n][1])
    eff_target_e.append(results[n][0]*df_hes.iloc[n].Education+results[n][2])
    eff_target_s.append(results[n][0]*df_hes.iloc[n]['Standard of living']+results[n][3])
    coeff_hk.append(results[n][hk_index])
    coeff_aus.append(results[n][aus_index])
    coeff_lie.append(results[n][lie_index])

In [19]:
# Add these information to the DataFrame

df_hes['Efficient target health'] = eff_target_h
df_hes['Efficient target education'] = eff_target_e
df_hes['Efficient target standard of living'] = eff_target_s
df_hes['Lambda Hong Kong'] = coeff_hk
df_hes['Lambda Australia'] = coeff_aus
df_hes['Lambda Liechtenstein'] = coeff_lie

In [20]:
df_hes.loc['New Zealand']

Health                                     82.4513
Education                                 16.61429
Standard of living                     44057.31394
Efficiency                                0.983635
Efficient target health                    84.5265
Efficient target education               16.890705
Efficient target standard of living    49238.43335
Lambda Hong Kong                               0.0
Lambda Australia                               1.0
Lambda Liechtenstein                           0.0
Name: New Zealand, dtype: object

The Efficiency rating of New Zealand is 0.983635.

According to the result of the DEA method, it is not efficient. Target values for its three criteria are 84.5265 for health, 16.890705 for education and 49238.43335 for standard of living.

It is achieved by the following convex combination of ERS member :

$$x_{New Zealand} = 0 \cdot x_{Hong Kong} + 1 \cdot x_{Australia} + 0 \cdot x_{Liechtenstein}$$

where $x_{Country}$ represents any of the three criteria value.