# GHSI puzzle
## Variable transformation


Loading variables:


In [1]:
import pandas as pd
import numpy as np
from scipy.stats import boxcox
import sys
sys.path.append('./functions')
from DnRoutlier import DnRoutlier

ghsi_2019_df = pd.read_csv('data/mr_GHSI_demo_data_transf_out.csv')
ghsi_2021_df = pd.read_csv('data/GHSI_2021_large_table.csv', sep="\t")

ghsi_2019_df = ghsi_2019_df.set_index("Row")
ghsi_2021_df = ghsi_2021_df.set_index("Country")



Filtering by country:

In [2]:
common = ghsi_2019_df.index.intersection(ghsi_2021_df.index)
ghsi_2019_df = ghsi_2019_df.loc[common]
ghsi_2021_df = ghsi_2021_df.loc[common]

# Checking if the two are filtered correctly

ghsi_2019_df.index.equals(ghsi_2021_df.index)



True

Pulling names of countries and predictors into specific list for future use

In [3]:
country_names = ghsi_2021_df.index.tolist()
col_names = ghsi_2021_df.columns.to_list()

Transforming a table into a numeric matrix for further manipulation. We are preparing the GHSI 2021 data for boxcox transformation, so we can use it for PCA. The GHSI 2019 data was already cleaned and prepared by Markovic et al (2022), and will be taken as is.

In [4]:
ghsi_2021_matrix = ghsi_2021_df.to_numpy()


# Preparing data for boxcox transformation

ghsi_2021_matrix_transformed = np.zeros(ghsi_2021_matrix.shape)
lambdas = np.zeros(ghsi_2021_matrix.shape[1])

min_val = np.min(ghsi_2021_matrix)

if min_val <= 0:
    ghsi_2021_matrix = ghsi_2021_matrix - min_val + np.finfo(float).eps

for i in range(ghsi_2021_matrix.shape[1]):
    transformed_column, lambda_val = boxcox(ghsi_2021_matrix[:, i])
    ghsi_2021_matrix_transformed[:, i] = transformed_column
    lambdas[i] = lambda_val

Using DnRoutlier (Detect and Replace outlier) to replace outliers with the median for that predictor

In [5]:
ghsi_2021_matrix_transformed = DnRoutlier(ghsi_2021_matrix_transformed)

Saving the transformation into a table

In [9]:
ghsi_2021_table_transformed = pd.DataFrame(
    ghsi_2021_matrix_transformed,
    index=country_names,
    columns=col_names
)

ghsi_2021_table_transformed



Unnamed: 0,OVERALL SCORE,1) PREVENTION OF THE EMERGENCE OR RELEASE OF PATHOGENS,2) EARLY DETECTION & REPORTING FOR EPIDEMICS OF POTENTIAL INT'L CONCERN,3) RAPID RESPONSE TO AND MITIGATION OF THE SPREAD OF AN EPIDEMIC,4) SUFFICIENT & ROBUST HEALTH SECTOR TO TREAT THE SICK & PROTECT HEALTH WORKERS,"5) COMMITMENTS TO IMPROVING NATIONAL CAPACITY, FINANCING AND ADHERENCE TO NORMS",6) OVERALL RISK ENVIRONMENT AND COUNTRY VULNERABILITY TO BIOLOGICAL THREATS
Afghanistan,9.239996,7.572495,9.829645,4.992588,12.046882,74.102016,39.253817
Andorra,10.388138,15.383090,15.508250,6.162419,8.654353,51.091233,111.021456
Armenia,14.750233,20.249933,24.093557,7.129452,25.165138,71.870169,62.018425
Australia,16.025246,31.656961,27.109299,7.389482,28.497672,89.037839,104.182236
Austria,14.041782,26.888342,16.576125,6.310926,23.573333,78.050616,121.271983
...,...,...,...,...,...,...,...
Turkey,12.993144,25.981757,16.576125,5.965761,23.539768,72.526158,76.056147
Uruguay,11.394925,23.548220,7.662603,6.041667,17.321488,39.237939,100.550379
Uzbekistan,11.167438,21.524632,9.043068,5.355757,15.022159,76.601333,70.323656
Venezuela,7.510077,8.140242,15.508250,5.075957,9.903073,34.579098,43.592263


Saving the rearanged 2019 data, and the transformed 2021 data into CSVs for futher use.

In [10]:
ghsi_2019_df.to_csv("data/ghsi_2019_rearanged.csv")
ghsi_2021_table_transformed.to_csv("data/ghsi_2021_transformed.csv")