# Feature Correlation & Analysis

This notebook creates functions that allow the easy exploration of correlations with specific variables. One problem with the dataset is that the column names are mostly meaningless. This fixes that by referencing the indicator dictionary to show the descriptions of the most highly correlated variables to a particular variable. It also outputs corrsDF, which is a correlation matrix, reformatted as a three column table. The first two columns are features and the third column is the correlation between them. Some correlations of note are at the bottom of the notebook. The feature sets obtained from various feature selection methods were analyzed, so one can see what variables correlate well with what will go into our eventual model. The feature selection will show up in M6, and that notebook will refer back here to complete the big picture.

## Import Libraries

In [2]:
import numpy as np
import pandas as pd
from collections import Counter

## Load Data

In [3]:
completeDF = pd.read_csv("../../M2/carpentry/completeDF.csv")
completeDF.head(2)

Unnamed: 0,CountryName,Year,SPURBGROW,SPPOPAG25FEIN,SPPOPAG25MAIN,SPPOPDPND,SPPOPDPNDOL,SPPOPDPNDYG,SPPOPGROW,SPPOPTOTL,...,NYGDPFCSTKD,NYGDPFCSTKN,NYTAXNINDCD,SEPRMENRRFE,SEENRPRIMFMZS,SEPRMENRRMA,NVINDMANFKDZG,Continent,Region1,ISO3
0,Albania,1990.0,2.543043,29833.0,30896.0,61.994909,8.901594,53.093316,1.799086,3286542.0,...,4687984000.0,556155400000.0,456523000.0,99.688721,1.00459,99.233337,-1.330062,Europe,Southern Europe,ALB
1,Albania,1991.0,0.141061,28894.0,29109.0,62.715405,9.191396,53.524009,-0.60281,3266790.0,...,3484063000.0,413329200000.0,559041600.0,101.441803,1.00398,101.039749,4.437151,Europe,Southern Europe,ALB


In [4]:
completeDF.shape

(3150, 601)

In [5]:
completeDF.iloc[:3,1:549]

Unnamed: 0,Year,SPURBGROW,SPPOPAG25FEIN,SPPOPAG25MAIN,SPPOPDPND,SPPOPDPNDOL,SPPOPDPNDYG,SPPOPGROW,SPPOPTOTL,SPPOPTOTLFEIN,...,BMTRFPRVTCD,BMGSRINSFZS,TMVALINSFZSWT,FIRESTOTLCD,FIRESXGLDCD,NVINDTOTLKDZG,NYGDPFCSTCN,NECONGOVTKDZG,NECONPRVTPCKDZG,SHSTAOB18FEZS
0,1990.0,2.543043,29833.0,30896.0,61.994909,8.901594,53.093316,1.799086,3286542.0,1603543.0,...,0.0,2.920962,2.920962,91241940.0,85776640.0,-0.950977,42214810000.0,1.894168,1.371166,11.7
1,1991.0,0.141061,28894.0,29109.0,62.715405,9.191396,53.524009,-0.60281,3266790.0,1604790.0,...,0.0,1.676647,1.676647,91241940.0,85776640.0,6.149618,62312370000.0,1.894168,1.371166,12.0
2,1992.0,0.87843,27689.0,26863.0,63.311979,9.516996,53.794984,-0.606435,3247039.0,1610302.0,...,300000.0,4.242424,4.364896,196031100.0,171595900.0,6.149618,62312370000.0,3.4806,4.121177,12.2


---

## Correlation generation

### ***Generating correlation coefficients using .corr()***

In [6]:
corrmatrixDF = completeDF.iloc[:,2:549].corr()

corrmatrixDF.iloc[:5,:5]

Unnamed: 0,SPURBGROW,SPPOPAG25FEIN,SPPOPAG25MAIN,SPPOPDPND,SPPOPDPNDOL
SPURBGROW,1.0,0.096373,0.093288,0.731157,-0.699875
SPPOPAG25FEIN,0.096373,1.0,0.999571,-0.112233,-0.09
SPPOPAG25MAIN,0.093288,0.999571,1.0,-0.112257,-0.087194
SPPOPDPND,0.731157,-0.112233,-0.112257,1.0,-0.626614
SPPOPDPNDOL,-0.699875,-0.09,-0.087194,-0.626614,1.0


In [7]:
corrmatrixDF.shape

(547, 547)

In [8]:
%%time
#25 sec runtime

x = 0
y = 1
corfeat1 = []
corfeat2 = []
corcoefs = []
while x < corrmatrixDF.shape[0]:
    while y < corrmatrixDF.shape[0]:
        corcoef = corrmatrixDF.iloc[x, y]
        corfeat1.append((corrmatrixDF.iloc[x,:].name))
        corfeat2.append((corrmatrixDF.iloc[:,y].name))
        corcoefs.append(corcoef)
        y += 1
    y = x + 2
    x += 1
    
matrixcorrsDF = pd.DataFrame({"indicator1": corfeat1, "indicator2": corfeat2, "correlation": corcoefs})
matrixcorrsDF = matrixcorrsDF[~matrixcorrsDF.correlation.isnull()] #Drops 4 NA results
matrixcorrsDF.shape

CPU times: user 24.4 s, sys: 12.5 ms, total: 24.5 s
Wall time: 24.5 s


(149331, 3)

In [9]:
matrixcorrsDF.head(2)

Unnamed: 0,indicator1,indicator2,correlation
0,SPURBGROW,SPPOPAG25FEIN,0.096373
1,SPURBGROW,SPPOPAG25MAIN,0.093288


In [10]:
matrixcorrsDF.to_csv('corrsDF.csv', index=False)

---

## Functions for Correlation Exploration

### ***Indicator definition lookup***

*Takes an array of indicators and returns their definitions in array form*

In [11]:
#Loading saved generated pearsonR correlation dataframe
pearsoncorrsDF = pd.read_csv("corrsDF.csv")

pearsoncorrsDF = pearsoncorrsDF[~pearsoncorrsDF.correlation.isnull()] #Drops 4 NA results

pearsoncorrsDF.head(2)

Unnamed: 0,indicator1,indicator2,correlation
0,SPURBGROW,SPPOPAG25FEIN,0.096373
1,SPURBGROW,SPPOPAG25MAIN,0.093288


In [12]:
def ind_def_lookup(indicatorarray):
    
    defs = []
    
    indicatordict = pd.read_csv('../../M1/make_indicator_dict/Indicator_Dict.csv')
    
    indicatordict.columns = ["IndicatorCode", "IndicatorName"]
    
    for ind in indicatorarray:
        if ind not in indicatordict["IndicatorCode"].unique():
            defs.append(ind)
        else:
            defs.append(indicatordict[indicatordict["IndicatorCode"] == ind]["IndicatorName"].values[0])
        
    return defs

In [13]:
for i in ["AGCONFERTZS", "AGLNDARBLZS", "AGPRDLVSKXD", "BMGSRCMCPZS", "BMTRFPWKRCDDT", "BNGSRFCTYCD", "BXTRFPWKRCDDT", "EGELCCOALZS", "EGELCFOSLZS", "ENATMCO2EPPGDKD", "ENATMCO2ESFZS", "FIRESTOTLMO", "FMASTCGOVZGM3", "FMLBLBMNYCN", "IMMIGRATION", "ISAIRGOODMTK1", "ITCELSETSP2", "MSMILTOTLP1", "NECONPRVTKN", "NECONPRVTPCKD", "NEGDIFTOTCN", "NEGDIFTOTZS", "NEGDITOTLKD", "NEIMPGNFSZS", "NVAGRTOTLCD", "NVINDEMPLKD", "NVINDMANFKDZG", "NVSRVTOTLKN", "NYADJNNTYKD", "NYGDPFCSTCN", "NYGDPTOTLRTZS", "NYGNPMKTPKD", "NYGNSICTRGNZS", "NYGSRNFCYCD", "SESECENRLGCFEZS", "SGHMETRVLEQ", "SGLAWINDX", "SGOWNPRRTIM", "SHDTH0509", "SHDTH1519", "SHDYN0509", "SHDYNMORTFE", "SHHTNTRETFEZS", "SHIMMIBCG", "SHIMMMEAS", "SHSTAOWADMAZS", "SLEMP1524SPMAZS", "SLEMPTOTLSPZS", "SLFAMWORKMAZS", "SLINDEMPLFEZS", "SLSRVEMPLFEZS", "SLTLFACTIFEZS", "SLTLFACTIZS", "SLTLFTOTLFEIN", "SLUEM1524FEZS", "SLUEM1524FMZS", "SLUEM1524ZS", "SLUEMTOTLFEZS", "SPADOTFRT", "SPDYNAMRTFE", "SPDYNLE00FEIN", "SPDYNLE00IN", "SPDYNLE00MAIN", "SPDYNTO65FEZS", "SPPOP0509FE5Y", "SPPOP0509MA5Y", "SPPOP1014FE5Y", "SPPOP1014MA5Y", "SPPOP1519FE5Y", "SPPOP1519MA5Y", "SPPOP2024FE5Y", "SPPOP2024MA5Y", "SPPOP2529FE5Y", "SPPOP2529MA5Y", "SPPOP3034FE5Y", "SPPOP3034MA5Y", "SPPOP3539FE5Y", "SPPOP3539MA5Y", "SPPOP4044FE5Y", "SPPOP4044MA5Y", "SPPOP4549FE5Y", "SPPOP4549MA5Y", "SPPOP5054FE5Y", "SPPOP5054MA5Y", "SPPOP5559FE5Y", "SPPOP5559MA5Y", "SPPOP6064FE5Y", "SPPOP6064MA5Y", "SPPOP6569FE5Y", "SPPOP6569MA5Y", "SPPOP7074FE5Y", "SPPOP7074MA5Y", "SPPOP7579FE5Y", "SPPOP7579MA5Y", "SPPOP80UPFE5Y", "SPPOP80UPMA5Y", "SPPOPGROW", "SPRURTOTLZS", "SPURBGROW", "SPURBTOTLINZS", "TGVALTOTLGDZS", "TMVALMANFZSUN", "TMVALTRANZSWT", "TXUVIMRCHXDWD", "TXVALMMTLZSUN", "TXVALMRCHALZS", "TXVALMRCHR6ZS"]:
    print(i, ": ",  ind_def_lookup([i]))

AGCONFERTZS :  ['Fertilizer consumption (kilograms per hectare of arable land)']
AGLNDARBLZS :  ['Arable land (% of land area)']
AGPRDLVSKXD :  ['Livestock production index (2014-2016 = 100)']
BMGSRCMCPZS :  ['Communications, computer, etc. (% of service imports, BoP)']
BMTRFPWKRCDDT :  ['Personal remittances, paid (current US$)']
BNGSRFCTYCD :  ['Net primary income (BoP, current US$)']
BXTRFPWKRCDDT :  ['Personal remittances, received (current US$)']
EGELCCOALZS :  ['Electricity production from coal sources (% of total)']
EGELCFOSLZS :  ['Electricity production from oil, gas and coal sources (% of total)']
ENATMCO2EPPGDKD :  ['CO2 emissions (kg per 2017 PPP $ of GDP)']
ENATMCO2ESFZS :  ['CO2 emissions from solid fuel consumption (% of total)']
FIRESTOTLMO :  ['Total reserves in months of imports']
FMASTCGOVZGM3 :  ['Claims on central government (annual growth as % of broad money)']
FMLBLBMNYCN :  ['Broad money (current LCU)']
IMMIGRATION :  ['IMMIGRATION']
ISAIRGOODMTK1 :  ['Air trans

In [14]:
ind_def_lookup(["AGLNDAGRIK2", "AGLNDCROPZS", "AGLNDFRSTK2", "AGLNDFRSTZS", "AGPRDFOODXD", "BMGSRCMCPZS", "BMTRFPWKRCDDT", "BNCABXOKAGDZS", "BXTRFPWKRCDDT", "BXTRFPWKRDTGDZS", "EGELCHYROZS", "EGELCNGASZS", "EGELCRNWXZS", "ENATMCO2EPPGDKD", "ENATMCO2ESFZS", "ENATMGHGTKTCE", "ENATMMETHAGKTCE", "ENPOPDNST", "ENURBMCTYTLZS", "ERFSHCAPTMT", "FIRESTOTLMO", "FMASTCGOVZGM3", "FPCPITOTL", "IMMIGRATION", "ISAIRGOODMTK1", "ISAIRPSGR", "NECONPRVTPCKD", "NECONPRVTZS", "NEGDIFTOTKDZG", "NEIMPGNFSKD", "NVAGRTOTLKD", "NVINDEMPLKD", "NVINDMANFKDZG", "NVSRVTOTLKD", "NYADJSVNGGNZS", "NYGDPFCSTKN", "NYGDPTOTLRTZS", "NYGNPMKTPKDZG", "NYGNPPCAPKDZG", "PANUSPPP", "PANUSPRVTPP", "SGHMETRVLEQ", "SGLAWINDX", "SGOWNPRRTIM", "SHDTH0509", "SHDTH1519", "SHDYNMORTFE", "SHHTNTRETZS", "SHIMMIBCG", "SHIMMMEAS", "SHSTAOWADMAZS", "SLAGREMPLFEZS", "SLEMPMPYRFEZS", "SLEMPTOTLSPFEZS", "SLFAMWORKZS", "SLTLFACTI1524ZS", "SLTLFACTIMAZS", "SLTLFTOTLIN", "SPDYNAMRTFE", "SPDYNCDRTIN", "SPDYNLE00FEIN", "SPDYNTO65FEZS", "SPPOP0509FE5Y", "SPPOP0509MA5Y", "SPPOP1014FE5Y", "SPPOP1014MA5Y", "SPPOP1519FE5Y", "SPPOP1519MA5Y", "SPPOP2024FE5Y", "SPPOP2024MA5Y", "SPPOP2529FE5Y", "SPPOP2529MA5Y", "SPPOP3034FE5Y", "SPPOP3034MA5Y", "SPPOP3539FE5Y", "SPPOP3539MA5Y", "SPPOP4044FE5Y", "SPPOP4044MA5Y", "SPPOP4549FE5Y", "SPPOP4549MA5Y", "SPPOP5054FE5Y", "SPPOP5054MA5Y", "SPPOP5559FE5Y", "SPPOP5559MA5Y", "SPPOP6064FE5Y", "SPPOP6064MA5Y", "SPPOP6569FE5Y", "SPPOP6569MA5Y", "SPPOP7074FE5Y", "SPPOP7074MA5Y", "SPPOP7579FE5Y", "SPPOP7579MA5Y", "SPPOP80UPFE5Y", "SPPOP80UPMA5Y", "SPPOPDPND", "SPPOPDPNDOL", "SPRURTOTLZS", "SPURBGROW", "SPURBTOTLINZS", "TMUVIMRCHXDWD", "TMVALMMTLZSUN", "TMVALMRCHR5ZS", "TMVALTRANZSWT", "TXVALMMTLZSUN", "TXVALMRCHALZS", "TXVALMRCHORZS", "TXVALMRCHR2ZS", "TXVALMRCHR6ZS", "TXVALMRCHXDWD"])

['Agricultural land (sq. km)',
 'Permanent cropland (% of land area)',
 'Forest area (sq. km)',
 'Forest area (% of land area)',
 'Food production index (2014-2016 = 100)',
 'Communications, computer, etc. (% of service imports, BoP)',
 'Personal remittances, paid (current US$)',
 'Current account balance (% of GDP)',
 'Personal remittances, received (current US$)',
 'Personal remittances, received (% of GDP)',
 'Electricity production from hydroelectric sources (% of total)',
 'Electricity production from natural gas sources (% of total)',
 'Electricity production from renewable sources, excluding hydroelectric (% of total)',
 'CO2 emissions (kg per 2017 PPP $ of GDP)',
 'CO2 emissions from solid fuel consumption (% of total)',
 'Total greenhouse gas emissions (kt of CO2 equivalent)',
 'Agricultural methane emissions (thousand metric tons of CO2 equivalent)',
 'Population density (people per sq. km of land area)',
 'Population in urban agglomerations of more than 1 million (% of total

In [15]:
ind_def_lookup(["EGELCACCSURZS", "EGELCACCSZS", "EGELCRNEWZS", "EGEGYPRIMPPKD", "EGELCACCSRUZS", "EGELCPETRZS", "EGELCFOSLZS", "EGELCCOALZS", "EGELCHYROZS", "EGELCNGASZS", "EGELCRNWXZS", "EGELCRNWXKH"])

['Access to electricity, urban (% of urban population)',
 'Access to electricity (% of population)',
 'Renewable electricity output (% of total electricity output)',
 'Energy intensity level of primary energy (MJ/$2011 PPP GDP)',
 'Access to electricity, rural (% of rural population)',
 'Electricity production from oil sources (% of total)',
 'Electricity production from oil, gas and coal sources (% of total)',
 'Electricity production from coal sources (% of total)',
 'Electricity production from hydroelectric sources (% of total)',
 'Electricity production from natural gas sources (% of total)',
 'Electricity production from renewable sources, excluding hydroelectric (% of total)',
 'Electricity production from renewable sources, excluding hydroelectric (kWh)']

### ***Correlation Generator Function***
*Generates an array of correlations (correlation coefficient, correlated indicator, and definition) of a given indicator. Correlation thresholds may be specified*

In [16]:
def ind_corr_gen(indicator, minimum=.3, maximum=.9):
    
    storedcorrs = []
    
    #Search for indicator in correlationDF
    corrDF = pearsoncorrsDF[:].loc[(pearsoncorrsDF["indicator1"] == indicator) | (pearsoncorrsDF["indicator2"] == indicator)]
    
    
    #Create a column of those indicators that correlate
    corrDF['correlated_indicator'] = corrDF["indicator1"].replace(indicator, "") + corrDF["indicator2"].replace(indicator, "")
    
    #Filter DF by indicated correlation threshholds
    corrDF = corrDF[(abs(corrDF["correlation"]) > minimum) & (abs(corrDF["correlation"] < maximum))]
    
    #Lookup definitions of correlated indicators
    indicatordefs = indicator_definition_lookup(corrDF['correlated_indicator'])
    
    #Create array containing correlation, correlated indicator, and definition
    for i, ind in enumerate(corrDF['correlated_indicator']):
        correlation = corrDF.iloc[i,2:3].values[0]
        storedcorrs.append([round(correlation, 4), ind, indicatordefs[i]])

    #Return as array
    print("Indicator searched: ", indicator_definition_lookup([indicator])[0]) #Uses above function to generate name
    print("Total correlations: ", len(storedcorrs))
    print(" ")
    return storedcorrs

### ***Indicator Relationship Search***

*Generates the correlation coefficient of two given indicators*

In [17]:
def corr_finder(indicator1, indicator2):
    if indicator1 == indicator2:
        return 1
    coef = pearsoncorrsDF[(pearsoncorrsDF["indicator1"] == indicator1) & (pearsoncorrsDF["indicator2"] == indicator2)]
    coef = coef["correlation"].values
    if coef.size != 0:
        print("Correlation: ", coef[0])
    else:
        coef = pearsoncorrsDF[(pearsoncorrsDF["indicator1"] == indicator2) & (pearsoncorrsDF["indicator2"] == indicator1)]
        coef = coef["correlation"].values
        if coef.size == 0:
            print('indicator not found')
        else:
            print("Correlation: ", coef[0])


### ***Keyword Search***

*Generates a list of indicators and their definitions which contain a given keyword*

In [18]:
def ind_search(keyword):
    
    indices_to_filter = []
    
    indicatordict = pd.read_csv('../../M1/make_indicator_dict/Indicator_Dict.csv')
    
    indicatordict.columns = ["IndicatorCode", "IndicatorName"]
    
    indicatornames = indicatordict["IndicatorName"].values
    
    for i in np.arange(len(indicatornames)):
        if keyword in indicatornames[i]:
            indices_to_filter.append(i)
    
    for j in indicatordict.filter(indices_to_filter, axis=0).values:
        print(j[0], ":", j[1])

In [19]:
ind_search('% of GDP')

SEXPDTOTLGDZS : Government expenditure on education, total (% of GDP)
SEXPDPRIMPCZS : Government expenditure per student, primary (% of GDP per capita)
SEXPDSECOPCZS : Government expenditure per student, secondary (% of GDP per capita)
SHXPDKHEXGDZS : Capital health expenditure (% of GDP)
SHXPDCHEXGDZS : Current health expenditure (% of GDP)
SHXPDGHEDGDZS : Domestic general government health expenditure (% of GDP)
NVAGRTOTLZS : Agriculture, forestry, and fishing, value added (% of GDP)
FMLBLBMNYGDZS : Broad money (% of GDP)
GCDODTOTLGDZS : Central government debt, total (% of GDP)
FSASTDOMOGDZS : Claims on other sectors of the domestic economy (% of GDP)
NYGDPCOALRTZS : Coal rents (% of GDP)
BNCABXOKAGDZS : Current account balance (% of GDP)
FSASTDOMSGDZS : Domestic credit provided by financial sector (% of GDP)
FSASTPRVTGDZS : Domestic credit to private sector (% of GDP)
FDASTPRVTGDZS : Domestic credit to private sector by banks (% of GDP)
GCXPNTOTLGDZS : Expense (% of GDP)
NEEXPGNFSZ

---

## Test Area

In [15]:
ind_corr_gen("NYGNSICTRCN")

Indicator searched:  Gross savings (current LCU)
Total correlations:  17
 


[[0.9798, 'NYGDPMKTPCN', 'GDP (current LCU)'],
 [0.9798, 'NYGDPMKTPCNAD', 'GDP: linked series (current LCU)'],
 [0.965, 'NEGDITOTLCN', 'Gross capital formation (current LCU)'],
 [0.978, 'NYGNPMKTPCN', 'GNI (current LCU)'],
 [0.978, 'NYGNPMKTPCNAD', 'GNI: linked series (current LCU)'],
 [0.9805, 'NECONTOTLCN', 'Final consumption expenditure (current LCU)'],
 [0.9782, 'NEDABTOTLCN', 'Gross national expenditure (current LCU)'],
 [0.9659, 'NYGDSTOTLCN', 'Gross domestic savings (current LCU)'],
 [0.9512,
  'NECONGOVTCN',
  'General government final consumption expenditure (current LCU)'],
 [0.9801,
  'NECONPRVTCN',
  'Households and NPISHs Final consumption expenditure (current LCU)'],
 [0.9878, 'NEGDIFTOTCN', 'Gross fixed capital formation (current LCU)'],
 [0.9801,
  'NECONPRVTCNAD',
  'Households and NPISHs final consumption expenditure: linked series (current LCU)'],
 [0.96,
  'NVAGRTOTLCN',
  'Agriculture, forestry, and fishing, value added (current LCU)'],
 [0.9878,
  'NVINDTOTLCN',
 

In [16]:
ind_def_lookup(['SPDYNCDRTIN'])

['Death rate, crude (per 1,000 people)']

### *Result of NO SP run*

In [17]:
vars_generated = ["SGLAWINDX", "NYADJAEDUGNZS", "SHMMRLEVE", "ENPOPDNST", "SESECAGES", "SHDYNMORT", "SHDYNMORTFE", "SHDTHIMRTMA", "NYADJDNGYCD", "AGLNDARBLZS", "AGPRDLVSKXD", "AGLNDCROPZS", "NYGDPTOTLRTZS", "SHDYNNMRT", "AGYLDCRELKG", "NYGDPPCAPKN", "TMVALMRCHHIZS", "TMVALMRCHRSZS", "ITCELSETS", "ENATMCO2ESFKT", "ENURBLCTY", "ENURBLCTYURZS", "NYADJDCO2GNZS", "TMVALMRCHR5ZS", "TXVALMRCHR6ZS", "NYGNPATLSCD", "NYGNPPCAPCD", "ENATMCO2EKDGD", "AGCONFERTZS", "TMVALMRCHR4ZS", "NECONPRVTZS", "SHSTAOWADZS", "SHSTAOB18FEZS", "TMVALMRCHR2ZS", "ENURBMCTY", "ENURBMCTYTLZS", "ENATMMETHEGZS", "BNCABXOKACD", "BXGSRTOTLCD", "NYTRFNCTRCN", "TMVALTRANZSWT", "BMGSRCMCPZS", "TMVALTRVLZSWT", "BXGSRTRVLZS", "SHIMMMEAS", "TMVALMANFZSUN", "TMVALFUELZSUN", "TXVALMANFZSUN", "FMLBLBMNYIRZS", "NVINDTOTLKN", "DTODAODATPCZS", "FIRESTOTLMO", "EGELCCOALZS", "TXVALFUELZSUN", "NEGDITOTLKN", "NEDABTOTLKN", "TMVALMRCHXDWD", "EGIMPCONSZS", "ENCO2OTHXZS", "ENCO2TRANZS", "EGUSECOMMFOZS", "EGELCLOSSZS", "NYGNSICTRCD"]
ind_def_lookup(vars_generated)

['Women Business and the Law Index Score (scale 1-100)',
 'Adjusted savings: education expenditure (% of GNI)',
 'Length of paid maternity leave (calendar days)',
 'Population density (people per sq. km of land area)',
 'Lower secondary school starting age (years)',
 'Mortality rate, under-5 (per 1,000)',
 'Mortality rate, under-5, female (per 1,000)',
 'Number of infant deaths, male',
 'Adjusted savings: energy depletion (current US$)',
 'Arable land (% of land area)',
 'Livestock production index (2014-2016 = 100)',
 'Permanent cropland (% of land area)',
 'Total natural resources rents (% of GDP)',
 'Mortality rate, neonatal (per 1,000 live births)',
 'Cereal yield (kg per hectare)',
 'GDP per capita (constant LCU)',
 'Merchandise imports from high-income economies (% of total merchandise imports)',
 'Merchandise imports by the reporting economy, residual (% of total merchandise imports)',
 'Mobile cellular subscriptions',
 'CO2 emissions from solid fuel consumption (kt)',
 'Populat

### *Europe*

In [18]:
eu_vars_generated = ["SPPOPAG05FEIN", "SPPOPAG01FEIN", "SPDYNTO65MAZS", "SPDYNTO65FEZS", "SPPOP2529FE5Y", "SPPOP1564TOZS", "SPPOP1014MA5Y", "SPPOP0509FE5Y", "SPPOPDPNDOL", "SPPOP0509FE", "SPPOP1014MA", "SPPOP1519MA", "SPPOP2529MA", "SPPOPGROW", "ENPOPDNST", "AGSRFTOTLK2", "TXVALMRCHRSZS", "TGVALTOTLGDZS", "ITCELSETSP2", "TMVALFUELZSUN", "EGELCHYROZS", "NYADJNNATGNZS", "ENCO2ETOTZS", "BMTRFPWKRCDDT"]
#ind_def_lookup(eu_vars_generated)

### *Africa*

In [19]:
af_vars_generated = ["SPADOTFRT", "SPPOP65UPMAZS", "SPPOP65UPTOZS", "SPPOP7074FE5Y", "SPPOPAG01MAIN", "SPPOPAG03FEIN", "SPDYNTO65MAZS", "SPPOP1564FEZS", "SPPOP1014MA5Y", "SPPOP0509MA5Y", "SPPOP0509FE5Y", "SPDYNLE00MAIN", "SPPOP5054FE5Y", "SPPOP4044MA5Y", "SPPOP4549FE", "SPPOP5559MA", "SPPOPGROW", "ENPOPDNST", "SESECDURS", "SPDYNIMRTFEIN", "TMVALMRCHHIZS", "ITCELSETSP2", "NYADJDNGYGNZS", "MSMILXPNDGDZS", "NECONGOVTZS", "SHSTAOB18FEZS", "SHSTAOB18MAZS", "NVAGRTOTLZS", "NVAGRTOTLCD", "NVINDMANFZS", "SHIMMMEAS", "EGUSEELECKHPC"]
#ind_def_lookup(af_vars_generated)

### *South America*

In [20]:
sa_vars_generated = ["SPADOTFRT", "SPPOPAG04FEIN", "SGLAWOBHBMRNO", "SPPOP1564TOZS", "SPDYNCDRTIN", "SPPOP1014MA5Y", "SPPOP0509MA5Y", "SGLAWINDX", "SPPOP4549MA5Y", "SPPOPDPNDOL", "SPPOP6064FE", "SPRURTOTLZG", "PANUSATLS", "SHDYNNMRT", "ENATMCO2ELFZS", "TXVALMRCHORZS", "TMVALMRCHR4ZS", "NECONGOVTZS", "DCDACTOTLCD", "TXVALAGRIZSUN", "NECONGOVTKD", "BXTRFPWKRDTGDZS", "DTNFLUNDPCD", "EGUSECRNWZS"]
#ind_def_lookup(sa_vars_generated)

### *Asia*

In [21]:
asia_vars_generated = ["SPPOP6569FE5Y", "SPADOTFRT", "SPDYNTO65FEZS", "SPPOP1564MAZS", "SPPOP1014MA5Y", "SPPOP0509MA5Y", "SPDYNLE00FEIN", "SPPOP3539FE5Y", "SPPOP5559MA5Y", "SPPOPDPND", "SPPOPDPNDOL", "IMMIGRATION", "SPURBGROW", "SPPOPGROW", "SPDYNIMRTMAIN", "SPDYNIMRTFEIN", "AGLNDAGRIZS", "AGLNDARBLHAPC", "TMVALMRCHHIZS", "NYGDPDEFLZS", "NYADJDKAPGNZS", "NYADJDCO2GNZS", "NECONGOVTZS", "FMASTPRVTGDZS", "PANUSFCRF", "BXGSRNFSVCD", "SHIMMMEAS", "DTODAALLDKD", "TMVALFUELZSUN", "BXGSRINSFZS", "FIRESTOTLMO", "EGELCRNWXZS", "NEGDITOTLKD", "EGELCLOSSZS"]
#ind_def_lookup(as_vars_generated)

### *North America*

In [22]:
na_vars_generated = ["SPADOTFRT", "SPPOP1564TOZS", "SPPOP1519FE5Y", "SPPOP1014MA5Y", "SPDYNLE00FEIN", "SHMMRLEVE", "SGOPNBANKEQ", "SPPOPDPNDOL", "ENPOPDNST", "SPRURTOTLZG", "SPDYNAMRTMA", "SPDYNIMRTMAIN", "SPDYNIMRTFEIN", "SHDTHIMRTFE", "AGLNDARBLHA", "NYGDPFRSTRTZS", "BXKLTDINVWDGDZS", "NYADJDKAPGNZS", "NYADJDCO2GNZS", "SEPRMENRR", "NECONPRVTZS", "ENURBMCTYTLZS", "TXVALFOODZSUN", "EGELCHYROZS", "NEGDITOTLKN", "NEDABTOTLKN", "NEGDIFTOTKD", "EGIMPCONSZS"]
#ind_def_lookup(na_vars_generated)

In [23]:
Counter(na_vars_generated + asia_vars_generated + sa_vars_generated + af_vars_generated + eu_vars_generated).most_common(15)

[('SPPOP1014MA5Y', 5),
 ('SPADOTFRT', 4),
 ('SPPOPDPNDOL', 4),
 ('SPPOP1564TOZS', 3),
 ('ENPOPDNST', 3),
 ('SPDYNIMRTFEIN', 3),
 ('SPPOP0509MA5Y', 3),
 ('SPPOPGROW', 3),
 ('NECONGOVTZS', 3),
 ('SPDYNLE00FEIN', 2),
 ('SPRURTOTLZG', 2),
 ('SPDYNIMRTMAIN', 2),
 ('NYADJDKAPGNZS', 2),
 ('NYADJDCO2GNZS', 2),
 ('EGELCHYROZS', 2)]

---

### *Europe No SP*

In [24]:
vars_generated = ["SGAGEMRETEQ", "SGLAWINDX", "SHMMRLEVE", "ENPOPDNST", "SHDYNMORT", "SHDYNMORTFE", "ENATMMETHAGKTCE", "AGPRDFOODXD", "NYGDPPCAPCD", "TMVALMRCHRSZS", "TGVALTOTLGDZS", "BXKLTDINVWDGDZS", "NYADJDCO2GNZS", "NYGDPCOALRTZS", "ENATMCO2EKDGD", "MSMILXPNDCD", "MSMILXPNDGDZS", "NECONGOVTZS", "SEPRMENRR", "TXVALMRCHR3ZS", "NVAGRTOTLZS", "NVINDTOTLCN", "BMGSRTRVLZS", "TXVALFOODZSUN", "BXGSRINSFZS", "SESECENRL", "NECONGOVTKDZG", "BXTRFPWKRDTGDZS", "DTODAODATPCZS", "NECONPRVTPCKDZG", "EGIMPCONSZS", "NEGDITOTLKDZG", "ENCO2OTHXZS", "ENCO2TRANZS"]
ind_def_lookup(vars_generated)

['The mandatory retirement age for men and women is the same (1=yes; 0=no)',
 'Women Business and the Law Index Score (scale 1-100)',
 'Length of paid maternity leave (calendar days)',
 'Population density (people per sq. km of land area)',
 'Mortality rate, under-5 (per 1,000)',
 'Mortality rate, under-5, female (per 1,000)',
 'Agricultural methane emissions (thousand metric tons of CO2 equivalent)',
 'Food production index (2014-2016 = 100)',
 'GDP per capita (Current US$)',
 'Merchandise imports by the reporting economy, residual (% of total merchandise imports)',
 'Merchandise trade (% of GDP)',
 'Foreign direct investment, net inflows (% of GDP)',
 'Adjusted savings: carbon dioxide damage (% of GNI)',
 'Coal rents (% of GDP)',
 'CO2 emissions (kg per 2015 US$ of GDP)',
 'Military expenditure (current USD)',
 'Military expenditure (% of GDP)',
 'General government final consumption expenditure (% of GDP)',
 'School enrollment, primary (% gross)',
 'Merchandise exports to low- and m

### *Africa No SP*

In [25]:
vars_generated = ["SGLAWINDX", "NYADJAEDUGNZS", "ENPOPDNST", "SESECDURS", "SESECAGES", "SHDTHIMRTFE", "ITMLTMAINP2", "AGLNDARBLHAPC", "AGPRDFOODXD", "SHDYNNMRT", "AGYLDCRELKG", "SHDTHNMRT", "NYGDPMKTPKDZG", "NYGDPPCAPKDZG", "ENATMCO2ELFZS", "ENATMCO2EGFZS", "NYADJDCO2GNZS", "NYGDPCOALRTZS", "NEIMPGNFSZS", "NETRDGNFSZS", "TMVALMRCHR4ZS", "ERFSHAQUAMT", "NECONPRVTZS", "FPCPITOTL", "SHSTAOB18FEZS", "ENURBMCTYTLZS", "SHIMMMEAS", "NYGDPFCSTKN", "TXVALMANFZSUN", "FMLBLBMNYIRZS", "BXTRFPWKRCDDT", "EGELCCOALZS", "NYADJNNATGNZS", "NYADJNNATCD", "EGUSEELECKHPC", "NEGDITOTLKDZG", "ENCO2BLDGZS", "NEDABDEFLZS", "BMTRFPWKRCDDT"]
ind_def_lookup(vars_generated)

['Women Business and the Law Index Score (scale 1-100)',
 'Adjusted savings: education expenditure (% of GNI)',
 'Population density (people per sq. km of land area)',
 'Secondary education, duration (years)',
 'Lower secondary school starting age (years)',
 'Number of infant deaths, female',
 'Fixed telephone subscriptions (per 100 people)',
 'Arable land (hectares per person)',
 'Food production index (2014-2016 = 100)',
 'Mortality rate, neonatal (per 1,000 live births)',
 'Cereal yield (kg per hectare)',
 'Number of neonatal deaths',
 'GDP growth (annual %)',
 'GDP per capita growth (annual %)',
 'CO2 emissions from liquid fuel consumption (% of total)',
 'CO2 emissions from gaseous fuel consumption (% of total)',
 'Adjusted savings: carbon dioxide damage (% of GNI)',
 'Coal rents (% of GDP)',
 'Imports of goods and services (% of GDP)',
 'Trade (% of GDP)',
 'Merchandise imports from low- and middle-income economies in Middle East & North Africa (% of total merchandise imports)',


### *Asia No SP*

In [26]:
vars_generated = ["ENPOPDNST", "SHDYNMORT", "NYADJDNGYCD", "AGLNDAGRIZS", "AGPRDLVSKXD", "BXKLTDINVCDWD", "NYGDPMINRRTZS", "SHDYNNMRT", "AGYLDCRELKG", "TMVALMRCHHIZS", "TMVALMRCHRSZS", "ENATMCO2EGFZS", "ENURBLCTYURZS", "NYADJDCO2GNZS", "NYGDPPETRRTZS", "TMVALMRCHR4ZS", "NYADJNNTYPCCD", "TXVALMRCHR3ZS", "FDASTPRVTGDZS", "TMVALMRCHR2ZS", "NVAGRTOTLZS", "PANUSFCRF", "SEENRPRIMFMZS", "SEPRMENRRMA", "SEPRMENRLFEZS", "ENURBMCTYTLZS", "NVAGRTOTLKDZG", "NYTRFNCTRCN", "BXGSRCMCPZS", "SHIMMIDPT", "BXGSRTRANZS", "BMGSRINSFZS", "BXGRTEXTACDWD", "NVINDTOTLKD", "EGELCRNWXZS", "NVINDMANFKN", "EGUSECRNWZS", "ENCO2OTHXZS", "ENCO2BLDGZS", "EGELCLOSSZS"]
ind_def_lookup(vars_generated)

['Population density (people per sq. km of land area)',
 'Mortality rate, under-5 (per 1,000)',
 'Adjusted savings: energy depletion (current US$)',
 'Agricultural land (% of land area)',
 'Livestock production index (2014-2016 = 100)',
 'Foreign direct investment, net inflows (BoP, current US$)',
 'Mineral rents (% of GDP)',
 'Mortality rate, neonatal (per 1,000 live births)',
 'Cereal yield (kg per hectare)',
 'Merchandise imports from high-income economies (% of total merchandise imports)',
 'Merchandise imports by the reporting economy, residual (% of total merchandise imports)',
 'CO2 emissions from gaseous fuel consumption (% of total)',
 'Population in the largest city (% of urban population)',
 'Adjusted savings: carbon dioxide damage (% of GNI)',
 'Oil rents (% of GDP)',
 'Merchandise imports from low- and middle-income economies in Middle East & North Africa (% of total merchandise imports)',
 'Adjusted net national income per capita (current US$)',
 'Merchandise exports to l

In [27]:
vars_generated = []
ind_def_lookup(vars_generated)

[]

In [28]:
vars_generated = []
ind_def_lookup(vars_generated)

[]

---

## Results of Feature Selection

### *Random Method - 'swamy-arora'*

In [20]:
sa_results = ["SPPOPDPNDOL", "SPADOTFRT", "SPDYNTO65MAZS", "SPPOP1014FE5Y", "SPPOP0509MA5Y", "SPPOP1564TOZS", "SPPOP2024FE5Y", "SPDYNCDRTIN", "SHDYNMORTMA", "SPPOP6064MA5Y", "SPPOP65UPFEZS", "SPPOP65UPMAZS", "SPPOP3034MA", "NYGDPTOTLRTZS", "NYGDPMKTPKDZG", "NYGDPPCAPKDZG", "TMVALMRCHR3ZS", "AGLNDARBLZS", "AGPRDLVSKXD", "NYGNPPCAPPPCD", "TXVALMRCHR6ZS", "SLUEMTOTLZS", "TXVALMRCHXDWD", "ENATMCO2EPPGDKD", "BMGSRTRANZS", "EGELCACCSURZS", "NYGNPPCAPKN", "EGELCRNEWZS", "SHIMMIBCG", "SESECENRLGC", "EGELCFOSLZS"]
ind_def_lookup(sa_results)

['Age dependency ratio, old',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Survival to age 65, male (% of cohort)',
 'Population ages 10-14, female (% of female population)',
 'Population ages 05-09, male (% of male population)',
 'Population ages 15-64 (% of total population)',
 'Population ages 20-24, female (% of female population)',
 'Death rate, crude (per 1,000 people)',
 'Mortality rate, under-5, male (per 1,000)',
 'Population ages 60-64, male (% of male population)',
 'Population ages 65 and above, female (% of female population)',
 'Population ages 65 and above, male (% of male population)',
 'Population ages 30-34, male',
 'Total natural resources rents (% of GDP)',
 'GDP growth (annual %)',
 'GDP per capita growth (annual %)',
 'Merchandise imports from low- and middle-income economies in Latin America & the Caribbean (% of total merchandise imports)',
 'Arable land (% of land area)',
 'Livestock production index (2014-2016 = 100)',
 'GNI per capita, 

### *Random Method  - 'nerlove'*

In [21]:
nerlove_results = ["SPADOTFRT", "SPDYNTO65FEZS", "SPPOP1014FE5Y", "SPPOP1519MA", "SPPOP0509MA5Y", "SPPOP1564TOZS", "SPPOP2024FE5Y", "SPDYNLE00FEIN", "SPPOP6064MA5Y", "SPPOP65UPFEZS", "SPPOP65UPMAZS", "SPPOP80UPFE5Y", "NYGDPTOTLRTZS", "SPDYNAMRTFE", "NYGDPMKTPKDZG", "NYGDPPCAPKDZG", "SLTLFACTIMAZS", "NYGDPPCAPPPKD", "NEGDIFTOTZS", "NVAGRTOTLKD", "TXVALMRCHR6ZS", "SLFAMWORKMAZS", "SLEMP1524SPMAZS", "SLAGREMPLZS", "TXVALMRCHR4ZS", "SMPOPREFG", "ENATMCO2EPPGDKD", "BXGSRTRVLZS", "SHSTAOWADMAZS", "NYTAXNINDCN", "NVINDEMPLKD", "TXVALFUELZSUN", "TMUVIMRCHXDWD", "FMASTCGOVZGM3", "STINTTVLXCD", "IMMIGRATION"]
ind_def_lookup(nerlove_results)

['Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Survival to age 65, female (% of cohort)',
 'Population ages 10-14, female (% of female population)',
 'Population ages 15-19, male',
 'Population ages 05-09, male (% of male population)',
 'Population ages 15-64 (% of total population)',
 'Population ages 20-24, female (% of female population)',
 'Life expectancy at birth, female (years)',
 'Population ages 60-64, male (% of male population)',
 'Population ages 65 and above, female (% of female population)',
 'Population ages 65 and above, male (% of male population)',
 'Population ages 80 and older, female (% of female population)',
 'Total natural resources rents (% of GDP)',
 'Mortality rate, adult, female (per 1,000 female adults)',
 'GDP growth (annual %)',
 'GDP per capita growth (annual %)',
 'Labor force participation rate, male (% of male population ages 15-64) (modeled ILO estimate)',
 'GDP per capita, PPP (constant 2017 international $)',
 'Gross fixed capi

### *Random Method  - 'walhus'*

In [22]:
walhus_results = ["SPADOTFRT", "SPDYNTO65FEZS", "SPPOP1014FE5Y", "SPPOP0509MA5Y", "SPPOP1564TOZS", "SPPOP2024FE5Y", "SPDYNLE00FEIN", "SHDYNMORTMA", "SPPOP1519FE", "SPPOP6569FE5Y", "SPPOP65UPFEZS", "SPPOP65UPMAZS", "SPPOP3034MA", "NYGDPTOTLRTZS", "SPDYNAMRTFE", "NYGDPMKTPKDZG", "NYGDPPCAPKDZG", "SLTLFACTI1524ZS", "SLTLFACTIZS", "TXVALMRCHALZS", "AGPRDLVSKXD", "SLUEM1524ZS", "SMPOPREFG", "SLGDPPCAPEMKD", "EGELCACCSURZS", "TXQTYMRCHXDWD", "SHIMMIBCG", "EGELCFOSLZS", "IMMIGRATION"]
ind_def_lookup(walhus_results)

['Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Survival to age 65, female (% of cohort)',
 'Population ages 10-14, female (% of female population)',
 'Population ages 05-09, male (% of male population)',
 'Population ages 15-64 (% of total population)',
 'Population ages 20-24, female (% of female population)',
 'Life expectancy at birth, female (years)',
 'Mortality rate, under-5, male (per 1,000)',
 'Population ages 15-19, female',
 'Population ages 65-69, female (% of female population)',
 'Population ages 65 and above, female (% of female population)',
 'Population ages 65 and above, male (% of male population)',
 'Population ages 30-34, male',
 'Total natural resources rents (% of GDP)',
 'Mortality rate, adult, female (per 1,000 female adults)',
 'GDP growth (annual %)',
 'GDP per capita growth (annual %)',
 'Labor force participation rate for ages 15-24, total (%) (modeled ILO estimate)',
 'Labor force participation rate, total (% of total population ages 15

### *Pooling* 

In [23]:
pool_results = ["SPPOPDPNDOL", "SGIHTASSTEQ", "SPPOP1014FE5Y", "SPPOP0509FE5Y", "SPPOP1564FEZS", "SPPOP1564TOZS", "SPPOP2024FE5Y", "SPDYNLE00FEIN", "SPDYNCDRTIN", "SHDYNMORT", "SPPOP65UPFEZS", "NYGDPMKTPKDZG", "NYGDPPCAPKDZG", "NYADJDRESGNZS", "MSMILXPNDGDZS", "AGLNDARBLZS", "ENATMCO2EKDGD", "SLUEM1524ZS", "TXVALMRCHXDWD", "NVSRVTOTLKN", "EGELCACCSURZS", "MSMILXPNDZS", "EGELCHYROZS", "DEATHTOTL"]
ind_def_lookup(pool_results)

['Age dependency ratio, old',
 'Female and male surviving spouses have equal rights to inherit assets (1=yes; 0=no)',
 'Population ages 10-14, female (% of female population)',
 'Population ages 05-09, female (% of female population)',
 'Population ages 15-64, female (% of female population)',
 'Population ages 15-64 (% of total population)',
 'Population ages 20-24, female (% of female population)',
 'Life expectancy at birth, female (years)',
 'Death rate, crude (per 1,000 people)',
 'Mortality rate, under-5 (per 1,000)',
 'Population ages 65 and above, female (% of female population)',
 'GDP growth (annual %)',
 'GDP per capita growth (annual %)',
 'Adjusted savings: natural resources depletion (% of GNI)',
 'Military expenditure (% of GDP)',
 'Arable land (% of land area)',
 'CO2 emissions (kg per 2015 US$ of GDP)',
 'Unemployment, youth total (% of total labor force ages 15-24) (modeled ILO estimate)',
 'Export value index (2000 = 100)',
 'Services, value added (constant LCU)',
 '

### *Fixed Effects*

In [24]:
fe_results = ["SPURBGROW", "ENPOPDNST", "SPADOTFRT", "SPPOP1014FE5Y", "SPPOP1564TOZS", "SPPOP6569MA5Y", "SPPOP7074FE5Y", "TXVALSERVCDWT", "SHSTAOWADMAZS", "NVSRVTOTLKN", "TXVALFOODZSUN", "EGELCACCSZS", "NYADJNNTYPCKD", "NYGNPMKTPKN", "TXUVIMRCHXDWD", "FMLBLBMNYGDZS", "SESECENRLGC", "EGELCPETRZS", "EGELCFOSLZS", "DEATHTOTL"]
ind_def_lookup(fe_results)

['Urban population growth (annual %)',
 'Population density (people per sq. km of land area)',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Population ages 10-14, female (% of female population)',
 'Population ages 15-64 (% of total population)',
 'Population ages 65-69, male (% of male population)',
 'Population ages 70-74, female (% of female population)',
 'Commercial service exports (current US$)',
 'Prevalence of overweight, male (% of male adults)',
 'Services, value added (constant LCU)',
 'Food exports (% of merchandise exports)',
 'Access to electricity (% of population)',
 'Adjusted net national income per capita (constant 2015 US$)',
 'GNI (constant LCU)',
 'Export unit value index (2000 = 100)',
 'Broad money (% of GDP)',
 'Secondary education, general pupils',
 'Electricity production from oil sources (% of total)',
 'Electricity production from oil, gas and coal sources (% of total)',
 'DEATHTOTL']

---

## Combine All Results
### Find most/least frequent

In [25]:
#Combine
sigvars = sa_results + nerlove_results + walhus_results + pool_results + fe_results

#How many unique?
len(np.unique(sigvars)) #

85

In [26]:
common_vars = []
for i in Counter(sigvars).most_common(len(np.unique(sigvars))):
    if (i[1] >= 3):
        common_vars.append(i[0])

In [27]:
len(common_vars)

13

In [28]:
ind_def_lookup(common_vars)

['Population ages 10-14, female (% of female population)',
 'Population ages 15-64 (% of total population)',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Population ages 20-24, female (% of female population)',
 'Population ages 65 and above, female (% of female population)',
 'GDP growth (annual %)',
 'GDP per capita growth (annual %)',
 'Population ages 05-09, male (% of male population)',
 'Population ages 65 and above, male (% of male population)',
 'Total natural resources rents (% of GDP)',
 'Access to electricity, urban (% of urban population)',
 'Electricity production from oil, gas and coal sources (% of total)',
 'Life expectancy at birth, female (years)']

In [29]:
common_vars

['SPPOP1014FE5Y',
 'SPPOP1564TOZS',
 'SPADOTFRT',
 'SPPOP2024FE5Y',
 'SPPOP65UPFEZS',
 'NYGDPMKTPKDZG',
 'NYGDPPCAPKDZG',
 'SPPOP0509MA5Y',
 'SPPOP65UPMAZS',
 'NYGDPTOTLRTZS',
 'EGELCACCSURZS',
 'EGELCFOSLZS',
 'SPDYNLE00FEIN']

---

## Alternative Selection Results

In [30]:
alt_results = ["NYGNSICTRZS", "BXTRFPWKRDTGDZS", "NVINDMANFZS", "SHHTNTRETMAZS", "SPPOPDPNDOL", "SPPOPGROW", "SPPOPAG05MAIN", "SPPOPAG06MAIN", "SPPOPAG07MAIN", "SPPOP0509FE5Y", "SPPOP1564TOZS", "SPPOP2024FE5Y", "SPPOP2024MA5Y", "SPPOP1014MA5Y", "SPDYNCDRTIN", "SHDYNNMRT", "SPPOP6064MA5Y", "SPPOPAG12MAIN", "SPADOTFRT", "SPDYNAMRTFE", "SPDYNAMRTMA", "AGPRDLVSKXD", "FPCPITOTLZG", "NVSRVTOTLZS", "SLINDEMPLFEZS", "SLEMPMPYRZS", "SLEMP1524SPMAZS", "SLEMP1524SPZS", "SLEMPVULNFEZS", "SLEMPMPYRMAZS", "SLEMPSELFFEZS", "TXVALMRCHR4ZS", "SMPOPREFG", "BXGSRMRCHCD", "TXVALSERVCDWT", "BNGSRGNFSCD", "BMGSRGNFSCD", "SPPOP7579MA", "SPPOP4549MA5Y", "SPPOP3034MA5Y", "SPPOP3034FE5Y", "SPPOP4044FE", "SPPOP4549FE", "SPPOP4549FE5Y", "SPPOP4549MA", "SPPOP5054FE", "SPPOP5054MA", "NYGNPPCAPKD", "TXUVIMRCHXDWD", "NYTAXNINDKN", "TMUVIMRCHXDWD", "NYGNPPCAPPPKD", "SHIMMIBCG", "EGELCCOALZS", "IMMIGRATION"]
ind_def_lookup(alt_results)

['Gross savings (% of GDP)',
 'Personal remittances, received (% of GDP)',
 'Manufacturing, value added (% of GDP)',
 'Treatment for hypertension, male (% of male adults ages 30-79 with hypertension)',
 'Age dependency ratio, old',
 'Population growth (annual %)',
 'Age population, age 05, male, interpolated',
 'Age population, age 06, male, interpolated',
 'Age population, age 07, male, interpolated',
 'Population ages 05-09, female (% of female population)',
 'Population ages 15-64 (% of total population)',
 'Population ages 20-24, female (% of female population)',
 'Population ages 20-24, male (% of male population)',
 'Population ages 10-14, male (% of male population)',
 'Death rate, crude (per 1,000 people)',
 'Mortality rate, neonatal (per 1,000 live births)',
 'Population ages 60-64, male (% of male population)',
 'Age population, age 12, male, interpolated',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Mortality rate, adult, female (per 1,000 female adul

In [31]:
#this list produced a .98 R2 and 29.79 SS
alt_results = ["SHHTNTRETZS", "SLTLFACTIZS", "SGLAWINDX", "SPADOTFRT", "SPPOP1014FE5Y", "SPPOP0509FE5Y", "SPPOPDPND", "SPPOPDPNDOL", "SPPOPTOTL", "SPPOPTOTLFEIN", "SPPOPAG05FEIN", "SPPOPAG07FEIN", "BXTRFPWKRDTGDZS", "TMVALTRVLZSWT", "BMGSRTRVLZS", "SLFAMWORKMAZS", "SHSTAOB18MAZS", "SHSTAOWADZS", "SHSTAOWADMAZS", "NVSRVTOTLKD", "SPDYNAMRTFE", "SPDYNAMRTMA", "NYGDPMKTPKDZG", "NYGDPPCAPKDZG", "NEGDITOTLCD", "NVINDEMPLKD", "TXVALFOODZSUN", "SPPOP65UPFEIN", "SPPOP65UPFEZS", "SPPOP65UPMAZS", "SPPOP65UPTO", "SPPOP7074FE5Y", "SPPOP7579FE5Y", "SPPOP7579MA5Y", "SPPOP3539MA5Y", "SPPOP3539FE5Y", "SHDTHNMRT", "ITMLTMAINP2", "SLEMPMPYRZS", "SLEMP1524SPFEZS", "SLEMP1524SPZS", "SLUEMTOTLZS", "SLUEMTOTLFEZS", "SLAGREMPLZS", "SLEMPTOTLSPFEZS", "TXVALMRCHR3ZS", "TXVALMRCHR4ZS", "SMPOPREFG", "BMGSRFCTYCD", "NEIMPGNFSKN", "BXPEFTOTLCDWD", "NYTAXNINDKN", "TMUVIMRCHXDWD", "NYGNPMKTPKD", "FMLBLBMNYCN", "ENURBMCTY", "NYGNPMKTPPPKD", "EGELCCOALZS", "IMMIGRATION"]
ind_def_lookup(alt_results)

['Treatment for hypertension (% of adults ages 30-79 with hypertension)',
 'Labor force participation rate, total (% of total population ages 15-64) (modeled ILO estimate)',
 'Women Business and the Law Index Score (scale 1-100)',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Population ages 10-14, female (% of female population)',
 'Population ages 05-09, female (% of female population)',
 'Age dependency ratio (% of working-age population)',
 'Age dependency ratio, old',
 'Population, total',
 'Population, female',
 'Age population, age 05, female, interpolated',
 'Age population, age 07, female, interpolated',
 'Personal remittances, received (% of GDP)',
 'Travel services (% of commercial service imports)',
 'Travel services (% of service imports, BoP)',
 'Contributing family workers, male (% of male employment) (modeled ILO estimate)',
 'Prevalence of obesity, male (% of male population ages 18+)',
 'Prevalence of overweight (% of adults)',
 'Prevalence of ov

In [32]:
results = ["SPRURTOTLZS", "SPURBGROW", "SGHMETRVLEQ", "SGDMLPRGW", "ENPOPDNST", "SPADOTFRT", "SPDYNTO65FEZS", "SGOWNPRRTIM", "SPPOP1014FE5Y", "SPPOP1519MA", "SPPOP0509MA5Y", "SPPOP1564TOZS", "SPPOP2024FE5Y", "SPDYNLE00FEIN", "SPDYNCDRTIN", "SHDYNMORTMA", "SPPOP1519FE", "SPPOP6064MA5Y", "SPPOP6569FE", "SPPOP65UPFEZS", "SPPOP65UPMAZS", "SPPOP7074FE", "SPPOP7074FE5Y", "SPPOP80UPFE5Y", "SPPOP3034MA", "SPPOP3539MA", "SPPOP4044MA", "SPPOP5559FE5Y", "ITCELSETSP2", "NYGDPTOTLRTZS", "SPDYNAMRTFE", "SHIMMMEAS", "NYGDPMKTPKDZG", "NYGDPPCAPKDZG", "TGVALTOTLGDZS", "NERSBGNFSZS", "NETRDGNFSZS", "SLTLFTOTLFEIN", "SHDYN0509", "SHHTNTRETZS", "SLTLFACTIFEZS", "SLTLFACTIZS", "NEDABTOTLZS", "NECONPRVTZS", "TXVALMRCHALZS", "AGPRDLVSKXD", "NVAGRTOTLKD", "MSMILTOTLP1", "ERFSHAQUAMT", "TXVALMRCHR6ZS", "SLFAMWORKMAZS", "SLSRVEMPLZS", "SLUEM1524ZS", "SLEMP1524SPMAZS", "SLEMPTOTLSPZS", "NVAGRTOTLKDZG", "BXTRFPWKRCDDT", "EGFECRNEWZS", "BMGSRNFSVCD", "NEEXPGNFSKN", "ENATMCO2EPPGDKD", "SMPOPREFGOR", "MSMILTOTLTFZS", "BXGSRTRANZS", "NEGDIFTOTKDZG", "NVINDMANFCD", "NECONPRVTPCKD", "SHSTAOWADMAZS", "NECONPRVTKDZG", "ENATMCO2ESFZS", "NECONPRVTPPKD", "NYGDPFCSTKD", "NVINDMANFKDZG", "ISAIRGOODMTK1", "TXVALMANFZSUN", "TXVALMMTLZSUN", "NYGNPMKTPKN", "TMQTYMRCHXDWD", "TXUVIMRCHXDWD", "EGELCRNEWZS", "FMLBLBMNYCN", "ENURBMCTY", "FMLBLBMNYGDZS", "NYGNPPCAPPPKD", "NYGNPMKTPPPKD", "FMASTCGOVZGM3", "SHIMMIBCG", "EGELCFOSLZS", "SESECENRLGCFEZS", "IMMIGRATION"]
ind_def_lookup(results)

['Rural population (% of total population)',
 'Urban population growth (annual %)',
 'A woman can travel outside her home in the same way as a man (1=yes; 0=no)',
 'Dismissal of pregnant workers is prohibited (1=yes; 0=no)',
 'Population density (people per sq. km of land area)',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'Survival to age 65, female (% of cohort)',
 'Men and married women have equal ownership rights to immovable property (1=yes; 0=no)',
 'Population ages 10-14, female (% of female population)',
 'Population ages 15-19, male',
 'Population ages 05-09, male (% of male population)',
 'Population ages 15-64 (% of total population)',
 'Population ages 20-24, female (% of female population)',
 'Life expectancy at birth, female (years)',
 'Death rate, crude (per 1,000 people)',
 'Mortality rate, under-5, male (per 1,000)',
 'Population ages 15-19, female',
 'Population ages 60-64, male (% of male population)',
 'Population ages 65-69, female',
 'Popula