# World Bank Correlations

Thank you for your interest in the World Bank Correlations package. This package was built with both practical application and theoretical research in mind. Through their APIs, the World Bank makes available data on over 20,000 variables. While there exists a notable amount of missingness in this data and some variables are available only for certain countries or years, the data can offer initial insight into what relationships exist and may warrant further investigation when constructing a model for or attempting to make predictions about an aspect of interest. The functions in this package allow users to quickly find relationships that exist with variables in the World Bank dataset through various methods depending on the user's interest or theory. Each of the strategies of searching also allows for various optional limits to be placed on the results, as explained below. This package utilizes functions built by MWOUTS in the world_bank_data package which can be investigated here: https://github.com/mwouts/world_bank_data. 

## Package Installation

The following call can be used to install the package from GitHub.

In [1]:
pip install git+https://github.com/JohnMarion54/World_Bank_Correlations

Collecting git+https://github.com/JohnMarion54/World_Bank_Correlations
  Cloning https://github.com/JohnMarion54/World_Bank_Correlations to c:\users\john marion\appdata\local\temp\pip-req-build-cbth6krz
  Resolved https://github.com/JohnMarion54/World_Bank_Correlations to commit 038d44d36fb3df968607529b7f2fecee3a4f13c3
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Note: you may need to restart the kernel to use updated packages.


  Running command git clone --filter=blob:none -q https://github.com/JohnMarion54/World_Bank_Correlations 'C:\Users\John Marion\AppData\Local\Temp\pip-req-build-cbth6krz'


The following includes general setup including loading in two datasets that will be used as examples, demonstrating how to use the functions with either data already downloaded by the user or data regarding another of the World Bank variables using the world_bank_data package. Note that both have a column called "Country" and one called "Year," a requirement of the World Bank Correlations functions.

In [1]:
import requests
import world_bank_data as wb
import pandas as pd
sample=wb.get_series('3.0.Gini',mrv=50).reset_index()
sample2=pd.read_csv('C:\\Users\\John Marion\\...\\Bottom_40.csv')
sample2.rename(columns = {'country':'Country', 'year':'Year'}, inplace = True)
sample2.drop(['Income_share_held_by_lowest_20%', 'Income_share_held_by_fourth_20%','Unnamed: 0'], axis=1, inplace=True)
sample2['Year']=sample2['Year'].str.slice(1, 5)

In [3]:
sample.head(1)

Unnamed: 0,Country,Series,Year,3.0.Gini
0,Andean Region,Gini Coefficient,2000,0.560277


In [4]:
sample2.head(1)

Unnamed: 0,Country,Year,income_share_bottom_40
0,Afghanistan,1960,


In [None]:
from World_Bank_Correlations import World_Bank_Correlations as wbc

# Investigation by Specific Indicator
The first function of the package can be used to find the relationship (either the correlation or the correlation between annual percent changes) between the user's chosen variable and one or more specifically named other variables. The names of the other variables to test are the ID (as given by the World Bank) of the variables. To find a list of indicators, this API can be used to see a list of indicators: http://api.worldbank.org/v2/indicator?per_page=21000 or http://api.worldbank.org/v2/indicator?page=2 (page number can be changed), or one can run a command similar to the following:
pd.read_xml(requests.get('http://api.worldbank.org/v2/indicator?page=250').content)

__wb_corr(data, col, indicator, change=False)__

Required arguments:
- data: a dataframe containing columns "Country," "Year," and information on the variable of itnerest
- col: The integer index where data of the variable exists in data
- indicator: The ID of an indicator to find the relationship with the input variable or list of IDs whose relationships are desired

Optional arguments:
- change: True to find and order variables by the correlation between the variable of interest and World Bank variables

Using the sample built previously, we can investigate the relationship between the Gini Coefficient (chosen when building the sample) and another variable from the World Bank such as the percent of GDP given to research and development which has the ID GB.XPD.RSDV.GD.ZS. 

In [19]:
wbc.wb_corr(sample,3,'GB.XPD.RSDV.GD.ZS')

Unnamed: 0_level_0,Correlation,n
Indicator,Unnamed: 1_level_1,Unnamed: 2_level_1
Research and development expenditure (% of GDP),0.110859,114


Here, correlations displayed are correlations with the input variable (in this case, the Gini Coefficeint). 

More than one relationship can be found if desired. A string of indicator IDs can be used to find the relationships that exist between the input variable. Further, change can be set to true to find the correlation between the annual percent changes between the chosen variable(s) and the input variable. 

In the below example, the data read in from a csv is used, demonstrating that it is not necessary to only compare variables from the World Bank data. The function can be used to find two types of relationship between the variable whose data was loaded in and a list of variables from the World Bank data. Here, we find these relationships between the income share of the bottom 40% with the percent of GDP spent on research and development (GB.XPD.RSDV.GD.ZS) and total population (SP.POP.TOTL).

In [20]:
wbc.wb_corr(sample2,2,['GB.XPD.RSDV.GD.ZS','SP.POP.TOTL'],change=True)

Unnamed: 0_level_0,Correlation,n,Correlation_change,n_change
Indicator,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Research and development expenditure (% of GDP),0.404506,1053,0.056673,799
"Population, total",-0.022363,1739,-0.008851,1057


# Investigation by Topic
Rather than find the relationship(s) between the input variable and specifically named variables, the wb_topic_corrs function can be used to find the k variables under a given topic (as defined by the World Bank) that have the strongest relationship with an input variable. The relationship to judge can be either the correlation between the input variable and the variables under the chosen topic (change=False) or the correlation between the annual percent changes (change=True).

The World Bank defines 21 topics such as Economy & Growth and Social Protection & Labor. A complete list of topics can be found here: http://api.worldbank.org/v2/topic?

Topics chosen in the function can either be written as the integer corresponding to the topic or the name of the topic as a string. The k strongest relationships will be found and displayed. 

__wb_topic_corrs(wb_topic_corrs(data,col,topic,k=5,change=False,nlim=1,cor_lim=0,t_lim=0))__

Required arguments:
- data: a dataframe containing columns "Country," "Year," and information on the variable of itnerest
- col: The integer index where data of the variable exists in data
- topic: Integer associated with one of the topics or a string of the topic name

Optional arguments:
- k: The number of variables to return
- change: True to find and order variables by the correlation between the variable of interest and World Bank variables
- nlim: The minimum number observations to be used to find the correlation
- cor_lim: The minimum absolute value of the correlation of variables to be displayed. 
- t_lim: the minimum absolute value of the t-value associated with the correlation of variables to be displayed.

In the example below, the 5 variables listed under the topic of "Financial Sector" that have the strongest correlation with the Gini Coefficient are found.

In [23]:
wbc.wb_topic_corrs(sample,3,'Financial Sector')

Unnamed: 0_level_0,Correlation,n
Indicator,Unnamed: 1_level_1,Unnamed: 2_level_1
Nonfinancial corporate bonds to total bonds and notes outstanding (%),-0.766563,32
Outstanding domestic public debt securities to GDP (%),0.712796,38
"Risk premium on lending (lending rate minus treasury bill rate, %)",0.587889,46
Private credit bureau coverage (% of adults),-0.550783,128
Mutual fund assets to GDP (%),0.528449,80


Additional options are available when investigating by topic. If a user desires the 7 strongest variables under the "Poverty" topic that are most strongly related to the income share of the bottom 40% via the annual percent change of both measures, but only if at least 20 observations can be used to calcluate the correlation, change can be set to true and nlim can be set to 19. Further, the index associated with the "Poverty" topic is 11, for demonstration of this method. 

In [24]:
wbc.wb_topic_corrs(sample2,2,11,k=7,change=True,nlim=19)

Unnamed: 0_level_0,Correlation,n,Correlation_change,n_change
Indicator,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Income share held by highest 10%,-0.994108,1741,-0.905048,1057
Income share held by highest 20%,-0.993918,1741,-0.898675,1057
Income share held by third 20%,0.977071,1741,0.872442,1057
Income Share of Fifth Quintile,-0.96237,160,-0.833629,125
Gini index (World Bank estimate),-0.989436,1741,-0.830589,1057
Income share held by fourth 20%,0.830866,1741,0.829783,1057
Income share held by second 20%,0.973005,1741,0.821601,1057


# Investigation by Keyword Search
Rather than by topic, users can find the k strongest relationships that exist in the World Bank data by a keyword search. Variables matching the search will be found and their correlation with the input variable will be calculated. Variables, again, will be listed by the absolute value of their correlation with the input variable is change is set to False or by the absolute value of the correlation between the annual percent changes if change is set to True. The search should be a string to search for variable names that match. The function takes advantage of the search_indicators() function from the world_bank_data package built by MWOUTS. 

__wb_corrs_search(data,col,search,k=5,change=False,nlim=1,cor_lim=0,t_lim=0)__

Required arguments:
- data: a dataframe containing columns "Country," "Year," and information on the variable of itnerest
- col: The integer index where data of the variable exists in data
- search: The keyword search desired to be used to match variables from the World Bank

Optional arguments:
- k: The number of variables to return
- change: True to find and order variables by the correlation between the variable of interest and World Bank variables
- nlim: The minimum number observations to be used to find the correlation
- cor_lim: The minimum absolute value of the correlation of variables to be displayed. 
- t_lim: the minimum absolute value of the t-value associated with the correlation of variables to be displayed.

The following example finds the top 3 strongest correlations between the Gini Coefficient and variables whose name contains the word "income share." 

In [26]:
wbc.wb_corrs_search(sample,3,"income share",k=3)

Unnamed: 0_level_0,Correlation,n
Indicator,Unnamed: 1_level_1,Unnamed: 2_level_1
Income Share of Fifth Quintile,0.991789,160
Income Share of Second Quintile,-0.985993,160
Income share held by highest 20%,0.970095,172


If only correlations that are likely statistically significant are desired, t_lim can be set to about 1.96

In [27]:
wbc.wb_corrs_search(sample2,2,"income share",change=True,t_lim=1.96)

Unnamed: 0_level_0,Correlation,n,t,Correlation_change,n_change,t_change
Indicator,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Income share held by highest 10%,-0.994108,1741,-382.451007,-0.905048,1057,-69.118209
Income share held by highest 20%,-0.993918,1741,-376.370151,-0.898675,1057,-66.549576
Income share held by third 20%,0.977071,1741,191.371319,0.872442,1057,57.983605
Income Share of Fifth Quintile,-0.96237,160,-44.515475,-0.833629,125,-16.739031
Income share held by fourth 20%,0.830866,1741,62.264102,0.829783,1057,48.29362


# Search with All Variables

### Warning: The following code will take significant time to run. It is generally advised to use one of the other paths provided. More advisable would be to use the topic function for each of the 21 topics.

To find the k World Bank variables that have the strongest relationship with an input variable of interest, the following code finds the correlation (or correlation between annual percent changes) between a user's input variable and all variables available in the World Bank data and displays the k strongest. As with the previous functions, results are listed by correlation if change is set to False and by the correlation in annual percent changes if change is True. 

__wb_every(data,col,k=5,change=False,nlim=1,cor_lim=0,t_lim=0)__

Required arguments:
- data: a dataframe containing columns "Country," "Year," and information on the variable of itnerest
- col: The integer index where data of the variable exists in data

Optional arguments:
- k: The number of variables to return
- change: True to find and order variables by the correlation between the variable of interest and World Bank variables
- nlim: The minimum number observations to be used to find the correlation
- cor_lim: The minimum absolute value of the correlation of variables to be displayed. 
- t_lim: the minimum absolute value of the t-value associated with the correlation of variables to be displayed.