# **The Preston Curve: Examining the relationship between income and life expectancy**

Imports and set magics:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# **Read and clean data**

## **Dataset 1: World Bank Development Indicators**

We import data from the World Bank's databank's World Development Indicators. We import the data through a pre-downloaded csv-file containing the relevant variables. The csv-file contains data for 217 countries in the world. The included variables are GDP per capita in 2019 (PPP, current international $), life expectancy at birth in 2019, mortality rate, under-5 (per 1,000 live births) in 2019 and population size in 2019. 

In [None]:
# import data
wd = pd.read_csv('WorldData.csv') 
# inspect data
wd.head() 

We clean our dataset by deleting the colums we dont need, which is country code and series code. 

In [None]:
# drop the following colums: country code and series code
drop_these = ['Country Code', 'Series Code']
wd.drop(drop_these, axis=1, inplace=True)
# rename columns
wd.rename(columns = {'Country Name':'country', 'Series Name':'index', '2019 [YR2019]':'val'}, inplace=True)
wd.head()

We want to reshape the data set, in order to have all the information for a country in one row. We have a long data set that we would like to reshape to a wide data set. We use the pivot_table function to do the reshape.

In [None]:
# reshape from long to wide
wd = wd.pivot_table(index='country', columns='index', values='val', aggfunc='sum').reset_index() # reset_index removes the second index created by the pivot
wd.head()

We change the variable names, such that there are no white spaces in columnames.

In [None]:
# define dictionary containing variable names
var_dict = {} # var is for variable
var_dict['GDP per capita, PPP (current international $)'] = 'gdp_per_cap'
var_dict['Life expectancy at birth, total (years)'] = 'life_exp'
var_dict['Mortality rate, under-5 (per 1,000 live births)'] = 'child_mort'
var_dict['Population, total'] = 'popl'
# rename
wd.rename(columns = var_dict,inplace=True)
wd.head()

For the sake of our future analysis, we only want to keep data where we have values for all variables. If there is not avaliable data for either gdp_per_cap, life_exp, child_mort or popl we delete the observation. This is the case for 37 countries in our dataset.

In [None]:
# create logical index
I = wd.gdp_per_cap == '..'
I |= wd.life_exp == '..'
I |= wd.child_mort == '..'
I |= wd.popl == '..'
wd.loc[I, :]
# print number of cases to delete
print(f'Number of cases to deletes = {sum(I)}')
# keep all other observations
wd = wd.loc[I == False]
wd.head()

We see that American Samoa and Andorra now is deleted because of missing values. The same goes for the other 35 observations.

We see that there is a problem regarding the type of the variables. The dataframe has stored all the variabels as 'objects'.

In [None]:
wd.info() # check the variable types of the dataframe

We change the variable types to the correct ones, so we can work with them.

In [None]:
# change types
wd.country = wd.country.astype('string')
wd.gdp_per_cap = wd.gdp_per_cap.astype(float)
wd.life_exp = wd.life_exp.astype(float)
wd.child_mort = wd.child_mort.astype(float)
wd.popl = wd.popl.astype(float)
# check correct types
wd.info()

We change the population size to be in millions (this will make it easier for us to make a graph later)

In [None]:
wd.popl = wd.popl*(10**-6) # change to millions
wd.head()

## **Dataset 2: Continents**

We import another dataset containing the continent for each contry. The data set origins from Our World in Data and is also imported as a pre-downloaded csv.file.

In [None]:
# import data
cd = pd.read_csv('ContinentData.csv')
# inspect data
cd.head()

In [None]:
# drop the following colums: Code and Year
drop_cont = ['Code', 'Year']
cd.drop(drop_cont, axis=1, inplace=True)
# rename columns
cd.rename(columns = {'Entity':'country', 'Continent':'continent'}, inplace=True)

# change types
cd.country = cd.country.astype('string')
cd.continent = cd.continent.astype('string')

# check correct types
cd.info()

## **Merge datasets**

We merge the two dataframes wd and cd. We only want to keep observations which are available in both datasets. Therefore we use the 'inner'-method to merge the two data sets. We merge on the country variable which is common for the two datasets.

In [None]:
# merge datasets
wd_cont = pd.merge(wd, cd, how = 'inner', on = 'country')
print(f'Number of countries = {len(wd_cont.country.unique())}')

Our merge using the 'inner'-method have further deleted 19 countries. We now have all the information we want for 161 countries. We believe that this is a sufficient number of observations for our analysis.
We know that, in general, we should not merge on country names, because these can vary across sources in spelling. Instead an idea could be to merge on country code (which we drop in the cleaning of our dataset). We do not keep the country code because we face issues when trying to pivot the dataset from long to wide and wanting to keep both country name and country code. 

## **Examining Our Data**

In this section, we will examine our datasets by using some summary statistics. We begin by using the df.describe function. This gives us an overall view of our data.

In [None]:
# summary statistics
wd_cont.describe()

If we look at life expectancy, this function provides us information about:
- The average life expectancy in the world is 72.5 years.
- The lowest life expectancy for a country in the world is 52.9 years.
- The 25th (50th, 75th) percentile country of those with the lowest life expectancy expects to live 66.4 (73.9, 78.7) years.
- The highest life expectancy for a country in the world is 84.4 years.

To further examine the data, with regards to our later graphs, we will look at the average life expectancy across continents.

*Average life expectancy by continent:*

In [None]:
# using groupedby to calculate within means
wd_grouped1 = wd_cont.groupby(['continent'])['life_exp'].mean() # by life expectancy
wd_grouped2 = wd_cont.groupby(['continent'])['gdp_per_cap'].mean() # by income
# print all continents
print(f'Average life expectancy by: {wd_grouped1}')
print(f'Average income by: {wd_grouped2}')

We see how life expectancy varies across continents. The average life expectancy in Africa is the lowest in the world with a life expectancy of 63.7 years, while the average life expectancy in Europe is 79.6 years and the highest average in the world.

We plot this in a bar chart.

In [None]:
# create figure
fig = plt.figure()
# create plot as bar chart
ax = fig.add_subplot(1,1,1)
wd_grouped1.plot(kind='bar', label='_nolegend_')
# alterations to plot
plt.ylim(50,80)
plt.xticks(rotation=45, ha='right')
ax.set_title("Average life expectancy by continent")
ax.set_ylabel("avg. life expectancy")
# add world average life expectancy as horizontal line
plt.axhline(y = np.nanmean(wd_cont.life_exp), color='black', label='World avg. life expectancy')
plt.legend(frameon=False)
plt.show()

We see that Africa and Oceania have average life expectancies below the world average. The average life expectancy in Africa is 8.8 years lower than the world average. To examine this further, we will present the Preston Curve in the next section as well as child mortality across continents.

We do the same for income across continents

In [None]:
# create figure
fig = plt.figure()
# create plot as bar chart
ax = fig.add_subplot(1,1,1)
wd_grouped2.plot(kind='bar', label='_nolegend_')
# alterations to plot
plt.xticks(rotation=45, ha='right')
ax.set_title("Average GDP per capita across continents")
ax.set_ylabel("avg. GDP per cap")
# add world average life expectancy as horizontal line
plt.axhline(y = np.nanmean(wd_cont.gdp_per_cap), color='black', label='World avg. GDP per cap')
plt.legend(frameon=False)
plt.show()

The two bar charts show the same pattern. Again we see that Africa and Oceania being significantly below the world average and Europe dragging up the average.

## **The Preston Curve**

We want to show the relationship between income and life expentancy for all countries which contains data for both. Using a scatterplot we get the following relationship.

In [None]:
# create figure
fig = plt.figure()
# create plot
ax = fig.add_subplot(1,1,1)
plt.scatter(wd.gdp_per_cap, wd.life_exp)

# alterations to figure
ax.set(xlim=(0, 70000), ylim=(50, 90))
ax.set_title('Relationship between income and life expectancy')
ax.set_xlabel('GDP per capita, 2019 (US$, PPP)')
ax.set_ylabel('Life expantancy at birth, 2019')


We have now show the relationship between the income and life expectancy with a simple figure. Using our learned skills we will make the figure look nice and more readable.

We create a dictionary where we asign a colour to each continent

In [None]:
# create dictionary assigning colours to continents
colours = {
    'Africa':'purple',
    'Asia':'yellow',
    'Europe':'green',
    'North America':'blue',
    'Oceania':'red',
    'South America':'orange'
}

In [None]:
# map dictionary of colours over continents to assign the correct colours for each continent
wd_cont['colour'] = wd_cont['continent'].map(colours)
wd_cont.head()

We will now present the Preston Curve. In the plot we will colour coordinate the continents to visually show how life expectancy varies across continents as seen in the bar chart above. Furthermore, we will population size of each country into account when presenting each observation in the scatter plot.

In [None]:
# create figure
fig = plt.figure()
# create plot
ax = fig.add_subplot(1,1,1)
plt.scatter(wd_cont.gdp_per_cap, wd_cont.life_exp, s = wd_cont.popl, c=wd_cont['colour'], alpha=0.5) #We add s equal to the population size

# add fitted line
z = np.polyfit(wd_cont.gdp_per_cap, wd_cont.life_exp, 2)
p = np.poly1d(z)
print(p)
sorted = np.sort(wd_cont.gdp_per_cap)
plt.plot(sorted, p(sorted), color='black')

# add text to selected countries
china_index = wd_cont.loc[wd_cont.country == "China"].index[0]
plt.text(wd_cont.gdp_per_cap[china_index], wd_cont.life_exp[china_index], "China")
nigeria_index = wd_cont.loc[wd_cont.country == "Nigeria"].index[0]
plt.text(wd_cont.gdp_per_cap[nigeria_index], wd_cont.life_exp[nigeria_index], "Nigeria")
usa_index = wd_cont.loc[wd_cont.country == "United States"].index[0]
plt.text(wd_cont.gdp_per_cap[usa_index], wd_cont.life_exp[usa_index], "USA")

# alterations to figure
ax.set(xlim=(0, 70000), ylim=(50, 90))
ax.set_title('Relationship between income and life expectancy')
ax.set_xlabel('GDP per capita, 2019 (US$, PPP)')
ax.set_ylabel('Life expantancy at birth, 2019')
# add legend: colours corresponding to continents
legend_elements = [Line2D([0],[0], marker='o', color=colour, linestyle='') for colour in colours.values()]
ax.legend(legend_elements, colours.keys(), numpoints=1)

From the figure we see that life expectancy grows rapidly for higher incomes. For low incomes a small increase in income implies large increases in life expectancy. At higher incomes, the increase in expected longevity seems to become smaller and smaller. As expected, the countries with the lowest life expectnancy can be found in Africa, as we see a larger concentration of the purple colour in the lower left area of the plot. Likewise, we see a large concentration of the green colour representing Europe in the upper right corner of the plot. For Oceania, the life expectancy was below the world average. From this plot it becomes noticeable that it is a few countries dragging down the average as we have some countries in Oceania in the upper right corner of the plot and some countries in the left side of the plot with low income and low life expectancy.

As seen in the plot, we add a fitted line using the polyfit funnction. This provides us with an expression for the fitted line: life_exp = -3.994e-09*(gdp_per_cap)^2 + 0.0005767*(gdp_per_cap) + 63.76. This is not a regression prediction, but is still only a very simple approximation of the non-linear relationship between income and life expectancy which only can be interpreted as a correlation.

After having presented the Preston curve, we want to present why especially Africa has a very low life expectancy. This is due to the fact that child mortality is very high in this region contrary to the more developed parts of the world where income is higher.

In [None]:
# create figure
fig = plt.figure()
# create plot
ax = fig.add_subplot(1,1,1)
plt.scatter(wd_cont.gdp_per_cap, wd_cont.child_mort, c=wd_cont['colour'], alpha=0.8)

# alterations to figure
ax.set_title('Relationship between income and child mortality')
ax.set_xlabel('GDP per capita, 2019 (US$, PPP)')
ax.set_ylabel('Mortality rate, under-5 (per 1,000 births)')
# add legend: colours corresponding to continents
legend_elements = [Line2D([0],[0], marker='o', color=colour, linestyle='') for colour in colours.values()]
ax.legend(legend_elements, colours.keys(), numpoints=1)

We see that child mortality for Africa is concentrated to the top left of the plot implying a high child mortality and a low income. Child mortality decreases rapidly for higher incomes. Europe has a low child mortality at different (higher) income levels.

# **Conclusion**

In this project, we succeed in reproducing the Preston Curve and thereby showing the positive relationship between income and life expectancy.

We find, in line with other empirics, that life expectancy in Africa is visibly lower than in other (richer) parts of the world. This low life expectancy is greatly due to the high child mortality in Africa.