<a href="https://colab.research.google.com/github/ReidelVichot/DSTEP23/blob/main/week_3/dstep23_world_bank_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **DSTEP23 // World Bank: GDP per Capita and Life Expectancy**

*September 14, 2023*

This notebook will explore the relationship between GDP per capita and life expectancy at birth using pre-processed$^{\dagger}$ World Bank data.

<small><i>$^{\dagger}$ we'll look at <u>how</u> the data was pre-processed in the coming weeks. </i></small>

---

The first step is to create the link between the virtual machine running on Google's computational platform and our Google Drive containing the data.

In [None]:
# -- link google drive


Import useful modules:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Now we load the data directly into this Colaboratory python runtime using pandas.

In [None]:
# -- read in the data
fname =
wb =

# -- check that data was read in correctly
print(wb.head())

As an example of working with this DataFrame, let's sort "in place" by GDP per Capita in 2017 to find the lowest 5 values:

In [None]:
# -- sort in place by GDP per Capita in 2017


print(wb.head())

---

### **Exploring summary statistics with Pandas and NumPy**

Let's look at some summary statistics for the data.  Pandas DataFrames have a "describe" method:

In [None]:
# -- summarize the DataFrame using the describe method


though there is not much [flexibility](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html).

As we did with sea level, we can also visualize how one column varies with another,

In [None]:
# -- make a scatter plot of Life Expectancy at Birth in 2017 versus GDP per
#    Capita in 2017
fig, ax = plt.subplots()

ax.set_xlabel("GDP per Capita in 2017 [USD]")
ax.set_ylabel("Life Expectancy at Birth in 2017 [years]")
fig.show()

How would we characterize this relationship?  Linear?

What about a "scatter plot" with the log of the GDP?$^*$

<small></i> $^*$ Recall that you can modify column values by using square brackets and the name of the column.  If the column does not exist in the DataFrame, the column is created. </i></small>

In [None]:
# -- calculate the log of the GDP per Capita
wb["log10gpc"] =

In [None]:
# -- print out the GDP per capita and the log of the GDP per capita
cols = ["gpc2017", "log10gpc"]
print(wb[cols])

In [None]:
# -- make a scatter plot of Life Expectancy at Birth in 2017 versus the log of
#    GDP per Capita (in USD) in 2017


---

### **Fitting a linear model to the data.**

There are several ways to implement OLS in python.  Let's use the `statsmodels` module:

In [None]:
# -- import statmodels using the formula api
import statsmodels.formula.api as sm

In [None]:
# -- first build the model
model =

# -- now fit the model to the data
result =

We can see a summary of how the model performs:

In [None]:
# -- summarize the model fit


Finally, let's determine the model "predictions" for the data,

In [None]:
# -- calculate the model "prediction" of Life Expectancy at Birth for each
#    log(GDP per Capita) data point
pred =

and overplot the model on the data.

In [None]:
# -- make a scatter plot of Life Expectancy at Birth in 2017 versus the log of
#    GDP per Capita (in USD) in 2017 and overlay an OLS fit


---

### **Comparing the fit with a more complex model**

It **<i>may</i>** be that there is a slight decrease in life expectancy at birth at lower GDP per capita.  Let's try to model this with a quadratic function and compare the two models with a likelihood-ratio test.

LINEAR model <br>
$y = a_0 + a_1 \cdot x$
<br><br>
QUADRATIC model <br>
$y = a_0 + a_1 \cdot x + a_2 \cdot x^2$

First, we fit the quadratic model:

In [None]:
# -- build the quadratic model
model2 =

# -- now fit the model to the data
result2 =

# -- summarize the fit


Let's overplot the two models now,

In [None]:
# -- make a scatter plot of Life Expectancy at Birth in 2017 versus the log of
#    GDP per Capita (in USD) in 2017 and overlay an OLS fit

# -- predict the GDP per capita with the new model
pred2 =

# -- make the plot
fig, ax = plt.subplots(figsize=(7, 4))
wb.plot.scatter("log10gpc", "leb2017", color="red", ax=ax, label="data")
ax.plot(wb["log10gpc"], pred, color="steelblue", lw=2, label="linear model")
ax.plot(wb["log10gpc"], pred2, color="black", lw=2, label="quadratic model")
ax.set_xlabel("log(GDP per Capita in 2017 [USD])")
ax.set_ylabel("Life Expectancy at Birth in 2017 [years]")
ax.legend()
fig.show()

Does a likelihood ratio test indicate that the null hypothesis is rejected?  Because these models are "nested" (i.e., the first model is just a special case of the second) we can use statsmodels likelihood ratio test:

In [None]:
# -- likelihood ratio test
lr, pval, ddof =
print("p-value: {0}".format(round(pval, 3)))

This p-value indicates that the null hypothesis (that a straight line is an equally good fit to the data) is **<i>not</i>** rejected.  I.e., there is no reason to include the quadratic term in the fit.

Conicidentally, we can also extrapolate the result to values outside the fit range.  Let's predict the life expectancy at birth for a country with log(GDP per capita) = 6,

In [None]:
# -- create a DataFrame for extrapolation
df =
df["log10gpc"] =

In [None]:
# -- predicting a value outisde the fit range
ext =
ext2 =

print("linear model extrapolation:")
print(ext)

print("")

print("quadratic model extrapolation")
print(ext2)

We can also predict at multiple values using lists:

In [None]:
# -- create a DataFrame for extrapolation
df =
df["log10gpc"] =

In [None]:
# -- predicting a value outisde the fit range
extm =
extm2 =

print("linear model extrapolation:")
print(extm)

print("")

print("quadratic model extrapolation")
print(extm2)