# Inferential Modeling

The `sowc_demographics` data set is from UNICEF's State of the World's Children 2019 Statistical Tables. The data is broken down by country and contains a variety of statistics from each country.

In this lab you'll be using the data from the data set to perform inference on linear regression models. This will involve simulations using the randomization process and bootstrapping in order to compute p-values and confidence intervals. This will allow you to create different models and compute statistics in order to compare the models.

# Getting started

## Load packages

For this lab we will need the following packages.

```python
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
```

## Creating a reproducible lab report

You will be using Jupyter notebook to create reproducible lab reports. Download the lab report template and load the template into Jupyter notebook. These templates can be used for each of the labs.

## The data

The data we are working with is in the sowc_demographics.csv file. Download and load the data frame into **python**. The `sowc_demographics` data frame has eighteen variables. The description of each variable will be given at the end of this document for reference.

<div class="alert alert-block alert-info">
<b>Exercise 1:</b> Describe the distribution of the fertility_2018 variable. The fertility_2018 variable represents the number of live births per woman in 2018. A fertility level of 2.1 is called replacement level and represents a level at which the population would remain the same size. Roughly what percent of countries are at or above the replacement level?</div>

# Inference for linear regression with a single predictor

<div class="alert alert-block alert-info">
<b>Exercise 2:</b> Is there a relationship between life expectancy and fertility? Potentially <code>life_expectancy_2018</code> is a good predictor of <code>fertility_2018</code>. If so, what is the relationship? Create a linear model and describe the relationship between the variables. Recall these techniques from lab 3.</div>

Repeated random permutations of the response variable will provide a sense of how unlikely this observed data is if there were no linear relationship between `life_expectancy_2018` and `fertility_2018`. The following code does this and plots the results of the observed linear model as well as simulated models on the same graph.

```python
sns.regplot(x='life_expectancy_2018', y='fertility_2018', data=demo_lifefert_clean, ci=None, scatter=False, color='red')

for x in range(50):
    sim_regression = pd.DataFrame().assign(fertility_2018=demo_lifefert_clean['fertility_2018'].sample(frac=1, ignore_index=True), life_expectancy_2018=demo_lifefert_clean['life_expectancy_2018'])
    sns.regplot(x='life_expectancy_2018', y='fertility_2018', data=sim_regression, ci=None, scatter=False, color='blue')
```

The code here uses the same techniques you have seen before. Coloring the observed model as red and the simulated models as blue helps distinguish the lines from each other.

<div class="alert alert-block alert-info">
<b>Exercise 3:</b> What do you notice in the graph you have created comparing the 50 simulated linear models and the observed linear models? Based on this, make a guess on whether this observed model is significantly different than expected assuming no linear relationship between the variables.</div>

<div class="alert alert-block alert-info">
<b>Exercise 4:</b> Randomly simulate 1000 linear models. Create a histogram of the slopes of these simulated models. Using your histogram, compute a p-value and interpret your results.</div>

<div class="alert alert-block alert-info">
<b>Exercise 5:</b> Bootstrap 5000 slopes and create a histogram of those bootstrapped slopes. Compute a confidence interval from your histogram.</div>

# Inference for linear regression with multiple predictors

A country's life expectancy seems to be a good predictor of their fertility in 2018. This model could be used as a way of estimating the fertility rate of a country knowing their life expectancy. By including more predictors, however, we might be able to create a more accurate model.

<div class="alert alert-block alert-info">
<b>Exercise 6:</b> Using the techniques from lab 3, create a linear regression model to predict fertility using the multiple predictors: <code>life_expectancy_2018</code>, <code>migration_rate</code>, <code>percent_urban_2018</code>, <code>births_2018</code>, and <code>dependency_ratio_total</code>. Do you think this model predicts fertility rates better than the model above? Why or why not?</div>

You now have two different models for predicting a country's fertility, but which is better? You can use cross validation to test the quality of each model. First, have a look at the prediction errors of the two models.

```python
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import PredictionErrorDisplay

y_pred_single = cross_val_predict(model_lr, demo_lifefert_clean[['life_expectancy_2018']], demo_lifefert_clean['fertility_2018'], cv=4)
y_pred_mult = cross_val_predict(mult_model_lr, X, demo_mult_clean['fertility_2018'], cv=4)

fig, axs = plt.subplots(ncols=2, figsize=(8, 4))
PredictionErrorDisplay.from_predictions(y_true=demo_lifefert_clean['fertility_2018'], y_pred=y_pred_single, ax=axs[0])
axs[0].set_title("Single Variable Model Residuals")
PredictionErrorDisplay.from_predictions(y_true=demo_mult_clean['fertility_2018'], y_pred=y_pred_mult, ax=axs[1])
axs[1].set_title("Multiple Variables Model Residuals")
fig.suptitle("Plotting cross-validated residuals")
plt.tight_layout()
plt.show()
```

The code here conducts cross validation on each model, slicing each model into quarters. The residuals, or errors, are plotted on a single graph for each model.

<div class="alert alert-block alert-info">
<b>Exercise 7:</b> Compare the spread of residuals between the two models. In which model are the errors more spread out? Note the scale of the y-axis might be different between the two graphs. Based on this, which model do you think is better at predicting fertility?</div>

Now you can compute the cross-validation SSE, or the sum of squared error, associated with the predictions. As a reminder, the formula for the CV SSE is
$$\text{CV SSE} = \sum_{i = 1}^n (\hat{y}_{cv,i} - y_i)^2$$
Luckily this formula is coded into the **sklearn** package.

```python
from sklearn.metrics import mean_squared_error

sse_single = len(demo_lifefert_clean['fertility_2018'])*mean_squared_error(demo_lifefert_clean['fertility_2018'], y_pred_single)
sse_mult = len(demo_mult_clean['fertility_2018'])*mean_squared_error(demo_mult_clean['fertility_2018'], y_pred_mult)

print("The cross-validation sum of squared error for the single predictor model is", sse_single)
print("The cross-validation sum of squared error for the multiple predictor model is", sse_mult)
```

<div class="alert alert-block alert-info">
<b>Exercise 8:</b> Based on the results of the cross-validation SSE, which model appears to better predict fertility?</div>

---

# Additional questions

<div class="alert alert-block alert-info">
<b>Exercise 9:</b> Using whichever variables you would like, try to create a model that seems to predict fertility better than the models created above. Explain why you chose the variables you did and why you think they will be useful in predicting fertility.</div>

---

# Variable reference

The `sowc_demographics` data set has 18 variables. The following table gives a brief description of each variable.

| Variable Name | Description |
|:---------|:--------:|
|  countries_and_areas   |  Country or area name.   |
|  total_pop_2018   |  Population in 2018 in thousands.   |
|  under18_pop_2018   |  Population under age 18 in 2018 in thousands.   |
|  under5_pop_2018   |  Population under age 5 in 2018 in thousands.   |
|  pop_growth_rate_2018   |  Rate at which population is growing in 2018.   |
|  pop_growth_rate_2030   |  Rate at which population is estimated to grow in 2030.   |
|  births_2018   |  Number of births in 2018 in thousands.   |
|  fertility_2018   |  Number of live births per woman in 2018.   |
|  life_expectancy_1970   |  Life expectancy at birth in 1970.   |
|  life_expectancy_2000   |  Life expectancy at birth in 2000.   |
|  life_expectancy_2018   |  Life expectancy at birth in 2018.   |
|  dependency_ratio_total   |  The ratio of the not-working-age population to the working-age population of 15-64 years.   |
|  dependency_ratio_child   |  The ratio of the under 15 population to the working-age population of 15-64 years.   |
|  dependency_ratio_oldage   |  The ratio of the over 64 population to the working-age popultion of 15-64 years.   |
|  percent_urban_2018   |  Percent of population living in urban areas.   |
|  pop_urban_growth_rate_2018   |  Annual urban population growth rate from 2000 to 2018.   |
|  pop_urban_growth_rate_2030   |  Estimated annual urban population growth rate from 2018 to 2030.   |
|  migration_rate   |  Net migration rate per 1000 population from 2015 to 2020.   |