In [2]:
import pandas as pd
import numpy as np
from cmdstanpy import CmdStanModel
import arviz as az

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# You may import additional libraries if needed

# Aesthetics
sns.set_theme(style="whitegrid")
plt.rc('axes', titlesize=10, labelsize=9)
plt.rc('xtick', labelsize=8)
plt.rc('ytick', labelsize=8)

In [3]:
africa = pd.read_csv('https://raw.githubusercontent.com/MLGlobalHealth/StatML4PopHealth/refs/heads/main/data/africa_who_life_expectancy.csv')
africa.head(10)

Unnamed: 0,Country,Region,Year,Infant_deaths,Under_five_deaths,Adult_mortality,Alcohol_consumption,Hepatitis_B,Measles,BMI,...,Diphtheria,Incidents_HIV,GDP_per_capita,Population_mln,Thinness_ten_nineteen_years,Thinness_five_nine_years,Schooling,Economy_status_Developed,Economy_status_Developing,Life_expectancy
0,Comoros,East Africa,2007,66.8,91.9,255.8815,0.15,75,64,23.5,...,75,0.02,1166,0.64,7.3,7.2,3.5,0,1,60.7
1,Gabon,Central Africa,2012,39.1,57.1,256.88,7.47,82,64,24.9,...,82,1.57,7181,1.75,6.3,6.2,7.8,0,1,62.9
2,Ghana,West Africa,2011,45.2,65.9,257.0865,1.67,91,58,23.7,...,91,0.93,1580,25.39,6.9,6.8,6.8,0,1,61.4
3,"Congo, Rep.",Central Africa,2003,64.6,100.2,406.702,0.9,71,64,22.3,...,50,2.23,2072,3.41,9.1,8.8,5.7,0,1,53.8
4,Madagascar,Southern Africa,2011,45.8,67.0,237.6755,0.89,73,64,21.1,...,73,0.15,464,21.74,7.5,7.4,6.1,0,1,63.8
5,Cameroon,Central Africa,2015,57.0,88.0,340.1265,4.55,84,64,24.3,...,84,1.12,1383,23.3,5.6,5.5,6.1,0,1,57.6
6,Zimbabwe,Southern Africa,2009,54.0,91.3,551.9115,2.77,73,64,23.6,...,73,7.31,941,12.53,7.5,7.4,7.2,0,1,48.4
7,South Africa,Southern Africa,2006,48.7,79.2,515.718,7.3,83,53,26.6,...,82,11.13,5827,48.49,1.6,12.8,9.0,0,1,53.8
8,Eritrea,East Africa,2012,36.8,51.2,274.029,1.03,94,88,20.6,...,94,0.18,699,3.28,8.7,8.6,3.9,0,1,63.2
9,Seychelles,East Africa,2011,12.2,14.2,154.4505,7.83,99,99,26.3,...,99,3.5,13664,0.09,5.7,6.1,8.4,0,1,72.7


# Final Assessment

In this final assessment, you are tasked with investigating the relationship between various population health indicators and life expectancy to inform government policy decisions on resource allocation and public health priorities. Employing Bayesian statistical methods, your objective is to conduct a rigorous analysis of the World Health Organization (WHO) life expectancy dataset, interpret the results within the context of existing literature, and propose evidence-based recommendations for policy improvements.

## Submission
- You should submit a Jupyter notebook (`.ipynb` file) containing your analysis, code, and results.
- You should submit any `.stan` file you use for the analysis.
- You should include visualizations, tables, and other relevant outputs to support your analysis.
- The notebook should be well-organized, with clear explanations and interpretations of the results.

## Dataset
TODO: include URL of dataset

The dataset is an updated subset of the WHO life expectancy dataset, specifically focused on the African continent. It comprises data from 51 countries covering the period from 2000 to 2015, with a total of 816 observations and 21 variables. The dataset includes a broad spectrum of indicators, ranging from health and immunization measures to economic and demographic factors. The variables are defined as follows:

- Country: Name of the country
- Region: Geographical region within Africa
- Year: Year of observation
- Infant_deaths: Number of infant deaths per 1000 population
- Under_five_deaths: Number of deaths among children under five per 1000 population
- Adult_mortality: Number of adult deaths per 1000 population
- Alcohol_consumption: Per capita alcohol consumption (liters for individuals aged 15 and older)
- Hepatitis_B: Immunization coverage against Hepatitis B among 1-year-olds (%)
- Measles: Immunization coverage against measles among 1-year-olds (%)
- BMI: Average Body Mass Index of the population
- Polio: Polio immunization coverage among 1-year-olds (%)
- Diphtheria: Coverage for DTP3 (Diphtheria, Tetanus, Pertussis) immunization among 1-year-olds (%)
- Incidents_HIV: Incidence of HIV per 1000 population (ages 15-49)
- GDP_per_capita: Gross Domestic Product per capita (in USD)
- Population_mln: Population size in millions
- Thinness_ten_nineteen_years: Prevalence of thinness among adolescents (ages 10-19, %)
- Thinness_five_nine_years: Prevalence of thinness among children (ages 5-9, %)
- Schooling: Average years of schooling for individuals aged 25 or older
- Economy_status_Developed: Indicator variable for developed country status
- Economy_status_Developing: Indicator variable for developing country status
- Life_expectancy: Average life expectancy (in years)

## Scoring scheme
- Exploratory Data Analysis (EDA) (20 points)
    - Appropriate methodology for answering the research question and building the model
    - Quality of the figures and interpretation 
- Bayesian Linear Regression (30 points)
    - Correct mathematical expression of the model
    - Correct implementation of the model in Stan
    - Necessary model diagnositics
    - Summarization, visualization, and interpretation of the results
- Interpretation of Results (10 points)
    - Interpretation of the results in the context of the research question
    - Comparison of the results with other sources of information
- Recommendations (10 points)
    - Recommendations are clear and supported by the analysis
- Extra Credit (30 points)
    - Attempts to improve the model

Note on code: You will not be judged on the quality of the code, but it should be clear and well-commented, otherwise we make take points off for not being able to understand your analysis.

## 1. Exploratory Data Analysis (EDA)

### 1.1 Temporal Analysis of Life Expectancy

Initiate your analysis by examining the temporal evolution of life expectancy across countries. Construct time series plots for each country to assess the progression of life expectancy over time. Scrutinize these visualizations for overarching trends and notable deviations. Identify any countries that exhibit anomalous patterns and consider potential explanatory factors, such as socioeconomic disparities, healthcare policies, or regional environmental influences.

Given the considerable number of countries within the dataset, it may be necessary to create multiple plots by grouping countries according to their respective regions. You do not have the time to present a detailed discussion for each country, so you should focus on 3 to 5 countries that are representative, have distinct trends, or unique characteristics. You may choose countries that you are familiar with as domain knowledge can provide valuable insights into the underlying dynamics.

###  1.2 Analysis of Variable Relationships

Examine the associations between `Life_expectancy` and the remaining variables in the dataset using appropriate visualization techniques such as scatter plots, box plots, or other relevant methods. As a preliminary step, subset the dataset to include only observations from the year 2008. Identify and describe the patterns that emerge, highlighting any variables that exhibit a particularly strong or consistent relationship with life expectancy. Provide an academic discussion on the potential underlying factors—whether socioeconomic, demographic, or health-related—that might explain these observed relationships.

### 1.3 Variable Selection, Actionability, and Statistical Considerations

Based on your exploratory data analysis, identify the variables you propose to include in your Bayesian linear model for inferring the effects of health indicators on life expectancy. In your discussion, please address both statistical and practical aspects by considering the following questions:

- Which variables appear most influential and relevant according to your analyses, and what underlying mechanisms might justify their inclusion?
- Reflect on the concept of actionability: To what extent can interventions realistically improve these variables? Consider whether a variable represents a direct target for policy change (for example, immunization rates) versus a more distal or structural factor (such as overall GDP per capita).
- How do potential issues like multicollinearity influence your variable selection? Discuss whether highly correlated variables should be combined, represented by a single measure, or treated separately, and explain your reasoning.

Your answer should integrate these considerations to justify a well-rounded and contextually informed model specification.

## 2. Bayesian Linear Regression

Subset the data to 2008 and the variables you have chosen based on your EDA and reasoning. Implement a Bayesian linear regression model to infer the effects of the chosen health indicators on life expectancy.

### 2.1 Mathematical Expression of the Model
Write down the mathematical expression of the model.

### 2.2 Implement the Model in Stan and infer the posterior using MCMC
Implement the Bayesian linear regression model in Stan. Compile the model and sample from the posterior using MCMC.
You will not be judged on the efficiency of the Stan code, but rather on the correctness of the model implementation.

### 2.3 Diagnose the Model
Conduct model diagnostics to ensure that the sampler did not encounter any issues during the sampling process. Check for convergence, effective sample size, and goodness of fit.

### 2.4 Summarize and Visualise the Results
Summarise the posterior distribution of the model parameters and provide point estimates and 95% credible intervals for the effects of the chosen health indicators on life expectancy.

### 2.6 Interpret the Results
Interpret the results of the Bayesian linear regression model in the context of the research question. Discuss the magnitude and direction of the effects of the chosen health indicators on life expectancy.

## Recommendations

Your government is interested in improving life expectancy. They inform you that they have a limited budget and can only focus on a few health indicators. Based on your analysis, what recommendations would you give to the government? What are the potential benefits of focusing on these health indicators? Provide evidence to support your recommendations.

The government has been able to provide you with the latest statistics on each of the indicators. They want you to use this information to:

1. Predict the life expectancy, which has not been measured yet.
2. Visualize the life expectancy predictions by changing the values of each health indicator you have chose by 10%. Note that for some indicators a increase in value is good for life expectancy, while for others a decrease is good.

In [1]:
new_data_dict = {
    'Adult_mortality': 263.6,
    'Alcohol_consumption': 3.0,
    'BMI': 23.8,
    'Diphtheria': 80.9,
    'Economy_status_Developed': 0.0,
    'Economy_status_Developing': 1.0,
    'GDP_per_capita': 2705,
    'Hepatitis_B': 81.5,
    'Incidents_HIV': 1.8,
    'Infant_deaths': 47.6,
    'Measles': 69.5,
    'Polio': 81.1,
    'Population_mln': 22.1,
    'Schooling': 5.3,
    'Thinness_five_nine_years': 6.6,
    'Thinness_ten_nineteen_years': 6.7,
    'Under_five_deaths': 68.8
}

## Extending the linear model

So far we have only worked with data from a single year. How would you improve/reformulate the model to include data from all years? and/or account for non linearity in the relationship between the health indicators and life expectancy? Implement your ideas, summarise the results, and provide a discussion. You will be assessed based on the following criteria:

- The ability to provide a clear mathematical expression of the model and justify the modeling choices.
- The ability to correctly implement the model in Stan.
- The ability to extract estimates from the model to create figures and tables that answer the research question.

Note: If you choose to do a longitudinal analysis, you may need to subset the data to a smaller number of countries as an analysis of all countries may be computationally intensive.