# API-201 ABC PROBLEM SET #2
**Due on Wednesday, September 14, at 5:00 p.m.**

**I - INSTRUCTIONS**  
To successfully complete this problem set, please follow these steps:

1. **Create a copy in your own directory by clicking `File > Save as…` and choosing a location outside of the `shared_data` folder - if you do not do this your work will not be saved!**
    1. Remember to save your work frequently by pressing `command-S` of clicking `File > Save and Checkpoint` in the menubar.
2. **Insert all your answers into your copy of the document.** 
    1. Please include every portion of your submission in this document unless a separate electronic file is explicitly requested. 
    3. All numerical calculations should be done in the notebook itself, using R code. If you have to do calculations by hand, include a picture of your handwritten work.
    2. Use `Edit > Insert Image` in the menubar to add an image of handwritten work, screenshots, or anything else.
3. **Once your document is complete, please save and submit the notebook on Canvas as a PDF.** 
    1. Click `Cell > Run All` in the menubar to make sure all of your code is executed.
    1. Click `File > Download as > PDF via HTML (.html)` in the menubar to export your notebook as a PDF, and submit it on Canvas.


**II - IDENTIFICATION**
1. **Your Full Name:** `     `  

2. **Group Members (classmates with whom you worked on this problem set):**  
    1. `     `
    2. `     `
    3. `     `
    4. `     `
    
3. **Compliance with HKS Academic Code**  
We abide by the Harvard Kennedy School Academic code for all aspects of the course. In terms of problem sets, unless explicitly written otherwise, the norms are the following: You are free (and encouraged) to discuss problem sets with your classmates. However, you must hand in your own unique written work and code in all cases. Any copy/paste of another’s work is plagiarism. In other words, you may work with your classmate(s), sitting side-by-side (physically or remotely!) and going through the problem set question by question, but you must each type your own answers and your own code. For more details, please see syllabus.

    **I certify that my work in this problem set complies with the HKS Academic Code**
    - [ ] Yes
    - [ ] No

---

#### Load `R` libraries and data
The cell below will set up your notebook for the assignment - make sure you run it before beginning!

In addition to loading the tidyverse, this code cell will import some data - an extract from the World Bank's World Development Indicators (WDI) dataset.

*Note: If you look at the lines of code below, you may notice that we're importing an Excel sheet, whereas last problem set we imported an Rdata file. The `read_excel` command lets us do this, which comes in handy if the data that you have is an Excel file!*

In [0]:
library(tidyverse)
library(readxl)

wdi <- read_excel("~/shared_data/API201-students/data/WDI_PS2.xlsx", 
                  sheet = 2)

wdi_advars <- read_excel("~/shared_data/API201-students/data/WDI_PS2.xlsx", 
                         sheet = 3)

---

# Table of Contents

1. [LEARNING ABOUT THE WORLD ECONOMY](#LEARNING-ABOUT-THE-WORLD-ECONOMY)
2. [CORRELATIONS IN THE WORLD](#CORRELATIONS-IN-THE-WORLD)
3. [CHOOSING THE RIGHT STATISTIC](#CHOOSING-THE-RIGHT-STATISTIC)
5. [PCE: INTRODUCTION TO PROBABILITY](#PCE:-INTRODUCTION-TO-PROBABILITY)

---

# LEARNING ABOUT THE WORLD ECONOMY

The purpose of this question is threefold: (a) to help you learn about the world economy using a widely used dataset; (b) to ensure that you are comfortable with several R functions and operations that are commonly used to analyze data; and (c) to review the descriptive statistics we studied in Class 3. 

We will be using an extract from World Bank’s World Development Indicators (WDI) for this question. A table of variable definitions is provided below.

Note that the WDI dataset reports GDP values in 2010 U.S. dollars, so that so you can directly compare values in different years.  Also note that GDP and population data are not available in some years.  For this problem set, use only those observations for which data are available.  (We ask specifically about the missing data in part 4.)

| Code   | Indicator Name| Long definition |
|:-:|:-:|:- |
| GDP    | GDP (constant 2010 US$)       | GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2010 U.S. dollars. Dollar figures for GDP are converted from domestic currencies using 2010 official exchange rates. For a few countries where the official exchange rate does not reflect the rate effectively applied to actual foreign exchange transactions, an alternative conversion factor is used. |
| POP    | Population, total| Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates.|
| CO2    | CO2 emissions (metric tons per capita)| Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring.|
| IMRT   | Mortality rate, infant (per 1,000 live births)| Infant mortality rate is the number of infants dying before reaching one year of age, per 1,000 live births in a given year.|
| CHE    | Current health expenditure (\% of GDP) | Level of current health expenditure expressed as a percentage of GDP.  Estimates of current health expenditures include healthcare goods and services consumed during each year. This indicator does not include capital health expenditures such as buildings, machinery, IT and stocks of vaccines for emergency or outbreaks.     |
| ALITRT | Literacy rate, adult total (\% of people ages 15 and above)    | Adult literacy rate is the percentage of people ages 15 and above who can both read and write with understanding a short simple statement about their everyday life.|
| PARL   | Proportion of seats held by women in national parliaments (\%) | Women in parliaments are the percentage of parliamentary seats in a single or lower chamber held by women.  |

### 0. Preview the Data

Run the cell below to view the first 10 rows of the data.

In [0]:
head(wdi, 10)

### 1. Totals: 

a. Calculate and report total world GDP and world population in 2019 using the `sum()` function and add the argument `na.rm = TRUE` to remove missing values.

_Hints:_ 

* Recall that `sum()` takes a numerical vector as its first argument. You can extract a column from a data frame using `$` (e.g., `dataset$var1` will reference the `var1` column of your data frame `dataset`). 
* You can add an argument by putting a comma and the argument in parentheses (e.g., `fun(input, arg)` includes argument `arg` to function `fun(input)`).

In [0]:
# Your answer here!



b. Identify the five countries with the largest populations in 2019 and their respective populations in 2019 using `head()` and the tidyverse function `arrange()`. 

_Hints:_
* Use pipes `%>%` to perform multiple sequential operations on the `wdi` dataset.
* `arrange(data, desc(x))` sorts data in descending order of `x` rather than ascending order.
* `head(data, n)` returns just the first `n` rows of data.
* Make sure to use `head()` and `arrange()` in the correct order, otherwise you'll order only the first 5 countries in the dataset.


In [0]:
# Your answer here!



c. Identify the five countries with the largest GDPs in 2019 and their respective GDPs in 2019.

In [0]:
# Your answer here!



### 2. Central Tendencies:

When the necessary data are available, calculate GDP per capita for each country in the database in both 1994 and 2019 by creating a new dataset called `wdi_percap` that includes two new columns called `gdp_percap_1994` and `gdp_percap_2019`, respectively.

_HINT:_ Use `mutate()` to create the new columns.


**a. What is the simple average of the country-level GDP per capita values in 1994 and 2019?**

_HINT:_ You can remove missing values with `mean()` or `median()` just like with `sum()`!

In [0]:
# Your answer here!



**b. What is the median country-level GDP per capita in 1994 and 2019?**

In [0]:
# Your answer here!



**c. What drives the differences between (a) and (b)?**

`Your answer here!`



**d. Using data on 2019, What is the total population of all countries with GDP per capita below the world mean calculated in part (a)?  What fraction of the total world population does this number represent?**

There are many possible ways to solve this question, so consider the following outline as a possible way to solve the problem, but not the only solution!
1. Start by using `mutate()` to create 3 new columns: 
    a. Mean GDP per capita.
    b. A binary variable that flags whether `gdp_percap_2019` is below mean GDP per capita.
    c. A variable called `pop_below` that is equal to a country's population if its GDP per capita is below the mean and 0 otherwise.
2. Use `summarize()` and `sum()` to sum total population and population of countries with GDP per capita below mean. 
3. use `mutate()` again to calculate the fraction of world population living in countries with below-mean GDP per capita.

In [0]:
# Your answer here!



**e. What is your answer to (d) telling you in terms of how appropriate mean world GDP per capita is in characterizing the economic well-being of the average person in the world? [2-3 sentences]**

`Your answer here!`



### 3. Missing data:

Missing data is, unfortunately, a fact of life, and we face it here.  While we cannot necessarily fix the problem of missing data, it is important to consider its effects.

**a. The WDI database does not have a 2019 GDP figure for how many countries?**

_HINT:_ The function `is.na()` returns `TRUE` for missing values and `FALSE` otherwise. You can calculate the total number of missing values in a vector `x` using `sum(is.na(x))`.

In [0]:
# Your answer here!



**b. The WDI database does not have a 2019 GDP figure for Iran. Describe two distinct ways in which 
you could approximate Iran’s GDP in 2019 (you do not need to implement your suggestions in R).**

`Your answer here!`



**c. Based on your examination of the data, how would you expect the missing data for Iran in 2019 to affect your calculations of each of the following?  Please answer this question qualitatively (i.e., “The result I reported is likely too low because I do not have Iran’s 2019 GDP in the dataset.”). It is not necessary to do any additional calculations.**

*i. The total world GDP that you reported in part (1).*

`Your answer here!`



*ii. The central tendencies of GDP per capita that you reported in part (2).*

`Your answer here!`



*iii. The variance of GDP per capita (Hint: reference the formula for variance from Class 3)*

`Your answer here!`



### 4. Putting it all together:

Summarize your findings from your analyses above.

**a. In one crisp paragraph:**

`Your answer here!`



**b. In one tweet-length statement (if you'd like, feel free to tweet it using `#api201`!):**

`Your answer here!`



# CORRELATIONS IN THE WORLD

The “Additional Variables” dataframe (`wdi_advars`) includes five other variables that we selected to cover a variety of topics of potential interest.  As indicated in the variable definition table from Q1, these variables are:

* CO2 emissions (metric tons per capita)
* Mortality rate, infant (per 1,000 live births)
* Current health expenditure (% of GDP)
* Literacy rate, adult total (% of people ages 15 and above)
* Proportion of seats held by women in national parliaments (%) 

We have included the most recent available data for these variables, which in often from 2016, 2017, or 2018, not 2019. 

Run the block below to preview the data. When you do so, you'll be able to see which years we have additional variables for.

In [0]:
head(wdi_advars, 10)

Please choose two of the eight variables (possibly including GDP, population, and GDP per capita) that interest you and that you believe might be related.  Then:

**a. Using `ggplot()`, create a scatterplot of the two variables you selected. Label the axes appropriately using `labs()`.**

In [0]:
# Your answer here!



**b. Use the `cor()` function, calculate the correlation coefficient between the two variables you selected.**

_HINT:_ By default, if any values of vectors `x` or `y` are missing, `cor(x, y)` will return `NA`. To calculate the correlation only using non-missing values, use `cor(x, y, use = "complete.obs")`.

In [0]:
# Your answer here!



**c. If you find that the two variables you chose to analyze are in fact correlated (either positively or negatively), suggest two distinct reasons that could explain the relationship you found between the variables.  Please be specific (i.e., explain how your two specific variables could be related, not just how two generic variables might be related).**

`Your answer here!`



Generally in the world, we may find that two variables are highly correlated (either positively or negatively) but the two variables are not causally related.  The correlation may instead be driven by a third variable or may be entirely coincidental.  These situations are sometimes called “spurious correlations.” 

**d. Briefly describe a situation from your personal or professional life in which a spurious correlation led to a wrong interpretation or decision.  \[3-4 sentences\]**

`Your answer here!`



# CHOOSING THE RIGHT STATISTIC

As we discussed in class, choosing the right statistic often depends on the problem we are trying to inform.  Different statistics have different strengths and weaknesses depending on the context. For each of the following policy problems, decide whether the mean alone is enough to inform the problem at hand. If it is, explain why. If it is not, suggest what other statistics you would like to have in addition to the mean.

**a. Earlier this spring, hospital administrators and public health officials spent considerable time planning for an expected surge in patients needing hospital care.  To help them plan, they use data available at the time to calculate the average number of COVID-positive patients who required ICU care and/or ventilator support in regions around the world.**
 

`Your answer here!`



**b. A city is setting its public transportation budget based on finely detailed traffic data. When there’s more traffic, more money is spent on increasing bus and train capacity. The city collects road speeds and traffic volume each day at rush hour and calculates the mean speed and volume during rush hour.**

`Your answer here!`



**c. A town in New England is preparing its budget for snow removal this coming winter.  The town calculates mean spending on snow removal over the previous five years.**

`Your answer here!`



# PCE: Bayes' Rule

The goal of this problem set question is to help prepare you for the class on Bayes' Rule that will be held on Thursday (the day after this problem set is due). You will be asked to read two articles an important and controversial question: "*Should women in their 40’s have annual mammograms?*". After reading the articles, we ask that you answer one of three possible questions in the Canvas discussion forum. 

Additionally, the PCE has an optional opportunity to review probability rules. 

Remember that you get full credit for completing the required parts of the PCE - your responses will be registered in the system but will not count towards your grade in any way.

__The module is available at the following sites:__

* [Section A (Prof. Borck)](https://canvas.harvard.edu/courses/109219/modules/231151)

* [Section B (Prof. Svoronos)](https://canvas.harvard.edu/courses/109220/modules/231149)

* [Section C (Prof. Goel)](https://canvas.harvard.edu/courses/109221/modules/231138)

**Completed the module**
- [ ] Yes
- [ ] No