# Quiz 5

BEFORE YOU START THIS QUIZ:

1. Click on "Copy to Drive" to make a copy of the quiz,

2. Click on "Share",
    
3. Click on "Change" and select "Anyone with this link can edit"
    
4. Click "Copy link" and

5. Paste the link into [this Canvas assignment](https://canvas.olin.edu/courses/390/assignments/6271).

DO THIS BEFORE YOU START THE QUIZ.

The following cells define functions we have used before.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
def decorate(**options):
    """Decorate the current axes.

    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')

    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    ax = plt.gca()
    ax.set(**options)

    handles, labels = ax.get_legend_handles_labels()
    if handles:
        ax.legend(handles, labels)

    plt.tight_layout()

The following cell installs `empiricaldist` if needed.

In [4]:
try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

## Vaccines and GDP

This example is based on an article in *The Economist*,
[Conspiracy theories about covid-19 vaccines may prevent herd immunity](https://www.economist.com/graphic-detail/2020/08/29/conspiracy-theories-about-covid-19-vaccines-may-prevent-herd-immunity).

Thanks to a friend at *The Economist*, I obtained the data used to generate the figures in this article.  Ultimately, the sources are [a poll conducted by the Wellcome Trust](https://wellcome.org/reports/wellcome-global-monitor/2018) and data from the World Bank.

The following cell downloads the data:

In [5]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download(
    "https://github.com/AllenDowney/ElementsOfDataScience/raw/master/data/vaxplot.csv"
)

The following cell reads the data into a Pandas `DataFrame`.

In [6]:
df = pd.read_csv("vaxplot.csv", index_col=0)
df.shape

In [7]:
df.head()

This dataset contains one row for each of 144 countries (out of approximately 195). The columns we will use are:

* `vaccinesafe`: Fraction of respondents who believe vaccines are safe.

* `trustscience`: Fraction who trust science (index based on multiple questions).

* `trustdoctor`: Fraction who trust doctors (index based on multiple questions).

* `gdppcppp`: Gross Domestic Product (a measure of formal economic activity) per capita, adjusted to reflect the cost of living in each country.

* `country`: Country name.

* `region2`: Which of several global regions the country is in (for a loose definition of "region").

I'll add a new column with the log (base 10) of `gdppcppp`

In [8]:
df["log_gdp"] = np.log10(df["gdppcppp"])
df["log_gdp"]

The dataset is mostly complete, but there are a few countries where we don't have GDP data.

In [9]:
df.isna().sum()

**Question 1:** Use `Cdf` from `empiricaldist` to compute the CDF of `log_gdp` across the countries in the dataset. Assign the result to variable named `cdf_gdp`.
Plot the CDF and label the axes.

**Question 2:** The following cell plots the `Cdf` you computed in the previous question along with a Gaussian distribution with the same mean and standard deviation.

In the cell below it, write a sentence or two to interpret the results. If the distribution of `log_gdp` is approximately Gaussian, what does that mean about the distribution of GDP?

In [11]:
from scipy.stats import norm

mu, sigma = cdf_gdp.mean(), cdf_gdp.std()
xs = cdf_gdp.qs
ys = norm(mu, sigma).cdf(xs)
plt.plot(xs, ys, color="gray", label="Normal CDF")

cdf_gdp.plot()

decorate(xlabel="GDP per person at PPP (log10 USD)", ylabel="CDF")

**Question 3:** Make a scatterplot that shows `log_gdp` on the x-axis and `vaccinesafe` on the y-axis. Adjust the size and transparency of the markers so they don't overlap too much.

Optionally, you can use the following function to label some of the markers with country names. You might find it interesting to label Togo, Ukraine, Belarus, and Japan.

In [13]:
import matplotlib.pyplot as plt


def plot_name(df, name):
    index = df["country"] == name
    x = df.loc[index, "log_gdp"]
    y = df.loc[index, "vaccinesafe"]
    plt.text(x, y, name)

**Question 4:** Use `linregress` from `scipy.stats` to compute the line of best fit to the relationship between `vaccinesafe` and `log_gdp`. Display the slope of the fitted line.

**Question 5:** Use the results from the previous question to compute the line of best fit for a range of values of `log_gdp` from 3 to 5.
Plot the line of best fit on top of the scatter plot and confirm that the line passes through the visual center of the data.

**Question 6:** Based on the `log_gdp` of the United States, compute the value of `vaccinesafe` we would expect based on the line of best fit. What is the difference between this predicted value and the actual value?

In [26]:
# you can use this code to select a single row from `df`

index = (df["country"] == "United States")
row = df.loc[index]
row

**Question 7:** The following cell uses the Seaborn function `pairplot` to explore the pairwise relationships between `vaccinesafe`, `trustscience`, `trustdoctor`.
The result is a matrix where the diagonal elements show the histogram of each variable and the off-diagonal elements show scatter plots of each pair.

Based on these results, which pair of variables seem to be most strongly correlated?

In [17]:
data = df[["vaccinesafe", "trustscience", "trustdoctor"]]
sns.pairplot(data)
None

**Question 8:** Compute a correlation matrix that shows the coefficient of correlation for each pair of variables (for just these three variables).

**Question 9:** The following cell makes a "facet plot" that shows a small scatter plot of `vaccinesafe` and `log_gdp` with the countries grouped by region. In which regions does the correlation between these two variables seem to be positive? 

In [20]:
g = sns.FacetGrid(df, col="region2")
g.map(sns.scatterplot, "log_gdp", "vaccinesafe")
None

**Question 10:** Use `groupby` to group the dataset by `region2` and then write a loop that uses `linregress` to estimate the slope of the relationship between `vaccinesafe` and `log_gdp`. When the loop runs, it should print each region name and the estimated slope.

Hint: You might have to drop rows with missing data.
Normally you want to drop only rows that are missing data you need, but in this case you can drop any row with missing data.