# Quiz 6

BEFORE YOU START THIS QUIZ:

1. Click on "Copy to Drive" to make a copy of the quiz,

2. Click on "Share",
    
3. Click on "Change" and select "Anyone with this link can edit"
    
4. Click "Copy link" and

5. Paste the link into [this Canvas assignment](https://canvas.olin.edu/courses/390/assignments/6317).

DO THIS BEFORE YOU START THE QUIZ.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Happiness

Recently I read [Happiness and Life Satisfaction](https://ourworldindata.org/happiness-and-life-satisfaction)
by Esteban Ortiz-Ospina and Max Roser, which discusses (among many other things) the relationship between income and happiness, both between countries, within countries, and over time.

It cites the [World Happiness Report](https://worldhappiness.report/), which I skimmed looking for examples of multiple regression and sources of data.

Their [Table 2.1](https://worldhappiness.report/ed/2020/social-environments-for-world-happiness/) reports the result of a multiple regression analysis that explores the relationship between happiness and six potentially predictive factors:

**Logarithm base 10 of GDP:**
"GDP per capita is in terms of Purchasing
Power Parity (PPP) adjusted to constant
2011 international dollars, taken from
the World Development Indicators
(WDI) released by the World Bank in
September 2017."

**Social support:**
"The national average
of the binary responses (either 0 or 1)
to the Gallup World Poll (GWP)
question “If you were in trouble, do
you have relatives or friends you can
count on to help you whenever you
need them, or not?”"

**Healthy life expectancy:***
"The time series of healthy life expectancy
at birth are constructed based on data
from the World Health Organization
(WHO) and WDI. WHO publishes the
data on healthy life expectancy for
the year 2012. The time series of life
expectancies, with no adjustment for
health, are available in WDI. We adopt
the following strategy to construct the
time series of healthy life expectancy
at birth: first we generate the ratios
of healthy life expectancy to life
expectancy in 2012 for countries
with both data. We then apply the
country-specific ratios to other years
to generate the healthy life expectancy
data."

**Freedom to make life choices:**
"The
national average of binary responses to
the GWP question “Are you satisfied or
dissatisfied with your freedom to
choose what you do with your life?”"

**Generosity:**
"The residual of regressing
the national average of GWP responses
to the question “Have you donated
money to a charity in the past month?”
on GDP per capita."

**Perceptions of corruption:**
"The average
of binary answers to two GWP questions:
“Is corruption widespread throughout the
government or not?” and “Is corruption
widespread within businesses or not?”
Where data for government corruption
are missing, the perception of business
corruption is used as the overall
corruption-perception measure."

The dependent variable, happiness, is the national average of responses to the "Cantril ladder question" used by the [Gallup World Poll](https://news.gallup.com/poll/122453/understanding-gallup-uses-cantril-scale.aspx):

> Please imagine a ladder with steps numbered from zero at the bottom to 10 at the top.
The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you.
On which step of the ladder would you say you personally feel you stand at this time?  

The data used to make this table can be [downloaded from here](https://happiness-report.s3.amazonaws.com/2020/WHR20_DataForFigure2.1.xls).

In [2]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://happiness-report.s3.amazonaws.com/2020/WHR20_DataForFigure2.1.xls")

Now we can read the data into a Pandas `DataFrame`.

In [28]:
import sys

if "google.colab" in sys.modules:
    !pip install -U xlrd

In [3]:
whr = pd.read_excel("WHR20_DataForFigure2.1.xls")
whr.head()

## Correlations

Before we go on, I'm going to make a new dataframe that contains only the columns we need, and I'm going to give them shorter names.

In [4]:
subset = pd.DataFrame()

subset["ladder"] = whr["Ladder score"]
subset["log_gdp"] = whr["Logged GDP per capita"]
subset["social"] = whr["Social support"]
subset["life_exp"] = whr["Healthy life expectancy"]
subset["freedom"] = whr["Freedom to make life choices"]
subset["generosity"] = whr["Generosity"]
subset["corruption"] = whr["Perceptions of corruption"]

Here's what the correlations look like among these variables.

In [5]:
subset.corr()

**Question 1:** Use the Seaborn function `pairplot` to generate distributions and scatterplots for the columns in `subset`. Do you see any scatterplots that might indicate a non-linear relationship?

**Question 2:** Use `linregress` from `scipy.stats` to compute a simple regression of `ladder` as a function of `log_gdp`. 

**Question 3:** Write a sentence or two that interpret the estimated slope. For example, if we are comparing two countries with different GDPs, what difference would we expect in their average happiness?

**Question 4:** Use `ols` from `statsmodels` to compute a simple regression of `ladder` as a function of `log_gdp`. Confirm that the estimated slope and intercept are the same as what we got from `linregress`.

In [10]:
# Here's the import statement for `statsmodels`

import statsmodels.formula.api as smf

**Question 5:** Using the results from the previous question and the range of `xs` from the following cell, compute the corresponding `ys` for the line of best fit.
Plot this line along with a scatter plot of `ladder` vs `log_gdp`.
The line should pass through the visual center of the scatter plot.

In [12]:
low, high = subset["log_gdp"].min(), subset["log_gdp"].max()
xs = np.linspace(low, high)

**Question 6:** Let's see what the relationship is between `corruption` and `ladder` while controlling for GDP. 

Use `ols` to compute a multiple regression with `ladder` as the dependent variable and both `corruption` and `log_gdp` as predictors. Display the estimated parameters.

**Question 7:** Based on the scatter plot, it looks like the relationship between `corruption` and `ladder` might be nonlinear. To explore this possibility, add a new column to `subset` that contains the square of the values from `corruption`.
Then run a multiple regression with `ladder` as the dependent variable and with `log_gdp`, `corruption`, and your new variable as predictors.

**Question 8:** To visualize the results from the previous model, let's generate some predictions.

Make a new `DataFrame` named `pred` that represents hypothetical countries with different perceived levels of corruption.

Now add a column named `corruption` that contains an array of values between `low` and `high`, as computed in the next cell.
Then add a column that contains the square of these corruption values. Finally, add a column that sets `log_gdp` to `9.0`, which is near the average across countries.

In [17]:
low, high = subset["corruption"].min(), subset["corruption"].max()

Run the following cell `DataFrame` and the results from the previous regression to generate predictions for the hypothetical countries.

In [19]:
pred["ladder"] = results.predict(pred)

If everything so far has worked, you should be able to use the following cell to plot the predictions from the model along with a scatter plot of `ladder` vs `corruption`.

In [20]:
plt.plot(pred["corruption"], pred["ladder"], color="gray")
plt.plot(subset["corruption"], subset["ladder"], "o")

plt.xlabel("Perception of corruption")
plt.ylabel("Happiness ladder")
None

## Standardizing

The following cell runs a "kitchen sink model" with all of the predictors (but not the square of corruption).

In [21]:
formula = "ladder ~ log_gdp + social + life_exp + freedom + generosity + corruption"

results = smf.ols(formula, data=subset).fit()
results.params

It is tempting to compare the estimated parameters to see which ones have the strongest relationship with `ladder`, but that would be misleading because the various predictors are in different units with different ranges of values.
To demonstrate that point, here are their standard deviations. Some are clearly bigger than others.

In [22]:
subset.std()

A solution to this problem is to "standardize" the dependent variable and the predictors so that they all have mean 0 and standard deviation 1.
Here's how we can standardize the columns in subset.

In [23]:
standardized = (subset - subset.mean()) / subset.std()

The following cell confirms that the means are near 0.

In [24]:
standardized.mean()

And the standard deviations are near 1.

In [25]:
standardized.std()

**Question 9:** Run the "kitchen sink" regression again using the standardized data and display the parameters.
You can interpret the parameters to mean "if two countries are the same except that one of their predictors differs by one standard deviation, we expect a corresponding increase of X standard deviations in the dependent variable", where X is the estimated parameter.


**Question 10:** Based on these results, which factor has the strongest relationship with happiness? Which has the weakest?

*Elements of Data Science*

Copyright 2022 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)