In [1]:
suppressPackageStartupMessages(library(rstanarm))
suppressPackageStartupMessages(library(ggformula))
library(tibble)
suppressPackageStartupMessages(library(glue))
suppressPackageStartupMessages(library(dplyr))
library(stringr)

In [2]:
# Set the maximum number of columns and rows to display
options(repr.matrix.max.cols=150, repr.matrix.max.rows=200)
# Set the default plot size
options(repr.plot.width=18, repr.plot.height=12)

In [3]:
download_if_missing <- function(filename, url) {
    if (!file.exists(filename)) {
        dir.create(dirname(filename), showWarnings=FALSE, recursive=TRUE)
        download.file(url, destfile = filename, method="curl")
    }
}

# Assumptions of the regression model

For the model in Section 7.1 predicting presidental vote share from the economy, discuss each of the assumptions in the numbered list in Section 11.1.
For each assumption, state where it is made (implicitly of explicitly) in the model, whether it seems reasonable, and how you might address violations of the assumptions.

The underlying data comes from [Douglas A. Hibbs "Bread and Peace" model](https://douglas-hibbs.com/background-information-on-bread-and-peace-voting-in-us-presidential-elections/).

Even just focusing on growth there's a hidden parameter here: the growth is a geometrically *weighted* average of annualized quarterly real income (i.e. CPI adjusted) growth rates.
The weight parameter is determined from the data.

## Validity

The underlying model is relating the share of the US incumbents two-party preferred vote to real income growth; putting forward the hypothesis that if people are earning more money on average they are more likely to vote in the incumbent.
For this question the data is valid.

The underlying target variable is US election results, which is somewhat a representative of future elections.
The two party preferred vote is a very valid and reliable measure.

The variable of weighted average of annualized quarterly real income growth rate is reasonably valid.
Income growth rate is measured (somehow) by the Beurau of Labor Statistics.
Real growth is a little slippery; CPI changes its definition over time and doesn't exactly measure growth, but is a good proxy.
The weight could be thought of as a parameter in the model; it is valid if it would be stable over time.

## Representativeness

## Additivity and linearity

## Independence of errors

## Equal variance of errors

## Normality of errors

# Descriptive and causal inference

##  Growth as Descriptive variable
For the model in Section 7.1 predicting presidential vote share from the economy, describe the coefficient for economic growth in purely descriptive, non-causal terms.

## Issues with causal interpretation
Explain the difficulties of interpreting that coefficient as the effect of economic growth on the incumbent party's vote share.

# Coverage of confidence intervals

Consider the following proceduce

* Set n=100 and draw n continuous variables $x_i$ uniformly distributed between 0 and 10. Then simulate data from the model $ y_i = a + bx_i + \rm{error}_i $ for $ i = 1,\ldots,n$, with a=2, b=3, and independent errors from a normal distribution.
* Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the meadian $ \pm 2$ mad sd includes the true value, b=3.
* Repeat the above 2 steps 1000 times.

## Coverage
True of false: the interval should contain the true value approximately 950 times.
Explain your answer.

## Coverage for non-normal error distributions
Same as above, except the error distribution is bimodal, not normal.
True or fale: the interval should contain the true value approximately 950 times.
Explain your answer.

# Interpreting residual plots

Anna takes continuous data $x_1$ and binary data $x2$, creates fake data $y$ from the model, $ y = a + b_1 x_1 + b_2 x_2 + b_3 x_1 x_2 + \rm{error}$, and gives these data to Barb, who, not knowing how the data were contructed, fits a linear regression predicting $y$ from $x_1$ and $x_2$ but without the interaction.
In these data, Barb makes a residual plot of $y$ vs $x_1$, using dots and circles to display points with $x_2 = 0$ and $x_2 = 1$, respectively.
The residual plot indicates that she should fit the interaction model.
Sketch with pen on paper a residual plot that Barb could have seen after fitting the regression without interaction.

# Residuals and predictions

The folder [`Pyth`](https://github.com/avehtari/ROS-Examples/tree/master/Pyth/) contains outcome $y$ and predictors $x_1$, $x_2$ for 40 data points, with a further 20 points with the predictors but no observed outcome.
Save the file to your working directory, then read it into R using `read.table()`.

In [5]:
filename <- "./data/Pyth/pyth.txt"

download_if_missing(filename,
                    'https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Pyth/pyth.txt')
pyth <- read.table(filename, header=TRUE)
pyth %>% t()

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60
y,15.68,6.18,18.1,9.07,17.97,10.04,20.74,9.76,8.23,6.52,15.69,15.51,20.61,19.58,9.72,16.36,18.3,13.26,12.1,18.15,16.8,16.55,18.79,15.68,4.08,15.45,13.44,20.86,16.05,6.0,3.29,9.41,10.76,5.98,19.23,15.67,7.04,21.63,17.84,7.49,,,,,,,,,,,,,,,,,,,,
x1,6.87,4.4,0.43,2.73,3.25,5.3,7.08,9.73,4.51,6.4,5.72,6.28,6.14,8.26,9.41,2.88,5.74,0.45,3.74,5.03,9.67,3.62,2.54,9.15,0.69,7.97,2.49,9.81,7.56,0.98,0.65,9.0,7.83,0.26,3.64,9.28,5.66,9.71,9.36,0.88,9.87,9.99,8.39,0.8,9.58,4.82,2.97,8.8,6.07,0.19,4.19,5.39,6.58,2.36,2.37,1.52,2.07,6.7,2.02,9.63
x2,14.09,4.35,18.09,8.65,17.68,8.53,19.5,0.72,6.88,1.26,14.62,14.18,19.68,17.75,2.44,16.1,17.37,13.25,11.51,17.44,13.74,16.15,18.62,12.74,4.02,13.24,13.21,18.41,14.16,5.92,3.22,2.74,7.39,5.97,18.89,12.63,4.18,19.32,15.19,7.43,10.43,15.72,0.35,10.91,15.82,11.9,2.46,4.09,1.8,13.54,19.13,14.84,5.28,15.42,4.12,6.54,2.67,12.85,8.36,12.16


## Fit a model

Use R to fit a linear regression model predicting $y$ from $x_1$, $x_2$, using the first 40 data points in the file.
Summarize the inferences and check the fit of your model.

## Graphing model

Display the estimated model graphically as in Figure 11.2

## Assumptions

Make a residual plot for this model.
Do the assumptions appear to be met?

## Assumptions

Make predictions for the remaining 20 data points in the file.
How confident do you feel about these predictions?

## Data source

After doing this exercise, take a look at [Gelman and Nolan](http://www.stat.columbia.edu/~gelman/bag-of-tricks/) (2017, section 10.4) to see where these data came from.

# Fitting a wrong model

Suppose you have 100 data points that arose from the following model: $ y= 3 + 0.1x_1 +0.5 x_2 + \rm{error}$, with independent errors drawn from a t distribution with mean 0, scale 5, and 4 degrees of freedom.
We shall explore the implications of fitting a standard linear regression to these data.

## Simulating

Simulate data from this model.
For simlicity, suppose the values of $x_1$ are simply the integers from 1 to 100, and that the values of $x_2$ are random and equally likely to be 0 or 1.
In R, you can define `x_1 <- 1:100`, simulate  `x_2` using `rbinom`, then create the linear predictor, and finally simulate the random errors in `y` using the `rt` function.
Fit a linear regression (with normal errors) to these data and see if the 68% confidence intervals for the regression coefficients (for each, the estimates $\pm 1 $ standard error) cover their true values.

## Coverage

Put the above step in a loop and repeat 1000 times.
Calculate the confidence coverage for the 68% intervals for each of the three coefficients in the model.

# Correlation and explained variance

In a least squares regression wtih one predictor, show that $R^2$ equals the square of the correlation between $x$ and $y$.

# Using simulation to check the fit of a time-series model

Find time-series data and fit a first-order autoregression model to it.
Then use predictive simulation to check the fit of this model as in Section 11.5.

# Leave-one-out cross validation

Use LOO to compare different models fit to the beauty and teaching evaluations example from Exercise 10.6

In [6]:
filename <- "./data/Beauty/beauty.csv"

download_if_missing(filename,
                    'https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Beauty/data/beauty.csv')
beauty <- read.csv(filename)

beauty

eval,beauty,female,age,minority,nonenglish,lower,course_id
<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>
4.3,0.2015666,1,36,1,0,0,3
4.5,-0.8260813,0,59,0,0,0,0
3.7,-0.6603327,0,51,0,0,0,4
4.3,-0.7663125,1,40,0,0,0,2
4.4,1.4214450,1,31,0,0,0,0
4.2,0.5002196,0,62,0,0,0,0
4.0,-0.2143501,1,33,0,0,0,4
3.4,-0.3465390,1,51,0,0,0,0
4.5,0.0613435,1,33,0,0,0,0
3.9,0.4525679,0,47,0,0,0,4


##  Comparing LOO
Discuss the LOO results for the different models and what this implies, or should imply, for model choice in this example.

## Pointwise errors as outliers
Compare predictive errors pointwise.
Are there some data points that have high predictive errors for all the fitted models?

# K-fold cross validation

Repeat part (a) of the previous example, but using 5-fold cross validation

## Sampling
Randomly partition the data into five parts using the `sample` function in R

## Fitting the folds
For each part, re-fitting the model excluding that part, then use each fitted model to predict the outcomes for the left-out part, and compute the sum of squared errors for the prediction.

## Assessing cross-validated scores.

For each model, add up the sum of squared errors for the five steps in (b).
Compare the different models based on this fit.