In [1]:
suppressPackageStartupMessages(library(rstanarm))
suppressPackageStartupMessages(library(ggformula))
library(tibble)
suppressPackageStartupMessages(library(glue))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(modelr))
library(stringr)

In [2]:
# Set the maximum number of columns and rows to display
options(repr.matrix.max.cols=150, repr.matrix.max.rows=200)
# Set the default plot size
options(repr.plot.width=18, repr.plot.height=12)

In [3]:
download_if_missing <- function(filename, url) {
    if (!file.exists(filename)) {
        dir.create(dirname(filename), showWarnings=FALSE, recursive=TRUE)
        download.file(url, destfile = filename, method="curl")
    }
}

# Plotting linear and quadratic regressions

The folder [`Earnings`](https://github.com/avehtari/ROS-Examples/tree/master/Earnings/) has data on weight (in pounds), age (in years), and other information from a sample of American adults.
We create a new variable, `age10` = age/10, and fit the following regression predicting weight:

```
            Median   MAD_SD
(Intercept)  148.7     2.2
age10          1.8     0.5

Auxiliary parameter(s)
      Median   MAX_SD
sigma  34.5       0.6
```

## Sketching the model

With pen on paper, sketch a scatterplot of weights versus age (that is, weight on y-ais, age on x-axis) that is consistent with the above information, also drawing the fitted regression line.
Do this just given the information here and your general knowledge about adult heights and weights: do not download the data.

## Adding a quadratic term

Next, we define `age10_sq` $= (\rm{age}/10)^2$ and predict weight as a quadratic function of age:

```
            Median   MAD_SD
(Intercept)  108.0     5.7
age10         21.3     2.6
age10sq       -2.0     0.3

Auxiliary parameter(s)
      Median   MAX_SD
sigma  33.9       0.6
```

Draw this fitted curve on the graph you already sketched above

# Plotting regression with a continuous variable broken into categories

Continuing Exercise 12.1, we divide age into 4 categories and create corresponding inticator variable, `age18_29`, `age30_44`, `age45_64`, and `age65_up`. We then fit the following regression:

```
stan_glm(weight ~ age30_44 + age45_64 + age65_up, data=earnings)

             Median   MAD_SD
(Intercept)   147.8     1.6
age30_44TRUE    9.6     2.1
age45_64TRUE   16.6     2.3
age65_upTRUE    7.5     2.7

Auxiliary parameter(s)
      Median   MAX_SD
sigma  34.1       0.6
```

## Missing Indicator

Why did we not include an indicator for the youngest group, `age18_29`?

## Sketch the graph

Using the same axes and scale as in your graph for Exercise 12.1, sketch with pen on paper the scatterplot, along with the above regression function, which will be discontinuous.

### Check

In [11]:
filename <- "./data/Earnings/earnings.csv"

download_if_missing(filename,
                    'https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Earnings/data/earnings.csv')
earnings <- read.csv(filename, header=TRUE)

# Scale of regression coefficients

A regression was fit to data from different countries, predicting the rate of civil conflicts given a set of geographic and political predictors.
Here are the estimated coefficients and their z-scores (coefficient divided by standard error), given to three decimal places.

|  | Estimate | z-score |
| --- | --- | --- |
| Intercept | -3.814 | -20.178 |
| Conflict before 2000 | 0.020 | 1.861 |
| Distance to border | 0.000 | 0.450 |
| Distance to capital | 0.000 | 1.629 |
| Population | 0.000 | 2.482 |
| % mountainous | 1.641 | 8.518 |
| % irrigated | -0.027 | -1.663 |
| GDP per capita | -0.000 | -3.589 |

Why are the coefficients for distance to border, distance to capital, population, and GDP per capita so small?

# Coding a predictor as both cateogical and continuous

A linear regression is fit on a group of employed adults, predicting their physical flexibility given age.
Flexibility is defined on a 0-30 scale based on measurements from a series of stretching tasts.
Your model includes age in categories (under 30, 30-44, 45-59, 60+) and also age as a linear predictor.
Sketch a graph of flexibility vs. age, showing what the fitted regression might look like.

## Checking with a simulation

# Logarithmic transformation and regression

Consider the following regression:

$$ \log(\rm{weight}) = - 3.8 + 2.1 \log(\rm{height}) + \rm{error}, $$

with errors that have standard deviation 0.25.
Weights are in pounds and heights are in inches.

## Fill in the blanks

Approximately 68% of people will heave weights within a factor of **exp(-0.25)=0.78** and **exp(0.25)=1.3** of their predicted values from the regression.

## Sketch

Using a pen and paper, sketch the regression line and scaterrplot or log(weight) versus log(height) that make sense and are consistent with the fitted model.
Be sure to label the axes of your graph.

# Logarithmic transformations

The folder [`Pollution`](https://github.com/avehtari/ROS-Examples/tree/master/Pollution/) contains mortality rates and various environmental factors from 60 U.S. metropolitan areas (see [McDonald and Schwing, 1973](https://www.tandfonline.com/doi/abs/10.1080/00401706.1973.10489073)).
For this exercise we shall model mortality rate given nitric oxides, sulfur dioxide, and hydrocarbons as inputs.
This model is an extreme oversimplification, as it combines all sources of mortality and does not adjust for crucial factors such as age and smoking.
We use it to illustrate log transformations in regression.

In [14]:
filename <- "./data/Polution/pollution.csv"

download_if_missing(filename,
                    'https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Pollution/data/pollution.csv')
pollution <- read.csv(filename, header=TRUE)

In [17]:
pollution

prec,jant,jult,ovr65,popn,educ,hous,dens,nonw,wwdrk,poor,hc,nox,so2,humid,mort
<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>
36,27,71,8.1,3.34,11.4,81.5,3243,8.8,42.6,11.7,21,15,59,59,921.87
35,23,72,11.1,3.14,11.0,78.8,4281,3.5,50.7,14.4,8,10,39,57,997.875
44,29,74,10.4,3.21,9.8,81.6,4260,0.8,39.4,12.4,6,6,33,54,962.354
47,45,79,6.5,3.41,11.1,77.5,3125,27.1,50.2,20.6,18,8,24,56,982.291
43,35,77,7.6,3.44,9.6,84.6,6441,24.4,43.7,14.3,43,38,206,55,1071.289
53,45,80,7.7,3.45,10.2,66.8,3325,38.5,43.1,25.5,30,32,72,54,1030.38
43,30,74,10.9,3.23,12.1,83.9,4679,3.5,49.2,11.3,21,32,62,56,934.7
45,30,73,9.3,3.29,10.6,86.0,2140,5.3,40.4,10.5,6,4,4,56,899.529
36,24,70,9.0,3.31,10.5,83.2,6582,8.1,42.5,12.6,18,12,37,61,1001.902
36,27,72,9.5,3.36,10.7,79.3,4213,6.7,41.0,13.2,12,7,20,59,912.347


## Linearity in Nitric Oxides

Create a scatterplot of mortality rate versus level of nitric oxides.
Do you think linear regression will fit these data well?
Fit the regression and evaluate a residual plot from the regression.

## Transforming the data

Find an appropriate transformation that will result in data more appropriate for linear regression.
Fit a regression to the transformed data and evaluate the new residual plot.

## Intepreting Models on Transformed Data

Interpret the slope coefficient from the model you chose in (b).

## Adding Sulfure Dioxide and Hydrocarbons as Predictors

Now fit a model predicting mortality rate using levels of nitric oxides, sulfur dioxide, and hydrocarbons as inputs.
Use appropriate transformations when helpful.
Plot the fitted regression model and interpred the coefficients.

## Cross validate

Fit the model you chose above to the first half of the data and then predict for the second half.
You used all the data to construct the model in (d), so this is not really cross validation, but it gives a sense of how the steps of cross validation can be implemented.

# Cross validation comparison of models with different transformations of outcomes

When we compare models with transformed continuous outcomes, we must take into account how the nonlinear transformation warps the continuous variable.
Follow the procedure used to compare models for the mesquite bushes example on page 202.

## Earnings under log transformation

Compare models for eachings and for log(earnings) given height and sex as shown on pages 84 and 192.
Use `earnk` and `log(earnk)` as outcomes.

In [16]:
earnings

height,weight,male,earn,earnk,ethnicity,education,mother_education,father_education,walk,exercise,smokenow,tense,angry,age
<int>,<int>,<int>,<dbl>,<dbl>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
74,210,1,50000,50.00,White,16,16,16,3,3,2,0,0,45
66,125,0,60000,60.00,White,16,16,16,6,5,1,0,0,58
64,126,0,30000,30.00,White,16,16,16,8,1,2,1,1,29
65,200,0,25000,25.00,White,17,17,,8,1,2,0,0,57
63,110,0,50000,50.00,Other,16,16,16,5,6,2,0,0,91
68,165,0,62000,62.00,Black,18,18,18,1,1,2,2,2,54
63,190,0,51000,51.00,White,17,17,17,3,1,2,4,4,39
64,125,0,9000,9.00,White,15,15,15,7,4,1,4,4,26
62,200,0,29000,29.00,White,12,12,12,2,2,2,0,0,49
73,230,1,32000,32.00,White,17,17,17,7,1,1,0,0,46


## Other examples

Compare models from exercise 12.6.

# Log-log transformations

Suppose that, for a certain population of animals, we can predict log weight from log height as follows:

* An animal that is 50 centimeters tall is predicted to weigh 10 kg.
* Every increase of 1% in height corresponds to a predicted increase of 2% in weight.
* The weights of approximately 95% of the animals fall within a factor of 1.1 of predicted values.

## Description to model

Give the equation of the regression line and the residual standard deviation of the regression.

## Calculating $R^2$

Suppose the standard deviation of log weights is 20% in this population.
What, then, is the $R^2$ of the regression model described here?

# Linear and logarithmic transformations

For a study of congressional elections, you would like a measure of the relative amound of money raised by each of the two major-party candidates in each district.
Suppose that you know the amount of money raised by each candidate; label these doallar values $D_i$ and $R_i$.
You would like to combine these into a single variable that can be included as an input variables into a model predicting vote share for the Democrats.
Discuss the advantages and disadvandates of the following measures:

1. The simple difference, $D_i - R_i$
2. The ratio, $D_i/R_i$
3. The difference on the logarithmic scale, $\log D_i - \log R_i $
4. The relative proportion, $D_i / (D_i + R_i)$.

# Special-purpose transformations

For the congressional elections example in the previous exercise, propose an idiosyncratic transformation as in the example on page 196 and discuss the advantages and disadvantages of using it as a regression input.

# Elasticity
An economist runs a regression examining the relations between the average price of cigarettes, $P$, and the quantity purchages, $Q$, across a large sample of countries in the Unites States, assuming the functional form, $\log Q = \alpha + \beta \log P$.
Suppose the estimate for $\beta$ is 0.3.
Interrpet this coefficient.

# Sequence of regressions

Fina  regression problem that is of interest to you and can be performed repeatedly (for example, data from several years, or for several countries).
Perform a separate analysis for each year, or country, and display the estimates in a plot as in Figure 10.9.

# Building regression models

Return to the teaching evaluations data from Exercise 10.6.
Fit regression models predicting evaluations given many of the inputs in the dataset.
Consider interactions, combinations of predictors, and transformations, as appropriate.
Consider several models, discuss in detail the final model that you choose, and also explain why you chose it rather than the others you had considered.

In [18]:
filename <- "./data/Beauty/beauty.csv"

download_if_missing(filename,
                    'https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Beauty/data/beauty.csv')
beauty <- read.csv(filename)

beauty

eval,beauty,female,age,minority,nonenglish,lower,course_id
<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>
4.3,0.2015666,1,36,1,0,0,3
4.5,-0.8260813,0,59,0,0,0,0
3.7,-0.6603327,0,51,0,0,0,4
4.3,-0.7663125,1,40,0,0,0,2
4.4,1.4214450,1,31,0,0,0,0
4.2,0.5002196,0,62,0,0,0,0
4.0,-0.2143501,1,33,0,0,0,4
3.4,-0.3465390,1,51,0,0,0,0
4.5,0.0613435,1,33,0,0,0,0
3.9,0.4525679,0,47,0,0,0,4


# Prediction from a fitted regression

Consider one of the fitted models for mesquite leaves, for example `fit_4`, in Section 12.6.
Suppose you wish to use this model to make inferences about the average mesquite yield in a new set of trees whose predictors are in data frame called `new_trees`.
Give R code to obtain an estimate and standard error for this population average.

You do not need to make the prediction; just give the code.

In [21]:
?read.table

In [31]:
filename <- "./data/Mesquite/mesquite.dat"

download_if_missing(filename,
                    'https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Mesquite/data/mesquite.dat')
mesquite <- read.table(filename, header=TRUE)

mesquite <- mesquite %>%
mutate(canopy_volume = diam1 * diam2 * canopy_height,
       canopy_area = diam1 * diam2,
       canopy_shape = diam1 / diam2)

In [32]:
fit_4 <- stan_glm(formula=log(weight) ~ log(canopy_volume) + log(canopy_area) + log(canopy_shape) +
                                             log(total_height) + log(density) + group,
                  data=mesquite, refresh=0)

In [34]:
new_trees <- sample(mesquite, 20, TRUE)

# Models for regression coefficients

Using hte Portugese student data from the [`Student`](https://github.com/avehtari/ROS-Examples/tree/master/Student) folder, repeat the analyses in Section 12.7 with the same predictors, but using as outcome the Poruguese language grade rather than the mathematics grade.

In [36]:
filename <- "./data/Student/student-merged.csv"

download_if_missing(filename,
                    'https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Student/data/student-merged.csv')
student <- read.csv(filename)

student

G1mat,G2mat,G3mat,G1por,G2por,G3por,school,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
7,10,10,13,13,13,0,0,15,0,0,1,1,1,2,4,1,1,1,1,1,1,1,1,0,3,1,2,1,1,1,2
8,6,5,13,11,11,0,0,15,0,0,1,1,1,1,2,2,1,1,0,0,0,1,1,1,3,3,4,2,4,5,2
14,13,13,14,13,12,0,0,15,0,0,1,2,2,1,1,0,1,1,1,1,1,1,0,0,4,3,1,1,1,2,8
10,9,8,10,11,10,0,0,15,0,0,1,2,4,1,3,0,1,1,1,1,1,1,1,0,4,3,2,1,1,5,2
10,10,10,13,13,13,0,0,15,0,0,1,3,3,2,3,2,0,1,1,1,1,1,1,1,4,2,1,2,3,3,8
12,12,11,11,12,12,0,0,15,0,0,1,3,4,1,3,0,1,1,1,1,1,1,1,0,4,3,2,1,1,5,2
12,0,0,10,11,12,0,0,15,0,0,1,3,4,2,3,2,0,1,0,0,1,1,1,1,4,2,2,2,2,5,0
8,9,8,11,10,11,0,0,15,0,1,1,2,2,2,2,0,1,1,1,0,1,1,1,0,4,1,3,1,3,4,2
16,16,16,15,15,15,0,0,15,0,1,1,3,1,2,4,0,0,1,0,0,0,1,1,0,4,4,2,2,3,3,12
10,11,11,10,10,10,0,0,15,1,0,0,3,3,1,4,0,1,0,0,0,1,1,0,0,4,3,3,1,1,4,10


# Applying ideas of regression

Read a published article that uses regression modeling and is on a topic of interest to you.
Write a page or two evaluating and criticizing the article, addressing issues discussed in Chapters 1-12, such as measurement, data visualization, modeling, inference, simulation, regression, assumptions, model checking, interactions, and transformations.
The point of this exercise is not to come up with a comprehensive critique of the article but rather to review the key points of this book so far in the context of a live example.

Perhaps [Pierre Chandon and Brian Wansink, “When Are Stockpiled Products Consumed Faster? A Convenience-Salience Framework of Postpurchase Consumption Incidence and Quantity,” *Journal of Marketing Research* 39, no. 3 (August 2002): 321–35.](https://faculty.insead.edu/pierre-chandon/documents/Article-When%20are%20stockpiled%20products%20consumed%20faster%20-%20A%20convenience%20salience%20framework%20of%20post%20purchase%20consumption%20incidence%20and%20quantity.pdf)

Or [Determinants of Store Price Elasticity](https://www.researchgate.net/profile/Peter-Rossi-4/publication/237130017_Determinants_of_Store-Level_Price_Elasticity/links/55e089cc08ae2fac471bea66/Determinants-of-Store-Level-Price-Elasticity.pdf)

[Price Elasticity of SA Electricity (Hyndman)](https://www.monash.edu/business/ebs/our-research/publications/ebs/wp16-10.pdf) hmm..

[Ebook Price Elasticity](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.688.6731&rep=rep1&type=pdf)

[Happiness and Income (Kahneman)](https://www.pnas.org/content/pnas/107/38/16489.full.pdf?source=post_page---------------------------)