# Descriptive

- Summaries
- Tables

# Statistics

- Models
- Output (`stargazer`)

# Summaries

`group_by()`

# Statistical models

There are a lot of packages for creating statistical and there are packages for all kinds of specific analysis.

A recurring element of a lot of these packages and functions however is to specify the model as a function.

Formulas are specified as:
- `y ~ x1 (+x2 +x3 ... +xn)`


The code below created a linear model for age and weight:

In [60]:
#Linear model for weight and yrbrn
lm(weight ~ yrbrn, ess_data)


Call:
lm(formula = weight ~ yrbrn, data = ess_data)

Coefficients:
(Intercept)        yrbrn  
   44.27414      0.01624  


In [62]:
#Multiple
lm(bmi ~ weight + height, ess_data)


Call:
lm(formula = bmi ~ weight + height, data = ess_data)

Coefficients:
(Intercept)       weight       height  
    50.1059       0.3318      -0.2889  


An advantage of R is the ability to store the model as any other object making it easy to store and recall past results.

In [63]:
#Storing model
bmi_model <- lm(bmi ~ weight + height, ess_data)

In [64]:
#Summary statistics for bmi_model
summary(bmi_model)


Call:
lm(formula = bmi ~ weight + height, data = ess_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1250 -0.1842  0.0199  0.1593  4.1995 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 50.105861   0.341344   146.8   <2e-16 ***
weight       0.331774   0.001376   241.2   <2e-16 ***
height      -0.288922   0.002197  -131.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4993 on 737 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.9875,	Adjusted R-squared:  0.9875 
F-statistic: 2.915e+04 on 2 and 737 DF,  p-value: < 2.2e-16


## Models and categorical

When working with categoricals in R, almost everything about how to treat that categorical in a model should be specified *before* creating the model.

- Should the variable be treated as ordered (nominal) or unordered (ordinal)?
- What value should be used as reference/base?
- Is the ordinal variable to be used as an interval variable?


In [65]:
#Linear model with categorical (2 values)
lm(height ~ yrbrn + gndr, ess_data)


Call:
lm(formula = height ~ yrbrn + gndr, data = ess_data)

Coefficients:
(Intercept)        yrbrn     gndrMale  
   -50.2225       0.1104      12.7617  


In [67]:
#Linear model with ordinal
ess_data$healthcat <- factor(ess_data$health, levels = c('Very bad', 'Bad', 'Fair', 'Good', 'Very good'), ordered = TRUE)

summary(lm(height ~ yrbrn + healthcat, ess_data))


Call:
lm(formula = height ~ yrbrn + healthcat, data = ess_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.8004  -6.7797  -0.1917   6.4317  30.0358 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -61.37901   36.03229  -1.703   0.0889 .  
yrbrn         0.11893    0.01836   6.478 1.69e-10 ***
healthcat.L   4.11147    2.31731   1.774   0.0764 .  
healthcat.Q  -1.34712    1.99509  -0.675   0.4997    
healthcat.C   2.02962    1.51963   1.336   0.1821    
healthcat^4  -2.09098    1.03979  -2.011   0.0447 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.311 on 742 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.07674,	Adjusted R-squared:  0.07052 
F-statistic: 12.34 on 5 and 742 DF,  p-value: 1.644e-11


In [68]:
#Linear model with nominal (character as factor)
summary(lm(height ~ yrbrn + health, ess_data))


Call:
lm(formula = height ~ yrbrn + health, data = ess_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.8004  -6.7797  -0.1917   6.4317  30.0358 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -60.03582   35.99289  -1.668   0.0957 .  
yrbrn             0.11893    0.01836   6.478 1.69e-10 ***
healthFair       -2.12265    1.69675  -1.251   0.2113    
healthGood        0.03304    1.61819   0.020   0.9837    
healthVery bad   -5.55532    3.82994  -1.450   0.1473    
healthVery good   0.92896    1.61690   0.575   0.5658    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.311 on 742 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.07674,	Adjusted R-squared:  0.07052 
F-statistic: 12.34 on 5 and 742 DF,  p-value: 1.644e-11


## Output a model

In [69]:
library(stargazer)


Please cite as: 


 Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.2. https://CRAN.R-project.org/package=stargazer 




In [70]:
height_model <- lm(height ~ yrbrn + health, ess_data)
stargazer(height_model, type = "html", out = "../output/modelout.html")


<table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
<tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td>height</td></tr>
<tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">yrbrn</td><td>0.119<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.018)</td></tr>
<tr><td style="text-align:left"></td><td></td></tr>
<tr><td style="text-align:left">healthFair</td><td>-2.123</td></tr>
<tr><td style="text-align:left"></td><td>(1.697)</td></tr>
<tr><td style="text-align:left"></td><td></td></tr>
<tr><td style="text-align:left">healthGood</td><td>0.033</td></tr>
<tr><td style="text-align:left"></td><td>(1.618)</td></tr>
<tr><td style="text-align:left"></td><td></td></tr>
<tr><td style="text-align:left">healthVery bad</td><td>-5.555</td><