# EEP/IAS 118 - Section 2

## Manipulating Data and More Regression Output

### July 1, 2021

Today's section will help familiarize ourselves with some more __R__ functions, improve our ability to manipulate and summarize data, and will spend some time getting more familiar with our regression output.


For today, let's load a few packages and read in a dataset on sleep quality and time allocation for 706 individuals. This dataset is saved to the section folder as `sleep75.dta`. 

In [None]:
library(tidyverse)
library(haven)
sleepdata <- read_dta("sleep75.dta")

## Grouping Data

Sometimes we may want to group our data by values of certain variables. One way we know how to do this is by creating subsets using `filter()`. Let's say we wanted to split our sample on whether the individual is in a union (`union = 0`) and get the mean minutes slept per week (`week` variable):

In [None]:
union <- filter(sleepdata, union == 1)
nonunion <- filter(sleepdata, union == 0)

summarize(union, mean = mean(sleep), union = max(union))
summarize(nonunion, mean = mean(sleep), union = max(union))

Notice that we have to use two equals signs, `==`, in the condition of the `filter` function when we want to filter on a specific value of the data.

More efficiently, we can group data using __tidyverse__'s `group_by()` function. The function takes two arguments: first the name of the data, second the variable whose values we want to group on: `group_by(data, varname)`

Let's practice by grouping our sleep data by whether the sleeper is african american or not. 

In [None]:
head(sleepdata)

sleep_group <- group_by(sleepdata, black)
head(sleep_group)

Grouping the data does nothing to its appearance. Instead, what it does is change how it behaves when we use other functions from __tidyvverse__.

In [None]:
summarize(sleepdata, mean_sleep = mean(sleep), max_sleep = max(sleep), count = n())

summarize(sleep_group, mean_sleep = mean(sleep), max_sleep = max(sleep), count = n())

We can see that when we run `summarize()` on our grouped data, it produces separate sets of summary statistics for each level of the grouping variable - in this case both levels of our variable `black`. We can see that there are 35 participants who identify as black, and 671 other ethnicities - corresponding to the total 706 we see from the ungrouped data.

## Pipes

As we start wanting to generate more specific summary statistics that require multiple coding steps, it can get tedious (and memory-intensive) to constantly have to assign objects to memory in each intermediate step.

For an example, if we were interested in altering our sleep variable to measure hours slept per night and also wanted to then obtain summary statistics by whether individuals are in good or excellent health (`gdhlth = 1`), we could do it in the following way


In [None]:
sleepdata <- mutate(sleepdata, hrs_night = sleep/(7*60))
sleepdata_goodhealth <- filter(sleepdata, gdhlth == 1)
sleepdata_poorhealth <- filter(sleepdata, gdhlth == 0)

summarize(sleepdata_poorhealth, mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count_badhealth = n())
summarize(sleepdata_goodhealth, mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count_goodhealth = n())

To get summary statistics on hours slept per night for each of the good and poor health groups, we had to use `filter()` to subset the data on health quality, store those subsets in data, and then generate summary statistics for each subset individually. 

`tidyverse` has a fantastic alternative that helps us skip these intermediate steps: a pipe `%>%`. The way the pipe (`%>%`) works is by taking the output from one expression and plugging it into the next expression (defaulting to the first argument in the second expression). For instance, we could rewrite the above code using pipes in fewer lines and without having to store our intermediate data:

In [None]:
sleepdata %>%
    mutate(hrs_night = sleep/(7*60)) %>%
    group_by(gdhlth) %>%
    summarize(mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count = n())

Which gives us the same output without storing anything to memory and in fewer steps. What the pipe is doing here is 
1. Telling `mutate` that it's first argument should be what we're piping to it, our object _sleepdata_, and creating a new variable `hrs_night` that measures hours slept per night (minutes per week divided by 7 days/week and divided by 60 min/hr)
2. Taking the mutated version of sleepdata and grouping it by our good health variable
3. Summarizing the grouped data, reporting mean/min/max hours per night and the total number in each group.

One quick note: if we wanted to use a pipe for a number of steps and then save the resulting object to memory, we can do that! As long as you add `[name] <-` before the object at the top of the pipe, the result at the end of all the pipes will be saved to memory.

For example, we can add our hours slept per night variable using a pipe.

In [None]:
sleepdata <- sleepdata %>%
    mutate(hrs_night = sleep/(7*60))

head(sleepdata)

We could also subset the data for those not in a union, keep only a few variables of interest, and then arrange the subset by hours slept:

In [None]:
subset <- sleepdata %>%
    filter(union ==0) %>%
    select(hrs_night, union, gdhlth, age, exper) %>%
    arrange(hrs_night)

head(subset)

### Practice with Grouping and Pipes

We want to know the average hours slept per night for everyone under age 30 in our sample. We feel the mean will be more informative if we can see the average hours slept per night by year of age. 

Report the mean of hours slept per night by ages 23, 24, 25, 26, 27, 28, and 29.

## More Regression Output

We've seen how to use the `lm()` object to run our regressions, but sometimes we want more than just the coefficients. Let's take a look at the other stuff stored by `lm` after running a regression. Let's start by running the following model:

$$ \widehat{hrs/{night}} = \hat\beta_0 + \hat\beta_1 totwrk + \hat\beta_2 age + \hat\beta_3 male $$

In [None]:
sleepmodel <- lm(hrs_night ~ totwrk + age + male, data = sleepdata)
sleepmodel

Our model predicts that, on average, an additional year of age is associated with an additional 0.0065 hours slept per night, and that identifying as male increases predicted hours slept by 0.2. 

However, we can get more details by using the `summary()` function on our lm object.

In [None]:
summary(sleepmodel)

`summary` gives us a ton more info:
* we see the formula at the top
* followed by information about our residuals 
* Now we obtain our coefficient estimates, their predicted standard errors, the corresponding t values for a test that the true parameter equals zero, and the corresponding p-value (we'll chat about these later in the class)
* Finally, we obtain the model degrees of freedom, R-squared, and a model F-test.

We can also take a look at all the different object stored within `sleepdata`: these can all be accessed using `$` like variables in a data frame. First, we can obtain all our residuals using `$residuals`.

In [None]:
head(sleepmodel$residuals)

Fitted values are stored as `fitted.values`:

In [None]:
head(sleepmodel$fitted.values)

In future sections we'll go over ways to get professional-looking regression tables.