# EEP/IAS 118 - Section 3

## Manipulating (more) Data, Attractive Figures, and Practice Problems!

### July 8, 2021

Today's coding portion of the section will help get us familiar with a few packages that will help us improve the quality of our output tables and figures. 

In [None]:
library(tidyverse)
library(haven)
library(xtable)
sleepdata <- read_dta("sleep75.dta")

## Working with Indexes

We've seen how to manipulate datasets by adding in variables or removing certain observations, but what if we want to obtain one element/a set of elements from a known location? 

### Vectors
Let's start by working with a vector:

In [None]:
vec <- rnorm(10, mean =4, sd = 2)
vec

We created a vector of length 10 of random draws from a N(4,4) distribution. Now if we were interested in getting just the third element of this vector, we can do that like so:

In [None]:
vec[3]

The `[]` lets __R__ know that you want to select on position, while the 3 is our instruction for which position to pull from. 

(note that since we're working with a vector and not a dataframe, we can't use `$` to call a certain column). 

If we were interested in elements 5 through 7, we can pull them with the use of `:`

In [None]:
vec[5:7]

Finally, if we wanted to pull the first, fourth, and ninth elements we can do that using `c()`:

In [None]:
vec[c(1,4,9)]

What `c()` is doing is combining all the elements given to it into a vector themselves. We can see that by running it on its own.

In [None]:
newvec <- c(30,34,38,42)
newvec
is.vector(newvec)

### Matrices and Data Frames

What happens when we are working multidimensional objects? Largely the same thing! Now we just need to refer to position by specifying `[row#, column#]`. It is the same process for whether we're working with a matrix or a data frame.

In [None]:
# make a matrix
mat40 <- matrix(1:40, nrow = 4, ncol = 10)
mat40
is.matrix(mat40)

# Get the first element (1)
mat40[1,1]

# Get the element from the 3rd row and 6th column
mat40[3,6]

# Get the fifth, sixth, and seventh elements from the 2nd row
mat40[2, 5:7]

# Get all of column five
mat40[, 5]

# Get all of row four
mat40[4,]

# Get the fifth, sixth, and seventh elements from the first AND 2nd rows
mat40[1:2, 5:7]

# Get the first and fourth elements from the third row
mat40[3,c(1,4)]

We have a bunch of flexibility here to call one element or multiple elements at the same time, the only restriction being that we follow the `[row#, col#]` syntax.

The process for data frames is pretty similar, albeit with one extension. Now that we have variables, we can combine a position call with the `$` for a specific variable.

In [None]:

sleepdf <- sleepdata %>%
    select(age, educ, exper, hrwage)
head(sleepdf)
nrow(sleepdf)
ncol(sleepdf)
dim(sleepdf)

is.data.frame(sleepdf)

# Get the first row
sleepdf[1,]

# Get the head of the age variable 
head(sleepdf$age)

# Get the fourth row element of column 4 (hrwage)
sleepdf[4,4]

# Alternatively, we can do the same thing by refering to the specific variable/column
sleepdf$hrwage[4]

Note that when we use the `$` to call a specific variable, __R__ now treats that variable as a vector, so we can refer to its elements with `[]` in one dimension. In that case, our call `sleepdf$hrwage[4]` gives us just a number, whereas the previous call of `sleepdf[4,4]` gives us the same value but presented in a 1x1 table.

## ggplot2

One of the sad facts about (most) economic research papers is that they don't always have the most aesthetically pleasing figures. For many data visualization applications or our own work we might want to have more control over the visuals and step them up a notch, making sure they convey useful information and have informative labels/captions. This is where the __ggplot2__ package comes in.

We started off using __R's__ built-in plot function, which let us produce scatterplots and construct histograms of all sorts of variables. However, it doesn't look the best and has some ugly naming conventions. __ggplot2__ will give us complete control over our figure and allow us to get as in depth with it as we want.

### ggplot2 Basic Syntax

Let's start by getting familiar with the basic syntax of __ggplot2__. It's syntax is a little bit different than some of the functions we've used before, but once we figure it out it makes thing nice and easy as we make more and more professional-looking figures.

To start a plot, we start with the function

## `ggplot()`

This function initializes an empty plot and passes data to other plots that we'll add on top. We can also use this function to define our dataset or specify what our x and y variables are.

In [None]:
ggplot()

Okay, so not the most impressive yet. We get a little bit more if we specify our data and our x/y variables. To specify the data, we add the argument `data = "dataname"` to the function. To specify which variable is on the x axis and which is on the y, we use the `aes(x= "xvar", y= "yvar")` argument. `aes()` is short for "aesthetics" and allows us to automatically pass these variables along as our x and y variables for the plots we add.

Let's say we're interested in using our `sleepdata` to see the relationship between age and hourly wage in our sample

In [None]:
ggplot(data = sleepdata, aes(x = age, y = hrwage))

That is a start! Now we have labels on both of our axes corresponding to the assigned variable, and a grid corresponding to possible values of those variables. 

We will add geometries (sets of points, histograms, lines, etc.) by adding what we call "layers" - let's take a look at a few of the options.

### Scatterplots

Now let's add some points! If we want to get a sense of how age and hourly wage vary in our data, we can do that by just plotting the points. We can add points using the `geom_point()` function.

Since we already declared our two variables, all we need to add the function with `+ geom_point()` to our existing code: 

In [None]:
ggplot(data = sleepdata, aes(x = age, y = hrwage)) +
    geom_point()

And we get a a plot of all our points (note that we were warned that there are some missing values that get dropped).

#### Labels

Sometimes we might want to change the labels from the variable names to a more descriptive label, and possibly add a title. We can do that! We do this by adding the `labs()` function to our plot.

In [None]:
ggplot(data = sleepdata, aes(x = age, y = hrwage)) +
    geom_point() +
    labs(title = "Relationship between Age and Hourly Wage",
        subtitle = "Nonmissing Sample",
        x = "Age (years)",
        y = "Hourly Wage ($)")

Let's take a look at what we added to `labs()`. First, `title` gives us the main title at the top. Second, `subtitle` gives us another line in a smaller font below the main title. `x` and `y` correspond to our x and y labels, respectively. 

#### Changing Points

What if we want to change the color/shape/transparency of our points? We can do that by using arguments of `geom_point()`.

In [None]:
ggplot(data = sleepdata, aes(x = age, y = hrwage)) +
    geom_point(colour = "blue", alpha = 0.4, size = 0.8) +
    labs(title = "Relationship between Age and Hourly Wage",
        subtitle = "Nonmissing Sample",
        x = "Age (years)",
        y = "Hourly Wage ($)")

By adding `colour="blue"` we changed the color to blue. There are [a toooooon](http://sape.inf.usi.ch/sites/default/files/ggplot2-colour-names.png) of named colors that we could use instead (this gets really useful when we start splitting our data by group levels).

`alpha = 0.4` is changing the transparency of our points to 40%. `size = 0.8` is reducing the size of the points to 80% of their original size.

#### Splitting by Groups

What if we wanted to change the color of our points according to whether the individual is male or not? We can do that!

In [None]:
ggplot(data = sleepdata, aes(x = age, y = hrwage)) +
    geom_point(aes(colour = factor(male))) +
    labs(title = "Relationship between Age and Hourly Wage",
        subtitle = "Nonmissing Sample",
        x = "Age (years)",
        y = "Hourly Wage ($)")

By adding an aesthestic to our `geom_point` we can set the color to be determined by the value of $male$. By default, the zero value (i.e. female) gets a red color while a 1 value (female) gets a light green. We specify the variable as a `factor()` so that ggplot knows it is a discrete variable. What if we instead wanted to change color on a continuous scale?

In [None]:
ggplot(data = sleepdata, aes(x = age, y = hrwage)) +
    geom_point(aes(colour = age)) +
    labs(title = "Relationship between Age and Hourly Wage",
        subtitle = "Nonmissing Sample",
        x = "Age (years)",
        y = "Hourly Wage ($)")

Here the color is now a function of our continuous variable $age$, taking increasingly lighter values for higher ages.

(note that __ggplot2__ lets you specify the color scale or color levels if you want, as well as nitpick the labels in the legend. In reality we can change anything that appears in the plot - we just have to choose the right option). 

One thing to note is that we can make other options conditional on variables in our data frame too. What if we wanted the shape of our points to depend on union participation, the color to vary with gender, and the size of the points to depend on the total minutes worked per week? We can do all that - even if it might look real gross.

In [None]:
ggplot(data = sleepdata, aes(x = age, y = hrwage)) +
    geom_point(aes(colour = factor(male), shape = factor(union), size = totwrk)) +
    labs(title = "Relationship between Age and Hourly Wage",
        subtitle = "Nonmissing Sample",
        x = "Age (years)",
        y = "Hourly Wage ($)")

While the above example is cluttered, it shows how we can take a simple scatterplot and use it to convey additional information in just one plot.

### Lines

We can add lines to our figure in a couple different ways. First, if we wanted to connect all the points in our data with a line, we would use the `geom_line()` function. 

In [None]:
sleepdata %>% 
    group_by(age) %>% 
    filter(row_number() == 1) %>%
    ggplot(aes(x=age, y = hrwage)) +
    geom_line()

We can also add points just by adding another layer!

In [None]:
sleepdata %>% 
    group_by(age) %>% 
    filter(row_number() == 1) %>%
    ggplot(aes(x=age, y = hrwage)) +
    geom_line()+
    geom_point(colour = "gray40")

What if instead we wanted to add a vertical, horizontal, or sloped line in our plot? We use the layers `vline()`, `hline()`, and `abline()` for that.

`vline()` is simple and really only needs the `xintercept` argument. Similarly, `hline` takes the `yintercept` argument. `abline` requires us to specify both a `slope` and an `intercept`.

Let's say we wanted to add lines to the previous set of points (not connected):

In [None]:
sleepdata %>% 
    group_by(age) %>% 
    filter(row_number() == 1) %>%
    ggplot(aes(x=age, y = hrwage)) +
    geom_point(colour = "gray40") +
    geom_vline(xintercept = 40, colour = "orchid4") +
    geom_hline(yintercept = 10) +
    geom_abline(intercept = 25, slope = -0.5, colour = "grey60", linetype = "dashed")


### Histograms and Distributions

Sometimes we want to get information about one variable on its own. We can use __ggplot2__ to make histograms as well as predicted distributions!

We use the function `geom_histogram()` to produce histograms. To get a basic histogram of $age$, 

In [None]:
ggplot(data = sleepdata, aes(x = age)) +
    geom_histogram()

Notice that __ggplot2__ chooses a bin width by default, but we can change this by adding `binwidth`. We can also add labels as before.

Note that if we want to change color, we now have two different options. `colour` now changes the outline color, while `fill` changes the interior color.


In [None]:
ggplot(data = sleepdata, aes(x = age)) +
    geom_histogram(binwidth = 10, colour = "seagreen4") +
    labs(title = "Age Histogram",
        x = "Age (years)",
        y = "Count")

ggplot(data = sleepdata, aes(x = age)) +
    geom_histogram(binwidth = 10, fill = "midnightblue") +
    labs(title = "Age Histogram",
        x = "Age (years)",
        y = "Count")

ggplot(data = sleepdata, aes(x = age)) +
    geom_histogram(binwidth = 10, colour = "grey60", fill = "darkolivegreen1") +
    labs(title = "Age Histogram",
        x = "Age (years)",
        y = "Count")

ggplot(data = sleepdata, aes(x = age)) +
    geom_histogram(aes(fill = factor(male)), binwidth = 10) +
    labs(title = "Age Histogram",
        x = "Age (years)",
        y = "Count")


What if we wanted to get a sense of the estimated distribution of age rather than look at the histogram? We can do that with the `geom_density()` function!

In [None]:
ggplot(data = sleepdata, aes(x = age)) +
    geom_density(fill = "gray60", colour= "navy") +
    labs(title = "Age Density",
        x = "Age (years)",
        y = "Density")

ggplot(data = sleepdata, aes(x = age)) +
    geom_density(aes(colour = factor(male))) +
    labs(title = "Age Density",
        x = "Age (years)",
        y = "Density")

### Regression

One cool thing that we can do with __ggplot2__ is produce a simple linear regression line directly in our plot! We use the `geom_smooth(method = "lm")` layer for that.


In [None]:
wagereg <- lm(hrwage ~ age, data = sleepdata)
summary(wagereg)

ggplot(data = sleepdata, aes(x=age, y = hrwage)) +
    geom_point()+
    geom_smooth(method = "lm")

Notice that by default it gives us the 95% confidence interval too! We can change the confidence interval using the `level` argument.

### Multiple Linear Regression in ggplot2

How would we go about plotting the results of a multiple linear regression? In this case we have to combine output from our regression with the `abline` function.

In [None]:
wagereg2 <- lm(hrwage ~ age + educ + male, data = sleepdata)


summary(wagereg2)

int <- wagereg2$coefficients[1]
slope_age <- wagereg2$coefficients[2]

ggplot(data = sleepdata, aes(x=age, y = hrwage)) +
    geom_point()+
    geom_abline(intercept = int, slope = slope_age) +
    ylim(-20,40)

I had to add the `ylim(-20,40)` to change the y limits so that we could see the line... because it now doesn't pass through the data! Recall that our slope coefficient $\hat\beta_{age}$ is now the _partial_ effect of age on hourly wage, holding education level and gender constant. As a result, the plot isn't quite as informative on top of the data points in a single set of dimensions.

### Facets

Sometimes we might want to produce different panels of a plot for different _values_ of another variable. For instance, instead of changing the color of our points for males vs females earlier, we could have produced separate plots for data where males = 0 and females = 0 right next to each other. We do that using the `facet_grid()` layer. 

In [None]:
ggplot(data = sleepdata, aes(x=age, y = hrwage)) +
    geom_point()+
    facet_grid(. ~ male)

Here we put the panels next to each other, first for female ($male=0$) on the left and then for males on the left. We can also arrange them vertically by changing how we write the argument.

In [None]:
ggplot(data = sleepdata, aes(x=age, y = hrwage)) +
    geom_point()+
    facet_grid(male ~ .)

Notice that when we put `male ~ .` we get the plots stacked vertically by age, whereas `. ~ male` splits them side by side.

## xtable

The package __xtable__ allows us to obtain high-quality formatted versions of our summary statistics tables, regression tables, and raw data to improve the look of our __R__ output. This is especially useful for generating professional-looking tables that can be added to a research paper... once we get into __RStudio__ on its own. Right now it's not as useful, since our Jupyter notebook already formats results in a specific way.

One way we can get a sense of how it formats is by using it on our regression tables in our Jupyter notebook. 


In [None]:

reg <- lm(hrwage ~ educ + age + union + exper, data = sleepdata)
summary(reg)

xtable(reg)

We'll spend more time with __xtable__ (and eventually __stargazer__ once we switch over to __RStudio__). 

## Practice with ggplot!

Let's try producing a couple of different plots. First, let's load in a new dataset - the _autos.dta_ file again.

In [114]:
autodata <- read_dta("autos.dta")
head(autodata)

make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl+lbl>
AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,0
AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,0
AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,0
Buick Century,4816,20,3.0,4.5,16,3250,196,40,196,2.93,0
Buick Electra,7827,15,4.0,4.0,20,4080,222,43,350,2.41,0
Buick LeSabre,5788,18,3.0,4.0,21,3670,218,43,231,2.73,0


### Using `autodata`, create
### 1. A scatter plot showing the relationship between weight and mpg. Put weight on the x-axis and mpg on the y-axis. Label the x-axis "Weight (lbs)" and the y-axis "Fuel Efficiency (mpg)" and give it a nice title.
### 2. A histogram of price, with fill color according to whether the vehicle is foreign-made or not
### 3. A histogram of price, faceted according to whether the vehicle is foreign-made or not. Do you think this looks better or worse than 2. ?
### 4. Run a regression of price on mpg, foreign, and weight. Use `mutate` to add the residuals as a variable in `autodata`. Then, plot the residuals (y-axis) against mpg. Do the residuals appear to vary systematically with fuel efficiency? (recall that we can access residuals from `lm` output using `$residuals`)

# Practice Exercises

## 1.

#### We run the following regression of log-wage on three X variables:

#### lm(log(wage) ~ educ + exper + female, data = WageData)

<p style="text-align: center;"> </p>

<img src="images/wagereg.png" width="800" />

#### 1. Fill in the t-stat for education and calculate the 95% confidence interval

#### 2. Interpret the coefficient on experience, remember to comment on sign, size, and significance (SSS)

#### 3. Test the null that female salaries are 50% lower than male salaries at 1% significance. Show your work using the five steps in hypothesis testing.


## 2. 

#### A multinational firm  focused on petroleum refining conducted a poll that showed a disapproval rate of 63% among consumers. The CEO refuses to believe this is true, and hires you as a consultant to check on the validity of the earlier poll. After depositing your hefty consulting fee, you collect a random sample of 100 consumers and find that 55 of them disapprove of the way the firm treats the environment. Run a hypothesis test (95% significance level) to evaluate whether the original poll is reporting the correct disapproval rate.

## 3.

#### To investigate possible gender discrimination in a firm, a sample of 100 men and 64 women with similar job descriptions are selected at random and independently. A summary of the resulting monthly salaries is:


|    Group   | Average | Standard Deviation | Observations |
|------------|---------|--------------------|--------------|
| Men        | 3100    | 200                | 100          |
| Women      | 2900    | 320                | 64           |


#### Do these data provide statistically significant evidence that the wages of men and women are different at the 1 percent significance level?


## 4. 

#### From a sample of 200 households, we estimated the following two models of gasoline consumption (t-statistics in parentheses)


$$ gas = 34.2 + 10.5 suv + 0.25 inc -  0.00005 inc^2 $$
$$ ~~~~~~~~~~~~~~~~(2.3)~~~~(3.1)~~~~~~~~~(1.7)~~~~~~~~~~~~(1.8)~~~~~~~~~~~~~~$$


$$ gas = 22.2 + 15.3 suv $$
$$ ~~~~~~~~~~~~~~~~(2.3)~~~~(3.1)~~~~~~~~~~$$

#### where gas gives the number of gallons per month, suv is a dummy variable for whether the household owns an SUV, and inc is the annual household income in thousands of \$.

#### 1. What is the marginal effect of income on gasoline consumption?

#### 2. At what point does that relationship change sign?

#### 3. What is the correlation between income and owning and SUV? Show how you came to this conclusion. (What did you have to assume - reasonably - in order to answer this question?)