# EEP/IAS 118 - Problem Set 4

## Due __Tuesday, July 27__ at 11:59pm. 

Submit materials (a copy of this notebook and any additional typed work/written scans) as one combined pdf on __Gradescope__. All work can be completed in this notebook. Make sure to run (`shift` + `enter`) all your answer cells before submission to make sure all your output is displayed.

If you do not want to use this notebook, __your submitted PDF must include R code for all requested summary statistics/regression output AND the desired output itself.__ For example, if a question asks you to run a regression, I expect to see the code used to estimate the regression __and__ the output of the regression itself. You can download a compiled report from RStudio by going to __File > Compile Report__. 

Answers that only provide an estimated coefficient values without the code/output from which it was obtained will lose points.

## R Tips 

* You can type `?function()` for a given function and execute the cell to receive a popup of documentation on that function.
* If used after a regression with the `lm()` command that you name `reg_name`, running `reg_name$fitted.values` will output a vector of fitted values ($\hat y$) that is in the same order as the observations in your dataset (i.e. the first fitted value corresponds to the first observation in your data).
* The function `lm.beta()` in teh package __lm.beta__ will let us run standardized regressions. First, run the regression per usual using `lm()`, saving it to memory as `reg_name`. Then, run`lm.beta(reg_name)` to get the standardized version. If you execute it without saving to memory, you will obtain a regression table for the standardized variables, and if you save it to memory as `reg_name_beta` you can access the coefficients with `reg_name_beta$standardized.coefficients`.
* `cor(var1, var2)` prints the matrix of correlation between the two variables `var1` and `var2`.

# Exercise 1. Fuel Consumption

## Guidelines

This exercise uses the `Cars_PS4.dta` data, which contains observations for 200 cars on their miles per gallon $(mpg)$, their number of cylinders $(cyl)$, their engine displacement (some technical parameter of the car related to the size of the piston and the number of cylinder), horsepower $(hp)$, the vehicle weight in pounds $(wght)$, and the price in dollars $(price)$.

## Preamble

When writing **R** code (and especially once we move over to RStudio), it can be helpful to include a preamble in your script where you load in your data and all the packages you'll use throughout the entire problem set. Go ahead and use the below code cell to load in your dataset (assign it any name you'd like), and load the packages you'll be using (we'll at least need __tidyverse__ and __haven__). 

In [13]:
library(tidyverse)
library(haven)

cars <- read_dta("Cars_PS4.dta")
head(cars)
getwd()

mpg,cyl,hp,disp,wght,price
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
23.0,4,97,54,2254,7523.984
17.6,8,302,129,3725,9146.851
23.0,4,120,97,2506,7752.67
20.2,6,200,88,3060,8429.475
18.0,6,225,95,3785,9255.319
14.0,8,304,150,3672,9125.904


## Question 1.

### Estimate the model:

$$ mpg= \beta_0 + \beta_1wght + \beta_2cyl + u $$


#### (a)  Print the regression summary table to the notebook and write out the estimated equation.


In [12]:
reg <- lm(mpg ~ wght + cyl + disp, data = cars)
summary(reg)


Call:
lm(formula = mpg ~ wght + cyl + disp, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2366 -2.7924 -0.4305  2.2926 15.3742 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 48.770899   1.486424  32.811  < 2e-16 ***
wght        -0.006095   0.000992  -6.144 4.42e-09 ***
cyl          0.141052   0.405159   0.348    0.728    
disp        -0.080768   0.019329  -4.179 4.41e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.083 on 196 degrees of freedom
Multiple R-squared:  0.6336,	Adjusted R-squared:  0.628 
F-statistic:   113 on 3 and 196 DF,  p-value: < 2.2e-16


Estimated equation is above.

#### (b) Comment on your results (SSS for both slope parameters).

Add your written answers for 1b here.

#### (c) Compare the relative effect of weight and cylinder in determining fuel efficiency (hint: think standardizing).

In [3]:
# Insert your code for 1c here.

Add any written answer for 1c here.

#### (d)  Without re-estimating the model above, what is the effect of vehicle weight on fuel efficiency when vehicle weight is measures in TONS (where 1 ton$=$2000 pounds)?  Answer this question by commenting on SSS (and when discussing significance please explicitly mention why significance does or does not change when vehicle weight is converted to tons). 

Add your written answers for 1d here.

## Question 2.

### Use your results above to:

#### (a) Predict the average fuel efficiency (mpg) of the Mazda Tribute SUVs that weighs 2500lbs and has 6 cylinders. 



In [4]:
# Include your code for Question 2a here.


Add your written answers for Question 2a here.

#### (b) Run a regression that allows you to put a 90\% confidence interval around this predicted value. 

In [9]:
# Include your code for Question 2b here.


Add your written answers for Question 2b here.

## Question 3.


###  Let $mpg^0$ be the unknown fuel efficiency of one particular Mazda Tribute SUV that weighs 2500lbs and has 6 cylinders. 


#### (a)  Find a 90\% confidence interval for $mpg^0$. 


In [8]:
# Include your code for Question 3a here.


Add your written answers for Question 3a here.

#### (b)  Compare the width of this confidence interval with what you found in the previous question. Explain in no more than 3 sentences.


In [8]:
# Include your code for Question 3b here.


Add your written answers for Question 3b here.

## Question 4. 

### Plot the graph of fuel efficiency $(mpg)$ against $(wght)$. What functional form might be more appropriate than a linear form for explaining fuel efficiency?

In [7]:
# Include your code for Question 4 here


Add your written answers for Question 4 here.

## Question 5.

### Now estimate the following model

$$ ln(mpg)=\beta_0+\beta_1wght  +\beta_1cyl+ u  $$

### Print the regression summary table to the notebook and write out the estimated equation.


In [4]:
# Type your code for Question 5 here


Add your written discussion for Question 5 here.

## Question 6.

### For explaining variation in $mpg$, decide whether you prefer the model from question 1 or the model from question 5.

Add your written discussion for Question 6 here.

## Question 7.

###  Let's think about adding horsepower into the linear model, and estimate the following model: 

$$ mpg= \beta_0 + \beta_1wght + \beta_2cyl + \beta_3 hp+ u $$

### Comment on what happens to the significance of $cyl$. What accounts for the change in the standard errors of $cyl$? (No more than 3 sentences). 


In [4]:
# Type your code for Question 7 here


Add your written discussion for Question 7 here.

## Question 8.

###  Finally let's think about what determines car price by estimating the following two equations:

$$ price = \beta_0 + \beta_1mpg+ u$$

$$ price = \beta_0 + \beta_1mpg + \beta_2wght+u$$

#### (a) Write out the two estimated equations

In [4]:
# Type your code for part (a) here


Add your written discussion for part (a) here.

#### (b)  Comment on how the coefficient on $mpg$ changes when we add in the variable $wgt$.  What does this tell us about the correlation between $mpg$ and $wght$? (No more than 3 sentences). 

Add your written discussion for part (b) here.

# Exercise 2:  Do Legislators with More Daughters Vote More Liberally on Women's Issues?

## Background

The data for this exercise were used in Ebonya Washington's paper: "Female Socialization: How Daughters Affect Their Legislator Fathers' Voting on Women's Issues."  published in the American Economic Review in 2008. The paper asks whether having daughters influences the voting behavior of members of the US Congress. The hypothesis is that having (more) daughters makes legislators more likely to vote liberally on issues concerning women.

For this exercise, we will focus on votes that took place in the 108th Congress, which held session in 2003/04. As a measure of a liberal voting record, we use scores assigned by the American Association of University Women (AAUW), a liberal group that concerns itself with issues of interest to women. For the 108th Congress, the AAUW selected 9 pieces of legislation in the areas of education, equality and reproductive rights. The AAUW then assigned a score to each member of Congress. The scores range from 0 to 100 and measure the percentage of times the legislator voted in favor of the position held by the AAUW.

 The dataset `Legislators_PS4.dta` contains the following characteristics for 320 members of the 108th Congress:
 
 * $ngirls$ number of daughters
 * $totchi$ number of children
 * $age$ Age
 * $female$ indicator for being female
 * $repub$ indicator for being a Republican
 *  $moredef$ proportion of people in the legislator's district who are in favor of "more spending on
defense" 
 *  $aauw$ AAUW score

### Question 1. Estimate and report results for the following regression models:

\begin{align}
aauw&=\beta_0 +\beta_1ngirls+u  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (1) \\
aauw&=\beta_0 +\beta_1ngirls+\beta_2totchi+u ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~(2)  \\
aauw&=\beta_0 +\beta_1ngirls+\beta_2totchi+\beta_3female+\beta_4repub+u\ \ \ \ ~~\ \ \ (3) 
\end{align}

In [None]:
# Add Code for Question 1 here.

Add your written answer for Question 1 here.


### Question 2. Look at the results of the above regressions. 

#### (a) Briefly interpret $\beta_1$ in the first model (remember SSS).


Add your written answer for part (a) here.

#### (b) Discuss how the coefficient on $\beta_1$ changes across the three models.  What do you think explains the difference in $\beta_1$ between the first and second model? 


Add your written answer for Question (b) here.

### Question 3. Estimate the following quadratic regression of voting scores on age:

$$aauw = \beta_0 + \beta_1 ngirls+ \beta_2 totchi  + \beta_3 age + \beta_4 age^2 + u \ \ \ \ \ \ \ (4) $$

#### (a) Comment on the sign of $\beta_3$ and $\beta_4$.  What does this tell us about the relationship between age and $aauw$? 

In [5]:
# Add your code for part (a) here.

Add your written answer for part (a) here.

#### (b) What is the marginal effect of age on $aauw$ (provide the mathematical formula). 

In [7]:
# Add your code for part (b) here.

Add your written answer for part (b) here.

#### (c)  Evaluate the marginal effect of age on $aauw$ for someone with the mean age (53.79614)

In [15]:
# Add your code for part (c) here.

Add your written answer for part (c) here.

#### (d)  At what age does the marginal effect of $age$ on $aauw$  become positive? 

In [15]:
# Add your code for part (d) here.

Add your written answer for part (d) here.

## Question 4. 

### Compare equations (3) to equation (4).  In no more than 3 sentences please explain why you would choose one model over the other?

Add your written answer for Question 4 here.

## Downloading your Notebook

Download a PDF copy of your notebook by using __File > Download as > PDF via Chrome (.pdf)__. It does not matter what browser you are using, you can use this command to download a copy of your notebook while on Chrome/Firefox/Opera/Edge/Safari/heck probably even Internet Explorer or Netscape Navigator.

If you have a problem downloading your notebook or combining PDFs, or if you are using RStudio on your own computer and are unsure what sort of output I want, __please email me__.