# Homework 5:  Linear models, continued
This homework assignment is designed to give you a deeper understanding of linear models. First, we'll dive into the math behind the closed-form solution of maximum likelihood estimation. **In the first section below, write your answers using Latex equation formatting.**

*Note: Check out [this page](https://gtribello.github.io/mathNET/assets/notebook-writing.html) and [this page](https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd) for resources on how to do Latex formatting. You can also double click on the question cells in this notebook to see how math is formatted in the questions.*


## Deriving the Maximum Likelihood Estimate for Simple Linear Regression (6 points)

Using the mean squared error (MSE) as your objective function (the thing you're trying to minimize when you fit your model) allows for a closed form solution to finding the maximum likelihood estimate (MLE) of your model parameters in linear regression. Let’s consider the simple, single predictor variable model, i.e. simple linear regression :  $Y= \beta_0 + \beta_1 X $. 

a) Use algebra to show how you can expand out $MSE(\beta_0, \beta_1)$ to get from i to ii below.

> _i)_ $E[ (Y-(\beta_0 + \beta_1 X))^2]$

> _ii)_ $E[Y^2] -2 \beta_0E[Y]-2 \beta_1 Cov[X,Y]-2 \beta_1 E[X]E[Y]+ \beta_0^2 +2 \beta_0 \beta_1 E[X]+\beta_1^2 Var[X]+ \beta_1^2 (E[X])^2$






**Answer:**

$E[(Y-(\beta_0 + \beta_1 X))^2]$ 

$= E[Y^2 - 2Y(\beta_0 + \beta_1X) + (\beta_0 + \beta_1X)^2]$

$= E[Y^2 - 2Y\beta_0 - 2Y\beta_1X + \beta_0^2 + 2\beta_0\beta_1X + \beta_1^2X^2]$

$= E[Y^2] - 2\beta_0E[Y] - 2\beta_1E[XY] + \beta_0^2 + 2\beta_0\beta_1E[X] + \beta_1^2E[X^2]$

$= E[Y^2] - 2\beta_0E[Y] - 2\beta_1Cov[X,Y] - 2\beta_1E[X]E[Y] + \beta_0^2 + 2\beta_0\beta_1E[X] + \beta_1^2Var[X] + \beta_1^2(E[X])^2 $




b) Prove that the MLE of $\beta_0$ is $E[Y]- \beta_1 E[X]$ by taking the derivative of _ii_ above, with respect to $\beta_0$, setting the derivative to zero, and solving for $\beta_0$.

**Answer:**

$ 0 = \frac{\partial E[(Y-(\beta_0 + \beta_1 X))^2]}{\partial\beta_0}$

$ 0 = 0 - 2*1*\beta_0^{1-1}*E[Y] - 0 - 0 + 2*\beta_0^{2-1} + 2 * 1 * \beta_0^{1-1} * \beta_1E[X] + 0 + 0$

$ 0 = -2E[Y] + 2\beta_0 + 2\beta_1E[X]$

$ 0 = -2(E[Y] - \beta_0 - \beta_1E[X)$

$ 0 = -2(E[Y] - \beta_0 - \beta_1E[X)$

$ 0 = E[Y] - \beta_0 - \beta_1E[X]$

$ \beta_0 = E[Y] - \beta_1E[X] $


c) Prove that the MLE for $\beta_1$ is $Cov[X,Y]/Var[X]$ by taking the derivative of equation _ii_ above, with respect to $\beta_1$, setting the derivative to zero, and solving for $\beta_1$. *Hint: after you've simplified / expanded a bit, plug in the solution for $\beta_0$ from part b.*

**Answer:**

$0 = \frac{\partial E[(Y-(\beta_0 + \beta_1 X))^2]}{\partial\beta_1}$

$0 = 0 - 0 -2*1*\beta_1^{1-1}*Cov[X,Y] - 2 * 1 * \beta_1^{1-1}E[X]E[Y] + 0 + 2 * \beta_0 * 1 * \beta_1^{1-1} + 2*\beta_1^{2-1}*Var[X] + 2*\beta_1^{2-1}(E[X])^2$

$0 = -2Cov[X,Y] - 2E[X]E[Y] + 2\beta_1Var[X] + 2\beta_1(E[X])^2 +2\beta_0$

$0 = -2Cov[X,Y] - 2E[X]E[Y] + 2\beta_1(Var[X] + (E[X])^2) + 2\beta_0$

$0 = -2Cov[X,Y] - 2E[X]E[Y] + 2\beta_1(E[X^2]) +2\beta_0$

$0 = -2Cov[X,Y] - 2E[X]E[Y] + 2\beta_1(E[X^2])+2(E[Y] - \beta_1E[X])$

$0 = -2Cov[X,Y] - 2E[X]E[Y] + 2\beta_1(E[X^2])+2(E[Y]) - 2\beta_1E[X]$

---
## Connecting to data (4 points)

Now let's connect this to some real data. Once again we'll be using the  **unrestricted_trimmed_1_7_2020_10_50_44.csv** file from the *Homework/hcp_data* folder in the class GitHub repository. 

​
This data is a portion of the [Human Connectome Project database](http://www.humanconnectomeproject.org/). It provides measures of cognitive tasks and brain morphology measurements from 1206 participants. The full description of each variable is provided in the **HCP_S1200_DataDictionary_April_20_2018.csv** file in the *Homework/hcp_data* folder in the class GitHub repository. 

a) Use the `setwd` and `read.csv` functions to load data from the **unrestricted_trimmed_1_7_2020_10_50_44.csv** file. Then use the `tidyverse` tools make a new dataframe `d1` that only inclues the subject ID (`Subject`), Flanker Task performance (`Flanker_Unadj`), and total grey matter volume (`FS_Total_GM_Vol`) variables and remove all _NA_ values.

Use the `head` function to look at the first few rows of each data frame. 

In [13]:
library(tidyverse)
library(broom)

In [15]:
system("gdown --id 1ebJ2y4NuAcCD70-_v9DrIveuRN9qNr9G")
HCP_data = read.csv("unrestricted_trimmed_1_7_2020_10_50_44.csv")

head(HCP_data)

d1 <- HCP_data %>% select(Subject, 
                          Flanker_Unadj, 
                          FS_Total_GM_Vol) %>%
                                            filter(!is.na(Subject),
                                                   !is.na(Flanker_Unadj),
                                                   !is.na(FS_Total_GM_Vol))
head(d1)

Unnamed: 0_level_0,Subject,Release,Acquisition,Gender,Age,MMSE_Score,PSQI_Score,PSQI_Comp1,PSQI_Comp2,PSQI_Comp3,⋯,Noise_Comp,Odor_Unadj,Odor_AgeAdj,PainIntens_RawScore,PainInterf_Tscore,Taste_Unadj,Taste_AgeAdj,Mars_Log_Score,Mars_Errs,Mars_Final
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,⋯,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,100004,S900,Q06,M,22-25,29,8,1,2,2,⋯,5.2,101.12,86.45,2,45.9,107.17,105.31,1.8,0,1.8
2,100206,S900,Q11,M,26-30,30,6,1,1,1,⋯,6.0,108.79,97.19,1,49.7,72.63,72.03,1.84,0,1.84
3,100307,Q1,Q01,F,26-30,29,4,1,0,1,⋯,3.6,101.12,86.45,0,38.6,71.69,71.76,1.76,0,1.76
4,100408,Q3,Q03,M,31-35,30,4,1,1,0,⋯,2.0,108.79,98.04,2,52.6,114.01,113.59,1.76,2,1.68
5,100610,S900,Q08,M,26-30,30,4,1,1,0,⋯,2.0,122.25,110.45,0,38.6,84.84,85.31,1.92,1,1.88
6,101006,S500,Q06,F,31-35,28,2,1,1,0,⋯,6.0,122.25,111.41,0,38.6,123.8,123.31,1.8,0,1.8


Unnamed: 0_level_0,Subject,Flanker_Unadj,FS_Total_GM_Vol
Unnamed: 0_level_1,<int>,<dbl>,<int>
1,100206,130.42,807245
2,100307,112.56,664124
3,100408,121.18,726206
4,100610,126.53,762308
5,101006,101.85,579632
6,101107,107.04,665024


b) Now we're going to see if the solutions we proved above actually line up with the model fit that R gives us (it should...). Calculate what the $\beta_0$ and $\beta_1$ coefficients should be for a simple linear regression model using `Flanker_Unadj` as $Y$ and `FS_Total_GM_Vol` as $X$. Use the formulas we derived above ($\beta_1 = Cov[XY]/Var[X]$ , $\beta_0 = E[Y] - \beta_1E[X]$). Then use `lm()` to compare the coefficients you calculated with the ones R gives you. 

$\hat\beta_0 = E[Y] - \hat\beta_1E[X] = \bar{y} - \hat\beta_1\bar{x}$

$\hat\beta_1 = \frac{Cov[XY]}{Var[X]}$

In [41]:
cov_XY = cov(d1)[2,3]
var_X = sd(d1$FS_Total_GM_Vol)^2

beta_1 = cov_XY/var_X 
beta_0 = mean(d1$Flanker_Unadj) - beta_1 * mean(d1$FS_Total_GM_Vol) 

calc_beta_1 = beta_1 %>% round(digits = 6)
calc_beta_0 = beta_0 %>% round(digits = 2)

d1.lm = lm(data = d1, Flanker_Unadj ~ FS_Total_GM_Vol)
d1.summary = d1.lm %>% summary()
d1.details = d1.summary %>% tidy()

check_beta_1 = d1.details$estimate[2] %>% round(digits = 6)
check_beta_0 = d1.details$estimate[1] %>% round(digits = 2)

print(paste('Calculated beta 0:', calc_beta_0))
print(paste('Calculated beta 1:', calc_beta_1))
print(paste('R beta 0: ', check_beta_0))
print(paste('R beta 1: ', check_beta_1))

[1] "Calculated beta 0: 90.26"
[1] "Calculated beta 1: 3.1e-05"
[1] "R beta 0:  90.26"
[1] "R beta 1:  3.1e-05"


$\hat\beta_0 = E[Y] - \hat\beta_1E[X] = \bar{y} - \hat\beta_1\bar{x} = 90.26$

$\hat\beta_1 = \frac{Cov[XY]}{Var[X]} = 0.000031$

**DUE:** 5pm EST, March 18, 2021

**IMPORTANT** Did you collaborate with anyone on this assignment? If so, list their names here. 
> Urszula Ozszcapinska

> Austin Luor