# Homework Assignment #2

*due Tuesday, February 12, 2019 by 11:59pm*

This homework is worth **10 points** [100%], there are no bonus points questions, however, up to 1 bonus point may be given for **clearly written code** (provided that the code is correct).

### n.b.
All deliverables are required to be typed and all graphs and statistical output generated in Jupyter Notebook using Python and associated packages. Deliverables with *any* handwritten elements will not be accepted and will receive a grade of zero. 

Similarly to `hw1` copy the whole directory `hw2` to your own private repository (**use the same one as for `hw1`**).

In [2]:
import numpy as np
import pandas as pd
from scipy.linalg import eigh
from scipy.stats import norm
import matplotlib.pyplot as plt
%matplotlib inline

## Part 1: Linear Regression - the closed form solution [total of 15%]

***DO NOT USE ANY REGRESSION FUNCTIONS OR REGRESSION PACKAGES FOR THIS PROBLEM***

### Data for Part 1:
Suggested steps:
* Read the [housing data](https://s3-us-west-1.amazonaws.com/usfca-cs686-ml/hw1/housing.csv) used in `hw1`
* Use the column "`GrLivArea`" as your independent variable and "`SalePrice`" as your dependent variable. (You can safely disregard the rest of the dataset)

In [18]:
df = pd.read_csv("../hw1/data/housing.csv").loc[:, ['GrLivArea','SalePrice']]

#### Find estimates for $\beta_0$ and $\beta_1$

$b_1 = \frac{\sum_{i=1}^n(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n(x_i - \overline{x})^2}$

In [20]:
x = df["GrLivArea"].mean()
y = df["SalePrice"].mean()
b1 = ((df["GrLivArea"] - x)*(df["SalePrice"] - y)).sum() / ((df["GrLivArea"] - x)**2).sum()
b1

107.13035896582517

$b_0 = \overline{y} - b_1\overline{x}$

In [21]:
b0 = y - b1*x
b0

18569.02585648728

#### Find $R^2$

$R^2 = (\frac{s_{xy}}{s_x s_y})^2 = (\frac{\sum_{i=1}^n(x_i - \overline{x})(y_i - \overline{y})}{(n-1)s_x s_y})^2 = \frac{(\sum_{i=1}^n(x_i - \overline{x})(y_i - \overline{y}))^2}{\sum_{i=1}^n(x_i - \overline{x})^2 \sum_{i=1}^n(y_i - \overline{y})^2}$

In [23]:
rs = (((df["GrLivArea"] - x)*(df["SalePrice"] - y)).sum())**2 \
    / (((df["GrLivArea"] - x)**2).sum() * ((df["SalePrice"] - y)**2).sum())
rs

0.502148650271804

## Part 2: Linear Regression - the iterative solution [total of 35%]

***YOU ARE ASKED TO CODE THE FOLLOWING EXERCISE FROM SCRATCH. DO NOT USE ANY PACKAGES FOR AUTOMATIC DIFFERENTIATION FOR THIS PROBLEM.***

### <span style="color: #DA122C;">**YOUR SOLUTION MUST WORK FOR ANY GIVEN NUMBER OF INDEPENDENT VARIABLES**</span>

HINT: To make your implementation stand out, try implementing this question following a functional programming paradigm (i.e. define functions)

In [None]:
## YOU WILL NEED THESE PARAMETERS ##
# DO NOT CHANGE ANYTHING IN THIS CELL
init_eta = 1e-4 # eta, gamma, learning rate or whatever you want to call it ¯\_(ツ)_/¯
num_iters = 100 # number of iterations
n = 1000

In [None]:
# DO NOT CHANGE ANYTHING IN THIS CELL
np.random.seed(42)
r = np.eye(int(np.sqrt(n)/3)-1)
r += np.random.normal(0, 0.1, size=(r.shape[0],)*2)
lams, A = eigh(r)
c = np.dot(A, np.diag(np.sqrt(lams)))
X = np.dot(c, norm.rvs(size=(r.shape[0], n))).transpose()
Y = np.random.randn(r.shape[0]+1).dot(np.concatenate((X, np.ones((n,1))), axis=1).transpose()) + np.random.normal(0, 0.5, n)
del r, lams, A, c

In [None]:
# Start your solution here with data X and labels Y

## Part 3: Theory [total of 50%]

### Question 1

In the SLR model, the probability distribution of $Y$ (i.e., $Y_i$) has the same mean and variance for all levels of $X$ (i.e., $X_i$). True or False? Explain.

Answer:

False, in the case of $\beta_1 \neq 0$, the probability distribution of $Y$ has the same variance $\sigma^2$. However, the means are dependent on the value of $X$.

### Question 2

The number of points above the fitted regression line is always equal to the number of points below it. True or False? Explain.

Answer:



### Question 3

In a SLR model, what does $\beta_1$ measure?  

Answer:

$\beta_1$ is measureing the slope of the regression function, and whether or not dependent variable $Y$ and independent variable $X$ are linear assotiated

### Question 4

In the context of an SLR model, prove the following:

- $E[Y_i] = \beta_0 + \beta_1 X_i$ 
- $V(Y_i) = \sigma^2 \hspace{5pt} \forall \hspace{5pt} i$ 

Answer:

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i \Longrightarrow E[Y_i] = E[\beta_0 + \beta_1X_i + \epsilon_i]
\Longrightarrow  E[Y_i] = E[\beta_0] + E[\beta_1X_i] + E[\epsilon_i]
\Longrightarrow E[Y_i] = \beta_0 + \beta_1E[X_i] + E[\epsilon_i]$

Since we know that $E[\epsilon_i \mid X_i] = 0$ and $X_i$ is a known constant

we can conclude that $E[Y_i] = \beta_0 + \beta_1E[X_i] + E[\epsilon_i] \Longrightarrow \beta_0 + \beta_1X_i$

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i \Longrightarrow V[Y_i] = V[\beta_0 + \beta_1X_i + \epsilon_i]
\Longrightarrow  V[Y_i] = V[\beta_0] + V[\beta_1X_i] + V[\epsilon_i]
\Longrightarrow V[Y_i] = V[\epsilon_i]$

Since we know that $V[\sigma_i \mid X_i] = \sigma^2 \forall i$

we can conclude that $V[Y_i] = V[\epsilon_i] = \sigma^2$

### Question 5

For the SLR model, $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, how many random variables are there. Explain.

Answer:

There 2 random variables. The First one is random term $\epsilon_i$. And We Observe that response $Y_i$ is the sum of the constant term $\beta_0 + \beta_1X_i$ and $\epsilon_i$, so $Y_i$ is also a random variable.

### Question 6

Write out the normal error regression model and its assumptions (in English and math).

Answer:

Normal error regression model: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$

$Y_i$ is the value of response variable in the $i^th$ trial

$\beta_0$ and $\beta_1$ are parameters: $\beta_0$ is the $y$-intercept and $\beta_1$ is the slop of regression function

$X_i$ is a known constant, the value of the predictor variable in the $i^th$ trial

$\epsilon_i$ is a random error term that :
- $E[\epsilon_i \mid X_i] = 0$
- $V[\epsilon_i \mid X_i] = \sigma^2 \forall i$
- $\sigma[\epsilon_i,\epsilon_j \mid X_i] = 0 \forall i$,$j$; $i \neq j$
- $\sigma[X_i, \epsilon_i] = 0 \forall i$
- $\epsilon_i$ ~ $N(0,\sigma^2)$, $\epsilon_i$ is normally distributed

### Question 7

What does a negative value of $\beta_1$ indicate about the relation between $X$ and $Y$?

$$ b_0 = \frac{1}{n} \Big( \sum Y_i - b_1 \sum X_i \Big) $$
$$ b_1 = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}$$

Answer:

It indicate that X and Y are negatively corelated, which means when X increase 1 unite, we expect on average Y decrease $\left|\beta_1\right|$

### Question 8

Are hypotheses tested concerning the actual values of the coefficients, e.g., $\beta_1$, or their estimated values, e.g., $b_1$? Why?

Answer:

Hypotheses test is concerning the actual values of the coefficients $\beta_1$, since we want to examine whether or not there is any linear association between the dependent variable $Y$ and the independent variable $X$, which is determine by whether $\beta_1 = 0$ 

### Question 9

You compute a coefficient of determination for a regression model an obtain an $R^2=0.832$. What does the strength of the coefficient of determination say about the causal relationship between the explanatory and response variables?

Answer:

The strength of coefficient of determination can only explain the goodness-of-fit of the regression line (The variation of response variable is smaller when $R^2$ is bigger), but it is not necessary explain the causal relationship. Like the relationship between number of people eat ice cream and number of people get bite by shark, the $R^2$ may be hight but it can not indicate any causal relationship between them

### Question 10

You compute a coefficient of determination for a regression model, regressing crime rate per capita ($Y$) on the size of municipal police force ($X$), obtaining an $R^2 = 0.6533$. What can you say about the relationship between $Y$ and $X$?

Answer:



### Question 11

From the discussion of SLR so far, how do you believe outliers will effect the regression line? 

Answer:

For a few of outliers may affect the regression line seriously, but the slope of the, $b_1$, and the y-intercept, $b_0$ still can be slightly affect. But for some serious outliers, I mean the outliers that far away from other data, may affect both $b_1$ and $b_0$ seriously.

### Question 12

Write out the hypothesis test which tests for the statistical significance of $\beta_1$ for an SLR model. Be sure to include the null an alternate hypothesis, the critical value including degrees of freedom (two-tailed test) for $\alpha = 0.05$ and an interpretation of both possible results. 

Answer:

Hypothesis:

$H_0: \beta_1 = 0$

$H_1: \beta_1 \neq 0$

Calculation:

$t* = (b_1 - \beta_1)/s${$b_1$}

Since we lose 2 degrees of freedom to estimate $\beta_0$ and $\beta_1$ with $b_0$ and $b_1$

$(b_1 - \beta_1)/s${$b_1$} ~ $t(n-2)$

Conclusion:

Since it is two sided test, We can conclude that:

if $\left|t^*\right| \leq t(1-\alpha/2;n-2)$, do not reject $H_0: \beta_1=0$, that $\beta_1$ do not have statistical significance with 0 at $\alpha = 0.05$, which mean at $\alpha = 0.05$ we conclude that X and Y do not have any relationship.

if $\left|t^*\right| > t(1-\alpha/2;n-2)$, reject $H_0: \beta_1=0$ ,that $\beta_1$ have statistical significance with 0 at $\alpha = 0.05$, which mean at $\alpha = 0.05$ X and Y have some kind of relationship.

### Question 13

Using [```modified_SENIC_data_01.csv```](https://s3-us-west-1.amazonaws.com/usfca-cs686-ml/hw2/modified_SENIC_data_01.csv)
1. Regress Infection Risk ($Y$) on Length of Stay ($X$). Report the $R^2$, $b_0$ and $b_1$  values.
2. Multiply the observations, both $X$ and $Y$, by 192; we will refer to these as $X_{(2)}$ and $Y_{(2)}$. Regress $Y_{(2)}$ on $X_{(2)}$. Report the $R^2$, $b_0$ and $b_1$  values.
3. Multiply only $Y$ by 47; we will refer to this as $Y_{(3)}$. Regress $Y_{(3)}$ on $X$. Report the $R^2$, $b_0$ and $b_1$  values.
4. Multiply only $X$ by 12; we will refer to this as $X_{(3)}$. Regress $Y$ on $X_{(3)}$. Report the $R^2$, $b_0$ and $b_1$  values.


**Succinctly** explain what you have gleaned from this exercise. Include a summarized tabular representation of the regression output and the associated $R^2$, $b_0$ and $b_1$ values.

### Question 14

Using the [```fourDataSets.csv```](https://s3-us-west-1.amazonaws.com/usfca-cs686-ml/hw2/fourDataSets.csv), regress $Y$ on $X$ for $i=1,2,3,4$, i.e., generate four separate SLR models. **For each** of the four data sets, run and report summary statistics, generate a scatter plot and run a SLR model, reporting the regression function $R^2$, $R_a^2$, and the significance of $b_1$ ($p$-value). Intelligently discuss what you observe about each data set and the data sets as a whole.