# Chapter 3: Linear Regression
By [**Mosta Ashour**](https://www.linkedin.com/in/mosta-ashour/)
- **Chapter 3 from the book [An Introduction to Statistical Learning](https://www.statlearning.com/).**
- **By Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani.**
- **Pages from $120$ to $121$**


- **Exercises:**
 - **[1.](#1)**
 - **[2.](#2)**
 - **[3.](#3)**
 - **[4.](#4)**
 - **[5.](#5)**
 - **[6.](#6)**
 - **[7.](#7)**
 
# <span style="font-family:cursive;color:#0071bb;"> 3.7 Exercises </span>
## <span style="font-family:cursive;color:#0071bb;"> Conceptual </span>

<a id='1'></a>
### $1.$ Describe the null hypotheses to which the $\text{p-values}$ given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of <span style="font-family:cursive;color:red;"> $sales, TV, radio,$ </span> and <span style="font-family:cursive;color:red;"> $newspaper,$ </span> rather than in terms of the coefficients of the linear model.

![3_table3_4](img/3_table3_4.png)

- **The null hypothesis in this case are:**
 - That there is no relationship between amount spent on $TV, radio, newspaper$ advertising and $Sales$

$$H_{0}^{(TV)}: \beta_1 = 0$$
$$H_{0}^{(radio)}: \beta_2 = 0$$
$$H_{0}^{(newspaper)}: \beta_3 = 0$$

- From the **p-values** above, it does appear that $TV$ and $radio$ have a significant impact on sales and not $newspaper$.

- The **p-values** given in table 3.4 suggest that we **can reject** the null hypotheses for $TV$ and $newspaper$ and we **can't reject** the null hypothesis for $newspaper$. 
- It seems likely that there is a relationship between TV ads and Sales, and radio ads and sales and not $newspaper$.

<a id='2'></a>
### $2.$ Carefully explain the differences between the $\text{KNN}$ classifier and $\text{KNN}$ regression methods.

- **$\text{KNN}$ classifier methods** 
 - Attempts to predict the **class** to which the output variable belong by computing the local probability and determines a decision boundary `"typically used for qualitative response, classification problems"`.

- **$\text{KNN}$ regression methods** 
 - Tries to predict the **value** of the output variable by using a local average `"typically used for quantitative response, regression problems"`.

<a id='3'></a>
### $3.$ Suppose we have a data set with five predictors, $X_1 = GPA$, $X_2 = IQ$, $X_3 = Gender$ (1 for Female and 0 for Male), $X_4 = \text{Interaction between GPA and IQ}$, and $X_5 = \text{Interaction between GPA and Gender}$. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get $\hat{β_0} = 50, \hat{β_1} = 20 , \hat{β_2} = 0.07 , \hat{β_3} = 35 , \hat{β_4} = 0.01 , \hat{β_5} = −10$ .

**$(a)$** Which answer is correct, and why?

- $i.$ For a fixed value of IQ and GPA, males earn more on average than females.
- $ii.$ For a fixed value of IQ and GPA, females earn more on average than males.
- **$iii.$ For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.**
- $iv.$ For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.
    
### Answer:  
- The least square line is given by:
$$\hat{y}=50+20GPA+0.07IQ+35Gender+0.01GPA×IQ−10GPA×Gender$$
- For males:
$$\hat{y}=50+20GPA+0.07IQ+0.01GPA×IQ$$
- For females:
$$\hat{y}=85+10GPA+0.07IQ+0.01GPA×IQ$$


- So the starting salary for females is higher than Males by `$35`, but on average males earn more than females if GPA is higher than 3.5:
$$50 + 20GPA \geq 85 + 10GPA$$
$$10GPA \geq 35$$
$$GPA \geq 3.5$$
- **Answer iii. is the correct one**

**(b)** Predict the salary of a **female** with **IQ of 110** and a **GPA of 4.0**

In [1]:
gpa, iq, gender = 4, 110, 1

ls = 50 + 20*gpa + 0.07*iq + 35*gender + 0.01*gpa*iq + (-10*gpa*gender)
print('$', ls * 1000)

$ 137100.0


**$(c)$** **True or false:** Since the coefficient for the $GPA/IQ$ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

- **False**. the interaction effect might be small but to verify if the $GPA/IQ$ has an impact on the quality of the model we need to test the null hypothesis $H_0:\hat{\beta_4}=0$ and look at the **p-value** associated with the $\text{t-statistic}$ or the $\text{F-statistic}$ to reject or not reject the null hypothesis.

<a id='4'></a>
### $4.$ I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e.  $Y = β_0 + β_1X + β_2X^2 + β_3X^3 + \epsilon$

**$(a)$** Suppose that the true relationship between $X$ and $Y$ is linear, i.e. $Y = β_0 + β_1X + ε$. Consider the training residual sum of squares ($RSS$) for the linear regression, and also the training $RSS$ for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

 - Without knowing more details about the training data, it is difficult to know which training $RSS$ is lower between linear or cubic.
 - However, We would expect the training $RSS$ for the **cubic model to be lower than the linear model** because it is more flexible which allows it to fit more closely variance in the training data despite the true relationship between $X$ and $Y$ is linear.
 
 
**$(b)$** Answer (a) using test rather than training $RSS$.

 - We would expect the test $RSS$ for the **linear model to be lower than the cubic model** because The cubic model is more flexible, and so is likely to overfit the training data and would have more error than the linear regression.
 
 
**$(c)$** Suppose that the true relationship between $X$ and $Y$ is not linear, but we don't know how far it is from linear. Consider the training $RSS$ for the linear regression, and also the training $RSS$ for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

 - We would expect the training $RSS$ for the **cubic model to be lower than the linear model** because because of the cubic model flexibility.


**$(d)$** Answer (c) using test rather than training RSS.

 - **There is not enough information to tell.**
  - **Cubic would be lower if:**
  - The true relationship between $X$ and $Y$ is not linear and there is low noise in our training data.
  - **Linear would be lower if:**
  - The relationship is only slightly non-linear or the noise in our training data is high.

<a id='5'></a>
### $5.$ Consider the fitted values that result from performing linear regression without an intercept. In this setting, the i-th fitted value takes the form:

$$\hat{y_i} = x_i \hat{\beta},$$

$where$

$$\hat{\beta} = \bigg(\sum_{i=1}^{n}x_i y_i\bigg) \bigg/ \bigg(\sum_{i'=1}^{n}x_{i'}^2\bigg)$$

Show that we can write

$$\hat{y_i} = \sum_{i'=1}^n a_{i'} y_{i'}$$

What is $a_{i'}$?

*Note: We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.*

$$\hat{y_i} = x_i \frac{\sum_{i=1}^{n}x_i y_i} {\sum_{i'=1}^{n} x_{i'}^2}$$

$$\hat{y_i} = \frac{\sum_{i'=1}^{n}x_i x_i' }  {\sum_{i''=1}^{n} x_{i''}^2}y_i$$

 - $Where$ $$\hat{y_i} = \sum_{i'=1}^n a_{i'} y_{i'}$$
 - $So$ $$a_{i'} = \frac{x_i x_i' }  {\sum_{i''=1}^{n} x_{i''}^2}$$

<a id='6'></a>
### $6.$ Using $(3.4)$, argue that in the case of simple linear regression, the least squares line always passes through the point $(\bar{x}, \bar{y})$.

The least square line equation is $\hat{y}=\hat{\beta}_0+\hat{\beta}_1 x$, prove that when $x=\bar{x}$, $\hat{y} = \bar{y}$

$\text{When  } x=\bar{x}$
$$\hat{y}=\hat{\beta}_0+\hat{\beta}_1 \bar{x}$$

$Where$ $$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

$So$ $$\hat{y}=\bar{y} - \hat{\beta}_1 \bar{x}+\hat{\beta}_1 x$$

$$\hat{y}=\bar{y}$$

<a id='7'></a>
### $7.$ It is claimed in the text that in the case of simple linear regression of $Y$ onto $X$, the $R^2$ statistic $(3.17)$ is equal to the square of the correlation between $X$ and $Y$ (3.18). Prove that this is the case. For simplicity, you may assume that $\bar{x} = \bar{y}= 0$.


**Proposition**: Prove that in case of simple linear regression:

$$ y = \beta_0 + \beta_1 x + \varepsilon $$

the $R^2$ is equal to correlation between $X$ and $Y$ squared, e.g.:

$$ R^2 = corr^2(x, y) $$

We'll be using the following definitions to prove the above proposition.

**Def**:
$$ R^2 = 1- \frac{RSS}{TSS} $$

**Def**:
$$ RSS = \sum (y_i - \hat{y}_i)^2 \label{RSS} $$ 

**Def**:
$$ TSS = \sum (y_i - \bar{y})^2 \label{TSS} $$

**Def**:
$$
\begin{align}
  corr(x, y) &= \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}
                     {\sigma_x \sigma_y} \\
  \sigma_x^2 &= \sum (x_i - \bar{x})^2 \\
  \sigma_y^2 &= \sum (y_i - \bar{y})^2
\end{align}
$$

**Proof**:

Substitute defintions of TSS and RSS into $R^2$:

$$
R^2 = 1-\frac{\sum (y_i - \hat{y}_i)^2}
           {\sum y_i^2}
$$


Recall that:

$$
\begin{align}
  \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \label{beta0} \\
  \hat{\beta}_1 &= \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}
                        {\sum (x_i - \bar{x})^2}
\end{align}
$$

Substitute the expression for $\hat{\beta}_0$ into $\hat{y}_i$:
And with $\bar{x} = \bar{y} = 0$
$$
\begin{align}
  \hat{y}_i &= \hat{\beta}_1 x_i \\
  \hat{y}_i &= \frac{\sum x_i y_i}
                        {\sum x_i^2}
\end{align}
$$

$Then$

$$
\begin{align}
        R^2 &= 1-\frac{\sum (y_i - \frac{\sum x_i y_i}
                                    {\sum x_i^2})^2}
                  {\sum y_i^2}\\
            &= \frac{\sum{y_i^2} -2\sum y_i (\frac{\sum x_i y_i}
                     {\sum x_i^2})x_i+\sum(\frac{\sum x_i y_i}
                                                {\sum x_i^2})^2 x_i^2)}
                {\sum y_i^2}\\
            &= \frac{\frac{2(\sum x_i y_i)^2}{\sum x_i^2} - \frac{(\sum x_i y_i)^2}{\sum x_i^2}}{\sum y_i^2}\\
            &= \frac{(\sum x_i y_i)^2}{\sum x_i^2 \sum y_i^2}
\end{align}
$$

$ \text{with } \bar{x} = \bar{y} = 0$

$
\begin{align}
  corr(x, y) &= \frac{\sum x_i y_i}
                     {\sum x_i^2 \sum y_i^2} = R^2
\end{align}
$




## Done!