## Homework 2

1. In the DataSets folder, read the data set 'College.csv' into Python. Perform simple linear regression, with `perc_alumni` as the independent variable and `S_F_ratio` as the response ($y$) variable. Report on the slope and intercept of the LSR line you find. What is the confidence interval for each? 

In [349]:
import numpy as np
import pandas as pd

In [351]:
college_df = pd.read_csv("College.csv")
college_df.head()

Unnamed: 0,School,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F_Undergrad,P_Undergrad,Outstate,Room_Board,Books,Personal,PhD,Terminal,S_F_Ratio,perc_alumni,Expend,Grad_Rate
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


In [353]:
x = college_df["perc_alumni"].to_numpy()
y = college_df["S_F_Ratio"].to_numpy()

In [355]:
lsr = np.polyfit(x, y, 1)
m_hat = lsr[0]
b_hat = lsr[1]

In [357]:
print(f"The slope of the LSR is m_hat = {m} and the intercept of the LSR is b_hat = {b}.")

The slope of the LSR is m_hat = -0.1287088334943756 and the intercept of the LSR is b_hat = 17.017043121637855.


In [359]:
# Numer of Data Points
n = len(x)

# Residual Errors
residuals = y - (lsr[0]*x + lsr[1])

# SSR = nMSE
SSR = np.sum(residuals**2)

# RSE
sigma_hat = np.sqrt( SSR / (n - 2) )

**Standard Error of $\hat{m} $**:

$$
\text{Var}(\hat{m}) = \sigma^2 \frac{1}{\sum_{i=1}^{n} (x_i - \bar{x})^2},
$$

so

$$
SE(\hat{m}) = \sqrt{\text{Var}(\hat{m})} = \sqrt{\frac{\sigma^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}}.
$$

**Standard Error of the Intercept $\hat{b}$**:
$$
\text{Var}(\hat{b}) = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \right],
$$

so

$$
SE(\hat{b}) = \sqrt{\text{Var}(\hat{b})} = \hat{\sigma} \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}}.
$$

In [363]:
# Mean of all x-values
xbar = np.mean(x)

# The denominator part
xx = np.sum((x - xbar)**2)

In [365]:
# Standard Error of m_hat
SE_slope = sigma_hat * np.sqrt(1.0 / xx)
print(SE_slope)

0.010501699789972306


In [367]:
# Standard Error of b_hat
SE_intercept = sigma_hat * np.sqrt(1.0/n + (xbar**2)/xx)
print(SE_intercept)

0.27196026713229987


**95% Confidence Interval with $\pm2SE$**:
$$
\hat{m} \pm 2 \text{SE}(\hat{m}), \quad
\hat{b} \pm 2 \text{SE}(\hat{b}),
$$

In [370]:
# Confidence Intervals
m_hat_confidence_interval = (m_hat - 2 * SE_slope, m_hat + 2 * SE_slope)
b_hat_confidence_interval = (b_hat - 2 * SE_intercept, b_hat + 2 * SE_intercept)


print(f"Confidence interval for m_hat: {m_hat_confidence_interval}")
print(f"Confidence interval for b_hat: {b_hat_confidence_interval}")

Confidence interval for m_hat: (-0.1497122330743202, -0.10770543391443098)
Confidence interval for b_hat: (16.473122587373254, 17.560963655902455)


**References:**
- Used AI to help me properly write and format the mathematical formulas and equations.

---

2. Write out an explanation for why $R^2$ is scale invariant. In other words, for an arbitrary constant $c$, if you take all $y$-coordinates in the data $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$ and make the change $y_i \mapsto cy_i$ then show how $R^2$ remains the same.

**Claim:** If we multiply all $ y $-coordinates by a constant $ c $, i.e. $ y_i \mapsto c y_i $, then

$$ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} $$

remains unchanged.

**Proof:**

Let:

The linear regression model be $ \hat{y} = mx + b $, where:
- Total Sum of Squares (TSS): $$ TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2 $$

- Sum of Squared Residuals (SSR): $$ SSR = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

- Mean Squared Error (MSE): $$ MSE = \frac{SSR}{n} $$

- Residual Standard Error (RSE): $$ RSE = \sqrt{\frac{SSR}{n - 2}} $$ 

- Proportion of Variance: $$ R^2 = 1 - \frac{SSR}{TSS} $$

Now suppose we replace $ y $ by $ c y $. The least-squares fit to $ c y $ becomes $ c \hat{y} $. Intuitively, if the model is $ \hat{y} = mx + b $, then fitting to $ c y $ yields:

$$ \hat{y}_{(c)} = c (m x + b) \Rightarrow c \hat{y} = (c m) x + (c b). $$

- New Mean: $$ \bar{y}_{(c)} = c \bar{y}. $$

- New TSS: $$ TSS_{(c)} = \sum_{i=1}^{n} (c y_i - c \bar{y})^2 = c^2 \sum_{i=1}^{n} (y_i - \bar{y})^2 = c^2 TSS. $$

- New SSR: $$ SSR_{(c)} = \sum_{i=1}^{n} (c y_i - c \hat{y}_i)^2 = c^2 \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = c^2 SSR. $$

- New MSE: $$ MSE_{(c)} = \frac{SSR_{(c)}}{n} = \frac{c^2 SSR}{n} = c^2 MSE. $$

- New RSE: $$ RSE_{(c)} = \sqrt{\frac{SSR_{(c)}}{n - k}} = \sqrt{\frac{c^2 SSR}{n - k}} = c RSE. $$

**Effect on $ R^2 $:**
$$ R^2_{(c)} = 1 - \frac{SSR_{(c)}}{TSS_{(c)}} = 1 - \frac{c^2 SSR}{c^2 TSS} = 1 - \frac{SSR}{TSS} = R^2. $$

- The regression coefficients $ m $ and $ b $ are scaled by $ c $.
- The residuals, SSR and TSS, are scaled by $ c^2 $.
- MSE is scaled by $ c^2 $.
- RSE is scaled by $ c $.
- $ R^2 $ remains unchanged showing that it is scale-invariant under multiplication by a constant.

**References:**
- Took inspiration from this thread [Coefficient of Determination Invariance on StackExchange](https://stats.stackexchange.com/questions/348758/coefficient-of-determination-invariant-to-centering-and-rescaling-of-variables)
- Used AI to help me properly write and format the mathematical formulas and equations.


---

3. Check that $(\bar{x}, \bar{y})$ necessarily lies on the LSR line. (_Do this without using calculus, which means you should not use that_ $\sum_{i=1}^n(y_i - \hat{y}_i) = 0$.)
Hint: Using ${\bf x}$ (resp. ${\bf y}$) for the vectors of $x$-coords (resp. $y$-coords), check that 
$$A^TA = \begin{bmatrix}{\bf x}\cdot{\bf x} & n\bar{x} \\ n\bar{x} & n\end{bmatrix}; \qquad A^T{\bf y} = \begin{bmatrix}{\bf x}\cdot{\bf y} \\ n\bar{y}\end{bmatrix} $$ 

**Claim:** If you plug in $x = \bar{x}$ to the fitted model $\hat{y} = b + m x$, the claim is that $\hat{y}(\bar{x}) = \bar{y}$.

**Proof:** 

Consider the matrix $A$ for simple linear regression of $y$ on $x$:

$$
A = \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_n & 1 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}.
$$

The LSR coefficients $p = (m, b)$ satisfy the normal equations:

$$
(A^T A) p = A^T \mathbf{y}.
$$

- $A^T A$ is a $2 \times 2$ matrix:

  $$
  A^T A = \begin{bmatrix} \sum x_i^2 & \sum x_i \\ \sum x_i & n \end{bmatrix}.
  $$

  If you expand $\sum x_i = n \bar{x}$ and $\sum x_i^2 = \mathbf{x} \cdot \mathbf{x}$, then

  $$
  \begin{bmatrix} \mathbf{x} \cdot \mathbf{x} & n \bar{x} \\ n \bar{x} & n \end{bmatrix}.
  $$

- $A^T \mathbf{y}$ is a $2 \times 1$ vector:

  $$
  A^T \mathbf{y} = \begin{bmatrix} \sum x_i y_i \\ \sum y_i \end{bmatrix} = \begin{bmatrix} \mathbf{x} \cdot \mathbf{y} \\ n \bar{y} \end{bmatrix}.
  $$

Putting it together, the normal equations read:

$$
(A^T A) p = A^T y
\Rightarrow
\begin{bmatrix} 
\sum x_i^2 & \sum x_i \\  
\sum x_i & n 
\end{bmatrix} 
\begin{bmatrix} 
m \\ 
b 
\end{bmatrix} 
=
\begin{bmatrix} 
\sum x_i y_i \\ 
\sum y_i 
\end{bmatrix}.
$$

Multiplying the matrices results in:

$$
\begin{bmatrix} 
\left( \sum x_i^2 \right) m + \left( \sum x_i \right) b \\  
\left( \sum x_i \right) m + n b
\end{bmatrix} 
=
\begin{bmatrix} 
\sum x_i y_i \\  
\sum y_i
\end{bmatrix}.
$$

This gives the two scalar equations:

1. **First equation:**
   $$
   \left( \sum x_i^2 \right) m + \left( \sum x_i \right) b = \sum x_i y_i.
   $$

2. **Second equation:**
   $$
   \left( \sum x_i \right) m + n b = \sum y_i.
   $$

Since we know that $ \sum x_i = n \bar{x} $ and $ \sum y_i = n \bar{y} $, we substitute these into the second equation:

$$
n \bar{x} m + n b = n \bar{y}.
$$

Dividing everything by $n$:

$$
\bar{x} m + b = \bar{y}.
$$

Thus, the point $ (\bar{x}, \bar{y}) $ lies on the LSR line.

**References:**
- Used AI to help me properly write and format the mathematical formulas, matrices/vectors, and equations.

---

4. Suppose that you have a data set, with $n=100$ points, with one independent variable $x$ and one response variable $y$. First, suppose that you fit a simple linear regression to the data. In addition, suppose that you also fit a cubic regression (that is, $\hat{y} = \hat{p}_0x^3 + \hat{p}_1x^2 + \hat{p}_2x + \hat{p}_3$). 

* Which of the two models, the linear regression or the cubic regression, should be expected to have a smaller MSE on the data? Do we have enough information to tell? Explain your reasoning. Additionally, you may use computations on example data to support your conclusion. 

In [380]:
# Simple linear regression
p_linear = np.polyfit(x, y, deg=1)  
ssr_linear = np.sum( (np.polyval(p_linear,x) - y)**2 )
mse_linear = ssr_linear / n

# Cubic regression :
p_cubic = np.polyfit(x, y, deg=3)  
ssr_cubic = np.sum( (np.polyval(p_cubic,x) - y)**2 )
mse_cubic = ssr_cubic / n

print("MSE (linear) =", mse_linear)
print("MSE (cubic)  =", mse_cubic)

MSE (linear) = 13.107820886560184
MSE (cubic)  = 13.071311919557951


The linear model $ \{b + m x\} $ is a special case of the cubic model if we have: $ p_0 = p_1 = 0, \quad p_2 = m, \quad p_3 = b. $

- Equivalently, the columns $ \{1, x\} $ in $ X_{\text{lin}} $ are contained in the columns $ \{x^3, x^2, x, 1\} $ of $ X_{\text{cubic}} $.
- The space spanned by $ \{1, x\} $ is a subspace of the space spanned by $ \{1, x, x^2, x^3\} $.

Since the cubic fit is allowed to use more columns (which allows for more degrees of freedom), it can always choose to set $ p_0 = p_1 = 0 $ and match the linear solution exactly if that turns out best. Or, it can choose nonzero $ p_0, p_1 $ to get an even better SSR.

- Therefore, the best cubic fit cannot yield a larger SSR than the best linear fit—it can only be the same or smaller.
- Dividing by $ n $ converts SSR to MSE, so:

  $$
  \text{MSE}_{\text{cubic}} \leq \text{MSE}_{\text{linear}}.
  $$

**References:**
- Used AI to help me properly write and format the mathematical formulas, matrices/vectors, and equations.

---

**Extra challenge**

5. Using the 'College.csv' data set again, compute a column (or array), `F_Undergrad_perc`, that has the percentage of undergraduate students that are full-time &ndash; e.g., if you read in the DataFrame as `college`, you could use `100*(college['F_Undergrad']/(college['F_Undergrad'] + college['P_Undergrad']))` to get this column.
* With the column `Grad_Rate` as your response variable and the columns `perc_alumni`, `S_F_Ratio`, `Top25perc`, `F_Undergrad_perc`, and `Outstate` as your independent variables, use multiple linear regression on these variables to predict the values in `Grad_Rate`. What are the regression coefficients? 
* Which variable(s) are not significant.
> In the analysis to estimate how much coefficients fluctuate relative to their size, use 100 subsamples that are at least half the size of the data set.

In [403]:
# Creates F_Undergrad_perc column
college_df["F_Undergrad_perc"] = 100 * (college_df["F_Undergrad"] / (college_df["F_Undergrad"] + college_df["P_Undergrad"]))
college_df.head()

Unnamed: 0,School,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F_Undergrad,P_Undergrad,Outstate,Room_Board,Books,Personal,PhD,Terminal,S_F_Ratio,perc_alumni,Expend,Grad_Rate,F_Undergrad_perc
0,Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60,84.307423
1,Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56,68.618926
2,Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54,91.277533
3,Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59,89.005236
4,Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15,22.271914


**References:**
- https://stackoverflow.com/questions/11479064/multiple-linear-regression-in-python
- Used AI to help me properly write and format the mathematical formulas, matrices/vectors, and equations.