# 3.7 Exercises

This notebook file is intended to be a translation of the exercises included in Chapter 3 of Introduction to Statisical Learning


## Installation
There are multiple ways to get a notebook environment working. The easiest is probably to install Anaconda and work off of that premade setup.

However, I personally recommend using ASDF to install versions of Python and using Poetry to manage package dependencies. This is because I tend to use separate environments for each project I'm working on (including this one) and I previously had problems with `conda` installs.

Please look at the README.md in the base of the repository to see how to install things.


## Import necessary libraries

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import Generator, PCG64
import pandas as pd
from scipy import stats
import statsmodels.api as sm

scipy_seed = 1304

numpy_randomGen = Generator(PCG64(scipy_seed))

### Problem Number 8

We are looking at the `Auto` data set and want to perform simple lienar regression.

In [2]:
auto_dataframe_path = '../data/Auto.csv' # filepath location of the dataset
cols = list(pd.read_csv(auto_dataframe_path, nrows=1)) # get column names of the dataset and put in list
auto_dataframe = pd.read_csv(auto_dataframe_path, usecols = [i for i in cols if i!= "Unnamed: 0"]) # remove the Index column that is used by R
auto_dataframe.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
5,15.0,8,429.0,198,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220,4354,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225,4425,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190,3850,8.5,70,1,amc ambassador dpl


(a) The book says to use the `lm()` function to fit a simple linear regression model on `horsepower` to predict `mpg`.

An analog for this would be the `OLS` function inside the `statsmodels.api` module

In [3]:
?sm.OLS

[0;31mInit signature:[0m [0msm[0m[0;34m.[0m[0mOLS[0m[0;34m([0m[0mendog[0m[0;34m,[0m [0mexog[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mmissing[0m[0;34m=[0m[0;34m'none'[0m[0;34m,[0m [0mhasconst[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Ordinary Least Squares

Parameters
----------
endog : array_like
    A 1-d endogenous response variable. The dependent variable.
exog : array_like
    A nobs x k array where `nobs` is the number of observations and `k`
    is the number of regressors. An intercept is not included by default
    and should be added by the user. See
    :func:`statsmodels.tools.add_constant`.
missing : str
    Available options are 'none', 'drop', and 'raise'. If 'none', no nan
    checking is done. If 'drop', any observations with nans are dropped.
    If 'raise', an error is raised. Default is 'none'.
hasconst : None or bool
    Indicates whether the RH

Use the `pandas.DataFrame.describe()` to print the results. Comment on the output. For example:

8. (a) i. Is there a relationship between the predictor (`horsepower`) and the response (`mpg`)?

8. (a) ii. How strong is the relationship between the predictor and the response?

8. (a) iii. Is the relationship between the predictor and the response positive or negative?

8. (a) iv. What is the predicted `mpg` associated with a `horsepower` of 98? What are the? What are the associated 95% confidence and prediction intervals?

8. (b) Plot the response and the predictor. Use the `matplotlib.pyplot.plot` function to make a line/scatter plot. Will need to save the results and call the intercept and slope. Aftewards, use the x values to calculate predictions and plot with matplotlib (can store in the same dataframe)


We are trying to overlay a line plot (the regression) on a scatter plot (the observations)

8. (c) Use the `plot()` function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

May need to come back and correct the function call. 
I think you are supposed to compare the observational data with your predictions on a scatter plot

### Problem 9

This question involves the use of multiple linear regression on the `Auto` data.

9. (a) Produce a scatterplot matrix which includes all of the variables in the data set.

Check function calls

9. (b) Computee the matrix of correlations between the variables using the function `pd.DataFrame.corr()`. You will need to exclude the `name` variable, which is qualitative

Go back and check the function calls

9. (c) Use the `sm.OLS` function to perform a multiple linear regression with `mpg` as the response and all other variables except `name` as the predictors. 

Use the `pd.DataFrame.describe()` function to print the results. Comment on the output. 

(c) i. Is there a relationship between the predictors and the response?

(c) ii. Which predictors appear to have a statistically significant relationship to the response?

(c) iii. What does the coefficient for the `year` variable suggest?

(d) Use the `plot()` function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. 

Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

May not have as easy access to residual or leverage plots...

(e) Use the `*` and `:` symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?


Need to figure out what these R operators are...

`*` might just be multiply and `:` seems to be a sequence generator

(f) Try a few different transformations of the variables, such as $log(X)$ (`np.log()`), $\sqrt{X}$ (`np.sqrt()`), and $X^2$ (`np.power()`). Comment on your findings

### Problem 10

This question uses the `Carseats` data set.

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`

(b) Provide an interpretation of each coefficient in the model. Be careful! Some of the variables in the model are qualitative!

(c) Write out the model in equation form, being careful to handle the qualtitative variables properly.

(d) For which of the predictors can you reject the null hypothesis $H_{0}$: $β_{j}$ = 0 ?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

(f) How well do the models in (a) and (e) fit the data?

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s)

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

### Problem 11

In this problem we will investigate the *t*-statistic for the null hypothesis $H_{0}$ : $β$ = 0 in simple linear regression without an intercept. To
begin, we generate a predictor `x` and a response `y` as follows.

In [6]:
np.random.seed(200)
x1_np = np.random.standard_normal(size=100)

y1_np = 2 * x1_np + np.random.standard_normal(size=100)

False


(a) Perform a simple linear regression of `y` onto `x`, *without* an intercept. Report the coefficient estimate $\hat{β}$, the standard error of
this coefficient estimate, and the *t*-statistic and *p*-value associated with the null hypothesis *$H_{0}$* : *β* = 0. Comment on these
results. (You can perform regression without an intercept using
the command `sm.OLS`)

In [None]:
# Hint: use sm.OLS(), and you will have to set the first element of y to be z

(b) Now perform a simple linear regression of `x` onto `y` without an
intercept, and report the coefficient estimate, its standard error,
and the corresponding *t*-statistic and *p*-values associated with
the null hypothesis $H_{0}$ : $β$ = 0. Comment on these results.

(c) What is the relationship between the results obtained in (a) and
(b)?

(d) For the regression of *Y* onto *X* without an intercept, the tstatistic for $H_{0}$ : $β$ = 0 takes the form $\hatβ/SE(\hatβ)$, where $\hatβ$ is
given by (3.38), and where

\begin{gather*}
SE(\hat{β}) = \sqrt{ \frac{\sum_{i=1}^{n} (y_{i} - x_{i}\hat{β})^2}{(n-1)\sum_{i'=1}^{n} x_{i'}^2}}.
\end{gather*}


(These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression
without an intercept.) Show algebraically, and confirm numerically in `Python`, that the *t*-statistic can be written as

\begin{gather*}
\frac{(\sqrt{n-1})\sum_{i=1}^{n} x_{i}y_{i}}{\sqrt{ (\sum_{i=1}^{n} x_{i}^2)(\sum_{i'=1}^{n} y_{i}^2) - (\sum_{i'=1}^{n}x_{i'}y_{i'})^2 }}.
\end{gather*}

(e) Using the results from (d), argue that the *t*-statistic for the regression of `y` onto `x` is the same as the *t*-statistic for the regression
of `x` onto `y`.

(f) In Python, show that when regression is performed *with* an intercept,
the *t*-statistic for $H_{0}$ : $β_{1}$ = 0 is the same for the regression of `y` onto `x` as it is for the regression of `x` onto `y`.

### Problem 12

This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate $\hatβ$ for the linear regression of
$Y$ onto $X$ without an intercept is given by (3.38). Under what
circumstance is the coefficient estimate for the regression of $X$
onto $Y$ the same as the coefficient estimate for the regression of
$Y$ onto $X$ ?

(b) Generate an example in `Python` with *n* = 100 observations in which
the coefficient estimate for the regression of *X* onto *Y* is *different
from* the coefficient estimate for the regression of *Y* onto *X*.

(c) Generate an example in `Python` with *n* = 100 observations in which
the coefficient estimate for the regression of *X* onto *Y* is *the
same* as the coefficient estimate for the regression of *Y* onto *X*.

### Problem 13

In this exercise you will create some simulated data and will fit simple
linear regression models to it. Make sure to use `set.seed(1)` prior to
starting part (a) to ensure consistent results.

1.  (a) Using the `np.random.standard_normal()` function, create a vector, `x`, containing 100 observations drawn from a $N(0, 1)$ distribution. This represents
a feature, *X*.

1.  (b) Using the `np.random.normal()` function, create a vector, `eps`, containing 100 observations drawn from a $N(0, 0.25)$ distribution—a normal
distribution with mean zero and variance 0.25.

13. (c) Using `x` and `eps`, generate a vector `y` according to the model

\begin{gather*}
Y = −1+0.5X + ϵ.                                        (3.39)
\end{gather*}
What is the length of the vector `y`? What are the values of $β_{0}$
and $β_{1}$ in this linear model?

13. (d) Create a scatterplot displaying the relationship between `x` and
`y`. Comment on what you observe.

13. (e) Fit a least squares linear model to predict `y` using `x`. Comment
on the model obtained. How do $\hatβ_{0}$ and $\hatβ_{1}$ compare to $β_{0}$ and
$β_{1}$?

13. (f) Display the least squares line on the scatterplot obtained in (d).
Draw the population regression line on the plot, in a different
color. Use the `pandas.DataFrame.plot(legend=True)` command to create an appropriate legend.

13. (g) Now fit a polynomial regression model that predicts `y` using `x`
and $x^2$. Is there evidence that the quadratic term improves the
model fit? Explain your answer.

Find out mix superscript with code formatting style

13. (h) Repeat (a)–(f) after modifying the data generation process in
such a way that there is less noise in the data. The model (3.39)
should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term
$ϵ$ in (b). Describe your results.

13. (i) Repeat (a)–(f) after modifying the data generation process in
such a way that there is more noise in the data. The model
(3.39) should remain the same. You can do this by increasing
the variance of the normal distribution used to generate the
error term ϵ in (b). Describe your results.

13. (j) What are the confidence intervals for $β_{0}$ and $β_{1}$ based on the
original data set, the noisier data set, and the less noisy data
set? Comment on your results.

### Problem 14
This problem focuses on the collinearity problem

14. (a) Perform the following commands in Python with scipy and numpy:
> np.random.seed(200) 
> 
> x1 = np.random.uniform()
> 
> x2 = 0.5 * x1 + np.random.normal(100) / 10
> 
> y = 2 + 2 * x1 + 0.3 * x2 + np.random.normal(100)


The last line corresponds to creating a linear model in which y is
a function of x1 and x2. Write out the form of the linear model.
What are the regression coefficients?

14. (b) What is the correlation between x1 and x2? Create a scatterplot
displaying the relationship between the variables.

14. (c) Using this data, fit a least squares regression to predict y using
x1 and x2. Describe the results obtained. What are βˆ0, βˆ1, and
βˆ2? How do these relate to the true $β_{0}$, $β_{1}$, and $β_{2}$? Can you
reject the null hypothesis $H_{0}$ : $β_{1}$ = 0? How about the null
hypothesis $H_{0}$ : $β_{2}$ = 0?

14. (d) Now fit a least squares regression to predict y using only x1.
Comment on your results. Can you reject the null hypothesis
$H_{0}$ : $β_{1}$ = 0?

14. (e) Now fit a least squares regression to predict y using only x2.
Comment on your results. Can you reject the null hypothesis
$H_{0}$ : $β_{1}$ = 0?

14. (f) Do the results obtained in (c)–(e) contradict each other? Explain
your answer.

14. (g) Now suppose we obtain one additional observation, which was
unfortunately mismeasured.
> x1 = np.append(x1, 0.1)
> 
> x2 = np.append(x2, 0.8)
> 
> y = np.append(y, 6)


Re-fit the linear models from (c) to (e) using this new data. What
effect does this new observation have on the each of the models?
In each model, is this observation an outlier? A high-leverage
point? Both? Explain your answers.

### Problem 15

This problem involves the `Boston` data set, which we saw in the lab
for this chapter. We will now try to predict per capita crime rate
using the other variables in this data set. In other words, per capita
crime rate is the response, and the other variables are the predictors.

15. (a) For each predictor, fit a simple linear regression model to predict
the response. Describe your results. In which of the models is
there a statistically significant association between the predictor
and the response? Create some plots to back up your assertions.

15. (b) Fit a multiple regression model to predict the response using
all of the predictors. Describe your results. For which predictors
can we reject the null hypothesis $H_{0}$ : $β_{j}$ = 0?

15. (c) How do your results from (a) compare to your results from (b)?
Create a plot displaying the univariate regression coefficients
from (a) on the x-axis, and the multiple regression coefficients
from (b) on the y-axis. That is, each predictor is displayed as a
single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate
in the multiple linear regression model is shown on the y-axis.

15. (d) Is there evidence of non-linear association between any of the
predictors and the response? To answer this question, for each
predictor *X*, fit a model of the form

\begin{gather*}
Y = β_{0} + β_{1}X + β_{2}X^2 + β_{3}X^3 + ϵ
\end{gather*}