# Introduction to Statistical Learning with Applications in Python

## 3.7 Linear Regression Exercises

### Conceptual

*1. Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coeﬃcients of the linear model.*

The null hypotheses that the p-values address are that TV, radio, and newspaper, individually, have no relationship with sales while the other two are held constant. That the p-values of TV and radio are near zero suggests that spending more money on these venues for advertising leads to increased sales. For newspaper, however, spending more on advertising appears to have no significant effect on sales, according to the high p-value.

*2. Carefully explain the diﬀerences between the KNN classiﬁer and KNN regression methods.*

Forumlaically they are similar, but their applications and outputs are not. KNN classification is an unsupervised method to classify a qualitative variable $Y$, whereas KNN regression is a supervised method to predict the quatitative value for $f(X)$.

*3. Suppose we have a data set with five predictors, $X_1=GPA$, $X_2=IQ$, $X_3=Gender$ ($1$ for Female and $0$ for Male), $X_4=GPA \times IQ$, and $X_5 = GPA \times Gender$. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get*

$$
\hat{\beta_0} = 50
\\
\hat{\beta_1} = 20
\\
\hat{\beta_2} = 0.07
\\
\hat{\beta_3} = 35
\\
\hat{\beta_4} = 0.01
\\
\hat{\beta_5} = -10
$$

*a. Which answer is correct, and why?*

For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.

*b. Predict the salary of a female with an IQ of 110 and a GPA of 4.0*

$$
y = 50 + (20 \times 4) + (0.07 \times 110) + (35 \times 1) + (0.01 \times (4 \times 110)) + (-10 \times (4 \times 1))
\\
= 50 + 80 + 7.7 + 35 + 4.4 - 40 = \$137.1k
$$

*c. True or false: Since the coeﬃcient for the $GPA \times IQ$ interaction term is very small, there is very little evidence of an interaction eﬀect. Justify your answer.*

False. Even if it's compelling to make an assumption, significance is determined by the critical p-value.



#### 4a

You would expect the cubic regression to have a lower RSS during the training phase, but that would be due to overfitting the noise.

#### 4b

Testing RSS for linear regression would be lower because it is a better approximation of $f(X)$. The cubic regression would suffer from having been overfitted in the training phase.

#### 4c

Polynomial regression is always going to have a lower training RSS because of the greater flexibility it has.

#### 4d

There isn't enough information. If relationship of X and Y are close to linear, then a linear model would have a lower RSS. If the relation is more loosey-goosey, then polynomial regression will like perform better. However, because no information is given, which type of model would be better is anybody's guess.

### Applied

#### 8a

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
import statsmodels
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
auto = pd.read_csv('../auto.csv')
# The horsepower column has some issues.
auto.drop(auto[auto['horsepower']=='?'].index, inplace=True)
auto['horsepower'] = pd.to_numeric(auto['horsepower'])

In [3]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [4]:
auto.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,1.576531
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,0.805518
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,1.0
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [5]:
auto.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 396
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null int64
weight          392 non-null int64
acceleration    392 non-null float64
year            392 non-null int64
origin          392 non-null int64
name            392 non-null object
dtypes: float64(3), int64(5), object(1)
memory usage: 30.6+ KB


In [6]:
model = sm.OLS(auto['mpg'], auto['horsepower'])
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.649
Model:                            OLS   Adj. R-squared:                  0.648
Method:                 Least Squares   F-statistic:                     723.7
Date:                Tue, 20 Feb 2018   Prob (F-statistic):           5.67e-91
Time:                        14:49:22   Log-Likelihood:                -1608.1
No. Observations:                 392   AIC:                             3218.
Df Residuals:                     391   BIC:                             3222.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
horsepower     0.1788      0.007     26.901      0.0

There appears to be a statistically significant relationship between horsepower and mileage. You can see that the F-statistic is much larger than one and the probability of obtaining that value is near-zero, so we can reject the null hypothesis.

The $R^2$ of the fit is equal to $0.649$ which means that about $65\%$ of the variance in mpg values can be explained by horsepower.

The relationship between horsepower and mpg is negative. That means as horsepower rises, fuel efficiency goes down.

# Uh-oh

The analyses I've done here are not looking like the other answers I've found online. I can't figure what went wrong. I feel like I know less now than I ever have before. I need to start making some flashcards so I can study something during my days off.