# Simple Linear Regression

- Demonstrate the equation of a line $y = mx + b$
    - Interpret $b$ as the "intercept" and $m$ as the "rise over run" slope
    - Introduce flexible notation for an ***expected linear relationship*** $E[y_i] = \beta_0 + \beta_1x_i$ that is inexact $y_i \approx \beta_0 + \beta_1x_i$ but $y_i \neq \beta_0 + \beta_1x_i$ 
- Demonstrate ***simple linear regression*** model
    - Give the $y_i = \beta_0 + \beta_1x_i + \epsilon_i, \epsilon_i \sim \mathcal N (0,\sigma^2)$ mathematical specification of ***simple linear regression***
        - We won't cover the assumptions implied here carefully; but, the homework will address them more fully and explicitly
        - This is the same as $y_i \sim \mathcal N (\mu_{x_i} = \beta_0 + \beta_1x_i, \sigma^2)$ so ***simple linear regression*** is just a ***normal distribution population model***
            - where the mean $\mu_{x_i}$ depends on $x_i$
        - Interpret ***coefficients*** $\beta_0$ as the "intercept" and $\beta_1$ as the "rise over run" slope and now $\epsilon_i$ as the ***error***
- Define ***correlation*** $r_{xy} = \frac{\text{Cov}(x,y)}{s_xs_y} \in [-1,1]$
    - We probably won't have time for a fuller mathematical definition, but perhaps we'll start it?
    - Explore the behavior of $r_{xy}$ relative to $\sigma$
- Introduce `import statsmodels.api as sm` and contrast it with `statsmodels.formula.api as smf`
    - Define ***predictions*** as ***fitted values*** $\hat y_i = \hat \beta_0 + \hat \beta_1x_i$
        - Contrast ***parameters*** versus ***estimated*** ("intercept" and "slope") ***coefficients***
        - Interpret ***estimated*** ("intercept" and "slope") ***coefficients***
        - Explore 
            - `np.cov(x,y, ddof=1)`, `x.var(ddof=1)`, `y.var(ddof=1)`, 
            - `np.cov(x,y, ddof=1)[0,1]/(x.std(ddof=1)*y.std(ddof=1)`, `np.corrcoef(x,y)[0,1]`
            - $\hat \beta_1 = $`np.corrcoef(x,y)[0,1]*y.std(ddof=1)/x.std(ddof=1)`
        
            - $\hat \beta_0 =$ `y.mean() - `$\hat \beta_1$` * x.mean()`
            
    - Provide ***statistical inference*** based on the ***null hypothesis*** for "slope" $\beta_1$ using the ***coefficient*** $\hat \beta_1$ (***test statistic***)
- Make ***predictions*** with ***fitted values*** $\hat y_i = \hat \beta_0 + \hat \beta_1x_i$ or ***extrapolations*** $\tilde y_i = \hat \beta_0 + \hat \beta_1 \tilde x_i$ 
    - Give synonyms 
        - $y_i$ (outcomes, responses, dependent/endogenous variables)
        - $x_i$ (covariates, features, independent/exogenous/explanatory variables)
    - Define ***residuals*** $e_i = \hat \epsilon_i = y_i - \hat y_i \neq \epsilon_i$
- Define ***proportion of variation explained*** $R^2 = r_{y\hat y}^2 = 1-\frac{\sum_{i=1}^n(y_i-\hat y_i)^2}{\sum_{i=1}^n(y_i-\bar y_i)^2} = 1-\frac{\frac{1}{n-1}\sum_{i=1}^ne_i^2}{s_y^2}$     
    
- Do some examples with "amazonbooks.csv" (`ab_noNaN`)


In [7]:
from scipy import stats
n = 100 # 10000
x = stats.uniform(loc=0, scale=10).rvs(n)
x

array([3.46813151, 5.79405812, 3.3765849 , 5.94032302, 0.42005484,
       3.93113716, 0.39258481, 4.24241201, 4.10543059, 8.59782529,
       5.11351993, 1.66249068, 6.56813692, 0.23853907, 7.48428971,
       0.47588917, 3.82460993, 6.40173168, 6.15381433, 5.82715669,
       9.87181061, 1.47713169, 2.8843529 , 3.42963969, 5.63468789,
       5.58173896, 6.5272543 , 9.15533899, 7.15851167, 9.61203416,
       0.32957747, 7.5991751 , 8.56901826, 4.4641432 , 6.06045152,
       2.17679537, 4.57994947, 9.92688446, 0.11269917, 5.80325948,
       7.07044104, 7.70194738, 6.38504437, 9.68188781, 0.69561834,
       0.65650503, 6.64349475, 7.70338965, 4.55791778, 0.11537027,
       7.80012415, 7.57369719, 0.19651778, 8.94643067, 4.87679378,
       6.6883127 , 2.44838409, 7.88685328, 9.29512991, 6.39803588,
       4.63180507, 7.75302212, 6.50410178, 2.52153283, 9.16606053,
       6.44310705, 4.54151492, 2.0424636 , 8.26646154, 3.78320699,
       1.14352169, 6.05260598, 6.92810802, 3.59293023, 0.23047

In [2]:
stats.uniform?

In [8]:
import plotly.express as px
import pandas as pd
df = pd.DataFrame({'x': x})
px.histogram(df, x='x')


# $y = mx + b$

In [10]:
b = 10 # "intercept"
m = 2 # "slope" "rise over run"
y = m*x+b # y = m1*x1 + m2*x2 + b # 
df2 = pd.DataFrame({'x':x, 'y':y})
px.scatter(df2, x='x', y='y')

# $y_i \approx mx + b \longrightarrow E[y_i] = \beta_0 + \beta_1 x_i$

# $y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \epsilon_i \sim \mathcal N(0, \sigma)$
- this is equivalent to $y_i \sim \mathcal N(\mu_{x_i} = \beta_0 + \beta_1 x_i, \sigma)$
- epsilon $\epsilon$ is the "error" or "noise" and is normally distributed
- $x_i$ is "measured with out error" or is "known and fixed"

# Simple linear regression

- error is normally distribted
- errors are homoskedastic (variability in errors is constant regardless of $x_i$)
- the "linear form" is true (and $x_i$ is "measured with out error" or is "known and fixed")
- the errors are independent

In [66]:
x
beta0 = b
beta1 = m
error = stats.norm(loc=0, scale=2).rvs(n)
y_with_error = beta0 + beta1*x + error  #y = m*x+b 

df3 = pd.DataFrame({'x':x, 'y': y_with_error})
px.scatter(df3, x='x', y='y')

# Correlation $r_{xy} = \frac{\text{Cov}(x,y)}{s_x s_y} \in [-1 ,1]$

- a measure of the uniform strength of linear association between x and y

In [55]:
import numpy as np

np.corrcoef(1000000*x,10*y_with_error)[0,1] # matrix
np.corrcoef(x,y_with_error)[0,1] # matrix

0.9502977942521981

In [21]:
np.cov(x,y_with_error)

array([[ 8.54666997, 17.89893529],
       [17.89893529, 41.56424596]])

# ordinary least squares (ols) -- provide a "line of best fit" through data

In [60]:
import statsmodels.api as sm # not demo'ing this today 
import statsmodels.formula.api as smf

simple_linear_regression_model = smf.ols('y ~ x', data=df3)
slrm_fit = simple_linear_regression_model.fit()
slrm_fit.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.876
Model:,OLS,Adj. R-squared:,0.874
Method:,Least Squares,F-statistic:,689.4
Date:,"Mon, 13 Nov 2023",Prob (F-statistic):,3.89e-46
Time:,14:36:46,Log-Likelihood:,-224.8
No. Observations:,100,AIC:,453.6
Df Residuals:,98,BIC:,458.8
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,9.5922,0.467,20.526,0.000,8.665,10.520
x,2.0891,0.080,26.256,0.000,1.931,2.247

0,1,2,3
Omnibus:,0.561,Durbin-Watson:,2.03
Prob(Omnibus):,0.755,Jarque-Bera (JB):,0.687
Skew:,-0.063,Prob(JB):,0.709
Kurtosis:,2.614,Cond. No.,12.1


# $E[y_i] = \beta_0 + \beta_1 x_i$
# $\hat y_i = \hat \beta_0 + \hat\beta_1 x_i \longrightarrow 9.5922 + 2.0891 x_i $
- predicted values

In [64]:
df3 = pd.DataFrame({'x':x, 'y': y_with_error})
fig = px.scatter(df3, x='x', y='y')

import plotly.graph_objects as go
fig.add_trace(go.Scatter(x=x, y=slrm_fit.predict(), mode="lines"))

fig.add_trace(go.Scatter(x=x, y=slrm_fit.predict(), mode="markers"))

# Hypothesis test for simple linear regression

# $H_0: \beta_1 =0 \quad \text{there is no linear association between x and y}$

for the model $y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \epsilon_i \sim \mathcal N(0, \sigma)$

implies sampling distribution under $H_0$ for (test) statistic $\hat \beta_1$

# $H_1: \beta_1 \neq 0 $

In [4]:
import pandas as pd
ab = pd.read_csv("../amazonbooks.csv", encoding="ISO-8859-1")#.dropna()
#print(ab.shape)
#ab.isnull().sum()
ab_noNaN = ab.drop(['Weight_oz','Width','Height'], axis=1).dropna()
ab_noNaN['Pub year'] = ab_noNaN['Pub year'].astype(int)
ab_noNaN['NumPages'] = ab_noNaN['NumPages'].astype(int)
ab_noNaN['Hard_or_Paper'] = ab_noNaN['Hard_or_Paper'].astype("category")
#print(ab_noNaN.shape)
#ab_noNaN.dtypes
ab_noNaN

Unnamed: 0,Title,Author,List Price,Amazon Price,Hard_or_Paper,NumPages,Publisher,Pub year,ISBN-10,Thick
0,"1,001 Facts that Will Scare the S#*t Out of Yo...",Cary McNeal,12.95,5.18,P,304,Adams Media,2010,1605506249,0.8
1,21: Bringing Down the House - Movie Tie-In: Th...,Ben Mezrich,15.00,10.20,P,273,Free Press,2008,1416564195,0.7
2,100 Best-Loved Poems (Dover Thrift Editions),Smith,1.50,1.50,P,96,Dover Publications,1995,486285537,0.3
3,1421: The Year China Discovered America,Gavin Menzies,15.99,10.87,P,672,Harper Perennial,2008,61564893,1.6
4,1493: Uncovering the New World Columbus Created,Charles C. Mann,30.50,16.77,P,720,Knopf,2011,307265722,1.4
...,...,...,...,...,...,...,...,...,...,...
320,Where the Sidewalk Ends,Shel Silverstein,18.99,12.24,H,192,HarperCollins,2004,60572345,1.1
321,White Privilege,Paula S. Rothenberg,27.55,27.55,P,160,Worth Publishers,2011,1429233443,0.7
322,Why I wore lipstick,Geralyn Lucas,12.95,5.18,P,224,St Martin's Griffin,2005,031233446X,0.7
323,"Worlds Together, Worlds Apart: A History of th...",Robert Tignor,97.50,97.50,P,480,W. W. Norton & Company,2010,393934942,0.9
