# Simple Linear Regression

- Demonstrate the equation of a line $y = mx + b$
    - Interpret $b$ as the "intercept" and $m$ as the "rise over run" slope
    - Introduce flexible notation for an ***expected linear relationship*** $E[y_i] = \beta_0 + \beta_1x_i$ that is inexact $y_i \approx \beta_0 + \beta_1x_i$ but $y_i \neq \beta_0 + \beta_1x_i$ 
- Demonstrate ***simple linear regression*** model
    - Give the $y_i = \beta_0 + \beta_1x_i + \epsilon_i, \epsilon_i \sim \mathcal N (0,\sigma^2)$ mathematical specification of ***simple linear regression***
        - We won't cover the assumptions implied here carefully; but, the homework will address them more fully and explicitly
        - This is the same as $y_i \sim \mathcal N (\mu_{x_i} = \beta_0 + \beta_1x_i, \sigma^2)$ so ***simple linear regression*** is just a ***normal distribution population model***
            - where the mean $\mu_{x_i}$ depends on $x_i$
        - Interpret ***coefficients*** $\beta_0$ as the "intercept" and $\beta_1$ as the "rise over run" slope and now $\epsilon_i$ as the ***error***
- Define ***correlation*** $r_{xy} = \frac{\text{Cov}(x,y)}{s_xs_y} \in [-1,1]$
    - We probably won't have time for a fuller mathematical definition, but perhaps we'll start it?
    - Explore the behavior of $r_{xy}$ relative to $\sigma$
- Introduce `import statsmodels.api as sm` and contrast it with `statsmodels.formula.api as smf`
    - Define ***predictions*** as ***fitted values*** $\hat y_i = \hat \beta_0 + \hat \beta_1x_i$
        - Contrast ***parameters*** versus ***estimated*** ("intercept" and "slope") ***coefficients***
        - Interpret ***estimated*** ("intercept" and "slope") ***coefficients***
        - Explore 
            - `np.cov(x,y, ddof=1)`, `x.var(ddof=1)`, `y.var(ddof=1)`, 
            - `np.cov(x,y, ddof=1)[0,1]/(x.std(ddof=1)*y.std(ddof=1)`, `np.corrcoef(x,y)[0,1]`
            - $\hat \beta_1 = $`np.corrcoef(x,y)[0,1]*y.std(ddof=1)/x.std(ddof=1)`
        
            - $\hat \beta_0 =$ `y.mean() - `$\hat \beta_1$` * x.mean()`
            
    - Provide ***statistical inference*** based on the ***null hypothesis*** for "slope" $\beta_1$ using the ***coefficient*** $\hat \beta_1$ (***test statistic***)
- Make ***predictions*** with ***fitted values*** $\hat y_i = \hat \beta_0 + \hat \beta_1x_i$ or ***extrapolations*** $\tilde y_i = \hat \beta_0 + \hat \beta_1 \tilde x_i$ 
    - Give synonyms 
        - $y_i$ (outcomes, responses, dependent/endogenous variables)
        - $x_i$ (covariates, features, independent/exogenous/explanatory variables)
    - Define ***residuals*** $e_i = \hat \epsilon_i = y_i - \hat y_i \neq \epsilon_i$
- Define ***proportion of variation explained*** $R^2 = r_{y\hat y}^2 = 1-\frac{\sum_{i=1}^n(y_i-\hat y_i)^2}{\sum_{i=1}^n(y_i-\bar y_i)^2} = 1-\frac{\frac{1}{n-1}\sum_{i=1}^ne_i^2}{s_y^2}$     
    
- Do some examples with "amazonbooks.csv" (`ab_noNaN`)


In [4]:
import pandas as pd
ab = pd.read_csv("../amazonbooks.csv", encoding="ISO-8859-1")#.dropna()
#print(ab.shape)
#ab.isnull().sum()
ab_noNaN = ab.drop(['Weight_oz','Width','Height'], axis=1).dropna()
ab_noNaN['Pub year'] = ab_noNaN['Pub year'].astype(int)
ab_noNaN['NumPages'] = ab_noNaN['NumPages'].astype(int)
ab_noNaN['Hard_or_Paper'] = ab_noNaN['Hard_or_Paper'].astype("category")
#print(ab_noNaN.shape)
#ab_noNaN.dtypes
ab_noNaN

Unnamed: 0,Title,Author,List Price,Amazon Price,Hard_or_Paper,NumPages,Publisher,Pub year,ISBN-10,Thick
0,"1,001 Facts that Will Scare the S#*t Out of Yo...",Cary McNeal,12.95,5.18,P,304,Adams Media,2010,1605506249,0.8
1,21: Bringing Down the House - Movie Tie-In: Th...,Ben Mezrich,15.00,10.20,P,273,Free Press,2008,1416564195,0.7
2,100 Best-Loved Poems (Dover Thrift Editions),Smith,1.50,1.50,P,96,Dover Publications,1995,486285537,0.3
3,1421: The Year China Discovered America,Gavin Menzies,15.99,10.87,P,672,Harper Perennial,2008,61564893,1.6
4,1493: Uncovering the New World Columbus Created,Charles C. Mann,30.50,16.77,P,720,Knopf,2011,307265722,1.4
...,...,...,...,...,...,...,...,...,...,...
320,Where the Sidewalk Ends,Shel Silverstein,18.99,12.24,H,192,HarperCollins,2004,60572345,1.1
321,White Privilege,Paula S. Rothenberg,27.55,27.55,P,160,Worth Publishers,2011,1429233443,0.7
322,Why I wore lipstick,Geralyn Lucas,12.95,5.18,P,224,St Martin's Griffin,2005,031233446X,0.7
323,"Worlds Together, Worlds Apart: A History of th...",Robert Tignor,97.50,97.50,P,480,W. W. Norton & Company,2010,393934942,0.9
