Linear Regression and EDA

1) **Exploratory Data Analysis (EDA)**
* Objectives of EDA
    1. *Suggest hypotheses* about the causes of observed phenomena
    2. *Assess assumptions* on which inference is based
    3. Support *selection of appropriate tools and techniques*
    4. Provide *basis for further data* collection
* Data Mungling/Cleaning Objectives:
    * Making sure that the data makes sense
    * Checking for missing data (do you need to impute data?)
    * Looking for outliers/anomalies
    * Making data type conversions
    * Transforming data (e.g. aggregation)
    * Encoding, decoding, recoding data so that categorical variables are usable in machine learning algo
    * Renaming variables to normalize data
    * Merging data (e.g. joining tables together)
* Types of variables
    * Qualitative or Categorical
    * Quantative or Numerical
* Number of variables
    * Univariate (1)
    * Bivariate (2)
    * Multivariate (3 or more)
* Types of plots to use for specified variable analysis
    * Univariate, numeric: 
        * **Histogram/KDE**
            * shows center, variability (spread), skewness
            * any outliers
            * choose bin value carefully
        * **Boxplots**
            * median, IQR, range
            * any outliers
            * but, doesn't show distributional shape (soln: choose violinplot)
    * Univariate, categorical:
        * **Bar charts**
            * univariate
            * univariate by type
                * side by side bar chart
                * stacked bar chart
    * Bivariate, numeric vs numeric:
        * **Scatterplots**
            * understanding trends
            * sometimes it could be useful to bin quantatively variables using boxplots
    * Bivariate, numeric vs categorical:
        * **Overlay Two Histograms**
            * comparing distribution properties
        * **Multiple Boxplots**
            * compare medium, IQR, range, and outliers
    * Bivariate, categorical vs categorical:
        * **Heatmap**
            * use pd.crosstab to visualize difference in percentages    

2) **Simple Linear Regression**
![simple_lin_reg](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/400px-Linear_regression.svg.png)
![lin_reg_dist](https://daviddalpiaz.github.io/appliedstats/images/model.jpg)
* Model: $Y=\beta_0 + \beta_1 X + \epsilon$
    * $\beta_0$ and $\beta_1$ are unknown constants that represent the intercept and slope
    * $\epsilon$ is the error term
        * $\epsilon$ assumes random variables are independent and identically distributed on Normal distribution
        * $\epsilon$ ~ $Normal(0, \sigma^2)$
    * Fitted Values from Model: $\hat{y} \rightarrow$ $\hat{y}=\hat{B}_0 + \hat{B}_1 x$
        * $\hat{B}_0$ and $\hat{B}_1$ are model coefficient estimates for the world (presumed)
        * $\hat{y}$ indicates the prediction of $Y$ based on $X=x$
* want to minimize error: $e_i=y_i-\hat{y}_i=y_i-\hat{B}_0-\hat{B}_1x_i$
    * **Residual Sum of Square (RSS)** - the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepancy between the data and an estimation model. A small RSS indicates a tight fit of the model to the data. 
    * $\begin{align} RSS 
        & = e_1^2 + e_2^2 + \cdots + e_n^2 \\
        & = (y_1-\hat{B}_0-\hat{B}_1x_1)^2+(y_2-\hat{B}_0-\hat{B}_1x_2)^2+\cdots+(y_n-\hat{B}_0-\hat{B}_1x_n)^2 \\
        & = \sum_{i=1}^n (y_i-(\hat{B}_0 + \hat{B}_1x_i))^2 \\
        \end{align}$
    * estimates $\hat{B}_0$ and $\hat{B}_1$ minimze RSS
        * $\hat{B}_0=\bar{y}-\hat{B}_1\bar{x}$
        * $\hat{B}_1=\frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2}$
* Coefficient estimate - determine with hypothesis test if $X$ (feature) has a effect on predictor $Y$
    * $SE(\hat{\beta}_0)^2 = \sigma^2 \big[\frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}\big]$
    * $SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}$
    * $\sigma^2 = Var(\epsilon)$ finding the spread for error

|                     |                                         One-sample mean test                                         |                                    One-sample coefficient test                                   |
|:-------------------:|:----------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------:|
|   Setup Hypothesis  |                                       $H_0: \mu = \mu_0 = 100$                                       |                                        $H_0: \beta_1 = 0$                                        |
|   Sample Statistic  |                                               $\bar{x}$                                              |                                            $\hat{B}_1$                                           |
| Test Statistic      | $t=\frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$                                                         | $t=\frac{\hat{B}_1-0}{SE(\hat{B}_1)}$                                                            |
| Confidence Interval | $(\bar{x}-t_{\frac{\alpha}{2}}*\frac{s}{\sqrt{n}}, \bar{x}+t_{\frac{\alpha}{2}}*\frac{s}{\sqrt{n}})$ | $(\hat{B}_1-t_{\frac{\alpha}{2}}*SE(\hat{B}_1), \hat{B}_1+t_{\frac{\alpha}{2}}*SE(\hat{B}_1))$ |

3) **Multiple Linear Regression**
* Model: $Y = \beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p+\epsilon$
* Fitted Values From Model: $\hat{y} = \hat{\beta}_0+\hat{\beta}_1X_1+\hat{\beta}_2X_2+\cdots+\hat{\beta}_pX_p$
* Residual Sum of Squares: $\begin{align} RSS
    & = \sum_{i=1}^n(y_i-\hat{y}_i)^2 \\
    & = \sum_{i=1}^n(y_i-\hat{B}_0-\hat{B}_1x_{i1}-\hat{B}_2x_{i2}-\cdots-\hat{B}_px_{ip})^2 \\
    \end{align}$
* Coefficient Estimates: $\hat{B}=(X^TX)^{-1}X^Ty$
![mult_lin_reg](multiple_linear_regression.png)

4) Assessing Accuracy and Comparing Models
* Metrics for assessing accuracy:
    1. **Residual Sum of Squares (RSS)** - (aka Sum of the Squared Residuals) the sum of the squares of residuals (deviations predicted from actual empirical values of data)
        * equation: $RSS=\sum_{i=1}^n(y_i-\hat{y}_i)^2$
        * example: $RSS = 1520123.11$
            * really meaningless number
            * this RSS value grows with $n$
            * measured in the *units of the response variable*, $y$
                * e.g. think $y$ in dollars vs. $y$ in millions of dollars
    2. **Mean Squared Error (MSE)** - (aka Mean of the Squared of the Residuals) an estimator that measures the average of squares of the errors or deviations. It measures the quality of an estimator as a second moment (which incorporates variance and its bias)
        * equation: $MSE=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2$
    3. **Root Mean Squared Error (RMSE)** - taking the square root of MSE is analagous as takin the sqrt of variance to obtain standard deviation. It represents the sample standard deviation of the differences between predicted and observed values
        * equation: $RMSE=\sqrt{\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{n}}$
    4. **Residual Standard Error (RSE)** - standard deviation of points formed around a linear function, and is an estimate of the accuracy of the dependent variable being measured. Can roughly think of as average amount that response will deviate from regression line
        * equation: $RSE=\sqrt{\frac{1}{n-p-1}RSS}=\sqrt{\frac{(y_i-\hat{y}_i)^2}{n-p-1}}$
    5. **R-Squared (R$^2$)** (aka Coefficient of determination or Proportion of Variance Explained) - the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A statistical measure of how close the data are to the fitted regression line. 
        * equation: $R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}$ where Total Sum of Squares: $TSS=\sum_{i=1}^n (y_i-\bar{y})^2$
        * independent of scale of y
        * 0% indicates that the model explains none of the variability of the response data around its mean
    6. **F-test** - used to compare difference between different models with different number of variables to determine how important the new variables are
        * equation: $F=\frac{\frac{RSS_{reduced}-RSS_{full}}{p_{full}-p_{reduced}}}{\frac{RSS_{full}}{n-p_{full}-1}}$
            * where $F$ has degrees of freedom $(p_{full}-p_{reduced}), (n-p_{full}-1)$
        * example: predict $Y$ (MPG) and suspect height and color might not be very important variables
        * assume $\alpha=0.05$
            1. setting up comparison models:
                * $m_{reduced}: Y$~$\beta_0+\beta_{weight}+\beta_{modelyear}+\beta_{cartype}$
                * $m_{full}: Y$~$\beta_0+\beta_{weight}+\beta_{modelyear}+\beta_{cartype}+\beta_{height}+\beta_{color}$
            2. compute F-statistic:
                * $F = 2.23$
                * notice that if height and color really don't matter much: $(RSS_{reduced}-RSS_{full})$ will be small $\rightarrow$ F-statistic will be small
            3. compute p-value:
                * p-value = 0.1241
                * if $p < 0.05$ reject null (that height and color don't matter)
                * if $p >= 0.05$ fail to reject null (that height and color don't matter)
        * comparing models using the F-test:
            * is my mode useful at all? e.g. is at least one of the my predictors $X_1,X_2,\dots,X_p$ useful in predicting the response?
                * $H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0$
                * $H_A:$ at least one of the $\beta_j$ is non-zero
                * $F=F_{p,n-p-1}$~$\frac{\frac{TSS-RSS}{p}}{\frac{RSS}{n-p-1}}$
            * equivalence to t-test in the regression output
                * $m_{reduced}: Y$~$\beta_0+\beta_{weight}+\beta_{modelyear}$
                * $m_{full}: Y$~$\beta_0+\beta_{weight}+\beta_{modelyear}+\beta_{cartype}$
* Interpretation:
<img src=ols_regression_results.png text="Simple Linear Regression with Distribution" width=90% />
    * **R-squared** - proportion of variance explained by model is 93.3%
    * **Prob(F-statistic)** - measure of the significance of the fit (determining how useful the model is): $6.30e-27$
        * you want a low value to determine that the model is useful
    * **Confidence Interval of Coefficient** - there is an approximately 95% chance that $[0.275, 0.693]$ will contain the true value of $\beta_2$
    * **p-value of Coefficient** - each coefficient is statistically significance (also thought as a partial F-test)
    * **Coefficient value** - interpreting what it signifies
        * example: average effect on $Y$ of a one unit increase in $X_2$, holding all other predictors ($X_1$ and $X_3$) fixed, is $0.4836$
        * beware that interpretations are generally pretty hazardous due to **correlations among predictors**
        * however, p-values for each coefficient are $\approx$ 0, so it might be possible to interpret the coefficients
    

5) Linear Regression Assumptions:
* LR Assumptions - all of the linear regression model assumptions are really statements about the regression error terms, $\epsilon$
    1. **Linearity** - if scatterplot of residuals and y-values that follow a linear pattern (e.g. not curvilinear)
    2. **Constant variance (homoscedasticity)** - if the scatterplot of the residuals have constant spread or variance (e.g. not in triangular form)
    3. **Independence of errors** - are we collecting observations from the same entity over time (e.g. stock prices over time)
        * longitudinal data - data from specific entities over time
        * cross-sectional data - collect data on entities only once
    4. **Normality of errors** - are the residuals in a histogram normal? is the residuals skewed?
    5. **Lack of multicollinearity** - is there any high correlations between independent variables?
* **Studentized Residuals (aka Standardized Residuals)** - the quotient resulting from the division of a **residual** by an estimate of its standard deviation
![studentized_residuals](http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/images/res1panel.png)
    * Error terms cannot be observed directly, so we rely on the least squares *residuals*: $e_i = y_i-\hat{y}_i$ 
    * equation: $r_i=\frac{e_i}{s_{e_i}}=\frac{\epsilon_i}{\sigma}$~$Normal(0,1)$
    * Obtaining studentized residuals:
        1. run the regression
        2. calculate the predicted values
        3. calculate the *residuals*: $e_i=y_i-\hat{y}_i$
        4. calculate the *studentized residuals*: $r_i=\frac{e_i}{s_{e_i}}=\frac{\epsilon_i}{\sigma}$~$Normal(0,1)$
    * residual plot comparison types
        * residuals $e_i$ vs. independent variables
        * residuals $e_i$ vs. predicted/fitted values $\hat{y}$
* Regression diagnostics (Model Checking):
    1. **Non-linearity with predictors**
![non_linearity](http://docs.statwing.com/wp-content/uploads/2014/10/Nonlinear-residual-21.png)
        * **Polynomials**
            * different degrees of polynomial
        * **Step functions**
            * easy to create and explain
        * **Splines**
        * **Local Regression** - using a sliding weight function to make separate linear fits over range of $X$
        * **Generalized Additive Models (GAMs)** - adding up contributing effects
    2. **Non-normality of Error Terms** - normality assumption allows us to construct confidence intervals and do hypothesis tests
        * Graphical checks:
            * **Normal Q-Q plot** - a quantile-quantile plot of the standardized data against the standard normal distribution
![norm_qq_plot](https://i.stack.imgur.com/ezLDI.png)
            * Histogram
        * Normality tests
            * **Jarque-Bera test**
            * **Shapiro-Wilk test**
        * Solution? A log transformation of the dependent variable is often useful
    3. **Heteroscedasticity or Non-constant Variance** - checking if the variance is constant over fitted values
![heteroscedasticity](https://i.stack.imgur.com/RU17l.png) 
        * use residual plot $e_i$ vs. predicted/fitted values $\hat{y}$
        * Solution? transform Y via $log(Y)$ or $\sqrt{Y}$
    4. **Multicollinearity** - checking if features are highly correlated with each other
        * **Correlation Matrix / Scatterplot Matrix** - shows correlation between pairwise variables
            * downside is that it only pick up pairwise effects
        * **Variance Inflation Factors (VIF)** - runs ordinary least squares for each predictor as function of all the other predictors
            * equation: $VIF = \frac{1}{1-R^2_i}$
            * $k$ times for $k$ predictors
            * $X_1 = \alpha_2X_2+\alpha_3X_3+\cdots+\alpha_kX_k+c_0+e$
            * rule of thumb, $k > 10$ is problematic    
    5. **Outliers** - occurs when $y_i$ is far from predicted, $\hat{y}_i$
        * may occur due to data collection, re-coding issues, dirty data, etc.
        * least squares estimates particularly affected by outliers
        * residual plots can help identify outliers: $e_i=y_i-\hat{y}_i$
            * dividing each residual by its standard error should result in "studentized residual", and when a value is outside this range indicates outliers
        * different types of outliers:
            * extreme $X$ value
            * extreme $Y$ value
            * extreme $X$ and $Y$
            * disatant data point
        * **Leverage point** - an observation with an unusual $X$ value
![leverage_point](leverage.png)
            * does not necessarily have a large effect on the regression model
            * most common measure, the hat value, $h_{ii} = (H)_{ii}$
            * the $i$th diagonal of the hat matrix: $H = X(X^TX)^{-1}X^T$
            * high-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation
        * **Influential points** - an outlier that greatly affects the slope of the regression line
![influential_points](influential_points.png)
            * observations that have high leverage and large residuals tend to be influential            

6) Categorical Variables and Interactions
* Categorical Variables:
    * interested in credit card balances $y$
    * suspect it may be related to gender or ethnicity
    * modeling with only gender
        * $x_i = $$\begin{cases} 
            1 & \text{if $i$th person is female} \\
            0 & \text{if $i$th person is male}
            \end{cases}$
        * $y_i = \beta_0 + \beta_1x_i + \epsilon_i = $$\begin{cases} 
            \beta_0+\beta_1+\epsilon_i & \text{if $i$th person is female} \\
            \beta_0+\epsilon_i & \text{if $i$th person is male}
            \end{cases}$
    * modeling with only ethnicity (more than 2 levels)
        * $x_{i1} = $$\begin{cases} 
            1 & \text{if $i$th person is asian} \\
            0 & \text{if $i$th person is not asian}
            \end{cases}$
        * $x_{i2} = $$\begin{cases} 
            1 & \text{if $i$th person is caucasian} \\
            0 & \text{if $i$th person is not caucasian}
            \end{cases}$
        * $y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_i = $$\begin{cases} 
            \beta_0+\beta_1+\epsilon_i & \text{if $i$th person is asian} \\
            \beta_0+\beta_2+\epsilon_i & \text{if $i$th person is caucasian} \\
            \beta_0+\epsilon_i & \text{if $i$th person is african american (AA)}
            \end{cases}$
        * $\beta_0$ as average credit card balance for AA
        * $\beta_1$ as difference in average balance between asian and AA
        * $\beta_2$ as difference in average balance between caucasian and AA
        * recode categorical column to 0 and 1
            * Asian: {Asian:1, Caucasian:0}
            * Caucasian: {Asian:0, Caucasian:1}
            * AA: {Asian:0, Caucasian:0}
        * intercept $\beta_0$ loses nice interpretation
        * $\beta_1=-23.1$ - still interpret as difference between asian and AA holding all other predictors constant (beware of interpretation)
* **Interactions** - relationship among three or more variables
    * example of predicting sales by TV/radio/newspaper expenditure
        * $\hat{sales} = \beta_0+\beta_1*TV+\beta_2*radio+\beta_3*newspaper$
        * looks to be synergy between TV and Radio based on plot of sales vs. tv/radio
        * account for this synergy by adding a interaction coefficient of $TV*Radio$
        * $sales = \beta_0+\beta_1*TV+\beta_2*radio+\beta_3*(radio*TV)+\epsilon$
            * the coefficient estimates in the table suggest that an increase in TV advertising of \$1,000 is associated with increased sales of $(\hat{\beta}_1+\hat{\beta}_3 * radio) * 1000 = 19+1.1 * radio$ units
    