# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Linear Regression
What are our learning objectives for this lesson?
* Calculate a least squares linear regression line

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm-up Task(s)
In ClassificationFun:
* Download the AmesHousing.csv file from the DAs repository on Github
* In a new notebook called LinearRegression.ipynb
    * Read in AmesHousing.csv into a dataframe
    * Create a scatter plot of SalePrice on the y-axis and a numeric attribute on the x-axis you think is predictive of how much a house sells for (attribute descriptions available [here](https://jse.amstat.org/v19n3/decock/DataDocumentation.txt))

## Today
* Announcements
    * Let's go over IQ9
    * Work on your project (Cleaning, EDA, and at least 1 hypothesis test)
        * Check-in due in class on 5/29 at the latest
        * Bonus points for demoing early (this week: 4/22-4/25) during office hours
    * Any DA7 questions? Due Sunday night. Today we will cover linear regression
    * Note: office hours tomorrow (4/23) cancelled. If you need me, I'll be in my office at 4:15pm or email me
* Today
    * Cross validation (magnet demo + code in [kNN.ipynb on Github](https://github.com/GonzagaCPSC222/U7-Machine-Learning-NLP/blob/master/ClassificationFun/kNN.ipynb))
    * LinearRegression.ipynb

## Regression
In supervised machine learning, when your "class" attribute is continuous, the machine learning task is called regression instead of classification. There are several regression algorithms in Sci-kit Learn that can be used for these tasks, such as:
* [Linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* [kNN regressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
* [Decision tree regressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
* [Support vector regressor](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)
* Etc.

While these algorithms can get complex, it is pretty straightforward to implement our own "simple" linear regression algorithm. To do this, consider regression tasks where we have one feature variable (an independent variable, x) and one target variable (a dependent variable, y). Given a training set of (x, y) pairs, we can fit a line y = mx + b that can be used for unseen instances (e.g. x values) to predict a y value.

## Simple Linear Regression
In scatter plots of (x, y) data, it can be helpful to "fit a line"
<img src="https://raw.githubusercontent.com/GonzagaCPSC222/U7-Machine-Learning-NLP/master/figures/linear_regression_example.png" width="600"/>

* this can be done via linear regression
* we're going to look at a simple approach called "Least Squares"

The basic idea: Given a set of points, find a line that "best" fits the points
* i.e., find values for $m$ (slope) and $b$ (intercept) that best fits $y = mx + b$

In least squares linear regression
* find $m$ and $b$ that minimizes the sum of the (vertical) squared distance to the measured data points
* once we find $m$, finding $b$ isn't difficult

The basic least squares approach:
1. Calculate the mean $\bar{x}$ of the $x$ values and the mean $\bar{y}$ of the $y$ values
    * Note the line must go through the point ($\bar{x}$, $\bar{y}$)
2. Calculate the slope using the means (where n is the number of data points):
$$m = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$$
3. Calculate the y intercept as b = $\bar{y} - m\bar{x}$
     * or, $\bar{y} = m\bar{x} + b$ ... since we know it must go through ($\bar{x}$, $\bar{y}$)

## Multiple Linear Regression
We can perform linear regression with multiple independent variables (e.g. $x_1, x_2, ..., x_n$) using [Sci-kit Learn's `LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). Like classifiers in Sci-kit Learn, `LinearRegression` has `fit(X_train, y_train)`, `predict(X_test)`, and `score(X_test, y_test)` methods.

## Regression Evaluation Metrics
The correlation coefficient $r$ helps checks how good the linear relationship is:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
* Note the bottom is essentially the same as the top just squared to strip away
the signs
* If the correlation is perfectly linear, then result is 1
* If the correlation is perfect inverse linear, then result is -1
* If no relationship, the result is 0

An alternative formula (where $\sigma_x$ is the standard deviation of $x$):
$$m = r \frac{\sigma_y}{\sigma_x}$$

The coefficient of determination $R^2$ is the correlation coefficient squared, $R^2 = r^2$
* $R^2$ is the proportion of variation in the dependent (y) variable that is explained by the independent (x) variable
* The higher $R^2$, the better the fit line 

The covariance can also be used to assess correlation
$$cov = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n}$$
* covariance can also be used to calculate the correlation coefficient:
$$r = \frac{cov}{\sigma_x \sigma_y}$$

The standard error (SE) is also used to help check the fit
$$SE = \sqrt{\frac{\sum_{i=1}^{n}(y_i - y^\prime)^2}{n}}$$
* Where $y^\prime$ is the "predicted" value and $y$ is the actual value
* $(y_i - y^\prime)$ is called a "residual"
* Note standard error is essentially the standard deviation of the differences: $SE = \frac{\sigma}{\sqrt{n}}$
* Lower the value the "better" the fit

Additional regression evaluation metrics that are commonly used to quantify the residuals:
* Mean absolute error (MAE)
* Mean square error (MSE)
* Root mean square error (RMSE)
* Normalized RMSE (NRMSE)
* Etc. 

What does it mean if there is a strong (linear) correlation?
* One of the attributes is (potentially) redundant because it is implied by the other
* One is a good predictor for the other ... good if one is a class label
    * i.e., Regression is one way to make predictions

## Correlation DOES NOT Imply Causation
<img src="https://qph.fs.quoracdn.net/main-qimg-13d22f6fda3811a9108d18b71c46e933" width="400"/>

(image from https://qph.fs.quoracdn.net/main-qimg-13d22f6fda3811a9108d18b71c46e933)

It is important to note that a strong correlation between two variables does not mean that one causes the other. Correlation does not account for extraneous variables, such as:
* Weather
* Seasonality
* Competitor
* Economy
* Pricing
* etc.

For example consider how ice cream consumption and sunburn prevalence are correlated, but neither "causes" the other because there is an extraneous variable, sunny weather:

<img src="https://pbs.twimg.com/media/ENXkNi2X0AAqLyI?format=jpg&name=medium" width="500"/>

(image from https://pbs.twimg.com/media/ENXkNi2X0AAqLyI?format=jpg&name=medium)

## Isolating Causation and Experiment Design
As another example, consider how changes in weather are correlated with increases in product sales:
    
<img src="https://www.businessprocessincubator.com/wp-content/uploads/2017/10/www.capgemini.comweeklysalesincreaseusatem-0f0687fe4fda13494079d48aa4b6ba5521ecb085.png" width="400"/>

(image from https://www.businessprocessincubator.com/content/understanding-the-sales-landscape-with-an-extremely-randomized-forest/)

If a hedge trimmer company rolled out a new marketing campaign at the start of summer, they may attribute their increase in sales to the marketing campaign because sales are correlated with the marketing budget/efforts; however, as the infographic above shows, weather is an extraneous variable that is more likely the real cause of the increased sales during the summer.

To better estimate the effects of the campaign, the hedge trimmer company could compare sales from this summer (with a marketing campaign) to a previous summer (without a marketing campaign) and try to control for other extraneous/confounding variables. 

In general, researchers typically isolate causality with an effectively designed experiment, such as a randomized control trial where there are two groups (control and experiment) that are identical except for one variable (the one hypothesize is the cause of a certain outcome):

<img src="https://quantifyinghealth.com/wp-content/uploads/2020/01/rct-vs-cohort-study-design.png" width="500"/>

(image from https://quantifyinghealth.com/wp-content/uploads/2020/01/rct-vs-cohort-study-design.png)