# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Linear Regression
What are our learning objectives for this lesson?
* Calculate a least squares linear regression line

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm-up Task(s)
* Open LinearRegression.ipynb from last class
* For our x, y scatter data, compute the correlation coefficient $r$
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
    * Add more/less noise to our y values. How does this affect $r$?
* Read the rest of the Linear Regression notes below

## Today
* Attendance
* Announcements
    * Great mid-project demos
    * IQ9 is graded, average was ~8/10, fantastic job!
    * DA7 questions?
* Today
    * Finish the kNN.ipynb
        * Quick overview of bootstrap method
        * "Classifier Evaluation Metrics"
            * Including confusion matrices
    * Start LinearRegression.ipynb

## Regression
In supervised machine learning, when your "class" attribute is continuous, the machine learning task is called regression instead of classification. There are several regression algorithms in Sci-kit Learn that can be used for these tasks, such as:
* [Linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* [kNN regressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
* [Decision tree regressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
* [Support vector regressor](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)
* Etc.

While these algorithms can get complex, it is pretty straightforward to implement our own "simple" linear regression algorithm. To do this, consider regression tasks where we have one feature variable (an independent variable, x) and one target variable (a dependent variable, y). Given a training set of (x, y) pairs, we can fit a line y = mx + b that can be used for unseen instances (e.g. x values) to predict a y value.

## Simple Linear Regression
In scatter plots of (x, y) data, it can be helpful to "fit a line"
<img src="https://raw.githubusercontent.com/GonzagaCPSC222/U6-Machine-Learning/master/figures/linear_regression_example.png" width="600"/>

* this can be done via linear regression
* we're going to look at a simple approach called "Least Squares"

The basic idea: Given a set of points, find a line that "best" fits the points
* i.e., find values for $m$ (slope) and $b$ (intercept) that best fits $y = mx + b$

In least squares linear regression
* find $m$ and $b$ that minimizes the sum of the (vertical) squared distance to the measured data points
* once we find $m$, finding $b$ isn't difficult

The basic least squares approach:
1. Calculate the mean $\bar{x}$ of the $x$ values and the mean $\bar{y}$ of the $y$ values
    * note the line must go through the point ($\bar{x}$, $\bar{y}$)
2. Calculate the slope using the means (where n is the number of data points):
$$m = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$$
3. Calculate the y intercept as b = $\bar{y} - m\bar{x}$
     * or, $\bar{y} = m\bar{x} + b$ ... since we know it must go through ($\bar{x}$, $\bar{y}$)
     
## Regression Evaluation Metrics
The correlation coefficient $r$ helps checks how good the linear relationship is:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
* Note the bottom is essentially the same as the top just squared to strip away
the signs
* If the correlation is perfectly linear, then result is 1
* If the correlation is perfect inverse linear, then result is -1
* If no relationship, the result is 0

An alternative formula (where $\sigma_x$ is the standard deviation of $x$):
$$m = r \frac{\sigma_y}{\sigma_x}$$

The coefficient of determination $R^2$ is the correlation coefficient squared, $R^2 = r^2$
* $R^2$ is the proportion of variation in the dependent (y) variable that is explained by the independent (x) variable
* The higher $R^2$, the better the fit line 

The covariance can also be used to assess correlation
$$cov = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n}$$
* covariance can also be used to calculate the correlation coefficient:
$$r = \frac{cov}{\sigma_x \sigma_y}$$

The standard error (SE) is also used to help check the fit
$$SE = \sqrt{\frac{\sum_{i=1}^{n}(y_i - y^\prime)^2}{n}}$$
* Where $y^\prime$ is the "predicted" value and $y$ is the actual value
* $(y_i - y^\prime)$ is called a "residual"
* Note standard error is essentially the standard deviation of the differences: $SE = \frac{\sigma}{\sqrt{n}}$
* Lower the value the "better" the fit

Additional regression evaluation metrics that are commonly used to quantify the residuals:
* Mean absolute error (MAE)
* Mean square error (MSE)
* Root mean square error (RMSE)
* Normalized RMSE (NRMSE)
* Etc. 

Q: What does it mean if there is a strong (linear) correlation?
* one of the attributes is (potentially) redundant because it is implied by the other
* one is a good predictor for the other ... good if one is a class label
* i.e., regression is one way to make predictions (more later)