In [1]:
from datascience import *
%matplotlib inline
path_data = '../../../assets/data/'
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import math
import numpy as np
from scipy import stats

### Questions:
- I don't quite get the regression effect. Why does it happen?
- Why when $x$ and $y$ are measured in standard units, the regression line passes through the origin?
- are all residual plots centered around 0?
- It says that residuals plots show no trend. But it sometimes shows a pattern. What is the difference between trend and pattern?

# Chapter 15: Prediction
Terminology:
- **regression:** 
- **ecological correlations:** correlations based on aggregates and averages. (must be interpreted with care)
- **Heteroscedasticity**: uneven spread


### 15.1 Correlation
The *correlation coefficient* measures the strength of the **linear** relationship, and only linear association, between two variables. Graphically, it measures how clustered the scatter diagram is around a straight line.

Note that outliers can have a big effect on correlation.

Some facts about correlation coeff. r.:
- $r$ is a number between $-1$ and $1$
- $r$ measures the extent to which the scatter plot clusters around a straight line.
- r = 1 if the scatter diagram is perfectly straight line sloping upwards, and r = -1 if the scatter diagram is a perfect straight line sloping downwards.


##### Calculating $r$
the mathematical basis for this is out of scope, but the formula for r is:

**$r$ is the average of the products of the two variables, when both variables are measured in standard units**

##### Properties of $r$
- $r$ is unit-less. This is because $r$ is based on standard units.
- $r$ is unaffected by changing the units on either axis. This too is because r is based on standard units.
- $r$ is unaffected by switching the axes. Algebraically, this is because the product of standard units does not depend on which variable is called $x$ and which is $y$. Geometrically, switching axes, reflects the scatter plot about the $y = x$, but does not change the amount of clustering nor the sign of the association.

### 15.2 The Regression Line
For a *Football shaped scatter plot;* when $r$ is close to 1, the scatter plot, the 45 degree line, and the regression line are all very close to each other. But for more moderate values of $r$, the regression line is noticeably flatter.

#### The Regression Effect
With the regression line, it gives us predications that is somewhat closer to the average than the points that were used to make the prediction. This is called "*regression to the mean*" and it is how the name *regression arises.*

In general, individuals who are away from average on one variable are expected to be not quite as far away from average on the other. This is called the *regression effect.*


#### The Equation of the Regression Line
When $x$ and $y$ are measured in standard units, the regression line for predicting $y$ based on $x$ has slope $r$ and passes through the origin. This the equation of the regression line is:

$$
\text{estimate of } y = r \cdot x
$$

The slope and intercept of the regression line in original units is:

$$
\text{slope} = r \cdot \frac{\text{SD of }y}{\text{SD of }x}
$$
$$
\text{intercept} = \overline{y} - m\overline{x}
$$

A surprising mathematical fact is that no matter what the shape of the scatter plot, the same equation gives the "best" among all straight lines. That's the topic of the next section.

### 15.3 The Method of Least Squares
The purpose of the line is to *predict* or *estimate* values of $y$, but estimates aren't perfect, each one is off the true value by an *error*. A reasonable criterion for a line to be the "best" is for it to have the smallest possible overall error among all straight lines.

#### Root Mean Squared Error
The method of developing a way to measure the rough size of the error is exactly how we develop the SD.

We take the mean of the square errors to avoid cancellation when measuring the rough size of the errors, which will give us a measure of roughly how big the squared errors are, but as noted in finding the SD, the units are hard to interpret, so we take the square root. This yields the root mean square error (rmse). Which is the same units as the variable being predict and therefore much easier to understand.

A remarkable fact of mathematics is that **for any shaped scatter plot, the regression line is the unique straight line that minimized the mean squared error of estimation among all straight lines.**


#### Numerical Optimization
The proof of the statement above requires abstract mathematics, by we have python to confirm the statement above.

First note that minimizing the root mean squared error minimized the mean squared error, the root makes no difference ti the minimization, so we'll save a step of computation and just minimize the mean squared error (mse).

Since the predication depends on the slope $m$ and intercept $b$:
$$
\text{prediction} = mx + b
$$
we can write a function that computes the (mse) that takes in `slope` and `intercept`, and through trial and error, it finds the slope and intercept that minimized the returned value of that function.

This function that performs the trial and error is called `minimize`, which follows the changes that lead to incrementally lower output values. The input to `minimize` is a function that itself takes numerical arguments and returns a numerical value.

`minimize` will return an array where each element of the array corresponds to the argument of the inputted function such that it minimizes the output of the inputted function.

#### The Least Squares Line
Therefore, we have found that the regression line minimized the mean squared error, and that minimizing the mean squared error gives us the regression line, **for any shaped scatter plot.**

This is why the regression line is sometimes called the "least squares line."



### 15.4 Least Squares Regression

For scatter plots with a non-linear association, it sometimes better to fit a curve than a straight line.

#### Nonlinear Regression
For a quadratic relation on a scatter plot, we can find the best quadratic function among all quadratic functions just as easy as it was with linear relations using `minimize`. Recall a quadratic function has the form:

$$
f(x) = ax^2 + bx + c
$$
for constants $a, b, \text{and } c$.

Therefore we can write a function that finds the mse, where the fitted values are now based on a quadratic function instead of linear.

In [7]:
def quadratic_mse(a, b, c):
    x = ... # tbl.column('x column')
    y = ... # tbl.column('y column')
    fitted = a*(x**2) + b*x + c
    return np.mean((y - fitted)**2)

# best = minimize(quadratic_mse)

### 15.5 Visual Diagnostics

when using linear regression, we can see how well this method of estimation performs, we must measure how far off the estimates are form the actual values. These differences are called *residuals.*

$$
\text{residual} = \text{observed value} - \text{regression estimate} 
$$
A residual is what's left over – the residue – after estimation.

It is helpful to start visualization, a *residual plot* can be drawn by plotting the residuals against the predictor variable.


#### Regression Diagnostics
The residual plot of a good regression shows no pattern. The residuals look about the same, above and below the horizontal line at 0, across the range of the predictor variable (x).


#### Detecting Nonlinearity
While you can usually spot nonlinearity by just drawing a scatter plot of the data, often however, it is easier to spot nonlinearity in a residual plot than in the original plot. This is because residual plot allows us to zoom in on the errors and hence makes it easier to spot patterns.

**When a residual plot shows a pattern, there may be a non-linear relation between the variables.**

#### Detecting Heteroscedasticity.
The meaning of Heteroscedasticity is *"uneven spread"*.

Take the mpg vs acceleration residual plot of hybrid cars for example.

- .
  - .
    - ![](images/mpg-vs-accel-residual-plot.png)  


Notice how the residual plot flares out towards the low end of the accelerations.

**If the residual plot shows uneven variation about the horizontal line at 0, the regression estimates are not equally accurate across the range of the predictor variable.**

### 15.6 Numerical Diagnostics

#### Residual Plots Show No Trend
**for ever linear regression, whether good or bad, the residual plot shows no trend. overall, it is flat, In Other words, the residuals and the predictor variable are uncorrelated (the correlation coeff is 0).**

A trend means the general direction of a graph. Residual plots 