In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

def standard_units(x):
    return (x - np.average(x))/np.std(x)

def correlation(t, x, y):
    x_su = standard_units(t.column(x))
    y_su = standard_units(t.column(y))
    return np.average(x_su * y_su)

def slope(t, x, y):
    r = correlation(t, x, y)
    return r * np.std(t.column(y))/np.std(t.column(x))

def intercept(t, x, y):
    a = slope(t, x, y)
    return np.average(t.column(y)) - a*np.average(t.column(x))

def fitted_values(t, x, y):
    a = slope(t, x, y)
    b = intercept(t, x, y)
    return a * t.column(x) + b

def residuals(t, x, y):
    """ Returns residual for each prediction, 
        i.e. the difference between the true y and predicted y"""
    predictions = fitted_values(t, x, y)
    return t.column(y) - predictions

def plot_residuals(t, x, y):
    with_residuals = t.with_columns(
        "Fitted", fitted_values(t, x, y),
        "Residual", residuals(t, x, y)/ 1000 # I did this division just for this example
    )
    with_residuals.select(x, y, 'Fitted').scatter(0)
    with_residuals.scatter(x, 'Residual')
    plots.ylim(-1, 1)

**Question:**
<br>
Hypothetically, if the scattered data seem to lie closely around a straight, horizontal line instead of the usually upward/downward tilting line, would we still get a r value (since technically the y-value of the scatter plot doesn't vary with the x-value)? Would this still count as a linear relationship? Since r is measured by how clustered the scattered data are around a straight line, does this definition include how clustered data are around a straight, horizontal line as well?

**Answer:**

Let's first make a table with this example.
Here, for every x, the y-value is between 0 and 0.1

In [None]:
np.random.seed(1234)
X = np.arange(-10, 10)
Y = [np.random.rand()*.1 for i in range(20)]
example_tbl = Table().with_columns("x", X, "y", Y)
example_tbl.show(5)
example_tbl.scatter('x')
plots.ylim(-1, 1)

The next line will compute the correlation for us

In [None]:
correlation(example_tbl, 'x', 'y')

As we can see, these two values aren't really correlation. *r* = 0.08 (pretty low)

Let's now make a prediction by finding a regression line

In [None]:
example_tbl_fitted = example_tbl.with_columns("fitted y", fitted_values(example_tbl, 'x', 'y'))

Let's graph that in a scatter graph

In [None]:
example_tbl_fitted.scatter('x')
plots.ylim(-1, 1)

Zoomed out, our predictions dont look that bad. Let's zoom in a bit more

In [None]:
example_tbl_fitted.scatter('x')

Here, the prediction doesn't look that great.

Let's plot our residuals

In [None]:
plot_residuals(example_tbl_fitted, 'x', 'y')

Remember that the true `y` is the fitted line (the predictions) plus the residuals (the errors in our predictions).

We'll talk about this more in today's (December 8, 2020) lecture

Before we end, lets make a new example that reflects the question.

Now, instead of adding some random noise, we will draw a horizontal line

In [None]:
np.random.seed(1234)
X = np.arange(-10, 10)
Y = [1 for i in range(20)]
example_tbl = Table().with_columns("x", X, "y", Y)
example_tbl.show(5)
example_tbl.scatter('x')
plots.ylim(-0.5, 1.5)

Are these correlated?

In [None]:
correlation(example_tbl, 'x', 'y')

Uh-oh, we got a `nan`. We havent seen this value before. `nan` tells us that this is not a number

Hmm, what does it mean for the correlation to be not a number?

To answer this lets look at the equation for correlation, which we provide in the code below

In [None]:
def standard_units(x):
    return (x - np.average(x))/np.std(x)

def correlation(t, x, y):
    x_su = standard_units(t.column(x))
    y_su = standard_units(t.column(y))
    return np.average(x_su * y_su)

To compute *r* (the correlation coefficient), we multiple the standard units of `x` and `y`. 

Let's focus right now on the standard units of `y`.

In [None]:
standard_units(example_tbl.column('y'))

Huh, we see we get lots of `nan`s again. Before reading on, ask your self, why is this the case? 

*(I added empty cells below, scroll down after trying to answer this question on your own)*

Let's look at the equation for converting a list of numbers into standard units

$\dfrac{y - np.average(y)}{np.std(y)}$

Here, what is the average of y?

In [None]:
np.average(example_tbl.column('y'))

**Question:** So for every `y`, what will the numerator here be?

In [None]:
example_tbl.column('y') - np.average(example_tbl.column('y'))

**Answer:** The standard units will be 
    $\dfrac{0}{np.std(y)}$

**Question:** What is standard deviation of y? 

In [None]:
np.std(example_tbl.column('y'))

Remember that standard deviation measures roughly
how far the data are from their average (slide 30 from https://coms1016.barnard.edu/slides/week6/lecture18_sdt_correlation.pdf).

**Answer:** Since every `y` is the same here (recall this is a horizontal line), std(y) will just be 0

So, lets go back to our equation for converting a list of numbers into standard units

$\dfrac{y - np.average(y)}{np.std(y)}$

If we plug in what we just got, the standard units will be a list of

$\dfrac{0}{0} = \text{ ?} $ 

We cannot divide a number by `0`. That is why the correlation is `nan`, not a number

### Big Picture

So, what does this mean. When we have a horizontal line, are these two variables linearly correlated? 

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook()
grader.export("horizontal-line-and-correlation.ipynb", pdf=False)