In [None]:
from datascience import *
import numpy as np

import matplotlib.pyplot as plots
from mpl_toolkits.mplot3d import Axes3D
plots.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
def standard_units(any_numbers):
    """Convert any array of numbers to standard units."""
    return (any_numbers - np.average(any_numbers)) / np.std(any_numbers)

def correlation(t, x, y):
    """Return the correlation coefficient (r) of two variables."""
    return np.mean(standard_units(t.column(x)) * standard_units(t.column(y)))

def slope(t, x, y):
    """The slope of the regression line (original units)."""
    r = correlation(t, x, y)
    return r * np.std(t.column(y)) / np.std(t.column(x))

def intercept(t, x, y):
    """The intercept of the regression line (original units)."""
    return np.mean(t.column(y)) - slope(t, x, y) * np.mean(t.column(x))

def fit(t, x, y):
    """The fitted values along the regression line."""
    a = slope(t, x, y)
    b = intercept(t, x, y)
    return a * t.column(x) + b

def plot_residuals(t, x, y):
    """Plot a scatter diagram and residuals."""
    t.scatter(x, y, fit_line=True)
    actual = t.column(y)
    fitted = fit(t, x, y)
    residuals = actual - fitted
    print('r:', correlation(t, x, y))
    print('RMSE:', np.mean(residuals**2)**0.5)
    t.select(x).with_column('Residual', residuals).scatter(0, 1)

# Residuals Examples: Medicine



In this example we are going to focus on residuals. Residuals are used to quantify the error of our prediction.
A residual represented the difference between the real (true) y and our estimate (prediction) of y.

Here, we are going to try predicting a patient's Red Blood Cell Count based on their Glucose levels

In [None]:
ckd = Table.read_table('ckd.csv').relabeled('Blood Glucose Random', 'Glucose')
ckd.show(3)

In the next cell we will will plot the relationship between these two variables

In [None]:
ckd.scatter('Hemoglobin', 'Red Blood Cell Count')

If we add `fit_line=True` as an argument to `.scatter()`, the plot will show the regression (best-fit) line.

In [None]:
ckd.scatter('Hemoglobin', 'Red Blood Cell Count', fit_line=True)

For any given *Hemoglobin* value we can now make a prediction of *Red Blood Cell Count* value by finding the corresponding
value on the line. In the figure above, if we know a new patient's *Hemoglobin* value is $12$, we would predict the the patient's *Red Blood Cell Count* to be about $4.5$

Let's now right some code to make prediction's for every individual in our dataset

Remember we can predict the *Red Blood Cell Count*  by the following equation:
\begin{equation}
slope * Hemoglobin + intercept
\end{equation}

The next code cell finds the slope and intercept

In [None]:
ckd_h_rb = ckd.select('Hemoglobin', 'Red Blood Cell Count').relabel('Red Blood Cell Count', 'RB True')
current_slope = slope(ckd_h_rb, 'Hemoglobin', 'RB True')
current_intercept = intercept(ckd_h_rb, 'Hemoglobin', 'RB True')

Now we can make our predictions by plugging in the slope and intercept we just computed. The second line of code adds our predictions to our table

In [None]:
predictions = current_slope * ckd_h_rb.column('Hemoglobin') + current_intercept
ckd_h_rb = ckd_h_rb.with_columns('RB Prediction', predictions)
ckd_h_rb

Let's look at our prediction for the examle where the patient's *Hemoglobin* was $12$

In [None]:
example_row = ckd_h_rb.where('Hemoglobin', 12).row(0)
example_row.item(0), example_row.item(2)

We just prediced the patient's *Hemaglobin* is $4.44$ (very close to what we said when we looked at the regression line)

However, if we look at the figure below, we can see that the real patient with a *Hemaglobin* of $12$  was an outlier. 

In [None]:
ckd_h_rb.scatter('Hemoglobin')

This patient's *Red Blood Cell Count* was $8$

In [None]:
example_row.item(1)

Let's quantify the error of our prediction for this individual.

Before scrolling down, think about how we would compute the error of our prediction.

(skip a few cells)

The next cell computes the error for each prediction. 

In [None]:


ckd_h_rb = ckd_h_rb.with_columns('RB Residual (Error)', ckd_h_rb.column('RB True') - ckd_h_rb.column('RB Prediction'))

ckd_h_rb.scatter('Hemoglobin')

The light blue dots represent the residuals. Each residiual indicates how far our prediction was from the true value.

Let's look at the dots where *Hemoglobin* is $12$. We have three dots: the real value (dark blue), the prediction (yellow), and the residual (light blue). 

Recall that our prediction was a little less than 4.5 and the true value was 8. Therefore, the y-value of our residual is about $3.5$. If you look at the figure above, the y-axis of the light blue dot is $3.5$. 

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook()
grader.export("Residuals_Example.ipynb", pdf=False)