In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Prediction ##

Let's work with Galton's dataset again.  As a reminder, he looks at a family, and for each child, he measured the average height of the child's two parents, and the height of the child when they reach adulthood.  The former is called the "midparent height" in this notebook.  (Why average height of the two parents?  Well, he was thinking that if the dad was tall or the mom was tall, then their child might be more likely to be tall; and if they were both tall, then that might have an even stronger effect, so the height of both parents is relevant.)  We're going to investigate how well one could predict the future height of a couple's kid, based on the height of the two parents.

In [None]:
# Run this cell

galton = Table.read_table('data/galton.csv')
galton

In [None]:
# Assign to `heights` a table with the columns "midParent" and "Child"
# that shows the mid-parent height and child height, respectively, 
# based on the table `galton`
#

...
heights

Recall that often a good thing to do anytime you have a new dataset is to visualize the data.  Let's look at a scatter plot to see if there seems to be any association between midparent height and eventual height of their child.

In [None]:
# Create a scatter plot of `heights`

# Hint: We have not used this code in a while! 
# Looking at the documentation may be helpful

...

Looks like there might be a weak association.  Can we use this for prediction?  If we know the height of the two parents, can we predict how tall their kid will be when their kid grows up?  Let's work out how to do that.  The principle is to look for other couples who are similar (in terms of midparent height), where we know how tall their kid turned out to be, and use that to predict.  There might be multiple other couples that are similar, so we'll average the heights of all of their kids.

In [None]:
# Create a function called `predict_child` that takes the value of any height.
# Within the function `predict_child` there is an assignment statement, 
# `close_points`, that is assigned a table based on `heights` 
# of the MidParent heights that are +/- .5 from the value of the 
# height taken into the function

# Note: This is very similar to what we did in Lesson 26

# This function will return the average value of the column `Child` based 
# on the table `close_points`

... :
    """Return a prediction of the height of a child 
    whose parents have a midparent height of h.
    
    The prediction is the average height of the children 
    whose midparent height is in the range h plus or minus 0.5 inches.
    """
    
    close_points = ...
    
    return ... 

In [None]:
# Use the function `predict_child` to find the predicted height of a child
# whose parents have various values for their MidParent height

predict_child(...)

In [None]:
# Create a table called `heights_with_predictions` that adds the column 
# `Prediction` to the table `heights`, and uses the function 
# `predict_child` on the `MidParent` column of the table `heights`
# by way of .apply

# Hint: To recall how to use .apply, refer to the documentation

heights_with_predictions = ...
heights_with_predictions

We'll visualize the predictions against this dataset.  Blue dots are the observed data, yellow is the prediction we would have made (with this method).  You can see that the taller the parents are, the taller we predict the child will be.

In [None]:
# Create a scatter plot for the table `heights_with_predictions` of `MidParent`

...

## Association ##

Let's look at another example. This time we'll look at a dataset of different models of cars, with various attributes for each model of car.  We'll investigate what attributes have an association between them.

In [None]:
# Run this cell

hybrid = Table.read_table('data/hybrid.csv')
hybrid

In [None]:
# Sort the table `hybrid` by `msrp` in descending order

... 

Let's check whether there is an association between miles-per-gallon and the manufacturer's suggested retail price (msrp).

In [None]:
# Create a scatter plot to visualize the association between mpg and msrp
# based on the table `hybrid`

...

Looks like there is some kind of association.  Is it a linear association?  Why do you think there is an association?  On first glance, one might have thought that building a car that gets better miles-per-gallon requires better technology, which would lead to a more expensive car -- but that's not what we're seeing.  Any guesses why?

How about acceleration vs msrp?  Is there an association?  A linear association?  Why?

In [None]:
# Create a scatter plot to visualize the association between acceleration 
# and msrp based on the table `hybrid`

...

In [None]:
# Assign to `suv` a new table, based on `hybrid`, that only shows the vehicles
# that are classified as a SUV

suv = ...

In [None]:
# Create a scatter plot to visualize the association between acceleration 
# and msrp based on the table `suv`

... 

In [None]:
# Create a scatter plot to visualize the association between mpg and msrp
# based on the table `suv`

...

In [None]:
# Create a function called `standard_units` that takes in any array
# and returns the array in standard units

# Recall, standard units are found by taking the difference between the
# data and the average of a data set, and then dividing by the 
# standard deviation of the same data set

def ...:
    "Convert any array of numbers to standard units."
    return ...

In [None]:
# Run this cell

Table().with_columns(
    'mpg (standard units)',  standard_units(suv.column('mpg')), 
    'msrp (standard units)', standard_units(suv.column('msrp'))
).scatter(0, 1)
plots.xlim(-3, 3)
plots.ylim(-3, 3);

In [None]:
# Run this cell.
# How does it compare to the scatter plot above?

suv.scatter('mpg', 'msrp')

In [None]:
# Run this cell

Table().with_columns(
    'acceleration (standard units)', standard_units(suv.column('acceleration')), 
    'msrp (standard units)',         standard_units(suv.column('msrp'))
).scatter(0, 1)
plots.xlim(-3, 3)
plots.ylim(-3, 3);

In [None]:
# Run this cell.
# How does it compare to the scatter plot above?

suv.scatter('acceleration', 'msrp')

## Correlation ##

In [None]:
# Below, we are creating a function that will generate a scatter plot
# that approximately has a correlation of any value we use (r)
# Review the function to make sure you understand what the code is doing

# Run this cell

def r_scatter(r):
    plots.figure(figsize=(5,5))
    "Generate a scatter plot with a correlation approximately r"
    x = np.random.normal(0, 1, 1000)
    z = np.random.normal(0, 1, 1000)
    y = r*x + (np.sqrt(1-r**2))*z
    plots.scatter(x, y, color='darkblue', s=20)
    plots.xlim(-4, 4)
    plots.ylim(-4, 4)

In [None]:
# Use various values of r that are between -1 and 1
# to see the scatter plot that is created

r_scatter(...)

## Calculating $r$ ##

In [None]:
# Here we are creating a table to use to calculate r
# for the data set

# Run this cell

x = np.arange(1, 7, 1)
y = make_array(2, 3, 1, 5, 2, 7)
t = Table().with_columns('x', x, 'y', y)

t

In [None]:
# Create a scatter plot of our values from the table `t`

# Please do not delete the last portion of the code that is already 
# provided

... (... , ... , s=30, color='red')

In [None]:
# Add a two columns to the table `t`

# The first column should be labeled `x (standard units)`, 
# and takes the values from using the function `standard_units`
# on the x-values.

# The second column should be labeled `y (standard units)`, 
# and takes the values from using the function `standard_units` 
# on the y-values.

t = ...
t

In [None]:
# Create a new scatter plot of the table t
# How is it different from the first scatter plot?

...

In [None]:
# Add another column to the table `t` with the label
# `product of standard units`.  The values in this column
# are generated by finding the product of the x and y values 
# in standard units

t = ...
t

In [None]:
# r is the average of the product of the x and y values
# in standard units.  Assign this value to r.

r = ...
r

In [None]:
# FINISH THIS HERE!!!!

def correlation(t, x, y):
    """t is a table; x and y are column labels"""
    x_in_standard_units = standard_units(t.column(x))
    y_in_standard_units = standard_units(t.column(y))
    return np.average(x_in_standard_units * y_in_standard_units)

In [None]:
# Run this cell

# What do you notice?

correlation(t, 'x', 'y')

In [None]:
# Run this cell

# What do you notice?

correlation(suv, 'mpg', 'msrp')

In [None]:
# Run this cell

# What do you notice?

correlation(suv, 'acceleration', 'msrp')

### Switching Axes ###

In [None]:
# Going back to our table `t`
# Run this cell

correlation(t, 'x', 'y')

In [None]:
# Let's create a scatter plot of our values from the table `t`

# The x values should be represented along the x-axis and the 
# y values should be represented along the y-axis

# Please do not delete the last portion of the code that is already 
# provided

... (... , ... , s=30, color='red')

In [None]:
# Create a scatter plot of our values from the table `t`

# The y-values should be represented along the x-axis and the 
# x-values should be represented along the y-axis

# Please do not delete the last portion of the code that is already 
# provided

... (... , ... , s=30, color='red')

In [None]:
# Now, let's see how changing the axes will impact our correlation

correlation(t, 'y', 'x')

### Nonlinearity ###

In [None]:
new_x = np.arange(-4, 4.1, 0.5)
nonlinear = Table().with_columns(
        'x', new_x,
        'y', new_x**2
    )
nonlinear.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(nonlinear, 'x', 'y')

### Outliers ###

In [None]:
line = Table().with_columns(
        'x', make_array(1, 2, 3, 4),
        'y', make_array(1, 2, 3, 4)
    )
line.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(line, 'x', 'y')

In [None]:
outlier = Table().with_columns(
        'x', make_array(1, 2, 3, 4, 5),
        'y', make_array(1, 2, 3, 4, 0)
    )
outlier.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(outlier, 'x', 'y')