# EDA Part 2: Looking at Two Variables

# Scatterplots

So far, you have looked at exploring one variable at a time through summary statistics, histograms, boxplots, etc.

What do you do when you have more than one numeric variable? You can start by visualizing the relationship between the variables through a scatterplot.

A scatterplot can provide a quick view of the general relationship between variables. On a scatterplot, each point corresponds to a single observation.

## Things to Look for in a Scatterplot:

1. **Direction of Association:** Positive or Negative
2. **Form of Association:** Linear? Curved? Neither?
3. **Strength of Association**
4. **Outliers**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Load the `auto-mpg.csv` dataset and look at the first 5 rows with the `pandas` `.head()` method.

In [None]:
cars = pd.read_csv('../data/auto-mpg.csv')

In [None]:
cars.head()

You want to see the relationship between horsepower and engine size (displacement). First make sure these are both numeric variables and don't contain null values. You can use the `pandas` `.info()` method to check both of these things.

In [None]:
cars.info()

You can see that displacement is a floating point decimal (float64). You can also see that horsepower is an object - python's designation for a string or textual data. You need to convert horsepower to a numeric data type.

In [None]:
cars['horsepower'].sort_values()

In [None]:
cars['horsepower'] = pd.to_numeric(cars['horsepower'], errors='coerce')
cars.head(3)

In [None]:
cars.info()

The `errors = 'coerce'` argument introduced some missing values into our data set. To save ourselves some trouble later, let's drop the null values.

In [None]:
cars = cars.dropna()

Now you are ready to plot your scatterplot!

Note that if we want a larger plot, we'll need to pass in `figsize` as an argument to the `.plot` method.

In [None]:
cars.plot(kind = 'scatter', x = 'displacement', y = 'horsepower', figsize = (10,6))
plt.title('Horsepower vs. Displacement');

**Question:** Based on this scatterplot, what can we say about the relationship between displacement and horsepower?

**Write a few observations about the scatterplot here.**

If you want to bring in an additional variable, such as number of cylinders to see how it is related to the first two, you can do so by using seaborn to color the points, using the `hue` argument.

In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize = (12,8))

sns.scatterplot(data = cars, x = 'displacement', y = 'horsepower',
                hue = 'cylinders', palette = 'Blues', edgecolor = 'black'
               )
plt.title('Horsepower vs. Displacement');

**Question:** What additional information does this give us about this relationship?

**Write down your observations here.**

There are a few cars that are somewhat far from the general trend. Let's investigate. We can accomplish this by slicing the dataframe. This is done by adding a set of square brackets including the conditions we are using to slice the data.

For example, if we want to find all cars with over 200 horsepower, we can do it by typing:

In [None]:
cars['horsepower'] > 200

In [None]:
cars[cars['horsepower'] > 200]

Notice that if you are specifying multiple conditions, you must place each condition inside a set of parentheses and separate them with an ampersand &. In this context, the & means "and". In the event that you want to slice based on an "or" condition, you can use the pipe |.

In [None]:
cars[(cars['displacement'] < 250) & (cars['horsepower'] > 175)]

These three vehicles are all large trucks.

There is a single 6-cylinder vehicle that stands out since it has higher horsepower than other cars with its displacement value. Investigate this car.

In [None]:
# Your Code Here

Next, take a look at the relationship between weight and acceleration.

Note that acceleration shows time (in seconds) to accelerate from O to 60 mph.

In [None]:
cars.plot(kind = 'scatter', x = 'weight', y = 'acceleration', figsize = (12,8))
plt.title('Acceleration vs. Weight');

**Question:** What does the plot tell us?

**Write a few observations here.**

Next, create a scatterplot to take a look at the relationship between horsepower and the cars fuel efficiency (mpg).

In [None]:
cars.plot(kind = 'scatter', x = 'horsepower', y = 'mpg', figsize = (12,8))
plt.title('Miles per Gallon vs. Horsepower');

**Question:** What does this scatterplot tell us?

There is one car that has around 130 horsepower but an unusually high mpg for that range of horsepower. Which car is it?

In [None]:
# Your Code Here


Now, let's see how we can quantify the relationship between two variables.


## Covariance

For a dataset $\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$ we define the **covariance** as

$$cov(X, Y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n-1}$$

where $\bar{x}$ is the mean of the $x_i$'s and $\bar{y}$ is the mean of the $y_i$'s.

To calculate covariance, we can use the `cov` method from pandas. We'll slice the dataframe down to the variables that we are interested in before applying this method.

Note that this returns the *covariance matrix*.

In [None]:
cars[['displacement', 'horsepower']].cov()

First, let's look at a special case - the covariance of a variable with itself:

$$cov(X, X) = \frac{\sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})}{n-1}$$

$$ = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}$$

**Question:** What is this the same as?

In [None]:
cars['displacement'].var()

In [None]:
cars['horsepower'].var()

Now, what about when we're looking at the covariance of two different variables?

Take a look at your scatterplot again. This time we'll add a vertical line to show the mean value for displacement(plotted along the horizontal axis) and a horizontal line to show the mean value of horsepower (plotted along the vertical axis).

In [None]:
from nssstats.plots import quadrant_plot, half_plot

In [None]:
quadrant_plot(cars.displacement, cars.horsepower, labels = ['displacement', 'horsepower'], figsize = (12,8))

Let's analyze the four quadrants that have been determined by partitioning at the average values. Starting with the upper left quadrant.

In [None]:
quadrant_plot(cars.displacement,
              cars.horsepower,
              labels = ['displacement', 'horsepower'],
              quadrant = 4,
              figsize = (12,8))

Points in the upper left quadrant have lower than average displacement and higher than average horsepower. Thus, $(x_i - \bar{x})(y_i - \bar{y})$ will be a negative number times a positive number, so negative overall.

Next take a look at the lower right quadrant.

In [None]:
quadrant_plot(cars.displacement,
              cars.horsepower,
              labels = ['displacement', 'horsepower'],
              quadrant = 2,
              figsize = (12,8))

For points in the lower right, where horsepower is below average and displacement is above average, $(x_i - \bar{x})(y_i - \bar{y})$ will also be negative.

Next, focus on the lower left quadrant, where both displacement and horsepower are below the mean.

In [None]:
quadrant_plot(cars.displacement,
              cars.horsepower,
              labels = ['displacement', 'horsepower'],
              quadrant = 3,
              figsize = (12,8))

Points in the lower left quadrant have lower than average displacement and lower than average horsepower. Thus, $(x_i - \bar{x})(y_i - \bar{y})$ will be a positive number times a positive number, so positive overall.

Finally, take a look at the upper right quadrant.

In [None]:
quadrant_plot(cars.displacement,
              cars.horsepower,
              labels = ['displacement', 'horsepower'],
              quadrant = 1,
              figsize = (12,8))

For points in the upper right quadrant, $(x_i - \bar{x})(y_i - \bar{y})$ will be positive.

If you have more points in the lower left and upper right, when finding the covariance, you will be adding a lot of positive numbers, so the outcome is likely to be positive.

If you have more points in the upper left and lower right, when calculating the covariance, you will be adding a lot of negative numbers, so the outcome is likely to be negative.

These two scenarios correspond to a positive trend and a negative trend, respectively.

On the other hand, if points are roughly evenly spread around the four quadrants (no trend), then when finding the covariance, you will be adding a lot of both positive and negative values, so overall, the covariance will be close to zero.

Let's look at this in another way. First look at observations with below-average displacement.

In [None]:
half_plot(cars.displacement, cars.horsepower, labels = ['displacement', 'horsepower'],
          figsize = (12,8), half = 'left')

Notice that most observations with below-average displacement will also have below-average horsepower.

Now, look at points that have above-average displacement.

In [None]:
half_plot(cars.displacement, cars.horsepower, labels = ['displacement', 'horsepower'],
          figsize = (12,8), half = 'right')

Notice that most points with above-average displacement also have above-average horsepower.

This is another way to understand covariance. 

**Positive covariance** between variables $a$ and $b$ means that observations with above-average values for $a$ tend to also have above-average values for $b$ and vice versa.

**Negative covariance** between variables $a$ and $b$ means that observations with above-average values for $a$ tend to have below-average values for $b$ and vice versa.

Covariance actually does a little more than this - it detects the strength of a _linear_ relationship - that is, do the points tend to follow a straight line?

You can check this using the horsepower and mpg variables.

In [None]:
cars[['horsepower', 'mpg']].cov()

In [None]:
half_plot(cars.displacement, cars.mpg, labels = ['displacement', 'mpg'],
          figsize = (12,8), half = 'left')

In [None]:
half_plot(cars.displacement, cars.mpg, labels = ['displacement', 'mpg'],
          figsize = (12,8), half = 'right')

By examining the sign and the magnitude of the covariance, you can get an idea of the existence or nonexistence of a trend in the data. 

But there is one major drawback, which is that the magnitude of $(x_i - \bar{x})$ and $(y_i - \bar{y})$ depends on the measurement scale of the variables. In this case, you are measuring these difference in displacement and horsepower units (and so when you multiply them, you have displacement units $\cdot$ horsepower units).

A better approach would be to standardize these differences, so that you get a unitless measure of trend. This is exactly what the correlation does.

## (Pearson) Correlation

How can you normalize these differences? The most common way to do it is to convert it into units of standard deviations (by dividing the absolute differences by the standard deviation).

$$ r = \frac{\sum_{i=1}^n \frac{(x_i - \bar{x})}{s_X}\frac{(y_i - \bar{y})}{s_Y}}{n-1} = \frac{cov(X,Y)}{s_X \cdot s_Y} $$

**Question:** Look at the numerator. What are $\frac{(x_i - \bar{x})}{s_X}$ and $\frac{(y_i - \bar{y})}{s_Y}$?

For correlation, we can use the `corr` method. 

In [None]:
cars[['displacement', 'horsepower']].corr()

This shows that the correlation between displacement and horsepower is roughly 0.897.

The correlation will always be between -1 and 1, where higher magnitudes (further from zero in either direction) signify a stronger association and the sign determines whether it is a positive or negative association.

This table give a rule of thumb for assessing the strength of a relationship based on the correlation.

| $r$  | interpretation  |
|---|---|
| 1  | Perfect positive relationship  |
| 0.8  | Strong positive relationship  |
| 0.5 | Moderate positive relationship  |
| 0.2 | Weak positive relationship |
| 0 | No linear relationship |
| -0.2 | Weak negative relationship |
| -0.5 | Moderate negative relationship |
| -0.8 | Strong negative relationship |
| -1 | Perfect negative relationship |


The following interactive plot can be used to see what different levels of correlation look like.

In [None]:
from ipywidgets import interact, FloatSlider

@interact(corr = FloatSlider(value = 0.8, min = -1, max = 1, step = 0.01, continuous_update = False))
def make_corr_plot(corr):
    xx = np.array([0, 1])
    yy = np.array([0, 1])
    means = [xx.mean(), yy.mean()]  
    stds = [xx.std() / 3, yy.std() / 3]
    corr = corr       # correlation
    covs = [[stds[0]**2          , stds[0]*stds[1]*corr], 
            [stds[0]*stds[1]*corr,           stds[1]**2]] 

    m = np.random.multivariate_normal(means, covs, 1000).T
    fig, ax = plt.subplots(figsize = (8,6))
    plt.scatter(m[0], m[1])
    plt.title('Correlation = {:.2f}'.format(corr))
    ax.axis('equal');

The Pearson correlation measures the strength of a *linear* relationship between variables. That is, how strongly they follow a line. If you want to see this trendline, you can use the polyfit method from numpy. You'll learn much more about this trendline when we talk about linear regression later in the course.

In [None]:
fig, ax = plt.subplots(figsize = (12,8))
cars.plot(kind = 'scatter', x = 'displacement', y = 'horsepower', ax = ax)

x = np.linspace(cars['displacement'].min(), cars['displacement'].max(), 100)
z = np.polyfit(cars['displacement'], cars['horsepower'], 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.title('horsepower vs. displacement');

**Caution:** the correlation coefficient that you have encountered so far only measures the strength of a *linear* relationship. It is possible that two variables can have a strong *nonlinear* relationship that cannot be detected by using the correlation coefficient. It is advisable to always plot your variables against each other rather than just relying on the correlation coefficient.

For example, the following variables have a very strong relationship. In fact, there is a formula to find the value of $y$ based on the value of $x$.

In [None]:
x = np.linspace(start = -3, stop = 3, num=50)
y = x**2

plt.scatter(x,y);

However, you won't get very far trying to detect this relationship using the correlation coefficient, which in this case is essentially 0.

Notice that I can also calculate correlation using functions from numpy.

In [None]:
np.corrcoef(x,y)

Now, let's look at the correlations between all of our variables.

In [None]:
cars[['mpg', 'cylinders', 'displacement', 'horsepower', 'acceleration', 'weight']].corr()

To find the correlation between any two variables, find the intersection of the row and column for those variables.

For example, the correlation between weight and acceleration is -0.417.

It appears that the strongest relationship is between cylinders and displacement.

Because cylinders is a discrete variable, a scatterplot of cylinders vs displacement has points only at each discrete value of cylinder and not continuously along the horizontal axis. 

In [None]:
cars.plot(kind = 'scatter', x = 'cylinders', y = 'displacement', figsize = (12,8))
x = np.linspace(cars['cylinders'].min(), cars['cylinders'].max(), 100)
z = np.polyfit(cars['cylinders'], cars['displacement'], 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--");

Note that the type of correlation discussed above is called Pearson's correlation coefficient. It is what is most commonly referred to by the term "correlation", but it is not the only correlation measure. For a discussion of other types of correlations and appropriate use cases of these, see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/.

# Exploring Categorical-Numeric Relationships

In [None]:
cars.head()

The cars dataset includes a categorical variable, the model year. Note that while year is a number, in this context it is not a measurement. Instead, it is a grouping variable, making it categorical.

In `pandas`, if we want to explore categorical vs numeric variables, we usually do it by using `groupby`. To groupby, you need to specify the column(s) to group on, followed by the column you want to aggregate, and finally an aggregation type.

Let's say we want a basic count of cars by year.
* groupby: model year
* column to aggregate: car name (can really use any column here)
* aggregation: count

In [None]:
cars.groupby('model year')['car name'].count()

We can feed the result of our groupby into a plot.

In [None]:
plt.figure(figsize = (10,6))

cars.groupby('model year')['car name'].count().plot(kind = 'bar')
plt.title('Count of Cars by Model Year')
plt.ylabel('count')
plt.xticks(rotation = 0);

However, we are not just limited to counting. We can apply any sort of aggregation we want. For example, let's say we want to see the average mpg by year.

* groupby: model year
* column to aggregate: mpg
* aggregation: mean

In [None]:
plt.figure(figsize = (10,6))

cars.groupby('model year')['mpg'].mean().plot(kind = 'bar')
plt.title('Mean mpg per Model Year')
plt.ylabel('mpg')
plt.xticks(rotation = 0);

**Question:** What does this plot tell us?

While the above plot gives some idea about how mpg has changed over the years, it is only giving a single snapshot per year. A boxplot, which you saw in the last set of slides gives a more complete picture since it shows the overall distribution of mpg values per year. A boxplot shows not just the average value but can reveal the something about the spread of values in a given year.

In [None]:
plt.figure(figsize = (10,6))

sns.boxplot(data = cars, x = 'model year', y = 'mpg');

**Question:** What additional information do we get from the boxplot that was not obvious from the mean plot above?

**Your Turn** The origin variable shows the origin of the vehicle (1 = American, 2 = European, 3 = Japanese)

Look at the relationship between mpg and origin. What do you find?

In [None]:
# Your Code Here

**Your Turn #2**: Above, we treated cylinders as a numeric variable, but we might also want to view it as a categorical variable. Examine the relationship between number of cylinders and the acceleration. What do you find? You might also look at the relationship between cylinders and other variables.

In [None]:
# Your Code Here

## Categorical-Categorical

When studying two categorical variables, you can perform *cross-tabulation* to see how the sample is distributed across the categories.

For this example, you'll look at data from the 2018 Central Park Squirrel Census, which can be obtained from https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw.

In [None]:
squirrels = pd.read_csv('../data/2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv')
squirrels.head()

Let's say you are interested in seeing if squirrels of different colors behave differently around humans. You'll be looking specifically at the `Primary Fur Color` and `Runs from` column, which indicates "Squirrel was seen running from humans, seeing them as a threat." You'll be using the `pandas` `crosstab` function for this.

In [None]:
pd.crosstab(squirrels['Primary Fur Color'], squirrels['Runs from'])

By default, `crosstab` will return counts, which can give an idea about the relative size of each group, but makes it difficult to assess exact proportions. Luckily, you can normalize your measurements to give relative proportions by specifying the `normalize` argument. To normalize across rows, you can specify `normalize = 'index'`.

In [None]:
pd.crosstab(squirrels['Primary Fur Color'], squirrels['Runs from'], normalize='index')

Based on this, you can see that a larger proportion of black squirrels run from humans than other colors of squirrels.

You can visualize your findings using a side-by-side barplot.

In [None]:
pd.crosstab(squirrels['Primary Fur Color'],
            squirrels['Runs from'], 
            normalize='index').plot(kind = 'bar', 
                                    edgecolor = 'black', 
                                    width = 0.75)
plt.ylabel('Proportion');

Or you can show them in a stacked bar plot.

In [None]:
pd.crosstab(squirrels['Primary Fur Color'], 
            squirrels['Runs from'], 
            normalize='index').plot(kind = 'bar', 
                                    edgecolor = 'black', 
                                    width = 0.75,
                                    stacked = True)
plt.ylabel('Proportion');

**Your turn** 

Create a stacked bar plot from the cars dataset showing the proportion of cars having each number of cylinders by year. 

What new information does this plot give about this dataset? 

In [None]:
# Your Code Here