Part of EDA is understanding the relationship between two (or more variables).

In this notebook, we'll continue exploring the Palmer Penguins dataset and learn tools for looking at relationships between variables.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
penguins = pd.read_csv('../data/penguins.csv')
penguins.head()

Like in the single-variable case, the tool we use depends on the type of the variables we are examining.

## Examining Two Categorical Variables

What if we want to look at a cross-tabulation of the number of observations of each sex broken down by species. 

For this, we can use the `pandas` `crosstab` function. What we need to do is to pass in the two columns we want to create the table based on.

In [None]:
pd.crosstab(penguins['island'], penguins['species'])

We can take the results and create a plot.

In [None]:
pd.crosstab(penguins['island'], penguins['species']).plot(kind = 'bar', 
                                                          stacked = True,       # stacked as opposed to side-by-side
                                                          color = ['cornflowerblue', 'coral', 'pink'],     # change the default colors
                                                          edgecolor = 'black')              # add a border to the bars
plt.title('Penguin Species Distribution by Island')                   # add a title
plt.xticks(rotation = 0);                                             # change the appearance of the x tick labels

You can also normalize the cross-tabulation. Since we are working with two variables here, we can normalize in a number of ways. 

For example, if we just want the proportion of total observations that are contained in each cell, we can use the `normalize = 'all'` option.

In [None]:
pd.crosstab(penguins['island'], penguins['species'],  normalize = 'all')

Or if we want proportions by row, we can use the `normalize = 'index'` option.

In [None]:
pd.crosstab(penguins['island'], penguins['species'],  normalize = 'index')

In [None]:
(pd.crosstab(penguins['island'], penguins['species'],  normalize = 'index') * 100).plot(kind = 'bar', 
                                                                                        stacked = True,
                                                                                        color = ['cornflowerblue', 'coral', 'pink'],
                                                                                        edgecolor = 'black',
                                                                                        width = 0.75,
                                                                                       )
plt.title('Percentage of Species by Island')
plt.xticks(rotation = 0)
plt.legend(bbox_to_anchor = (1, 0.8), loc = 'upper left');   # move the legend to the right side of the plot

## Categorical-Numeric Combinations

Let's say we want to look at the average body mass by species. One way to do this is to use `.groupby`.

Quite often when using `.groupby()`, our goal is to calculate an aggregate value by group. To use `.groupby()`, we need to tell pandas: 
* **what** to group by
* **which** column (or columns) we want to aggregate
* **how** to aggregate

In this case, we want to group by `species` and then aggregate the `body_mass_g` column by taking the _mean_.

In [None]:
penguins.groupby('species')['body_mass_g'].mean()

You can even use `.describe` with `groupby` to get even more information by species.

In [None]:
penguins.groupby('species')['body_mass_g'].describe()

**Question:** Looking at the summary statistics by group, what do you notice?

We can easily compare distributions using a grouped boxplot.

In [None]:
plt.figure(figsize = (10,6))

sns.boxplot(data = penguins,
           x = 'body_mass_g',
           y = 'species');

## Comparing Two Numeric Variables

One way to assess the relationship between two numeric variables is to find the correlation. This can be accomplished using the `.corr()` method, which returns the correlation matrix.

In [None]:
penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].corr()

**Question:** Do you notice anything interesting when inspecting the correlation values?

Let's investigate the relationship between bill length and bill depth. We can do this using a scatterplot.

First, let's use the `.plot` method from pandas.

In [None]:
penguins.plot(kind = 'scatter',
             x = 'bill_length_mm',
             y = 'bill_depth_mm',
             figsize = (10,6));

We do need to remember that we have multiple species of penguins. To get a better understanding of the relationship between these variables, we could color the points by species. The easiest way to do this is using the _seaborn_ library.

In [None]:
plt.figure(figsize = (10,6))

sns.scatterplot(data = penguins,
               x = 'bill_length_mm',
               y = 'bill_depth_mm',
               hue = 'species',
               palette = ['cornflowerblue', 'coral', 'pink']);

# End of Instruction

### Read in the schools_clean .csv

### Make a crosstab of zipcode and level

### Now make the same crosstab normalized

### Now make that crosstob into a stacked barchart where each bar totals 100%

### Now group by zipcode and find the number of different levels each zipcode has.  (Look up the nunique() method)

### Now find the average total students by school level.   (This will be a multi-step problem)