In this notebook, we will see some tools for doing exploratory data analysis (EDA) on a dataset. 

We'll be using the [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset, which contains size measurements for three penguin species observed on three islands in the Palmer Archipelago of Antarctica.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
penguins = pd.read_csv('../data/penguins.csv')

In [None]:
penguins.head()

**Question: Do you notice anything when you look at the first five rows of data?**

Let's look and see how many null values we have.

In [None]:
penguins.info()

Another way to count null values is by using the `.isna()` method.

In [None]:
penguins.isna().sum()

## Categorical Variables

**Warm-up Question:** How many penguins are there of each sex?

In [None]:
# Your code here

In [None]:
penguins['sex'].value_counts()

The NaN values do not show up. Place your cursor inside the parantheses for `value_counts` and press Shift+Tab to look for a way to retain the NaN values.

In [None]:
penguins['sex'].value_counts(dropna = False)

Notice also that we can normalize our value counts. For example, to get the percentage by species, we can use this: 

In [None]:
penguins['sex'].value_counts(normalize = True)

As we have seen before, we can create a bar plot of the value counts by using the `.plot` method and specifying `kind = 'bar'`.

In [None]:
penguins['species'].value_counts().plot(kind = 'bar');

## Examining Two Categorical Variables

What if we want to look at a cross-tabulation of the number of observations of each sex broken down by species. 

For this, we can use the `pandas` `crosstab` function. What we need to do is to pass in the two columns we want to create the table based on.

In [None]:
pd.crosstab(penguins['island'], penguins['species'])

As above, we can take the results and create a plot.

In [None]:
pd.crosstab(penguins['island'], penguins['species']).plot(kind = 'bar', 
                                                          stacked = True,       # stacked as opposed to side-by-side
                                                          color = ['cornflowerblue', 'coral', 'pink'],     # change the default colors
                                                          edgecolor = 'black')              # add a border to the bars
plt.title('Penguin Species Distribution by Island')                   # add a title
plt.xticks(rotation = 0);                                             # change the appearance of the x tick labels

You can also normalize the cross-tabulation. Since we are working with two variables here, we can normalize in a number of ways. 

For example, if we just want the proportion of total observations that are contained in each cell, we can use the `normalize = 'all'` option.

In [None]:
pd.crosstab(penguins['island'], penguins['species'],  normalize = 'all')

Or if we want proportions by row, we can use the `normalize = 'index'` option.

In [None]:
pd.crosstab(penguins['island'], penguins['species'],  normalize = 'index')

In [None]:
(pd.crosstab(penguins['island'], penguins['species'],  normalize = 'index') * 100).plot(kind = 'bar', 
                                                                                        stacked = True,
                                                                                        color = ['cornflowerblue', 'coral', 'pink'],
                                                                                        edgecolor = 'black',
                                                                                        width = 0.75,
                                                                                       )
plt.title('Percentage of Species by Island')
plt.xticks(rotation = 0)
plt.legend(bbox_to_anchor = (1, 0.8), loc = 'upper left');   # move the legend to the right side of the plot

## Numeric Variables

Now, let's say that we want to study the `body_mass_g` variable.

We can quickly get a quick summary by using the `.describe()` method.

In [None]:
penguins['body_mass_g'].describe()

We can also calculate individual summary statistics, many of which have built-in _pandas_ methods.

In [None]:
penguins['body_mass_g'].mean()

In [None]:
penguins['body_mass_g'].skew()

We have a number of options for inspecting the distribution of a numeric variable. In this notebook, we'll look at histograms and box plots.

For histograms, we can use the `.hist()` method from _pandas_.

In [None]:
penguins['body_mass_g'].hist();

As before, we can easily make modifications to this plot.

In [None]:
penguins['body_mass_g'].hist(bins = 25,
                            color = 'coral',
                            edgecolor = 'black',
                            figsize = (10,6))

plt.title('Distribution of Body Mass Values', fontsize = 16);

The _pandas_ library also includes a boxplot method, but we can get a nicer looking one using the _seaborn_ library.

In [None]:
sns.boxplot(data = penguins,
            x = 'body_mass_g');

Note that if we want to increase the figure size when using seaborn, we cannot pass it in as an argument but must use either the `.figure()` or `.subplots()` function from matplotlib.

In [None]:
plt.figure(figsize = (10,4))

sns.boxplot(data = penguins,
            x = 'body_mass_g');

## Categorical-Numeric Combinations

The histogram of body mass values is not exactly symmetric and appears to have a number of subgroups. Perhaps this distribution shape could be explained by looking at the body mass distribution by species.

One way to do this is to use `.groupby`, as we've seen before.

In [None]:
penguins.groupby('species')['body_mass_g'].describe()

**Question:** Looking at the summary statistics by group, what do you notice?

We can easily compare distributions using a grouped boxplot.

In [None]:
plt.figure(figsize = (10,6))

sns.boxplot(data = penguins,
           x = 'body_mass_g',
           y = 'species');

## Comparing Two Numeric Variables

One way to assess the relationship between two numeric variables is to find the correlation. This can be accomplished using the `.corr()` method, which returns the correlation matrix.

In [None]:
penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].corr()

**Question:** Do you notice anything interesting when inspecting the correlation values?

Let's investigate the relationship between bill length and bill depth. We can do this using a scatterplot.

First, let's use the `.plot` method from pandas.

In [None]:
penguins.plot(kind = 'scatter',
             x = 'bill_length_mm',
             y = 'bill_depth_mm',
             figsize = (10,6));

We do need to remember that we have multiple species of penguins. To get a better understanding of the relationship between these variables, we could color the points by species. The easiest way to do this is using the _seaborn_ library.

In [None]:
plt.figure(figsize = (10,6))

sns.scatterplot(data = penguins,
               x = 'bill_length_mm',
               y = 'bill_depth_mm',
               hue = 'species',
               palette = ['cornflowerblue', 'coral', 'pink']);

**Question:** What is the correlation between bill length and bill depth if we just look at the Gentoo species?

In [None]:
# Your code here

## Additional Practice

1. Are there major differences in the distribution of species observed across the three years that the data was collected?

In [None]:
# Your Code Here

2. How does the distribution of body mass differ between male and female observations?

In [None]:
# Your Code Here

3. You can group by multiple variables simultaneously by passing a list into your `groupby` method. What do the differences in body mass between male and female penguins look like at a species level?

In [None]:
# Your Code Here

4. Inspect the relationship between body mass and flipper length. How is the relationship between these variables different than the one between bill length and bill depth that we observed above?

In [None]:
# Your Code Here