# Exploratory Data Analysis

In this notebook, we will see some tools for doing exploratory data analysis (EDA) on a dataset. 

We'll be using the [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset, which contains size measurements for three penguin species observed on three islands in the Palmer Archipelago of Antarctica.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
penguins = pd.read_csv('../data/penguins.csv')

In [None]:
penguins.head()

## Categorical Variables

**Warm-up Question:** How many penguins are there of each sex?

In [None]:
# Your code here

Notice also that we can normalize our value counts. For example, to get the percentage by species, we can use this: 

In [None]:
penguins['sex'].value_counts(normalize = True)

If we want to create a plot showing the number of penguins per species, we can do so using the `.plot` method. We need to specify that we want to create a bar chart.

In [None]:
penguins['species'].value_counts().plot(kind = 'bar');

The default plot can be improved using a combination of arguments and matplotlib methods.

In [None]:
penguins['species'].value_counts().plot(kind = 'bar',
                                        figsize = (10,6))               # Increase the plot size                

plt.xticks(rotation = 0,                                                # Remove the rotation of the labels
           fontsize = 12)                   
plt.title('Number of Penguins by Species',                              # Add a title
         fontsize = 14,
         fontweight = 'bold');

## Numeric Variables

Now, let's say that we want to study the `body_mass_g` variable.

We can quickly get a quick summary by using the `.describe()` method.

In [None]:
penguins['body_mass_g'].describe()

We can also calculate individual summary statistics, many of which have built-in _pandas_ methods.

In [None]:
penguins['body_mass_g'].mean()

In [None]:
penguins['body_mass_g'].median()

We have a number of options for inspecting the distribution of a numeric variable. In this notebook, we'll look at histograms and box plots.

For histograms, we can use the `.hist()` method from _pandas_.

In [None]:
penguins['body_mass_g'].hist();

As before, we can easily make modifications to this plot.

In [None]:
penguins['body_mass_g'].hist(bins = 25,
                            color = '#99FFFF',
                            edgecolor = 'black',
                            figsize = (10,6),
                            grid = False)


plt.title('Distribution of Body Mass Values', fontsize = 16);

## Categorical-Numeric Combinations

The histogram of body mass values is not exactly symmetric and appears to have a number of subgroups. Perhaps this distribution shape could be explained by looking at the body mass distribution by species. Let's say we want to look at the average body mass by species.

One way to do this is to use `.groupby`.

Quite often when using `.groupby()`, our goal is to calculate an aggregate value by group. To use `.groupby()`, we need to tell pandas: 
* **what** to group by
* **which** column (or columns) we want to aggregate
* **how** to aggregate

In this case, we want to group by `species` and then aggregate the `body_mass_g` column by taking the _mean_.

In [None]:
penguins.groupby('species')['body_mass_g'].mean()

You can even use `.describe` with `groupby` to get even more information by species.

In [None]:
penguins.groupby('species')['body_mass_g'].describe()

**Question:** Looking at the summary statistics by group, what do you notice?

## Comparing Two Numeric Variables

Let's investigate the relationship between bill length and bill depth. We can do this using a scatterplot.

First, let's use the `.plot` method from pandas.

In [None]:
penguins.plot(kind = 'scatter',
             x = 'bill_length_mm',
             y = 'bill_depth_mm',
             figsize = (10,6));

We do need to remember that we have multiple species of penguins. To get a better understanding of the relationship between these variables, we could color the points by species. The easiest way to do this is using the _seaborn_ library.

In [None]:
plt.figure(figsize = (10,6))

sns.scatterplot(data = penguins,
               x = 'bill_length_mm',
               y = 'bill_depth_mm',
               hue = 'species',
               palette = ['cornflowerblue', 'coral', 'pink']);