In this notebook, we will see some tools for doing exploratory data analysis (EDA) on a dataset. 

This is the process of familiarizing yourself with your data, and it typically includes examining the structure and components of your dataset, the distributions of individual variables, and the relationships between two or more variables.

We'll be using the [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset, which contains size measurements for three penguin species observed on three islands in the Palmer Archipelago of Antarctica.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
penguins = pd.read_csv('../data/penguins.csv')

We can start by looking at the first few rows.

In [None]:
penguins.head()

**Question: Do you notice anything when you look at the first five rows of data?**

Let's look and see how many null values we have. We can do this by using the `.info()` method.

In [None]:
penguins.info()

Another way to count null values is by using the `.isna()` method.

In [None]:
penguins.isna().sum()

## Variable Types

The tools that we use to understand a varaible depend on the type variable we are looking at. Broadly speaking, there are two type of variables:

**Categorical Variables:**  Express a qualitative attribute. Examples: hair color, eye color, religion, favorite movie, gender

**Numeric Variables:** Variables that are measured in terms of numbers. Examples: height, weight, shoe size

## Categorical Variables

For categorical variables, we can start by getting a count by category. This can be accomplished in pandas by using `value_counts` method.

In [None]:
penguins['sex'].value_counts()

Notice that the NaN values do not show up. Place your cursor inside the parantheses for `value_counts` and press Shift+Tab to look for a way to retain the NaN values.

In [None]:
penguins['sex'].value_counts(dropna = False)

Notice also that we can normalize our value counts in order to get percents/proportions. For example, to get the percentage by species, we can use this: 

In [None]:
penguins['sex'].value_counts(normalize = True)

If we want to create a plot showing the number of penguins per species, we can do so using the `.plot` method. We need to specify that we want to create a bar chart.

In [None]:
penguins['species'].value_counts().plot(kind = 'bar');

The default plot can be improved using a combination of arguments and matplotlib methods.

In [None]:
penguins['species'].value_counts().plot(kind = 'bar', figsize = (10,6)) 
                                                                        # Increase the plot size                

plt.xticks(rotation = 0, fontsize = 12)                                 # Remove the rotation of the labels
                            
plt.title('Number of Penguins by Species', fontsize = 14, fontweight = 'bold');
                                                                        # Add a title

## Numeric Variables

When examining a numeric variable, we can start by calculating descriptive statistics. These can include things like mean, median, max, min, standard deviation, and quartiles.

Let's say that we want to study the `body_mass_g` variable.

We can quickly get a quick summary by using the `.describe()` method.

In [None]:
penguins['body_mass_g'].describe()

We can also calculate individual summary statistics, many of which have built-in _pandas_ methods.

In [None]:
penguins['body_mass_g'].mean()

In [None]:
penguins['body_mass_g'].median()

We have a number of options for inspecting the distribution of a numeric variable. In this notebook, we'll look at histograms and box plots.

For histograms, we can use the `.hist()` method from _pandas_.

In [None]:
penguins['body_mass_g'].hist();

As before, we can easily make modifications to this plot.

In [None]:
penguins['body_mass_g'].hist(bins = 25, color = 'coral', edgecolor = 'black', figsize = (10,6))

plt.title('Distribution of Body Mass Values', fontsize = 16);

The _pandas_ library also includes a boxplot method, but we can get a nicer looking one using the _seaborn_ library.

In [None]:
sns.boxplot(data = penguins, x = 'body_mass_g');

Note that if we want to increase the figure size when using seaborn, we cannot pass it in as an argument but must use either the `.figure()` or `.subplots()` function from matplotlib.

In [None]:
plt.figure(figsize = (10,4))

sns.boxplot(data = penguins, x = 'body_mass_g');

# End of Instruction

### Read in the people.csv

### Use normalized value_counts to find the percent male and percent female

### Make a histogram for all the ages in the data frame.  Change the default number of bins to be fewer.  
### Now, change it to be more.  Which do you prefer?

### Make a box plot showing the distribution of BMI. Are there any outliers?

### Make a box plot showing the distributino of years_played_sports.  Are there any outliers?

### Make a histogram showing years_played_sports.  Which chart type do you think best represents the data?