# Palmer Penguins

This notebook contains my analysis of the famous Palmer Penguins dataset. This dataset contains measurements for three species of penguin (Adélie, Chinstrap, and Gentoo), which were observed on three islands (Biscoe, Dream, and Torgersen) in Antarctica's Palmer Archipelago.

The dataset was collected and compiled by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. Allison Horst [shared and maintains the dataset](https://allisonhorst.github.io/palmerpenguins/), with the aim of providing a great dataset for data exploration & visualisation, as an alternative to [the Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

## Reviewing the Dataset
I will begin by importing [pandas](https://pandas.pydata.org/) as a useful tool for data analysis and manipulation. I will then use pandas to read in the dataset and take a preliminary look. As I have not worked with this dataset before, I will take some time to familiarise myself with it before proceeding to any analysis.

In [97]:
# Data Frames
import pandas as pd

# Load the Penguins Dataset
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")

# Take a Look at the Dataset
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


The dataset contains 344 rows, with each row corresponding to a unique penguin. I can see that for each penguin, the researchers captured seven variables:
- **species:** which of the three species (Adélie, Chinstrap, or Gentoo) the penguin belongs to
- **island:** which of the three islands (Biscoe, Dream, or Torgersen) the penguin was found on
- **bill_length_mm:** length (in mm) of the penguin's culmen (upper ridge of the penguin's bill)
- **bill_depth_mm:** depth (in mm) of the penguin's culmen
- **flipper_length_mm:** length (in mm) of the penguin's flipper
- **body_mass_g:** body mass (in g) of the penguin
- **sex:** whether the penguin was Female or Male

I will now use the [describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) function to get a quick overview of some key data points for the numeric variables:

In [98]:
# Describe the Data Set
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


For the three remaining non-numeric variables, I will now get an idea of the proportional split among them by using the [value_counts()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) function to look at unique counts:

In [99]:
#Count the number of penguins of each species
df["species"].value_counts()

species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

In [100]:
#Count the number of penguins on each island
df["island"].value_counts()

island
Biscoe       168
Dream        124
Torgersen     52
Name: count, dtype: int64

In [101]:
#Count the number of penguins of each sex
df["sex"].value_counts()

sex
MALE      168
FEMALE    165
Name: count, dtype: int64

## Species-Specific Analysis
Having taken an initial look at the data, it seems to me that it would be useful to break the data down by species and begin looking at the variation between each species grouping. I will begin by creating new dataframes by filtering the initial dataset by species:

In [102]:
# Create dataframes for each species
adelie_df = df.loc[(df['species'] == 'Adelie')]
chinstrap_df = df.loc[(df['species'] == 'Chinstrap')]
gentoo_df = df.loc[(df['species'] == 'Gentoo')]


Now I'll take a quick look at each one using describe() again:

In [103]:
adelie_df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,151.0,151.0,151.0,151.0
mean,38.791391,18.346358,189.953642,3700.662252
std,2.663405,1.21665,6.539457,458.566126
min,32.1,15.5,172.0,2850.0
25%,36.75,17.5,186.0,3350.0
50%,38.8,18.4,190.0,3700.0
75%,40.75,19.0,195.0,4000.0
max,46.0,21.5,210.0,4775.0


In [104]:
chinstrap_df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,68.0,68.0,68.0,68.0
mean,48.833824,18.420588,195.823529,3733.088235
std,3.339256,1.135395,7.131894,384.335081
min,40.9,16.4,178.0,2700.0
25%,46.35,17.5,191.0,3487.5
50%,49.55,18.45,196.0,3700.0
75%,51.075,19.4,201.0,3950.0
max,58.0,20.8,212.0,4800.0


In [105]:
gentoo_df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,123.0,123.0,123.0,123.0
mean,47.504878,14.982114,217.186992,5076.01626
std,3.081857,0.98122,6.484976,504.116237
min,40.9,13.1,203.0,3950.0
25%,45.3,14.2,212.0,4700.0
50%,47.3,15.0,216.0,5000.0
75%,49.55,15.7,221.0,5500.0
max,59.6,17.3,231.0,6300.0


***

### End