# Palmer Penguins

This notebook contains my analysis of the palmerpenguins data set. The palmer penguins data set contains data measurements for three different penguin species, the Chinstrap, Gentoo and Adélie penguin. The data was collected from 2007 - 2009 by Dr. Kristen Gorman with the [Palmer Station Long Term Ecological Research Program](https://lternet.edu/site/palmer-antarctica-lter/) on three different islands in the Palmer Archipelago, Antarctica. 

![The Palmer Penguins. Artwork by @allison_horst](https://sebastiancallh.github.io/ox-hugo/palmer-penguins.png)

The Palmer Penguins, artwork by @allison_horst






According to AllisonHorst.github.io "the goal of the palmerpenguins is to grovide a great datset for data exploration and visualization, as an alternative to iris". (https://github.com/allisonhorst/palmerpenguins/blob/main/README.md)

## Pandas

Pandas is a Python library for manipulating data and for performing data analysis. It can perform statistical calculations, find a correlations between two or more columns and it can be used to visualise data. **ADD MORE**

The pandas package must first be loaded using the 'import' command to work with it. It is usually imported with the alias 'pd'. 

In [4]:
# Data frames.
import pandas as pd

## Importing the data set

The data set was imported as a csv file from [Seaborn Data](https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv). The original raw data set is available on Alison Horst's [GitHub](https://github.com/allisonhorst/palmerpenguins/blob/main/inst/extdata/penguins_raw.csv). The data set used has the advantage in that it has been processed to remove extraneous information such as any NA in the data and the year that data was collected. The other difference is that male and female are capitalised in the Seaborne data set. 

The command to import the data set as a csv file from a webpage is: df = pd.read_csv("URL")

Explanation of the command: df dataframe. 

In [5]:
# Load the penguins data set
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")


## Overview of the data set

There are a couple of commands that can be used to get an overview of the data set. 

The first one of these is simply, df. This will print out a few rows of the parsed csv from the start and the end of the dataframe. Another useful command is df.head(), which will by default print the first 5 rows of the data set. 

In [15]:
# Have a look. Prints out a few rows of the parsed csv file.
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [16]:
# df.head() prints out the first five rows of the csv file.

df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


df tells us that the dataset has 344 rows and seven columns. The seven attributes (columns)/variables are species, island, bill_length, bill depth, flipper length, the weight and the sex of each penguin studied.

## Variables 

The info() method gives us more concise information about the dataframe. Like df and df.head() it tells us the the number and names of each columm, but it also provides some additional information. It also tells us the data type of each column and the number of non-null entries in each variable. 

In [23]:
# df.info() gives concise information about the dataframe. 

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [20]:
# Describe the data set. Summary statistics about the data file. 
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


The data type (dtypes) of the species, island and sex attributes are object types. This is accurate as they are all strings. Species contains three different values(change this word) Adelie, Gentoo and Chinstrap. The names of the three islands are . The sex of the penguins are either male or female. This variable had the highest no of null recorded where the sex of the penguin could not be identified. 

The data type of the bill length, bill depth and flipper length are all measured in mm with the data given to one decimal place. The weight or body mass of the penguins is stated in grams is also given to one decimal place. 



In [47]:
df.loc[1]


species                 Adelie
island               Torgersen
bill_length_mm            39.5
bill_depth_mm             17.4
flipper_length_mm        186.0
body_mass_g             3800.0
sex                     FEMALE
Name: 1, dtype: object

In [25]:
df["species"].unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

In [24]:
df["island"].unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

In [29]:
# Which island has the highest penguin population?

df['island'].value_counts()


island
Biscoe       168
Dream        124
Torgersen     52
Name: count, dtype: int64

In [49]:
df['island'].value_counts('species')

island
Biscoe       0.488372
Dream        0.360465
Torgersen    0.151163
Name: proportion, dtype: float64

In [19]:


df.groupby('species').count()

Unnamed: 0_level_0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelie,152,151,151,151,151,146
Chinstrap,68,68,68,68,68,68
Gentoo,124,123,123,123,123,119


In [26]:
df.groupby('species')['island'].value_counts()


species    island   
Adelie     Dream         56
           Torgersen     52
           Biscoe        44
Chinstrap  Dream         68
Gentoo     Biscoe       124
Name: count, dtype: int64

In [6]:
df.groupby('island')['species'].nunique()


island
Biscoe       2
Dream        2
Torgersen    1
Name: species, dtype: int64

In [8]:
# Breakdown of penguin species by island.  

df.groupby('island')['species'].value_counts()


island     species  
Biscoe     Gentoo       124
           Adelie        44
Dream      Chinstrap     68
           Adelie        56
Torgersen  Adelie        52
Name: count, dtype: int64

In [12]:
df.groupby('island')['body_mass_g'].mean()


island
Biscoe       4716.017964
Dream        3712.903226
Torgersen    3706.372549
Name: body_mass_g, dtype: float64

In [13]:
df.groupby('species')['body_mass_g'].mean()

species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64

In [17]:
# Are male penguins heavier than female penguins?

df.groupby('sex')['body_mass_g'].mean()

sex
FEMALE    3862.272727
MALE      4545.684524
Name: body_mass_g, dtype: float64

In [7]:
df.groupby('species')['island'].nunique()


species
Adelie       3
Chinstrap    1
Gentoo       1
Name: island, dtype: int64

In [18]:
# Look at the first row
df.iloc[0]

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       MALE
Name: 0, dtype: object

In [19]:
# Number and sex of penguins
df["sex"].value_counts()

sex
MALE      168
FEMALE    165
Name: count, dtype: int64

In [30]:
df.groupby('species')['flipper_length_mm'].mean()


species
Adelie       189.953642
Chinstrap    195.823529
Gentoo       217.186992
Name: flipper_length_mm, dtype: float64

In [33]:
df["flipper_length_mm"].std()



14.061713679356894