**<h1 align="center">Principles of Data Analytics</h1>**


# <h1 align="center">Palmer Penguins Exploratory Data Analysis</h1>

***

<p align="center">
<img width="528" height="291" src="img/lter_penguins.png")
</p>

[Artwork by @allison_horst](https://allisonhorst.github.io/palmerpenguins/articles/art.html)

## Introduction

***

The [Palmer penguins dataset](https://allisonhorst.github.io/palmerpenguins/) by [Allison Horst](https://allisonhorst.com/), [Alison Hill](https://www.apreshill.com/), and [Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) is a dataset for data exploration & visualization, as an alternative to the Iris dataset. It was made available in 2020.

The dataset contains data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

Data was collected from 2007 to 2009 by Dr. Kristen Gorman and the [Palmer Station, Antarctica LTER](https://pallter.marine.rutgers.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).

More information about the dataset is available in its [official documentation](https://allisonhorst.github.io/palmerpenguins/).

## Imports

***

We require the following libraries to analyse the dataset.

 - Pandas: Fundamental data analysis and manipulation library built on top of the Python programming language. It offers data structures and operations for manipulating numerical tables and time series.

 - Numpy: It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations.

 - Matplotlib: Essential for creating static, animated, and interactive visualizations in Python. It is closely integrated with NumPy and provides a MATLAB-like interface for creating plots and visualizations.


In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load data

***

Load the Palmer Penguins data set from a URL

The dataset is available on [GitHub](https://allisonhorst.github.io/palmerpenguins/).

In [2]:

df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")

The data is now loaded.

In [3]:

df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


## Data Exploration

***

### Selecting data by row numbers (.iloc)

**Rationale:** 

The iloc property gets, or sets, the value(s) of the specified indexes. Specify both row and column with an index. In this example, we look at row 1 in detail.<sup id="a1">[1](#f1)</sup> <sup name="a2">[2](#f2)</sup> [<sup>3</sup>](#fn3)


**Findings:**
- Row 1 is a Male Adelie Penguin on Torgersen island.
- Bill Length is 39.1mm, bill depth is 18.7mm
- Flipper Length is 181mm
- Body Mass is 3750g

###### <b id="f1">1</b> ---  Pandas Documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) --- [↩](#a1)

###### <b name="f2">2</b> ___  Adding footnotes to Github flavoured Markdown (https://stackoverflow.com/questions/25579868/how-to-add-footnotes-to-github-flavoured-markdown) ___ [↩](#a2)

###### <span id="fn3"> Footnote 3</span>




<p align="center">
<img width="428" height="291" src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png")
</p>

[Artwork by @allison_horst](https://allisonhorst.github.io/palmerpenguins/articles/art.html)

For this penguin data, the bill/culmen length and depth are measured as shown below.

In [4]:
df.iloc [0]

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       MALE
Name: 0, dtype: object

### T

**Rationale:** 



**Findings:**

In [5]:
# Sex of penguins
df["sex"]

0        MALE
1      FEMALE
2      FEMALE
3         NaN
4      FEMALE
        ...  
339       NaN
340    FEMALE
341      MALE
342    FEMALE
343      MALE
Name: sex, Length: 344, dtype: object

**Rationale:** 



**Findings:**

In [6]:
# Count the number of penguins of each sex.
df["sex"].value_counts()

sex
MALE      168
FEMALE    165
Name: count, dtype: int64

### Generate descriptive statistics with pandas .describe method

**Rationale:** 

Understand the distribution of numerical data with percentile attribution and a provide a quick summary that will helps us understand the penguins' characteristics central tendency, variability, and range.[1] It includes the following statistics:[2]


**Findings:**

In [7]:
# Describe the data set
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


### Summary of Data info()

**Rationale:** 

Pandas DataFrame info() returns a summary with the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

**Findings:**
- The set contains 344 rows and 7 columns.
- Four variables are numeric with type float64: bill length in mm, bill depth in mm, flipper length in mm, and body mass in grams.
- Three variables are categorical with type object: species, island, and sex.
- Five variables of the columns have missing values.

In [8]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


***

<p align="center">
<img width="250" height="291" src="https://allisonhorst.github.io/palmerpenguins/reference/figures/palmerpenguins.png")
</p>

[Artwork by @allison_horst](https://allisonhorst.github.io/palmerpenguins/articles/art.html)

***

<p align="center">
<img width="428" height="291" src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png")
</p>

[Artwork by @allison_horst](https://allisonhorst.github.io/palmerpenguins/articles/art.html)

Culmen measurements
What are culmen length & depth? The culmen is “the upper ridge of a bird’s beak” (definition from Oxford Languages). In the simplified penguins subset, culmen length and depth have been updated to variables named bill_length_mm and bill_depth_mm.

For this penguin data, the bill/culmen length and depth are measured as shown below (thanks Kristen Gorman for clarifying!):

## References

***

- Jupyter Notebook          https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html
- w3 Schools                https://www.w3schools.com/python/pandas/ref_df_info.asp
- Pandas                    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html   

***
## End