# | Palmer's Penguins Dataset EDA & Data Visualization |

### | Introduction |

[Originally published Data](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081) by Kristen B. Gorman ,Tony D. Williams,William R. Fraser

more information can be found [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081) 

## | Importing libraries | 

In [1]:
#import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express as px


## | Importing the Data |

#### The data was downloaded form Kaggle 

- Palmer's Penguins | [Data Used in this project](https://www.kaggle.com/way2studytable/palmer-penguins-using-r) | [Raw Data](https://www.kaggle.com/malanep/palmer-penguine) | [Original Data](https://github.com/allisonhorst/palmerpenguins)

## | Understanding the Dataset |

### Dictionary
 
- Species:
    1. Adelie 
    2. Chinstrap
    3. Gentoo            
- Island:
    1. Torgersen
    2. Biscoe
    3. Dream
- culmen_length_mm:
    - Culmen length of the observed penguin.    
- culmen_depth_mm:    
    - Culmen depth of the observed penguin.
- flipper_length_mm 
    - flipper length of the observed penguin.
> | More info about birds measurement [here](https://en.wikipedia.org/wiki/Bird_measurement) |
- body_mass_g:
    - Body mass of the observed penguin        
- sex: 
    1. MALE 
    2. FEMALE                  
- body_mass_kg:
    - A new column created from the body_mass_g column 
    - Describe the body mass in Kg        

In [2]:
peng = pd.read_csv('data/PenguinsData.csv')

In [3]:
# looking at generanl information about the data
peng.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
species              344 non-null object
island               344 non-null object
culmen_length_mm     342 non-null float64
culmen_depth_mm      342 non-null float64
flipper_length_mm    342 non-null float64
body_mass_g          342 non-null float64
sex                  334 non-null object
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [4]:
peng.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [5]:
peng.isnull().sum()

species               0
island                0
culmen_length_mm      2
culmen_depth_mm       2
flipper_length_mm     2
body_mass_g           2
sex                  10
dtype: int64

In [6]:
# Dropping null values and reseting index
peng = peng.dropna()
peng.reset_index(drop=True, inplace=True)

In [7]:
# getting the index of one observation with value "." in the sex columns
# dropping the row and resetting the index
peng.loc[peng["sex"] == "."]

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
327,Gentoo,Biscoe,44.5,15.7,217.0,4875.0,.


In [8]:
peng = peng.drop(327)
peng.reset_index(drop=True, inplace=True)

In [9]:
# checking for the index resetting 
peng.loc[327]

species              Gentoo
island               Biscoe
culmen_length_mm       48.8
culmen_depth_mm        16.2
flipper_length_mm       222
body_mass_g            6000
sex                    MALE
Name: 327, dtype: object

In [10]:
# Data after dropping null values
peng.shape

(333, 7)

In [11]:
# Creating new column for with body mass in Kg
def g_to_kg(x):
    return x/1000  

In [12]:
# Appling the function for Kg 
peng["body_mass_kg"] = peng["body_mass_g"].apply(g_to_kg)

In [13]:
# final look at the dataset after some cleaning 
peng.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE,3.75
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE,3.8
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE,3.25
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE,3.45
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE,3.65


## | Table Visualization |

In [14]:
# quick numeric infromation about the dataset 
peng.describe()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,body_mass_kg
count,333.0,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057,4.207057
std,5.468668,1.969235,14.015765,805.215802,0.805216
min,32.1,13.1,172.0,2700.0,2.7
25%,39.5,15.6,190.0,3550.0,3.55
50%,44.5,17.3,197.0,4050.0,4.05
75%,48.6,18.7,213.0,4775.0,4.775
max,59.6,21.5,231.0,6300.0,6.3


In [15]:
# penguins species distribution based on island 
(peng.groupby(["species","island"])["island"].count()).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,island
species,island,Unnamed: 2_level_1
Adelie,Biscoe,44
Adelie,Dream,55
Adelie,Torgersen,47
Chinstrap,Dream,68
Gentoo,Biscoe,119


In [16]:
# penguins gender distribution based on species 
(peng.groupby(["species","sex"])["sex"].count()).to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,sex
species,sex,Unnamed: 2_level_1
Adelie,FEMALE,73
Adelie,MALE,73
Chinstrap,FEMALE,34
Chinstrap,MALE,34
Gentoo,FEMALE,58
Gentoo,MALE,61


## | Chart Visualization |

In [17]:
# histogram chart that shows penguins distribution based on species  
fig = px.histogram(peng, 
                    x="sex",
                    color='species', barmode='group',
                    height=400,
                    width=400,
                    title="Penguins by Location")
fig.show()

In [18]:
# pie chart for penguins distribution based on the island 
fig = px.pie(peng, 
            names='island',
            width=400,
            height=400,
            title="Penguins by Location")
fig.show()

In [19]:
# histogram chart for the body mass of the penguins by Kg
fig = px.histogram(peng, 
                    x="body_mass_kg",
                    barmode='group',
                    height=400,
                    width=800,
                    title="Penguins body mass")
fig.show()

In [20]:
# 3 plots for penguins culmen_length_mm, culmen_depth_mm and flipper_length_mm
fig = px.box(peng,
            x="island", 
            y="culmen_length_mm",
            height=500,
            width=500,
            title="")
fig.show()

fig01 = px.box(peng,x="island", 
                y="culmen_depth_mm",
                height=500,
                width=500,
                title="")
fig01.show()


fig02 = px.box(peng,x="island", 
                y="flipper_length_mm",
                height=500,
                width=500,
                title="")
fig02.show()



In [21]:
# scatter plot comparing the species of the penguins in coloration  with thier culmen length and depth
fig = px.scatter(peng,
                y="culmen_length_mm",  
                x="culmen_depth_mm", 
                color="species",
                height=500,
                width=800)
fig.show()

In [1]:
!jupyter nbconvert --to html palmerpenguins.ipynb

[NbConvertApp] Converting notebook palmerpenguins.ipynb to html
  mimetypes=output.keys())
[NbConvertApp] Writing 284471 bytes to palmerpenguins.html
