# Palmer Penguins 
********************************

Welcome to this analysis of the Palmer Penguins dataset! This notebook contains my analysis of the famous Palmer Penguins dataset.

In this notebook, we will explore and analyze a comprehensive set of data related to various penguin species. The dataset encompasses information on key variables such as species, island of origin, bill dimensions, flipper length, body mass, and sex.

Throughout the analysis, we will delve into the characteristics of each variable, aiming to uncover patterns, insights, and relationships within the data. Let's embark on this journey to gain a comprehensive understanding of the Palmer Penguins dataset.

![Penguins](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*KU-V8tWWQU3nDtw12-bQ_g.png)

The Penguins dataset is available on [GitHub](https://allisonhorst.github.io/palmerpenguins/)


In [3]:
# Data frames. 
import pandas as pd 

In [4]:
# Load the penguins data set.
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')

# Overview of the Data Set and Variables
****************************
This section of the notebook is dedicated to providing a concise overview of the Palmer Penguins dataset and the key variables it encompasses. 

In [4]:
# Let's have a general look to the data set. 
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


In [5]:
# Display the basics information about this dataset.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [9]:
#Describe data set. 
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


In [11]:
# Look at the second row.
df.iloc[1]

species                 Adelie
island               Torgersen
bill_length_mm            39.5
bill_depth_mm             17.4
flipper_length_mm        186.0
body_mass_g             3800.0
sex                     FEMALE
Name: 1, dtype: object

In [6]:
# Display basics statistics of the dataset
summary_stats = df.describe(include='all')
print("\nSummary Statistics:")
print(summary_stats)


Summary Statistics:
       species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
count      344     344      342.000000     342.000000         342.000000   
unique       3       3             NaN            NaN                NaN   
top     Adelie  Biscoe             NaN            NaN                NaN   
freq       152     168             NaN            NaN                NaN   
mean       NaN     NaN       43.921930      17.151170         200.915205   
std        NaN     NaN        5.459584       1.974793          14.061714   
min        NaN     NaN       32.100000      13.100000         172.000000   
25%        NaN     NaN       39.225000      15.600000         190.000000   
50%        NaN     NaN       44.450000      17.300000         197.000000   
75%        NaN     NaN       48.500000      18.700000         213.000000   
max        NaN     NaN       59.600000      21.500000         231.000000   

        body_mass_g   sex  
count    342.000000   333  
unique    

### Species
- This dataset includes three species of penguins: Adelie, Chinstrap, and Gentoo. 
- The majority of penguins included in this dataset are of the Adelie species. 


In [13]:
# Species of penguins.
df["species"]

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 344, dtype: object

### Island of origin
- Penguins are observed on three islands: Togersen, Biscoe, and Dream. 
- The distribution of penguins among the islands is not uniform.

In [12]:
# Look at the island the penguins come from. 
df["island"]

0      Torgersen
1      Torgersen
2      Torgersen
3      Torgersen
4      Torgersen
         ...    
339       Biscoe
340       Biscoe
341       Biscoe
342       Biscoe
343       Biscoe
Name: island, Length: 344, dtype: object

### Sex of penguins
- The dataset includes information about the sex of the penguins observed, by dividing the sex into two categories: MALE, and FEMALE.

In [15]:
# Sex of penguins
df["sex"]

0        MALE
1      FEMALE
2      FEMALE
3         NaN
4      FEMALE
        ...  
339       NaN
340    FEMALE
341      MALE
342    FEMALE
343      MALE
Name: sex, Length: 344, dtype: object

### Body mass of penguins
- The body mass of penguins analyzed within this dataset ranges form 2700g to 6300g. 

In [10]:
# Count the body mass of penguins. 
df["body_mass_g"].value_counts()

body_mass_g
3800.0    12
3700.0    11
3900.0    10
3950.0    10
3550.0     9
          ..
4475.0     1
3975.0     1
3575.0     1
3850.0     1
5750.0     1
Name: count, Length: 94, dtype: int64

***************************


## Types of variables

In this section, I will explore and suggest the types of variables that should be used to model the variables in the Palmer Penguin dataset in Python. Along with each variable, I will explain the rationale behind the choice and cite the material I used in my research. I will delve into practical examples using the variables discussed previously. These examples aim to provide a hands-on understanding of how these variables can be effectively utilized in modeling the Palmer Penguin dataset in Python. By showcasing real-world applications, I hope to demonstrate the significance of each variable and its role in data analysis.

### Species
This section explores the variable "species" withing the Palmer Penguin dataset. In the previous section an overview of this variable has been shortly explored.

The species of penguins observed in the Palmer Penguin dataset are categorized into three main groups: Adelie, Chinstrap, and Ghentoo. You may find an introduction of Palmer Penguins dataset at this link [Palmer Penguins Intro](https://allisonhorst.github.io/palmerpenguins/articles/intro.html#:~:text=The%20palmerpenguins%20data%20contains%20size,Linux%20is%20named%20after%20penguins!).

#### Type of variable suggested
The name species included in this datased are defined as categorical labels representing distinct categories. Therefore, utilizing a string data type in this case would be the most suitable approack for modeling this variable in Python. 

#### Rationale
By categorizing the names of species as strings,  we can accurately capture their unique identities without suggesting any hierarchical order among them. String data types, in fact, could allow us to maintain the integrity of the name of species individually, and at the same type simplifying their idendification within the dataset. 
Utilizing a categorical data type in Python helps us to conduct different analysis and visualization adapted to differet species categories. For instance, we can calculate propotions, visualize species distribution across different variables, create an array of species names. count occurrences for each species in the dataset. See examples below. 


In [1]:
# Array of species names. This array can be used for filtering, sorting, and so on. 
import numpy as np

#Define array of names of species
names_species = np.array(["Adelie", "Chinstrap", "Gentoo"])

#Print array 
print(names_species)

['Adelie' 'Chinstrap' 'Gentoo']


In [3]:
# Count occurrences of each species in the Dataset

# Import Panda Library
import pandas as pd

# Define df by importing the dataset
df = df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')

# Count the occurrences 
counts_species = df["species"].value_counts()
print("Counts species: ")
print(counts_species)

Counts species: 
species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64


#### Literature:
- Alabuda, A. (2022). Classification and EDA - Palmer Penguins. Kaggle. [Link here](https://www.kaggle.com/code/alabuda/classification-and-eda-palmer-penguins)
- Analytics Vidhya. (2022, April). Data Exploration and Visualization using Palmer Penguins Dataset. [Link here](https://www.analyticsvidhya.com/blog/2022/04/data-exploration-and-visualisation-using-palmer-penguins-dataset/)
- Horst, A. (2020, November 30). Introduction to palmerpenguins. GitHub [Link here](https://allisonhorst.github.io/palmerpenguins/articles/intro.html#:~:text=The%20palmerpenguins%20data%20contains%20size,Linux%20is%20named%20after%20penguins!)
- Google Developers. (n.d.). Python Strings. [Link here](https://developers.google.com/edu/python/strings)
- JavaTpoint. (N/A). Categorical Variable in Python. [Link here](https://www.javatpoint.comcategorical-variable-in-python#:~:text=In%20Python%2C%20a%20categorical%20variable,as%20nominal%20variables%20or%20factors.)
- Lutz, M. (2009). Learning Python. O'Reilly Media.
- Twomey, M. (2022, July 12). Exploring Palmer Penguins. [Link here](https://rpubs.com/michelle10128/923430)
- NumPy Contributors. (n.d.). [Link here](https://numpy.org/doc/stable/)
- Real Python. (n.d.). Python Variables: Defining, Assigning, and Using Variables in Python. [Link here](https://realpython.com/python-variables/)
- Statistics Canada. (n.d.). Introduction to probability - Variables. [Link here](https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch8/5214817-eng.htm)
- Yale University Department of Statistics. (1997-98). Categorical Data. [Link here](http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm)
- VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media. [Link here](https://jakevdp.github.io/PythonDataScienceHandbook/)

### Island of Origin
This section focuses on the variable "island of origin" within the Palmer Penguin dataset. A brief overview of this variable has been provided in the preceding section.

The dataset categorizes the islands where penguins were observed into three main groups: Torgersen, Biscoe, and Dream. For an introduction to the Palmer Penguins dataset, you can refer to the [Palmer Penguins Intro](https://allisonhorst.github.io/palmerpenguins/articles/intro.html#:~:text=The%20palmerpenguins%20data%20contains%20size,Linux%20is%20named%20after%20penguins!).

#### Type of Variable Suggested
Similar to the "species" variable, the "island of origin" variable can be appropriately modeled as a categorical label representing distinct categories. Therefore, utilizing a string data type for this variable in Python would be the most suitable approach.

#### Rationale 
Categorizing the islands of origin as strings allows us to accurately capture their unique identities without implying any hierarchical order among them. Using string data types enables us to maintain the integrity of each island name individually while facilitating their identification within the dataset. Utilizing a categorical data type in Python facilitates various analyses and visualizations tailored to different island categories. For example, we can calculate proportions, visualize island distribution across different variables, create an array of island names, and count occurrences for each island in the dataset.

Below is an example demonstrating the creation of an array of island names first, and the calculation of occurrences for each island in the dataset right after:


In [10]:
# Create an array of island names

# Use the unique numpy function
# This function allow us to find the unique elements of an array and returns the sorted unique elements of an array
array_islands = df['island'].to_numpy()
unique_islands = np.unique(array_islands)

# Display the array
print("Unique islands:", unique_islands)

Unique islands: ['Biscoe' 'Dream' 'Torgersen']


In [11]:
# Count occurrences of each island in the dataset
island_counts = df['island'].value_counts()

# Print out the result
print("Occurrences of each island:")
print(island_counts)

Occurrences of each island:
island
Biscoe       168
Dream        124
Torgersen     52
Name: count, dtype: int64


#### Literature
- Alabuda, A. (2022). Classification and EDA - Palmer Penguins. Kaggle. [Link here](https://www.kaggle.com/code/alabuda/classification-and-eda-palmer-penguins)
- Analytics Vidhya. (2022, April). Data Exploration and Visualization using Palmer Penguins Dataset. [Link here](https://www.analyticsvidhya.com/blog/2022/04/data-exploration-and-visualisation-using-palmer-penguins-dataset/)
- Horst, A. (2020, November 30). Introduction to palmerpenguins. GitHub [Link here](https://allisonhorst.github.io/palmerpenguins/articles/intro.html#:~:text=The%20palmerpenguins%20data%20contains%20size,Linux%20is%20named%20after%20penguins!)
- NumPy Contributors. (n.d.). Array manipulation routines — NumPy v1.21 Manual. [Link here](https://numpy.org/doc/stable/reference/routines.array-manipulation.html)
- VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media. [Link here](https://jakevdp.github.io/PythonDataScienceHandbook/)
- W3Resource. (n.d.). NumPy: Array manipulation - numpy.unique(). [Link here](https://www.w3resource.com/numpy/manipulation/unique.php)