# <center>Palmer Penguins Exploratory Data Analysis</center>

> *Irina Simoes*

#### Table of Contents

- [Underline](#underline)
- [Indent](#indent)
- [Center](#center)
- [Color](#color)

### 1. Intro

This exploratory data analysis (EDA) aims to provide insights on the variables of the [Palmer Penguins dataset](https://allisonhorst.github.io/palmerpenguins/articles/intro.html), which was first introduced by Allison Horst, Alison Hill, and Kristen Gorman in 2020. The dataset is a collection of data about three different species of penguins inhabiting the Palmer Archipelago near Palmer Station in Antarctica, which wwere collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network.

For this analysis, we will focus on exploring the data to gain insights on the underlying attributes with the end goal of uncovering patterns and identifying dependencies. We will seek to explore the [correlation between two of the variables](https://towardsdatascience.com/what-it-takes-to-be-correlated-ce41ad0d8d7f), whether they are casual or not, by structuring the report as per the [Exploratory Data Analysis in Python](https://www.geeksforgeeks.org/exploratory-data-analysis-in-python/) article.


https://www.geeksforgeeks.org/exploratory-data-analysis-eda-types-and-tools/
https://www.geeksforgeeks.org/exploratory-data-analysis-in-python-set-1/
https://www.geeksforgeeks.org/data-analysis-visualization-python-set-2/
https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/





----

### 2. Data Exploration

> 📝 **Load all the required libraries for the analysis:**
* seaborn
* pandas
* numpy
* matplotlib

In [6]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

> 📝 **Load the dataset which our analysis will be based on:**

As stated in the [official documentation for the Seaborn library](https://seaborn.pydata.org/generated/seaborn.load_dataset.html), datasets can be loaded from an [online repository](https://github.com/mwaskom/seaborn-data):
- first by invoking *get_dataset_names()* to check if Palmer Penguins is listed as one of the 
of available datasets;
- secondly by invoking seaborn.load_dataset() with the actual database name to load it into our project, namely *penguins*. 

The loaded dataset is a DataFrame object by default, as Seaborn library is built on top of Pandas.

In [7]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

In [23]:
df = sns.load_dataset("penguins")

> 📝 **Gain general knowledge about the data:**

We should get a basic understanding of the data structure, format, and characteristics by inspecting the dataset's dimensions, data types, presence of missing values and/or duplicate records, as well as exploring some initial summary statistics and visualizations.

- Check for the DataFrame dimensionality with Pandas `.shape` attribute, that returns a tuple with the total number of rows and columns, as per [pandas official documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html). We can then understand the size and structure of the dataset before performing further analysis.

In [25]:
print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])

Number of rows: 344
Number of columns: 7


* Understand the distribution of numerical data by generating descriptive statistics with [Pandas .describe method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe). It provides a quick summary that helps us understand the data's central tendency, variability, and range. This also give us a quick glimpse of how many columns contain ca

*identify potential outliers or anomalies? check for literature.*

In [26]:
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


In [17]:

# Available through pandas: descriptive statistics about all numerical variables (not categorical ones)
print("Describe -- \n", df.describe(), "\n\n") 

# Check the type of data 
print("Type -- \n", type(df), "\n\n") 
  
# Printing Top 10 Records 
print("Head -- \n", df.head(10), "\n\n") 
  
# Printing last 10 Records  
print("Tail -- \n", df.tail(10), "\n\n") 

Shape -- 
 (344, 7) 


Describe -- 
        bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
count      342.000000     342.000000         342.000000   342.000000
mean        43.921930      17.151170         200.915205  4201.754386
std          5.459584       1.974793          14.061714   801.954536
min         32.100000      13.100000         172.000000  2700.000000
25%         39.225000      15.600000         190.000000  3550.000000
50%         44.450000      17.300000         197.000000  4050.000000
75%         48.500000      18.700000         213.000000  4750.000000
max         59.600000      21.500000         231.000000  6300.000000 


Type -- 
 <class 'pandas.core.frame.DataFrame'> 


Head -- 
   species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0     

----

### X. Notes

a. Markdowns formatting of this Jupyter Notebook were based on:
* [The Jupyter Notebook Formatting Guide](https://medium.com/pythoneers/jupyter-notebook-101-everything-you-need-to-know-56cda3ea76ef) by Raghu Prodduturi
* [Markdown Cheat Sheet](https://markdownguide.offshoot.io/cheat-sheet/)
* [Markdown Extended Syntax](https://markdownguide.offshoot.io/extended-syntax)