## Project: Data Analysis for Palmer Penguin data
- **Source**: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. Artwork by @allison_hors
- **URL**: 'https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv'
- **Date**: 29/11/24
- **Goal**: Learn the basis of descriptive statistics

In [1]:
# Step 0. Load libraries and custom modules
# Data -----------------------------------------------------------------
import pandas as pd
import numpy as np
# Graphics -------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### 1. Data loading
**Objective**: Obtain the data from source and get a first glimpse of their properties and presentation

In [2]:
# Step 1. Load data
# 1.1 Read the dataset from url
# Credits:
# Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer
# Archipelago (Antarctica) penguin data. R package version 0.1.0.
# https://allisonhorst.github.io/palmerpenguins/
url = 'https://gist.githubusercontent.com/slopp/'+ \
      'ce3b90b9168f2f921784de84fa445651/raw/' + \
      '4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv'
df_raw = pd.read_csv(url)

<img src='https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png' width=400/>

**Image 1.** Penguins drawing. Artwork by @allison_hors.

In [3]:
# 1.2 Get basic info
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
 8   year               344 non-null    int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 24.3+ KB


In [4]:
# 1.3 Get a reproducible sample
df_raw.sample(10,random_state=2025)

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
172,173,Gentoo,Biscoe,50.2,14.3,218.0,5700.0,male,2007
254,255,Gentoo,Biscoe,47.2,15.5,215.0,4975.0,female,2009
69,70,Adelie,Torgersen,41.8,19.4,198.0,4450.0,male,2008
236,237,Gentoo,Biscoe,44.9,13.8,212.0,4750.0,female,2009
258,259,Gentoo,Biscoe,41.7,14.7,210.0,4700.0,female,2009
60,61,Adelie,Biscoe,35.7,16.9,185.0,3150.0,female,2008
133,134,Adelie,Dream,37.5,18.5,199.0,4475.0,male,2009
264,265,Gentoo,Biscoe,43.5,15.2,213.0,4650.0,female,2009
33,34,Adelie,Dream,40.9,18.9,184.0,3900.0,male,2007
124,125,Adelie,Torgersen,35.2,15.9,186.0,3050.0,female,2009


<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png" width=400/>

**Image 2.** Penguins dimensions. Artwork by @allison_hors.

### 2. Data preprocessing
**Objectives**: Perform the data cleaning, data transformation and data reduction steps to avoid data mistmatching, noisy data or data not wrangled

In [11]:
# Step 2. Prepara the dataset for analysis
# 2.1 Manage NaN values
df_baking = df_raw.copy()
df_baking = df_baking.drop(["rowid","year"], axis=1) #with drop function we eliminare 1 column
df_baking =  df_baking.dropna(subset=["bill_length_mm","flipper_length_mm","body_mass_g","sex"]) 
df_baking ["island"] = pd.Categorical(df_baking["island"])
df_baking ["sex"] = pd.Categorical(df_baking["sex"])
df_baking ["species"] = pd.Categorical(df_baking["species"])
df_baking.info()
df=df_baking.copy()

<class 'pandas.core.frame.DataFrame'>
Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   species            333 non-null    category
 1   island             333 non-null    category
 2   bill_length_mm     333 non-null    float64 
 3   bill_depth_mm      333 non-null    float64 
 4   flipper_length_mm  333 non-null    float64 
 5   body_mass_g        333 non-null    float64 
 6   sex                333 non-null    category
dtypes: category(3), float64(4)
memory usage: 14.4 KB


### 3. Exploratory Data Analysis
**Objective**: Summarize the main characteristics of the dataset using descriptive statistics and data visualization methods

In [20]:
# 3.1 Get numerical and categorical summaries
df["body_mass_g"].median()

np.float64(4050.0)

In [47]:
# 3.2 Count categorical values, in a stratified manner

In [48]:
# 3.3 Create a cross table

In [49]:
# 3.4 Calculate statistics by species

In [50]:
# 3.5 Show the histograms

In [51]:
# 3.6 Show the boxplot of the numerical values


In [52]:
# 3.7 Show the bivariate analysis


In [53]:
# 3.8 Calculate probabilities for continuous data
