# Introduction

The palmerpenguins dataset contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.

We will use this dataset in classification setting to predict the penguins’ species from anatomical information.

![title](images/lter_penguins.png)

# Setup

In [3]:
import sklearn
import pandas as pd

# Data Loading

In [18]:
# Loading the dataset from the local file.

DATASET_PATH = './data/penguins_size.csv'
dataset = pd.read_csv(DATASET_PATH)

# Data Analysis

## Top 5 rows

First thing we are going to do is displaying the head of the dataset, to get a first glimpse.

In [10]:
dataset.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


We can see that our dataset contains 7 attributes:
- 6 independent variables / features (island, culmen length, culmen depth, flipper length, body mass and sex)
- 1 depenent variable / label (species)

We can also already notice that 1 row that has a total of 5 null values.
It is a good indication that we should probably remove this row during data preparation,
since it provides almost none value and will most likely create a noise.

## Quick overview

Once we had a first look at the data, the good idea is to get an overview on a whole dataset,
analyze its samples and their attributes to get an idea on what steps should be performed during the
data preparation.

In [11]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   culmen_length_mm   342 non-null    float64
 3   culmen_depth_mm    342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                334 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


We can see that 5 out of 7 attributes contain null values.
There are no null values for the target attribute (label) - species - as well as for the island attribute.
Other attributes will require some clean up during data step.

In addition to handling null values, we will also have to transform non-numerical features. There 3 non-numerical features in the dataset - species, island, sex - we will have to transform them into numerical, so they can be handled by our model.

## Analyzing non-numerical attributes

### species

In [16]:
dataset["species"].value_counts()

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

### island

In [15]:
dataset["island"].value_counts()

Biscoe       168
Dream        124
Torgersen     52
Name: island, dtype: int64

### sex

In [14]:
dataset["sex"].value_counts()

MALE      168
FEMALE    165
.           1
Name: sex, dtype: int64

# Data Preparation

# Training

# Evaluation

# Prediction