# Palmer Penguins
***
![Penguins](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)

This notebook contains my analysis of the famous palmer penguins dataset.

The data set is available [on GitHub](https://allisonhorst.github.io/palmerpenguins/).


**Disclaimer:** I used ChatGPT to generate ideas and sketches of the content of the following notebook. The notebook is mainly my work, my own work as ChatGPT sometimes suggested clearly incorrect ideas, and in any case I had to rework the code and text it generated to meet my own needs.

In [1]:
# Data frames.
import pandas as pd

# Plotting.
import matplotlib.pyplot as plt

# Numerical arrays.
import numpy as np


In [2]:
# Load the penguins data set.
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")

The following table gives us all available variables to work with: Species, island, bill length & depth, flipper length, body mass and sex.

In [4]:
# Let's have a look.
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


### We'll take the data apart a bit to get an better overview.
First we are having a look at the first 29 rows as well as the head and the tail of the table.

In [39]:
# Display the first 30 rows using iloc
print("First 30 rows:")
print(df.iloc[0:30])

# Display the first few rows using head
print("\n\n\First few rows:")
print(df.head())

# Display the last few rows using tail
print("\n\n\Last few rows:")
print(df.tail())


First 30 rows:
   species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0   Adelie  Torgersen            39.1           18.7              181.0   
1   Adelie  Torgersen            39.5           17.4              186.0   
2   Adelie  Torgersen            40.3           18.0              195.0   
3   Adelie  Torgersen             NaN            NaN                NaN   
4   Adelie  Torgersen            36.7           19.3              193.0   
5   Adelie  Torgersen            39.3           20.6              190.0   
6   Adelie  Torgersen            38.9           17.8              181.0   
7   Adelie  Torgersen            39.2           19.6              195.0   
8   Adelie  Torgersen            34.1           18.1              193.0   
9   Adelie  Torgersen            42.0           20.2              190.0   
10  Adelie  Torgersen            37.8           17.1              186.0   
11  Adelie  Torgersen            37.8           17.3              180.0   
12  Adelie

How many penguins don't have their sex determined?

In [24]:
# Count the number of occurrences of each sex, including NA values
'''
sex_counts = df['sex'].value_counts(dropna=False)

print("Number of each sex (including NaN):")
print(sex_counts)
'''
# Count the number of occurrences of each sex, including NA values
sex_counts = df['sex'].value_counts(dropna=False) # If dropna=False, NaN values are included in the count. They are treated as a separate category and included in the result of value_counts().

# Extract the total count of penguins
total_penguins = sex_counts.sum()

# Extract the count of penguins with undetermined sex (NA values)
undetermined_sex_count = sex_counts.get(float('nan'), 0)

# Calculate the count of penguins with determined sex
determined_sex_count = total_penguins - undetermined_sex_count

print(f"Of {total_penguins} total penguins, {undetermined_sex_count} don't have their sex determined.")



Of 344 total penguins, 11 don't have their sex determined.


In [12]:
# Islands.
df['island']

0      Torgersen
1      Torgersen
2      Torgersen
3      Torgersen
4      Torgersen
         ...    
339       Biscoe
340       Biscoe
341       Biscoe
342       Biscoe
343       Biscoe
Name: island, Length: 344, dtype: object

In [15]:
# Count of islands
island_counts = df['island'].value_counts()
number_of_islands = len(island_counts)

print("Number of islands:", number_of_islands)



Number of islands: 3


In [16]:
# Group the DataFrame by 'island' and 'sex', then count the number of occurrences of each combination. This code was proposed by ChatGPT.
island_sex_counts = df.groupby(['island', 'sex']).size().reset_index(name='count')

print("Number and sex on each island:")
print(island_sex_counts)

Number and sex on each island:
      island     sex  count
0     Biscoe  FEMALE     80
1     Biscoe    MALE     83
2      Dream  FEMALE     61
3      Dream    MALE     62
4  Torgersen  FEMALE     24
5  Torgersen    MALE     23


In [10]:
# Count the number of penguins of each set.
df['sex'].value_counts()

sex
MALE      168
FEMALE    165
Name: count, dtype: int64

In [27]:
# Inspect (Types).
df.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

In [11]:
# Describe the data set.
df.describe()


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


## Two Variable Plots
***

In [25]:
# Get the flipper lengths.
plen = df['flipper_length_mm']

# Show.
print(plen)

# Type.
print(type(plen))

0      181.0
1      186.0
2      195.0
3        NaN
4      193.0
       ...  
339      NaN
340    215.0
341    222.0
342    212.0
343    213.0
Name: flipper_length_mm, Length: 344, dtype: float64
<class 'pandas.core.series.Series'>


In [26]:
# Just get the numpy array.
plen = plen.to_numpy()

# Show.
plen

array([181., 186., 195.,  nan, 193., 190., 181., 195., 193., 190., 186.,
       180., 182., 191., 198., 185., 195., 197., 184., 194., 174., 180.,
       189., 185., 180., 187., 183., 187., 172., 180., 178., 178., 188.,
       184., 195., 196., 190., 180., 181., 184., 182., 195., 186., 196.,
       185., 190., 182., 179., 190., 191., 186., 188., 190., 200., 187.,
       191., 186., 193., 181., 194., 185., 195., 185., 192., 184., 192.,
       195., 188., 190., 198., 190., 190., 196., 197., 190., 195., 191.,
       184., 187., 195., 189., 196., 187., 193., 191., 194., 190., 189.,
       189., 190., 202., 205., 185., 186., 187., 208., 190., 196., 178.,
       192., 192., 203., 183., 190., 193., 184., 199., 190., 181., 197.,
       198., 191., 193., 197., 191., 196., 188., 199., 189., 189., 187.,
       198., 176., 202., 186., 199., 191., 195., 191., 210., 190., 197.,
       193., 199., 187., 190., 191., 200., 185., 193., 193., 187., 188.,
       190., 192., 185., 190., 184., 195., 193., 18

### Further reading and references
***
https://archive.ics.uci.edu/dataset/690/palmer+penguins-3

https://github.com/mwaskom/seaborn-data/bob/master/penguins.csv

https://realpython.com/python-matplotlib-guide/#understanding-pltsubplots-notation

https://statistics.laerd.com/statistical-guides/pearson-correlation-pltsubplots-notation

https://en.wikipedia.org/wiki/Peason_correlation_coefficient

https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html

https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html


***

## End