# The palmerpenguins dataset

In this workshop, you'll work with a curated dataset of penguin observations from the Palmer Archipelago, Antarctica, originally collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station LTER program](https://pallter.marine.rutgers.edu/) and made available through the [palmerpenguins project](https://allisonhorst.github.io/palmerpenguins/),

Each row corresponds to a penguin living in the Antarctica, with information such as:

- `species` – penguin species (Adélie, Chinstrap, Gentoo)  
- `island` – island in the Palmer Archipelago  
- `bill_length_mm` – bill (beak) length in millimeters  
- `bill_depth_mm` – bill depth in millimeters  
- `flipper_length_mm` – flipper length in millimeters  
- `body_mass_g` – body mass in grams  
- `sex` – biological sex (male, female)  
- `year` – year of observation  

Check out the [palmerpenguins project page](https://allisonhorst.github.io/palmerpenguins/) for some nice visuals that illustrate penguin bill anatomy.

Bellow you'll find some possible things to do to get started. Pick the level that best suits you, dig in, or ignore them and do your own thing. Happy coding!

##### **Beginner**
- New to Pandas? Start by taking a look at the [official documentation](https://pandas.pydata.org/docs/). A good first step is often to look at basic summaries (`.head()`, `.info()`, `.describe()`).
- Print the `.shape` of the DataFrame and list all column names. How many penguins and how many variables are there?
- It's always good to check out what kind of values are present in a certain column. For `species`, `island`, `sex`, and `year`, show the unique values and their counts using `.value_counts()`.
- Filter the data to keep only Adélie penguins. How many Adélies were measured on each island?
- Make a histogram of `flipper_length_mm` and another histogram of `body_mass_g`. Can you describe what the distribution looks like (symmetric, skewed, multi-modal)?
- Let's investigate average body mass. Can you figure out how to use [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) in order to compute the mean and standard deviation of `body_mass_g` for each `species`?


##### **Intermediate**
- Sometimes it is helpful to introduce new columns representing the interaction between others. Create a new column `bill_ratio = bill_length_mm / bill_depth_mm`. Compare the average `bill_ratio` across species and sexes using `groupby(["species", "sex"])`. Do some groups have "slimmer" bills than others?
- Let's build a mini EDA dashboard. Make a figure with three subplots: a histogram of `flipper_length_mm` for each species (one subplot per species). What differences in flipper lengths do you see between species?
- Time trends are always interesting. For each `year`, compute the mean `flipper_length_mm` and `body_mass_g` by species. Make a plot and ask yourself: are there noticeable year-to-year differences?
- For each island, summarize and visualize the distributions of `bill_length_mm` and `bill_depth_mm` (e.g. boxplots by island). Do some islands host penguins with systematically different bill shapes?
- [Linear regression](https://en.wikipedia.org/wiki/Linear_regression) is useful for statistically explaining the relationships between variables. Pick a particular penguin species and fit a [simple linear regression](https://en.wikipedia.org/wiki/Simple_linear_regression) predicting `body_mass_g` from `flipper_length_mm` using [statsmodels](https://www.statsmodels.org/stable/index.html) or [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). Is the relationship statistically significant?

##### **Advanced**
- Let's classify `species` using the numeric features `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`. Start by splitting the data into train and test sets, and then fit a [multinomial logistic regression model](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) (e.g. [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)). Aftern training, evaluate accuracy and inspect the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). Which species are easiest/hardest to classify?
- Standardize the numeric features and apply
[k-means](https://en.wikipedia.org/wiki/K-means_clustering) or a
[Gaussian mixture model](https://en.wikipedia.org/wiki/Model-based_clustering#Gaussian_mixture_model)
to cluster the data with respect to `species`, and evaluate your model.
Where do the unsupervised clusters disagree with the biological species,
and why might that be?

In [1]:
import pandas as pd

df = pd.read_csv("palmerpenguins.csv")
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
