# S2DS technical assessment

---



# Background
This exercise will use the [Palmer penguins](https://allisonhorst.github.io/palmerpenguins/articles/art.html) dataset. It contains two tables: the raw data are contained in `penguins_raw.csv` and `penguins.csv` is a curated subset of these data. You will need both for this exercise. A description of the dataset can be found [here](https://education.rstudio.com/blog/2020/07/palmerpenguins-cran/) that includes the following background information:

> The `palmerpenguins` data contains size measurements, clutch observations, and blood isotope ratios for three penguin species observed on three islands in the Palmer Archipelago, Antarctica over a study period of three years. These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy. We gratefully acknowledge Palmer Station LTER and the US LTER Network. Special thanks to Marty Downs (Director, LTER Network Office) for help regarding the data license & use. 

This exercise will help Pivigo understand your coding and technical skills. It is designed to be difficult to complete in the time allowed, so don't be discouraged if you can't complete all of the questions. 

**Please take the time to read each question carefully!** 

Good luck!

# Setup
Install libraries and load data.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# penguins = pd.read_csv(r"C:\Users\Laura\Desktop\S2DS_technical2\data\penguins.csv")

# penguins_raw = pd.read_csv(r"C:\Users\Laura\Desktop\S2DS_technical2\data\penguins_raw.csv")

#penguins.head(10)
#penguins_raw.head(10)


!pwd
path_penguins="data/penguins.csv"
path_penguins_raw="data/penguins_raw.csv"

penguins = pd.read_csv(path_penguins)
penguins_raw = pd.read_csv(path_penguins_raw)

/Users/jon/repos/work_repos/s2ds/individual_mentoring/S2DS_technical_assessment


In [6]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


## Question 01
How many columns are in the `penguins_raw.csv` dataset?

In [7]:
penguins_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   studyName            344 non-null    object 
 1   Sample Number        344 non-null    int64  
 2   Species              344 non-null    object 
 3   Region               344 non-null    object 
 4   Island               344 non-null    object 
 5   Stage                344 non-null    object 
 6   Individual ID        344 non-null    object 
 7   Clutch Completion    344 non-null    object 
 8   Date Egg             344 non-null    object 
 9   Culmen Length (mm)   342 non-null    float64
 10  Culmen Depth (mm)    342 non-null    float64
 11  Flipper Length (mm)  342 non-null    float64
 12  Body Mass (g)        342 non-null    float64
 13  Sex                  333 non-null    object 
 14  Delta 15 N (o/oo)    330 non-null    float64
 15  Delta 13 C (o/oo)    331 non-null    flo

**Answer:**

17 columns

***

## Question 02
How many distinct islands are represented in the dataset?

In [8]:
penguins_raw["Island"].unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

Your approach isn't wrong, but if you wanted to work with the answer to this question programmatically, I would code that in as I've done below. This becomes important when you want to work with this number in the future, for example in Q4 below. I see you did take this approach there, so this is probably a moot point!

In [10]:
len(penguins_raw["Island"].unique())

3

**Answer:**

3 islands 'Torgersen', 'Biscoe', 'Dream'

***

## Question 03
How many Chinstrap Penguins are found on Dream Island in the `penguins.csv` dataset?


In [5]:
#penguins.head()

chinstrap_penguins = penguins[penguins["species"] == "Chinstrap"]
#chinstrap_penguins.info()
dream_island_chinstrap = chinstrap_penguins[chinstrap_penguins["island"] == "Dream"]
dream_island_chinstrap.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68 entries, 276 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            68 non-null     object 
 1   island             68 non-null     object 
 2   bill_length_mm     68 non-null     float64
 3   bill_depth_mm      68 non-null     float64
 4   flipper_length_mm  68 non-null     float64
 5   body_mass_g        68 non-null     float64
 6   sex                68 non-null     object 
 7   year               68 non-null     int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 4.0+ KB


**Answer:**

68 penguins

***

## Question 04
What is the date of the first recording (column `Date Egg`) in the `penguins_raw.csv`?

In [6]:
penguins_raw["Date Egg"] = pd.to_datetime(penguins_raw["Date Egg"])

first_date = penguins_raw["Date Egg"][0]

first_date

Timestamp('2007-11-11 00:00:00')

**Answer:**

date of first recording is 2007-11-11 

***

## Question 05
What day of week was this?

In [7]:
first_date.day_name()

'Sunday'

**Answer:**

it was Sunday

***

## Question 06
What is the median body mass of male Adelie penguins in the `penguins.csv` dataset?

In [8]:

adelie = penguins[penguins["species"] == "Adelie"]
adelie_males = adelie[adelie["sex"] == "male"]

median_male = adelie_males["body_mass_g"].median()

median_male
#penguins.head()

4000.0

**Answer:**

4000.0 g

***

## Question 07
What is the mean body mass of male Adelie penguins in the `penguins.csv` dataset (rounded to the nearest gram)?

In [9]:
mean_male = adelie_males["body_mass_g"].mean().round()

mean_male

4043.0

**Answer:**

4043.0 g

***

## Question 08
Please show how you would create a plot showing the differences in the distributions of flipper length (`penguins.csv`) between males and females and **across species**.

In [37]:
male = penguins[penguins["sex"] == "male"]
female = penguins[penguins["sex"] == "female"]

#N = 3

#blue_bar = male["flipper_length_mm"]

#orange_bar = female["flipper_length_mm"]

#ind = np.arange(N)

#plt.figure(figsize=(10,5))

#width = 0.1

#plt.bar(ind, blue_bar , width, label='Male')
#plt.bar(ind + width, orange_bar, width, label='Female Penguins')

#plt.xlabel('Species')
#plt.ylabel('Flipper length (mm)')
#plt.title('Flipper Length Between Male and Female Penguins and Across Species')

# xticks()
#plt.xticks(ind + width / 2, penguins["species"].unique())

#plt.legend(loc='best')
#plt.show()

penguins.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
year                   int64
dtype: object

## Question 09
Please write a function that generalises the approach you used in Question 08 to show the distribution of any variable between males and females across species. The function should take one parameter as input, the variable name as a string.

## Question 10
One might predict that as penguin bills get longer, they would also increase depth. What is the correlation between these two variables in the `penguins.csv` dataset (if you remove missing values)? Please round to two decimal place.

**Answer:**

***

## Question 11
The hypothesis posed in Question 10 appears to be incorrect. This is verified by the scatter plot below:

![](data/q10.png)

Please show us ways you might investigate this further. (Hint: Consider other variable that might be confounding the observed effect)