# Palmer Penguins

This notebook contains my analysis of the famous Palmer Penguins dataset. 

![Penguins](https://allisonhorst.github.io/palmerpenguins/logo.png)

The dataset is available [on github](https://allisonhorst.github.io/palmerpenguins/).




## Imports 

***

We use pandas for the DataFrame data structure.

It allows us to investigate CSCV files, amongst other features.

In [9]:
# Data frames.
import pandas as pd

## Load Data

***

Load the Palmer Penguins dataset from a URL.


In [10]:
# load the penguins data set.
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")


The data is now loaded and we can inspect it. 

In [11]:
# Let's have a look.
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


## Inspect data.

***

In [12]:
# Look at the first row. 
df.iloc[0]

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       MALE
Name: 0, dtype: object

In [13]:
# sex of Penguins
df['sex']

0        MALE
1      FEMALE
2      FEMALE
3         NaN
4      FEMALE
        ...  
339       NaN
340    FEMALE
341      MALE
342    FEMALE
343      MALE
Name: sex, Length: 344, dtype: object

In [14]:
# Count the number of Penguins of each sex.
df['sex'].value_counts()

sex
MALE      168
FEMALE    165
Name: count, dtype: int64

In [15]:
# Describe the data set.
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


## Tables

***

|Species    |Bill Lenght (mm)  |Body Mass (gms)  |
|---------- |-----------------:| ---------------:|
|Adelie     |              38.8|             3701|
|Chinstrap  |              48.8|             3733|
|Gentoo     |              47.5|             5076|

## Types of variables that could be used to model the variables in the data set. 

***

The Palmer Penguin data set is a popular dataset in the field of data science and can be used for various types of analysis. This the basis for this jupiter notebook. A variable is any kind of attribute or characteristic that can be measured, manipulated and controled in statistics and research. All studies analyze a variable, which can describe a person (in this case penguins), place, thing or idea. A variable's value can change between groups or over time. There are two major types of variables- Quantitave and qualitative. Quantitative variables are data sets that involve numbers or amounts whereas qualitative variables are non-numerical values or groupings. In the case of the Palmer penguin set and example of a qualitiative variable would be mass. An example of a qualitative variable would be species. 

Some common types of variables that could be used to analyze the Palmer Penguin data set include:

1. **Categorical Variables**: 
In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.[1]

   - Species: There are three different species of penguins in the dataset - Adelie, Chinstrap, and Gentoo.
   - Island: The penguins were observed on three different islands - Biscoe, Dream, and Torgersen.

   A suggetion for analysis within this category could be if penguins of the same species were living in different environmnets ie islands had differences in their body mass or flipper length. Also the characteristics of each species of penguin can be compared against each other.

2. **Numerical Variables**:
Numeric variables have values that describe a measurable quantity as a number, like 'how many' or 'how much'. Therefore numeric variables are quantitative variables. Numeric variables may be further described as either continuous or discrete: A continuous variable is a numeric variable.[2]

   - Numerical Measurements: Various measurements such as bill length, bill depth, flipper length, and body mass can be used as numerical variables for analysis.
   - Date: The date on which the observations were made can also be a numerical variable for time series analysis.
   
   A suggestion for analysis could be comparing body mass at different times of the year ie date to see if there would be a seasonal aspect to body mass and investigate any trends related to seasonal patterns. By analysisng the body composition ie mass, flipper length ect the patterns or relationships (if any) could be identified

3. **Ordinal Variables**:
Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known.[3]

   - Clutch Completion: This variable indicates whether the penguin completed its clutch or not, and can be considered an ordinal variable.
   - Stage: The developmental stage of the penguin (Adult, Chick, or Juvenile) can also be treated as an ordinal variable.
   
   By analysing the completion of the clutch, success in reproduction can be assessed. If compared over years an upward, downward or stagnent trend could be identified. This would be useful to identify potential issues with survival of the species. 


4. **Boolean Variables**:
Boolean data type is a form of data with only two possible values (usually "true" and "false") or in this case Male or Female. [4]

   - Sex: The sex of the penguin (Male or Female) can be represented as a boolean variable.
   
   Many different variables can be assessed using sex. Males can be analysed in comparison to females in relation to mass and body composition.  


5. **Derived Variables**:
A derived variable is one that is derived from two (or more) primary variables. Percentages, ratios, indices and rates are all derived variables. Typically they will be decimal, integer, or alphanumeric types[5]

   - Body Mass Index (BMI): Calculated based on the body mass and flipper length, this can be a derived variable for further analysis.
   - Body Proportions: Ratios of different body measurements can be derived and used for analysis.

   By assessing the BMI or mass of the penguin additional insights into the overall health or body condition can be provided.

6. **Geospatial Variables**:
Geospatial data is information that describes objects, events or other features with a location on or near the surface of the earth. Geospatial data typically combines location information (usually coordinates on the earth) and attribute information (the characteristics of the object, event or phenomena concerned) with temporal information (the time or life span at which the location and attributes exist).[6]

   - Latitude and Longitude: The geographic coordinates of the islands where the penguins were observed can be used as geospatial variables to study any spatial patterns or distribution of penguin populations.

7. **Temporal Variables**:

   - Time of Day: The time of day when observations were made can be used as a temporal variable to investigate any diurnal patterns in penguin behavior or activity levels.

These are just a few examples of the types of variables that could be used to analyze the Palmer Penguin data set. The choice of variables would depend on the specific research questions or analysis objectives that researchers have in mind. The question/answers being sought dictate the type of variables used and how they are represented

## References:

***
1. (Wikipedia).  Yates, Daniel S.; Moore, David S.; Starnes, Daren S. (2003). The Practice of Statistics (2nd ed.). New York: Freeman. ISBN 978-0-7167-4773-4. Archived from the original on 2005-02-09.
2. https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/variables#:~:text=Numeric%20variables%20have%20values%20that,variable%20is%20a%20numeric%20variable.
3. (Wikipedia).  Agresti, Alan (2013). Categorical Data Analysis (3 ed.). Hoboken, New Jersey: John Wiley & Sons. ISBN 978-0-470-46363-5.
4. https://en.wikipedia.org/wiki/Boolean
5. https://sites.google.com/view/ochrewiki/categories/taxonomy/derived-variables
6. https://www.ibm.com/topics/geospatial-data


## Bar chart of an appropriate variable 

***

## Histogram of an appropriat variable.

***

## A correlation of two variables

***

***

### End