# Fundamentals of Data Analysis Tasks

**Andrea Cignoni**

***

# Task 1

> The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .

In [None]:
# This function defines the calculation applied to the input number 'x'
def f(x):
    if x % 2 == 0: #This "if" clause checks whether the integer given is even or not
        return x // 2 #If it is even it is devided by two
    else:
        return (3 * x) + 1 #If it is odd it then multiplied by 3 and then 1 is added

In [None]:
# This function formats the sequence of integers generated with the 'x' function
def collatz (x):
    while x != 1: #This 'while' clause stops the loop at number 1
        print(x, end=', ')
        x = f(x) # Update 'x' with the format defined in the collatz function
    print (x)

In [None]:
collatz(1000)

***

# Task 2

> Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.

<div align="center">

# **PENGUINS DATA SET**
</div>

<div align="justify">
The Palmer penguins dataset is a collection of data about penguins in the Palmer Archipelago, Antarctica. These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network and were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream).<br/>

The parameters considered are the followings:   


+ Island name (Dream, Torgersen, or Biscoe);
- Species name (Adelie, Chinstrap, or Gentoo);
* Billl length (mm);
+ Bill depth (mm);
- Flipper length (mm);
* Body mass (g);
- Sex.
</div>


<div align="center">

![Screenshot](Penguins.png) ![Screenshot](Penguins1.png)
</div>

<div align="justify">

**LOADING THE FILE AND LIBRARIES IMPORTED**<br/>

The famous Palmer penguins dataset is downloaded from mwaskom/seaborn-data ![Github](https://github.com/mwaskom/). Since the information appears readable and structured just as the standard CSV tabular disposition (a comma separates individual items and each record is on a new line), I have proceeded to open it as such in my Python repository.

Before proceeding to load the data in a data frame, I have proceeded to import the necessary modules to analyse all the variable represented in the file: Pandas; Matplotlib; Numpy; Seaborn.<br/>

- Pandas

Pandas is an open source Python package that is used for data science/data analysis and machine learning tasks. The common operations performed with Pandas are also: data cleansing, data fill, data normalization, merges and joins, data visualization, statistical analysis, data inspection, loading and saving data and much more. Here, I rely on Pandas for indexing the data frame, for manipulating it and extracting the sorted information from specified columns and rows. My main source for its usage is pandas.pydata.org

- Matplotlib

Matplotlib is a Python library used to create 2D graphs and plots through scripts. The pyplot module is explecially useful when it comes to control line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error charts etc.

- Numpy

Matplotlib is used along with NumPy to provide an environment with an effective and fast numeric computing. Numpy furnishes a multidimensional array object and various derived objects (such as masked arrays and matrices).

- Seaborn

Lastly, in order to clearly display a graphic overview of the whole dataset through pair plots, I have imported Seaborn. This library is built on top of the Matplotlib data visualization library and can perform the exploratory analysis that fits best to show the result of my searches.

</div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Extracting penguin dataset's 7 variables for data representation
# File downloaded from (https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv)

data = pd.read_csv('penguins.csv')

<div align="justify">

**DATA DESCRIPTION**

In order to give a visual overview of the data contained in the dataset, I have utilised a number of functions provided by Pandas and my point of reference is ![realpython.com](https://realpython.com/pandas-dataframe/). As already pointed out, the observations of the penguins are classified through 7 classes of parameters.
</div>

In [2]:
print("\n","Below are shown the first and the last five rows of the dataset")

print(data)


 Below are shown the first and the last five rows of the dataset
    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     

In [3]:
# Check the DataFrame structure
print("\n","The following is the summary of the dataframe, including information about the column names and their corresponding data types:","\n")

print(data.info())


 This is the summary of the dataframe, including information about the column names, data types, and non-null values: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None


<div align="justify">

**As shown by the above overview, the Palmer Penguin dataset includes two types of data: *strings* for the "species", "islands" and "sex"indeces and *floats64* for the "bill length", "bill depth", "flipper length" and "body mass", also identifiable as 3 categorical variables and 4 numerical variables.**

</div>

<div align="justify">

**INFO ON EACH SINGULAR VARIABLE**

I am using the following Pandas functions to give a detailed description of each individual variables. My point of reference for the following data maniplation is ![SparkByExamples.com](https://sparkbyexamples.com/pandas/get-pandas-dataframe-shape/)

</div>

In [5]:
# Number of observation taken for each species
print("\n","Number of samples for each species:","\n")

# Examining the species variable with Pandas value_counts function

print(data['species'].value_counts())


 Number of samples for each class: 

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64


In [None]:
print("\n","Levels of the variable island:","\n")

# using the Pandas unique()function to show the different islands where the penguins observations were taken

island_levels = data['island'].unique()

# Print the levels of "island"
print(island_levels)

In [18]:
print("\n","Population of penguins species living on each island:","\n")

# Using Pandas groupby function to show how the penguin population is spreadon the three islands where the observation were taken

species_counts = data.groupby(['island', 'species']).size().unstack() # Using `.size()` function to count the number of occurrences in the grouped variables through the groupby() function
#The unstack() function is used to pivot the two parameters, "sex" and "island"
print(species_counts)

print("\n","Gender distribution of the whole penguins population observed:","\n")

# Count the number of penguins in each gender through the "value_counts" Pandas function
gender_counts = data['sex'].value_counts()

print(gender_counts)

print("\n","Gender distribution for each species on each island:","\n")

# Group the data by species and gender, and count the number of occurrences through the "size" function and using the reset_index function to reset and introduce a new column containing the current index values.
gender_distribution = data.groupby(['species', 'sex']).size().reset_index(name='count')

# Creating a pivot the table to get gender distribution as columns
gender_distribution = gender_distribution.pivot(index='species', columns='sex', values='count')

# Display the gender distribution by species
print(gender_distribution)


 Population of penguins species living on each island: 

species    Adelie  Chinstrap  Gentoo
island                              
Biscoe       44.0        NaN   124.0
Dream        56.0       68.0     NaN
Torgersen    52.0        NaN     NaN

 Gender distribution of the whole penguins population observed: 

MALE      168
FEMALE    165
Name: sex, dtype: int64

 Gender distribution for each species on each island: 

sex        FEMALE  MALE
species                
Adelie         73    73
Chinstrap      34    34
Gentoo         58    61


<div align="justify">

From the above reports it is clear that the *Adelie* species is present on all the three islands taken into consideration, while the *Chinstrap* and *Gentoo* species are respectively hosted on the *Biscoe* and *Dream* islands only. For what concerns the gender distribution, the samples include an almost equal number of specimen evenly observed on each island.

</div>

In [12]:
print("\n","Displaying the mean of the bill length, bill depth, flipper length and body mass for each penguin species:","\n")

# Calculate the mean of each measure for each subset of the species variable with Pandas groupby and mean functions
mean_lengths = data.groupby('species')['bill_length_mm'].mean()

# Print the mean bill length for each subset
print(mean_lengths)

# Calculate the mean bill depth for each subset
mean_depths = data.groupby('species')['bill_depth_mm'].mean()

# Print the mean bill depth for each subset
print(mean_depths)

# Calculate the mean flipper length for each subset
mean_flipper_lengths = data.groupby('species')['flipper_length_mm'].mean()

# Print the mean flipper length for each subset
print(mean_flipper_lengths)

# Calculate the mean body mass for each subset
mean_body_mass = data.groupby('species')['body_mass_g'].mean()

# Print the mean body mass for each subset
print(mean_body_mass)


 Displaying the mean of the bill length, bill depth, flipper length and body mass for each penguin species: 

species
Adelie       38.791391
Chinstrap    48.833824
Gentoo       47.504878
Name: bill_length_mm, dtype: float64
species
Adelie       18.346358
Chinstrap    18.420588
Gentoo       14.982114
Name: bill_depth_mm, dtype: float64
species
Adelie       189.953642
Chinstrap    195.823529
Gentoo       217.186992
Name: flipper_length_mm, dtype: float64
species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64


<div align="justify">

The *Chinstrap* is the species with the longer and deeper bill of all the three species, while the *Gentoo* is the heaviest species of all and is also the kind of penguin with the longer flipper.

</div>

<div align="justify">

Once we have established which kind of data are being analysed, I can determine the graphical representation tequinques to compare and examine them best. My main point of reference for this topic is [Toward Science](https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95) website.

</div>

<div align="justify">

**TYPES OF VARIABLES IN THE DATASET**


Variables can be *dependent* or *independent* and this basic discrimnation depends on their roles in an experiment or in a statistical study. This two classes can be further characterised and distinguished into *categoricals* or *continous*.

1. *Dependent variable*: A dependent variable is a variable whose outcome or result depends on, or is influenced by, changes in its causes (called the independent variable) ==> **The dependent variable is often plotted on the y-axis of a graph**

2. *Independent variable*: The independent variable is the variable that is intentionally changed or manipulated to observe its effect on the dependent variable ==> **The independent variable is often plotted on the x-axis of a graph**

For instance, if a test mark can be considered a dependent variable and the many factors- such as the hours students spent in preparing the exam or their IQ level- can be considered independent variables: in fact, the scores are expected to be different if these causes are somehow modified. In data analysis, examinig how the dependent variables vary when their independent variables are manipulated can uncover insights and it is essential to make meaningful conclusions about cause and effect relationships. If you're interested in exploring the relationship between two variables, it's common to represent the independent variable (the one you suspect influences the other) on the x-axis and the dependent variable (the one you suspect is influenced) on the y-axis. In this way, it is possible to highight how the changes in the x-axis variable affect the y-axis variable. Here, my main studies are based on the article titled ["A Grahic Primer"](https://mathbench.umd.edu/modules/visualization_graph/page02.htm#:~:text=Scientists%20like%20to%20say%20that,left%20side%2C%20vertical%20one) found on https://mathbench.umd.edu/index.html.

As already mentioned, both dependent and independent variables can be categorized as either *categorical* or *continuous*. 

* *Categorical variables* are variables that represent qualitative characteristics or attributes ==> **In graphs, categorical variables are often represented using labels or names**

In relation to the Palmer Penguins dataset, it is necessary to examine further two variants of categorical variables: *nominal* and *dichotomous* variables. A nominal variable is a type of categorical variable where the categories have no inherent order or ranking. In other words, the categories are simply labels without any specific value or hierarchy associated with them. A dichotomous variable is a type of categorical variable that has only two possible categories or outcomes. These categories are typically labeled as "yes" or "no," "true" or "false," or "0" and "1".

In our case, the *species* and the *islands* are nominal categorical variables while the *sex* is also a nominal variable of the dichotomous type:

  - Species: It represents the species of penguins such as Adelie, Chinstrap, or Gentoo. As already seen, Python interpreter reads the column where their are stored as a string data type, as each species name is a label or category.
   
   - Island: It represents the island where the penguin was observed, such as Torgersen, Biscoe, or Dream. Since these island names are also categories, they can be also represented using Python's string data type.
   
   - Sex: It represents the gender of the penguin that can be either male or female. The sex variable can be represented using Python's string data type or using a binary variable (0 for male, 1 for female) using Python's integer ( for instance 0 or 1) or boolean data types.


* *Continuous variables*: Continuous variables are variables that represent numerical quantities and can take on any value within a certain range. They are typically measured on a continuous scale. Continuous variables are characterized by having an infinite number of possible values between any two values. Examples of continuous variables include height, weight, temperature, and time. 

In the Palmer Penguin dataset, the continuous variables are *bill length*, *bill depth*, *flipper length* and *body mass* and they are expressed in numbers, namely millimiters and grams. For numerical variables, you can use Python's float data type (float) or integer data type (int) to represent the measurements. Use float if you want to preserve decimal places in the values, or use int if the measurements are whole numbers only.

To understand the different types of variables, I have used as source of information [Laerd website](https://statistics.laerd.com/statistical-guides/types-of-variable.php/).


**What are then the dependent variables that we should focus on when using the different independent variables?**

Obviously the elements in the dataset that can vary when the other attributes are differently combined are the penguins' species and our aim can be to establish a principle that can be used to predict what is the specific dependent categorical variable if we have a minimum set of independent variables.

Once identified which viariables are stored in our dataset and my target, the next step would be using the peculiarities of each species.

In order to achieve a clear portrait of the species of penguins, we can use the dataframe to depict and come to assumptions using the independent variables that characterise each dependent variable. Specifically, we can answer the three following questions:

1. How are the three penguins' species distributed over each island?

2. What is the difference in mesurements between the three different species?

3. Is there any specific characteristic that distinguishes more one species from the others?

When this description is fully deployed, we can pass to compare and build relationships between the species and try to find a possible linear border of discrimination which allows to draw the minimum conditions we can predict what dependent variable we are examining with a given set of independent parameters.

</div>

***

### Task 3

> For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list is the most appropriate to the model of the vatiable.

<div align="justify">

**MODELING VARIABLES AND PROBABILITY DISTRIBUTION**

Modelling variables with probability distributions is a technique used to understand and describe lpatterns and characteristics of data and  the choice of a specific probability distribution depends on the type of variable analysed and the target pursued.

In fact,  a probability distribution is a mathematical function used to explore the probability of different possible values of a variable and is used to represent the number of times each possible value occurs in a sample. As a matter of fact, when we model a variable we are essentially shaping a set of information with a determined distribution that best represents the observed variable.

**NOMINAL VARIABLES**

Considering the Palmer Penguins dataset, the categorical type of variables can be further classified into *nominal* with *three levels* - as the “species” with “Chinstrap”, Adeline” and “Gentoo” subclasses and the “islands” with “Dream”, “Torgersen” and “Biscoe” subclasses - and *ordinal* with *two classes* - as “sex” with “male” and “female” subclasses.

This class level discrimination requires a different probability distribution’s selection:

==> In fact, to model nominal variables with more than two states, such as the “species” and the “island” variable, we need *MULTINOMIAL DISTRIBUTION*.

==> Whereas the “sex” category, the probability distribution chosen is the *BERNOULLI* distribution where only two states are allowed. 

The Numpy probability distribution selected is, therefore, the ![**numpy.random.multinomial**](https://docs.scipy.org/doc/numpy-1.9.0/reference/generated/numpy.random.multinomial.html#numpy-random-multinomial) function for the “species” and the “island” variables and the ![numpy.random.binomial]( https://docs.scipy.org/doc/numpy-1.9.0/reference/generated/numpy.random.multinomial.html#numpy-random-multinomial) function for the “sex” class.

My main points of reference for the above assumptions are the following lectures found on the ![Machine Learning & Simulation YouTube channel]( https://www.youtube.com/watch?v=421uW9aZHio) and ![ritvikmath YouTube channel]( https://www.youtube.com/watch?v=Dkc_hcVWDpA&t=3s).

</div>

<div align="justify">

**CONTINOUS VARIABLES**

In relation to the “bill length”, the “bill depth”, the “flipper length” and the “body mass”, measurements, to select the probability distribution that can better model these continuous variables, it is necessary to first check the actual data distribution to see if it reasonably fits a normal distribution or if another distribution (e.g., log-normal, gamma, or others) might be a better fit. You can use tools like histograms and statistical tests to identify the right fit.


To get a sense of the data's distribution, I will then give a visual representation of the variables plotting their patterns through histograms as this tool to group data into intervals and check whether the data is normally distributed, skewed, or has multiple modes (peaks).


***

# End