# Fundamentals of Data Analysis Tasks

**Andrea Cignoni**

***

# Task 1

> The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .

In [None]:
# This function defines the calculation applied to the input number 'x'
def f(x):
    if x % 2 == 0: #This "if" clause checks whether the integer given is even or not
        return x // 2 #If it is even it is devided by two
    else:
        return (3 * x) + 1 #If it is odd it then multiplied by 3 and then 1 is added

In [None]:
# This function formats the sequence of integers generated with the 'x' function
def collatz (x):
    while x != 1: #This 'while' clause stops the loop at number 1
        print(x, end=', ')
        x = f(x) # Update 'x' with the format defined in the collatz function
    print (x)

In [None]:
collatz(1000)

***

# Task 2

> Give an overview of the famous penguins data set,2 explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.

<div align="center">

# **PENGUINS DATA SET**
</div>

<div align="justify">
The Palmer penguins dataset is a collection of data about penguins in the Palmer Archipelago, Antarctica. These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network and were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream).<br/>

The parameters considered are the followings:   


+ Island name (Dream, Torgersen, or Biscoe);
- Species name (Adelie, Chinstrap, or Gentoo);
* Billl length (mm);
+ Bill depth (mm);
- Flipper length (mm);
* Body mass (g);
- Sex.
</div>


<div align="center">

![Screenshot](Penguins.png) ![Screenshot](Penguins1.png)
</div>

<div align="justify">

**LOADING THE FILE AND LIBRARIES IMPORTED**<br/>

The famous Palmer penguins dataset is downloaded from mwaskom/seaborn-data ![Github](https://github.com/mwaskom/). Since the information appears readable and structured just as the standard CSV tabular disposition (a comma separates individual items and each record is on a new line), I have proceeded to open it as such in my Python repository.

Before proceeding to load the data in a data frame, I have proceeded to import the necessary modules to analyse all the variable represented in the file: Pandas; Matplotlib; Numpy; Seaborn.<br/>

- Pandas

Pandas is an open source Python package that is used for data science/data analysis and machine learning tasks. The common operations performed with Pandas are also: data cleansing, data fill, data normalization, merges and joins, data visualization, statistical analysis, data inspection, loading and saving data and much more. Here, I rely on Pandas for indexing the data frame, for manipulating it and extracting the sorted information from specified columns and rows. My main source for its usage is pandas.pydata.org

- Matplotlib

Matplotlib is a Python library used to create 2D graphs and plots through scripts. The pyplot module is explecially useful when it comes to control line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error charts etc.

- Numpy

Matplotlib is used along with NumPy to provide an environment with an effective and fast numeric computing. Numpy furnishes a multidimensional array object and various derived objects (such as masked arrays and matrices).

- Seaborn

Lastly, in order to clearly display a graphic overview of the whole dataset through pair plots, I have imported Seaborn. This library is built on top of the Matplotlib data visualization library and can perform the exploratory analysis that fits best to show the result of my searches.

</div>

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Reading and formatting the data set downloaded from (https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv)

data = pd.read_csv('penguins.csv')

# Extracting penguin dataset's 8 variables for representation

file=open("penguins.csv", "r")


data.columns= ['species','island','bill_lenght_mm','bill_depth_mm','flipper_lenght_mm','body_mass_g','sex']

# Define the data types for the columns
dtypes = {
    "species": str,
    "island": str,
    "bill_length_mm": float,
    "bill_depth_mm": float,
    "flipper_length_mm": int,
    "body_mass_g": int,
    "sex": str
}

print(data.columns)

Index(['species', 'island', 'bill_lenght_mm', 'bill_depth_mm',
       'flipper_lenght_mm', 'body_mass_g', 'sex'],
      dtype='object')


<div align="justify">

**DATA DESCRIPTION**

In order to give a visual overview of the data contained in the dataset, I have utilised a number of functions provided by Pandas referring to realpython.com. As already pointed out, the observations of the penguins are classified through 8 classes of prameters.
</div>

In [None]:
print("\n","Below are shown the first and the last five rows of the dataset")

print(data)

In [None]:
# Check the DataFrame structure
print("\n","This is the summary of the dataframe, including information about the column names, data types, and non-null values:","\n")

print(data.info())

In [None]:
# this function is to display stats about data
print("\n","These are the main statistical information of the dataset:","\n")

print(data.describe())

In [None]:
# Number of observation taken for each species
print("\n","Number of samples for each class:","\n")

print(data['species'].value_counts())

<div align="justify">

**PREPROCESSING THE DATA SET**


Before proceeding to my actual analysis, I have preprocessed the data contained in the penguins dataset. This standard procedure is used to remove missing or inconsistent data values resulting from human or computer error. Preprocessing data can significantly improve the accuracy and quality of a dataset, making it more reliable. Once the data are proved to be consistent and all unhelpful parts are eliminated, the information can be transformed into a format that is more easily and effectively processed in data mining, machine learning and other data science tasks.
</div>

In [None]:
# Preprocessing the dataset: check for null values
print("\n","Below, the missing values that were found in the raw file:","\n")

print(data.isnull().sum())

<div align="justify">

**DATA TYPES**

the function used in Pandas to extract the types of data contained in the dataset is the *dtypes* attribute:

</div>


In [4]:
print("\n","The Penguins dataset containes 4 numerical variables and 3 categorical variables:", "\n")

data_types = data.dtypes

print(data_types)


 The Penguins dataset containes 4 numerical variables and 3 categorical variables: 

species               object
island                object
bill_lenght_mm       float64
bill_depth_mm        float64
flipper_lenght_mm    float64
body_mass_g          float64
sex                   object
dtype: object


<div align="justify">

Once we have established which kind of data are being analysed, I can determine the graphical representation tequinques to compare and examine them best. My main point of reference for this topic is [Toward Science](https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95) website.

</div>

<div align="justify">

# **TYPES OF VARIABLES IN THE DATASET**


Variables can be *dependent* or *independent* and this basic discrimnation depends on their roles in an experiment or in a statistical study. This two classes can be further characterised and distinguished into *categoricals* or *continous*.

1. *Dependent variable*: A dependent variable is a variable whose outcome or result depends on, or is influenced by, changes in its causes (called the independent variable) ==> **The dependent variable is often plotted on the y-axis of a graph**

2. *Independent variable*: The independent variable is the variable that is intentionally changed or manipulated to observe its effect on the dependent variable ==> **The independent variable is often plotted on the x-axis of a graph**

For instance, if a test mark can be considered a dependent variable and the many factors- such as the hours students spent in preparing the exam or their IQ level- can be considered independent variables: in fact, the scores are expected to be different if these causes are somehow modified. In data analysis, examinig how the dependent variables vary when their independent variables are manipulated can uncover insights and it is essential to make meaningful conclusions about cause and effect relationships. If you're interested in exploring the relationship between two variables, it's common to represent the independent variable (the one you suspect influences the other) on the x-axis and the dependent variable (the one you suspect is influenced) on the y-axis. In this way, it is possible to highight how the changes in the x-axis variable affect the y-axis variable. Here, my main studies are based on the article titled ["A Grahic Primer"](https://mathbench.umd.edu/modules/visualization_graph/page02.htm#:~:text=Scientists%20like%20to%20say%20that,left%20side%2C%20vertical%20one) found on https://mathbench.umd.edu/index.html.

As already mentioned, both dependent and independent variables can be categorized as either *categorical* or *continuous*. 

* *Categorical variables* are variables that represent qualitative characteristics or attributes ==> **In graphs, categorical variables are often represented using labels or names**

In relation to the Palmer Penguins dataset, it is necessary to examine further two variants of categorical variables: *nominal* and *dichotomous* variables. A nominal variable is a type of categorical variable where the categories have no inherent order or ranking. In other words, the categories are simply labels without any specific value or hierarchy associated with them. A dichotomous variable is a type of categorical variable that has only two possible categories or outcomes. These categories are typically labeled as "yes" or "no," "true" or "false," or "0" and "1".

In our case, the *species* and the *islands* are nominal categorical variables while the *sex* is a dichotomous variable:

  - Species: It represents the species of penguins such as Adelie, Chinstrap, or Gentoo. As already seen, Python interpreter reads the column where their are stored as a string data type, as each species name is a label or category.
   
   - Island: It represents the island where the penguin was observed, such as Torgersen, Biscoe, or Dream. Since these island names are also categories, they can be also represented using Python's string data type.
   
   - Sex: It represents the gender of the penguin that can be either male or female. The sex variable can be represented using Python's string data type or using a binary variable (0 for male, 1 for female) using Python's integer ( for instance 0 or 1) or boolean data types.


* *Continuous variables*: Continuous variables are variables that represent numerical quantities and can take on any value within a certain range. They are typically measured on a continuous scale. Continuous variables are characterized by having an infinite number of possible values between any two values. Examples of continuous variables include height, weight, temperature, and time. 

In the Palmer Penguin dataset, the continuous variables are *bill length*, *bill depth*, *flipper length* and *body mass* and they are expressed in numbers, namely millimiters and grams.

For numerical variables, you can use Python's float data type (float) or integer data type (int) to represent the measurements. Use float if you want to preserve decimal places in the values, or use int if the measurements are whole numbers only.

To understand the different types of variables, I have used as source of information [Laerd website](https://statistics.laerd.com/statistical-guides/types-of-variable.php/).

Once identified which viariables stored in our dataset, we will have to determine what we want to represent. In this case my point of reference is the following article publish on [Statistics By Jim](https://statisticsbyjim.com/basics/data-types/). 


</div>

***

# End