# Fundamentals of Data Analysis Tasks

**Andrea Cignoni**

***

# Task 1

> The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .

In [None]:
# This function defines the calculation applied to the input number 'x'
def f(x):
    if x % 2 == 0: #This "if" clause checks whether the integer given is even or not
        return x // 2 #If it is even it is devided by two
    else:
        return (3 * x) + 1 #If it is odd it then multiplied by 3 and then 1 is added

In [None]:
# This function formats the sequence of integers generated with the 'x' function
def collatz (x):
    while x != 1: #This 'while' clause stops the loop at number 1
        print(x, end=', ')
        x = f(x) # Update 'x' with the format defined in the collatz function
    print (x)

In [None]:
collatz(1000)

***

# Task 2

> Give an overview of the famous penguins data set,2 explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.

<div align="center">

# **PENGUINS DATA SET**
</div>

<div align="justify">
The Palmer penguins dataset is a collection of data about penguins in the Palmer Archipelago, Antarctica. These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network and were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream).<br/>

The parameters considered are the followings:   


+ Island name (Dream, Torgersen, or Biscoe);
- Species name (Adelie, Chinstrap, or Gentoo);
* Billl length (mm);
+ Bill depth (mm);
- Flipper length (mm);
* Body mass (g);
- Sex.
</div>


<div align="center">

![Screenshot](Penguins.png) ![Screenshot](Penguins1.png)
</div>

<div align="justify">

**LOADING THE FILE AND LIBRARIES IMPORTED**<br/>

The famous Palmer penguins dataset is downloaded from mwaskom/seaborn-data ![Github](https://github.com/mwaskom/). Since the information appears readable and structured just as the standard CSV tabular disposition (a comma separates individual items and each record is on a new line), I have proceeded to open it as such in my Python repository.

Before proceeding to load the data in a data frame, I have proceeded to import the necessary modules to analyse all the variable represented in the file: Pandas; Matplotlib; Numpy; Seaborn.<br/>

- Pandas

Pandas is an open source Python package that is used for data science/data analysis and machine learning tasks. The common operations performed with Pandas are also: data cleansing, data fill, data normalization, merges and joins, data visualization, statistical analysis, data inspection, loading and saving data and much more. Here, I rely on Pandas for indexing the data frame, for manipulating it and extracting the sorted information from specified columns and rows. My main source for its usage is pandas.pydata.org

- Matplotlib

Matplotlib is a Python library used to create 2D graphs and plots through scripts. The pyplot module is explecially useful when it comes to control line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error charts etc.

- Numpy

Matplotlib is used along with NumPy to provide an environment with an effective and fast numeric computing. Numpy furnishes a multidimensional array object and various derived objects (such as masked arrays and matrices).

- Seaborn

Lastly, in order to clearly display a graphic overview of the whole dataset through pair plots, I have imported Seaborn. This library is built on top of the Matplotlib data visualization library and can perform the exploratory analysis that fits best to show the result of my searches.

</div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Reading and formatting the data set downloaded from (https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv)

data = pd.read_csv('penguins.csv', header=None)

# Extracting penguin dataset's 8 variables for representation

file=open("penguins.csv", "a")

data.columns= ['species','island','bill_lenght_mm','bill_depth_mm','flipper_lenght_mm','body_mass_g','sex']

print(data.columns)

<div align="justify">

**DATA DESCRIPTION**

In order to give a visual overview of the data contained in the dataset, I have utilised a number of functions provided by Pandas referring to realpython.com. As already pointed out, the observations of the penguins are classified through 8 classes of prameters.
</div>

In [None]:
print("\n","Below are shown the first and the last five rows of the dataset")

print(data)

In [None]:
# Check the DataFrame structure
print("\n","This is the summary of the dataframe, including information about the column names, data types, and non-null values:","\n")

print(data.info())

In [None]:
# this function is to display stats about data
print("\n","These are the main statistical information of the dataset:","\n")

print(data.describe())

In [None]:
# Number of observation taken for each species
print("\n","Number of samples for each class:","\n")

print(data['species'].value_counts())

<div align="justify">

**PREPROCESSING THE DATA SET**


Before proceeding to my actual analysis, I have preprocessed the data contained in the penguins dataset. This standard procedure is used to remove missing or inconsistent data values resulting from human or computer error. Preprocessing data can significantly improve the accuracy and quality of a dataset, making it more reliable. Once the data are proved to be consistent and all unhelpful parts are eliminated, the information can be transformed into a format that is more easily and effectively processed in data mining, machine learning and other data science tasks.
</div>

In [None]:
# Preprocessing the dataset: check for null values
print("\n","Below, the missing values that were found in the raw file:","\n")

print(data.isnull().sum())

***

# End