# Fundamentals of Data Analysis Tasks

Phelim Barry

***

***Task1***

>The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .   
This task is to verify, using Python, that the conjecture is true for the first 10,000 positive integers.

If $x$ is even, $f(x) = x ÷ 2$   
If $x$ is odd, $f(x) = 3x + 1$

Define a function for $f(x)$ and apply the appropriate formula. Using the modulus operator we can determine if $x$ is even (i.e. no remainder when divided by 2) or else $x$ must be odd.

In [1]:
def f(x):
    if x % 2 == 0:          #x is even
        return x // 2
    else:                   #x is odd
        return (3 * x) + 1

Define a function to check the $x$ value. If after applying the formula to the number, we arrive at a value equal to 1 then we have proved for that number and move on to the next number. If we reach the number 10001 then we have proved for all numbers between 1 and 10000

In [2]:
def collatz(x):
    while x != 1:
        x = f(x)            #call the f(x) function

Run the function and print a confirmation message once complete.

In [3]:
# Define start and end range values
start_range = 1
stop_range = 10001

for num in range (start_range, stop_range):
    collatz(num)        #call the collatz function

print (f'We have cycled through the values from {start_range} to {stop_range-1} and each time ended with a value of 1 \nindicating the conjecture is true for the first {stop_range-1} positive integers')


We have cycled through the values from 1 to 10000 and each time ended with a value of 1 
indicating the conjecture is true for the first 10000 positive integers


***Task2***


>The purpose of this task is to give an overview of the penguins data set explaining the types of variables it contains.

>We will also suggest the types of variables that should be used to model them in Python giving explanations for the rationale.

**Overview**

The Palmer Penguin Dataset contains measurements of three different species of penguins from three islands in the Palmer Archipelago in Antartica. The data were collected between 2007 and 2009 and made available by researcher Dr. Kristen Gorman and other researchers from the Palmer Research Station.

Similar to the Iris dataset it is often used to explore areas of data analysis such as correlation and regression.
 

The dataset has seven columns of data containing the following variables:

| Variable Name | Type | Description |
| --- | --- | --- |
| species | categorical | species of penguin (Adelie, Chinstrap or Gentoo) |
| island | categorical | island found on (Biscoe, Dream or Torgersen) |
| sex | categorical | gender of the penguin (male or female) |
| bill_length | numeric/real | length of penguins bill measured in mm |
| bill_depth | numeric/real | depth of penguins bill measured in mm  |
| flipper_length | numeric/real | length of penguins flipper measured in mm |
| body_mass | numeric/real | weight of penguins measured in grams |
 

Categorical variable types are used when the variable can only be one of a specific list of values or used to describe qualitative data. In the penguins dataset the three categorical variables have very specific values.   
Species has three values: Adelie, Chinstrap and Gentoo. Island also has three values: Biscoe, Dream and Torgersen. Sex has two values: male and female.   
All three of these variables can be described as nominal as they are used to name something.   

Numeric variable types are also known as quantitative variables because they typically are used to measure something using a number value. The penguins dataset contains four numeric variables with three measured in millimeters and one measured in grams. These are continuous variables and can be further broken down and described as ratio variables because they cantain numbers that have measurable differences that can be determined such as the differences in lengths, weights etc. Given that all four of the variables are being used to measure something they would be considered to be using real numbers. The presence of decimal places in three of the variables also points to them being real numbers.

**Suggested Variable Types in Python**

To model each of the variables in Python...


| Variable Name | Type | Python Variable Type |
| --- | --- | --- |
| species | categorical | string |
| island | categorical | string |
| sex | categorical | string |
| bill_length | numeric/real | float |
| bill_depth | numeric/real | float  |
| flipper_length | numeric/real | float |
| body_mass | numeric/real | float OR int |

Species, Island and sex - model as type string - contain fixed values, alpha characters and we won't be performing any mathematical calculations on them...

Numeric values in python usually are defined as type ```int``` or ```float``` (can also be complex). ```int``` variable types are whole numbers with no decimal places. They are ideal for performing addition or subtraction but if we are to perform any arithmetic involving division such as calculating means or average values, then our calculated value would turn into a ```float``` by default python logic. ```float``` or floating-point variables are used to store values with decimal points. Arithmetic on float values will always result in another float value and will display a value after the decimal point even if the value is $.0$.   

With this in mind it would be best to model bill_length, bill_depth and flipper_length as type ```float``` because all three contain decimal values. 
body_mass could be stored in an ```int``` variable as it does not contain decimals however if we needed to convert the value to kilograms then that would introduce decimal places so it would be best to store as a ```float``` also.


Maybe move this somewhere else - Task5 maybe or above as part of the overview

In total, there are 344 rows of data in the dataset. However, unlike the iris dataset, the penguins dataset has some missing values as follows:  

| Variable Name | Missing Count |
| --- | --- |
| sex | 11 |
|bill_length_mm |2 |
|bill_depth_mm | 2 |
|flipper_length | 2 |
|body_mass_g | 2 |

***Task3***

>Add Task3 description...

***Task4***

>Add Task4 description...

***Task5***

>Add Task5 description...

In [4]:
#Import pandas and seaborn and load the penguins data set into a new dataframe df
import pandas as pd
import seaborn as sns

df=sns.load_dataset ('penguins')

#df

In [5]:
df.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

***

## End