# Fundamentals of Data Analysis Tasks
Eilis Donohue - G00006088
***

## Task 1
Collatz Conjecture

References
https://stackoverflow.com/questions/39734485/python-combining-two-lists-and-removing-duplicates-in-a-functional-programming

https://stackoverflow.com/questions/59925384/python-remove-elements-that-are-greater-than-a-threshold-from-a-list

In [23]:
# import packages
# import time to time the execution
import time

# the function to do collatz execution
def f(x):
  # check if even - if even divide by 2
  if x % 2 == 0:
    return x/2 
  else:
    return (x * 3) + 1 

In [25]:
# import time to time the execution
import time
# Define some variables
# Define the number of positive integers we want to check satisfies the Collatz conjecture
number_of_integers = 10000

# Define empty set of proven integers
proved_set = set()
loop_counter = 0
st = time.time()
# Loop backwards from the required number of integers
for x in reversed(range(1, number_of_integers+1, 1)):
  # check if the number has already been proven
  if x not in proved_set:
    # Define an empty list to store the collatz sequence for each number and a loop counter to check how many collatz sequences are actually calculated
    loop_counter = loop_counter + 1
    number_set = set() 
    while x != 1:
      # call Collatz function
      x = f(x)
      # append each value to the collatz sequence set
      # Could actually just only add numbers to the set if they're below number_of_integers to avoid having to remove later
      number_set.add(x)
  # remove the numbers bigger than number_of_integers from the set so that we aren't storing numbers greater than this
  #number_set = {val for val in number_set if val<=number_of_integers}
  # Union the sets to remove duplicates and get new proven set of numbers
  proved_set = proved_set | number_set
  et = time.time()
print(f"Collatz conjecture proven for first {number_of_integers} integers, completed in {loop_counter} loops in {et-st} seconds")



Collatz conjecture proven for first 10000 integers, completed in 2725 loops in 4.68923807144165 seconds


In [20]:
# Brute force - run every integer completely 

import time
# Define some variables
# Define the number of positive integers we want to check satisfies the Collatz conjecture
number_of_integers = 1000000

# Define empty set of proven integers
proved_set = set()
loop_counter = 0
st = time.time()

# Loop backwards from the required number of integers
for x in range(1, number_of_integers+1, 1):
  # check if the number has already been proven
    loop_counter = loop_counter + 1
    number_set = [x]
    while x != 1:
      # call Collatz function
      x = f(x)
      # append each value to the collatz sequence set
      # Could actually just only add numbers to the set if they're below number_of_integers to avoid having to remove later
      number_set.append(x)
  # remove the numbers bigger than number_of_integers from the set so that we aren't storing numbers greater than this
    #number_set = {val for val in number_set if val<=number_of_integers}
   # print(number_set)
  # Union the sets to remove duplicates and get new proven set of numbers
#  proved_set = proved_set | number_set
    et = time.time()
print(f"Collatz conjecture proven for first {number_of_integers} integers, completed in {loop_counter} loops in {et-st} seconds")


Collatz conjecture proven for first 1000000 integers, completed in 1000000 cycles in 46.73258876800537 seconds


It seems that running all the integers without checking if a number has already been hit on in a previous loop is faster up to at least 1e6 integers (46.7s). There is further optimisation in cutting off a series by checking each output of the collatz against the proven set. This is to be investigated

---
## Task 2
Give an overview of the famous penguins data set 2, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.

2 mwaskom/seaborn-data: Data repository
for seaborn examples. Aug. 30, 2023. url:
https://github.com/mwaskom/
seaborn - data / blob / master /
penguins.csv (visited on 08/30/2023).

Python datatypes:


In [38]:
# Use pandas to read in the penguins csv file stored in /data and get a preview of first 10 rows
import pandas as pd

penguins_df = pd.read_csv("data/penguins.csv")
print(penguins_df.head(10))

# Find the length of the dataframe
print(f"\n{len(penguins_df.index)} rows in dataframe.")

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   
5  Adelie  Torgersen            39.3           20.6              190.0   
6  Adelie  Torgersen            38.9           17.8              181.0   
7  Adelie  Torgersen            39.2           19.6              195.0   
8  Adelie  Torgersen            34.1           18.1              193.0   
9  Adelie  Torgersen            42.0           20.2              190.0   

   body_mass_g     sex  
0       3750.0    MALE  
1       3800.0  FEMALE  
2       3250.0  FEMALE  
3          NaN     NaN  
4       3450.0  FEMALE  
5       3650.0    MALE  
6       36

In [61]:
# Print the pandas assigned type for each variable in the dataframe
print(penguins_df.dtypes)


species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object


From the above, there are 4 variables which are identified as floating point numbers with at least one decimal place (all appearing to be measurements). **float** type would be used in Python to model these variables. Float type will also handle any "nans".

There are 3 variables with textual data ('species', 'island' and 'sex') which are textual descriptors of the variable.

The obvious Python datatypes for these variables would be string type **(str)** 

Identifying the unique data for each of the object/textual variable types in the dataframe:

In [62]:
print(penguins_df["species"].unique())
print(penguins_df["island"].unique())
print(penguins_df["sex"].unique())


['Adelie' 'Chinstrap' 'Gentoo']
['Torgersen' 'Biscoe' 'Dream']
['MALE' 'FEMALE' nan]


Getting a summary description of the textual variables below. From the unique description and the summary description and knowing there are 344 data entries in the dataset, it is clear that the species and island data entries are one of 3 possibilities and there are 11 nan values for the sex variable.

In [63]:
print(penguins_df["species"].describe())
print(penguins_df["island"].describe())
print(penguins_df["sex"].describe())


count        344
unique         3
top       Adelie
freq         152
Name: species, dtype: object
count        344
unique         3
top       Biscoe
freq         168
Name: island, dtype: object
count      333
unique       2
top       MALE
freq       168
Name: sex, dtype: object


In [60]:
import numpy as np
penguins_np = penguins_df.to_numpy()

print(penguins_np[3, 2])


nan


***
End