# Fundamentals of Data Analysis Tasks

**Stefania Verduga**

***

## Task 1

The Collatz conjecture

In [1]:
def f(x):
    # If x is even, divide it by two.
    if x % 2 == 0:
        return x // 2
    else:
        return (3 * x) + 1

In [2]:
def collatz(x):
    while x != 1:
        print(x, end=', ')
        x = f(x)
    print(x)
       

In [3]:
collatz(1000)

1000, 500, 250, 125, 376, 188, 94, 47, 142, 71, 214, 107, 322, 161, 484, 242, 121, 364, 182, 91, 274, 137, 412, 206, 103, 310, 155, 466, 233, 700, 350, 175, 526, 263, 790, 395, 1186, 593, 1780, 890, 445, 1336, 668, 334, 167, 502, 251, 754, 377, 1132, 566, 283, 850, 425, 1276, 638, 319, 958, 479, 1438, 719, 2158, 1079, 3238, 1619, 4858, 2429, 7288, 3644, 1822, 911, 2734, 1367, 4102, 2051, 6154, 3077, 9232, 4616, 2308, 1154, 577, 1732, 866, 433, 1300, 650, 325, 976, 488, 244, 122, 61, 184, 92, 46, 23, 70, 35, 106, 53, 160, 80, 40, 20, 10, 5, 16, 8, 4, 2, 1


## Task 2

Give an overview of the penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.

The Palmer penguins dataset is a collection of data about penguins in the Palmer Archipelago, Antarctica, composed of 3 islands. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.

The palmer penguin data has lots of information about three penguin species in the Palmer Archipelago, including size measurements, clutch sizes, and blood isotope ratios. There are 344 rows and 8 columns.

![Penguin's Species](https://miro.medium.com/v2/resize:fit:1400/1*KU-V8tWWQU3nDtw12-bQ_g.png)

Image by Allison Horst (https://allisonhorst.com/)

For each of the 344 penguins, numerical values of bill length/depth, flipper length and body mass were measured, and additional categorical characteristics like sex, species and island were recorded.

This dataset can be found in the following link: 

https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv

The CSV file previously downloaded, is read into our repository using the Pandas 'read_csv' method and it is stored in a pandas DataFrame object named 'df'. 

Using the method 'info' we will get all the relevant information about this dataset as the data type and its values.

In [4]:
import pandas as pd
df = pd.read_csv('/Users/stefania/Fundamentals of data analysis/fundamentals-of-data-analysis-project/penguins.csv')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


The dataset consists of 344 observations with 7 columns referring to the features used to describe these observations. Three of these variables are nominal while the other four are numeric.

- species: penguin species (Chinstrap, Adélie, or Gentoo)
- island: island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica)
- bill_length_mm: bill length (mm)
- bill_depth_mm: bill depth (mm)
- flipper_length_mm: flipper length (mm)
- body_mass_g: body mass (g)
- sex: penguin sex

Let's analyze the type of variable for each one:

- **Species**: This is a categorical/nominal variable which refers to the specie of each penguin. The dataset gather information about three different species Chinstrap, Adélie and Gentoo.

- **Island**: Like the previous variable, this is a categorical/numerical variable which refers to the name of the islands where data were collected for each of the sample values (Tongersen, Dream, Biscoe).

- **Bill Length**: The bill is the upper ridge of the penguin's beak, and this variable is a numerical value measured in millimeters.

- **Bill Depth**: This variable provides information on the attributes of the penguin beak, in this case its depth size, which is a numerical variable.

- **Flipper Length**: In this case, this variable provides us with information on the measurements of the penguin's flippers, so we are dealing with a numerical variable.

- **Body Mass**: This variable indicates the weight of the penguins, measured in grams, which is a numerical variable.

- **Sex**: The variable sex refers to the gender of the penguins, it is a nominal variable with only two possible values, in this case female or male.

For a better understanding of the measurement of the penguin's bill, we can find a picture explanation as follows:

![Penguin's Bill Length vs Bill Depth](https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png)
Image by Allison Horst (https://allisonhorst.com/)

## Task 3

For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list is the most appropriate to model the variable.

According to the penguin dataset, we have the following variables:
- species
- island 
- bill_length_mm
- bill_depth_mm 
- flipper_length_mm
- body_mass_g
-  sex            

In order to figure out the distributions of each variable, We need to take into account the number of observations for each one. We need also to import numpy to work with random variables.

In [6]:
# Import numpy.
import numpy as np

1. Species

The variable "species" in the Penguin dataset represents the categorical variable indicating the penguin species. The distribution of this variable will show the frequency or proportion of each species in the dataset. We should need take into account that this variable has three possible outcomes: Chinstrap, Adélie or Gentoo.

The Binomial distribution is a statistical distribution that summarizes the probability that a value will take one of two independent values under a given set of parameters or assumptions. According to this definition, we can confirm that this variable cannot follow a binomial distribution, but it can follow a multinomial distribution since the observations can be from 3 different nominal categories.

In [7]:
# Generate a random data for the variable Species.
species_counts = df["species"].value_counts()

# Calculate probabilities.
probabilities = species_counts / len(df)

number_of_observations = 150 

multinomial_variable = np.random.multinomial(n=number_of_observations, pvals=probabilities)
print(f"Simulated Observations: {multinomial_variable}")

Simulated Observations: [55 64 31]


The previous output indicates that there are 73 observations for Adélie, 47 observations for Chinstrap and 30 observations for Gentoo.

2. Island

This variable is similar to the previous one. The three possible outcomes for this variables are: Tongersen, Dream and Biscoe. So as in the previous variable, we can say that this variable follows a multinomial distribution.

In [8]:
# Generate a random data for the variable Island.
species_counts = df["island"].value_counts()

# Calculate probabilities.
probabilities = species_counts / len(df)

number_of_observations = 150 

multinomial_variable = np.random.multinomial(n=number_of_observations, pvals=probabilities)
print(f"Simulated Observations: {multinomial_variable}")

Simulated Observations: [77 53 20]


The previous output indicates that there are 71 observations for Tongersen, 55 observations for Biscoep and 24 observations for Dream.

3. Bill_length_mm

In the case of the Bill lenght variable, we can assume that if follows a Normal / Gaussian bell distribution.
A normal distribution represents how the values of a characteristic appear in a population, for example the height of people in any country or in this case, the bill lenght of the penguins. In these characteristics, there will be one value that will be the most frequent and that will also be the value located in the middle of all the observations recorded in that population, even if we calculate that value, which is the arithmetic mean, adding all the heights recorded and dividing that result by the number of registered people, both data will coincide (or will be very similar).

In [9]:
# Generate a random sata for the variable bill_lenght_mm following a normal distribution.
mean_bill_length = df["bill_length_mm"].mean()
std_dev_bill_length = df["bill_length_mm"].std()

number_of_observations = 150 

# Simulate a normal distribution
bill_length_simulated_data = np.random.normal(loc=mean_bill_length, scale=std_dev_bill_length, size=number_of_observations)
print(f"Simulated Bill Lengths: {bill_length_simulated_data}")

Simulated Bill Lengths: [47.66765716 36.42775438 40.07405774 38.49705488 44.24976496 32.54915446
 42.14988544 35.62696594 41.6657833  40.3303799  44.10635307 40.27921118
 39.34158304 49.30489607 43.18868201 28.98552867 32.86707958 53.59075411
 44.79422909 34.18892922 35.53325277 47.37456696 47.38743868 46.6219442
 43.17013338 39.52036525 38.46215303 45.11577534 47.881817   32.95722051
 45.7094079  51.92067209 41.07779556 45.11725326 49.17559793 45.17119662
 35.4924268  41.91978104 33.48648186 44.35791759 41.40529027 44.75286552
 48.54090263 49.08650843 49.51886456 45.95153329 43.50375949 52.52037708
 36.7205917  42.40678686 45.61151987 45.30766987 47.64472005 39.49300772
 53.12749546 31.43882449 40.37599163 40.49209993 43.83358244 40.22493164
 47.99990257 39.66567403 37.76058727 57.50219759 52.88514941 43.90880575
 42.97144078 42.46105293 38.16803752 49.66027633 45.26560201 48.0763331
 52.18033657 46.74687602 55.26931366 48.40118966 41.31498319 44.99285759
 46.6150979  51.99561943 46.0

4. Bill_depth_mm 

Just as in the previous variable and since it is a natural phenomenon, we are going to consider that this variable follows a normal distribution or Gaussian Bell.

In [10]:
# Generate a random sata for the variable bill_depth_mm following a normal distribution.
mean_bill_depth = df["bill_depth_mm"].mean()
std_dev_bill_depth = df["bill_depth_mm"].std()

number_of_observations = 150 

# Simulate a normal distribution
bill_depth_simulated_data = np.random.normal(loc=mean_bill_depth, scale=std_dev_bill_depth, size=number_of_observations)
print(f"Simulated Bill Lengths: {bill_depth_simulated_data}")

Simulated Bill Lengths: [19.0321071  14.22920099 15.70172326 18.59129279 17.27548668 17.12678741
 18.38747166 19.48526906 14.19971761 18.44819479 17.95122659 21.05388672
 17.17710784 14.99177133 17.80943924 16.46590192 17.00788694 16.901444
 17.48582803 17.3095427  15.11900309 15.75857167 15.7097517  17.61644221
 15.35837721 13.13361357 17.38591985 19.11662875 17.5283298  17.09004376
 18.61123824 17.80721658 16.61665079 14.34204744 14.44244739 14.46389071
 15.44451009 14.89243091 20.49036258 14.8639892  16.77560354 13.89738318
 19.15077571 17.28180662 17.21649155 17.64323018 16.10751952 19.08959957
 16.15649621 21.92650637 18.07273221 22.58091377 20.15830665 17.89335937
 15.36830966 16.56773871 18.33623142 17.06639814 17.57451466 14.86176329
 19.25409572 19.22855068 20.04135926 17.40348887 15.60105971 13.10163535
 14.90211706 17.99539051 13.47109125 14.89685703 17.37661282 19.2826428
 17.39113245 17.19987254 17.17612573 18.05227063 14.3021079  18.2117373
 19.83560168 16.71325466 14.380

5. Flipper_length_mm

Again here, as this variable explains a phisycal phenomenon of the penguins, this variable can be normally distributed.

In [11]:
# Generate a random sata for the variable flipper_lenght_mm following a normal distribution.
mean_flipper_length = df["flipper_length_mm"].mean()
std_dev_flipper_length = df["flipper_length_mm"].std()

number_of_observations = 150 

# Simulate a normal distribution
flipper_length_simulated_data = np.random.normal(loc=mean_flipper_length, scale=std_dev_flipper_length, size=number_of_observations)
print(f"Simulated Bill Lengths: {flipper_length_simulated_data}")

Simulated Bill Lengths: [206.92526696 198.31390664 204.39452346 195.78947683 210.02072886
 213.73523558 182.90839755 192.20937126 181.01139434 190.13791905
 191.03090149 193.64967552 192.66600498 230.89715088 206.03103042
 198.00284154 190.04297875 188.04086528 202.09439424 213.86352683
 198.65804565 200.69100085 171.44736663 205.16359239 205.1095906
 208.00808533 190.17523341 190.95389724 217.78241114 172.98457607
 205.83529509 204.97584168 187.3669333  229.37699346 194.64221357
 209.75916563 219.61639123 177.48658753 216.96747693 194.62117264
 211.68936016 208.90680391 196.64686969 211.21813145 192.54840439
 192.55614128 200.26040866 209.94291651 188.25183615 191.10585258
 201.69087739 216.57422565 200.03141286 195.32306262 207.89417807
 224.78747404 223.5523912  229.5127267  188.30714078 196.09597364
 194.52438827 206.38521227 183.89497919 171.71247802 175.60690327
 224.95932363 166.49790641 217.24291022 201.50164786 200.21515662
 233.61908491 187.43143271 191.79280088 216.86875195 

6. Body_mass_g

In this case, we can also assume that the body mass variable follows a Normal Distribution.

In [12]:
# Generate a random sata for the variable body_mass_g following a normal distribution.
mean_body_mass = df["body_mass_g"].mean()
std_dev_body_mass = df["body_mass_g"].std()

number_of_observations = 150 

# Simulate a normal distribution
body_mass_simulated_data = np.random.normal(loc=mean_body_mass, scale=std_dev_body_mass, size=number_of_observations)
print(f"Simulated Body Mass: {body_mass_simulated_data}")

Simulated Body Mass: [5966.26595963 3128.0964076  4278.40615934 3382.89493763 3683.27578539
 5594.82511126 2935.8561805  3840.94460918 4042.13557761 4934.99463254
 4502.99558676 4898.30484544 3338.09284453 3931.03225485 4229.75568053
 4818.63808124 3475.91242261 4445.51369772 4958.68388385 3030.830839
 4510.04347844 3786.42526927 5762.55712816 3951.273291   5420.28989738
 5446.21742987 5197.77565078 5620.71868326 3845.07587641 3346.68076686
 5428.86442563 3526.47724847 4256.88618566 5326.21347852 4246.17711912
 3678.61154981 5176.5285663  3637.85288326 5228.28949511 4039.18510034
 4802.84117996 3868.14205481 3395.35219627 4144.98554956 5208.17692632
 3682.29517971 4940.841297   5298.41915963 5540.15198951 4152.13376126
 3591.62247987 4593.79782093 3917.98533794 2916.07171918 5501.41612441
 4311.22455222 4033.03056089 4479.02774241 2444.72097293 6082.84297022
 3271.75548827 5683.79583173 2728.09514799 4251.70145622 5248.71544454
 4414.11481826 5360.47242841 4926.40232512 2893.69565168 4

7. Sex 

The variable sex has a different distribution. According to the definition of the Binomial distribution, a binomial distribution is a probability distribution function that is used when there are exactly two possible outcomes of a trial that are mutually exclusive. 
So, in this case we can assume that the variable sex follows a binomial distribution as it has only two possible outcomes: male or female and both are exclusive.

According to this, we are setting a random sex variable following a Binomial distribution. 
The binomial distribution is described by two parameters: 'n' the number of experiments performed, and 'p' the probability of success.

In [13]:
# We need to know the proportion of males and females in the original dataset.
sex_probabilities = df["sex"].value_counts(normalize=True)
print(sex_probabilities)

sex
MALE      0.504505
FEMALE    0.495495
Name: proportion, dtype: float64


In [14]:
# Once we have the probabilities, we need to set 'n' which is the number of the sample.
num_penguins = len(df)

# Generate random sexes based on the observed probabilities.
random_sex = np.random.choice(sex_probabilities.index, size=num_penguins, p=sex_probabilities.values)

# Create a new 'sex' column in the DataFrame.
df['sex'] = random_sex

# Display the new 'sex' variable.
print(random_sex)

['MALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE'
 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'MALE' 'FEMALE'
 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE'
 'FEMALE' 'FEMALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'FEMALE'
 'MALE' 'MALE' 'MALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE'
 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'FEMALE' 'MALE' 'MALE'
 'FEMALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'MALE' 'MALE'
 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE'
 'FEMALE' 'MALE' 'MALE' 'MALE' 'MALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE'
 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE'
 'MALE' 'MALE' 'FEMALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE'
 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE'
 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE'
 'FEMALE' 'MALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'MALE'
 'MALE' 'MALE' 'FEMALE' '

***

## End