# Fundamentals of Data Analysis Tasks

**Stefania Verduga**

***

## Task 1

The Collatz conjecture

In [1]:
def f(x):
    # If x is even, divide it by two.
    if x % 2 == 0:
        return x // 2
    else:
        return (3 * x) + 1

In [2]:
def collatz(x):
    while x != 1:
        print(x, end=', ')
        x = f(x)
    print(x)
       

In [3]:
collatz(1000)

1000, 500, 250, 125, 376, 188, 94, 47, 142, 71, 214, 107, 322, 161, 484, 242, 121, 364, 182, 91, 274, 137, 412, 206, 103, 310, 155, 466, 233, 700, 350, 175, 526, 263, 790, 395, 1186, 593, 1780, 890, 445, 1336, 668, 334, 167, 502, 251, 754, 377, 1132, 566, 283, 850, 425, 1276, 638, 319, 958, 479, 1438, 719, 2158, 1079, 3238, 1619, 4858, 2429, 7288, 3644, 1822, 911, 2734, 1367, 4102, 2051, 6154, 3077, 9232, 4616, 2308, 1154, 577, 1732, 866, 433, 1300, 650, 325, 976, 488, 244, 122, 61, 184, 92, 46, 23, 70, 35, 106, 53, 160, 80, 40, 20, 10, 5, 16, 8, 4, 2, 1


***
## Task 2

Give an overview of the penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.

The Palmer penguins dataset is a collection of data about penguins in the Palmer Archipelago, Antarctica, composed of 3 islands. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.

The palmer penguin data has lots of information about three penguin species in the Palmer Archipelago, including size measurements, clutch sizes, and blood isotope ratios. There are 344 rows and 8 columns.

![Penguin's Species](https://miro.medium.com/v2/resize:fit:1400/1*KU-V8tWWQU3nDtw12-bQ_g.png)

Image by Allison Horst (https://allisonhorst.com/)

For each of the 344 penguins, numerical values of bill length/depth, flipper length and body mass were measured, and additional categorical characteristics like sex, species and island were recorded.

This dataset can be found in the following link: 

https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv

The CSV file previously downloaded, is read into our repository using the Pandas 'read_csv' method and it is stored in a pandas DataFrame object named 'df'. 

Using the method 'info' we will get all the relevant information about this dataset as the data type and its values.

In [4]:
import pandas as pd
df = pd.read_csv('/Users/stefania/Fundamentals of data analysis/fundamentals-of-data-analysis-project/penguins.csv')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


The dataset consists of 344 observations with 7 columns referring to the features used to describe these observations. Three of these variables are nominal while the other four are numeric.

- species: penguin species (Chinstrap, Adélie, or Gentoo)
- island: island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica)
- bill_length_mm: bill length (mm)
- bill_depth_mm: bill depth (mm)
- flipper_length_mm: flipper length (mm)
- body_mass_g: body mass (g)
- sex: penguin sex

Let's analyze the type of variable for each one:

- **Species**: This is a categorical/nominal variable which refers to the specie of each penguin. The dataset gather information about three different species Chinstrap, Adélie and Gentoo.

- **Island**: Like the previous variable, this is a categorical/numerical variable which refers to the name of the islands where data were collected for each of the sample values (Tongersen, Dream, Biscoe).

- **Bill Length**: The bill is the upper ridge of the penguin's beak, and this variable is a numerical value measured in millimeters.

- **Bill Depth**: This variable provides information on the attributes of the penguin beak, in this case its depth size, which is a numerical variable.

- **Flipper Length**: In this case, this variable provides us with information on the measurements of the penguin's flippers, so we are dealing with a numerical variable.

- **Body Mass**: This variable indicates the weight of the penguins, measured in grams, which is a numerical variable.

- **Sex**: The variable sex refers to the gender of the penguins, it is a nominal variable with only two possible values, in this case female or male.

For a better understanding of the measurement of the penguin's bill, we can find a picture explanation as follows:

![Penguin's Bill Length vs Bill Depth](https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png)
Image by Allison Horst (https://allisonhorst.com/)

***
## Task 3

For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list is the most appropriate to model the variable.

According to the penguin dataset, we have the following variables:
- species
- island 
- bill_length_mm
- bill_depth_mm 
- flipper_length_mm
- body_mass_g
-  sex            

In order to figure out the distributions of each variable, We need to take into account the number of observations for each one. We need also to import numpy to work with random variables.

In [6]:
# Import numpy.
import numpy as np

1. Species

The variable "species" in the Penguin dataset represents the categorical variable indicating the penguin species. The distribution of this variable will show the frequency or proportion of each species in the dataset. We should need take into account that this variable has three possible outcomes: Chinstrap, Adélie or Gentoo.

The Binomial distribution is a statistical distribution that summarizes the probability that a value will take one of two independent values under a given set of parameters or assumptions. According to this definition, we can confirm that this variable cannot follow a binomial distribution, but it can follow a multinomial distribution since the observations can be from 3 different nominal categories.

In [7]:
# Generate a random data for the variable Species.
species_counts = df["species"].value_counts()

# Calculate probabilities.
probabilities = species_counts / len(df)

number_of_observations = 150 

multinomial_variable = np.random.multinomial(n=number_of_observations, pvals=probabilities)
print(f"Simulated Observations: {multinomial_variable}")

Simulated Observations: [72 53 25]


The previous output indicates that there are 73 observations for Adélie, 47 observations for Chinstrap and 30 observations for Gentoo.

2. Island

This variable is similar to the previous one. The three possible outcomes for this variables are: Tongersen, Dream and Biscoe. So as in the previous variable, we can say that this variable follows a multinomial distribution.

In [8]:
# Generate a random data for the variable Island.
species_counts = df["island"].value_counts()

# Calculate probabilities.
probabilities = species_counts / len(df)

number_of_observations = 150 

multinomial_variable = np.random.multinomial(n=number_of_observations, pvals=probabilities)
print(f"Simulated Observations: {multinomial_variable}")

Simulated Observations: [64 55 31]


The previous output indicates that there are 71 observations for Tongersen, 55 observations for Biscoep and 24 observations for Dream.

3. Bill_length_mm

In the case of the Bill lenght variable, we can assume that if follows a Normal / Gaussian bell distribution.
A normal distribution represents how the values of a characteristic appear in a population, for example the height of people in any country or in this case, the bill lenght of the penguins. In these characteristics, there will be one value that will be the most frequent and that will also be the value located in the middle of all the observations recorded in that population, even if we calculate that value, which is the arithmetic mean, adding all the heights recorded and dividing that result by the number of registered people, both data will coincide (or will be very similar).

In [9]:
# Generate a random sata for the variable bill_lenght_mm following a normal distribution.
mean_bill_length = df["bill_length_mm"].mean()
std_dev_bill_length = df["bill_length_mm"].std()

number_of_observations = 150 

# Simulate a normal distribution
bill_length_simulated_data = np.random.normal(loc=mean_bill_length, scale=std_dev_bill_length, size=number_of_observations)
print(f"Simulated Bill Lengths: {bill_length_simulated_data}")

Simulated Bill Lengths: [35.45815215 36.84078711 50.69904897 55.00991329 44.28911586 35.51303234
 48.39483325 36.97522463 52.72598771 37.49916522 39.65517751 43.27844258
 44.24482514 45.93840407 36.83885268 42.77835907 48.13784307 41.79471494
 41.26938717 43.21901224 49.04335517 46.45024838 54.85544529 40.78812746
 47.99970324 48.7008385  42.11027323 37.08775119 44.70805115 38.70854453
 30.15445264 49.82468534 44.00985272 48.62910492 38.08687046 42.76543268
 48.4118042  46.05444714 44.03037151 39.92875795 46.36178898 49.6583258
 53.0139964  43.15442237 40.82050775 52.86870781 41.86418625 38.70251744
 42.989581   52.87243195 41.57906504 46.8891296  36.09098817 46.87564422
 48.68198865 43.76384603 47.83033433 38.52001799 55.10865156 44.86948848
 52.61156252 40.92961534 40.11541065 51.61234497 47.72664391 42.78493928
 47.7117112  48.23426479 44.68558653 39.81477459 39.97289977 41.29247749
 42.03603166 48.20010436 33.01388386 47.13401268 49.02160096 39.61060168
 50.31222887 44.3956002  40.

4. Bill_depth_mm 

Just as in the previous variable and since it is a natural phenomenon, we are going to consider that this variable follows a normal distribution or Gaussian Bell.

In [10]:
# Generate a random sata for the variable bill_depth_mm following a normal distribution.
mean_bill_depth = df["bill_depth_mm"].mean()
std_dev_bill_depth = df["bill_depth_mm"].std()

number_of_observations = 150 

# Simulate a normal distribution
bill_depth_simulated_data = np.random.normal(loc=mean_bill_depth, scale=std_dev_bill_depth, size=number_of_observations)
print(f"Simulated Bill Lengths: {bill_depth_simulated_data}")

Simulated Bill Lengths: [17.06836232 17.77810954 14.80868511 19.73340984 18.61265106 18.74168195
 19.04649489 15.07415153 19.59927887 23.82873551 18.25814201 19.90876467
 17.26998032 16.54513785 18.23225581 14.30822273 18.66963928 19.50230612
 18.30523519 16.69866795 16.25849605 17.66328034 21.49085092 14.13448174
 15.69931981 15.38238371 11.24801204 21.68945602 15.39621945 14.71585114
 18.93618706 17.35076791 16.78256133 15.78135914 18.30077301 14.93657949
 16.0746646  19.54389617 17.31819808 17.11036281 13.69036681 15.32915416
 17.61515634 16.1320801  19.50149136 15.15293169 18.84872409 13.24065403
 14.79816327 20.04153954 19.38370161 18.18430392 17.59565195 17.47697588
 14.41869832 16.08411416 15.33877285 18.33926301 17.1461142  16.56341833
 16.10669374 11.37482332 18.52376206 17.72606809 17.42912465 19.78889143
 16.39442901 18.92834823 16.18745878 15.32202517 18.05492242 21.2606031
 20.94974514 17.62167951 16.35206201 16.10866503 15.79907688 17.24404084
 13.72585896 20.78253954 14.

5. Flipper_length_mm

Again here, as this variable explains a phisycal phenomenon of the penguins, this variable can be normally distributed.

In [11]:
# Generate a random sata for the variable flipper_lenght_mm following a normal distribution.
mean_flipper_length = df["flipper_length_mm"].mean()
std_dev_flipper_length = df["flipper_length_mm"].std()

number_of_observations = 150 

# Simulate a normal distribution
flipper_length_simulated_data = np.random.normal(loc=mean_flipper_length, scale=std_dev_flipper_length, size=number_of_observations)
print(f"Simulated Bill Lengths: {flipper_length_simulated_data}")

Simulated Bill Lengths: [205.76040286 179.7278969  193.48440943 206.42087209 211.24831954
 193.17372887 229.70353061 169.68879943 218.2435462  221.93934253
 215.52887413 178.29505682 202.74238217 169.23474678 201.72234461
 186.03934806 177.06752895 210.41517575 208.81591759 201.95895745
 198.70571542 208.84432969 184.97395297 212.56884875 199.28261938
 187.14307631 177.46421344 222.85753167 178.88945371 193.42129643
 201.41089279 186.77722492 212.15887583 207.21863684 204.67497605
 197.93642269 192.40657753 190.94555177 209.00189398 210.66689525
 207.15339852 208.9950184  204.98925256 200.87346786 188.80823416
 200.38261322 199.64769686 227.26489721 183.74804004 190.85943811
 234.14701356 222.57625426 201.72636702 201.41050583 199.93229423
 217.279716   198.6554775  243.11552213 177.62464173 199.16816089
 187.13386935 179.72565481 198.22517538 200.65247989 186.90573407
 181.06177539 197.63196371 178.4833419  194.15766197 201.86078549
 214.64987047 191.29453555 198.22660828 203.74203908

6. Body_mass_g

In this case, we can also assume that the body mass variable follows a Normal Distribution.

In [12]:
# Generate a random sata for the variable body_mass_g following a normal distribution.
mean_body_mass = df["body_mass_g"].mean()
std_dev_body_mass = df["body_mass_g"].std()

number_of_observations = 150 

# Simulate a normal distribution
body_mass_simulated_data = np.random.normal(loc=mean_body_mass, scale=std_dev_body_mass, size=number_of_observations)
print(f"Simulated Body Mass: {body_mass_simulated_data}")

Simulated Body Mass: [4770.17272131 5023.21050142 4479.70174628 6288.43450867 4252.83625662
 3577.68174367 3597.43440296 4550.32171622 2964.70644053 5605.11874874
 2947.96926475 3561.66239737 4503.5484538  4986.84565341 4643.16585907
 5703.94788548 5085.29944859 5290.02428879 4355.21484462 3536.31617231
 3268.75724571 4586.26724524 3947.0934752  4046.66249984 3674.61900772
 3796.85833857 3455.28575359 3644.82683721 4080.39495142 4424.8736931
 2474.12084348 4561.7008691  4407.85170983 4843.16310951 5464.28503164
 4275.39590013 3086.549698   4615.07476751 3544.72603066 3413.91512661
 4558.14297716 6050.85420409 2438.65311854 3959.15502533 4831.13033785
 4044.13974591 4902.45541111 6157.91016834 4669.6255872  3439.9415294
 3819.36125849 5331.53392212 5396.43973782 4074.53604502 5613.42228448
 4910.21655615 3309.43879152 3435.61824132 4624.47671079 4893.76037054
 4526.56665262 5192.68307616 4749.77137065 4667.34649211 3872.68272689
 3998.08280227 4437.48130723 3500.06077935 4207.97958075 4

7. Sex 

The variable sex has a different distribution. According to the definition of the Binomial distribution, a binomial distribution is a probability distribution function that is used when there are exactly two possible outcomes of a trial that are mutually exclusive. 
So, in this case we can assume that the variable sex follows a binomial distribution as it has only two possible outcomes: male or female and both are exclusive.

According to this, we are setting a random sex variable following a Binomial distribution. 
The binomial distribution is described by two parameters: 'n' the number of experiments performed, and 'p' the probability of success.

In [13]:
# We need to know the proportion of males and females in the original dataset.
sex_probabilities = df["sex"].value_counts(normalize=True)
print(sex_probabilities)

sex
MALE      0.504505
FEMALE    0.495495
Name: proportion, dtype: float64


In [14]:
# Once we have the probabilities, we need to set 'n' which is the number of the sample.
num_penguins = len(df)

# Generate random sexes based on the observed probabilities.
random_sex = np.random.choice(sex_probabilities.index, size=num_penguins, p=sex_probabilities.values)

# Create a new 'sex' column in the DataFrame.
df['sex'] = random_sex

# Display the new 'sex' variable.
print(random_sex)

['MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE'
 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'MALE' 'MALE' 'MALE' 'MALE'
 'MALE' 'MALE' 'MALE' 'MALE' 'MALE' 'FEMALE' 'MALE' 'FEMALE' 'FEMALE'
 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'MALE' 'MALE' 'MALE'
 'MALE' 'FEMALE' 'MALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE'
 'FEMALE' 'FEMALE' 'MALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE' 'MALE' 'MALE'
 'FEMALE' 'MALE' 'FEMALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE'
 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'FEMALE'
 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE'
 'FEMALE' 'MALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'FEMALE'
 'FEMALE' 'MALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'FEMALE' 'FEMALE'
 'MALE' 'MALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'FEMALE'
 'MALE' 'FEMALE' 'MALE' 'MALE' 'MALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE'
 'MALE' 'FEMALE' 'FEMALE' 'FEMALE' 'MALE' 'MALE' 'FEMALE' 'MALE' 'MALE'
 'MALE' '

***
## Task 4

Suppose you are flipping two coins, each with a probability p of giving heads. Plot the entropy of the total number of heads versus p.

The entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. The entropy formula for an event X with n possible outcomes and probabilities p_1, …, p_n is the following:

$$ H(X) = - \sum_{i} P(x_i) \log_2(P(x_i)) $$

When you toss a coin, the probability of getting heads or tails is the same.
In each case, the probability is ½ or 0.5. In other words, “heads” is one of two possible outcomes. The same is true for tails.
Find probability of multiple independent events by multiplying the probability of individual events. For example, the probability of getting heads and then tails (HT) is ½ x ½ = ¼.

A “fair coin” is one which has an equal probability of landing heads or tails in a coin toss. So, assuming that the coin from this exercise is a fair coin and has no defects, we can consider that the probability of getting heads (p) is 50% and the probability of getting tails is 50% (1-p).

In [15]:
import matplotlib.pyplot as plt

***

## End