## Topic 3 - Bias 

<b> Student: Lais Coletta Pereira </b>
***

### Exercise 1

<b>1) Give three real-world examples of different types of cognitive bias.</b>


As discussed in this bias module, cognitive biases are tendencies or patterns when making a decision, usually inherited by cultural and personal experiences, that lead to distorted perceptions of reality. And while data might seem objective, data is collected and analyzed by humans and thus can be biased.

According to the website https://www.metabase.com/blog/6-most-common-type-of-data-bias-in-data-analysis there are many different types of cognitive bias, a very common example in Data Analytics is the <b>confirmative bias</b>. This is the tendency to search for, interpret, or remember information in a way that confirms our own preconceptions or hypotheses. One example in real life is when detectives are trying to find the guilty or suspects of certain crimes they are investigating. In theory, they must look at all the evidence objectively and seek additional data toa solve the crime. Maybe they have seen similar situations so many times before in their career that they develop a “working theory” about what happened.
Once they start investigating, they will have a theory ready and may begin to search for evidence consistent with that theory. They may interview witnesses that fit a certain stereotype and ask specific questions that are also a little biased. Another example would be if I am developing an app that makes it easier for elderly people to shop online. I run a survey in which the questions direct the public to confirm this specific age group is in need of such an app and prove that my idea is good. I would also look into the data to compare the elderly and younger generations' online purchases amount. I conclude then that this parcel of the population would like to have an app in which they could easier buy their needs. However, the real conclusion is that they would prefer to go to a shop and personally buy the product. 

Another type of cognitive bias is <b>selection bias </b> which occurs when looking at samples that are not representative of the population. One example would be testing the long-term covid effects in the Irish population and the investigator includes only healthy young adults in the trials whereas the long-term effects predominantly affect the older population. 

On the other hand, <b>historical bias</b>, occurs when socio-cultural prejudices and beliefs are mirrored into systematic processes. For example, there are studies analyzing that machine learning models trained on Wikipedia produced gender-biased analogies like doctors are more likely to be men and women nurses, men a commander and women school teachers. The model inherited the historical biases of society by learning from the huge corpora of text and produced work further reinforcing those biases.

***

### Exercise 2

<b> Show that that the difference between the standard deviation calculations is greatest for small sample sizes.</b>

As explained above, standard deviation is a key measure that shows how spread out values are in a data set. A small standard deviation happens when data points are fairly close to the mean. In the other hand, a large standard deviation happens when values are less clustered around the mean.

In [3]:
# create a data frame
df = [5, 10, 50, 100]

#Calculating standart deviation manually
# Create the df lenght variable
n = len(df)
#Calculate the mean
mean = sum(df) / n
# Sum the squares of the difference between the value and the mean and then divide this sum by the lenght to get the variance
var = sum((x - mean)**2 for x in df) / n
# The standard deviation can then be calculated by taking the square root of the variance
std = var ** 0.5

print (std)

38.140365755980895


For small sample sizes, the standard error is larger because the sample variance is a less accurate estimate of the population variance. This is because the sample variance is based on a smaller number of observations and is therefore more subject to sampling error. As the sample size increases, the standard error decreases because the sample variance becomes a more accurate estimate of the population variance.

Therefore, the difference between the standard deviation of a sample and the standard deviation of a population is greatest for small sample sizes.

To show that the difference between the standard deviation calculations is greatest for small sample sizes using Python, we can use the numpy library to generate random samples from a population with a known standard deviation, and then calculate the standard deviation of the samples for different sample sizes.

In [2]:
import numpy as np
import matplotlib.pyplot as plt

# Set the seed for this experiment
np.random.seed(0)

# Set the population standard deviation
pop_std = 2

# Set the sample sizes to consider
sample_sizes = [5, 10, 50, 100]

# Generate samples from the population and calculate the sample standard deviation
sample_stds = []
for sample_size in sample_sizes:
    sample = np.random.normal(loc=0, scale=pop_std, size=sample_size)
    sample_std = np.std(sample, ddof=1)
    sample_stds.append(sample_std)

# Calculate the difference between the population standard deviation and the sample standard deviation
differences = [pop_std - sample_std for sample_std in sample_stds]

# Print the results
for sample_size, sample_std, difference in zip(sample_sizes, sample_stds, differences):
    print(f"Sample size: {sample_size} | Sample std: {sample_std:.2f} | Difference: {difference:.2f}")

Sample size: 5 | Sample std: 1.49 | Difference: 0.51
Sample size: 10 | Sample std: 1.34 | Difference: 0.66
Sample size: 50 | Sample std: 2.07 | Difference: -0.07
Sample size: 100 | Sample std: 2.00 | Difference: 0.00
