# Intro to Probability


Probability plays a vital role in data analytics by offering a mathematical foundation to measure uncertainty, enabling informed decision-making, and evaluating the chances of different outcomes. This is essential for conducting rigorous statistical analyses and developing accurate predictive models.

## Set Theory

In set theory we represent the "union" of two sets A and B with $A \cup B$, which represents all the members from both sets.

In contrast, we name the "intersection" of two sets A and B with $A \cap B$, which represent the "common" elements to both sets.

#### Example 1

Set theory, an important part of mathematics, studies unique groups of elements. This lays the essential groundwork for understanding probability. Grasping sets is vital for exploring the world of probability.

- Lets consider two sets, A and B ,where:
    - A represents Students IDs who passed Mathematics
    - B representes Students IDs who passed English



In [None]:
A = {1001, 1002, 1004, 1005, 1006, 1008, 1010}
B = {1003, 1004, 1005, 1006, 1007, 1008, 1009}

In [None]:
#Students who passed math or English or both
A.union(B)

In [None]:
#Students who passed both subjects
A.intersection(B)

In [None]:
#Students who passed math but not English
A.difference(B)

In [None]:
#Students who passed English but not math
B.difference(A)

## Probability

How to **estimate** probabilities? (frequentist approach)

$$P(x) = \frac{nx}{total}$$

Flips = 15
Heads = 7
Tails = 8

$$P(H) = \frac{7}{15}$$
$$P(T) = \frac{8}{15}$$
$$P(H) \ne P(T)$$

### Example 2

Drawing Cards

- 1. What is the probability of drawing an Ace from a standard deck of 52 cards?
- 2. What is the probability of drawing either a red card or a King from a standard deck of 52 cards?
- 3. If you draw a card from a standard deck and it is red, what is the probability that it is a King?

In [None]:
#1.
# S = {all 52 cards}
# A = {ace_spades, ace_hearts, ace_diamonds, aces_clubs}

# P(Ace) = 4/52

#A standard deck has 52 cards
deck = 52

#A standard deck has 4 aces
aces = 4

probability_aces = aces/deck

print(f"The probability of drawing an Ace is {probability_aces: .2f}")

In [None]:
#2. #P(Red) = 1/2; P(King) = 4/52
#P(Red ∪ King) = P(Red) + P(King) - P(Red ∩ King) # We need to substract the intersection to avoid counting twice the same cards
#P(Red ∪ King) = 26/52 + 4/52 - 2/52 # We have King of diamonds, and the King of hearts.
p_red_or_king =   (26/52) + (4/52) - (2/52)

print(f"The probability of drawing a red or King card is {p_red_or_king: .2f}")

In [None]:
#3.
# Among the red cards, what is the probability to select a King?
#P(King|Red) = P(King ∩ Red) / P(Red)
p_king_after_red = (2/52) / (26/52) # or simply 2/26
print(f"The probability of drawing a King, knowing it's red card is {p_king_after_red: .2f}")

### Example 3

You are planning to go on a picnic today but the morning is cloudy, so it might rain? The following data about rainy days might help you make a decision:

- Knowing that it is a rainy day, the probability of cloudy is 50%
- The probability of any day (rainy or not) starting off cloudy is 40%
- This month is usually dry so only 3 of 30 days (10%) tend to be rainy
- What is the probability of rain, given the day started cloudy?

**Baye's Theorem**

$$P(x, y)=?$$

$$P(x),P(y)$$

$$P(x, y) = P(x)*P(y|x)$$
$$P(x, y) = P(y)*P(x|y)$$

$$P(x)*P(y|x) = P(y)*P(x|y)$$

$$P(x|y) = \frac{P(x)*P(y|x)}{P(y)}$$

In [None]:
# Bayes Theorem is going to be really usefull here if we have in consideration the available data

# P(C|R) = 0.5
# P(C) = 0.4
# P(R) = 0.1

# P(R|C) = P(R) * P(C|R) / P(C)

p_cr = 0.5
p_c = 0.4
p_r = 0.1

p_rc = (p_r * p_cr) / p_c

print(f"The probability that it's rainny when it's cloudy is {p_rc: .2f}")

## Probability Distributions

#### Discrete

##### **Binomial Distribution** - Probability of having k successes in n independent trials with individual success probability p.

The idea here is the following:

We're dealing with an experiment that can only take **TWO** possible outcomes (no matter which ones, what matters it to have only two possible values). Then, we wonder what is the probability that if we repeat the experiment "n" times, we will get "k" succefull events.

Examples:

* Flipping a coin, we can have heads "H" or tails "T".
* Online transactions, it can be fraudulent "F" or not "N".

What we consider "successful" event can be any of the two outcomes.
On the other hand, the individual probabilites doesn't matter here. The coin can be fair or not, online transactions are mostly not fraudulent.

Example:
- Before the vaccines were released by companies, the government had to contact some suppliers in order to vaccinate the entire population when the vaccines were ready. So, imagine that each company had a probability of developing the vaccine of 40%.
- Each supplier has the capacity to develop 10,000 vaccines.
- The population that needs to be administered the vaccine is 30,000 people

**How many suppliers should the government contract?**

Let's say the governament contracts with 10 labs.
Is this enough to vacinate this 30,000 people?

In [None]:
from scipy.stats import binom

n = 10
p = 0.40
binom_dist = binom(n,p)

# If each lab can provide 10K vaccines and we have 30K people, we will need to contact at least 30K/10K = 3 or more successful labs.
# Let's see what is the probability to contact 3 successful labs.
print(f"The probability to have 3 successful contacts when we reach 10 labs is: {binom_dist.pmf(3): .2f}")

# However, if we have 3 or more (in any amount) successful contacts will also be safe. Lets compute what is the probability of
# finding 4 successful labs if we contact 10.
print(f"The probability to have 4 successful contacts when we reach 10 labs is: {binom_dist.pmf(4): .2f}")

# finding 5 successful labs if we contact 10.
print(f"The probability to have 5 successful contacts when we reach 10 labs is: {binom_dist.pmf(5): .2f}")

#1- binom_dist.cdf(2)

In [None]:
# Calculate and store the probabilities for exactly 3, 4, and 5 successes
probabilities = [binom_dist.pmf(k) for k in range(3, 6)]

# Print the results
for k, prob in zip(range(3, 6), probabilities):
    print(f"The probability of having exactly {k} successful contacts when attempting 10 times is: {prob:.2f}")

In [None]:
import pandas as pd

# Generate and display probabilities in a table format
k_values = range(3, 6)
probabilities = [binom_dist.pmf(k) for k in k_values]
df = pd.DataFrame({'k': k_values, 'Probability': probabilities})
print(df)

In this problem we don't care how many successful contacts we have as long as we hit 3 or more. Therefore, we're looking for:

$$P(k \ge 3, n, p) = \sum_{k=3}^{10}P(k, n, p) = P(3, 10, 0.4) + P(4, 10, 0.4) + P(5, 10, 0.4) +...+ P(10, 10, 0.4)$$

This is called the "cumulative probability". We need to add the individual probabilities because we can reach our goal in several ways: with 3 successful contacts, with 4, 5,...and so on.

This probability can be also computed as the 1 - probability of having less than 2 sucessful contacts: ie:

$$ 1 - P(k < 3, n, p) = \sum_{k=0}^{2}P(k, n, p)= P(0, 10, 0.4) + P(1, 10, 0.4) + P(2, 10, 0.4)$$

In [None]:
print(f"The probability to have 3 or more successful contacts when we reach 10 labs is: {1 - binom.cdf(2, n, p): .2f}")

If instead of contacting 10 labs, we contact 15...

In [None]:
n = 15
p = 0.40
binom_dist = binom(n,p)
1- binom_dist.cdf(2)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.barplot(x=[i for i in range(n+1)], y=[binom_dist.pmf(i) for i in range(n + 1)]);

In [None]:
sns.barplot(x=[i for i in range(n + 1)], y=[1 - binom_dist.cdf(i - 1) for i in range(n + 1)]);
plt.xlabel("Number of Successful Contacts (k)")
plt.ylabel("Cumulative Probability of k or More Successes")
plt.show()


We can say that if we contract 15 labs, the probability of three or more labs being successful in the development of the vaccine is now 0.97.
- Is this enough? Of course, the more labs we contract, the better. However, there are costs associated with contracting labs that we need to take into consideration

##### **Geometric Distribution**

Suppose you are applying for jobs, and the probability of receiving an interview invitation for each job application is p = 0.20.
Calculate the probability of getting an interview invitation on the third job application.

In [None]:
from scipy.stats import geom

p = 0.20
k = 3
geom_dist = geom(p)

print(f"The probability to get an interview invitation exactly on the 3rd job application is {geom_dist.pmf(3): .2f}")

In [None]:
n=10
p=0.2
sns.barplot(x=[i for i in range(n+1)], y=[ geom_dist.pmf(k) for k in range(n+1)]);

What is the probability that they need to submit at least 5 job applications before receiving their first job offer?

In [None]:
print(f"The probability to get an interview invitation after the 5th job application is {1 - geom_dist.cdf(4): .2f}")

##### **Poisson Distribution**

It's used when we're looking for counting events in  given period of time.

Suppose that on average, there are 2 accidents per day at a certain intersection.

- What is the probability that there will be no accidents at the intersection tomorrow?

In [None]:
from scipy.stats import poisson

mu = 2 # average
poisson_dist = poisson(mu)

print(f"The probability that tomorrow we dont observe an accident is {poisson_dist.pmf(0): .2f}")

- What is the probability that there will be at least 5 accidents at the intersection tomorrow?

In [None]:
print(f"The probability that tomorrow we observe 5 or more accidents is {1 - poisson_dist.cdf(4): .2f}")

In [None]:
k_values = range(0, k + 5)  # 5 is arbitrary
sns.barplot(x=k_values, y=[poisson_dist.pmf(i) for i in k_values]);

#### Continuous

##### **Exponencial Distribution**

Suppose the time (in minutes) between two consecutive arrivals at a bus stop is 10min.

- Calculate the probability that the time between arrivals is less than 8 minutes.

In [None]:
import numpy as np
from scipy.stats import expon
import seaborn as sns

# Lambda is the rate parameter, which is the inverse of the mean
lambda_rate = 0.1  # If the mean time between arrivals is 10 minutes, lambda is 1/10 per minute.
expon_dist = expon(scale=1/lambda_rate)  # Scale parameter is the inverse of lambda

# Calculate the probability that the time until the first event is less than 8 minutes
time = 8
prob_less_than_time = expon_dist.cdf(time)  # CDF for the time
prob_less_than_time

time_values = np.linspace(0, 8, 100)  # Replace 'time * x' with the range you want to visualize
sns.lineplot(x=time_values, y=expon_dist.cdf(time_values))

- Calculate the probability that the time between arrivals is more than 15 minutes.

In [None]:
print(f"The probability that the time between arrivals is more than 15 minutes is {1 - lambda_inv.cdf(15): .2f}")

##### **Normal (Gaussian) Distribution**

Suppose the heights of a population of students follow a normal distribution with a mean (μ) of 175 cm and a standard deviation (σ) of 7.5 cm .

- What is the probability that a randomly selected student is taller than 182 cm?

In [None]:
from scipy.stats import norm

mean = 175
std = 7.5

norm_dist = norm(loc = mean, scale = std)
print(f"The probability of the student's height being greater than 182 cm is {1 - norm_dist.cdf(182): .2f}")


- What is the probability that the height of a randomly selected student is between 170cm and 175cm?

$$P(170 <= h <= 175)=? = P(h < 175) - P(h < 170) =  CDF(175) - CDF(170)$$

In [None]:
print(f"The probability of the student's height being between 170 cm and 175 cm is {norm_dist.cdf(175) - norm_dist.cdf(170): .2f}")


The CDF(x) is the sum of all the probabilities from smallest value to x.

1 = CDF(x) + "remains"