## PROBABILITY FUNDAMENTALS

# Probability for Data Science

![Probability](https://drive.google.com/uc?export=view&id=1zlkvV68JIE0-L9ZIXuXJYi2wAyOrUd3P)


Probability is a crucial part of data science. It helps us make informed guesses and understand patterns in data, which is important for decision-making, prediction, and understanding uncertainty in our models. This guide will introduce you to the basic concepts of probability, explained in simple terms, and show how they relate to data science.


## What is Probability?

Probability measures how likely something is to happen. It is represented as a number between 0 and 1, where:
- 0 means an event will never happen (impossible),
- 1 means an event will always happen (certain).

In data science, probability helps us deal with uncertainity in data and make predictions.

**Example**:
The probability of flipping a coin and it landing on heads is 0.5 (or 50%), because there are two equally likely outcomes: heads or tails.



## Terminologies in Probability

a. **Experiment**
An experiment is any process that can be repeated, where the outcome is uncertain. Examples include flipping a coin, rolling a dice, or running a machine learning model.

b. **Outcome**
An outcome is a possible result of an experiment. For example, if you roll a dice, an outcome might be getting a 4.

c. **Event**
An event is a collection of outcomes. For example, rolling an even number on a dice (2, 4, or 6) is an event.

d. **Sample Space**
The sample space is the set of all possible outcomes of an experiment. For a coin flip, the sample space is {Heads, Tails}. For rolling a dice, the sample space is {1, 2, 3, 4, 5, 6}.



## Types of Probability

a. **Theoretical Probability**
Theoretical probability is based on reasoning. It's what we expect to happen in an ideal world.

**Formula**:
$$
P(A) = \frac{Number \ of \ favorable \ outcomes}{Total \ number \ of \ outcomes}
$$

**Example**:
The probability of rolling a 3 on a dice is:
$$
P(3) = \frac{1}{6}
$$
There’s 1 favorable outcome (rolling a 3) out of 6 possible outcomes.

b. **Experimental Probability**
Experimental probability is based on actual experiments and real-world data. It’s calculated by conducting experiments and recording the outcomes.

**Formula**:
$$P(A) = \frac {Number \ of \ times \ event \ occurs}{Total \ number \ of \ trials}$$

**Example**:
If you roll a dice 100 times and you get a 3 twenty times, then the experimental probability of rolling a 3 is:
$$
P(3) = \frac{20}{100} = 0.2
$$



## Probability Rules

a. **Addition Rule**

If two events are **mutually exclusive** (they cannot happen at the same time), the probability of either event happening is the sum of their individual probabilities.

It either you're jumping or standing, or it is either raining or not raining. These two things cannot happen at the same time.

**Formula**:
$$
P(A \ or \ B) = P(A) + P(B)
$$

**Example**:
The probability of rolling a 1 or a 2 on a dice is:
$$
P(1 \ or \ 2) = \frac{1}{6} + \frac{1}{6} = \frac{2}{6} = \frac{1}{3}
$$

b. **Multiplication Rule**

If two events are **independent** (the outcome of one event does not affect the other), the probability of both events happening is the product of their individual probabilities.

**Formula**:
$$
P(A \ and \ B) = P(A) \ * \ P(B)
$$

**Example**:
The probability of flipping two heads in a row with a coin is:
$$
P(Heads, \ Heads) = \frac{1}{2} * \frac{1}{2} = \frac{1}{4}
$$


## Conditional Probability

Conditional probability is the probability of one event happening, given that another event has already happened. It’s written as \(P(A|B)\), meaning the probability of A happening given that B has happened. E.g., probability of using an umbrella given it raining or it has rained.

**Formula**:
$$
P(A|B) = \frac{P(A \ and \ B)}{P(B)}
$$

**Example**:
Suppose 70% of people like coffee, and 40% of those coffee drinkers also like tea. The probability of someone liking tea, given that they like coffee, is:
$$
P(Tea|Coffee) = \frac{0.4}{0.7} \approx 0.57
$$


## Bayes' Theorem

Bayes' theorem is a formula used to update probabilities based on new information. It’s important in machine learning, especially in classification problems.

**Formula**:
$$
P(A|B) = \frac{P(B|A) \ * P(A)}{P(B)}
$$

**Example**:
Let’s say 1% of a population has a disease, and a test for the disease is 90% accurate. If someone tests positive, Bayes' theorem helps us calculate the probability that they actually have the disease, considering both the accuracy of the test and the rarity of the disease.


## Probability Distributions

In data science, probability distributions describe how the values of a random variable are distributed. Two common types are:

a. **Discrete Probability Distributions**: A discrete probability distribution applies to scenarios where the variable can take on specific, countable values (like rolling a dice or flipping a coin).

_Example_: The probability of each face of a fair dice is 1/6.

b. **Continuous Probability Distributions**: A continuous probability distribution applies when the variable can take on any value within a range (like measuring someone's height).

_Example_ The normal distribution (or bell curve) is a common continuous distribution in data science.

**More on this in the next three modules.**


## Why is Probability Important in Data Science?

In data science, probability helps with:
- **Prediction**: We use probability to make predictions about future events or trends.
- **Uncertainty**: Data is rarely perfect, so probability helps us deal with uncertainty.
- **Machine Learning**: Algorithms like Naive Bayes, decision trees, and more rely on probability to make decisions.

## Example

Suppose we have a dataset of a medical test's results. The dataset contains:

- Whether a patient has a disease (Disease: Yes/No).
- Whether their test result was positive (Test_Result: Positive/Negative).


Goals:

- Compute the probability of having the disease.
- Compute the conditional probability P(Disease∣Positive Test)P(Disease∣Positive Test).
- Check if the events "Disease" and "Positive Test" are independent.
- Apply Bayes' Theorem to verify P(Disease∣Positive Test)P(Disease∣Positive Test).

In [3]:
#import the required library
import pandas as pd

# Create the dataset
data = {
    "Patient_ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Disease": ["Yes", "No", "Yes", "No", "No", "Yes", "No", "Yes", "No", "No"],
    "Test_Result": ["Positive", "Positive", "Positive", "Negative", "Positive", "Positive", "Negative", "Positive", "Positive", "Negative"]
}
df = pd.DataFrame(data)

In [4]:
df

Unnamed: 0,Patient_ID,Disease,Test_Result
0,1,Yes,Positive
1,2,No,Positive
2,3,Yes,Positive
3,4,No,Negative
4,5,No,Positive
5,6,Yes,Positive
6,7,No,Negative
7,8,Yes,Positive
8,9,No,Positive
9,10,No,Negative


In [5]:
#find out more information about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Patient_ID   10 non-null     int64 
 1   Disease      10 non-null     object
 2   Test_Result  10 non-null     object
dtypes: int64(1), object(2)
memory usage: 372.0+ bytes


In [6]:
# Total number of patients
total_patients = df.shape[0]
total_patients

10

In [7]:
# 1. Probability of having the disease (P(Disease))
num_disease = len(df[df["Disease"] == "Yes"])
P_disease = num_disease / total_patients

print(f"P(Disease): {P_disease:.2f}")

P(Disease): 0.40


In [8]:
len(df[df["Disease"] == "Yes"])

4

This is the probability of having the disease, irrespective of the test result. Out of 10 patients, 4 have the disease. So, the probability is 40%.

This represents the prevalence of the disease in the dataset. Knowing the base rate helps understand how common the disease is in the population being studied.

In [9]:
# 2. Conditional Probability: P(Disease | Positive Test)

# Filter for positive tests
positive_tests = df[df["Test_Result"] == "Positive"]

# Count of total positive tests
num_positive_tests = len(positive_tests)

# Count of positive tests where disease is "Yes"
num_disease_given_positive = len(df[(df["Test_Result"] == "Positive") & (df["Disease"] == "Yes")])

# Calculate P(Disease | Positive Test)
P_disease_given_positive = num_disease_given_positive / num_positive_tests
print(f"P(Disease | Positive Test): {P_disease_given_positive:.2f}")

P(Disease | Positive Test): 0.57


In [11]:
# Count of positive tests where disease is "Yes"
num_disease_given_positive = len(df[(df["Test_Result"] == "Positive") & (df["Disease"] == "Yes")])
num_disease_given_positive

4

In [12]:
# Count of total positive tests
num_positive_tests = len(positive_tests)
num_positive_tests

7

In [13]:
df[df["Test_Result"] == "Positive"]

Unnamed: 0,Patient_ID,Disease,Test_Result
0,1,Yes,Positive
1,2,No,Positive
2,3,Yes,Positive
4,5,No,Positive
5,6,Yes,Positive
7,8,Yes,Positive
8,9,No,Positive


In [14]:
df["Disease"] == "Yes"

0     True
1    False
2     True
3    False
4    False
5     True
6    False
7     True
8    False
9    False
Name: Disease, dtype: bool

In [15]:
(df[(df["Test_Result"] == "Positive") & (df["Disease"] == "Yes")])

Unnamed: 0,Patient_ID,Disease,Test_Result
0,1,Yes,Positive
2,3,Yes,Positive
5,6,Yes,Positive
7,8,Yes,Positive


This is the conditional probability of having the disease given that the test result is positive. Out of the 7 patients who tested positive, 4 actually have the disease.

The results goes to reflect the reliability of a positive test in diagnosing the disease. It’s often referred to as the Positive Predictive Value (PPV) of the test. A higher PPV indicates that a positive test result is more likely to indicate the disease accurately.

In [16]:
# 3. Check independence: P(Disease ∩ Positive Test) vs P(Disease) * P(Positive Test)
P_positive = len(df[df["Test_Result"] == "Positive"]) / total_patients
P_disease_and_positive = num_disease_given_positive / total_patients
P_independence_check = P_disease * P_positive

print(f"P(Disease ∩ Positive Test): {P_disease_and_positive:.2f}")
print(f"P(Disease) * P(Positive Test): {P_independence_check:.2f}")

if abs(P_disease_and_positive - P_independence_check) < 1e-5:
    print("Disease and Positive Test are independent.")
else:
    print("Disease and Positive Test are NOT independent.")

P(Disease ∩ Positive Test): 0.40
P(Disease) * P(Positive Test): 0.28
Disease and Positive Test are NOT independent.


In [17]:
P_disease_and_positive

0.4

In [18]:
P_independence_check

0.27999999999999997

In [19]:
1e-5

0.00001

1e-05

1. The first result represents the joint probability that a patient both has the disease and tests positive. This captures the overlap of the two events: "having the disease" and "testing positive." It also reflects how frequently these two conditions occur together in the population.
2. Second, is the probability of "Disease" and "Positive Test" assuming the events are independent. Where independence means the occurrence of one event does not influence the occurrence of the other.


The actual joint probability P(Disease ∩ Positive Test)=0.40 is not equal to the calculated product P(Disease)×P(Positive Test)=0.28. This discrepancy shows that the two events are not independent.

**What does this mean?**

- The probability of testing positive depends on whether the patient has the disease. In other words: Patients with the disease are much more likely to test positive compared to patients without the disease.
- This dependence is expected for a medical test that is designed to detect the disease. A test with no dependence on the disease status would not be useful!

## Conclusion

Understanding probability is crucial for data science, as it allows us to handle uncertainty and make informed predictions. With these basics, you’ll have a solid foundation to explore more advanced topics in machine learning and statistics.

Feel free to reach out if you have questions or want more detailed explanations!