<a href="https://colab.research.google.com/github/DavidSenseman/BIO5853/blob/master/Lesson_02_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO5853: Biostatistics**

## **Lesson_02_1: Probability**

##### **Module II: Probability**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module II Material
* **Part 2.1: Probability**
* Part 2.2: Theorectical Probability Distributions
* Part 2.3: Sampling Distributions of the Mean

#### In this assignment you will learn about:

* Python sets
* Venn diagrams
* Experimental probability

### Google CoLab Instructions

The following code will map your GDrive to ```/content/drive``` and print out your Google GMAIL address.

In [None]:
# YOU MUST RUN THIS CELL FIRST
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

# **Probability**

**_Probability_** is a branch of mathematics that deals with the likelihood or chance of different outcomes occurring. It quantifies uncertainty and helps us make predictions about future events based on known data.

#### **Key Concepts in Probability:**

1. **Experiment:** An action or process that leads to one or more outcomes (e.g., rolling a die).
2. **Outcome:** A possible result of an experiment (e.g., rolling a 3).
3. **Event:** A set of one or more outcomes (e.g., rolling an odd number).
4. **Probability of an Event:** A measure of how likely an event is to occur, typically expressed as a number between 0 and 1, where 0 means the event will not occur and 1 means the event will certainly occur.

#### **Importance of Probability in Statistics:**

1. **Foundation for Statistical Inference:** Probability provides the theoretical foundation for making inferences about populations based on sample data. It helps in estimating population parameters and testing hypotheses1.
2. **Decision Making:** Understanding probability allows us to make informed decisions in the face of uncertainty. For example, businesses use probability to assess risks and make strategic decisions.
3. **Predicting Outcomes:** Probability models can predict the likelihood of future events, which is crucial in fields like finance, insurance, and medicine.
4. **Understanding Randomness:** Probability helps in understanding and modeling random processes, such as the distribution of diseases in epidemiology or the behavior of particles in physics.
5. **Quality Control:** In manufacturing, probability is used to monitor and control the quality of products by analyzing the likelihood of defects.

## **Python Sets**

A Python **set** is an _unordered_ collection of items that contains **no** duplicates. As we will see, if you try to add an item that is already in a set, nothing happens. 

Python sets can contain numbers, strings or a mixture. A ∩ B,

Pagano, Marcello; Gauvreau, Kimberlee; Mattie, Heather. Principles of Biostatistics (p. 111). CRC Press. Kindle Edition. 
Since strings behave as collections, a string can be used as the argument for a call to set. The resulting set will contain a **single-character string** for each unique character that appears in the argument. The order in which the elements of a set are printed will not necessarily bear any relation to the order in which they were added as shown in Example 1.

### Example 1: Create a Set using Curly Braces `{ }`

The code below shows how to create a Python set called `A` using curly braces `{}`.

In [None]:
# Example 1: Create a set using curly braces

# Create set A
A = {1,2,3,4}

# Print set
print(A)

If you code is correct you should see the following output:
~~~text
{1, 2, 3, 4}
~~~

The numbers are enclosed in curly braces `{ }` which is Python's notation for a `set`.

### **Exercise 1: Create a Set using Curly Braces**

In the cell below, create a new set called `B` using curly braces and print it out. Set `B` should contain the integers `3,4,5` and `6`.

In [None]:
# Insert your code for Exercise 1 here



If your code is correct you should see the following output:
~~~text
{3, 4, 5, 6}
~~~


### Example 2: Create a Set using the `set()` function

The code below shows an alternative way to create a Python set called `A` using the `set()` function. 

In [None]:
# Example 2: Create a set using set() function

# Create set A
A = set([1,2,3,4])

# Print set
print(f"Set A = {A}")

If you code is correct you should see the following output:
~~~text
Set A = {1, 2, 3, 4}
~~~

The curly braces `{ }` is Python's notation for a set.

### **Exercise 2: Create a Set using the `set()` function**

In the cell below, create a new set called `B` and print it out. Set `B` should contain the integers `3,4,5` and `6`.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see the following output:
~~~text
Set B = {3, 4, 5, 6}
~~~


When you create a Python `set` it's up to you whether to use the curly braces notation or the `set()` function. There's no difference in the sets you create.

### **Algebraic Set Operations**

In Python there are a number of operations and functions that work on different _collection_ types such as _sets_. In the next couple of examples we look at some of these operations.

### Example 3: Algebraic set operations - Union

The "addition" of one set with another is called the _union_ of the two sets. It is denoted as _A_ ∩ _B_. In Python, you can use the `|` operator to create a **union** of two sets as shown in the next cell. 


In [None]:
# Example 3: Union of 2 sets

# Create sets A and B
A = set([1,2,3,4])
B = set([3,4,5,6])

# Union of A, B
A_union_B = A | B

# Print the new set
print(f"Set A = {A}")
print(f"Set B = {B}")
print(f"Union of A and B = {A_union_B}")

If your code is correct you should see something similiar to the following output:
~~~text
Set A = {1, 2, 3, 4}
Set B = {3, 4, 5, 6}
Union of A and B = {1, 2, 3, 4, 5, 6}
~~~
Notice that when we add these two sets together, only the numbers `5` and `6` were added to `A`, not the extra `3` and `4`. 

>Why?

Because every element in a set must be **unique**. Since our original set `A` already contained the numbers `3` and `4`, they were **not** added, only the new values, `5`, and `6`. In other words, a Python `set` can only contain one example of each element.

### **Exercise 3: Try to Create a set with Duplicated Items**

Because each element in a set must be _unique_, when you try to create a set with duplicated items, you don't get an error, but only one item will be added to the set. 

In the cell below, create a new set called `DNABases` with `{'T', 'A', 'A', 'G', 'T', 'C', 'C'}` and then print out the set. 

In [None]:
# Insert your code for Exercise 3 here



If your code is correct, you should see: 
~~~text
Set DNABases = {'G', 'A', 'T', 'C'}
~~~~
Even though you defined the set `DNABases` with duplicated items, the set`DNABases` only contains one example of each item. As stated above, every element in a Python set must be **_unique_**.


### Example 4: Algebraic set operations - Intersection

Another algebraic set operation is **intersection**. The intersection of sets _A_ and _B_ is denoted as _A_ ∩ _B_. In Python, the ampersand symbol `&` is used as to perform the intersection of two sets as shown in the code cell below. 

In [None]:
# Example 4: Find intersection of two sets using & operator

# Create sets A and B
A = set([1,2,3,4])
B = set([3,4,5,6])

# Use `&` to find their intersection
A_int_B = A & B

# Print out the intersection
print(f"The intersection of Set A and B = {A_int_B}")

If the code is correct you should see the following output:
~~~text
The intersection of Set A and B = {3, 4}
~~~
The intersection of _A_ and _B_ or _A_ ∩ _B_, is the set of elements that _both_ sets have in common. In this example, only the numbers `3`, and `4` were contained in both sets. 

### **Exercise 4: Algebraic set operations - Intersection**

In Example 3, set intersection was found using the `&` operator. Python also offers an `intersection()` method for accomplishing the same thing.  

In the cell below, use the `intersection()` method to find the intersection between sets, `A` and `B`. Call the new set created by the intersection method `A_intersection_B`.

Methods are called using _dot notation_. In this case, the `intersection()` **method** is attached (by the dot) to the first set and its argument is the second set.

In [None]:
# Insert your code for Exercise 4 here



If your code is correct you should see the following output:
~~~text
The intersection of Set A and B = {3, 4}
~~~

## **Venn Diagrams**

A **_Venn Diagram_** is a visual tool used to illustrate the relationships between different sets. It consists of overlapping circles, each representing a set, with the overlapping areas showing the common elements between the sets.

#### **Importance in Probability and Statistics**

1. **Visualizing Relationships:** Venn diagrams help in visualizing the logical relationships between different sets. This is particularly useful in probability and statistics where understanding the overlap and differences between events is crucial.
2. **Simplifying Complex Concepts:** They make it easier to grasp complex concepts such as unions, intersections, and complements of sets. For example, the probability of either event A or event B occurring (union) can be easily visualized.
3. **Calculating Probabilities:** Venn diagrams are used to calculate probabilities of combined events. For instance, they can show the probability of two events happening together (intersection) or the probability of at least one of the events happening (union).
4. **Identifying Mutually Exclusive Events:** They help in identifying mutually exclusive events, which are events that cannot happen at the same time. This is important for calculating probabilities accurately.
5. **Teaching Tool:** Venn diagrams are widely used in education to teach set theory and probability concepts. They provide a clear and intuitive way to understand and solve problems related to sets and probabilities12.


## **Operations on Events and Probability**

In probability theory, an **_event_** is a set of outcomes of an experiment to which a probability is assigned. Essentially, it is a subset of the sample space, which includes all possible outcomes of the experiment. For example, rolling a `3` on a 6-sided die could be considered an event. More interesting examples of events would be a 30=year old female being diagnosed with cervical cancer or an individual being borm with the genetic defect that causes cystic fibrosis. An event either occurs, or it doesn't occur. In statistics, events are represented using uppercase letters (e.g. $A$, $B$, $C$). 

Here is **FIGURE 5.1** on page 113 from your textbook. This figure uses Venn diagrams to illustrate the 3 operations that can be performed on events: (1) _intersection_, (2) _union_ and (3) _complement_.

![___](https://biologicslab.co/BIO5853/images/module_02/lesson_02_1_image02.png)

## **Intersection**

The **_intersection_** of two events $A$ and $B$, denoted $A ∩ _B$, is defined as the event “both $A$ and $B$.” For example, let $A$ represent the event that a 30-year-old lives to see their 70th birthday, and $B$ the event that this person’s 30-year-old  friend is still alive at age 70. The intersection of $A$ and $B4 would be the event that both individuals are alive at age 70. 

In Python, there is no easy way to recreate part (a) of Figure 5.1. 

## **Union**

The **_union_** of $A$ and $B$, denoted $A ∪ B$, is the event “either $A$ or $B$, or both $A$ and $B$.” In the example above, the union of $A$ and $B$ would be the event that either the first 30-year-old or their 30-year-old friend lives to age 70, or that they both live to be 70 years of age.  

The Python code for recreating part (b) of Figure 5.1 is shown in Example 5.

### Example 5: Draw Venn Diagram--Union

The code in the cell below creates a Venn diagram showing the union of sets $A$ and $B$. This figure is a Python recreation of **FIGURE 5.1 (b)** on page 113 in your textbook. 

_Code Description:_

In addition to importing the plotting library `matplotlib.pyplot`, it is also necessary to import a special venn diagram package by adding the line:

~~~text
from matplotlib_venn import venn2
~~~

The code that actually draws the Venn diagram is:

~~~text
# Draw Venn
venn2(subsets=[A, B], set_labels=('', ''), set_colors=('#44a1f6', '#ffdc10'), alpha=0.7)
~~~
The labels were set to `' '` to hid them.

To make the figure match **FIGURE 5.1 (b)** as closely as possible, the following code chunk was added:

~~~text
# Cover over number in center of the circle 
plt.text(-0.48, -0.02, '.', fontsize=80, color='#64c1ff')
plt.text(0.37, -0.02, '.', fontsize=80, color='#ffec30')
plt.text(-0.05,-0.02, '.', fontsize=80, color='#f0ff20')
~~~

These three code lines are completely optional. All they do is cover-up a number that is normally printed in the center of each circle showing the number of elements inside the circle. 

In [None]:
# Example 5: Draw Venn diagram - Union

import matplotlib.pyplot as plt
from matplotlib_venn import venn2

# Define Sets
A = set([1,2,3,4])
B = set([4,5,6,7])

# Draw Venn
venn2(subsets=[A, B], set_labels=('', ''), set_colors=('#44a1f6', '#ffdc10'), alpha=0.7)

# Add custom text
plt.text(-0.55, -0.01, 'A', fontsize=24, color='black')
plt.text(0.45, -0.01, 'B', fontsize=24, color='black')
plt.text(-0.95, 0.40, 'S', fontsize=26, color='black')

# Cover over number in center of the circle 
plt.text(-0.48, -0.02, '.', fontsize=80, color='#64c1ff')
plt.text(0.37, -0.02, '.', fontsize=80, color='#ffec30')
plt.text(-0.05,-0.02, '.', fontsize=80, color='#f0ff00')

plt.axis('on')
plt.show()

If the code is correct you should see the follow Venn diagram:

![___](https://biologicslab.co/BIO5853/images/module_02/lesson_02_1_image04.png)

This is Python recreation of **FIGURE 5.1 (b)** in your textbook on page 113.

The figure legend in your textbook reads as follows:
>Venn diagrams representing the operations on events: (b) union of _A_ and _B_

## **Complement**

The **_complement_** of an event $A$, denoted $A^c$ or $\bar{A}$, is the event “not $A$.” Consequently, $A^c$ is the event that the first 30-year-old dies before reaching the age of 70.  

The Python code for recreating part (c) of Figure 5.1 is shown in Example 6.

### Example 6: Draw Venn Diagram--Complement

The code in the cell below recreates part **(c)** of **FIGURE 5.1** in your textbook. This figure shows the complement of set _A_. 

_Code Description:_

To create this figure, it was necessary to define the sets, $A$ and $B$, with **mutually exclusive** events, i.e., the numbers in set $A$ were all different than the numbers in set $B$. 

~~~text
# Define Mutally Exclusive Sets
A = set([1,2,3,4])
B = set([5,6,7,8])
~~~

It was also necessary to make the color of set $B$ completely white using the value `#ffffff`. Unfortunately, there is no easy way to change the background color using the command `venn2()` so it appears white. 

~~~text
# Draw Venn
venn2(subsets=[A, B], set_labels=('', ''), set_colors=('#44a1f6', '#ffffff'), alpha=0.7)
~~~


In [None]:
# Example 6: Draw Venn diagram--complement

import matplotlib.pyplot as plt
from matplotlib_venn import venn2

# Define Mutally Exclusive Sets
A = set([1,2,3,4])
B = set([5,6,7,8])

# Draw Venn
venn2(subsets=[A, B], set_labels=('', ''), set_colors=('#44a1f6', '#ffffff'), alpha=0.7)

# Add custom text
plt.text(-0.85, -0.01, 'A', fontsize=20, color='black')
plt.text(0.80, 0.20, "A$^c$", fontsize=20, color='black')
plt.text(-1.35, 0.30, 'S', fontsize=22, color='black')

# Cover over number in center of the circle 
plt.text(-0.68, -0.02, '.', fontsize=64, color='#64c1ff')
plt.text(0.55, -0.02, '.', fontsize=64, color='#ffffff')

plt.axis('on')
plt.show()

If the code is correct you should see the follow Venn diagram:

![___](https://biologicslab.co/BIO5853/images/module_02/lesson_02_1_image05.png)

This is Python recreation of **FIGURE 5.1 (c)** in your textbook on page 113.

The figure legend in your textbook reads as follows:
>Venn diagrams representing the operations on events: (c) complement of _A_

## **Mutually Exclusive Events**

In probability theory, **_mutually exclusive events_** are events that cannot occur at the same time. In other words, if one event happens, the other cannot. These events are also known as _disjoint_ events.

#### **Key Characteristics**

1. **No Overlap:** The probability of both events occurring together is zero. Mathematically, if ($A$) and ($B$) are mutually exclusive events, then:
   
   $$P(A \cap B) = 0 $$
   
3. **Addition Rule:** The probability of either event ($A$) or event ($B$) occurring is the sum of their individual probabilities. This is expressed as:
   
   $$ P(A \cup B) = P(A) + P(B) $$

#### **Examples**
* **Coin Toss:** When tossing a coin, getting heads and getting tails are mutually exclusive events because you cannot get both heads and tails on a single toss.
* **Dice Roll:** When rolling a six-sided die, the events of rolling a 2 and rolling a 5 are mutually exclusive because you cannot roll both numbers at the same time.
* **Card Draw:** In a standard deck of 52 cards, drawing a red card (hearts or diamonds) and drawing a black card (clubs or spades) are mutually exclusive events.

In your textbook, Figure 5.2 on page 114 shows a Venn diagram of two mutually exclusive events, $A$ and $B$. The Python recreation of this figure is left for you to complete as **Exercise 6**.


### **Exercise 6: Draw Venn Diagram--Mutually Exclusive Events**

For **Exercise 6**, you are to recreate **FIGURE 5.2** from your textbook on page 114. For the most part, you can simply re-use the code in Example 6 above with the following changes:

1. When drawing your Venn diagram, change the colors using the following code chunk:

~~~text
set_colors=('#44a1f6', '#ffdc10')
~~~
This will change the color of set _B_ to yellow, instead of white.

2. When adding custom text, you need to change the following line:
> plt.text(0.80, 0.20, "A$^c$", fontsize=20, color='black')

to read as follows:
>plt.text(0.80, -0.01, 'B', fontsize=20, color='black')

3. Finally, you need to change the color of the dot covering the number of set _B_. Change the following line of code:
>plt.text(0.55, -0.02, '.', fontsize=64, color='#ffffff')

to read as follows:
>plt.text(0.55, -0.02, '.', fontsize=64, color='#ffec30')

In [None]:
# Insert your code for Exercise 6 here



If the code is correct you should see the follow Venn diagram:

![___](https://biologicslab.co/BIO5853/images/module_02/lesson_02_1_image06.png)

This is Python recreation of **FIGURE 5.2** in your textbook on page 114.

The figure legend in your textbook reads as follows:
>Venn diagrams representing two mutually exclusive events _A_ and _B_ 

## **Additive Rule**

When two events are mutually exclusive, the **_additive rule of probability_** states that the probability that either of the two events will occur is equal to the **sum** of the probabilities of the individual events; more explicitly:
$$ P(A ∪ B) = P(A) + P(B) $$   

## Example 7: Additive Rule

Example 7 shows how to work the following problem:

>Suppose we know that the probability that a newborn’s birth weight is less than 1500 grams is 0.014 and the probability that it is between 1500 and 2499 grams is 0.069. What is the probability that either of these two events will occur, or equivalently, the probability that the child weighs less than 2500 grams? 

In [None]:
# Example 7: Additive Rule

# Define Probabilites
P_A=0.014  # Probability birthwt less than 1500 gm
P_B=0.069  # Probability birtwt is between 1500 and 2499 gm

# Additive Rule
P_total = P_A + P_B

# Print Result
print(f"The probability that a child will weight less than 2500 gm={P_total}")

If the code is correct you should see the following output:
~~~text
The probability that a child will weight less than 2500 gm=0.083
~~~

### **Exercise 7: Additive Rule**

Write the code to solve the following problem:

>Suppose we know that the probability that a newborn’s birth weight is less than 1500 grams is 0.014, the probability that it is between 1500 and 2499 grams is 0.069 and the probability that it is between 2499 and 3000 gm is 0.19. What is the probability that the child weighs less than 3000 grams?

_HINT:_ You will need to add the event $C$.


In [None]:
# Insert your code for Exercise 7 here



If your code is correct you should see the following output:
~~~text
The probability that a child will weight less than 3000 gm=0.273
~~~

## **Additive Rule**

If the events _A_ and _B_ are **not** mutually exclusive, as in **Figure 5.1(b)**, then the additive rule no longer applies. Let $A$ be the event that a newborn’s birth weight is less than 1500 grams and $B$ the event that it is less than 2500 grams. Since the two events are able to occur simultaneously – consider a child whose birth weight is 1850 grams – there is some area in which they will overlap. If we were to simply sum the probabilities of the individual events, this area of overlap would be counted _twice_. Therefore, when two events are not mutually exclusive, the probability that either of the events will occur is equal to the sum of the individual probabilities _minus_ the probability of their intersection:  

$$ P(A ∪ B) = P(A) + P(B) − P(A ∩ B) $$    

### Example 8: Additive Rule

Example 7 shows how to work the following problem:

>Suppose we know that the probability that a newborn’s birth weight (_A_) is less than 1500 grams is 0.014 and the probability that it is less than 2500 grams (_B_) is 0.083. What is the probability that a child will weigh 1850 grams if the intersection of _A_ and _B_ = 0.05?

In [None]:
# Example 8: Additive Rule

# Define Probabilites
P_A = 0.014  # Probability birthwt less than 1500 gm
P_B = 0.083  # Probability birtwt is less than 2500 gm
P_A_intersection_B = 0.05

# Additive Rule
P_total = P_A + P_B - P_A_intersection_B 

# Print Result
print(f"The probability that a child will weigh 1850 gm = {P_total:.3f}")

If the code is correct you should see the following output:
~~~text
The probability that a child will weigh 1850 gm = 0.047
~~~

### **Exercise 8: Additive Rule**

Write the code to solve the following problem:

>Suppose we know that in the US the probability of a patient having hypertension is 48.1% and the probability of having Type II diabetes is 8.6%. Assume the probability of having both hypertension and diabetes is 2.5% What is the probability that a person will have either hypertension _or_ diabetes?


In [None]:
# Insert your code for Exercise 8 here



If your code is correct you should see the following output:
~~~text
The probability that a patient will have either hypertension or diabetes = 0.542
~~~

## **Conditional Probability**

We are often interested in determining the probability that an event $B$ will occur given that we already know the outcome of another event $A$. Does the prior occurrence of $A$ cause the probability of $B$ to change? For instance, instead of the probability that a male has tuberculosis, we might want to know the probability of tuberculosis given that lung inflammation was noted on his chest X-ray, or the probability of tuberculosis given that he is infected with hiv. In this case, we are dealing  with a **_conditional probability_**. Conditional probabilities provide a model for updating information as our knowledge about a situation evolves. As we will see in Chapter 6, conditional probabilities are useful for interpreting the results of diagnostic tests. The notation $P(B | A)$ is used to represent the probability of the event $B$ given that event $A$ has already occurred. 


### **Multiplicative Rule of Probability**

The **_multiplicative rule of probability_** is used to determine the probability of two events, $A$ and $B$, both occurring. The rule states that the probability of both events happening is the _product_ of the probability of the first event and the conditional probability of the second event given that the first event has occurred.

### **Formula**

For any two events $A$ and $B$:

$$ P(A \cap B) = P(A) \times P(B|A) $$

**Where:**

* $P(A \cap B) $ is the probability of both events $A$ and $B$ occurring.
* $P(A)$ is the probability of event $A$ occurring.
* $P(B|A)$ is the conditional probability of event $B$ occurring given that event $A$ has occurred.
  
**Special Case for Independent Events**
If events $A$ and $B$ are **_independent_**, the occurrence of one does not affect the occurrence of the other. In this case, the formula simplifies to:

$$ P(A \cap B) = P(A) \times P(B) $$

**Example**
Let’s say we have two events:

* Event $A$: Probability hypertension (48.1%)
* Event $B$: Probability diabetes (8.6%)

If we assume the events are independent (which they are not in reality, but for simplicity) and using the multiplicative rule for independent events:

$$ P(A \cap B) = P(A) \times P(B) = 0.481 \times 0.086 \approx 0.041 $$

So, the probability of a person having both hypertension and diabestes is approximately 4.1%.

## **Total Probability Rule**

The **_Total Probability Rule_** (also known as the **Law of Total Probability**) is a fundamental concept in probability theory. It relates the probability of an event to the probabilities of several mutually exclusive and exhaustive events that partition the sample space.

#### **Definition**
The total probability rule states that if $B_1, B_2, \ldots, B_n $ are mutually exclusive and exhaustive events, then for any event $A$:

$$ P(A) = \sum_{i=1}^{n} P(A \cap B_i) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i) $$

#### **Explanation**

* **Mutually Exclusive:** Events $B_1, B_2, \ldots, B_n$ do _not_ overlap; no two events can occur simultaneously.
* **Exhaustive:** The events cover the entire sample space; one of these events must occur.
* **Conditional Probability:** $P(A|B_i)$ is the probability of event $A$ occurring given that $B_i$ has occurred.

#### **Example**

Consider a medical test that can detect a disease with different accuracy rates depending on the patient’s age group. Let:

* $A$ be the event that the test is positive.
* $B_1$ be the event that the patient is under 40.
* $B_2$ be the event that the patient is 40-60.
* $B_3$ be the event that the patient is over 60.

If we know the probabilities of the test being positive in each age group and the probabilities of a patient being in each age group, we can use the total probability rule to find the overall probability of a positive test result.

The Python code for working this problem is shown in Example 9.

### Example 9: Total Probability Rule

Assume the following probabilities for events $B_1$, $B_2$ and $B_3$:
* $P$_$B_1$ = 30%
* $P$_$B_2$ = 50%
* $P$_$B_3$ = 20%

Note that the sum of these 3 events = 100%. In other words, together these 3 events are **_exhaustive_**.

And assume the following conditional probabilites:
* $P(A|B_1)$ = 0.5
* $P(A|B_2)$ = 0.7
* $P(A|B_3)$ = 1.0

Compute the probability that the test will be positive (event $A$)

_Code Description:_

The f print statement at the end of the code cell demonstates how you can adjust the number of decimal places that are printed to the output by adding `:.2f` at the end of the value to be printed:

~~~text
# Print the result with 2 decimal places
print(f"The total probability of event A is: {P_A:.2f}")
~~~~


In [None]:
# Example 9: Total Probability Rule

# Define the probabilities of the events H1, H2, and H3
P_B1 = 0.3
P_B2 = 0.5
P_B3 = 0.2

# Define the conditional probabilities P(A|B1), P(A|B2), and P(A|B3)
P_A_given_B1 = 0.5
P_A_given_B2 = 0.7
P_A_given_B3 = 1.0

# Calculate the total probability of A using the total probability rule
P_A = (P_A_given_B1 * P_B1) + (P_A_given_B2 * P_B2) + (P_A_given_B3 * P_B3)

# Print the result with 2 decimal places
print(f"The total probability of event A is: {P_A:.2f}")

If the code is correct, you should see the following output:

~~~text
The total probability of event A is: 0.70
~~~

### **Exercise 9: Total Probability Rule**

On page 117 in your textbook, there is an example that is used to illustrate the **Total Probability Rule**. 

The example starts as follows:

>In the United States in 2010, there were more than 27 million people who were 70 years of age or  older [126]. We can separate these people into three mutually exclusive categories: those who are  70–79 years of age, those who are 80–89 years of age, and those who are 90 years of age or older. 

And ends with:

>Plugging in the probabilities than an individual in this population is in each of the three age  groups, and the conditional probabilities of dementia given age, we get
> * P(D)  = P(A1) P(D | A1) + P(A2) P(D | A2) + P(A3) P(D | A3)
> *         = (0.5963)(0.0497) + (0.3364)(0.2419) + (0.0673)(0.3736)
> *         = 0.0296 + 0.0814 + 0.0251  = 0.1361.

For **Exercise 9** you are to use the code in Example 9 to work this example from your textbook. In other words, plug in the data values from the example into the Example 9 code to produce the answer `0.1361`. You will need to adjust the f print statement to generate 4 decimal places instead of two.

In [None]:
# Insert your code for Exercise 9 here



Your answer for **Exercise 9** should be `0.1362` which is very close to the answer of `0.1361` shown on page 119 in your textbook. It is unclear why the two answers are not exactly the same value.

The _total probability rule_, illustrated in **Figure 5.4** using a Venn diagram, is shown here:
 
![___](https://biologicslab.co/BIO5853/images/module_02/lesson_02_1_image01.png)


## **Relative Risk**

**_Relative risk (RR)_**, also known as the **risk ratio**, is a measure used in statistics and epidemiology to compare the probability of a certain event occurring in two different groups. Specifically, it is the ratio of the probability of the event occurring in the exposed group to the probability of the event occurring in the unexposed group.

#### **Formula**

$$ \text{Relative Risk (RR)} = \frac{P(\text{Event | Exposed})}{P(\text{Event | Unexposed})} $$

#### **Interpretation**
* **RR = 1:** The event is equally likely in both groups.
* **RR > 1:** The event is more likely in the exposed group.
* **RR < 1:** The event is less likely in the exposed group.


### Example 10: Relative Risk

The code in the cell below shows how to calculate the relative risk (RR) for the example in your textbook starting on page 121. The example begins as follows:

>Consider a study that examined the risk factors for breast cancer among females participating in the first National Health and Nutrition Examination Survey in the 1980s [128]. In a cohort study such as this, the exposure is measured at the onset of the investigation. Groups of individuals with and without the exposure are followed to look for occurrences of the outcome. In this breast cancer study, a female was considered to be “exposed” if she first gave birth at age 25 or older, and “unexposed” if she gave birth at a younger age. In a sample of 4540 study participants who gave birth to their first child before the age of 25, 65 were later diagnosed with breast cancer. Of the 1628 who first gave birth at age 25 or older, 31 were diagnosed with breast cancer. If we assume that the numbers are large enough to satisfy ...


In [None]:
# Example 10: Relative risk

from scipy.stats.contingency import relative_risk

# Define the number of cases and totals for exposed and control groups
exposed_cases = 31
exposed_total = 1628
control_cases = 65
control_total = 4540

# Calculate the relative risk
result = relative_risk(exposed_cases, exposed_total, control_cases, control_total)

# Print the relative risk
print(f'Relative Risk: {result.relative_risk:.2f}')

If the code is correct you should see the following output:
~~~text
Relative Risk: 1.33
~~~

The is the same result as shown in your textbook.

### **Exercise 10: Relative Risk**

Suppose a study provides the following data:

* **Smokers:** 70 out of 1,000 smokers develop lung cancer.
* **Non-smokers:** 5 out of 1,000 non-smokers develop lung cancer.

Calculate the relative risk (RR) of developing lung cancer is the patient is a smoker.

In [None]:
# Insert your code for Exercise 10 here



If your code is correct you should see the following output:
~~~text
Relative Risk: 14.00
~~~

In other words, a person who smokes in 14 times more likely to develop lung cancer that someone who doesn't smoke.

## **Odds Ratio**

An **_odds ratio (OR)_** is a measure used in statistics to quantify the strength of the association between two events. It compares the odds of an event occurring in one group to the odds of it occurring in another group.

#### **Formula**
The odds ratio is calculated as:

$$ \text{OR} = \frac{\text{Odds of Event in Group 1}}{\text{Odds of Event in Group 2}} $$

Where the odds of an event are defined as the probability of the event occurring divided by the probability of the event not occurring.

#### **Interpretation**
* **OR = 1:** The event is equally likely in both groups.
* **OR > 1:** The event is more likely in Group 1 than in Group 2.
* **OR < 1:** The event is less likely in Group 1 than in Group 2.


### Example 11: Odds Ratio

The code in the cell below shows how to calculate the **odds ratio (OR)_** for the example in your textbook starting on page 123. The example begins as follows:

>Consider the following data taken from another study of the risk factors for breast cancer. This  one is a case-control study that examined the effects of the use of oral contraceptives [130]. In a casecontrol study, investigators start by identifying groups of individuals with the outcome (the cases) and without the outcome (the controls). They then determine whether the exposure in question was  present or absent for each individual. Among the 989 females in the study who had breast cancer,  273 had previously used oral contraceptives and 716 had not. Of the 9901 females who did not have breast cancer, 2641 had used oral contraceptives and 7260 had not. In a case-control study, the proportions of subjects with and without the disease are fixed by the investigator; therefore, the  probabilities of disease in the exposed and unexposed groups cannot be determined. However, we are able to calculate the probability of exposure for both cases and controls. Consequently, using the  second definition for the odds ratio ...



In [None]:
# Example 11: Odds Ratio

# Define the number of cases and totals
Grp1_exposed = 273
Grp2_exposed = 2641
Num_disease = 989
Num_no_disease=9901

# Compute probabilites
P_Exposed_Disease = Grp1_exposed/Num_disease
P_Exposed_noDisease = Grp2_exposed/Num_no_disease

# Compute OR
odds_ratio = ((P_Exposed_Disease/(1-P_Exposed_Disease)) / (P_Exposed_noDisease/(1-P_Exposed_noDisease)))

# Print results
print(f'Odds Ratio: {odds_ratio:.2f}')

If the code is correct you should see the same answer, `1.05`, as given in your textbook on page 123. 

Your textbook interpreted the result as follows:
>These data imply that females with breast cancer have an odds of using oral contraceptives that is 1.05 times the odds of those without breast cancer. However – because of the mathematical equivalence of the two formulas for the odds ratio – we are also able to say that individuals who have used oral contraceptives have an odds of developing breast cancer that is 1.05 times the odds of nonusers. As with the relative risk, an odds ratio of 1.0 indicates that exposure does not have an effect on the probability of the outcome. An odds ratio greater than 1.0 means that there is an increased risk of the outcome among exposed individuals, and an odds ratio less than 1.0 means that there is a decreased  risk of the outcome among the exposed.  

## **Lesson Turn-in**

When you have run all of the code cells. print a PDF of your Colab notebook and upload it to Canvas for grading. 