<a href="https://colab.research.google.com/github/Rashween-Kaur/Python/blob/main/AT_Lesson_46_Reference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 46: Probability

**WARNING:** The reference notebook is meant **ONLY** for a teacher. Please **DO NOT** share it with any student. The contents of the reference notebook are meant only to prepare a teacher for a class. To conduct the class, use the class copy of the reference notebook.

|Particulars|Description|
|-|-|
|**Topic**|Probability|
|||
|**Class Description**|In this class, a student will learn to calculate the probability of an event|
|||
|**Class**|C46|
|||
|**Class Time**|45 minutes|
|||
|**Goals**|Calculate the probability of a person having a liver disease|
||Find out the age group of people who are more likely to have liver disease|
||Based on the probability, infer what age group of patients should be prioritised for testing for the liver disease|
|||
|**Teacher Resources**|Google Account|
||Link to Lesson 46 Colab reference notebook|
||Laptop with internet connectivity|
||Earphones with mic|
|||
|**Student Resources**|Google Account|
||Laptop with internet connectivity|
||Earphones with mic|

---

### Teacher-Student Activities

In this class, we will learn the concept of probability and binomial distribution.

Here we have a dataset of 583 patients who were diagnosed for the liver disease. Out of them, 416 patients were tested positive and 167 were tested negative for the liver disease. The data is collected from the North East of Andhra Pradesh, India. We need to find out what is the probability of a patient having the disease. We also need to find out what is the probability of a patient having the disease given that the patient is

- a juvenile i.e. a person whose age is less than 18 years.

- an adult i.e. a person whose age is greater than or equal to 18 years but less than 50 years.

- an elderly i.e. a person whose age is at least 50 years.

We also need to create a binomial distribution model to check what is the probability distribution of a patient having a disease.

The `Dataset` column is a class label used to divide groups into liver patients (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.


Use these patient records to determine which patients have liver disease and which ones do not.

---

#### Data Description

The dataset contains the following columns (or features):

1. `Age`: Age of the patient. Any patient whose age exceeded 89 is listed as being of age "90".

2. `Gender`: Gender of the patient

3. `Total_Bilirubin`: Total Bilirubin

4. `Direct_Bilirubin`: Direct Bilirubin

5. `Alkaline_Phosphotase`: Alkaline Phosphatase

6. `Alamine_Aminotransferase`: Alamine Aminotransferase

7. `Aspartate_Aminotransferase`: Aspartate Aminotransferase

8. `Total_Protiens`: Total Proteins

9. `Albumin`: Albumin

10. `Albumin_and_Globulin_Ratio`: Ratio Albumin and Globulin Ratio

11. `Dataset`: Whether a patient has the liver disease or not.
 `1` means a patient has the liver disease and `2` means a patient does not have the liver disease.

Link: https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/indian-liver-patients/indian_liver_patient.csv

#### Acknowledgements

This dataset was downloaded from the UCI ML Repository:

Lichman, M. (2013). [UCI Machine Learning Repository](https:/archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)). Irvine, CA: University of California, School of Information and Computer Science.

Source: https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

---

#### Activity 1: Probability

*Probability of an outcome or an event is defined as the ratio of the number of favourable outcomes to the total number of possible outcomes.*

Consider throwing a die.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/fair-die.jpg' width=200>

It has six sides labelled as 1, 2, 3, 4, 5 and 6. So whenever you throw a die you will get either 1, 2, 3, 4, 5 or 6.

Suppose you are playing a game in which you need to get $6$ when you roll a die to win the game. So $6$ is your favourable outcome to win the game. Hence, the probability of getting 6 is **one over six**, i.e. $\frac{1}{6}$ because there is only **one** favourable outcome (i.e. getting $6$ on dice) out of six possible outcomes.

Mathematically, the probability of an outcome or an event $E$ is given by $$P(E) = \frac{n(E)}{n(S)}$$

or

$$P(E) = \frac{\text{Number of favourable outcomes}}{\text{Total number of possible outcomes}}$$

where

- $E$ is a set containing favourable outcomes

- $n(E)$ is the number of items contained in the set $E$

- $S$ is a set of all possible outcomes which is also known as **sample space**

- $n(S)$ the number of items contained in the set $S$

In the game of getting $6$ to win the game, the set of favourable outcome(s) is $E = \{6\}$ and the set of possible outcomes is $S = \{1, 2, 3, 4, 5, 6\}$

Hence, the probability of the outcome of getting $6$ is $\frac{1}{6}$

Let's change the rules of the game by saying that a player can take out their pawn only if they get a prime number. In this case, the set of favourable outcomes becomes $E = \{2, 3, 5\}$

So the probability of taking out the pawn or in other words, the probability of the outcome of getting either 2 or 3 or 5 is $\frac{3}{6} = \frac{1}{2}$ because there are three items in the set $E$.

**Note:**

- Probability is just a measure of finding out which events are more likely to occur. It does not mean that the highly likely events will definitely occur. The probability of an event which definitely will occur is 1. Similarly, the probability of an event which will definitely NOT occur is 0.

- The sum of the probability of occurrence of an event and the probability of that event not occurring is always 1.

With this idea in mind, let's find out the probability of a patient having the liver disease.

---

#### Activity 2: Data Preparation

Let's prepare the dataset for analysis by:

- Treating the null values (if there are any)

- Renaming the `Dataset` column with the name `Disease`, the `Alkaline_Phosphotase` column with the name `Alkaline_Phosphatase` and `Alamine_Aminotransferase` column with the name `Alanine_Aminotransferase`

- Labelling each patient as a juvenile, an adult or an elderly based on their age

- Encoding (or converting) the `Male` and `Female` values for the `Gender` column to the numeric values, i.e., `0` and `1`

In [None]:
# S2.1: Import the libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

  import pandas.util.testing as tm


In [None]:
# S2.2: Load the dataset
file_loc = 'https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/indian-liver-patients/indian_liver_patient.csv'
df = pd.read_csv(file_loc)
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [None]:
# S2.3: Get the dataset information.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


In [None]:
# S2.4: Check for the null values.
df.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Dataset                       0
dtype: int64

In [None]:
# S2.5: Treat the missing values and check for the null values again.
df.loc[df['Albumin_and_Globulin_Ratio'].isnull() == True, 'Albumin_and_Globulin_Ratio'] = df['Albumin_and_Globulin_Ratio'].median()
df['Albumin_and_Globulin_Ratio'].isnull().sum()

0

**Renaming a Column**

To rename a column in a Pandas DataFrame, use the `rename()` function. It requires a dictionary as an input to the `columns` parameter. The dictionary must contain the old column names and new column names as a key-value pair respectively.

**Syntax:** `dataframe.rename(columns=dictionary_of_new_and_old_names)`

where `dictionary_of_new_and_old_names` is a dictionary containing old column names and new column names as a key-value pair respectively.

In [None]:
# S2.6: Rename the 'Dataset' column with the name 'Disease'. Also, rename the 'Alkaline_Phosphotase' column with 'Alkaline_Phosphatase'
# Also, rename the 'Alamine_Aminotransferase' column with 'Alanine_Aminotransferase'
df.rename(columns={'Dataset' : 'Disease',
                   'Alkaline_Phosphotase' : 'Alkaline_Phosphatase',
                   'Alamine_Aminotransferase': 'Alanine_Aminotransferase'}, inplace=True)

**Note:** By setting the `True` value to the `inplace` parameter, we are telling Python to override the current column name permanently. Otherwise, the column will get changed only for that code execution. For further code execution, the original column name will get reset automatically.

In [None]:
# S2.7: Check whether we have the desired column names.
df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphatase', 'Alanine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Disease'],
      dtype='object')

**Labelling Age Group**

We need to label each patient as

- a juvenile i.e. a person whose age is less than 18 years.

- an adult i.e. a person whose age is greater than or equal to 18 years but less than 50 years.

- an elderly i.e. a person whose age is at least 50 years.

so that later we can find out the probability of a patient having the liver disease given that he/she is

- a juvenile

- an adult

- an elderly

In [None]:
# S2.8: Create a function which takes a Pandas series as an input and returns another Pandas series as an output containing items 1, 2 and 3.
def age_group(age_series):
  age_group_list = []
  for age in age_series:
    if age < 18:
      age_group_list.append(1) # 1 means juvenile: a patient whose age is less that 18 years.
    elif (age >= 18) and (age < 50):
      age_group_list.append(2) # 2 means adult: a patient whose age is at greater than 18 years but less than 50 years.
    else:
      age_group_list.append(3) # 3 means eldery: a patient whose age is at least 50 years.
  return pd.Series(data=age_group_list, index=age_series.index)

age_group_series = age_group(df['Age'])
age_group_series

0      3
1      3
2      3
3      3
4      3
      ..
578    3
579    2
580    3
581    2
582    2
Length: 583, dtype: int64

In [None]:
# S2.9: Add a new column to the 'df' DataFrame.
df['Age_Group'] = age_group_series
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphatase,Alanine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Disease,Age_Group
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1,3
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1,3
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1,3
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1,3
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1,3


**Encoding Gender**

Encode (or convert) the `Male` and `Female` values for the `Gender` column to the numeric values, i.e., `0` and `1`. For this, we can use the `replace()` function.

In [None]:
# S2.10: Find the number of male and female patients before encoding.
df['Gender'].value_counts()

Male      441
Female    142
Name: Gender, dtype: int64

In [None]:
# S2.11: Encode the 'Male' & 'Female' values for the 'Gender' column to the numeric values, i.e. 0 and 1.
df['Gender'].replace({'Male' : 0, 'Female' : 1}, inplace=True)

**Note:** By setting the `True` value to the `inplace` parameter, we are telling Python to override the current values in the `Gender` column permanently. Otherwise, the gender values will get changed only for that code execution. For further code execution, the original gender values will get reset automatically.

In [None]:
# S2.12: Find the number of male and female patients after encoding.
df['Gender'].value_counts()

0    441
1    142
Name: Gender, dtype: int64

---

#### Activity 3: Computing Probabilities

Let's answer the following questions:

1. What is the probability that a patient is a juvenile?

2. What is the probability that a patient is an adult?

3. What is the probability that a patient is an elderly?

4. What is the probability that a patient has the liver disease given that the patient is a juvenile?

5. What is the probability that a patient has the liver disease given that the patient is an adult?

6. What is the probability that a patient has the liver disease given that the patient is an elderly?

In [None]:
# S3.1 Find the number of patients who are juveniles, adults and elderlies.
df['Age_Group'].value_counts()

2    328
3    230
1     25
Name: Age_Group, dtype: int64

Juveniles are very few in number.

In [None]:
# S3.2: What is the probability that a patient is a juvenile? What is the probability that a patient is an adult?
# What is the probability that a patient is an elderly?
prob_juve = sum(df['Age_Group'] == 1) / df.shape[0]
prob_adult = sum(df['Age_Group'] == 2) / df.shape[0]
prob_elder = sum(df['Age_Group'] == 3) / df.shape[0]

print(f"Probability that a patient is a juvenile is {prob_juve:.2f}")
print(f"Probability that a patient is an adult is {prob_adult:.2f}")
print(f"Probability that a patient is an elderly is {prob_elder:.2f}")

Probability that a patient is a juvenile is 0.04
Probability that a patient is an adult is 0.56
Probability that a patient is an elderly is 0.39


As expected, the probability that a patient having the liver disease is an adult is the greatest amongst the three age groups.

The sum of the above three probabilities should be one.

In [None]:
# S3.3: Calculate the sum of the above three probabilities.
prob_juve + prob_adult + prob_elder

1.0

The remaining three probabilities

- What is the probability that a patient has the liver disease given that the patient is a juvenile?

- What is the probability that a patient has the liver disease given that the patient is an adult?

- What is the probability that a patient has the liver disease given that the patient is an elderly?

are **conditional probabilities** because they have a condition involved in them.

- In the first case, the condition is that a patient is a juvenile.

- In the second case, the condition is that a patient is an adult.

- In the third case, the condition is that a patient is an elderly.

So to solve the first case, we need to find the number of patients amongst the juveniles having the disease, then calculate the probability of juveniles having the liver disease and then multiply it with the probability that a patient is a juvenile.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img1.png' width=700>

We will repeat the same process for the adults and the elderlies as well.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img2.png' width=700>




In [None]:
# S3.4: Find the probability that a patient has the liver disease given that they are a juvenile.
# Also, find the probability that a patient doesn't have the liver disease given that they are a juvenile.
juve_df = df[df['Age_Group'] == 1]
juve_prob_disease = sum(juve_df['Disease'] == 1) / juve_df.shape[0]
juve_prob_not_disease = sum(juve_df['Disease'] == 2) / juve_df.shape[0]

print(f"Probability of a juvenile having the disease is {juve_prob_disease:.2f}")
print(f"Probability of a juvenile NOT having the disease is {juve_prob_not_disease:.2f}")

print(f"\nProbability that a patient has the liver disease given that they are a juvenile: {prob_juve * juve_prob_disease:.3f}")
print(f"Probability that a patient DOES NOT have the liver disease given that they are a juvenile: {prob_juve * juve_prob_not_disease:.3f}")

Probability of a juvenile having the disease is 0.48
Probability of a juvenile NOT having the disease is 0.52

Probability that a patient has the liver disease given that they are a juvenile: 0.021
Probability that a patient DOES NOT have the liver disease given that they are a juvenile: 0.022


The probability tree now looks like this

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img3.png' width=700>


In [None]:
# S3.5: Find the probability that a patient has the liver disease given that they are an adult.
# Also, find the probability that a patient doesn't have the liver disease given that they are an adult.
adult_df = df[df['Age_Group'] == 2]
adult_prob_disease = sum(adult_df['Disease'] == 1) / adult_df.shape[0]
adult_prob_not_disease = sum(adult_df['Disease'] == 2) / adult_df.shape[0]

print(f"Probability of an adult having the disease is {adult_prob_disease:.2f}")
print(f"Probability of an adult NOT having the disease is {adult_prob_not_disease:.2f}")

print(f"\nProbability that a patient has the liver disease given that they are an adult: {prob_adult * adult_prob_disease:.3f}")
print(f"Probability that a patient DOES NOT have the liver disease given that they are a adult: {prob_adult * adult_prob_not_disease:.3f}")

Probability of an adult having the disease is 0.70
Probability of an adult NOT having the disease is 0.30

Probability that a patient has the liver disease given that they are an adult: 0.393
Probability that a patient DOES NOT have the liver disease given that they are a adult: 0.170


The probability tree now looks like this

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img4.png' width=700>


In [None]:
# S3.6: Find the probability that a patient has the liver disease given that they are an elderly.
# Also, find the probability that a patient doesn't have the liver disease given that they are an elderly.
elder_df = df[df['Age_Group'] == 3]
elder_prob_disease = sum(elder_df['Disease'] == 1) / elder_df.shape[0]
elder_prob_not_disease = sum(elder_df['Disease'] == 2) / elder_df.shape[0]

print(f"Probability of an elderly having the disease is {elder_prob_disease:.2f}")
print(f"Probability of an elderly NOT having the disease is {elder_prob_not_disease:.2f}")

print(f"\nProbability that a patient has the liver disease given that they are an elderly: {prob_elder * elder_prob_disease:.3f}")
print(f"Probability that a patient DOES NOT have the liver disease given that they are a elderly: {prob_elder * elder_prob_not_disease:.3f}")

Probability of an elderly having the disease is 0.76
Probability of an elderly NOT having the disease is 0.24

Probability that a patient has the liver disease given that they are an elderly: 0.300
Probability that a patient DOES NOT have the liver disease given that they are a elderly: 0.094


The probability tree now looks like this

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img5.png' width=700>


You can collect the favourable cases, i.e, the juvenile patients having the disease using the `&` operator and the calculate the probabilities.

In [None]:
# T3.1: Find the probability that a patient has the liver disease given that they are a juvenile.
# Also, find the probability that a patient doesn't have the liver disease given that they are a juvenile.
prob_juve_and_disease = df.loc[(df['Age_Group'] == 1) & (df['Disease'] == 1), 'Age_Group'].shape[0] / df.shape[0]
prob_juve_and_not_disease = df.loc[(df['Age_Group'] == 1) & (df['Disease'] == 2), 'Age_Group'].shape[0] / df.shape[0]

print(f"Probability that a patient has the liver disease given that they are a juvenile: {prob_juve_and_disease:.3f}")
print(f"Probability that a patient DOES NOT have the liver disease given that they are a juvenile: {prob_juve_and_not_disease:.3f}")

Probability that a patient has the liver disease given that they are a juvenile: 0.021
Probability that a patient DOES NOT have the liver disease given that they are a juvenile: 0.022


**Note:** We can apply the `&` operator only when the occurrence of two events are independent of each other. Age is not the cause of the liver disease. Hence, age and liver disease are independent of each other. Statistically, there is a relatively high probability that an adult is more likely to have the  disease. Still, it doesn't mean that age is the cause of liver diseases.

It is the same as saying that it is not necessary that all tall players are good basketball players. The tall players have more advantage than short players.

In [None]:
# S3.7: Find the remaining two conditional probabilities.
prob_adult_and_disease = df.loc[(df['Age_Group'] == 2) & (df['Disease'] == 1), 'Age_Group'].shape[0] / df.shape[0]
prob_adult_and_not_disease = df.loc[(df['Age_Group'] == 2) & (df['Disease'] == 2), 'Age_Group'].shape[0] / df.shape[0]

prob_elder_and_disease = df.loc[(df['Age_Group'] == 3) & (df['Disease'] == 1), 'Age_Group'].shape[0] / df.shape[0]
prob_elder_and_not_disease = df.loc[(df['Age_Group'] == 3) & (df['Disease'] == 2), 'Age_Group'].shape[0] / df.shape[0]

print(f"Probability that a patient has the liver disease given that they are an adult: {prob_adult_and_disease:.3f}")
print(f"Probability that a patient DOES NOT have the liver disease given that they are an adult: {prob_adult_and_not_disease:.3f}")

print(f"\nProbability that a patient has the liver disease given that they are an elderly: {prob_elder_and_disease:.2f}")
print(f"Probability that a patient DOES NOT have the liver disease given that they are an elderly: {prob_elder_and_not_disease:.3f}")

Probability that a patient has the liver disease given that they are an adult: 0.393
Probability that a patient DOES NOT have the liver disease given that they are an adult: 0.170

Probability that a patient has the liver disease given that they are an elderly: 0.30
Probability that a patient DOES NOT have the liver disease given that they are an elderly: 0.094


So the conditional probabilities match with the ones calculated earlier.

---

#### Activities

**Teacher Activities**

1. Probability (Class Copy)

   Link on Panel
    
2. Probability (Reference)

   Link on Panel

---