<a href="https://colab.research.google.com/github/GabeMaldonado/AIforMedicine/blob/master/AIforMed_C2_Theory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$\mathfrak{Gabriel \ Maldonado}$



# Notes on AI for Medicine Specialization, Coursera
## Course II -- AI for Medical Prognosis

Prognosis is a branch of medicine that specializes in predicting the future health of patients. For example, given a patient's labs results-- can we estimate the risks of having a heart attack over the next five or ten years. 
Machine Learning is a powerful tool for prognosis and can provide a temendous boost to this branch of medicine by using many different types of medical data to make accurate predictions about a patient's future health. 
Making prognosis is a clinically useful task for a variety of reasons. 
1.   **Risk of Illness** It is useful for informing patients the risk of develping an illness. There are blood tests that are used to estimate the risk of developing breasts and ovarian cancer.
2.   **SUrvival with Illness** It is also used to inform patients how long they can expect to survive with a certain illness. Cancer staging gives an estimate of survival time of a patient with a particular type of cancer. 

Prognosis is also used for guiding treatment. In clinical practice, the prediction of a 10-year risk of heart attack is used to determine whether that patient should get drugs to reduce that risk. Another example is the **6-month mortality risk**. This is used for patients with terminal conditions that have become advanced and uncurable and it is used to determine who should recieve end-of-life care. 

### Prognosis -- Inputs and Outputs

Prognostic models can be thought of as a system that takes in a profile of a patient as an input and outputs s **risk score** for that patient. The patient profile can include:
*   Clinical History which includes major illness, previous medical procedures. 
*   Physical Examinations -- includes vital signs such as temperature and blodd pressure. 
*   Labs and imaging -- inlcudes blood work and CT scans, etc. 
The prognostic model can take one or more of these pieces of information and produce the patient's risk score which can be an arbitrary number or a probability. 

### Calculating $CHA_{2}DS_{2}-VASc$ (Chads vasc) Score

Here we will calculate the risk factor for patients with **atrial fibrilation**. 
Atrial Fibrilation is a common abnormal heart rhythm that puts the patient at the risk of stroke. (Stroke is when the blood flow to a region of the brain is cut off). The CHADS VASC model is used to calculate the one year risk of stroke for patients with AF. The name comes from the following conditions and the numbers in the columns represent the coefficients for each condition and the patient score for each condition (1 = yes, 0 = no). The last column is the coefficient * value and these products get added to obtain the risk factor.

<pre>
*   C - Congestive Heart Failure         1     0     0
*   H - Hypertension                     1     1     1
*   A2 - Age 75 years or older           2     0     0
*   D - Diabetis mellitus                1     1     1
*   S2 - Stroke, TIA or TE               2     0     0
*   V - Vascular disease                 1     0     0
*   A - Age 65 to 74                     1     1     1
*   Sc - Sex category (female)           1     0     0
*                                                   ---
*                                                    3
</pre>

### Calculating the Model for End-stage liver disease (MELD) Score

This score gives an  estimate of the 3-month mortality for patients older than 12 who are on liver transplant waiting lists. This score is a factor in determining how quickly a patient can get a liver transplant. 

Let's calculat the score for a 50 year old woman with the following lab results:
*   Creatinine = 1.0 mg/dL
*   Bilirubin total = 2.0 mg/dL
*   INR = 1.1

<pre>
                     Coeff    Value        Coeff * Value
ln(Creatinine)       0.957    ln(1.0)            0
ln(Bilirubin total)  0.378    ln(2.0)           0.26 
ln(INR)              1.120    ln(1.1)           0.11
Intercept            0.643       1              0.643
-----------------------------------------------------------
Score * 10                                 1.01x10 = 10   
</pre>

This score of 10 is not directly telling us the probability of survival at 3 months but it is informative when comparing it to MELD score of other patients. 






[Check out this notebook to see how to calculate risk scores in python ](https://github.com/GabeMaldonado/AIforMedicine/blob/master/AIforMed_C2_W1_Lab2.ipynb)

[Combining Features in pandas](https://github.com/GabeMaldonado/AIforMedicine/blob/master/AIforMed_C2_W1_Lab3.ipynb)

## Evaluating Prognostic Models

Teh basic idea behind evaluating a prognostinc model is to see how well it performs on pairs of patients. Now, to evaluate these risks scores we need to know whether the patients actually had the event. Here we are looking at death within 10 years so we need to know if patient **A** died within the next 10 years but patient **B** did not. A good prognostic model should give patient A a higher score than to patient B. 
In general when the patient with the worst outcome has a higher risk score, this pair is called ***Concordant***. Now, when the patient with the worst outcome does not have a higher score the pair is called ***Not Concordant/Discordant***. Patient with same scores are called **risk ties**.
A pair where the outcomes are different is called a **permissible** pair. It's with such pairs that we can evaluate prognostic models. 

#### Evaluating the prognostic model:

#### C-Index
*   +1 for a permissible pair that is concordant
*   +0.5 for a permissible pair for risk tie. 

$$ C Index = \frac{\# concordant\  pairs + 0.5 \times \# risk \ ties }{\# permissible\  pairs} $$

### C-Index Interpretation 

$$ P (score (A) > score(B)  | Y_{A}>Y_{B}$$
*   Random Model score = 0.5
*   Perfect model score = 1.0


[Calculating C-Index Lab](https://github.com/GabeMaldonado/AIforMedicine/blob/master/AIforMed_C2_W1_Lab4.ipynb)

## Decision Trees for Prognosis

We can use Decision Trees to build machine learning models. Decision trees are extremely useful in medical applications due to their ability to handle both continuous and categorical data, their interpretability, and the speed at which we can train them. 

Often ML models are considered blac boxes due to their complex inner workings, but in medicine, the ability to explain interpret a model may be critical for human acceptance and trust. 
We will build a prognostic model using age and systolic blood pressure that will predict a 10-year mortality risk. 

*   Systolic blood pressure is the pressure in your blood vessels when your heart beats. 

In this case, the decision tree would divide the input space into three regions of high-risk and low-risk using vertical and horizontal boundaries. The classifier can also be represented as a tree with an if-then structure. The decision tree is asking a series of questions and classifies the patient based on the answers to such questions. 

**How's a decision tree built?**
At a high level:
*   Pick a variable  and a value of that variable that partitions the data, such that one partition contains mostly red and the other partition contains mostly blue. We pick the variable and the value based on how well it splits the data, blues to one side and reds to the other. 
*   Repeat the same process in another region of the data until the partitions are mostly red and blue. 
*   Estimate risk in each partition. 
*   Binarize the output to just output whether an area is low or high risk. We can call a prediction high-risk if the predicted probability is greater than 50% and low-risk otherwise. 

### Challenges when building Decision Trees

One major challenge when bulding decision trees is that if we don't stop growing the decision trees, they continue to create more and more partitions and they can get overly complex. Decision models can create overly complex trees that fit the training data almost prefectly! This ends up being a bad thing adn it is known as ***overfitting***. The model fits the data so well that is unable to generalize other samples or real-world data. 
One way to combat overfitting is to control when we stop growing the trees. We cna control this by setting the maximum depth the tree can grow to. Another popular way to combat overfitting is y building a ***Random Forest***. 

### Random Forests 

Random Forests construct multiple decision trees and average their risk predictions. 
How can a random forest be trained? There are two key concepts when training a random forests. 
1.   Each tree in the forest is constructed using a random sample of the patients. For instance, for the 1st tree, we might draw P1, P2 & P1 which can happen because the random forest samples with replacement.
2.   The random forest algorithm also modifies the splitting procedure in the construction of the decision tree such that it uses a subset of features when creating decision boundaries. 

Random Forests generally boost the performance over single trees. With a single decision tree we might get a test accuracy of 0.71 with a random forest of 100 trees we can get an accuracy of 0.76. 

Random Forests are called an ***ensemble*** learning method because they use multiple decision trees to obtain better prediction performance than could be obtain from any of the decision trees alone. 

There are other popular algorithms that use ensembles including ** Gradient Boosting, XGBoost, LightGBM** which are also able to achieve high performance when working with structure data in medicine and other domains. 



[Link for W2 Lab1 which cover Missing Values, Imputation, Masks, and Decision Trees](https://github.com/GabeMaldonado/AIforMedicine/blob/master/AIforMed_C2_W2_Lab1.ipynb)

## Survival Data

To model survival, we need to represent the data in a form in which we can process. The primary challenge here is censored data which is a particular form of missing data. 
In healthcare, missing data is a common yet important issue we have to deal with. Let's imagine we are working with incomplete patient data. If we run these data through a 'regular' machine learning pipeline to create a prognostic model-- we would:
*   Create a train/test split
*   Exclude missing data, drop NaNs / rows
*   Create and run/train the model
*   Evaluate model using the test set
*   Let's say the model achieves a train_accuracy=0.87 and a test-accuracy=0.84. We notice that the accury for both sets is relatively high. 

Now let's say that we get a new test set which has been collected using the same method as the previous set but this new set contains no missing values. If we run this new test set through our random forest model, it would achieve a low accuracy of say, 0.61. What's causing this considering the model performed pretty well when using the first sets?
This is because the distributions of the data, for the old and new test sets, are different. To check this, we can look at the distribution of the input variables. Comparing the distribution graphs for the variables in the older and new set would allow us to see the difference in the data. If we find any discrepancies in the graph, let's say-- the graph for the new data shows more patients under 40, the we should take a second look at the old data and compare the pre and post dropped NaNs to see how the data compares. It might be the case that looking at these two graphs we will see that before dropping missing data, the set contain more values for patients under 40. When we dropped the missing values, we also eliminated data for a lot of young patients. The missing data could be due to the fact that in the field, medical practitioners might not record BP data for youn patients but do so for every older pantient. We need to examine the data carefully before procedding with the ML pipeline so we can avoid buliding a biased model. 

## Why is Data Missing?

To decide whether a complete case analysis would lead to bias we need to understand ***why data are missing***. There are three missing data categories:

1.   Missing Completely at Random -- Missing data is not dependent on anything. For instance, a doctor can flip a coin to decide to record the BP for any patient, regardless of age, without any given criteria. Here the probability of missing data is constant, let's say 50% chance that the doctor will forget to record the BP data for a patient. When we have data that is missing completely at random, it would not lead to a biased model. $$p(missing) = constant$$
2.   Missing at Random -- The missing data is dependent on a condition. For instance the doctor would decide to always record the BP if the patient is older than 40. For patients younger than 40, he can use a flip coin as above to decide whether or not to record the BP data. Here the probability of missing data is not constant as the missingness in the data is determined by the condition, in this case-- age. $$p(missing) \neq constant$$ 
$$p (missing | age < 40) = 0.5 \neq p(missing|age>40)=0 $$
3.   Missing Not at Random -- The missing data is caused by an unobservable event/unavailable information. For instance, in a particularly busy day at the doctor's office, the doctor can decide to record the BP data by the flip of a coin. If the office is not busy then there will be plenty of time to record the BP for every patient. 
$p(missing) \neq constant$$ 
$$p (missing | busy) = 0.5 \neq p(missing|not \ busy)=0 $$




## Imputation

An alternative to complete case analysis is to complete or **inpute** the missing values. Imputation replaces missign data with an estimated valued based on other available information. The two primary imputation methods are:
*   Mean Imputation -- replaces missing values with the mean of the data feature. Note that for the test set, we shoudl replace the missing values with the mean of the data of the train set as the amount of data in the test set is less and it might not be representative of all of the data. Keep in mind that mean imputation is not preserving the relationship between the variables. All the imputed values would lie in a staight line (same mean value for all) across all the ages. 

*   Regression Imputation -- tries to learn a linear model of the form: 
$$BP =  coefficient_{age} \times age + offset$$ and it will replace the missing values with the result of that linear function. Let's say that the linear equation we have is:$$BP= 0.6 \times age + 115$$
We'd replace the missing value for a patient of age $57$ with:
$$BP = 0.6 \times 57 + 115 = 149$$


### Right Censoring
The time to an event is only known to exceed a certain value. There are two types of right censoring:
*   End-of-study censoring -- where the patient completes the study
*   Loss-to-follow-up censoring -- where the patient dropsmout before the end of the study 


## Hazard Functions

### Survival Probability:

$$S(t) = Pr(T > t)$$

### Survival to Hazard
Hazard , represented by the Greek letter lamba $\lambda$, is the risk of death if aged $t$
$$\lambda(t) = Pr(T =t | T \geq t )$$


In [0]:
|