# Predicting COVID-19 patient's death situation.
This notebook looks into using various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting (given a Covid-19 patient's current symptom, status, and medical history) whether the patient is in high risk or not..


We're going to take the following approach:
1. Problem Definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

## 1. Problem Definition
In a statemant
> Given clinical parameters about a patient, can we predict whether or not they are in high risk or not?..

## 2. Data
> The original data came from Kaggle. https://www.kaggle.com/datasets/meirnizri/covid19-dataset

## 3. Evaluation
> If we can reach to or morethan 90% accuracy at predicting whether or not they are in high risk or not during the proof of concept, would be nice.

## 4. Features
* sex: 1 for female and 2 for male.
* age: of the patient.
* classification: covid test findings. 
  * Values 
    * 1-3 mean that the patient was diagnosed with covid in different degrees. 
    * 4 or higher means that the patient is not a carrier of covid or that the   test is inconclusive.
* patient type: type of care the patient received in the unit. 
  * 1 for returned home 
  * 2 for hospitalization.
* pneumonia: whether the patient already have air sacs inflammation or not.
* pregnancy: whether the patient is pregnant or not.
* diabetes: whether the patient has diabetes or not.
* copd: Indicates whether the patient has Chronic obstructive pulmonary disease or not.
* asthma: whether the patient has asthma or not.
* inmsupr: whether the patient is immunosuppressed or not.
* hypertension: whether the patient has hypertension or not.
* cardiovascular: whether the patient has heart or blood vessels related disease.
* renal chronic: whether the patient has chronic renal disease or not.
* other disease: whether the patient has other disease or not.
* obesity: whether the patient is obese or not.
* tobacco: whether the patient is a tobacco user.
* usmr: Indicates whether the patient treated medical units of the first, second or third level.
* medical unit: type of institution of the National Health System that provided the care.
* intubed: whether the patient was connected to the ventilator.
* icu: Indicates whether the patient had been admitted to an Intensive Care Unit.
* date died: If the patient died indicate the date of death, and 9999-99-99 otherwise.

## Preparing the tools
We're going to use pandas, numpy and matplotlib for data analysis and manipulation

In [None]:
# Regular EDA (exploratory data analysis) and plotting libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("seaborn")
import seaborn as sns

# Models from sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Model Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay, ConfusionMatrixDisplay

## Load Data

In [None]:
df = pd.read_csv("drive/MyDrive/Data/Covid Data.csv")
df.head()

## Data Exploration (exploratory data analysis or EDA)

The goal here is to find out more about the data and become a subject matter expert on the dataset you working with.

1. What question(s) are you tring to solve?
2. What kind of data we have and how we treat different types? 
3. What's missing form data and how do you deal with it? 
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
# Finding the number of unique values by column
for column in df.columns:
  print(column,"=>\t",len(df[column].unique()))

### Data Preprocessing

Note:
* We have some features that we expect them to have just 2 unique values but we see that these features have 3 or 4 unique values. For example the feature "PNEUMONIA" has 3 unique values (1,2,99) 99 represents NaN values. Hence we will just take the rows that includes 1 and 2 values (values as 97 and 99 are missing data).

* In "DATE_DIED" column, we have 971633 "9999-99-99" values which represent alive patients so i will take this feature as a "DEATH" that includes wether the patient died or not

In [None]:
# Make a copy of the orignal dataframe
df_tmp  = df.copy()

In [None]:
# Finding the columns with 99 and 97 values
for column in df_tmp.columns:
  if 99 in df_tmp[column].value_counts() or 97 in df_tmp[column].value_counts() or 98 in df_tmp[column].value_counts():
    print(column)

In [None]:
df_tmp = df_tmp[(df_tmp.PNEUMONIA == 1) | (df_tmp.PNEUMONIA == 2)]
df_tmp = df_tmp[(df_tmp.DIABETES == 1) | (df_tmp.DIABETES == 2)]
df_tmp = df_tmp[(df_tmp.COPD == 1) | (df_tmp.COPD == 2)]
df_tmp = df_tmp[(df_tmp.ASTHMA == 1) | (df_tmp.ASTHMA == 2)]
df_tmp = df_tmp[(df_tmp.INMSUPR == 1) | (df_tmp.INMSUPR == 2)]
df_tmp = df_tmp[(df_tmp.HIPERTENSION == 1) | (df_tmp.HIPERTENSION == 2)]
df_tmp = df_tmp[(df_tmp.OTHER_DISEASE == 1) | (df_tmp.OTHER_DISEASE == 2)]
df_tmp = df_tmp[(df_tmp.CARDIOVASCULAR == 1) | (df_tmp.CARDIOVASCULAR == 2)]
df_tmp = df_tmp[(df_tmp.OBESITY == 1) | (df_tmp.OBESITY == 2)]
df_tmp = df_tmp[(df_tmp.RENAL_CHRONIC == 1) | (df_tmp.RENAL_CHRONIC == 2)]
df_tmp = df_tmp[(df_tmp.TOBACCO == 1) | (df_tmp.TOBACCO == 2)]

* If we plot PREGNANT column, We see that all "97" values are for males and males can not be pregnant so we will convert 97 to 2.

In [None]:
pd.crosstab(df_tmp.PREGNANT, df_tmp.SEX).plot(kind="bar", 
                                              figsize=(10, 6),
                                              cmap="tab10");
plt.title("Pregnant according to Sex")
plt.xlabel("1 = Pregnant, 2 = Not pregnant, 97/98 = Missing")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0);

In [None]:
# Converting 97 to 2
df_tmp.PREGNANT = df_tmp.PREGNANT.replace(97, 2)

# Getting rid of missing values (98)
df_tmp = df_tmp[(df_tmp.PREGNANT==1) | (df_tmp.PREGNANT==2)]
df_tmp.PREGNANT.value_counts()

In [None]:
# Make the label from date-column
df_tmp["DEATH"] = [2 if date=="9999-99-99" else 1 for date in df_tmp.DATE_DIED]
df_tmp.DEATH.value_counts()

* In "INTUBED" and "ICU" features there are too many missing values so i will drop them. Also we don't need "DATE_DIED" column anymore because we used this feature as a "DEATH" feature.


In [None]:
df_tmp.drop(columns=["INTUBED","ICU","DATE_DIED"], inplace=True)

In [None]:
# Finding the number of unique values by column
for column in df_tmp.columns:
  print(column,"=>\t",len(df_tmp[column].unique()))

### Data Visualization

In [None]:
pd.crosstab(df_tmp.DEATH, df_tmp.SEX).plot(kind="bar", 
                                                       figsize=(10, 6), 
                                                       cmap="tab10")
plt.title("Death Frequency according to Sex")
plt.xlabel("1 = Death, 2 = Alive")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0);

In [None]:
df.AGE.plot(kind="hist",
            figsize=(10, 6),
            cmap="tab10");

In [None]:
pd.crosstab(df_tmp.DEATH, df_tmp.PATIENT_TYPE).plot(kind="bar", 
                                                       figsize=(10, 6), 
                                                       cmap="tab10")
plt.title("Death Frequency according to Patient type")
plt.xlabel("1 = Death, 2 = Alive")
plt.ylabel("Amount")
plt.legend(["returned home", "hospitalization"])
plt.xticks(rotation=0);

In [None]:
corr_matix = df_tmp.corr()
fig, ax = plt.subplots(figsize=(20, 12))
ax = sns.heatmap(corr_matix, 
                 annot=True, 
                 linewidths=0.5, 
                 fmt=".2f", 
                 cmap="YlGnBu")

In [None]:
# drop the features that have low correlation with "DEATH" feature.
unrelevant_columns = ["SEX","PREGNANT","COPD","ASTHMA","INMSUPR","OTHER_DISEASE","CARDIOVASCULAR",
                      "OBESITY","TOBACCO"]

df_tmp.drop(columns=unrelevant_columns,inplace=True)
df_tmp.info()

Note:
* We got well accuracy with Logistic Regression.
* But it can mislead us so we have to check the other metrics.
* When we look at the F1 Score it says that we predicted the patients who survived well but we can't say the same thing for dead patients.
* Also we see the same thing when we check the confusion matrix. This problem is based on imbalance dataset as i mentioned about it.

In [None]:
ax = sns.countplot(x=df_tmp.DEATH,
                   palette=sns.color_palette("YlGnBu"))
plt.bar_label(ax.containers[0])
plt.title("Death Distribution");

### How To Solve Imbalance Dataset Problem
* Loading More Datas
* Changing The Perfomance Metrics
* Resampling (Undersampling or Oversampling)
* Changing The Algorithm
* Penalized Models etc.

we're going to use Undersampling for this case because we already have too many patients.
* Undersampling : Undersampling is a technique to balance uneven datasets by keeping all of the data in the minority class and decreasing the size of the majority class.
* If we use Oversampling our row number will increase so this is too many rows for computer.
* If i can't solve the problem with Undersampling i will use the others

In [None]:
from imblearn.under_sampling import RandomUnderSampler

# split data in to x & y
x, y = df_tmp.drop("DEATH", axis=1), df_tmp["DEATH"]

rus = RandomUnderSampler(random_state=42)
x_resampled, y_resampled = rus.fit_resample(x, y)

In [None]:
ax = sns.countplot(x=y_resampled,
                   palette=sns.color_palette("YlGnBu"))
plt.bar_label(ax.containers[0])
plt.title("Death Distribution");

## 5. Modlling

In [None]:
# Create a fuction to split the data into train, validation and test sets
def train_val_test_split(X, 
                          y,
                          frac_train=0.6, 
                          frac_val=0.15, 
                          frac_test=0.25,
                          random_state=42):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.

    Parameters
    ----------
    X : Features
    y : Target
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().

    Returns
    -------
    X_train, y_train, X_val, y_val, X_test, y_test :
        Dataframes containing the three splits.
    '''

    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))

    # Split original dataframe into train and temp dataframes.
    X_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    X_val, X_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)

    return X_train, y_train, X_val, y_val, X_test, y_test

In [None]:
len(x_resampled)

In [None]:
# Spliting the data into train, validation and test
X_train, y_train, X_val, y_val, X_test, y_test = train_val_test_split(X=x_resampled, y=y_resampled)

X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape

Now've got our data split into training, validation and test sets, it time to build a machine learning model.

We'll train it (find the patterns) on the training sets.

And we'll test it (use the patterns) on the test set.

We'er going to try 3 different machine learning models:
1. Logistic Regression
2. K-Nearest Neighbours classifier
3. RandomForestClassifier
4. Support Vector Machine

In [None]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "KNN": KNeighborsClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
    "SVC": SVC(kernal="rbf")
}

def fit_and_score(models, x_train, x_test, y_train, y_test):
  """
  Fits and evaluates given machine learning models.
  models : a dict of different Scikit-learn machine learning models
  x_train : training data (no labels)
  x_test : testing data (no labels)
  y_train : training labels
  y_test : test labels
  """

  # Setup random seed
  np.random.seed(42)

  # Make a dictionary to keep models score
  model_scores = {}

  # Loop through models
  for model_name, model in models.items():
    # Fit the model to the data
    clf = model.fit(x_train, y_train)
  
    # Evaluate the model and append its score to model_score
    model_scores[model_name] = clf.score(x_test, y_test)

  return model_scores

In [None]:
models_score = fit_and_score(models, X_train, X_test, y_train, y_test)
models_score

### Model Comparison

In [None]:
model_compare = pd.DataFrame(models_score, index=["accuracy"])

model_compare.T.plot(kind="bar",
                     figsize=(10, 6),
                     cmap="tab10")

plt.xticks(rotation=0);

* As we can see the SVC model not prefromed good compare to other models so we eliminate the SVC Model

Now've got our baseline models... and we know a model's first predictions aren't always what we should based our next steps off. What should we do?

Let's look at the following:
* Hyperparameter tuning
* Feature importance
* Confusion matrix 
* Cross-validation
* Precision 
* Recall
* F1 score
* Classification report
* ROC curve
* Area under the curve (AUC)

## Hyperparameter tuning with RandomizedSearchCV

We're going to tune:
* LogisticRegression()
* RandomForestClassifier()

... using RandomizedSearchCV

In [None]:
# Create a hyperparameter grid for LogisticRegression
log_reg_grid = {
    'penalty' : ['l2'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
}

# Create a hyperparameter grid for RandomForestClassifier
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_features": ["sqrt", "log2", None],
           "max_depth": [3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

# Create a hyperparameter grid for KNN
knn_grid = {'n_neighbors' : np.arange(1, 30, 1),
            'leaf_size': np.arange(1, 50, 1),
            'weights' : ['uniform','distance'],
            'p': [1, 2],
            'metric' : ['minkowski','euclidean','manhattan']}

Now we've got hyperparameter grids setup for each of our models, 
let's tune them using RandomizedSearchCV.

In [None]:
x_resampled.columns

In [None]:
# Setup randomizedSearchCv for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(max_iter=5000),
                                param_distributions=log_reg_grid,
                                n_iter=100,
                                cv=5,
                                verbose=1)

# Fit random hyperparameter search model for LogisticRegression
rs_log_reg.fit(X_val, y_val)

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_score = rs_log_reg.score(X_test_normal, y_test)
rs_log_score

In [None]:
# Setup randomizedSearchCv for LogisticRegression
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                            param_distributions=rf_grid,
                            n_iter=100,
                            cv=5,
                            verbose=1)

# Fit random hyperparameter search model for LogisticRegression
rs_rf.fit(X_val, y_val)

In [None]:
rs_rf.best_params_

In [None]:
rf_score = rs_rf.score(X_test, y_test)
rf_score

In [None]:
# Setup randomizedSearchCv for LogisticRegression
rs_knn = RandomizedSearchCV(KNeighborsClassifier(),
                                param_distributions=knn_grid,
                                n_iter=100,
                                cv=5,
                                verbose=True)

# Fit random hyperparameter search model for LogisticRegression
rs_knn.fit(X_val, y_val)

In [None]:
rs_knn.best_params_

In [None]:
knn_score = rs_knn.score(X_test, y_test)
knn_score

* compare the tunned models

In [None]:
model_score = {
 'Logistic Regression': rs_log_score,
 'KNN': knn_score,
 'RandomForestClassifier': rf_score,
}

model_compare = pd.DataFrame(model_score, index=["accuracy"])
model_compare.T.plot(kind="bar",
                     figsize=(10, 6),
                     cmap="tab10")

plt.xticks(rotation=0);

## Hyperparameter tuning with GridSearchCV

Since our RandomForestClassifier model provides the best scores so far, we'll try and improve them again using GridSearchCV...

In [None]:
# Create a hyperparameter grid for RandomForestClassifier
rf_grid = {'bootstrap': [True],
            'max_depth': [80, 90, 100, 110],
            'max_features': [2, 3],
            'min_samples_leaf': [3, 4, 5],
            'min_samples_split': [8, 10, 12],
            'n_estimators': [100, 200, 300, 1000]}

# Setup grid hyperparameter search for RandomForestClassifier
gs_rf = GridSearchCV(estimator=RandomForestClassifier(),
                     param_grid=rf_grid,
                     cv=5,
                     verbose=True)

# fit grid hyperparameter searcg model for RandomForestClassifier
gs_rf.fit(X_val, y_val)

In [None]:
gs_rf.best_params_

In [None]:
gs_rf.score(X_test, y_test)

## Evaluting our tuned machine learning classifier, beyond accuracy

* ROC curve and AUC Score
* confusion Matrix
* Classification Report
* Precision
* Recall
* F1 Score

... and it would be great if cross-validation was used where possible.

To make comparisons and evaluate our trained model, first we need to make predictions.

In [None]:
# Creating a model with tuned model hyperparameters
clf = RandomForestClassifier(max_depth= 90,
                               max_features= 3,
                               min_samples_leaf= 4,
                               min_samples_split= 10,
                               n_estimators= 100)

clf.fit(X_train, y_train)

In [None]:
# Make predictions with new model
y_preds = clf.predict(X_test)
y_preds[:10]

In [None]:
# Plot ROC Curve and calculate AUC metrics
RocCurveDisplay.from_estimator(estimator=clf, X=X_test, y=y_test);

In [None]:
# confusion Matrix
print(confusion_matrix(y_true=y_test, y_pred=y_preds))

In [None]:
# Plot confusion matrix
ConfusionMatrixDisplay.from_estimator(estimator=clf, X=X_test, y=y_test);

* Now we've got a ROC curve, and AUC metrics and a confusion matrix, let's get a classification report as well as cross-validated precision, recall and f1-score

In [None]:
print(classification_report(y_true=y_test, y_pred=y_preds))

### Calculate evaluation metrics using cross-validation

We're going to calculate precision, recall and f1-score of our model using cross-validation and to do so we'll using `cross_val_score()` 

In [None]:
# Cross-validate accuracy
cv_acc = cross_val_score(clf,
                         cv=5,
                         X=x_resampled, 
                         y=y_resampled,
                         scoring="accuracy")
cv_acc

In [None]:
# Cross-validate precision
cv_precision = cross_val_score(clf,
                         cv=5,
                         X=x_resampled, 
                         y=y_resampled,
                         scoring="precision")
cv_precision

In [None]:
# Cross-validate recall
cv_recall = cross_val_score(clf,
                            cv=5,
                            X=x_resampled, 
                            y=y_resampled,
                            scoring="recall")
cv_recall

In [None]:
# Cross-validate f1
cv_f1 = cross_val_score(clf,
                        cv=5,
                        X=x_resampled, 
                        y=y_resampled,
                        scoring="f1")
cv_f1

In [None]:
# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": np.mean(cv_acc),
                           "Percision": np.mean(cv_precision),
                           "Recall": np.mean(cv_recall),
                           "F1": np.mean(cv_f1)},
                           index=[0])
cv_metrics

In [None]:
cv_metrics.T.plot.bar(title="Corss-validated classification metrics",
                      legend=False,
                      cmap="tab10");

### Feature Importance 

Feature Importance is another way as asking, "Which features contributed most to the outcomes of the mdoel and how did they contribute?"

Finding feature importance is different for each machine learning model. One way to find feature importance is to search for " (MODEL NAME) feature importance". 

In [None]:
X_train.head()

In [None]:
# Check the coefficients
clf.feature_importances_

In [None]:
# Match feature_importances_ values to columns
feature_dic = dict(zip(X_train.columns, list(clf.feature_importances_)))
feature_dic

In [None]:
# Visualize feature importance
feature_df = pd.DataFrame(feature_dic, index=[0])
feature_df.T.plot.bar(title="Feature Importance", legend=False);