<a href="https://colab.research.google.com/github/COVID-19-Severity/DataSet/blob/main/COVID19_Severity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<hr>

# :: SCRIPT AUTHORS ::
## * Ademir Luiz do Prado
## * Alexandre de Fátima Cobre
<hr>

# MACHINE LEARNING (ML)
Development of a Machine Learning model for the prognosis of COVID-19 in terms of SEVERITY using laboratory biomarkers.
The data are from examinations of patients treated at the Hospital de Clínicas of the Federal University of Paraná
# LEGEND:
### Sex:
##### 1=Female
##### 2=Male
### COVID:
##### * Total: 35,109 Positive Samples
##### * Non-Severe (Mild to Moderate): 7,719 samples
##### * Severe: 27,390 samples
### Classification Severity:
##### * Severe (Serious - Inpatients)
##### * Non-Severe (Mild to Moderate - Outpatients)
### Period of the Samples:
##### * March 2020 to September 2022

# OBJECTIVE:
Develop a Machine Learning model to predict the severity of COVID-19 and identify biomarkers associated with this severity in order to optimize priority in hospital care.

In [None]:
# PHASES:
# 1: Import the DataSet
# 2: Import the Pandas library for handling the DataSet
# 3: Remove unnecessary columns (features) from DataSet
# 4: Exploratory Analysis
# 5: Install the Pycaret library to aid Auto-Machine Learn
# 6: Import the Pycaret library
# 7: Perform data pre-processing
# 8: Build and compare models
# 9: Train the best model based on predictive performance metrics
#10: Extract the metrics results from the model
#11: Write conclusions about the best identified model
#12: Save the model to make predictions in real analyzes (Deploy)

In [None]:
# Phase 1: Import the DataSet

from google.colab import files
uploaded = files.upload()

In [None]:
# Phase 2: Import the Pandas library for handling the DataSet
import pandas as pd
DataSet = pd.read_csv("COVID19 DataSetSeverity.csv")
display (DataSet)

In [None]:
# Phase 3: Remove unnecessary columns (features) from DataSet
DataSetSeverity = DataSet.drop("ID", axis = 1)
display (DataSetSeverity)

In [None]:
# Phase 4: Exploratory Analysis
## 4.1. DataSet Informations
DataSetSeverity.info()

In [None]:
## 4.2. Install and Import library for Descriptive Statistics
!pip install researchpy
import researchpy as rp
### 1: COVID Feature
rp.summary_cat(DataSetSeverity['COVID'])

In [None]:
### 2: Sex Feature
rp.summary_cat(DataSetSeverity['Sex'])

In [None]:
### 3: Biomarkers Features
DescriptiveStat = DataSetSeverity
DataStatistics = DescriptiveStat.drop("COVID", axis = 1)
DataStatistics = DataStatistics.drop("Sex", axis = 1)
for statistical in DataStatistics.columns:
  display(rp.summary_cont(DataStatistics[statistical]))

In [None]:
## 4.3. Analyzing the variation in biomarker levels between COVID-19 severity samples (SEVERE AND NON-SEVERE)

### 1: Import Plotly library to graphics
import plotly.express as px

### 2: Create Graphics
#      HISTOGRAM
#for biomarker in DataSetSeverity.columns:
#  if biomarker != 'COVID' and biomarker != 'Sex':
#   graphic = px.histogram(DataSetSeverity, x = biomarker, color = "COVID", text_auto = True)
#   graphic.show()

#      BOXPLOT
for biomarker in DataSetSeverity.columns:
    if biomarker != 'COVID' and biomarker != 'Sex':
      graphic = px.box(DataSetSeverity, x = DataSetSeverity.columns[0], y=biomarker, color="COVID")
      graphic.show()

In [None]:
## Insights:
## In general, the levels of all biomarkers varied between SEVERE and NON-SEVERE samples for COVID-19.
## In general, SEVERE samples for COVID-19 had altered laboratory measurements compared to NON-SEVERE samples for COVID-19.
## SEVERE samples for COVID-19 demonstrate changes in laboratory measurements.
## All variables are important for analyzing the two groups of samples.
## The differences between the groups show that an in-depth study of supervised Machine Learning is justifiable.

In [None]:
# Phase 5: Install the Pycaret library to aid Auto-Machine Learn
!pip install pycaret

In [None]:
#Phase 6: Import the Pycaret library
from pycaret import classification

In [None]:
# Phase 7: Perform data pre-processing
classification_setup = classification.setup(data = DataSetSeverity, target = "COVID")

In [None]:
# Phase 8: Build and compare models
models = classification.compare_models()

In [None]:
# Phase 9: Train the best model based on predictive performance metrics
# First: The Light Gradient Boosting Machine (lightgbm) model achieved the best performance. We will create the Light Gradient Boosting Machine model
model_lightgbm = classification.create_model("lightgbm")

In [None]:
# Second: The Extreme Gradient Boosting (xgboost) model second the best performance.
model_xgboost = classification.create_model("xgboost")

In [None]:
# Third: The Random Forest Classifier (rf) model third the best performance.
model_rf = classification.create_model("rf")

In [None]:
# Fourth: The Extra Trees Classifier (et) model fourth the best performance.
model_et = classification.create_model("et")

In [None]:
# Fifth: The Gradient Boosting Classifier (gbc) model fifth the best performance.
model_gbc = classification.create_model("gbc")

In [None]:
# Phase 10: Extract the metrics results from the 5 top models
# First: lightgbm model metrics
classification.evaluate_model(model_lightgbm)

In [None]:
# Second: xgboost model metrics
classification.evaluate_model(model_xgboost)

In [None]:
# Third: rf model metrics
classification.evaluate_model(model_rf)

In [None]:
# Fourth: et model metrics
classification.evaluate_model(model_et)

In [None]:
# Fifth: gbc model metrics
classification.evaluate_model(model_gbc)

In [None]:
# Plotting only the 10 most important biomarkers for lightgbm model
classification.plot_model(model_lightgbm, plot ="feature")

In [None]:
# Phase 11: Write conclusions about the best identified model
# Several Machine Learning models were built to predict the diagnosis of COVID-19 using biomarker data from patients with COVID-19
# The Light Gradient Boosting Machine (lightgbm) model had the best predictive performance
# The 5 most important biomarkers for the prognosis of COVID-19, for the samples under study, were: C-reactive protein, Creatinine, Albumin, Lymphocytes and Erythrocytes
# The next step is to develop the App so that the model can be used in Health Institutions.

In [None]:
# Phase 12: Save the model to make predictions in real analyzes (Deploy)
classification.save_model(model_lightgbm, "BestModel-ML_LightGBM")