# Dataset: GSE63311

The dataset investigates the gene expression signature associated with endotoxin tolerance, a condition where repeated exposure to endotoxins like LPS (lipopolysaccharides) alters immune response. This condition has implications in sepsis pathogenesis, particularly in immunosuppressive states observed in late-stage sepsis.

The study aims to:

1. Clarify the timing and mechanisms of immune dysfunction related to sepsis.

2. Define a gene expression signature characteristic of endotoxin tolerance.

3. Correlate the endotoxin tolerance signature with early sepsis to predict the development of sepsis and organ dysfunction.

## Study Design:

Participants:

The dataset includes RNA-seq data from 73 patients recruited at initial clinical presentation in an emergency ward.
Patients were grouped based on their condition:

1. Sepsis.

2. Other conditions (e.g., non-sepsis).

3. Non-urgent surgical controls for comparison.
------------------
Present genes: 52

Missing genes: 3 (CALCA, LBP, CCL25)

-----------------

## Dataset preparation:

The dataset was not normalized, it was mapped count data. We normalized it by using Deseq2 Package to be prepared for downstream analysis. The prepared dataset contains extracted sepsis patients and control groups which are non-urgent surgical control patients and then saved as (filtered_sepsis_vstGSE63311.csv) file. It contains 37 sepsis and 11 controls. The steps and code are saved in (GSE63311code.R) file. (rows = 48 and cols = 54)

## Random forest analysis:

### Factor label target:

At first random forest analysis is done on data which target label is considered as categorical factor.
(Sepsis and Control).

100 randomly split was done for random forest (repeated_splits_metrics63311.csv).
Average metrics calculated for all MCC, F1, AUC,... and saved in (average_metrics63311.csv) file. Here is the results:


### Average metrics:

MCC    0.863202491033685
F1     0.965882783882784
AUC    0.981071428571429
TPR    0.964285714285714
TNR    0.89
PPV    0.971607142857143
NPV    0.911666666666667


**MCC** : 0.86 indicates a very strong correlation between the predicted and actual labels, suggesting the model performs well across all classes.

**F1 Score**: A high F1 score indicates that the model balances precision and recall well, especially in datasets with imbalanced classes.

**AUC** (Area Under the Curve): A score of 0.98 implies that the model is nearly perfect at separating the two classes.

**TPR** (True Positive Rate / Sensitivity / Recall): 0.96 means the model successfully identified 96% of the positive cases.

**TNR** (True Negative Rate / Specificity): 0.89 shows that 89% of the negative cases were correctly classified.

**PPV** (Positive Predictive Value / Precision): 0.97 indicates that 97% of the model's positive predictions are correct.

**NPV** (Negative Predictive Value): 0.91 shows that 91% of the model's negative predictions are correct.


### Feature removal analysis:

To investigate the importance of each gene in the prediction of the model , in each 100 iterations one feature was removed from the dataset to evaluate the mentioned metrics.
(feature_removal_results63311.csv)

The plot that is created by results of feature removal:(feature_removal_mcc_plot63311.png)


Most important genes based on the plot that we see are the genes that by removing from the dataset cause more drop in MCC value. (**MAPK14, SOCS3, CCL19, IFNA1, CXCL8 and MMP9**).

Less important genes in the prediction performance of the model are those that their removal does not affect the MCC value significantly. (**S100A8, S100A12, IFNG, IL1R2 and CD14*)


### Sanity check:

Sanity check has been done on the model to evaluate its performance . By adding more noise step by step to the dataset in each iteration to confirm if the MCC, F1 and AUC value will decline or not.
The noise was added to the dataset in 5 different steps as 10%, 20%, 30%, 40% and 50%.(sanity_check_results63311.csv).

Here is the plot that shows how the calculated metrics will change by increasing noise on the dataset. (impact_of_noise_on_model_performance63311.png).

At noise level 0, the metrics (AUC, F1, MCC) are high, likely indicative of excellent model performance in the original, unperturbed dataset.

AUC (Red Line) and F1 Score (Green Line): after a critical noise level 30%, performance declines sharply, indicating the model struggles to maintain discriminative power under extreme noise.

MCC (Blue Line): MCC appears to degrade more rapidly compared to AUC and F1. This metric is more sensitive to imbalances and errors in predictions, especially under noisy conditions. The steep decline in MCC beyond moderate noise suggests a higher proportion of false predictions as noise increases.

--------------------------------

### Number label target:

Then random forest analysis is done on data which target label is considered as a numerical factor.(0 and 1).

100 randomly split was done for random forest (repeated_splits_metrics-num63311.csv).
Average metrics calculated for all MCC, F1, AUC,... and saved in (average_metrics-num63311.csv) file. Here is the results:


### Average metrics:

MCC    0.792956891870448
F1     0.870284604284604
AUC    0.885869047619048
TPR    0.862892857142857
TNR    0.854833333333333
PPV    0.884214285714286
NPV    0.811166666666667


**MCC** : 0.79 indicates a very strong correlation between the predicted and actual labels, suggesting the model performs well across all classes.

**F1 Score**: 0.87 F1 score indicates that the model balances precision and recall well, especially in datasets with imbalanced classes.

**AUC** (Area Under the Curve): A score of 0.88 implies that the model is nearly perfect at separating the two classes.

**TPR** (True Positive Rate / Sensitivity / Recall): 0.96 means the model successfully identified 86% of the positive cases.

**TNR** (True Negative Rate / Specificity): 0.85 shows that 85% of the negative cases were correctly classified.

**PPV** (Positive Predictive Value / Precision): 0.88 indicates that 88% of the model's positive predictions are correct.

**NPV** (Negative Predictive Value): 0.81 shows that 81% of the model's negative predictions are correct.


### Feature removal analysis:

To investigate the importance of each gene in the prediction of the model , in each 100 iterations one feature was removed from the dataset to evaluate the mentioned metrics.
(feature_removal_results-num63311.csv)

The plot that is created by results of feature removal:(feature_removal_mcc_plot-num63311.png)



Most important genes based on the plot that we see are the genes that by removing from the dataset cause more drop in MCC value. (**MAPK14, ITGAM,PLAUR, CX3CR1, IL6 and C5AR1**).

Less important genes in the prediction performance of the model are those that their removal does not affect the MCC value significantly. (**S100A8, CCL19, IFNB1, CCR7 and HIF1A**)


### Sanity check:

Sanity check has been done on the model to evaluate its performance . By adding more noise step by step to the dataset in each iteration to confirm if the MCC, F1 and AUC value will decline or not.
The noise was added to the dataset in 5 different steps as 10%, 20%, 30%, 40% and 50%.(sanity_check_results-num63311.csv).

Here is the plot that shows how the calculated metrics will change by increasing noise on the dataset. (impact_of_noise_on_model_performance-num63311.png).

At noise level 0, the metrics (AUC, F1, MCC) are high, likely indicative of excellent model performance in the original, unperturbed dataset.

AUC (Red Line) and F1 Score (Green Line): after a critical noise level 30%, performance declines sharply, indicating the model struggles to maintain discriminative power under extreme noise.

MCC (Blue Line): MCC appears to degrade more rapidly compared to AUC and F1. This metric is more sensitive to imbalances and errors in predictions, especially under noisy conditions. The steep decline in MCC beyond moderate noise suggests a higher proportion of false predictions as noise increases.













