# Dataset: GSE28750

This study applies gene expression biomarkers to distinguish sepsis patients from those with systemic inflammation due to surgery. The goal is to develop a multi-marker diagnostic tool for early sepsis detection, within a clinical environment focused on SIRS differentiation.

### Study design:

1. Sepsis patients (n=27) met the 1992 Consensus Statement criteria and had microbiological evidence of infection.

2. Post-surgical patients (n=38): Blood collected within 24 hours after major surgery.

3. Healthy controls (n=20): Hospital staff with no known concurrent illness.


Overall Dataset Composition:

sepsis_data_GSE28750 -> Sepsis: 10 and Healthy controls: 20. (30 rows and 100 columns)


Present genes : 55

Missing genes: 0

The dataset was already normalized, 55 genes of interest filtered and prepared for random forest test. 

## Random Forest

We applied a random forest test on a dataset (sepsis_data_GSE28750.csv) to evaluate the prediction accuracy based on our 55 genes expression data between sepsis patients and healthy controls.(Rf-samesplit_GSE28750code.R). We ran a random forest 100 times with random splitting data to train and test and calculate the MCC, F1 score, AUC and … for all 100 repeats. (repeated_splits_metrics_GSE28750.csv).

Average metrics:
Here is the result for making the average of each metric over 100 iterations. (average_metrics_GSE28750.csv)

MCC     0.944868329805051
F1      0.95
AUC     1
TPR     0.925
TNR     1
PPV     1
NPV     0.97 


## Feature removal test

In the next step, we tried to figure out which feature or gene has more impact on MCC results. So we run each 100 time iteration by removing one of the features and calculate all the metrics to realize by removing which one we can see more drop in MCC value. (feature_removal_results_GSE28750.csv)



Top Features (**SOCS3.3, C3AR1, S100A9, S100A8,PLAUR.1**):
These are likely key biomarkers and should be prioritized in biological interpretation and downstream analysis. They are central to the model's predictions and could be explored further for biological relevance.

Bottom Features (**ARG1, SOCS3, P2RX7, ARG1.1, OLFM4**):
These features might not be crucial for prediction in this context. Consider removing them to simplify the model or explore their redundancy with other features.

## Sanity check

Noise is artificially introduced into the dataset, and the model's performance metrics are observed as the noise level increases.(sanity_check_results_GSE28750.csv)

Noise Level 0: Represents the original data with no added noise (baseline performance).

Noise Levels 10–50: Increasing levels of noise are added to the input data, simulating scenarios with reduced data quality or signal interference.


We can see Performance drops significantly between 20% and 30% noise, suggesting that noise at this level corrupts important features.
A partial recovery at 40% noise indicates that the model may be adapting to randomness, but performance degrades again at 50% noise.
AUC remains stable, meaning that the model maintains ranking ability but suffers in absolute classification accuracy as shown by F1 and MCC.


But the problem here is again the small size of the dataset and the imbalance in the number of sepsis and healthy ones.


Confusion Matrix and Statistics

          Reference
Prediction HEALTHY SEPSIS
   HEALTHY       4      0
   SEPSIS        0      2
                                     
               Accuracy : 1          
                 95% CI : (0.5407, 1)
    No Information Rate : 0.6667     
    P-Value [Acc > NIR] : 0.08779    
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.3333     
         Detection Rate : 0.3333     
   Detection Prevalence : 0.3333     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : SEPSIS     
                                     
> table(data$Label)

HEALTHY  SEPSIS 
     20      10 
> table(train_data$Label)

HEALTHY  SEPSIS 
     16       8 
> table(test_data$Label)

HEALTHY  SEPSIS 
      4       2


--------------------

So we just perform random forest again using SMOTE resampling to solve the imbalancing issue.
Random forest and resampling by SMOTE:

## Random forest
We applied a random forest test on a dataset (sepsis_data_GSE28750.csv) and used SMOTE resampling to evaluate the prediction accuracy based on our 55 genes expression data between sepsis patients and healthy controls.(RF_samesplit_SMOTE_GSE28750code.R). We ran a random forest 100 times with random splitting data to train and test and calculate the MCC, F1 score, AUC and … for all 100 repeats. (repeated_splits_SMOTE_GSE28750.csv).

## Average metrics
Here is the result for making the average of each metric over 100 iterations. (average_metrics_GSE28750.csv)

MCC    0.949570108523704
F1     0.953333333333333
AUC    0.99
TPR    0.935
TNR    0.99
PPV    0.99
NPV    0.968




table(data$Label)

HEALTHY  SEPSIS 
     20      10 
> table(train_data$Label)

HEALTHY  SEPSIS 
     16       8 
> table(smote_train_data$Label)

HEALTHY  SEPSIS 
     16      16 
> table(test_data$Label)

HEALTHY  SEPSIS 
      4       2


We can see after using SMOTE the train data that we used has become balanced.


## Feature removal test

In the next step, we tried to figure out which feature or gene has more impact on MCC results. So we run each 100 time iteration by removing one of the features and calculate all the metrics to realize by removing which one we can see more drop in MCC value. (feature_removal_results_GSE28750.csv)


Top Features (**LBP, NLRP3, BCL2.3, S100A9**):These are likely key biomarkers and should be prioritized in biological interpretation and downstream analysis. They are central to the model's predictions and could be explored further for biological relevance.

Bottom Features (**ARG1, BCL2.1, NLRP3.1, ARG1.1**):These features might not be crucial for prediction in this context. Consider removing them to simplify the model or explore their redundancy with other features.

## Sanity check

Noise is artificially introduced into the dataset, and the model's performance metrics are observed as the noise level increases.(sanity_check_results_GSE28750.csv)
Noise Level 0: Represents the original data with no added noise (baseline performance).
Noise Levels 10–50: Increasing levels of noise are added to the input data, simulating scenarios with reduced data quality or signal interference.


We can see Performance drops significantly between 10% and 20% noise, suggesting that noise at this level corrupts important features.
A partial recovery at 30% noise indicates that the model may be adapting to randomness, but performance degrades again at 50% noise.
AUC remains stable, meaning that the model maintains ranking ability but suffers in absolute classification accuracy as shown by F1 and MCC.










