# Dataset: GSE243217

GSE243217 is a prospective observational study conducted at Osaka University Graduate School of Medicine. The study aims to compare host immune responses in sepsis and COVID-19 patients by analyzing mRNA and miRNA expression profiles. The dataset includes transcriptomic data from:

22 sepsis patients

35 COVID-19 patients

15 healthy subjects (controls)

## Study Design:

Sepsis Diagnosis: Sepsis patients were diagnosed using the Sepsis-3 criteria.
Sample Type: Whole blood samples were collected and processed for RNA sequencing.

Present genes : 54 genes

Missing genes: 1 (IFNB1)

The dataset was not normalized, so we retrieve the raw data for each sapleas and merge it into a single dataset and then apply normalization by Deseq2 and variance stabilization is done.

sepsis_dataGSE243217.csv -> includes 22 sepsis patients and 15 healthy ones.(35 rows and 56 columns)

## Random Forest Analysis:

We applied a random forest test on a dataset (sepsis_dataGSE243217.csv) to evaluate the prediction accuracy based on our 55 genes expression data between sepsis patients and healthy controls.(RF-samesplit_GSE243217code.R). We ran a random forest 100 times with random splitting data to train and test and calculate the MCC, F1 score, AUC and … for all 100 repeats. (repeated_splits_metrics_GSE243217.csv).

## Average metrics:

Here is the result for making the average of each metric over 100 iterations. (average_metrics_GSE243217.csv)

MCC    1
F1     1
AUC    1
TPR    1
TNR    1
PPV    1
NPV    1


> table(data$Label)

healthy  Sepsis 
     15      22 
     
> table(train_data$Label)
healthy  Sepsis 
      9      14 
      
> table(test_data$Label)
healthy  Sepsis 
      6       8

> confusion
Confusion Matrix and Statistics

          Reference
Prediction healthy Sepsis
   healthy       6      0
   Sepsis        0      8
                                     
               Accuracy : 1          
                 95% CI : (0.7684, 1)
    No Information Rate : 0.5714     
    P-Value [Acc > NIR] : 0.0003958  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.5714     
         Detection Rate : 0.5714     
   Detection Prevalence : 0.5714     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : Sepsis


## Feature removal test

In the next step, we tried to figure out which feature or gene has more impact on MCC results. So we run each 100 time iteration by removing one of the features and calculate all the metrics to realize by removing which one we can see more drop in MCC value. (feature_removal_results_GSE243217.csv)


## Sanity check

Noise is artificially introduced into the dataset, and the model's performance metrics are observed as the noise level increases.(sanity_check_results_GSE243217.csv)
Noise Level 0: Represents the original data with no added noise (baseline performance).
Noise Levels 10–50: Increasing levels of noise are added to the input data, simulating scenarios with reduced data quality or signal interference.



MCC drops significantly more than the F1-score, reaching its lowest point at x=30.
Since MCC is highly sensitive to class imbalance, this suggests that noise is strongly affecting class distributions or that the classifier’s ability to make correct predictions is heavily compromised in this region.
The recovery suggests that beyond a certain noise level, the model is either stabilizing or becoming more random (which sometimes leads to MCC artificially increasing if class distributions become balanced by chance).
AUC remains at 1.000 throughout suggests that the classifier is perfectly distinguishing between classes. However, when noise is introduced in the dataset, this might indicate an issue with overfitting or that AUC is not capturing the full impact of noise.

F1-score experiences a noticeable drop around x=25-30, suggesting that precision or recall (or both) are affected at this noise level. The recovery at the end implies that the model can regain its performance, which may indicate resilience to certain levels of noise.


The random forest is done also by using SMOTE to make the dataset for the training part more balanced but the results were the same and all the metrics became 1 again.

Then we do pca test to check the variance in dataset and clustering the data

Sample Clustering Indicates Separation of Sepsis vs. Healthy
Clusters are well-separated -> This suggests strong gene expression differences between Sepsis and Healthy samples.
Since PCA is an unsupervised technique, the fact that we see clear separation suggests that a supervised classifier (like Random Forest) will have an easy time learning this distinction.

AUC (Area Under Curve) = 1 means the model perfectly ranks Sepsis vs. Healthy cases. Since PCA already shows that Sepsis and Healthy groups are completely distinct, it makes sense that a classifier would be able to separate them perfectly.

F1-score = 1 and MCC = 1 mean that there are NO false positives or false negatives. The distinct clusters in PCA indicate that there is no overlap in gene expression patterns between Sepsis and Healthy patients.
This suggests that a simple decision boundary can be drawn between the two groups -> leading to zero misclassification.

A key concern was whether the model was just memorizing data (overfitting).
However, PCA is an unsupervised method, it doesn’t know the labels, yet it still finds clear clusters. This means the difference between Sepsis and Healthy is real, and not an artifact of overfitting or data leakage.

Random forest and resampling by SMOTE:


## Random forest:

We applied a random forest test on a dataset (sepsis_dataGSE243217.csv) and used SMOTE resampling to evaluate the prediction accuracy based on our 55 genes expression data between sepsis patients and healthy controls.(RF-Smote-GSE243217code.R). We ran a random forest 100 times with random splitting data to train and test and calculate the MCC, F1 score, AUC and … for all 100 repeats. (repeated_splits_SMOTE_GSE243217.csv).
Average metrics:
 
Here is the result for making the average of each metric over 100 iterations. (average_metrics_SMOTE_GSE243217.csv)

MCC    1
F1     1
AUC    1
TPR    1
TNR    1
PPV    1
NPV    1


## Feature removal test

In the next step, we tried to figure out which feature or gene has more impact on MCC results. So we run each 100 time iteration by removing one of the features and calculate all the metrics to realize by removing which one we can see more drop in MCC value. (feature_removal_results_GSE243217.csv)


## Feature importance results:

Feature importance extraction in Random Forest helps us understand which features (genes, biomarkers, etc.) contribute the most to classification performance. we this for each iteration and then make average of all the results.(average_feature_importance_GSE243217.csv)


Based on the results of feature extraction we can see some genes such as (CD177,S100A9, MMP8, ITGAM, S100A8, S100A12, ARG1) are the most influential ones in classification of sepsis and healthy patients. So these features reduce impurity (uncertainty) when used in tree splits and with higher Gini scores are more important.

## Sanity check

Noise is artificially introduced into the dataset, and the model's performance metrics are observed as the noise level increases.(sanity_check_results_SMOTE_GSE243217.csv)

Noise Level 0: Represents the original data with no added noise (baseline performance).

Noise Levels 10–50: Increasing levels of noise are added to the input data, simulating scenarios with reduced data quality or signal interference.




We can see Performance drops significantly between 30% and 40% noise, suggesting that noise at this level corrupts important features.


A full recovery at 50% noise indicates that the model may be adapting to randomness.
AUC remains stable, meaning that the model maintains ranking ability but suffers in absolute classification accuracy as shown by F1 and MCC.


## Mann-Whitney U test:

The Mann-Whitney U test (also called the Wilcoxon rank-sum test) is a non-parametric test used to compare two independent groups when the assumptions of normality (required for a t-test) are not met. In our analysis, we want to compare gene expression between sepsis patients and controls. The Mann-Whitney test helps determine if the expression levels are significantly different between these two groups.(Man_W_GSE243217code.R)

If the p-value < 0.05, it means that there is a significant difference in gene expression between Sepsis vs. Control for that gene.

If p-value ≥ 0.05, it means there is no strong evidence of a difference in expression levels.
And this a plot for top 25 significant genes:

We can see genes like,ARG1, S100A9, S100A8, CCR7, S100A12, ELANE, MMP8, GATA3, CD177, FCGRA1…. are significantly different expression levels between sepsis and Control ones.



















