In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [2]:
import os

The analysis will be conducted using the results from the experiment file that trains and test the fine-tuned models of MedSAM on subsets of C-TRUS. I use the wilcoxon_compare.py script (provided in my GitHub) to compare results between experiments. It runs a Wilcoxon signed-rank test on each metric to check if performance differences are statistically significant, without the need for normal distribution checks.

Please make sure that the files are placed in the according files if you want to be able to run this script without errors since it uses the directory I used to get the evaluations.

# Experiment 1 - Analysis

In [3]:
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp1_highTr_lowTe/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp1_lowTr_lowTe/per_sample_metrics.csv \
  --label_a HighTr --label_b LowTr \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp1_highTr_vs_lowTr


Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp1_highTr_vs_lowTr/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp1_highTr_vs_lowTr/wilcoxon_report.txt


# Experiment 2 - Analysis

Since experiment 2 works with more tests, it will need multiple comparisons to come to conlusions about the performance of the two models under noise conditions. The first two comparisons will compare the noisy and clean tests of each model (high and low) and then I will compare the noisy test of low and high together to see if one significantly out performed the other.

In [4]:
#high-quality trained (clean vs. noisy)
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp2_highTr_clean/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp2_highTr_noisy/per_sample_metrics.csv \
  --label_a Clean --label_b Noisy \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp2_highTr_clean_vs_noisy

Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp2_highTr_clean_vs_noisy/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp2_highTr_clean_vs_noisy/wilcoxon_report.txt


In [5]:
#low-quality trained (clean vs. noisy)
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp2_lowTr_clean/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp2_lowTr_noisy/per_sample_metrics.csv \
  --label_a Clean --label_b Noisy \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp2_lowTr_clean_vs_noisy


Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp2_lowTr_clean_vs_noisy/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp2_lowTr_clean_vs_noisy/wilcoxon_report.txt


In [6]:
#high vs. low quality trained models under synthetic noise
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp2_highTr_noisy/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp2_lowTr_noisy/per_sample_metrics.csv \
  --label_a HighTrNoisy --label_b LowTrNoisy \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp2_highNoisy_vs_lowNoisy


Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp2_highNoisy_vs_lowNoisy/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp2_highNoisy_vs_lowNoisy/wilcoxon_report.txt


# Experiment 3 - Analysis

Experiment 3 introduced mixed trained fine tuned MedSAM model and was tested in the same process as low and high for experiment 1 and 2

In [7]:
#experiment 1 with mixed trained model being compared to low quality trained model
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp3_mixTr_lowTe/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp1_lowTr_lowTe/per_sample_metrics.csv \
  --label_a MixTr --label_b LowTr \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_vs_lowTr

Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_vs_lowTr/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_vs_lowTr/wilcoxon_report.txt


In [8]:
#experiment 1 with mixed trained model being compared to high quality trained model
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp3_mixTr_lowTe/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp1_highTr_lowTe/per_sample_metrics.csv \
  --label_a MixTr --label_b LowTr \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_vs_highTr

Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_vs_highTr/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_vs_highTr/wilcoxon_report.txt


In [9]:
#experiment 2 with mixed trained model (noisy vs. clean)
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp3_mixTr_clean/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp3_mixTr_noisy/per_sample_metrics.csv \
  --label_a Clean --label_b Noisy \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_clean_vs_noisy


Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_clean_vs_noisy/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp3_mixTr_clean_vs_noisy/wilcoxon_report.txt


# Experiment 4 - Analysis

Experiment 4 provides the overal performance of the three fine-tuned models on the entirety of the valid C-TRUS dataset. Using these results I can compare the models and determine if some models outperform others and also helps give a comparison to the baseline models tested in the C-TRUS papers as well as the zero-shot test in the beginning.

In [10]:
#comparing high with low
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp4_highTr_full/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp4_lowTr_full/per_sample_metrics.csv \
  --label_a HighTr --label_b LowTr \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp4_highTr_vs_lowTr

Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp4_highTr_vs_lowTr/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp4_highTr_vs_lowTr/wilcoxon_report.txt


In [11]:
#comparing high with mixed
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp4_highTr_full/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp4_mixTr_full/per_sample_metrics.csv \
  --label_a HighTr --label_b MixTr \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp4_highTr_vs_mixTr

Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp4_highTr_vs_mixTr/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp4_highTr_vs_mixTr/wilcoxon_report.txt


In [12]:
#comparing low with mixed
!python /content/drive/MyDrive/MedSAM/fine_tune_scripts/wilcoxon_compare.py \
  --csv_a /content/drive/MyDrive/MedSAM/eval_results/exp4_lowTr_full/per_sample_metrics.csv \
  --csv_b /content/drive/MyDrive/MedSAM/eval_results/exp4_mixTr_full/per_sample_metrics.csv \
  --label_a LowTr --label_b MixTr \
  --out_dir /content/drive/MyDrive/MedSAM/eval_results/results/exp4_lowTr_vs_mixTr

Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp4_lowTr_vs_mixTr/wilcoxon_results.csv
Saved: /content/drive/MyDrive/MedSAM/eval_results/results/exp4_lowTr_vs_mixTr/wilcoxon_report.txt
