In [5]:
"""
The purpose of this Jupyter notebook is to determine the optimal
thresholds from ROC curves in order to suggest novel potential
therapeutic targets for VACV infection.

To be more precise, median-based intensity refinement turned out to
perform best, which is why its ROC curves are used to this end. One
threshold is chosen for early and late intensities each.

Subsequent to performing median-based intensity refinement and min-max
normalization on the entire screen, the optimal thresholds are applied
to identify hits. In detail, the two thresholds are first applied
separately on all genes/proteins, yielding two separate lists of hits.
Proteins with a min-max normalized value greater than or equal to the
threshold value are considered hits. In a subsequent step, the two hit
lists are merged, i.e. the union is formed.

The optimal threshold is determined using Youden's J index, which is
defined as the difference between TPR and FPR (i.e. TPR - FPR).
Specifically, the optimal threshold is the one associated with the
maximum value of Youden's J index.
"""

"\nThe purpose of this Jupyter notebook is to determine the optimal\nthresholds from ROC curves in order to suggest novel potential\ntherapeutic targets for VACV infection.\n\nTo be more precise, median-based intensity refinement turned out to\nperform best, which is why its ROC curves are used to this end. One\nthreshold is chosen for early and late intensities each.\n\nSubsequent to performing median-based intensity refinement and min-max\nnormalization on the entire screen, the optimal thresholds are applied\nto identify hits. In detail, the two thresholds are first applied\nseparately on all genes/proteins, yielding two separate lists of hits.\nProteins with a min-max normalized value greater than or equal to the\nthreshold value are considered hits. In a subsequent step, the two hit\nlists are merged, i.e. the union is formed.\n\nThe optimal threshold is determined using Youden's J index, which is\ndefined as the difference between TPR and FPR (i.e. TPR - FPR).\nSpecifically, the 

In [2]:
import pandas as pd

In [3]:
# Load the two TSV files containing FPR, TPR and threshold value triples
path_to_early_tsv = (
    "/Users/jacobanter/Documents/Code/VACV_screen/Processing_Dharmacon_"
    "pooled_genome_1_and_2_subset/roc_curves_and_related_data/refined_"
    "normalized_intensity_values/median_refinement/roc_curve_data_"
    "early_VoronoiCells.tsv"
)

early_roc_curve_df = pd.read_csv(
    path_to_early_tsv,
    sep="\t"
)

path_to_late_tsv = (
    "/Users/jacobanter/Documents/Code/VACV_screen/Processing_Dharmacon_"
    "pooled_genome_1_and_2_subset/roc_curves_and_related_data/refined_"
    "normalized_intensity_values/median_refinement/roc_curve_data_late_"
    "VoronoiCells.tsv"
)

late_roc_curve_df = pd.read_csv(
    path_to_late_tsv,
    sep="\t"
)

In [10]:
# Introduce a new column storing the difference between TPR and FPR for
# each row
early_roc_curve_df["J_index"] = (
    early_roc_curve_df["TPR"]
    -
    early_roc_curve_df["FPR"]
)

late_roc_curve_df["J_index"] = (
    late_roc_curve_df["TPR"]
    -
    late_roc_curve_df["FPR"]
)

In [12]:
# Overwrite the TSV files with the extended DataFrames
early_roc_curve_df.to_csv(
    path_to_early_tsv,
    sep="\t",
    index=False
)

late_roc_curve_df.to_csv(
    path_to_late_tsv,
    sep="\t",
    index=False
)

In [19]:
# Now, for early and late intensities individually, determine the
# threshold associated with the maximum J index
early_best_threshold = early_roc_curve_df.loc[
    early_roc_curve_df["J_index"].idxmax(), "Threshold"
]

late_best_threshold = late_roc_curve_df.loc[
    late_roc_curve_df["J_index"].idxmax(), "Threshold"
]

print(
    f"Optimal threshold for early intensities: {early_best_threshold:.3f}"
)
print(
    f"Optimal threshold for late intensities: {late_best_threshold:.3f}"
)

Optimal threshold for early intensities: 0.404
Optimal threshold for late intensities: 0.286


In [35]:
# Save the optimal thresholds to a Pandas DataFrame
optimal_thresholds_df = pd.DataFrame(
    data={
        "threshold_type": ["early", "late"],
        "threshold_value": [early_best_threshold, late_best_threshold]
    }
)

optimal_thresholds_df.to_csv(
    "/Users/jacobanter/Documents/Code/VACV_screen/Processing_Dharmacon_"
    "pooled_genome_1_and_2_subset/roc_curves_and_related_data/refined_"
    "normalized_intensity_values/median_refinement/optimal_thresholds.tsv",
    sep="\t",
    index=False
)

In [37]:
# As a test, load the TSV file storing the optimal thresholds and try to
# read them
optimal_thresholds_df = pd.read_csv(
    "/Users/jacobanter/Documents/Code/VACV_screen/Processing_Dharmacon_"
    "pooled_genome_1_and_2_subset/roc_curves_and_related_data/refined_"
    "normalized_intensity_values/median_refinement/optimal_thresholds.tsv",
    sep="\t",
    index_col="threshold_type"
)

print(
    optimal_thresholds_df.loc["early", "threshold_value"]
)

print(
    optimal_thresholds_df.loc["late", "threshold_value"]
)

0.4036220452743436
0.2860976796049103
