# Determine P-value threshold at chosen a FDR threshold

In this notebook, we describe how to determine a P-value threshold for a given dataset so that the FDR is kept below a chosen threshold. We use an empirical method to estimate the FDR at given P-value thresholds. For a given P-value threshold ``p_thresh``, we proceed in four steps:

1. Determine the number of significant interactions ``sig_num_o`` at ``p_thresh``.
2. Randomize the simple and twisted read pair counts of all interactions. We randomize individual interactions with $n$ read pairs by randomly drawing a simple count from a binomial distribution with $p=0.5$ and then setting the twisted count to $t=n-s$.
3. Determine the number of randomized significant interactions ``sig_num_r`` at ``p_thresh``.
4. Use ``sig_num_r/sig_num_o`` to estimate the FDR at ``p_thresh``.

To find a P-value threshold for which the FDR is kept below a chosen threshold, we estimate the FDR for increasing P-value thresholds. Then we use the P-value threshold for which the FDR is still below the chosen threshold. Note that we randomize each interaction only once and then use the same list of P-values of randomized interactions for all P-value thresholds.

Usually the FDR grows monotonically with the P-value threshold. However, for datasets with poor signal-to-noise ratio, this may not be the case. Therefore we check this and issue a warning if applicable.

## Setting up the notebook

In [6]:
import sys
import os
from numpy import arange
import pandas
sys.path.append("..")
from diachr import DiachromaticInteractionSet
from diachr import RandomizeInteractionSet

In this notebook, we use the following classes:
- ``DiachromaticInteractionSet``
- ``RandomizeInteractionSet``

We use the ``DiachromaticInteractionSet`` to read, evaluate, categorize and write interactions to a file. The randomization analyzes are carried out in class ``RandomizeInteractionSet``, which operates on an already existing ``DiachromaticInteractionSet`` object that has been passed to its constructor.

## Test files

### Test file 1

The purpose of the first test file is to check whether the 

We have prepared a test file that contains the same number of interactions for consecutive P-value intervals. For this purpose, the following parameters must be specified:
- an interaction file
- a maximum P-value
- a P-value step size
- requested number of interactions per interval

Make sure that the interaction file has not previously been categorized with a P-value threshold that is smaller than the specified maximum P-value. Otherwise, interactions that do not have enough read pairs to be significant at the smaller threshold may have already been discarded.

In [42]:
interaction_file = '../tests/data/test_03/MK_0.06_evaluated_and_categorized_interactions.tsv.gz'
p_value_max = 0.05
p_value_step = 0.00025
i_count_per_range = 10

First, we load the interaction file in a ``DiachromaticInteractionSet``.

In [43]:
interaction_set_test = DiachromaticInteractionSet()
interaction_set_test.parse_file(interaction_file, verbose=False)
read_file_info_report = interaction_set_test.get_read_file_info_report()
print(read_file_info_report)

[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 9000910 interactions from: 
			[INFO] ../tests/data/test_03/MK_0.06_evaluated_and_categorized_interactions.tsv.gz
			[INFO] Set size: 9000910
	[INFO] The interaction set has 9000910 interactions.
[INFO] End of report.



 Then we evaluate and categorize the interactions, using ``p_value_max`` as the threshold.

In [46]:
interaction_set_test.evaluate_and_categorize_interactions(p_value_max, verbose=False)
eval_cat_info_report = interaction_set_test.get_eval_cat_info_report()
print(eval_cat_info_report)

[INFO] Report on evaluation and categorization interactions:
	[INFO] P-value threshold: 0.0500000
	[INFO] Minimum number of read pairs required for significance: 6
	[INFO] Corresponding largest P-value: 0.0312500
	[INFO] Processed interactions: 9000910
	[INFO] Discarded interactions: 0
	[INFO] Not significant interactions (UI): 8266032
	[INFO] Significant interactions (DI): 734878
[INFO] End of report.



Nothing is done. Interaction set remains unchanged.


We use the step size together with the maximum P-value to create a list with P-value thresholds.

In [48]:
p_threshs = arange(p_value_step, p_value_max + p_value_step, p_value_step)
print(p_threshs)

[0.00025 0.0005  0.00075 0.001   0.00125 0.0015  0.00175 0.002   0.00225
 0.0025  0.00275 0.003   0.00325 0.0035  0.00375 0.004   0.00425 0.0045
 0.00475 0.005   0.00525 0.0055  0.00575 0.006   0.00625 0.0065  0.00675
 0.007   0.00725 0.0075  0.00775 0.008   0.00825 0.0085  0.00875 0.009
 0.00925 0.0095  0.00975 0.01    0.01025 0.0105  0.01075 0.011   0.01125
 0.0115  0.01175 0.012   0.01225 0.0125  0.01275 0.013   0.01325 0.0135
 0.01375 0.014   0.01425 0.0145  0.01475 0.015   0.01525 0.0155  0.01575
 0.016   0.01625 0.0165  0.01675 0.017   0.01725 0.0175  0.01775 0.018
 0.01825 0.0185  0.01875 0.019   0.01925 0.0195  0.01975 0.02    0.02025
 0.0205  0.02075 0.021   0.02125 0.0215  0.02175 0.022   0.02225 0.0225
 0.02275 0.023   0.02325 0.0235  0.02375 0.024   0.02425 0.0245  0.02475
 0.025   0.02525 0.0255  0.02575 0.026   0.02625 0.0265  0.02675 0.027
 0.02725 0.0275  0.02775 0.028   0.02825 0.0285  0.02875 0.029   0.02925
 0.0295  0.02975 0.03    0.03025 0.0305  0.03075 0.031   0.0

For the parameters given above, this results in a list of 20 P-value thresholds. For each of the threshold values, we go through the interaction set until we have selected the required number of interactions with P-values within the current interval.

In [49]:
out_fh = open('diachromatic_fdr_test_file.tsv', 'wt')

i_count = 0
for p_thresh in p_threshs:
    i_count_range = 0
    for d_inter in interaction_set_test.interaction_list:
        if (p_thresh - p_value_step < d_inter.get_pval()) and d_inter.get_pval() <= p_thresh:
            i_count_range  += 1
            out_fh.write(d_inter.get_diachromatic_interaction_line() + '\n')
            i_count += 1
        if i_count_range == i_count_per_range:
            break
    if i_count_range < i_count_per_range:
        print("[WARNING] Could not select the required number (only "
              + str(i_count_range) + " of " + str(i_count_per_range) +
              ") of interactions for the P-value range ]"
              + str(p_thresh - p_value_step) + ';' + str(str(p_thresh)) + ']')
    print(str(p_thresh - p_value_step) + '\t' + str(p_thresh) + '\t' + str(i_count_range)+ '\t' + str(i_count))
    
out_fh.close()

  """Entry point for launching an IPython kernel.


0.0	0.00025	10	10
0.00025	0.0005	10	20
0.0005	0.00075	10	30
0.00075	0.001	10	40
0.001	0.00125	10	50
0.00125	0.0015	10	60
0.0015	0.00175	10	70
0.00175	0.002	10	80
0.002	0.0022500000000000003	10	90
0.0022500000000000003	0.0025000000000000005	10	100
0.0024999999999999996	0.00275	10	110
0.00275	0.003	10	120
0.003	0.0032500000000000003	10	130
0.0032500000000000003	0.0035000000000000005	10	140
0.0034999999999999996	0.00375	10	150
0.00375	0.004	10	160
0.004	0.00425	10	170
0.00425	0.0045000000000000005	10	180
0.0045000000000000005	0.004750000000000001	10	190
0.00475	0.005	10	200
0.005	0.00525	10	210
0.00525	0.0055000000000000005	10	220
0.0055	0.00575	10	230
0.00575	0.006	10	240
0.006	0.00625	10	250
0.00625	0.006500000000000001	10	260
0.006500000000000001	0.006750000000000001	10	270
0.00675	0.007	10	280
0.007	0.00725	10	290
0.00725	0.007500000000000001	10	300
0.0075	0.00775	10	310
0.00775	0.008	10	320
0.008	0.00825	10	330
0.00825	0.0085	10	340
0.0085	0.00875	10	350
0.00875	0.009000000000000001	

The code snippet above reports for each P-value range the number of selcted interaction within this range and cumulative number of selected ranges for this and all previous ranges.

## Reading in a Diachromatic interaction file and evaluate and categorize interactions

XXX

In [1]:
import sys
import os
from numpy import arange, exp, log
import pandas
sys.path.append("..")
from diachr import DiachromaticInteractionSet
from diachr import RandomizeInteractionSet

In [2]:
interaction_set = DiachromaticInteractionSet()
interaction_set.parse_file('diachromatic_fdr_test_file.tsv', verbose=False)
read_file_info_report = interaction_set.get_read_file_info_report()
print(read_file_info_report)

[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 2000 interactions from: 
			[INFO] diachromatic_fdr_test_file.tsv
			[INFO] Set size: 2000
	[INFO] The interaction set has 2000 interactions.
[INFO] End of report.



In [3]:
p_value_max = 0.05
interaction_set.evaluate_and_categorize_interactions(p_value_max, verbose=False)
eval_cat_info_report = interaction_set.get_eval_cat_info_report()
print(eval_cat_info_report)

[INFO] Report on evaluation and categorization interactions:
	[INFO] P-value threshold: 0.0500000
	[INFO] Minimum number of read pairs required for significance: 6
	[INFO] Corresponding largest P-value: 0.0312500
	[INFO] Processed interactions: 2000
	[INFO] Discarded interactions: 0
	[INFO] Not significant interactions (UI): 0
	[INFO] Significant interactions (DI): 2000
[INFO] End of report.



## Pass interaction set to ``RandomizeInteractionSet`` module

XXX

In [4]:
randomize_fdr = RandomizeInteractionSet(interaction_set=interaction_set)

In [5]:
chosen_fdr_thresh = 0.05
p_value_max = 0.05
p_value_step = 0.00025
fdr_info_dict = randomize_fdr.get_pval_tresh_at_chosen_fdr_tresh(chosen_fdr_thresh=chosen_fdr_thresh,
                                             pval_thresh_max=p_value_max,
                                             pval_thresh_step_size=p_value_step)
pandas.DataFrame(fdr_info_dict)

[WARING] XXX


Unnamed: 0,P_VAL_TRESH,NNL_P_VAL_TRESH,SIG_NUM_R,SIG_NUM_O,FDR
0,0.00025,8.294050,0,10,0.000000
1,0.00050,7.600902,2,20,0.100000
2,0.00075,7.195437,2,30,0.066667
3,0.00100,6.907755,3,40,0.075000
4,0.00125,6.684612,4,50,0.080000
...,...,...,...,...,...
195,0.04900,3.015935,77,1960,0.039286
196,0.04925,3.010846,77,1970,0.039086
197,0.04950,3.005783,79,1980,0.039899
198,0.04975,3.000745,79,1990,0.039698


## Dependency between the stability of the FDR estimate and the number of interactions

XXX

In [1]:
import sys
import os
from numpy import arange, exp, log
import pandas
sys.path.append("..")
from diachr import DiachromaticInteractionSet
from diachr import RandomizeInteractionSet

In [2]:
interaction_file = '../tests/data/test_03/MK_0.06_evaluated_and_categorized_interactions_top_10000.tsv.gz'
p_value_max = 0.05
p_value_step = 0.00025

In [3]:
p_value_max = 0.05
interaction_set_2 = DiachromaticInteractionSet()
interaction_set_2.parse_file(interaction_file, verbose=True)
read_file_info_report = interaction_set_2.get_read_file_info_report()
print(read_file_info_report)

[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../tests/data/test_03/MK_0.06_evaluated_and_categorized_interactions_top_10000.tsv.gz
[INFO] ... done.
[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 10000 interactions from: 
			[INFO] ../tests/data/test_03/MK_0.06_evaluated_and_categorized_interactions_top_10000.tsv.gz
			[INFO] Set size: 10000
	[INFO] The interaction set has 10000 interactions.
[INFO] End of report.



In [4]:
p_value_max = 0.05
interaction_set_2.evaluate_and_categorize_interactions(p_value_max, verbose=True)
eval_cat_info_report = interaction_set_2.get_eval_cat_info_report()
print(eval_cat_info_report)

[INFO] Evaluate and categorize interactions ...
[INFO] ...done.
[INFO] Report on evaluation and categorization interactions:
	[INFO] P-value threshold: 0.0500000
	[INFO] Minimum number of read pairs required for significance: 6
	[INFO] Corresponding largest P-value: 0.0312500
	[INFO] Processed interactions: 10000
	[INFO] Discarded interactions: 0
	[INFO] Not significant interactions (UI): 9209
	[INFO] Significant interactions (DI): 791
[INFO] End of report.



In [5]:
randomize_fdr_2 = RandomizeInteractionSet(interaction_set=interaction_set_2)

In [38]:
chosen_fdr_thresh = 0.05
p_value_max = 0.05
p_value_step = 0.00025
fdr_info_dict = randomize_fdr_2.get_pval_tresh_at_chosen_fdr_tresh(chosen_fdr_thresh=chosen_fdr_thresh,
                                             pval_thresh_max=p_value_max,
                                             pval_thresh_step_size=p_value_step)

pandas.DataFrame(fdr_info_dict['RESULTS_TABLE']).loc[fdr_info_dict['ROW_INDEX'][0],:]

13	0.0035000000000000005	10	212	0.04716981132075472


P_VAL_TRESH          0.003500
NNL_P_VAL_TRESH      5.654992
SIG_NUM_R           10.000000
SIG_NUM_O          212.000000
FDR                  0.047170
Name: 13, dtype: float64

In [37]:
pandas.set_option('display.max_rows', None)
pandas.DataFrame(fdr_info_dict['RESULTS_TABLE'])

Unnamed: 0,P_VAL_TRESH,NNL_P_VAL_TRESH,SIG_NUM_R,SIG_NUM_O,FDR
0,0.00025,8.29405,2,103,0.019417
1,0.0005,7.600902,2,123,0.01626
2,0.00075,7.195437,3,133,0.022556
3,0.001,6.907755,5,147,0.034014
4,0.00125,6.684612,5,152,0.032895
5,0.0015,6.50229,5,163,0.030675
6,0.00175,6.348139,6,173,0.034682
7,0.002,6.214608,7,185,0.037838
8,0.00225,6.096825,7,189,0.037037
9,0.0025,5.991465,8,195,0.041026
