Skip to content

BackofenLab/IntaRNA-benchmark

Repository files navigation

IntaRNA-benchmark

Data and scripts to benchmark IntaRNA.

Dependencies

  • IntaRNA >= 2.0.0
  • python > 3.2
  • pandas python package
  • matplotlib python package

Setup

The input directory contains folders representing certain organisms (here Salmonella and E. coli) or benchmark data sets. Each has to contain two folders, one holding the query RNAs and one holding the target RNAs.

The output directory contains a folder for each callID (see parameters of calls.py) holding all the result files for that specific callID. This folder is initially empty and is filled using the benchmark scripts.

The bin folder holds various different scripts used in the benchmarking process.

A required file is the verified_interactions.csv. The file contains interactions that were verified experimentally. The current data set covers interactions for

  • Echericha coli (GenBank accession number NC_000913)
  • Salmonella typhimurium (NC_003197)

Theoretical background

The idea is to compare the output of different IntaRNA calls with the experimentally verified interactions. In order to achieve this, the calls.py script is used on the query and target files for each benchmark data set (in input folder) using the specified IntaRNA call. This results in a result file for each data set and query RNA. These files contain the interaction results produced by IntaRNA.

In each file, the results are ordered according to their energy. This way, the result files are sorted from the most favorable (lowest energy) to the most unfavorable interaction (highest energy). The hope is that the verified interactions for each query RNA are amongst the first entries of each file, i.e they have low energy. Therefore, a rank is stored for each entry in the verified_interactions file, representing the rowID of that specific interaction in the according IntaRNA output. For example, for sRNA ArcZ and mRNA STM1682 we check the Salmonella result file for ArcZ (ArcZ_NC_003197) and search for STM1682. The rank is then the row index in which STM1682 appears. The lower the rank, the better.

In order to visualize the results, a receiver operating characteristic (ROC) curve is used. It is created using the ranks determined earlier. The X-axis describes the number of target predictions per query RNA, while the Y-axis represents the number of true positives. This means that for each X, the number of ranks that are smaller or equal to X are counted and represented on the Y-axis. Like this, multiple callIDs can be plotted into the same graph to compare the performance.

Scripts

The scripts are contained in the bin folder. The default value expect the scripts to be called from the main folder of the repository.

calls.py

Parameters:

  • intaRNAbinary (-b) the path of the intaRNA executable. Default: ../IntaRNA/src/bin/IntaRNA
  • infile (-i) location of the folder containing folders for each organism. The organism folders have to contain a query and a target folder holding the according fasta files. Default: ./input/
  • outfile (-o) location of the output folder. The script will add a folder for each callID. Default: ./output/
  • callID (-c) is a mandatory ID to differentiate between multiple calls of the script.
  • withED (-e) allows the precomputation of target ED-values in order to avoid recomputation.
  • callsOnly (-n) generates the calls and saves them in a log file without starting the process.
  • verified (-v) the path and file containing the verified interactions.
  • maxInteractionLength (-m) The maximum interaction length used in the precomputation of the target ED values. Default: 150

IMPORTANT: Arguments for IntaRNA can be added at the end of the script call and will be redirected to IntaRNA. python3 calls.py -c "callID" --"IntaRNA cmdLineArguments"

This script calls IntaRNA from intaRNAPath using the queries and targets for all data sets found within the inputPath (see above) and the additional parameterization provided by arg. The results of IntaRNA are piped to stdout and then into an output file in the outputPath where the callID is used for according file naming. There are many different controls to assure that no files are overwritten and that the required files are available.

The time (in seconds) and maximal memory usage (in megabyte) required to handle each call is also measured and represented in a table. The tables are also stored in the specified outputPath. The individual calls are also logged into a log file.

When withED option is set, the ED-values for all targets will be precomputed and stored in a folder ED-values/'organism'/target_name/. They are then used by all further IntaRNA calls. If the target ED-values are already contained in the given folder, they are directly used without recomputation. This option was only tested for ViennaRNA version 2.4.4 and IntaRNA version 2.2.0 and might not work for older versions.

Calls the benchmark.py using the specified callID as benchID.

Output: (contained in the respective callID folder)

  • (query)_(target).csv -> intarna output for a specific query-target combination (FASTA names used)
  • calls.txt -> log file for the calls
  • runTime.csv -> table with runtimes for each query-target combination.
  • memoryUsage.csv -> table with memory usage for each query-target combination.

benchmark.py

Parameters:

  • infile (-i) the location of the file containing the experimentally verified interactions. Default: ../verified_interactions.csv
  • outfile (-o) the name of the output file. Default: /benchmark.csv
  • callDirs (-p) the location where the output of the calls.py script lies. Default: ../output/
  • callID (-c) mandatory ID to differentiate between multiple benchmarkings.

This script uses the output of the calls.py script. It is called automatically at the end of the calls.py script. It stores the verified interactions from the specified file in a dictionary and calculates the rank for each interaction. In order to achieve this, it reads the files created by the calls.py script and sorts the tables according to the energy. Once the files are sorted, the row-number for each interaction in the verified interactions file is determined. The resulting row-number is the rank for that interaction. The ranks are then stored in a CSV file.

Default Output: (contained in the respective callID folder)

  • benchmark.csv -> file containing the rank for each verified interaction.

plot.py

Parameters:

  • benchmarkFile (-i) mandatory benchmark file used to plot the results. (created using benchmark.py eventually in compination with mergeBenchmarks.py)
  • outputFilePath (-o) the location and name of the output file. Default: IntaRNA2_benchmark.pdf .
  • separator (-s) separator used for the csv files. Default: ;
  • config (-c) path to the required configuration file.
  • title (-t) the title of the main plot (currently not in config file to allow easier changing via script).
  • referenceID (-r) the ID used to create the reference curve for violin plots.
  • plottype (-p) the type of plot required (violin / TODO / TODO).
  • plottype (-a) create additional plots for the time and memory consumption.

THIS SCRIPT WILL REPLACE ALL OTHER PLOTTING SCRIPTS. KEY FUNCTIONALITY ALREADY AVAILABLE This plotting script requires a config.txt as provided in the github repository. This allows a complete costumization of the plots without changing the code. The script can currently output combined ROC/violin plots showing the performance of a given IntaRNA call. Further, it can also plot the time and memory consumption for the given call.

plot_performance.py

Parameters:

  • benchmarkFile (-i) mandatory benchmark file used to plot the results. (created using benchmark.py eventually in compination with mergeBenchmarks.py)
  • outputFilePath (-o) the location and name of the output file. Default: IntaRNA2_benchmark.pdf .
  • separator (-s) separator used for the csv files. Default: ;
  • end (-e) the upper bound of the number of target predictions. Default: 200
  • xlim (-x) specify an x-limit for the output. x_start/x_end (x is already bound by end, changing might lead to strange results)
  • ylim (-y) specify an y-limit for the output. y_start/y_end

UNDER REPLACEMENT: Will be removed after plot.py script is fully functional This script uses a benchmark.csv file created by the benchmark.py script. For each callID present in the benchmark file, the ranks are used to create a receiver operating characteristic (ROC) curve. For each step from 1 to "end(200)" the number of ranks that are smaller or equal to the current step are recorded. These are the desired true positives.

Default Output:

  • IntaRNA2_benchmark.pdf -> a pdf of a roc plot for all contained callIDs

plot_boxes.py

Parameters:

  • benchmarkFile (-i) mandatory benchmark file used to plot the results. (created using benchmark.py eventually in compination with mergeBenchmarks.py)
  • outputFilePath (-o) the location and name of the output file. Default: IntaRNA2_benchmark.pdf .
  • separator (-s) separator used for the csv files. Default: ;
  • title (-t) title for the plot
  • rankThreshold (-r) thresholds for which the boxplots are created. Default: 5 10 50 100 200
  • fixedID (-f) the callID for the reference curve (needed for the boxplots)

This script uses a benchmark.csv file created by the benchmark.py script. For each callID present in the benchmark file, the ranks are used to create a receiver operating characteristic (ROC) curve. The data plotted is the number of ranks smaller or equal to the currently allowed target predictions. [0-max(threshold)] The data from the ROC curve is used to create a difference measure between each curve and the reference curve (defined by fixedID). This is visualized using boxplots for different target prediction thresholds (user-defineable). The upper bound of the x-axis is taken from the thresholds. Default: 200.

mergeBenchmarks.py

Parameters:

  • outputFileName (-o) mandatory name and path of the output file.
  • outputPath (-d) location of the result directory (containing the folders of the individual callIDs).
  • benchID (-b) specific benchIDs to be merged, atleast two. benchID1 benchID2 ...
  • all (-a) when set, all benchIDs in the outputPath are merged.

This script can be used to merge benchmark files and their according runTime and memoryUsage files for multiple/all benchIDs. This can be used to easily create one file for the data of multiple benchIDs, that can be used to plot all IDs at once using plot_performance.py.

clearAll.py

Parameters:

  • outputPath (-f) the location of the output files that will be deleted. Default ../output/ .
  • callID (-c) specific callIDs that will be deleted. callID1/callID2/...

Script to delete specific callIDs. If no specification is made all callIDs will be deleted from the specified folder.

About

Data and scripts to benchmark IntaRNA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published