Cross-validate isoform predictions of TENNIS using a transcriptome assembly of deep long-read RNA-seq from GSE203583.

Download the `gtf` file from GSE203583 and put it in the `../data` directory.

In [None]:
# configurations
tennisMain      = "../programs/tennis.py"
longreadsgtf    = "../data/GSE203583/GSE203583_CIA.assembly.allTissues59K.gtf"
dm6_predictions = "./analyze_annotations/dm6.pred.gtf"

!mkdir long_read_support

Since the chromosomes are named in different annotations of `dm6.gtf` and `longreadsgtf`. We should convert them using cvbio first.

In [None]:
longreads_chr_gtf = "../data/GSE203583/GSE203583_CIA.assembly.allTissues59K.chrname.gtf"
! cvbio UpdateContigNames \
    -i {longreadsgtf} \
    -o {longreads_chr_gtf} \
    -m ../data/chr_name.dm6.txt  \
    --comment-chars '#' \
    --columns 0 \
    --skip-missing true

Then we can evaluate TENNIS predictions by cross-validating it with a transcriptome assembly. The principle is that once a predicted isoform is supported by real sequencing data, then it is more likely to be a true positive (i.e. truly missing from the annotation).

In [None]:
# Check support from real data
!gffcompare -r {longreads_chr_gtf} -o long_read_support/comp_GSE203583 {dm6_predictions}
!cat comp_GSE203583.stats

Similarly, we are able to evaluate two randomized baseline approaches `Rand1` and `RandX`.

In [None]:
dm6gtf = "../data/dm6.gtf"
!python {tennisMain} test -f Random1 -p 0.0 -o long_read_support/Rand1.dm6 --xi_gtf_file {dm6_predictions} {dm6gtf} 
!python {tennisMain} test -f RandomX -p 0.0 -o long_read_support/RandX.dm6 --xi_gtf_file {dm6_predictions} {dm6gtf}

In [None]:
# Check support from real data
!gffcompare -r {longreads_chr_gtf} -o long_read_support/comp_GSE203583_Rand1 long_read_support/Rand1.dm6.pred.gtf
!gffcompare -r {longreads_chr_gtf} -o long_read_support/comp_GSE203583_RandX long_read_support/RandX.dm6.pred.gtf

!cat long_read_support/comp_GSE203583_Rand1.stats
!cat long_read_support/comp_GSE203583_RandX.stats

Intron-chian level precision and recall can be found in files comp_GSE203583_Rand1.stats and comp_GSE203583_RandX.stats. Denote them as `Rand1prec`, `Rand1rec`, `RandXprec`, `RandXrec`. They will be used as input of the plot script below.

In [None]:
# replace the following values as from the `stats` files
Rand1prec = 23.2
Rand1rec  = 149
RandXprec = 18.3
RandXrec  = 171

In [None]:
import os
from pathlib import Path
from typing import Optional

def get_symlink_target(symlink_path: str) -> Optional[str]:

    # Convert to Path object and resolve to absolute path
    path = Path(symlink_path).resolve()
    
    # Check if the path exists
    if not path.exists():
        return None
        
    # Check if it's actually a symlink
    if not os.path.islink(symlink_path):
        return None
        
    # Get the immediate target
    target = os.readlink(symlink_path)
    
    
    # If target is itself a symlink, recursively resolve it
    while os.path.islink(target):
        target = os.readlink(target)
        # Check for circular references
        if target == symlink_path:
            return None
            
    return os.path.abspath(target)
        

In [None]:
import sys
import os
from os.path  import basename

# get the path to tennis/src dir
tennisSrc = tennisMain
while os.path.islink(tennisSrc):
    newTarget = os.readlink(tennisSrc)
    if not os.path.isabs(newTarget):
        newTarget = os.path.join(os.path.dirname(os.path.abspath(tennisSrc)), newTarget)
    tennisSrc = newTarget
tennisSrcDir = os.path.dirname(tennisSrc)
sys.path.insert(0, tennisSrcDir)

tmap = "./analyze_annotations/comp_GSE203583." + basename(dm6_predictions) + ".tmap"
pr_txt = "long_read_support/dm6.precision_recall_by_pctIn.txt"
!python ../scripts/precision_recall_by_pctIn.py {dm6_predictions} {tmap} {pr_txt} {tennisSrcDir}


Then we can make the plot. It will be output in file `long_read_support/dm6.plot.pdf`

In [None]:
!python ../scripts/precision_recall_fig.py {pr_txt} {Rand1rec} {Rand1prec} {RandXrec} {RandXprec} long_read_support/dm6.plot.pdf