# Benchmark
Adopt the RNAlight approach, mostly.

Input files: one for cytoplasmic lncRNA, the other for nuclear lncRNA.   
Input file format: tab-delimited lines of 3 fields: transcipt ID, gene name, RNA sequence.    
Header line: ensembl_transcript_id name cdna 
Data lines: ENST00000371086	DLEU2L	GAAAGTTTTCACTGCATCT... 
Each lncRNA is placed in either file, depending on mean CNRCI over 14 cell lines from lncATLAS.   
The threshold is zero; positive CNRCI values are cytoplasmic and others are nuclear.    
Use the Ensembl transcript ID (prefix ENST) without any version number suffix.
Use the canonical RNA sequence from GenCode to represent each gene in lncATLAS.
Evaluate the model by cross-validation on the entire dataset (no test subset withheld).

From GenCode, download these two files and unzip them:
* annotation = gencode.v45.long_noncoding_RNAs.gff3
* sequence = gencode.v45.lncRNA_transcripts.fa

From lncATLAS, download this file and unzip it:
* lncATLAS_all_data_RCI.csv

Make sure all three files, or links to them, are in the current directory.

In [1]:
import traceback
import numpy as np

In [2]:
ATLAS='lncATLAS_all_data_RCI.csv'
GFF_FILE='gencode.v45.long_noncoding_RNAs.gff3'
FASTA_FILE='gencode.v45.lncRNA_transcripts.fa'

In [3]:
def get_canonical_ids(gff_file):
    ids = set()
    with open(gff_file,'r') as fin:
        for line in fin:
            if line.startswith('#'):
                continue
            fields = line.strip().split('\t')
            if fields[2] != 'transcript':
                continue
            if 'Ensembl_canonical' not in fields[8]:
                continue
            tag_value_pairs = fields[8].split(';')
            for pair in tag_value_pairs:
                if pair.startswith('ID='):
                    ID_plus_ver = pair[3:]
                    just_ID = ID_plus_ver.split('.')[0]
                    ids.add(just_ID)
                    break
    return ids

In [4]:
canon_ids = get_canonical_ids(GFF_FILE)
print( len(canon_ids), 'canonical lncRNA IDs')

20424 canonical lncRNA IDs


In [5]:
def load_mean_rci(filename,exclude=None):
    cnrci_lists = dict()  # key=ENSG_ID, value=list of CN-RCI
    with open (filename, 'r') as fin:
        header = None
        for line in fin:
            try:
                fields = line.strip().split(',')
                if header is None:
                    header = fields
                    continue
                gid = fields[0]
                cell_type = fields[1]
                rci_type = fields[2]
                rci_value = fields[3]
                gene_type = fields[6]
                if exclude is None or cell_type==exclude:
                    if gene_type=='nc' and\
                        rci_type=='CNRCI' and\
                        rci_value!='NA':
                        rci_value=float(rci_value)
                        if gid not in cnrci_lists.keys():
                            cnrci_lists[gid] = []
                        cnrci_lists[gid].append(rci_value)
            except Exception as e:
                print(line)
                traceback.print_exc()
                raise(e)
    cnrci_means = dict()
    # Compute the log2 of the mean ratio.
    # Avoid using the mean of the log2 ratios.
    for gene,values in cnrci_lists.items():
        if len(values)<=0:
            print('Should not have loaded a gene with no CNRCI values.')
            raise Exception
        antilogs = [2**x for x in values]
        big_mean = np.mean(antilogs)
        if np.isclose(big_mean,0):
            log_mean = -1000000 # neg infinity
        else:
            log_mean = np.log2(big_mean) 
        cnrci_means[gene] = log_mean
    return cnrci_means

In [6]:
mean_rcis = load_mean_rci(ATLAS,exclude='H1.hESC')
print( len(mean_rcis.keys()), 'genes with a mean CN-RCI value')

4923 genes with a mean CN-RCI value


In [7]:
def load_transcripts(seq_file,good_gids,good_tids):
    sequences = list()  # list of tuple
    seq = ''
    with open (seq_file,'r') as fin:
        loading_sequence = False
        for line in fin:
            line = line.strip()
            if line.startswith('>'):
                # Save the previous sequence
                if loading_sequence:
                    tup = (tid,gid,seq)
                    sequences.append(tup)
                    loading_sequence = False
                # Parse the defline like
                # >ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|
                fields = line[1:].split('|')
                tid = fields[0].split('.')[0]
                gid = fields[1].split('.')[0]
                seq = ''
                if tid in good_tids and gid in good_gids:
                    loading_sequence = True
            elif loading_sequence:
                # Continuation of multi-line sequence
                seq += line
    if loading_sequence:
        # Save the last sequence
        tup = (tid,gid,seq)
        sequences.append(tup)    
    return sequences        

In [8]:
gene_ids = set(mean_rcis.keys())
print('Loading RNA sequence for canonical transcripts of genes with CN-RCI values...')
sequences = load_transcripts(FASTA_FILE,gene_ids,canon_ids)
print( len(sequences), 'sequences loaded')

Loading RNA sequence for canonical transcripts of genes with CN-RCI values...
4536 sequences loaded


In [9]:
DATAPATH = './'
cyt_file  = DATAPATH+'mean_RCI_positive.canonical.tsv'   
nuc_file  = DATAPATH+'mean_RCI_negative.canonical.tsv'

In [10]:
THRESHOLD = 0
def print_fasta_files(nuc_fn, cyt_fn, seqtups, mean_rcis):
    header = 'ensembl_transcript_id\tname\tcdna\n'
    nuc_handle = open(nuc_fn, 'w')
    nuc_handle.write(header)
    cyt_handle = open(cyt_fn, 'w') 
    cyt_handle.write(header)
    for tup in seqtups:
        tid,gid,seq = tup
        cnrci = mean_rcis[gid]
        string = tid+'\t'+gid+'\t'+seq+'\n'
        if cnrci > THRESHOLD:
            cyt_handle.write(string)
        else:
            nuc_handle.write(string)
    nuc_handle.close()
    cyt_handle.close()

In [11]:
print_fasta_files(nuc_file, cyt_file, sequences, mean_rcis)
print('Done')

Done


# 