# TACOS
Use the TACOS [web server](https://balalab-skku.org/TACOS/).   
Use cell line: A549.
This notebook is part 1 of 2. See notebook 202.   
Assume the TACOS_input directory exists.   
The notebook generates 100 random genes with CNRCI between -1 and +1, for each of the 10 TACOS cell lines.    
The result is 10 FASTA files.    

In [1]:
import numpy as np
np.random.seed(seed=1234)
FASTA_SIZE = 100
PREFIX='TACOS_input/' # must end in /
TACOS_CELL_LINES=['A549','GM12878','HELA','HEPG','HESC','HT1080','HUVEC','NHEK','SKMEL','SKNS']
ATLAS_CELL_LINES=['A549','GM12878','HeLa.S3','HepG2','H1.hESC','HT1080','HUVEC','NHEK','SK.MEL.5','SK.N.SH']

In [2]:
# Assume these files or links are in the current directory.
# This is from GenCode downloads
GENCODE = 'gencode.v45.lncRNA_transcripts.fa'
# This is from GenCode downloads
GFF_FILE='gencode.v45.long_noncoding_RNAs.gff3'
# This is from lncATLAS downloads
LNCATLAS = 'lncATLAS_all_data_RCI.csv'

Load canonical transcript IDs fron GenCode annotations file

In [3]:
def get_canonical_ids(gff_file):
    ids = set()
    with open(gff_file,'r') as fin:
        for line in fin:
            if line.startswith('#'):
                continue
            fields = line.strip().split('\t')
            if fields[2] != 'transcript':
                continue
            if 'Ensembl_canonical' not in fields[8]:
                continue
            tag_value_pairs = fields[8].split(';')
            for pair in tag_value_pairs:
                if pair.startswith('ID='):
                    ID_plus_ver = pair[3:]
                    just_ID = ID_plus_ver.split('.')[0]
                    ids.add(just_ID)
                    break
    return ids

From lncATLAS file, load all gene IDs with mid-range CNRCI for this cell line

In [4]:
def load_rci_truth(filename,cell_line):
    gene_to_rci = dict()
    with open (filename, 'r') as fin:
        header = None
        for line in fin:
            try:
                fields = line.strip().split(',')
                if header is None:
                    header = fields
                    continue
                gid = fields[0]
                cell_type = fields[1]
                rci_type = fields[2]
                rci_value = fields[3]
                gene_type = fields[6]
                # check for numeric CNRCI in desired range, non-coding gene, desired cell line
                if gene_type=='nc' and\
                    cell_type==cell_line and\
                    rci_type=='CNRCI' and\
                    rci_value!='NA':
                    rci_value=float(rci_value)
                    if gid in gene_to_rci.keys():
                        raise Exception('Unexpected second value for gene',gid)
                    if rci_value >= -1 and rci_value <= 1:
                        gene_to_rci[gid]=rci_value
            except Exception as e:
                print(line)
                traceback.print_exc()
                raise(e)
    return gene_to_rci

From GenCode, get canonical transcript sequences for the selected genes

In [5]:
def load_transcripts(seq_file,good_gids,good_tids):
    sequences = list()  # list of tuple
    seq = ''
    with open (seq_file,'r') as fin:
        loading_sequence = False
        for line in fin:
            line = line.strip()
            if line.startswith('>'):
                # Save the previous sequence
                if loading_sequence:
                    tup = (tid,gid,seq)
                    sequences.append(tup)
                    loading_sequence = False
                # Parse the defline like
                # >ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|
                fields = line[1:].split('|')
                tid = fields[0].split('.')[0]
                gid = fields[1].split('.')[0]
                seq = ''
                if tid in good_tids and gid in good_gids:
                    loading_sequence = True
            elif loading_sequence:
                # Continuation of multi-line sequence
                seq += line
    if loading_sequence:
        # Save the last sequence
        tup = (tid,gid,seq)
        sequences.append(tup)    
    return sequences        

Sample down from the available sequences

In [6]:
def write_fasta(output_filename, tuples, lookup):
    num_seqs = 0
    with open (output_filename,'w') as fout:
        for (tid,gid,seq) in tuples:
            rci = lookup[gid]
            defline = '>' + tid + ' ' + gid + ' ' + str(rci)
            print(defline, file=fout)
            print(seq, file=fout)
            num_seqs += 1
    return num_seqs

## Loop over TACOS cell lines, writing files

In [7]:
canon_ids = get_canonical_ids(GFF_FILE)
print( len(canon_ids), 'canonical lncRNA IDs\n')
for i in range(len(TACOS_CELL_LINES)):
    cell_line = TACOS_CELL_LINES[i]
    real_name = ATLAS_CELL_LINES[i]
    print('Processing cell line:', real_name, cell_line)
    # From lncATLAS file, load all gene IDs with mid-range CNRCI for this cell line    
    rci_dict = load_rci_truth(LNCATLAS,real_name)
    print(len(rci_dict.keys()), 'gene-to-rci values for thiscell line')
    # From GenCode, get canonical transcript sequences for the selected genes
    gene_ids = list(rci_dict.keys())
    canonical_sequences = load_transcripts(GENCODE, gene_ids, canon_ids)
    print(len(canonical_sequences),'canonical sequences loaded')
    # Sample down from the available sequences
    ## This won't work because numpy treats list-of-tuple like a 2D array
    ## sample_sequences = np.random.choice(canonical_sequences,FASTA_SIZE)
    sample_indices = np.random.choice( len(canonical_sequences), FASTA_SIZE)
    ## This won't work because python has the same problem
    ## sample_sequences = canonical_sequences[sample_indices]
    sample_sequences = list()
    for i in sample_indices:
        sample_sequences.append(canonical_sequences[i])
    print(len(sample_sequences), 'sequences selected')
    # Print sequences in FASTA format for TACOS
    fn=PREFIX+cell_line+'.middle.fasta'
    num = write_fasta(fn,sample_sequences,rci_dict)
    print(num, 'sequences written to', fn)
    print()

20424 canonical lncRNA IDs

Processing cell line: A549 A549
997 gene-to-rci values for thiscell line
914 canonical sequences loaded
100 sequences selected
100 sequences written to TACOS_input/A549.middle.fasta

Processing cell line: GM12878 GM12878
1042 gene-to-rci values for thiscell line
962 canonical sequences loaded
100 sequences selected
100 sequences written to TACOS_input/GM12878.middle.fasta

Processing cell line: HeLa.S3 HELA
385 gene-to-rci values for thiscell line
346 canonical sequences loaded
100 sequences selected
100 sequences written to TACOS_input/HELA.middle.fasta

Processing cell line: HepG2 HEPG
702 gene-to-rci values for thiscell line
648 canonical sequences loaded
100 sequences selected
100 sequences written to TACOS_input/HEPG.middle.fasta

Processing cell line: H1.hESC HESC
2476 gene-to-rci values for thiscell line
2298 canonical sequences loaded
100 sequences selected
100 sequences written to TACOS_input/HESC.middle.fasta

Processing cell line: HT1080 HT1080
58

Next, user must submit the 10 fasta files to the TACOS server.    
Automate the submissions? TACOS doesn't have an API and anyway it might tax the serer.   