# GenCode

GenCode release 42
[LINK](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/)

This notebook runs before Train/Test Split.

This notebook reduces our GenCode sequence files to 
1. genes with RCI values in LncAtlas
1. transcripts whose GenCode annotation looks good

This notebook does not generate the train/test split.

In [1]:
from datetime import datetime
print(datetime.now())

2022-10-27 18:35:03.022160


In [2]:
DATA_DIR = '/Users/jasonmiller/WVU/Localization/GenCode/'
# GenCode inputs
ANNOTATION = 'gencode.v42.annotation.gff3'
NONCODING_SEQUENCE = 'gencode.v42.lncRNA_transcripts.fa'   
CODING_SEQUENCE = 'gencode.v42.pc_transcripts.fa'    
# GenCode outputs
CODING_CSV = 'gencode.v42.pc_transcripts.csv'        
NONCODING_CSV = 'gencode.v42.lncRNA_transcripts.csv'   
# Atlas inputs
ATLAS_DIR = '/Users/jasonmiller/WVU/Localization/LncAtlas/'
ATLAS_FILE = 'lncATLAS_all_data_RCI.csv'

In [3]:
class fasta_reader():
    '''
    Parser for human transcripts FASTA file from GenCode.
    '''
    def __init__(self,infile,outfile,biotype):
        '''
        Biotype should reflect the filename: either 'pc' or 'lncRNA'.
        '''
        self.infile = infile
        self.outfile = outfile
        self.biotype = biotype
        self.FASTA_DEFCHAR = '>'  # signals a defline = definition line
        self.count_in = 0
        self.count_out = 0
        self.allow_genes = None
        self.allow_transcripts = None
        self.headers='transcript_id,gene_id,biotype,length,sequence\n'
    def allow_these_genes(self,genes:set):
        self.allow_genes = genes
    def allow_these_transcripts(self,trans:set):
        self.allow_transcripts = trans
    def print_one_sequence(self,handle,tran,gene,seq):
        allow_genes = self.allow_genes
        allow_trans = self.allow_transcripts
        if seq is not None:
            # sequence is None when we encounter the first defline
            if allow_genes is None or gene in allow_genes:
                if allow_trans is None or tran in allow_trans:
                    biotype = self.biotype
                    length = str(len(seq))
                    outstr = ','.join((tran,gene,biotype,length,seq))
                    handle.write(outstr+'\n')
                    self.count_out += 1
    def fasta_to_csv(self):
        with open(self.outfile,'w') as handle:
            handle.write(self.headers)
            with open(self.infile,'r') as fasta:
                transcript_id = None
                gene_id = None
                next_seq = None
                for line in fasta:
                    if line[0]==self.FASTA_DEFCHAR:
                        self.count_in += 1
                        # The defline starts with '>'
                        # The defline has fields separated by vertical bar
                        # Wrap up the previous sequence before moving on to the next.
                        self.print_one_sequence(handle,transcript_id,gene_id,next_seq)
                        tokens = line.split('|')
                        transcript_id = tokens[0][1:] # chop off '>'
                        gene_id = tokens[1]   
                        version_index=gene_id.find('.')   
                        if version_index>=0:
                            # chop off version number, as in ENSG00000198888.2
                            gene_id = gene_id[:version_index]  
                        next_seq = ""   # get ready for one to many sequence lines
                    else:
                        # In FASTA format, one sequence may continue to next line
                        next_seq = next_seq + line.strip()  
            self.print_one_sequence(handle,transcript_id,gene_id,next_seq)
        print(" Input sequences: %d"%self.count_in)
        print("Output sequences: %d"%self.count_out)

In [4]:
def load_atlas_genes(filepath):
    genes = set()
    with open (filepath,'r') as handle:
        header = None
        for row in handle:
            if header is None:
                header = row
            else:
                fields = row.split(',')
                gene_id = fields[0]
                value = fields[3]
                if (value != 'NA'):
                    genes.add(gene_id)  # set removes dupes
    return genes
print(datetime.now())
atlas_genes = load_atlas_genes(ATLAS_DIR+ATLAS_FILE)
print('Atlas good genes:', len(atlas_genes))

2022-10-27 18:35:03.106824
Atlas good genes: 25172


We keep transcripts with the following combinations of GenCode annotation:
1. gene_type=transcript_type=lncRNA
1. gene_type=transcript_type=protein_coding

The goal is to avoid these GenCode transcript types:
1. transcript_type=retained_intron
1. transcript_type=protein_coding_CDS_not_defined   
1. transcript_type=protein_coding_LoF # note the bug-inducing partial string match
1. transcript_type=nonsense_mediated_decay
1. transcript_type=non_stop_decay

In [5]:
def load_annotated_transcripts(filepath):
    pc_tids = set()
    nc_tids = set()
    with open (filepath,'r') as handle:
        for row in handle:
            columns = row.split('\t')
            # Avoid comment lines
            if len(columns)>=9 and columns[2] == 'transcript':
                comments = columns[8]
                pairs = comments.split(';')
                tid = None
                gtype = None
                ttype = None
                for pair in pairs:
                    if pair.startswith('ID=ENST'):
                        tid = pair[3:]  
                    elif pair.startswith('gene_type='):
                        gtype = pair[10:]
                    elif pair.startswith('transcript_type='):
                        ttype = pair[16:]
                if ttype is not None:
                    if tid is None:
                        raise Exception('transcript type without ID')
                    if ttype==gtype: 
                        if ttype=='protein_coding':
                            pc_tids.add(tid)
                        elif ttype=='lncRNA':
                            nc_tids.add(tid)
    return pc_tids, nc_tids
print(datetime.now())
gencode_pc_transcripts,gencode_nc_transcripts = load_annotated_transcripts(DATA_DIR+ANNOTATION)
print('Gencode good pc transcripts pc/nc:', len(gencode_pc_transcripts))
print('Gencode good nc transcripts pc/nc:', len(gencode_nc_transcripts))

2022-10-27 18:35:03.625319
Gencode good pc transcripts pc/nc: 89305
Gencode good nc transcripts pc/nc: 56049


## mRNA
One genes can have several RNA transcripts called isoforms. 

Typical defline of the GenCode pc file:    

    >ENST00000641515.2|ENSG00000186092.7|OTTHUMG00000001094.4|OTTHUMT00000003223.4|OR4F5-201|OR4F5|2618|UTR5:1-60|CDS:61-1041|UTR3:1042-2618|

In [6]:
print(datetime.now())
infile = DATA_DIR + CODING_SEQUENCE
outfile = DATA_DIR + CODING_CSV
converter = fasta_reader(infile,outfile,'pc')
converter.allow_these_genes(atlas_genes)
converter.allow_these_transcripts(gencode_pc_transcripts)
converter.fasta_to_csv()
# Without gencode_pc_transcripts filter, Input sequences: 111053 Output sequences: 106395

2022-10-27 18:35:10.531331
 Input sequences: 111053
Output sequences: 85641


## lncRNA

Non-coding genes can halso ave different transcripts called isoforms. 

Typical defline of the GenCode lncRNA file:   

    >ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|


In [7]:
print(datetime.now())
infile = DATA_DIR + NONCODING_SEQUENCE
outfile = DATA_DIR + NONCODING_CSV
converter = fasta_reader(infile,outfile,'lncRNA')
converter.allow_these_genes(atlas_genes)
converter.allow_these_transcripts(gencode_nc_transcripts)
converter.fasta_to_csv()
# Without gencode_pc_transcripts filter, Input sequences: 57936 Output sequences: 30139

2022-10-27 18:35:14.566913
 Input sequences: 57936
Output sequences: 29117


In [8]:
print(datetime.now())

2022-10-27 18:35:15.788251
