# Gen Code 43
GenCode 43 released 08.02.2023 (Feb 8)  
My download started Feb 27.   

Get FASTA and GFF3 from here: https://www.gencodegenes.org/human/  
Scroll down to Fasta files   
Click on Protein-coding transcript sequences = gencode.v43.pc_transcripts.fa   
Click on Long non-coding RNA transcript sequences = gencode.v43.lncRNA_transcripts.fa   
Scroll down to GFF3 files    
Click on Basic gene annotation on chromosomes (the main file for most uers) = gencode.v43.basic.annotation.gff3

GenCode [biotypes](https://www.gencodegenes.org/pages/biotypes.html)
describes for example 'lncRNA' 

GenCode [tags](https://www.gencodegenes.org/pages/tags.html)
describes for example 'appris_principal_1' Where the transcript expected to code for the main functional isoform...

We discovered some transcripts are missing from the lncRNA and pc transcript FASTA files but they are present in the all transcripts file. Switch to using that.

We discovered that our filters, such as only certain gene_type values, and transcript_type must match gene_type, were removing genes listed in lncAtlas. Now, load everything from GenCode and filter by what's in lncAtlas.

## Decide to use Basic Annotation
Using v42 in Sep 2022, 
we previously used the full "annotation" file which has 252416 transcripts. 
Our filters reduced this to 114760 (29K non-coding and 86L coding).
The "basic.annotation" file had 117681 transcripts.
I expect "basic" probably removes the same ones we filtered. 

It is unclear exactly what "basic" means for lncRNA. According to the FAQ...
The transcripts tagged as "basic" form part of a subset of representative transcripts for each gene. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.

In [1]:
from datetime import datetime
print(datetime.now())

2023-02-27 20:21:53.354527


In [2]:
# GenCode inputs
DATA_DIR = '/Users/jasonmiller/WVU/Localization/GenCode/GenCode43/'
ANNOTATION = 'gencode.v43.basic.annotation.gff3'
SEQUENCE = 'gencode.v43.transcripts.fa'   
# GenCode outputs
CODING_CSV_ALL = 'gencode_v43.all_pc_transcripts.csv'        
NONCODING_CSV_ALL = 'gencode_v43.all_lncRNA_transcripts.csv'   
CODING_CSV_CANON = 'gencode_v43.canon_pc_transcripts.csv'        
NONCODING_CSV_CANON = 'gencode_v43.canon_lncRNA_transcripts.csv'   
# Atlas inputs
ATLAS_DIR = '/Users/jasonmiller/WVU/Localization/LncAtlas/'
ATLAS_FILE = 'lncATLAS_all_data_RCI.csv'

## lncAtlas Gene IDs
Load the list of distinct genes in lncAtlas.
Only examine lines where field 2 is CNRCI and field 3 is not NA.
Load the ESNG IDs.
Remove duplicate IDs.

In [3]:
def load_atlas_genes(filepath):
    '''Expect a CSV file.'''
    pc_genes = set()
    nc_genes = set()
    with open (filepath,'r') as handle:
        header = None
        for row in handle:
            if header is None:
                header = row
            else:
                fields = row.split(',')
                gene_id = fields[0]  # like ENSG00000000003
                rci_type = fields[2]   
                value = fields[3]
                gene_type = fields[5] # field 5 same as field 6 in all cases
                if (value != 'NA' and rci_type == 'CNRCI'):
                    if gene_type == 'coding':
                        pc_genes.add(gene_id)  # set removes dupes
                    elif gene_type == 'nc':
                        nc_genes.add(gene_id)
    return pc_genes, nc_genes

In [4]:
print(datetime.now())
atlas_pc_genes,atlas_nc_genes = load_atlas_genes(ATLAS_DIR+ATLAS_FILE)
print('Atlas pc genes:', len(atlas_pc_genes))
print('Atlas nc genes:', len(atlas_nc_genes))

2023-02-27 20:21:58.179577
Atlas pc genes: 17770
Atlas nc genes: 6768


## GenCode annotation

In [5]:
def load_annotated_transcripts(filepath,canonical_transcripts=False):
    '''Expect a GFF3 file.'''
    tids = set()
    with open (filepath,'r') as handle:
        for row in handle:
            columns = row.split('\t')
            # Avoid comment lines and read only transcript lines.
            if len(columns)>=9 and columns[2] == 'transcript':
                # The data we need is in column 9, the so-called comments,
                # listed as name=value pairs separated by semicolon.
                comments = columns[8] 
                pairs = comments.split(';') 
                tid = None
                canonical = False
                for pair in pairs:
                    if pair.startswith('ID=ENST'):
                        tid = pair[3:] 
                    if pair.startswith('tag='):
                        if canonical_transcripts and 'Ensembl_canonical' in pair:
                            canonical = True
                if tid is None:
                    raise Exception('transcript type without ID')
                if not canonical_transcripts or canonical is True:
                    tids.add(tid)
    return tids

In [6]:
print(datetime.now())
gencode_transcripts = load_annotated_transcripts(DATA_DIR+ANNOTATION,False)
print('Gencode all transcripts:', len(gencode_transcripts))
canonical_transcripts = load_annotated_transcripts(DATA_DIR+ANNOTATION,True)
print('Gencode canonical transcripts:', len(canonical_transcripts))

2023-02-27 20:22:06.197686
Gencode all transcripts: 117725
Gencode canonical transcripts: 61009


## Make sequence files
Parse FASTA, extract sequences of interest, write CSV.
The rest of our pipeline uses CSV because it is easier to process than FASTA.

In [10]:
class fasta_reader():
    '''Expect human transcripts FASTA file from GenCode.'''
    def __init__(self,infile,outfile,biotype):
        '''Biotype should be either 'pc' or 'lncRNA'.'''
        self.infile = infile
        self.outfile = outfile
        self.biotype = biotype
        self.FASTA_DEFCHAR = '>'  # signals a defline = definition line
        self.allow_genes = None
        self.allow_transcripts = None
        self.transcripts_per_gene = None
        self.headers='transcript_id,gene_id,biotype,length,sequence\n'
    def get_transcripts_per_gene(self):
        return self.transcripts_per_gene
    def allow_these_genes(self,genes:set):
        self.allow_genes = genes
    def allow_these_transcripts(self,trans:set):
        self.allow_transcripts = trans
    def print_one_sequence(self,handle,tran,gene,seq):
        allow_genes = self.allow_genes
        allow_trans = self.allow_transcripts
        if seq is not None:
            # sequence is None when we encounter the first defline
            if allow_genes is None or gene in allow_genes:
                if allow_trans is None or tran in allow_trans:
                    biotype = self.biotype
                    length = str(len(seq))
                    outstr = ','.join((tran,gene,biotype,length,seq))
                    handle.write(outstr+'\n')
                    self.count_out += 1
                    if gene not in self.transcripts_per_gene:
                        self.transcripts_per_gene[gene]=0
                    self.transcripts_per_gene[gene] += 1  
    def fasta_to_csv(self):
        self.count_in = 0
        self.count_out = 0
        self.transcripts_per_gene = {}
        with open(self.outfile,'w') as handle:
            handle.write(self.headers)
            with open(self.infile,'r') as fasta:
                transcript_id = None
                gene_id = None
                next_seq = None
                for line in fasta:
                    if line[0]==self.FASTA_DEFCHAR:
                        # Wrap up the previous sequence before moving on to the next.
                        self.print_one_sequence(handle,transcript_id,gene_id,next_seq)
                        # The defline starts with '>'
                        # The defline has fields separated by vertical bar
                        self.count_in += 1
                        tokens = line.split('|')
                        transcript_id = tokens[0][1:] # chop off '>'
                        gene_id = tokens[1]   
                        version_index=gene_id.find('.')   
                        if version_index>=0:
                            # chop off version number, like the 2 in ENSG00000198888.2
                            gene_id = gene_id[:version_index]  
                        next_seq = ""   # get ready for one to many sequence lines
                    else:
                        # In FASTA format, one sequence may continue to next line
                        next_seq = next_seq + line.strip()  
            # Reading FASTA, be sure to process the last sequence.
            self.print_one_sequence(handle,transcript_id,gene_id,next_seq)
        return self.count_in,self.count_out

In [11]:
print(datetime.now())
infile = DATA_DIR + SEQUENCE
outfile = DATA_DIR + CODING_CSV_ALL
converter = fasta_reader(infile,outfile,'pc')
converter.allow_these_genes(atlas_pc_genes)
converter.allow_these_transcripts(gencode_transcripts)
count_in,count_out=converter.fasta_to_csv()
print("All mRNA in GenCode and Atlas")
print(" Input sequences: %d" % count_in)
print("Output sequences: %d" % count_out)
print("Genes in Atlas: %d" % len(atlas_pc_genes))
print("Genes represented in output: %d" % len(converter.get_transcripts_per_gene()))

2023-02-27 20:23:19.763615
All mRNA in GenCode and Atlas
 Input sequences: 252913
Output sequences: 60538
Genes in Atlas: 17770
Genes represented in output: 17668


In [13]:
print(datetime.now())
infile = DATA_DIR + SEQUENCE
outfile = DATA_DIR + NONCODING_CSV_ALL
converter = fasta_reader(infile,outfile,'lncRNA')
converter.allow_these_genes(atlas_nc_genes)
converter.allow_these_transcripts(gencode_transcripts)
count_in,count_out=converter.fasta_to_csv()
print("All lncRNA in GenCode and Atlas")
print(" Input sequences: %d" % count_in)
print("Output sequences: %d" % count_out)
print("Genes in Atlas: %d" % len(atlas_nc_genes))
print("Genes represented in output: %d" % len(converter.get_transcripts_per_gene()))

2023-02-27 20:23:32.315434
All lncRNA in GenCode and Atlas
 Input sequences: 252913
Output sequences: 12419
Genes in Atlas: 6768
Genes represented in output: 6423


In [14]:
print(datetime.now())
infile = DATA_DIR + SEQUENCE
outfile = DATA_DIR + CODING_CSV_CANON
converter = fasta_reader(infile,outfile,'pc')
converter.allow_these_genes(atlas_pc_genes)
converter.allow_these_transcripts(canonical_transcripts)
count_in,count_out=converter.fasta_to_csv()
print("Canonical mRNA in GenCode and Atlas")
print(" Input sequences: %d" % count_in)
print("Output sequences: %d" % count_out)
print("Genes in Atlas: %d" % len(atlas_pc_genes))
print("Genes represented in output: %d" % len(converter.get_transcripts_per_gene()))

2023-02-27 20:23:37.229457
Canonical mRNA in GenCode and Atlas
 Input sequences: 252913
Output sequences: 17662
Genes in Atlas: 17770
Genes represented in output: 17645


In [15]:
print(datetime.now())
infile = DATA_DIR + SEQUENCE
outfile = DATA_DIR + NONCODING_CSV_CANON
converter = fasta_reader(infile,outfile,'lncRNA')
converter.allow_these_genes(atlas_nc_genes)
converter.allow_these_transcripts(canonical_transcripts)
count_in,count_out=converter.fasta_to_csv()
print("Canonical lncRNA in GenCode and Atlas")
print(" Input sequences: %d" % count_in)
print("Output sequences: %d" % count_out)
print("Genes in Atlas: %d" % len(atlas_nc_genes))
print("Genes represented in output: %d" % len(converter.get_transcripts_per_gene()))

2023-02-27 20:23:42.458862
Canonical lncRNA in GenCode and Atlas
 Input sequences: 252913
Output sequences: 5697
Genes in Atlas: 6768
Genes represented in output: 5688


In [16]:
if 'ENSG00000274628' in atlas_nc_genes:
    print('Hello ENSG00000274628')
tpg = converter.get_transcripts_per_gene()
if 'ENSG00000274628' not in tpg:
    print('Not there')

Hello ENSG00000274628


In [17]:
print(tpg['ENSG00000274628'])

1
