# Gen Code 43
GenCode 43 released 08.02.2023 (Feb 8)  
My download started Feb 27.   

Get FASTA and GFF3 from here: https://www.gencodegenes.org/human/  
Scroll down to Fasta files   
Click on Protein-coding transcript sequences = gencode.v43.pc_transcripts.fa   
Click on Long non-coding RNA transcript sequences = gencode.v43.lncRNA_transcripts.fa   
Scroll down to GFF3 files    
Click on Basic gene annotation on chromosomes (the main file for most uers) = gencode.v43.basic.annotation.gff3

GenCode [biotypes](https://www.gencodegenes.org/pages/biotypes.html)
describes for example 'lncRNA' 

GenCode [tags](https://www.gencodegenes.org/pages/tags.html)
describes for example 'appris_principal_1' Where the transcript expected to code for the main functional isoform...

## Decide to use Basic Annotation
Using v42 in Sep 2022, 
we previously used the full "annotation" file which has 252416 transcripts. 
Our filters reduced this to 114760 (29K non-coding and 86L coding).
The "basic.annotation" file had 117681 transcripts.
I expect "basic" probably removes the same ones we filtered. 

It is unclear exactly what "basic" means for lncRNA. According to the FAQ...
The transcripts tagged as "basic" form part of a subset of representative transcripts for each gene. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.

In [1]:
from datetime import datetime
print(datetime.now())

2023-02-27 17:45:19.541964


In [2]:
# GenCode inputs
DATA_DIR = '/Users/jasonmiller/WVU/Localization/GenCode/GenCode43/'
ANNOTATION = 'gencode.v43.basic.annotation.gff3'
NONCODING_SEQUENCE = 'gencode.v43.lncRNA_transcripts.fa'   
CODING_SEQUENCE = 'gencode.v43.pc_transcripts.fa'    
# GenCode outputs
CODING_CSV = 'gencode.v43.pc_transcripts.csv'        
NONCODING_CSV = 'gencode.v43.lncRNA_transcripts.csv'   
# Atlas inputs
ATLAS_DIR = '/Users/jasonmiller/WVU/Localization/LncAtlas/'
ATLAS_FILE = 'lncATLAS_all_data_RCI.csv'

## lncAtlas Gene IDs
Load the list of distinct genes in lncAtlas.
Only examine lines where field 2 is CNRCI and field 3 is not NA.
Load the ESNG IDs.
Remove duplicate IDs.

In [3]:
def load_atlas_genes(filepath):
    '''Expect a CSV file.'''
    genes = set()
    with open (filepath,'r') as handle:
        header = None
        for row in handle:
            if header is None:
                header = row
            else:
                fields = row.split(',')
                gene_id = fields[0]  # like ENSG00000000003
                rci_type = fields[2]   
                value = fields[3]
                if (value != 'NA' and rci_type == 'CNRCI'):
                    genes.add(gene_id)  # set removes dupes
    return genes

In [4]:
print(datetime.now())
atlas_genes = load_atlas_genes(ATLAS_DIR+ATLAS_FILE)
print('Atlas good genes:', len(atlas_genes))

2023-02-27 17:45:19.636310
Atlas good genes: 24538


## GenCode annotation

In [5]:
def load_annotated_transcripts(filepath,all_transcripts=True):
    '''Expect a GFF3 file.'''
    pc_tids = set()
    nc_tids = set()
    with open (filepath,'r') as handle:
        for row in handle:
            columns = row.split('\t')
            # Avoid comment lines and read only transcript lines.
            if len(columns)>=9 and columns[2] == 'transcript':
                # The data we need is in column 9, the so-called comments,
                # listed as name=value pairs separated by semicolon.
                comments = columns[8] 
                pairs = comments.split(';') 
                tid = None
                gtype = None
                ttype = None
                canonical = False
                for pair in pairs:
                    # Look for certain tags and retain their values.
                    if pair.startswith('ID=ENST'):
                        tid = pair[3:] 
                    elif pair.startswith('gene_type='):
                        gtype = pair[10:] 
                    elif pair.startswith('transcript_type='):
                        ttype = pair[16:]
                    elif not all_transcripts and pair.startswith('tag='):
                        if 'Ensembl_canonical' in pair:
                            canonical = True
                if ttype is not None:
                    if tid is None:
                        raise Exception('transcript type without ID')
                    if ttype==gtype: 
                        if all_transcripts or canonical is True:
                            if ttype=='protein_coding':
                                pc_tids.add(tid)
                            elif ttype=='lncRNA':
                                nc_tids.add(tid)
    return pc_tids, nc_tids

In [6]:
print(datetime.now())
gencode_pc_transcripts,gencode_nc_transcripts = \
    load_annotated_transcripts(DATA_DIR+ANNOTATION,True)
print('Gencode all pc transcripts pc/nc:', len(gencode_pc_transcripts))
print('Gencode all nc transcripts pc/nc:', len(gencode_nc_transcripts))
canonical_pc_transcripts,canonical_nc_transcripts = \
    load_annotated_transcripts(DATA_DIR+ANNOTATION,False)
print('Gencode principal pc transcripts pc/nc:', len(canonical_pc_transcripts))
print('Gencode principal nc transcripts pc/nc:', len(canonical_nc_transcripts))

2023-02-27 17:45:20.227234
Gencode all pc transcripts pc/nc: 63752
Gencode all nc transcripts pc/nc: 29147
Gencode principal pc transcripts pc/nc: 19688
Gencode principal nc transcripts pc/nc: 17153
