# Using Gen Code v43
GenCode 43 released 08.02.2023 (Feb 8)  

My download started Feb 27.   

Get FASTA and GFF3 from [GenCode](https://www.gencodegenes.org/human/),
and GenCode [biotypes](https://www.gencodegenes.org/pages/biotypes.html)
which describes GFF terms such as 'lncRNA',
and GenCode [tags](https://www.gencodegenes.org/pages/tags.html)
which describes GFF tags.

Which is the better tag for best transcript per isoform?
We tried appris_principal_1, described as 
"Where the transcript expected to code for the main functional isoform...".
That gave us mRNA but not lncRNA.
Next we tried 'Ensembl_canonical' which gave us mRNA and lncRNA.

Scroll down to Fasta files.
Previously, we used "Protein-coding transcript sequences" (gencode.v43.pc_transcripts.fa)
and "Long non-coding RNA transcript sequences" (gencode.v43.lncRNA_transcripts.fa).
But now we discovered some lncAtlas genes are missing.
Now we switch to using 
"Transcript sequences - CHR - Nucleotide sequences of all transcripts on the reference chromosomes" (gencode.v43.transcripts.fa) which is larger than both others combined.
This file does not contain every transcript in the GFF3 file; see below.

Scroll down to GFF3 files    
When we used v42 in Sep 2022, 
we previously used the full "annotation" file which has 252416 transcripts. 
Our filters reduced this to 114760 (29K non-coding and 86K coding).
So "basic" probably removes the same ones we filtered. 
So again we click on "Basic gene annotation on chromosomes - the main file for most users"(gencode.v43.basic.annotation.gff3). 
The "basic.annotation" file had 117681 transcripts in v42.
The "basic.annotation" file has 117725 transcripts in v43.

According to the FAQ...
The transcripts tagged as "basic" form part of a subset of representative transcripts for each gene. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.

The all-annotation file (gencode.v43.chr_patch_hapl_scaff.annotation.gff3) 
almost twice as large as the basic-annotation file. 
The documentation says it contains genes not assigned to chromosomes
(these are on left-over contigs from the genome assembly)
as well as genes assigned to chromosomes.
We will use this.

GenCode includes two copies of any gene in the paralogous regions of ChrX and ChrY.
For such a gene, the GFF and FASTA list the gene twice 
(but the version number of one includes \_PAR_Y).
Also, the GFF and FASTA list two canonical transcripts.
We decided to filter genes with the PAR tag.

### Missing sequences
GenCode provides all annotation but not all sequences.
There are about 6500 genes in the all-annotations GFF
that are not in the all-transcripts FASTA.
That's because the FASTA only has transcripts for genes placed on chromosomes.
It omits genes on contigs whose chromosome position is unknown.

For example, ENSG00000278704 is a protein-coding gene with a one-exon transcript
ENST00000618686.1 which is marked as the canonical transcript.
The transcript is about 2000 nucleotides, and the CDS about 300.
But the chromosome location is GL000009.2, which is a contig not a chromosome.
This transcript does not appear in any FASTA file from GenCode. 

Also, lncAtlas is old. 
Some of its gene IDs no longer exist at Ensembl or GenCode.
One example is ENSG00000244952. 
This gene ID is retired and has no successors, according to Ensembl.

### What is lncRNA?
Previously, we tried to generate a database of trustable lncRNA (and mRNA for comparison).
We applied filters such as GFF transcript_type must be present, must be 'lncRNA', must match the gene_type, etc. Also, we used GenCode as our oracle for deciding whether a gene was lncRNA or protein-coding.

This time, we will use lncAtlas as the oracle and GenCode merely as a source for transcript sequences per gene. Note lncAtlas lists the Ensembl gene ID (but not transcript ID) and whether the gene is 'coding' or 'nc'. Here are some changes.

Gene ENSG00000274628 is listed as 'nc' in lncAtlas. It has negative CNRCI values indicating nuclear. Ensemble indicates this is a member of the Cytochrome P450 gene family. It is annotated "Transcribed unprocessed pseudogene". I think that means it was retrotransposed into the genome with introns -- a seriously messed up gene. This gene was left off the chromosomes in human genome assemblies until the latest release, GRCh38, which put it on Chr9, indicating it belongs to a seriously messy part of the genome. There is no RefSeq for this gene. This gene is not annotated as 'lncRNA' in the GenCode GFF. Instead, it is "gene_type=transcribed_unprocessed_pseudogene", one of 959 such transcripts. 

Is this a lncRNA? it is probably a mistake, not an evolved feature. But who are we to decide what's a lncRNA. Which database should we believe? We previously went with GenCode but this time we're going with lncRNA.

### Why not use basic annotation?
Example: ENSG00000238009.
The canonical transcript is in GenCode's all-annotation but not its basic-annotation file!
Ensemble says lncRNA gene with 5 transcripts from Chr1 but no other information.
This gene is present and useful in the lncAtlas database.
The gene tag in the GFF is 'overlapping_locus'.
GenCode GFF has 2 transcripts, neither marked canonical.
Both have transcript_support_level_5.
Lower ID has level=2 and tag=not_best_in_genome_evidence; other has level=3.
(Level=1 means validated, Level=2 means manual, Level=3 means automatic.)
GenCode FASTA has 5 transcripts; only difference is highest ID has no OTTHUMG ID.
The one with lowest ID is not present in the GFF.
Lengths range from 336 to 2748.

In [1]:
from datetime import datetime
print(datetime.now())

2023-02-28 19:21:55.266198


In [2]:
# GenCode inputs
DATA_DIR = '/Users/jasonmiller/WVU/Localization/GenCode/GenCode43/'
ALL_ANNOTATION = 'gencode.v43.chr_patch_hapl_scaff.annotation.gff3'
SEQUENCE = 'gencode.v43.transcripts.fa'   
# GenCode outputs
CODING_CSV_ALL = 'gencode_v43.all_pc_transcripts.csv'        
NONCODING_CSV_ALL = 'gencode_v43.all_lncRNA_transcripts.csv'   
CODING_CSV_CANON = 'gencode_v43.canon_pc_transcripts.csv'        
NONCODING_CSV_CANON = 'gencode_v43.canon_lncRNA_transcripts.csv'   
CODING_CSV_LONG = 'gencode_v43.longest_pc_transcripts.csv'        
NONCODING_CSV_LONG = 'gencode_v43.longest_lncRNA_transcripts.csv'   
# Atlas inputs
ATLAS_DIR = '/Users/jasonmiller/WVU/Localization/LncAtlas/'
ATLAS_FILE = 'lncATLAS_all_data_RCI.csv'

## lncAtlas Gene IDs
Load the list of distinct genes in lncAtlas.
Only examine lines where field 2 is CNRCI and field 3 is not NA.
Load the ESNG IDs.
Remove duplicate IDs.
Use field 5 as indicator of 'coding' or 'nc'.

In [3]:
class Atlas_Parser():
    def __init__(self,filepath):
        '''Expect a CSV file.'''
        self.filepath = filepath
    def load_useful_genes(self):
        '''Return two gene sets (each free of duplicates) containing
        only genes with at least one CNRCI value in lncAtlas.'''
        pc_genes = set() 
        nc_genes = set()
        with open (self.filepath,'r') as handle:
            header = None
            for row in handle:
                if header is None:
                    header = row
                else:
                    fields = row.split(',')
                    gene_id = fields[0]   # like ENSG00000000003
                    rci_type = fields[2]  # like CNRCI or RCIin
                    value = fields[3]     # like -3.05 or NA
                    gene_type = fields[5] # coding or nc, same as field 6
                    if (value != 'NA' and rci_type == 'CNRCI'):
                        if gene_type == 'coding':
                            pc_genes.add(gene_id)  
                        elif gene_type == 'nc':
                            nc_genes.add(gene_id)
        return pc_genes, nc_genes # sets - no duplicates

In [4]:
print(datetime.now())
atlas_db = Atlas_Parser(ATLAS_DIR+ATLAS_FILE)
atlas_pc_genes,atlas_nc_genes = atlas_db.load_useful_genes()
atlas_db = None
print('Atlas pc genes with at least one CNRCI:', len(atlas_pc_genes))
print('Atlas nc genes with at least one CNRCI:', len(atlas_nc_genes))

2023-02-28 19:21:55.313104
Atlas pc genes with at least one CNRCI: 17770
Atlas nc genes with at least one CNRCI: 6768


## GenCode annotation
The GenCode GFF file indicates the canonical transcript per gene.

In [5]:
class GenCodeGFF_Parser():
    def __init__(self,filepath):
        '''Expect a GFF3 file.'''
        self.filepath=filepath
    def load_transcripts(self,canonical_transcripts_only=False):
        gids = dict()  # gene_id:transcript_id
        tids = dict()  # transcript_id:gene_id
        with open (self.filepath,'r') as handle:
            for row in handle:
                columns = row.split('\t')
                # Avoid comment lines and read only transcript lines.
                if len(columns)>=9 and columns[2] == 'transcript':
                    # The data we need is in column 9, the so-called comments,
                    # listed as name=value pairs separated by semicolon.
                    comments = columns[8].strip() 
                    pairs = comments.split(';') 
                    gid = None
                    tid = None
                    canonical = False
                    for pair in pairs:
                        if pair.startswith('ID=ENST'):
                            tid = pair[3:].split('.')[0] 
                        if pair.startswith('gene_id='):
                            gid = pair[8:].split('.')[0] 
                        if pair.startswith('tag='): 
                            tag_string = pair[4:]
                            tag_list = tag_string.split(',')
                            # Some ChrX genes have an identical paralog on ChrY
                            if 'PAR' in tag_list:
                                continue
                            if canonical_transcripts_only and 'Ensembl_canonical' in pair:
                                canonical = True
                    if gid is None:
                        print(row)
                        raise Exception('missing ID')
                    if tid is None:
                        print(row)
                        raise Exception('missing transcript ID')
                    if not canonical_transcripts_only or canonical:
                        if canonical and gid in gids:
                            print('WARN: Another canonical!',gid,tid)
                        gids[gid]=tid # either the canonical tid or the last tid
                        tids[tid]=gid
        return gids,tids 

In [6]:
# Count canonical transcripts
print(datetime.now())
gff_db = GenCodeGFF_Parser(DATA_DIR+ALL_ANNOTATION)
gencode_genes,gencode_transcripts = gff_db.load_transcripts(False)
print('Gencode all genes:', len(gencode_genes))
print('Gencode all transcripts:', len(gencode_transcripts))
gencode_genes,gencode_transcripts = gff_db.load_transcripts(True)
print('Gencode genes with canonical transcripts:', len(gencode_genes))
print('Gencode canonical transcripts:', len(gencode_transcripts))
gene_to_canonical_transcript = gencode_genes # keep this!
gff_db = None
gencode_genes = None
gencode_transcripts = None 

2023-02-28 19:21:55.944662
Gencode all genes: 69175
Gencode all transcripts: 273857
Gencode genes with canonical transcripts: 69175
Gencode canonical transcripts: 69175


In [7]:
def remove_dead_genes(atlas_set,gencode_set):
    dead_genes = set()
    for gid in atlas_set:
        if gid not in gencode_set:
            dead_genes.add(gid)
    new_set = atlas_set - dead_genes
    return new_set

atlas_pc_genes = remove_dead_genes(atlas_pc_genes,gene_to_canonical_transcript)
atlas_nc_genes = remove_dead_genes(atlas_nc_genes,gene_to_canonical_transcript)
        
print('After removing dead genes missing from GenCode GFF:')
print('Atlas pc genes with at least one CNRCI:', len(atlas_pc_genes))
print('Atlas nc genes with at least one CNRCI:', len(atlas_nc_genes))

After removing dead genes missing from GenCode GFF:
Atlas pc genes with at least one CNRCI: 17668
Atlas nc genes with at least one CNRCI: 6423


## Ensembl Canonical
Ensembl documentation of 
[canonical](https://www.ensembl.org/info/genome/genebuild/canonical.html)

* The Ensembl Canonical transcript is a single, representative transcript identified at every locus. For accurate analysis, we recommend that more than one transcripts at a locus may need to be considered, however, we designate a single Ensembl Canonical transcript per locus to provide consistency when only one transcript is required... 
* The Ensembl Canonical transcript for non-protein-coding gene biotypes are currently calculated as follows: lncRNAs: The Ensembl Canonical is the transcript at the locus with the longest genomic span.

Notes 
* Genomic span is not the same as transcript length. There could be a short transcript derived from splicing two exons that are widely-separated on the genome.
* Genomic span is not indicated in the lncAtlas or GenCode FASTA files. 
* Genomic can be derived by subtracting columns 4 and 5 of any 'transcript' line of the GFF.
* The transcript length is reported only in the FASTA file defline. (The sum of the exon lengths in the GFF might also work, but those are genomic coordinates so it could be off.)

The GenCode FASTA of all transcripts has gene variants
that don't have an ensembl canonical transcript.
For example, for ID=ENSG00000124334, 
gene ID ENSG00000124334.18 has a canonical transcript,
but ENSG00000124334.18_PAR_Y is listed in the FASTA
and it does not have a canonical transcript (unless you parse away the PAR Y part).

## Canonical vs longest transcript
Ensemble canonical means longest genomic span. 
Here, we show this is not the same as longest transcript. 
We get transcript lengths from the FASTA.

In [8]:
# One pass through the FASTA file to get length of every transcript
transcript_to_length = dict()
transcript_to_gene = dict()
gene_to_transcript_list = dict()
with open(DATA_DIR+SEQUENCE,'r') as fasta:
    for line in fasta:
        if line[0]=='>':
            fields=line[1:].strip().split('|')
            tid=fields[0].split('.')[0]
            gid=fields[1].split('.')[0]
            tlen=int(fields[6])
            if gid not in gene_to_canonical_transcript:
                continue # e.g. ENSG00000124333.16_PAR_Y
            transcript_to_length[tid] = tlen
            transcript_to_gene[tid] = gid
            if gid in gene_to_transcript_list:
                gene_to_transcript_list[gid].append(tid)
            else:
                gene_to_transcript_list[gid] = [tid]
print("We have these data.")
print('gene_to_canonical_transcript:',len(gene_to_canonical_transcript))
print('gene_to_transcript_list     :',len(gene_to_transcript_list))
print('transcript_to_length        :',len(transcript_to_length))
print('transcript_to_gene          :',len(transcript_to_gene))
# See the note on missing sequences above.
# One example is ENSG00000278704.1

We have these data.
gene_to_canonical_transcript: 69175
gene_to_transcript_list     : 62656
transcript_to_length        : 252739
transcript_to_gene          : 252739


In [9]:
gene_to_longest_transcript = dict()
for gid in gene_to_transcript_list:
    for tid in gene_to_transcript_list[gid]:
        tlen = transcript_to_length[tid]
        if gid in gene_to_longest_transcript:
            prev_best_tid = gene_to_longest_transcript[gid]
            prev_best_len = transcript_to_length[prev_best_tid]
            if tlen > prev_best_len:
                gene_to_longest_transcript[gid]=tid
        else:
            gene_to_longest_transcript[gid]=tid
gain_in_genes = 0
gain_in_bases = 0
for gid in gene_to_transcript_list:
    canonical_id = gene_to_canonical_transcript[gid]
    canonical_len = transcript_to_length[canonical_id]
    longest_id = gene_to_longest_transcript[gid]
    longest_len = transcript_to_length[longest_id]
    if longest_len > canonical_len:
        gain_in_genes += 1
        gain_in_bases += longest_len - canonical_len
print('For all of GenCode (not just lncAtlas genes).')
print('Compare longest transcript per gene to canonical transcript.')
print('Genes with some gain:',gain_in_genes)
print('Gain in bases:',gain_in_bases)
print('Gain per transcript:',gain_in_bases/gain_in_genes)

For all of GenCode (not just lncAtlas genes).
Compare longest transcript per gene to canonical transcript.
Genes with some gain: 11485
Gain in bases: 11900500
Gain per transcript: 1036.177622986504


In [10]:
# Repeat for lncAtlast lncRNA only
atlas_dead_genes = set()
for gid in atlas_nc_genes:
    if gid not in gene_to_canonical_transcript:
        atlas_dead_genes.add(gid)
        continue
    canonical_id = gene_to_canonical_transcript[gid]
    canonical_len = transcript_to_length[canonical_id]
    longest_id = gene_to_longest_transcript[gid]
    longest_len = transcript_to_length[longest_id]
    if longest_len > canonical_len:
        gain_in_genes += 1
        gain_in_bases += longest_len - canonical_len
print('For non-coding gene IDs in lncAtlas:')
print('Dead gene IDs:',len(atlas_dead_genes))
print('Compare longest transcript per gene to canonical transcript.')
print('Genes with some gain:',gain_in_genes)
print('Gain in bases:',gain_in_bases)
print('Gain per transcript:',gain_in_bases/gain_in_genes)

For non-coding gene IDs in lncAtlas:
Dead gene IDs: 0
Compare longest transcript per gene to canonical transcript.
Genes with some gain: 13148
Gain in bases: 14273586
Gain per transcript: 1085.6089139032551


## Combine lncAtlas and GenCode to list transcripts
Only lncAtlas knows which genes have a CNRCI values. 
Also, we rely on lncAtlas for the coding/non-coding label.
Note lncAtlas gene IDs are lacking the version suffix.
Only GenCode GFF knows the Ensembl canonical transcript per gene.

In [11]:
def intersect(atlas):
    canonical_tids = set()
    longest_tids = set()
    all_tids = set()
    for gid in atlas:
        if gid in gene_to_canonical_transcript:
            tid = gene_to_canonical_transcript[gid]
            canonical_tids.add(tid)
        if gid in gene_to_longest_transcript:
            tid = gene_to_longest_transcript[gid]
            longest_tids.add(tid)
        if gid in gene_to_transcript_list:
            tid_list = gene_to_transcript_list[gid]
            for tid in tid_list:
                all_tids.add(tid)
    return canonical_tids,longest_tids,all_tids

In [12]:
def bases(set_of_tid):
    total = 0
    for tid in set_of_tid:
        total += transcript_to_length[tid]
    return total

In [13]:
atlas_pc_canonical_tids,atlas_pc_longest_tids,atlas_pc_all_tids = intersect(atlas_pc_genes)
atlas_nc_canonical_tids,atlas_nc_longest_tids,atlas_nc_all_tids = intersect(atlas_nc_genes)

print('atlas_pc_canonical_transcripts:',len(atlas_pc_canonical_tids))
print('atlas_pc_longest_transcripts  :',len(atlas_pc_longest_tids))
print('atlas_pc_all_transcripts      :',len(atlas_pc_all_tids))
print('atlas_nc_canonical_transcripts:',len(atlas_nc_canonical_tids))
print('atlas_nc_longest_transcripts  :',len(atlas_nc_longest_tids))
print('atlas_nc_all_transcripts      :',len(atlas_nc_all_tids))

print('atlas_pc_canonical_bases:',bases(atlas_pc_canonical_tids))
print('atlas_pc_longest_bases  :',bases(atlas_pc_longest_tids))
print('atlas_pc_all_bases      :',bases(atlas_pc_all_tids))
print('atlas_nc_canonical_bases:',bases(atlas_nc_canonical_tids))
print('atlas_nc_longest_bases  :',bases(atlas_nc_longest_tids))
print('atlas_nc_all_bases      :',bases(atlas_nc_all_tids))

atlas_pc_canonical_transcripts: 17668
atlas_pc_longest_transcripts  : 17668
atlas_pc_all_transcripts      : 165199
atlas_nc_canonical_transcripts: 6423
atlas_nc_longest_transcripts  : 6423
atlas_nc_all_transcripts      : 28982
atlas_pc_canonical_bases: 68237096
atlas_pc_longest_bases  : 75256443
atlas_pc_all_bases      : 337985715
atlas_nc_canonical_bases: 10917647
atlas_nc_longest_bases  : 13290733
atlas_nc_all_bases      : 42685749


## Convert FASTA to CSV
Extract just the relevant transcripts from the FASTA.
Convert to one-line format for ease of downstream processling.

In [14]:
class fasta_reader():
    '''Expect human transcripts FASTA file from GenCode.'''
    def __init__(self,infile,outfile,biotype):
        '''Biotype should be either 'pc' or 'lncRNA'.'''
        self.infile = infile
        self.outfile = outfile
        self.biotype = biotype
        self.FASTA_DEFCHAR = '>'  # signals a defline = definition line
        self.headers='transcript_id,gene_id,biotype,length,sequence\n'
        self.allow_tids = None
    def _print_one_sequence(self,handle,tran,gene,seq):
        if seq is not None: # None means it was first defline in file
            if tran in self.allow_tids:
                biotype = self.biotype
                length = str(len(seq))
                outstr = ','.join((tran,gene,biotype,length,seq))
                handle.write(outstr+'\n')
                self.count_out += 1
    def fasta_to_csv(self,allow_tids):
        self.allow_tids = allow_tids
        self.count_in = 0
        self.count_out = 0
        with open(self.outfile,'w') as handle:
            handle.write(self.headers)
            with open(self.infile,'r') as fasta:
                transcript_id = None
                gene_id = None
                next_seq = None
                for line in fasta:
                    if line[0]==self.FASTA_DEFCHAR:
                        # Wrap up the previous sequence before moving on to the next.
                        self._print_one_sequence(handle,transcript_id,gene_id,next_seq)
                        # The defline starts with '>'
                        # The defline has fields separated by vertical bar
                        self.count_in += 1
                        line = line[1:].strip()  # strip defchar and newline
                        tokens = line.split('|')
                        transcript_id = tokens[0].split('.')[0] 
                        gene_id = tokens[1].split('.')[0]   
                        next_seq = ""   # get ready for one to many sequence lines
                    else:
                        # In FASTA format, one sequence may continue to next line
                        next_seq = next_seq + line.strip()  
            # Reading FASTA, be sure to process the last sequence.
            self._print_one_sequence(handle,transcript_id,gene_id,next_seq)
        return self.count_in,self.count_out

In [15]:
fasta_db = fasta_reader(DATA_DIR+SEQUENCE,DATA_DIR+CODING_CSV_ALL,'pc')
tin,tout=fasta_db.fasta_to_csv(atlas_pc_all_tids)
print('         GenCode transcripts:',tin)
print('Atlas all coding transcripts:',tout)

         GenCode transcripts: 252913
Atlas all coding transcripts: 165307


In [16]:
fasta_db = fasta_reader(DATA_DIR+SEQUENCE,DATA_DIR+CODING_CSV_CANON,'pc')
tin,tout=fasta_db.fasta_to_csv(atlas_pc_canonical_tids)
print('               GenCode transcripts:',tin)
print('Atlas canonical coding transcripts:',tout)

               GenCode transcripts: 252913
Atlas canonical coding transcripts: 17685


In [17]:
fasta_db = fasta_reader(DATA_DIR+SEQUENCE,DATA_DIR+CODING_CSV_LONG,'pc')
tin,tout=fasta_db.fasta_to_csv(atlas_pc_longest_tids)
print('               GenCode transcripts:',tin)
print('Atlas longest coding transcripts:',tout)

               GenCode transcripts: 252913
Atlas longest coding transcripts: 17685


In [18]:
fasta_db = fasta_reader(DATA_DIR+SEQUENCE,DATA_DIR+NONCODING_CSV_ALL,'nc')
tin,tout=fasta_db.fasta_to_csv(atlas_nc_all_tids)
print('             GenCode transcripts:',tin)
print('Atlas all non-coding transcripts:',tout)

             GenCode transcripts: 252913
Atlas all non-coding transcripts: 28997


In [19]:
fasta_db = fasta_reader(DATA_DIR+SEQUENCE,DATA_DIR+NONCODING_CSV_CANON,'nc')
tin,tout=fasta_db.fasta_to_csv(atlas_nc_canonical_tids)
print('                   GenCode transcripts:',tin)
print('Atlas canonical non-coding transcripts:',tout)

                   GenCode transcripts: 252913
Atlas canonical non-coding transcripts: 6432


In [20]:
fasta_db = fasta_reader(DATA_DIR+SEQUENCE,DATA_DIR+NONCODING_CSV_LONG,'nc')
tin,tout=fasta_db.fasta_to_csv(atlas_nc_longest_tids)
print('                   GenCode transcripts:',tin)
print('Atlas longest non-coding transcripts:',tout)

                   GenCode transcripts: 252913
Atlas longest non-coding transcripts: 6432
