# GenCode / Ensembl
The current release for human is Release 41 (GRCh38.p13).
All GenCode annotataion refers to Ensembl IDs.
The sequence download pages at GenCode contain links to the Ensembl 
[FTP](https://useast.ensembl.org/info/data/ftp/index.html).  

Human sequence in FASTA: 
1. genomic DNA
1. gene cDNA (with introns) 
1. transcript [CDS](http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/cds/) 
(without introns) 
1. noncoding [ncRNA](http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/ncrna/)   
1. Human gene annotation in [GFF](https://www.gencodegenes.org/human/)


In [1]:
DATA_DIR = '/Users/jasonmiller/WVU/Localization/GenCode/'

In [2]:
class fasta_reader():
    '''Parser for human transcripts FASTA file from Ensembl.'''
    def __init__(self,infile,outfile):
        self.infile = infile
        self.outfile = outfile
        self.filter_type = None
        self.FASTA_DEFCHAR = '>'
        self.count=0
        self.headers='transcript_id,gene_id,biotype,length,sequence\n'
    def accumulate_types(self,biotype):
        if biotype in self.type_dict:
            self.type_dict[biotype] = self.type_dict[biotype]+1
        else:
            self.type_dict[biotype]=1
    def print_one_sequence(self,handle,tran,gene,biotype,seq):
        if seq is not None:
            if self.filter_type is None or self.filter_type==biotype:
                length = str(len(seq))
                outstr = ','.join((tran,gene,biotype,length,seq))
                handle.write(outstr+'\n')
                self.count += 1
    def set_filter(self,word):
        self.filter_type = word
    def types_report(self):
        self.type_dict={}
        with open(self.infile,'r') as fasta:
            for line in fasta:
                if line[0]==self.FASTA_DEFCHAR:
                    tokens = line.split(' ')
                    longbiotype = tokens[5].strip()
                    biotype = longbiotype.split(':')[1]
                    self.accumulate_types(biotype)
        print("Input transcript biotypes report:")
        print(self.type_dict)
    def fasta_to_csv(self):
        self.type_dict={}
        with open(self.outfile,'w') as handle:
            handle.write(self.headers)
            with open(self.infile,'r') as fasta:
                transcript_id = None
                gene_id = None
                biotype = None
                next_seq = None
                for line in fasta:
                    if line[0]==self.FASTA_DEFCHAR:
                        # The defline starts with '>'
                        # The defline has key:value pairs separated by spaces
                        self.print_one_sequence(handle,transcript_id,gene_id,biotype,next_seq)
                        # Special case: descriptions come last and have embedded spaces
                        desc_index = line.find(' description:')
                        if desc_index>0:
                            desc_str = line[desc_index+1:]
                            line = line[:desc_index]  # line minus description
                        tokens = line.split(' ')
                        transcript_id = tokens[0][1:] # chop off '>'
                        gene_id = tokens[3][5:]    # chop off 'gene:'
                        version_index=gene_id.find('.')   # as in ENSG00000198888.2
                        if version_index>=0:
                            gene_id = gene_id[:version_index]  # chop off version
                        longbiotype = tokens[5].strip()  # chop off newlines
                        biotype = longbiotype.split(':')[1]  # chop of 'transcript_biotype:'
                        next_seq = ""   # get ready for one to many sequence lines
                    else:
                        next_seq = next_seq + line.strip()  # sequence continuation
            self.print_one_sequence(handle,transcript_id,gene_id,biotype,next_seq)
        print("Output sequences: %d"%self.count)

## mRNA
The cds file contains the coding sequence, that is, the processed transcript,
which is the gene sequence minus the untranslated UTR and introns.

Genes can have different transcripts called isoforms. 

Some protein_coding genes have degenerate, non-coding transcripts.
For example, a protein_coding gene might have one transcript
marked for nonsense-mediated_decay because it is ill-formed.

Here, we filter for the protein_coding transcripts only.

Typical defline of cds file:    
`ENST00000631435.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]`

In [3]:
# From https://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/cds/
FASTA_FILENAME='Homo_sapiens.GRCh38.cds.all.fa'
# FASTA_FILENAME='test.cds.fa'   # just for testing
CSV_FILENAME='Homo_sapiens.GRCh38.cds.csv'
infile = DATA_DIR + FASTA_FILENAME
outfile = DATA_DIR + CSV_FILENAME
converter = fasta_reader(infile,outfile)
converter.types_report()
converter.set_filter('protein_coding')
converter.fasta_to_csv()

Input transcript biotypes report:
{'TR_D_gene': 5, 'IG_D_gene': 64, 'TR_V_gene': 160, 'TR_C_gene': 8, 'IG_J_gene': 24, 'IG_C_gene': 34, 'IG_V_gene': 231, 'TR_J_gene': 93, 'protein_coding': 98078, 'protein_coding_LoF': 112, 'nonsense_mediated_decay': 21789, 'non_stop_decay': 114}
Output sequences: 98078


## lncRNA
The ncrna file contains non-coding sequence, that is, 
transcripts from non-coding genes. 

Non-coding genes can have different transcripts called isoforms. 

Some non-coding genes have degenerate transcripts.
For example, a lncRNA gene might have one transcript
marked retained_intron because it was not processed correctly.

Here, we filter for the lncRNA transcripts only.

Typical defline of ncrna file:   
`>ENST00000516993.1 ncrna chromosome:GRCh38:1:26593940:26594041:-1 gene:ENSG00000252802.1 gene_biotype:misc_RNA transcript_biotype:misc_RNA gene_symbol:Y_RNA description:Y RNA [Source:RFAM;Acc:RF00019]`

In [4]:
# From http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/ncrna/
FASTA_FILENAME='Homo_sapiens.GRCh38.ncrna.fa'
# FASTA_FILENAME='test.ncrna.fa'   # just for testing
CSV_FILENAME='Homo_sapiens.GRCh38.ncrna.csv'
infile = DATA_DIR + FASTA_FILENAME
outfile = DATA_DIR + CSV_FILENAME
converter = fasta_reader(infile,outfile)
converter.types_report()
converter.set_filter('lncRNA')   
converter.fasta_to_csv()

Input transcript biotypes report:
{'snRNA': 2072, 'rRNA': 71, 'scRNA': 1, 'sRNA': 6, 'Mt_rRNA': 2, 'snoRNA': 1009, 'vault_RNA': 1, 'miRNA': 1924, 'Mt_tRNA': 22, 'scaRNA': 51, 'lncRNA': 55625, 'retained_intron': 648, 'TEC': 17, 'transcribed_unitary_pseudogene': 1, 'ribozyme': 8, 'misc_RNA': 2407}
Output sequences: 55625
