# GenCode / Ensembl
The current release for human is Release 41 (GRCh38.p13).
All GenCode annotataion refers to Ensembl IDs.
The sequence download pages at GenCode contain links to the Ensembl 
[FTP](https://useast.ensembl.org/info/data/ftp/index.html).  

Human sequence in FASTA: 
1. genomic DNA
1. gene cDNA (with introns) 
1. transcript [CDS](http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/cds/) 
(without introns) 
1. noncoding [ncRNA](http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/ncrna/)   
1. Human gene annotation in [GFF](https://www.gencodegenes.org/human/)


Process the human cds sequence file.

Typical defline of cds file:    
`ENST00000631435.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]`

Typical defline of ncrna file:   
`>ENST00000516993.1 ncrna chromosome:GRCh38:1:26593940:26594041:-1 gene:ENSG00000252802.1 gene_biotype:misc_RNA transcript_biotype:misc_RNA gene_symbol:Y_RNA description:Y RNA [Source:RFAM;Acc:RF00019]`

In [12]:
class fasta_reader():
    def __init__(self,infile,outfile):
        self.infile = infile
        self.outfile = outfile
    def accumulate_types(self,biotype):
        if biotype in self.type_dict:
            self.type_dict[biotype] = self.type_dict[biotype]+1
        else:
            self.type_dict[biotype]=1
    def print_one_sequence(self,handle,tran,gene,biotype,seq):
        if seq is not None:
            self.accumulate_types(biotype)
            outstr = ','.join((tran,gene,seq))
            handle.write(outstr+'\n')
    def fasta_to_csv(self,):
        FASTA_DEFCHAR = '>'
        self.type_dict={}
        with open(self.outfile,'w') as handle:
            with open(self.infile,'r') as fasta:
                transcript_id = None
                gene_id = None
                biotype = None
                next_seq = None
                for line in fasta:
                    if line[0]==FASTA_DEFCHAR:
                        self.print_one_sequence(handle,transcript_id,gene_id,biotype,next_seq)
                        desc_index = line.find(' description:')
                        if desc_index>0:
                            desc_str = line[desc_index+1:]
                            line = line[:desc_index]
                        tokens = line.split(' ')
                        transcript_id = tokens[0][1:]
                        gene_id = tokens[3][5:]
                        longbiotype = tokens[5].strip()
                        biotype = longbiotype.split(':')[1]
                        next_seq = ""
                    else:
                        next_seq = next_seq + line.strip()
            self.print_one_sequence(handle,transcript_id,gene_id,biotype,next_seq)
    def summary_report(self):
        print("Transcript biotypes report:")
        print(self.type_dict)

In [13]:
# From https://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/cds/
DATA_DIR = '/Users/jasonmiller/WVU/MDPI/GenCode/'
FASTA_FILENAME='Homo_sapiens.GRCh38.cds.all.fa'
FASTA_FILENAME='test.cds.fa'   # just for testing
CSV_FILENAME='Homo_sapiens.GRCh38.cds.csv'
infile = DATA_DIR + FASTA_FILENAME
outfile = DATA_DIR + CSV_FILENAME
converter = fasta_reader(infile,outfile)
converter.fasta_to_csv()
converter.summary_report()

Transcript biotypes report:
{'protein_coding': 7, 'nonsense_mediated_decay': 2}


In [15]:
# From http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/ncrna/
DATA_DIR = '/Users/jasonmiller/WVU/MDPI/GenCode/'
FASTA_FILENAME='Homo_sapiens.GRCh38.ncrna.fa'
FASTA_FILENAME='test.ncrna.fa'   # just for testing
CSV_FILENAME='Homo_sapiens.GRCh38.ncrna.csv'
infile = DATA_DIR + FASTA_FILENAME
outfile = DATA_DIR + CSV_FILENAME
converter = fasta_reader(infile,outfile)
converter.fasta_to_csv()
converter.summary_report()

Transcript biotypes report:
{'misc_RNA': 18}
