# Gene structure

A **gene** is a piece of DNA which is **transcribed into RNA** and then **translated into amino acids**. 

However, not the entire DNA sequence is translated into the final protein sequence, e.g. **introns are excluded** and **exons can be combined** in different ways to give different products known as **protein isoforms**.

Below a figure of the **P53 gene** highlighting the gene structure as provided by the crresponding [transcript page](https://www.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000141510;r=17:7668402-7687538;t=ENST00000269305) in the **Ensembl database**.


![zscore](figures/ensemble_p53.png)


- Exons --> boxes
    - Coding --> filled boxes
    - Non coding (UnTranslated Regions - UTR) --> white boxes
- Introns --> lines
- Antisense (reverse) direction. From 5' (right) --> to 3' (left)



# Genome assembly

In this Notebook you will learn how to map fragments to a known reference sequence. This is a common task in genome assembly and population studies. 

- Genome assembly --> identify overlapping regions of DNA fragments
- Population studies --> map seqeunced fragments to a reference sequence and identify mutations / variations

Specifically, you will learn how to use pairwise alignments to map exons sequences to a reference DNA sequence.




### Glossary
- DNA → Full gene DNA sequence (exons + introns) - 19,170 bps
- cDNA → Complementary DNA / Transcript (exons, red + white blocks) - 2,512 bps
    - The same gene can have multiple transcripts (different combinations of exons)
    - Only one transcript is the canonical transcript, the others are called isoforms
- CDS → Coding DNA sequence (exons, red blocks) - 1,182 bps
- Protein → Translated sequence - 393 residues (1/3 CDS)


### Dataset
The dataset fasta file (ensemble_p53.fasta) has been created from the Ensembl transcript 1 page (see above) of P53 and clicking on:

>"Export data" -> "Genomics None" -> "Next" -> "Text"




In [15]:
from Bio.Seq import Seq
#from Bio.Alphabet import generic_dna, generic_protein (only for old Python)
from Bio import SeqIO

from Bio import pairwise2
from Bio.SubsMat import MatrixInfo

import re

In [42]:
# Parse FASTA
seq_records = SeqIO.parse("data/ensembl_p53.fasta", "fasta")

data = {}  # {molecule_type: [seq_record, ...]}
for rec in seq_records:
#     print(rec)
    # Parse the sequence type (peptide, cds, cdna, utr5, utr3, <x>_exon, intron_<x>)
    seq_type = "_".join(rec.description.split(':')[0].split()[1:])
    data[seq_type] = rec
    

for record_type in data:
    print(record_type, data[record_type], len(data[record_type].seq), sep="\n")
    print()


# Turn a nucleotide sequence into a protein sequence
print("CDS\n{}\n\n".format(data["cds"].seq))

print("Complement\n{}\n\n".format(data["cds"].seq.complement()))

print("Reverse complement\n{}\n\n".format(data["cds"].seq.reverse_complement()))

print("Transcription (DNA --> RNA)\n{}\n\n".format(data["cds"].seq.transcribe()))

print("Translation (DNA --> RNA --> AA)\n{}\n\n".format(data["cds"].seq.transcribe().translate()))

print("Translation (implicit transcription, DNA --> AA)\n{}\n\n".format(data["cds"].seq.translate()))

print("Complement translation\n{}\n\n".format(data["cds"].seq.complement().transcribe().translate()))

print("Reverse complement translation\n{}\n\n".format(data["cds"].seq.reverse_complement().transcribe().translate()))

cdna
ID: ENST00000269305.9
Name: ENST00000269305.9
Description: ENST00000269305.9 cdna:protein_coding
Number of features: 0
Seq('CTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGA...CCA')
2512

cds
ID: ENST00000269305.9
Name: ENST00000269305.9
Description: ENST00000269305.9 cds:protein_coding
Number of features: 0
Seq('ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACA...TGA')
1182

ENSE00003753508_exon
ID: ENST00000269305.9
Name: ENST00000269305.9
Description: ENST00000269305.9 ENSE00003753508 exon:protein_coding
Number of features: 0
Seq('CTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGA...TGG')
114

ENSE00002667911_exon
ID: ENST00000269305.9
Name: ENST00000269305.9
Description: ENST00000269305.9 ENSE00002667911 exon:protein_coding
Number of features: 0
Seq('CAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGAGCCGCAGTCAGATCCTAG...ACT')
102

ENSE00002419584_exon
ID: ENST00000269305.9
Name: ENST00000269305.9
Description: ENST00000269305.9 ENSE00002419584 exon:protein_coding
Number of feat

# Map exons to the cDNA and CDS

Find position of exons using two strategies:

- Exact match (ex. regex)
- Alignment (pairwise2 module)

### Pairwise 2 module
The module includes different methods. The names are **\<algorithm_type\>\<match_parameter\>\<gap_parameter\>**, ex. **localxx()**

The match parameters are (CODE  DESCRIPTION):
- **x**     No parameters. Identical characters have score of 1, otherwise 0
- **m**     A match score is the score of identical chars, otherwise mismatch score
- **d**     A dictionary returns the score of any pair of characters
- **c**     A callback function returns scores

The gap penalty parameters are (CODE  DESCRIPTION):
- **x**     No gap penalties
- **s**     Same open and extend gap penalties for both sequences
- **d**     The sequences have different open and extend gap penalties
- **c**     A callback function returns the gap penalties

The align method return a list of tuple corresponing to alignments, each tuple has:
    
- seq1 aligned
- seq2 aligned
- score
- alignment start position
- alignment end position



In [36]:
# Use exact match against the cDNA (regex)
matches = []
for k in sorted(data.keys()):
    if "exon" in k:
        match = re.search(str(data[k].seq), str(data["cdna"].seq))
        matches.append((data[k].description, match.span() if match else (0, 0)))
for match in sorted(matches, key=lambda x: x[1][0]):
    print(match)

('ENST00000269305.9 ENSE00003753508 exon:protein_coding', (0, 114))
('ENST00000269305.9 ENSE00002667911 exon:protein_coding', (114, 216))
('ENST00000269305.9 ENSE00002419584 exon:protein_coding', (216, 238))
('ENST00000269305.9 ENSE00003625790 exon:protein_coding', (238, 517))
('ENST00000269305.9 ENSE00003518480 exon:protein_coding', (517, 701))
('ENST00000269305.9 ENSE00003723991 exon:protein_coding', (701, 814))
('ENST00000269305.9 ENSE00003712342 exon:protein_coding', (814, 924))
('ENST00000269305.9 ENSE00003725258 exon:protein_coding', (924, 1061))
('ENST00000269305.9 ENSE00003786593 exon:protein_coding', (1061, 1135))
('ENST00000269305.9 ENSE00003545950 exon:protein_coding', (1135, 1242))
('ENST00000269305.9 ENSE00002037735 exon:protein_coding', (1242, 2512))


In [37]:
# Use exact match against the CDS (regex)
matches = []
for k in sorted(data.keys()):
    if "exon" in k:
        match = re.search(str(data[k].seq), str(data["cds"].seq))
        matches.append((data[k].description, match.span() if match else (0, 0)))
for match in sorted(matches, key=lambda x: x[1][0]):
    print(match)

# Why some exons do not align with the cds?

('ENST00000269305.9 ENSE00002037735 exon:protein_coding', (0, 0))
('ENST00000269305.9 ENSE00002667911 exon:protein_coding', (0, 0))
('ENST00000269305.9 ENSE00003753508 exon:protein_coding', (0, 0))
('ENST00000269305.9 ENSE00002419584 exon:protein_coding', (74, 96))
('ENST00000269305.9 ENSE00003625790 exon:protein_coding', (96, 375))
('ENST00000269305.9 ENSE00003518480 exon:protein_coding', (375, 559))
('ENST00000269305.9 ENSE00003723991 exon:protein_coding', (559, 672))
('ENST00000269305.9 ENSE00003712342 exon:protein_coding', (672, 782))
('ENST00000269305.9 ENSE00003725258 exon:protein_coding', (782, 919))
('ENST00000269305.9 ENSE00003786593 exon:protein_coding', (919, 993))
('ENST00000269305.9 ENSE00003545950 exon:protein_coding', (993, 1100))


In [73]:
# Use alignments against the CDS 
# Provide some indexes to evaluate alignment quality
# The score is given by combining match (substitution matrix) and penalty (gap) scores

alignments = []
for k in sorted(data.keys()):
    if "exon" in k:
        
        # Global alignment, match 1, non-match 0, no gap penalties
#         alignment = pairwise2.align.globalxx(data["cds"].seq, data[k].seq, one_alignment_only=True)[0]
        
    
        # Local alignment
        
        # match 1, non-match 0, no gap penalties 
#         alignment = pairwise2.align.localxx(data["cds"].seq, data[k].seq, one_alignment_only=True)[0]
        
        # match 1, non-match 0, gap open -10, gap extention -0.5
#         alignment = pairwise2.align.localxs(data["cds"].seq, data[k].seq, -10, -0.5, one_alignment_only=True)[0]

        # identity matrix (match 6, non-match -1), gap open -200, gap extention -0.5
        alignment = pairwise2.align.localds(data["cds"].seq, data[k].seq, MatrixInfo.ident, -200, -0.5, one_alignment_only=True)[0]

        seq1_aligned, seq2_aligned, score, alignment_begin, alignment_end = alignment
        
        # Calculate average score per alignment position
        score_average = score / float(alignment_end - alignment_begin)

        alignments.append((k, alignment_begin, alignment_end, score, score_average, alignment))

        
# print(MatrixInfo.ident)
        
# Sort by starting position of the alignment
for k, alignment_begin, alignment_end, score, score_average, alignment in sorted(alignments, 
                                                                                 key=lambda ele: ele[1]):
    print(pairwise2.format_alignment(alignment))
    print("{} start:{:>4} end:{:>4} score:{:>6} score_average:{:>4.2f} ".format(k, alignment_begin, alignment_end, 
                                                                     score, score_average))
   


ENSE00002667911_exon start:  28 end: 102 score: 444.0 score_average:6.00 
ENSE00002037735_exon start:  60 end:1211 score:1124.0 score_average:0.98 
ENSE00002419584_exon start:  74 end:  96 score: 132.0 score_average:6.00 
ENSE00003625790_exon start:  96 end: 375 score:1674.0 score_average:6.00 
ENSE00003753508_exon start: 170 end: 280 score: 198.0 score_average:1.80 
ENSE00003518480_exon start: 375 end: 559 score:1104.0 score_average:6.00 
ENSE00003723991_exon start: 559 end: 672 score: 678.0 score_average:6.00 
ENSE00003712342_exon start: 672 end: 782 score: 660.0 score_average:6.00 
ENSE00003725258_exon start: 782 end: 919 score: 822.0 score_average:6.00 
ENSE00003786593_exon start: 919 end: 993 score: 444.0 score_average:6.00 
ENSE00003545950_exon start: 993 end:1100 score: 642.0 score_average:6.00 



## Questions

- Why even when using high penalties I don't get a good alignment for the CDS/UTR overlapping exon?

- Why the non overlapping exon has non-zero average score?


In [91]:
# Translate the exons
# Identify those which are coding for protein fragments
# Align translated exons against the protein (AA) sequence
# Note, the DNA of the exons can have a non-zero phase

# ENSE00003753508 (1, 3'-UTR, non coding)
# ENSE00002667911 (2, 3'-UTR + coding)
# ...
# ENSE00002037735 (11, 5'-UTR + coding)


# Use alignments (different parameters)
alignments = []
for k in sorted(data.keys()):
    if "exon" in k:
        
        # Try all possible reading frames
        for i in range(0, 3):
        
            # Translate the exon (remove stop codon symbols)
            exon_peptide = data[k].seq[i:].translate(stop_symbol="")

            alignment = pairwise2.align.localds(data["peptide"].seq, exon_peptide, MatrixInfo.blosum62, -10, -0.5, one_alignment_only=True)[0]

            seq1_aligned, seq2_aligned, score, alignment_begin, alignment_end = alignment
            
            identity = 0
            matches = 0
            for l1, l2 in zip(seq1_aligned, seq2_aligned):
                if l1 != '-' and l2 != '-':
                    matches += 1
                    if l1 == l2:
                        identity += 1

            alignments.append((k, alignment_begin, alignment_end, score, identity, matches, alignment))

# Print the Blosum62 matrix
# print(sorted(MatrixInfo.blosum62.items(), key=lambda x: x[1], reverse=True))
        
# Sort by starting position of the alignment
for k, alignment_begin, alignment_end, score, identity, matches, alignment in sorted(alignments, 
                                                                                 key=lambda ele: ele[1]):
    if k in ['ENSE00002037735_exon', 'ENSE00003753508_exon', 'ENSE00002667911_exon']:
        
        
        # Exon id, start, end, score, score_average, alignment_length, exon_aligment_coverage
        print("{}{:>5}{:>5}{:>6.1f}{:>6}{:>4}".format(k, alignment_begin, alignment_end, 
                                                score, 
                                                matches, 
                                                identity))

#         print(alignment)



ENSE00003753508_exon    8   22  27.0    37   7
ENSE00002667911_exon    9   33 130.0    24  24
ENSE00002037735_exon   74  221  45.0   345  45
ENSE00002667911_exon   79  108  28.0    32  10
ENSE00003753508_exon   98  133  27.5    34  12
ENSE00002037735_exon  129  197  47.5   366  37
ENSE00003753508_exon  152  157  20.0    35   4
ENSE00002667911_exon  351  374  26.0    33   7
ENSE00002037735_exon  367  393 137.0    26  26
