# Exercise 007

<a href="https://colab.research.google.com/github/FAIRChemistry/PythonProgramming2025/blob/master/exercises/Exercise007.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [123]:
# Please execute this cell to download the necessary data
!wget https://raw.githubusercontent.com/JR-1991/PythonProgramming2025/master/scripts/utils.py
!wget https://raw.githubusercontent.com/JR-1991/PythonProgramming2025/master/data/single_sequence.fasta

from utils import CODON_TABLE, to_triplets

--2025-06-24 13:38:05--  https://raw.githubusercontent.com/JR-1991/PythonProgramming2025/master/scripts/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1411 (1.4K) [text/plain]
Saving to: ‘utils.py.6’


2025-06-24 13:38:05 (19.1 MB/s) - ‘utils.py.6’ saved [1411/1411]

--2025-06-24 13:38:05--  https://raw.githubusercontent.com/JR-1991/PythonProgramming2025/master/data/single_sequence.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 877 [text/plain]
Saving to: ‘single_sequence.fasta.6’


2025-06-24 13:38:05 (51.3 MB/s)

# DNASequence class

Construct a `DNASequence` class that contains the following attributes:

* `id`
* `sequence`
* `organism`
* `gc_content`
* `length`
* `reverse_complement`

Next, implement methods for your class that perform the following tasks:

* `to_amino_acid`: Converts the nucelic acid sequence to an amino acid sequence.
* `align`: Takes another sequence and aligns it against the instance sequence.
* `__repr__`: Define how the contents of your class should be printed.
* `from_fasta`: Define a classmethod that parses a single FASTA entry into your class.

Demonstrate your class by parsing the `single_sequence.fasta` file either manually or via the `from_fasta`-classmethod.

**Tips**

> * Feel free to use the `get_identity`-function of the previous exercise.
> * When implementing the `classmethod` make sure to check if the format is correct. We have so far followed the `>[Header]\n[Sequence]` format.
> * Translate your sequence using the supported `to_triplets` function and `CODON_TABLE` dictionary.
> * Not familiar with reverse complements? Find more info [here](http://genewarrior.com/docs/exp_revcomp.jsp)
> * Dont hesitate using the `dataclass` decorator. It can help you in some ways already. Learn more on how to implement `__post_init__` to maximize customizability [here](https://docs.python.org/3/library/dataclasses.html#post-init-processing)
> * Python lacks type validation and thus you do have limited control of what flows into your class. [PyDantic](https://docs.pydantic.dev/latest/) is an excellent tool to solve this and other issues. Try it out to make your life easier!

In [124]:
# Execute this cell to use all packages
%pip install biopython

from Bio import pairwise2


def get_identity(seq1: str, seq2: str):
    """Aligns two sequences using BioPython

    Args:
        seq1 (str): Query sequence to align to
        seq2 (str): Target sequence to align with

    Returns:
        float: Identity of the resulting alignment

    """
    return pairwise2.align.globalxx(seq1, seq2, score_only=True) / len(seq1)



In [135]:
class DNASequence:
    def __init__(self, id: int, organism: str, sequence: str):
        self.id = id
        self.organism = organism
        self.sequence = sequence
        self.length = len(self.sequence)
        self.gc_content = (self.sequence.count('G') + self.sequence.count('C')) / self.length
    def reverse_complement(self):
        complement_map = str.maketrans('ATCG', 'TAGC')
        return self.sequence.translate(complement_map)
    def align(self, other: "DNASequence") -> float:
        if not isinstance(other, DNASequence):
            raise TypeError("Alignment must be done with a DNASequence.")
        return get_identity(self.sequence, other.sequence)
    def from_fasta(file_FASTA):
        DNA_sequence = []
        data_reading = [line.strip(">") for line in open(file_FASTA, "r")]
        for i in range(0, len(data_reading)):
         if i % 2 == 0:
            organism = data_reading[i].split("|", 1)[0]
            id = data_reading[i].split("|", 1)[1]
            sequence = data_reading[i+1]
            line = DNASequence(
                id=id.strip(),
                organism=organism.strip(),
                sequence=sequence.strip()
                )
            DNA_sequence.append(line)
            #print(sequence)
        return DNA_sequence

In [126]:
#check
complement_map = str.maketrans('ATCG', 'TAGC')
sequence = "ATGC"
reverse_complement = sequence.translate(complement_map)
print(reverse_complement)  # Output: GCAT


TACG


In [132]:
def to_amino_acid(sequence: str) -> str:
    """Translates a nucleotide sequence into an amino acid sequence.

    Args:
        sequence (str): A nucelotide sequence.

    Returns:
        str: The translated amino acid sequence.

    """

    BASES = {"A", "G", "C", "T"}

    if set(sequence) != BASES:
        raise ValueError(f"This sequence contains unknown bases! {set(sequence)}")

    if len(sequence) % 3 != 0:
        raise ValueError("The sequence length should be divisible by 3!")

    amino_seq = []
    triplets = to_triplets(sequence)

    for triplet in triplets:
        amino_acid = CODON_TABLE[triplet]
        amino_seq.append(amino_acid)

    return "".join(amino_seq)

In [134]:
dna_sequences = reading_FASTA("single_sequence.fasta")

for dna in dna_sequences:
    print(dna)
    protein = to_amino_acid(dna.sequence)
    print(protein)
    print(dna.sequence)
    print("Reverse complement:", dna.reverse_complement())
dna1 = DNASequence(id=1, organism="Human", sequence="ATGCGT")
dna2 = DNASequence(id=2, organism="Mouse", sequence="ATGAGT")
print(dna1.align(dna2))

<__main__.DNASequence object at 0x7a0d39239f10>
MRSRYLLHQYFVQVQFAAPSPAPTDSMSYIIPYRLSLNINKMNICNT_LSYQL_TKKNHPNLDGFS_FRGTSCADEPLQISKFRLQPKRSGKKHFLPLGC_STHRNVHGSGDAYLHLPRPSPAPGVRSFCFDRAYDPPHLPYRRSRYSLPATYAWCAQG_SPESVERFLPRRRLLFSPWR_AVRSDLSAQSPRERQQRPRYAGTRRGCADRLHRRGSAGTVDRAKRSDPADRFPGTDVLASRGR_CPDARHRRFSRRDIGGLPTG_PRPGTQGRGSAVTDVYLHGLLK_
ATGCGTTCTCGCTATTTGTTACATCAATATTTTGTTCAGGTACAGTTTGCAGCGCCGTCGCCAGCGCCAACGGATTCCATGTCATATATTATTCCATATAGATTAAGTTTAAATATTAATAAAATGAATATTTGCAATACGTAATTATCTTACCAGCTATAGACAAAAAAAAACCATCCAAATCTGGATGGCTTTTCATAATTCAGAGGAACTAGCTGCGCTGACGAACCGCTTCAAATAAGCAAATTCCGGTTGCAACCGAAACGTTCAGGGAAGAAACACTTCCTGCCATTGGGATGCTGATCAACTCATCGCAATGTTCACGGGTCAGGCGACGCATACCTTCACCTTCCGCGCCCATCACCAGCGCCAGGCGTCCGGTCATTTTGCTTTGATAGAGCGTATGATCCGCCTCACCTGCCGTACCGACGATCCAGATATTCTCTTCCTGCAACATACGCATGGTGCGCGCAAGGTTAGTCACCCGAATCAGTGGAACGCTTTCTGCCGCGCCGCAGGCTACTTTTTTCGCCGTGGCGTTGAGCTGTGCGGAGCGATCTTTCGGCACAATCACCGCGTGAACGCCAGCAGCGTCCGCGCTACGCAGGCACGCGCCGAGGTTGTGCGGATCGGTTACACCGTCGAGGATCAGCAGGAACGGT