## Intro to object oriented programming
### 1) what is it?
Its a paradigm in software design to group data and functionality into objects
that are easy to use and recycle and that provide the end user with access to complex functionality without having to worry about how it works.
Typical objects that you deal with in coding are for example:
- pandas data frames
- matplotlib figures and axes
- pathlib Path objects
### 2) why you should know about it
Unless you strat writing resuable code you wont need to worry about making yoru own classes. However ubderstanding the concept makes dealing with objects in your code a lot easier. For example you will undersatdn what is the difference between df.info() - a method, and df.columns - an attribute. Overall, it doesnt hurt looking a little bit more into how a computer programming language like python is designed if you want to become proficient in data science.

## The python class
At the heart of python OOP is the class. This tool allows you to design your
own objects with teir own methods and attributes and undersatnd what you are doing when using
other people's code.
So for example when you say:
df = pd.DataFrame
of fig, ax = plt.subplots()
you are dealing with classes.

In [5]:
#Lets make a simple class to deal with a DNA sequence for example:

# The first set is to define the class:

class DnaSequence: # Python classes are defined using names in CamelCase
    pass

# Now we can create an instance of the class:
seq = DnaSequence()
print(type(seq))

<class '__main__.DnaSequence'>


Now we can lest do something with the class: We can have it hold data and also perform functions. So in our case for example, we would store metadata and nucleotide sequence and
then perform steps such generating a reverse complement

In [6]:
class DnaSequence:
    def __init__(self, sequence, species, gene_name):
        self.sequence = sequence
        self.length = len(sequence)
        self.species = species
        self.gene_name = gene_name

Ok,, admittdely this looks strange. Whats the self? and the __init__? You dont have to worry too much about it. Self is just a way to refer to the object itself and attach things to it. the __init__ function is the way to define the basic attributes that are part of the class.
Lets use the class now to instantite an actual sequence object that we can use.

In [7]:
sequence = 'ACTTTGAACCCAGTTGGCGGGAGTGGCTGC'
species = 'H. sapiens'
gene_name = 'MASTL_mRNA'

seq = DnaSequence(sequence, species, gene_name)
print(seq.sequence)
print(seq.length)
print(seq.species)
print(seq.gene_name)

ACTTTGAACCCAGTTGGCGGGAGTGGCTGC
30
H. sapiens
MASTL_mRNA


So basically we have grouped a bunch of data (attributes) relelated to teh sequence together. Not very exciting! We could have used a simple dictionary to do this.
Classes become more powerful when we give them functions (methods).
So data connected to a class are called attributes, and functions methods.


In [8]:
class DnaSequence:
    def __init__(self, sequence, species, gene_name):
        self.sequence = sequence
        self.length = len(sequence)
        self.species = species
        self.gene_name = gene_name

    def reverse_complement(self):
        complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
        return ''.join([complement[base] for base in self.sequence[::-1]])
    
sequence = 'ACTTTGAACCCAGTTGGCGGGAGTGGCTGC'
species = 'H. sapiens'
gene_name = 'MASTL_mRNA'

seq = DnaSequence(sequence, species, gene_name)
print(seq.sequence) # note the attribute doen't need the parenthesis
print(seq.reverse_complement()) # note the method needs the parenthesis

ACTTTGAACCCAGTTGGCGGGAGTGGCTGC
GCAGCCACTCCCGCCAACTGGGTTCAAAGT


So whats the point of this?
1. it can be useful to group attributes and functions together when writing more complex code. It makes it easuer to maintain and to refactor.
2. Its the best way to generate functionsality that other people use, like pandas dataframes, or matplotlib figures and axes

So now you should undersatnd that with df.columns you are accessing a class attribute and using df.info() or df.reset_index() ure are using a class method. So basically a function that someone wrote and connected with df to help you analyse data.

Lets explore this a bit more by comparing a more complex bioinformatics program written as a step by step recipy in a notebook or as a class that we import from another file. 
Our aim now is to read in a Fasta file with sequence informationa and meta-data of a mRNA, and find the ORF in that sequence. 

In [20]:
# Step1 Reading the file:

fasta_dict = {}
with open('MASTL.fasta') as file:

    lines = file.read().split('\n')
    for line in lines:
        if line.startswith('>'):
            fasta_dict['metadata'] = line
            fasta_dict['sequence'] = ''
        else:
            fasta_dict['sequence'] += line
    fasta_dict['sequence'] = fasta_dict['sequence'].replace('\n', '')
print(fasta_dict['metadata'])
print(fasta_dict['sequence'])

>NM_001172303.3 Homo sapiens microtubule associated serine/threonine kinase like (MASTL), transcript variant 1, mRNA
ACTTTGAACCCAGTTGGCGGGAGTGGCTGCTCGCGGAGGGGCAGTGTCTGCGGGGCCGCTGTATGCTGTCCAGCGATGGATCCCACCGCGGGAAGCAAGAAGGAGCCTGGAGGAGGCGCGGCGACTGAGGAGGGCGTGAATAGGATCGCAGTGCCAAAACCGCCCTCCATTGAGGAATTCAGCATAGTGAAGCCCATTAGCCGGGGCGCCTTCGGGAAAGTGTATCTGGGGCAGAAAGGCGGCAAATTGTATGCAGTAAAGGTTGTTAAAAAAGCAGACATGATCAACAAAAATATGACTCATCAGGTCCAAGCTGAGAGAGATGCACTGGCACTAAGCAAAAGCCCATTCATTGTCCATTTGTATTATTCACTGCAGTCTGCAAACAATGTCTACTTGGTAATGGAATATCTTATTGGGGGAGATGTCAAGTCTCTCCTACATATATATGGTTATTTTGATGAAGAGATGGCTGTGAAATATATTTCTGAAGTAGCACTGGCTCTAGACTACCTTCACAGACATGGAATCATCCACAGGGACTTGAAACCGGACAATATGCTTATTTCTAATGAGGGTCATATTAAACTGACGGATTTTGGCCTTTCAAAAGTTACTTTGAATAGAGATATTAATATGATGGATATCCTTACAACACCATCAATGGCAAAACCTAGACAAGATTATTCAAGAACCCCAGGACAAGTGTTATCGCTTATCAGCTCGTTGGGATTTAACACACCAATTGCAGAAAAAAATCAAGACCCTGCAAACATCCTTTCAGCCTGTCTGTCTGAAACATCACAGCTTTCTCAAGGACTCGTATGCCCTATGTCTGTAGATCAAAAGGACACTACGCCTTATTCTAGCAAATTACTAAAAT

In [21]:
dna_sequence = fasta_dict['sequence']

# Define the stop codons
stop_codons = ['TAA', 'TAG', 'TGA']

# Initialize list to hold all open reading frames (ORFs)
orfs = []

# Find all ATG indices (start codons)
start_indices = []
index = dna_sequence.find('ATG')
while index != -1:
    start_indices.append(index)
    index = dna_sequence.find('ATG', index + 1)

# For each start codon, find the first stop codon in frame
for start_index in start_indices:
    for i in range(start_index, len(dna_sequence), 3):
        codon = dna_sequence[i:i+3]
        if codon in stop_codons:
            orfs.append(dna_sequence[start_index:i+3])
            break

# Find the longest ORF
longest_orf = max(orfs, key=len) if orfs else ''
print(longest_orf)

ATGGATCCCACCGCGGGAAGCAAGAAGGAGCCTGGAGGAGGCGCGGCGACTGAGGAGGGCGTGAATAGGATCGCAGTGCCAAAACCGCCCTCCATTGAGGAATTCAGCATAGTGAAGCCCATTAGCCGGGGCGCCTTCGGGAAAGTGTATCTGGGGCAGAAAGGCGGCAAATTGTATGCAGTAAAGGTTGTTAAAAAAGCAGACATGATCAACAAAAATATGACTCATCAGGTCCAAGCTGAGAGAGATGCACTGGCACTAAGCAAAAGCCCATTCATTGTCCATTTGTATTATTCACTGCAGTCTGCAAACAATGTCTACTTGGTAATGGAATATCTTATTGGGGGAGATGTCAAGTCTCTCCTACATATATATGGTTATTTTGATGAAGAGATGGCTGTGAAATATATTTCTGAAGTAGCACTGGCTCTAGACTACCTTCACAGACATGGAATCATCCACAGGGACTTGAAACCGGACAATATGCTTATTTCTAATGAGGGTCATATTAAACTGACGGATTTTGGCCTTTCAAAAGTTACTTTGAATAGAGATATTAATATGATGGATATCCTTACAACACCATCAATGGCAAAACCTAGACAAGATTATTCAAGAACCCCAGGACAAGTGTTATCGCTTATCAGCTCGTTGGGATTTAACACACCAATTGCAGAAAAAAATCAAGACCCTGCAAACATCCTTTCAGCCTGTCTGTCTGAAACATCACAGCTTTCTCAAGGACTCGTATGCCCTATGTCTGTAGATCAAAAGGACACTACGCCTTATTCTAGCAAATTACTAAAATCATGTCTTGAAACAGTTGCCTCCAACCCAGGAATGCCTGTGAAGTGTCTAACTTCTAATTTACTCCAGTCTAGGAAAAGGCTGGCCACATCCAGTGCCAGTAGTCAATCCCACACCTTCATATCCAGTGTGGAATCAGAATGCCACAGCAGTCCCAAATGGGAAAAAGATTGCCAGGAAAGTGATGAAGCAT

In [1]:
from dna_sequence import DnaSequence

dna_sequence = DnaSequence('MASTL.fasta')
print(dna_sequence.metadata)
print(dna_sequence.orf)

>NM_001172303.3 Homo sapiens microtubule associated serine/threonine kinase like (MASTL), transcript variant 1, mRNA
ATGGATCCCACCGCGGGAAGCAAGAAGGAGCCTGGAGGAGGCGCGGCGACTGAGGAGGGCGTGAATAGGATCGCAGTGCCAAAACCGCCCTCCATTGAGGAATTCAGCATAGTGAAGCCCATTAGCCGGGGCGCCTTCGGGAAAGTGTATCTGGGGCAGAAAGGCGGCAAATTGTATGCAGTAAAGGTTGTTAAAAAAGCAGACATGATCAACAAAAATATGACTCATCAGGTCCAAGCTGAGAGAGATGCACTGGCACTAAGCAAAAGCCCATTCATTGTCCATTTGTATTATTCACTGCAGTCTGCAAACAATGTCTACTTGGTAATGGAATATCTTATTGGGGGAGATGTCAAGTCTCTCCTACATATATATGGTTATTTTGATGAAGAGATGGCTGTGAAATATATTTCTGAAGTAGCACTGGCTCTAGACTACCTTCACAGACATGGAATCATCCACAGGGACTTGAAACCGGACAATATGCTTATTTCTAATGAGGGTCATATTAAACTGACGGATTTTGGCCTTTCAAAAGTTACTTTGAATAGAGATATTAATATGATGGATATCCTTACAACACCATCAATGGCAAAACCTAGACAAGATTATTCAAGAACCCCAGGACAAGTGTTATCGCTTATCAGCTCGTTGGGATTTAACACACCAATTGCAGAAAAAAATCAAGACCCTGCAAACATCCTTTCAGCCTGTCTGTCTGAAACATCACAGCTTTCTCAAGGACTCGTATGCCCTATGTCTGTAGATCAAAAGGACACTACGCCTTATTCTAGCAAATTACTAAAATCATGTCTTGAAACAGTTGCCTCCAACCCAGGAATGCCTGTGAAGTGTCTAACTTCTAATTTACTCCAGTCTAGGA