<a href="https://colab.research.google.com/github/LEGENDFTW19/DNA-sequence-Biopython/blob/main/Basic_python_for_DNA_sequence_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is based on Udemy's course, Machine Learning for Bioinformatics. Here, two tasks are implemented to demonstrate the utility of python programming languange for DNA sequence analysis -
1. *Basic python Sequence Analysis*
2. *Sequence Analysis using Biopython package*

**Author: Tushar Singh**

# Task 1

This section deals with the processes involved in doing **sequence analysis** using basic python, i.e. without using packages.

### Part 1 - Counting base pairs

The four bases in DNA are **Adenine(A), Thymine(T), Guanine(G) and Cytosine(C)**

In [None]:
# manually defined DNA sequence saved as a string
DNA_string = "3’-TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATAACCCCCCGATT-5’"

In [None]:
# define a function to count the different bases in the given DNA sequence
def count_base_in_DNAseq(DNA_string): 
  """
  Args:
    DNA_string: str, the DNA sequence 

  Returns: two dictonaries, one with the base count and the other with base 
           percentage in the given DNA sequence

  """
  # count number of base pairs
  base_dict={'A':0,'C':0,'T':0,'G':0}
  for base in DNA_string:
    for key,value in base_dict.items():
      if base==key:
        base_dict[key]=base_dict[key]+1

  # calculate the ratio of base in the given DNA sequence
  base_dict2={'A':0,'C':0,'T':0,'G':0}
  for base in DNA_string:
    for key,value in base_dict.items():
      if base==key:
        base_dict2[key]=(base_dict2[key]+1/len(DNA_string)) 
    
  return base_dict, base_dict2

In [None]:
base_count, base_ratio = count_base_in_DNAseq(DNA_string)
print("Base pair count: ", base_count)
print("Base pair ratio: ", base_ratio)

Base pair count:  {'A': 25, 'C': 24, 'T': 31, 'G': 19}
Base pair ratio:  {'A': 0.238095238095238, 'C': 0.22857142857142848, 'T': 0.29523809523809524, 'G': 0.1809523809523809}


### Part 2: Complementary base pairs

In a DNA sequence, Complement of 	**adenine(A) is thymine(T) and of guanine(G) is cytosine(C)**

In [None]:
# initialize string to store the complement 
DNA_complement =""

for base in DNA_string:

  if base == "T":
    base = "A"
  elif base == "A":
    base = "T"
  elif base == "C":
    base = "G"
  elif base == "G":
    base = "C"
  elif base == "3":
    base = "5"
  elif base == "5":
    base = "3"

  DNA_complement = DNA_complement + base

In [None]:
print("The complement of " + DNA_string + " is \n ", DNA_complement)

The complement of 3’-TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATAACCCCCCGATT-5’is 
  5’-ATGAGAGCAAGAACGTCGAACAGTCATGAAAGTCTTAGTACCACACGTACCATCTTACTGAGAATATTGCTTGAAGCTGTACCGTTATTGGGGGGCTAA-3’


### Part 3: Transcription
In genomics, *transcription* is the process of making an RNA copy of a gene's DNA sequence. This copy, called messenger RNA (mRNA), carries the gene's protein information encoded in DNA.

The transcription of DNA sequence can be obtained by simply **replacing the 'T' with 'U'**

In [None]:
# initialize string to store the transcription
DNA_transcription =""

for base in DNA_string:
  
  if base == "T":
    base = "U"

  DNA_transcription = DNA_transcription + base  # concatenation

In [None]:
print("The transcription of " + DNA_string + " is \n ", DNA_transcription)

The transcription of 3’-TACTCTCGTTCTTGCAGCTTGTCAGTACTTTCAGAATCATGGTGTGCATGGTAGAATGACTCTTATAACGAACTTCGACATGGCAATAACCCCCCGATT-5’is 
  3’-UACUCUCGUUCUUGCAGCUUGUCAGUACUUUCAGAAUCAUGGUGUGCAUGGUAGAAUGACUCUUAUAACGAACUUCGACAUGGCAAUAACCCCCCGAUU-5’


### Part 4: Translation
In molecular biology and genetics, *translation* is the process in which ribosomes in the cytoplasm or endoplasmic reticulum synthesize proteins after the process of transcription of DNA to RNA in the cell's nucleus. The entire process is called **gene expression**. 

In [None]:
# dictionary with all the RNA sequences and their corresponding amino acids

RNA_seq_dict = {'UUU':'PHE','UUC':'PHE','UUA':'PHE','UUG':'PHE','UCU':'SER','UCC':
'SER','UCA':'SER','UCG':'SER','UAU':'TYR','UAC':'TYR','UAA':'STOP','UAG':'STOP',
'UGU':'CYS','UGC':'CYS','UGA':'STOP','UGG':'TRP','CUU':'LEU','CUC':'LEU','CUA':
'LEU','CUG':'LEU','CCU':'PRO','CCC':'PRO','CCA':'PRO','CCG':'PRO','CAU':'HIS',
'CAC':'HIS','CAA':'GLN','CAG':'GLN','CGU':'ARG','CGC':'ARG','CGA':'ARG','CGG':
'ARG','AUU':'ILE','AUC':'ILE','AUA':'ILE','AUG':'MET','ACU':'THR','ACC':'THR'
,'ACA':'THR','ACG':'THR','AAU':'ASN','AAC':'ASN','AAA':'LYS','AAG':'LYS','AGU'
:'SER','AGC':'SER','AGA':'ARG','AGG':'ARG','GUU':'VAL','GUC':'VAL','GUA':'VAL',
'GUG':'VAL','GCU':'ALA','GCC':'ALA','GCA':'ALA','GCG':'ALA','GAU':'ASP','GAC':
'ASP','GAA':'GLU','GAG':'GLU','GGU':'GLY','GGC':'GLY','GGA':'GLY','GGG':'GLY'}


In [None]:
# manually defined RNA sequence string
RNA_string = "AUGAGAGCAAGAACGUCGAACAGUCAUGAAAGUCUUAGUACCACACGUACCAUCUUACUGAGAAUAUUGCUUGAAGCUGUACCGUUAUUGGGGGGCUAA"

RNA_translation = ""

# counter to keep check of the length of the string
counter = 0

while counter < len(RNA_string)-2: # -2 because 3 characters are checked at a time
  RNA_subscript = RNA_string[counter: counter+3]
  for key,value in RNA_seq_dict.items():
    if RNA_subscript == key:
      RNA_subscript = value
  RNA_translation = RNA_translation + RNA_subscript
  counter = counter + 3

In [None]:
print("The translation of " + RNA_string + " is \n ", RNA_translation)

The translation of AUGAGAGCAAGAACGUCGAACAGUCAUGAAAGUCUUAGUACCACACGUACCAUCUUACUGAGAAUAUUGCUUGAAGCUGUACCGUUAUUGGGGGGCUAA is 
  METARGALAARGTHRSERASNSERHISGLUSERLEUSERTHRTHRARGTHRILEPHELEUARGILEPHELEUGLUALAVALPROPHEPHEGLYGLYSTOP


# Task 2

In this section the python library **'Biopython'** is used to perform sequence analysis, making the process faster and more efficient than the manual implementation in Task 1.

In [None]:
! pip install Bio

In [None]:
# import the Biopython libraries
import Bio
from Bio.Seq import Seq
from Bio.SeqUtils import GC

# initialize a DNA sequence
DNA = Seq("AGTACACTGGTT")

# complement
print("The given DNA sequence- ", DNA)
print("The complement of the DNA sequence- ", DNA.complement())

# count base pairs(Easier method)
print("The base pair count for AC is ", DNA.count("AC"))

# base percent
A_percent = (DNA.count("A")/len(DNA))
print("Percent of A in DNA is ", A_percent*100)

# shorter method for base percebt
print("GC percent in DNA " + str(GC(DNA)))

# transcribe
mRNA = DNA.transcribe()
print("Transcribe of DNA is ", mRNA)

# translate
amino_acids = mRNA.translate()
print("Translate of mRNA is ", amino_acids)

The given DNA sequence-  AGTACACTGGTT
The complement of the DNA sequence-  TCATGTGACCAA
The base pair count for AC is  2
Percent of A in DNA is  25.0
GC percent in DNA 41.666666666666664
Transcribe of DNA is  AGUACACUGGUU
Translate of mRNA is  STLV


# Conclusions

Thus, this notebook sucessfully illustrates the python programming language functionality for analysing DNA sequences. The python library, Biopython makes the entire analysis process compact and efficient when compared with the manual process of finding the DNA complement, transcription and translation. The use of built-in functions improves the readibility of the code.