# GC percent for a DNA string

function to calculate GC percent

Sample DNA string

In [1]:
my_dna = "AATCGACAATAGCCCATTGATACAT"

In [2]:
print(my_dna)

AATCGACAATAGCCCATTGATACAT


## Functiion

In [3]:
def gc_percent(dna):
    """Description: Computes GC % for a given DNA string
    
    Arguments: 
       dna  :  DNA string
    
    Return: a float value 
    
    Example of usage:
       >>> gc = gc_percent(dna = "AATCGACAATAGCCCATTGATACAT")
       >>> print(gc)
       
       Output:
          0.56
    
    """
    count_G = dna.count('G')
    count_C = dna.count('C')
    dna_len = len(dna)
    gc = ((count_G + count_C) / dna_len)
    gc_pc = gc * 100
#   much simpler:
#   gc_pc = (dna.count('G') + dna.count('C') ) * 100 / len(dna)
    return(gc_pc)

In [4]:
gc_percent(dna = my_dna)

36.0

But what if there are letters in uppper/lower case?

In [5]:
my_dna = my_dna.upper()

converting the string to a single case will fix the issue. 

But if there are non ATGC in the DNA, it will not give you correct GC content or raise any error

We can write another function to check if the DNA is valid

In [6]:
def has_invalid_letters(sequence):
    """Description: checks if the DNA string is valid
    
    Arguments: 
        sequence  :  DNA string
    
    Return: boolean 
    
    Example of usage:
       >>> has_invalid_letters(sequence = "AATCGACAATAGCCCATTGATACAT")
       
       Output:
          False
    
    """
    
    pattern = re.compile(r'[^ATGCatgc]')
    if re.search(pattern, sequence):
        return True
    else:
        return False

now we can use this function within our `gc_percent` function to check if the DNA string is valid

In [7]:
import re
import sys
def gc_percent(dna):
    """Description: Computes GC % for a given DNA string
    
    Arguments: 
       dna  :  DNA string
    
    Return: a float value 
    
    Example of usage:
       >>> gc = gc_percent(dna = "AATCGACAATAGCCCATTGATACAT")
       >>> print(gc)
       
       Output:
          0.56
    
    """
    dna = dna.upper()
    if has_invalid_letters(dna):
        print("Sequence contains invalid letters.")
    else:
        print("Sequence contains only valid letters.")
        gc_pc = (dna.count('G') + dna.count('C') ) * 100 / len(dna)
        return(gc_pc)

In [8]:
gc_percent(dna = my_dna)

Sequence contains only valid letters.


36.0

In [9]:
gc_percent(dna = "NOTADNASEQUENCE")

Sequence contains invalid letters.
