# PROJECT 2: GC Content Calculation 

Let's start with understanding GC content!

**What is GC Content?**

GC content is the percentage of nucleotides in a DNA sequence that are either guanine (G) or cytosine (C). It is a measure of the proportion of these two bases in a given DNA or RNA sequence. The formula for calculating GC content is:

GC Content(%)=(Number of G’s and C’s/Total number of nucleotides) ×100

**Why is GC Content Important?**

**1. Stability of DNA:** Higher GC content generally increases the stability of the DNA molecule because G-C pairs form three hydrogen bonds (compared to two for A-T pairs), making the DNA more resistant to denaturation.

**2. Genomic Characterization:** Different organisms have different typical GC contents. This can be used to identify and characterize species or strains.

**3. PCR Design:** GC content is crucial for designing primers in PCR, as it affects the melting temperature of the primers.

**4. Gene Prediction:** GC content can help in gene prediction and finding coding regions, as coding regions often have a higher GC content compared to non-coding regions.

**Calculating GC Content in Python**

Let's use the fasta_data fetched from the database previously to calculate the GC content. Here's a step-by-step explanation:

**1. Define a Function to Calculate GC Content:** We will create a function that takes a sequence as input and returns its GC content.

**2. Parse the FASTA Data:** We will modify our existing code to parse the FASTA data and use the GC content function to calculate and print the GC content for each sequence.

## STEP 1. Import Libraries

In [2]:
import Bio

In [10]:
from Bio import SeqIO
from Bio import Entrez
from io import StringIO

### STEP 2: FETCH THE FASTA DATA FROM THE DATABASE

In [6]:
# Set email for NCBI Entrez
Entrez.email = "k26sangeetha@gmail.com"  # Replace with your email

In [7]:
# Function to fetch FASTA data from NCBI
def fetch_fasta_from_ncbi(query, database="nucleotide"):
    handle = Entrez.esearch(db=database, term=query, retmax=1)
    record = Entrez.read(handle)
    handle.close()
    if record["IdList"]:
        seq_id = record["IdList"][0]
        handle = Entrez.efetch(db=database, id=seq_id, rettype="fasta", retmode="text")
        fasta_data = handle.read()
        handle.close()
        return fasta_data
    else:
        return None

In [8]:
# Example usage
query = "Homo sapiens COX1" 
fasta_data = fetch_fasta_from_ncbi(query)
print(fasta_data)

>PP914118.1 Taenia solium isolate B cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
TAGATTTTTTAATGTTTTCTTTACATTTAGCTGGTGTATCAAGTATTTTTAGTTCTATTAATTTTATATG
TACATTATATAGAGTTTTTATGACTAATATATTTTCTCGTACATCTATAGTGTTATGATCTTATTTATTT
ACATCTATCTTGTTATTGGTTACTTTACCTGTTTTGGCAGCCGCTGTTACTATGCTTCTATTTGATCGTA
AATTTAGTTCTGCGTTTTTTGATCCGTTAGGAGGTGGTGATCCTGTTTTATTTCAACATATGTTTTGATT
TTTTGGTCATCCTGAGGTTTATGTGTTAATTCTTCCGGGGTTTGGTATAATTAGTCATATATGTTTGAGT
ATAAGTATGTGTTCTGATGCTTTTGGCTTTTATGGGTTATTGTTTGCTATGTTTTCAATAGTATGTTTAG
GAAGAAGTGTATGAGGGCATCATATGTTTACGGTTGGGTTAGATGTTAAGACGGCTGTATTTTTTAGTTC
TGTTACTATGATAATTGGAGTGCCTACGGGGATTAAGGTTTTTACTTGGCTTTATATGCTTTTAAAATCT
CGTGTTAATAAGAGTGATCCGGTTTTATGATGAATAATTTCGTTTATAGTATTGTTTACATTTGGTGGTG
TAACTGGTATTATTCTATCTGCTTGTGTATTAGATAAAGTTCTTCATGATACTTGGTTTGTTGTTGCTCA
TTTTCATT




**Function to Calculate GC Content**

In [1]:
# Function to calculate GC content
    # sequence.count("G"): Counts the number of G nucleotides in the sequence
    # sequence.count("C"): Counts the number of C nucleotides in the sequence
    # gc_count / len(sequence): Calculates the proportion of G and C nucleotides
    # * 100: Converts the proportion to a percentage
def calculate_gc_content(sequence):
    gc_count = sequence.count("G") + sequence.count("C")
    return (gc_count / len(sequence)) * 100

The function **calculate_gc_content** takes a DNA sequence as input and returns the GC content as a percentage.

In [4]:
# Function to parse a FASTA string and calculate GC content
    # StringIO: Converts the FASTA string into a file-like object so it can be parsed by SeqIO
    # SeqIO.parse: Parses the FASTA data
    # calculate_gc_content: Calls the function to calculate GC content for each sequence
def parse_fasta_gc_content(fasta_string):
    fasta_io = StringIO(fasta_string)
    for record in SeqIO.parse(fasta_io, "fasta"):
        gc_content = calculate_gc_content(record.seq)
        print(f"ID: {record.id}")
        print(f"GC Content: {gc_content:.2f}%\n")

In [11]:
# Example usage
if fasta_data:
    parse_fasta_gc_content(fasta_data)
else:
    print("No data fetched.")

ID: PP914118.1
GC Content: 30.65%



**Let's try to find the GC content for the fasta file present Locally**

There will be a change in the function which will be the file_path and not the FASTA data itself

In [12]:
# Function to parse a FASTA file and calculate GC content
def parse_fasta_gc_content(file_path):
    for record in SeqIO.parse(file_path, "fasta"):
        gc_content = calculate_gc_content(record.seq)
        print(f"ID: {record.id}")
        print(f"GC Content: {gc_content:.2f}%\n")

In [13]:
fasta_file = "Example1.fasta"

In [14]:
if fasta_file:
    parse_fasta_gc_content(fasta_file)
else: 
    print ("No data fetched")

ID: sequence1
GC Content: 57.69%

ID: sequence2
GC Content: 41.94%

ID: sequence3
GC Content: 42.55%

