<a href="https://colab.research.google.com/github/FarahhhFatima/dna-sequence-analyzer/blob/main/DNA_Sequence_Analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### DNA Sequence Analyzer

This section will contain functions to analyze DNA sequences. We'll start with basic statistics like sequence length, base counts, and GC content.

In [2]:
def analyze_dna_sequence(dna_sequence):
    """
    Analyzes a DNA sequence to provide its length, base counts, and GC content.

    Args:
        dna_sequence (str): The DNA sequence string (e.g., 'ATGCAGTG').

    Returns:
        dict: A dictionary containing the analysis results.
    """
    dna_sequence = dna_sequence.upper() # Convert to uppercase for consistent analysis

    # Calculate sequence length
    seq_length = len(dna_sequence)

    # Calculate base counts
    a_count = dna_sequence.count('A')
    t_count = dna_sequence.count('T')
    c_count = dna_sequence.count('C')
    g_count = dna_sequence.count('G')

    # Calculate GC content
    gc_content = ((c_count + g_count) / seq_length * 100) if seq_length > 0 else 0.0

    analysis_results = {
        "sequence_length": seq_length,
        "A_count": a_count,
        "T_count": t_count,
        "C_count": c_count,
        "G_count": g_count,
        "GC_content": round(gc_content, 2) # Round to 2 decimal places
    }

    return analysis_results

Let's test our `analyze_dna_sequence` function with a sample DNA sequence.

In [4]:
# Example usage:
sample_dna = "ATGCGTACGTTAG"
analysis = analyze_dna_sequence(sample_dna)

import pandas as pd
# Display the results in a readable format (e.g., using a Pandas DataFrame)
display(pd.DataFrame([analysis]).T.rename(columns={0: 'Value'}))

Unnamed: 0,Value
sequence_length,13.0
A_count,3.0
T_count,4.0
C_count,2.0
G_count,4.0
GC_content,46.15


### Reading DNA Sequences from FASTA Files

A FASTA file is a text-based format for representing nucleotide or amino acid sequences, in which nucleotides or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by successive lines of sequence data.

Here's a typical structure:
```
>sequence_id_1 This is a description for sequence 1
ATGCGTACGTTAGCTAGCTAGCTAGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGC
>sequence_id_2 Another sequence description
GCATGCATGCATGCATGCATGCATGCATGC
TAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
```

We will create a function to parse such a file (or string) and extract the sequences.

In [5]:
from io import StringIO

def parse_fasta(fasta_content):
    """
    Parses a FASTA format string and returns a dictionary of sequences.

    Args:
        fasta_content (str): A string containing DNA sequences in FASTA format.

    Returns:
        dict: A dictionary where keys are sequence IDs and values are the DNA sequences.
    """
    sequences = {}
    current_id = None
    current_sequence_lines = []

    # Use StringIO to treat the string content as a file
    with StringIO(fasta_content) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if line.startswith('>'):
                if current_id:
                    sequences[current_id] = ''.join(current_sequence_lines)
                current_id = line[1:].split(' ')[0] # Take only the first part as ID
                current_sequence_lines = []
            else:
                current_sequence_lines.append(line)

        # Add the last sequence
        if current_id:
            sequences[current_id] = ''.join(current_sequence_lines)

    return sequences

# Example FASTA content
sample_fasta_content = """
>seq1_example Human DNA sequence fragment
ATGCGTACGTTAGCTAGCTAGCTAGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGC
>seq2_example Another fragment
GCATGCATGCATGCATGCATGCATGCATGC
TAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
"""

# Parse the FASTA content
parsed_sequences = parse_fasta(sample_fasta_content)

# Analyze each sequence
all_analysis_results = {}
for seq_id, dna_sequence in parsed_sequences.items():
    print(f"\nAnalyzing sequence: {seq_id}")
    analysis = analyze_dna_sequence(dna_sequence)
    all_analysis_results[seq_id] = analysis
    display(pd.DataFrame([analysis]).T.rename(columns={0: 'Value'}))

# If you have a local FASTA file, you can read it like this:
# with open('your_sequence.fasta', 'r') as f:
#     fasta_file_content = f.read()
# parsed_sequences_from_file = parse_fasta(fasta_file_content)
# print(parsed_sequences_from_file)



Analyzing sequence: seq1_example


Unnamed: 0,Value
sequence_length,62.0
A_count,15.0
T_count,16.0
C_count,15.0
G_count,16.0
GC_content,50.0



Analyzing sequence: seq2_example


Unnamed: 0,Value
sequence_length,60.0
A_count,15.0
T_count,15.0
C_count,15.0
G_count,15.0
GC_content,50.0
