1) Entrez: Fetch DNA sequences from NCBI.
- "Entrez" (pronounced "on-tray") is French for "enter", and in the context of NCBI, it stands for:

- Entrez Programming Utilities (E-utilities)

- It's basically an API provided by NCBI that lets you programmatically search and retrieve biological data (like genes, sequences, proteins, taxonomy, etc.) from their databases.

- So in short:
    Entrez = Gateway to NCBI databases via code.

- Biopython's Entrez module makes it easy to access this API without handling raw HTTP requests



2) SeqIO: Read and parse sequences

3) pairwise2: Align sequences and compute similarity scores

4) Phylo: Build and visualize phylogenetic trees (later on, im not sure if we will be doing this)

In [3]:
from Bio import Entrez, SeqIO, Phylo
from Bio.Align import PairwiseAligner


### [1]
Each "result" from NCBI’s nucleotide database is basically a sequence record submitted by researchers or institutions. Think of it like a mini research entry.

A single result typically includes:
1. Metadata:
- Accession ID (unique code)
-  Title (e.g., “Homo sapiens mitochondrial DNA, complete genome”)
-  Organism name
-  Gene or region info
- Source lab or publication

2. The nucleotide sequence: 
-  Could be a whole genome, mitochondrial DNA, a gene, or even a small region.

3.  Optional annotations:
- Features (e.g., exons, CDS, introns)
- References or citations
- Length, type (DNA/RNA)
- molecule info

So yeah, it’s like a mini report for a specific genetic sequence, contributed to NCBI's database. Different sources can submit overlapping or unique data for the same species.

In [None]:
# Required by NCBI to prevent spam
Entrez.email = "gunanka.is22@bmsce.ac.in"

# species name has to be a perfect scientific name
def fetch_sequence(species_name):
    """Fetches the DNA sequence of a given species from NCBI."""
    try:
        # [1]
        # Search for the species in NCBI nucleotide database

        # handle is like a file pointer or a temporary connection to the data you requested from NCBI.
        # It doesn't hold the data itself — just points to where it's streamed from.
        handle = Entrez.esearch(db="nucleotide", term=species_name, retmax=1)

        # .read() takes in handle pointers and returns a dictionary object
        record = Entrez.read(handle)

        # close connection, good practise
        handle.close()

        if not record["IdList"]:
            print(f"No sequence found for {species_name}")
            return None

        print(f'Keys : {list(record.keys())}\n')
        print(record.items())

        # Fetch sequence using the first search result ID
        seq_id = record["IdList"][0]
        handle = Entrez.efetch(db="nucleotide", id=seq_id, rettype="fasta", retmode="text")
        seq_record = SeqIO.read(handle, "fasta-pearson")
        handle.close()

        return seq_record.seq  # Return just the DNA sequence
    except Exception as e:
        print(f"Error fetching sequence for {species_name}: {e}")
        return None

# Test it with an example species
sequence = fetch_sequence("Homo sapiens[Organism] AND mitochondrion[Filter]")


Keys : ['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'TranslationStack', 'QueryTranslation']

dict_items([('Count', '210630'), ('RetMax', '1'), ('RetStart', '0'), ('IdList', ['2938283260']), ('TranslationSet', [{'From': 'Homo sapiens[Organism]', 'To': '"Homo sapiens"[Organism]'}]), ('TranslationStack', [{'Term': '"Homo sapiens"[Organism]', 'Field': 'Organism', 'Count': '28920006', 'Explode': 'Y'}, {'Term': 'mitochondrion[Filter]', 'Field': 'Filter', 'Count': '7845076', 'Explode': 'N'}, 'AND']), ('QueryTranslation', '"Homo sapiens"[Organism] AND mitochondrion[Filter]')])


In [12]:
counter = 0

for ch in sequence[:100]:
    print(ch, end='')
    counter += 1
    if counter % 3 == 0:
        print(' ', end='')

TAT TCT CTG TTC TTT CAT GGG GAA GCA GAT TTG GGT ACC ACC CAA GTA TTG ACT CAT CCA TCA ACA ACC GCT ATG TAT TTC GTA CAT TAC TGC CAG CCA C