# Pilot 3

I wanted to see if I can use more than the 16S rRNA gene to identify the species of the bacteria. I first searched online and found that there was a further breakdown in taxa. Taking note of the list of organisms, I scripted a search for Endozoicomonas & coral. This included a lot of whole genome shotgunning, which in the end turned out to be useless for further analysis, and took a long time to align and tree search.

We are using the package biopython to download nucleotide datasets from NCBI's Genbank, then prep those to align later. The guide on downloading from NCBI is from [this guide](https://notebook.community/widdowquinn/Notebooks-Bioinformatics/Biopython_NCBI_Entrez_downloads).

## Setup the environment

In [12]:
from Bio import Entrez, SeqIO
from Bio.SeqRecord import SeqRecord
# from Bio.Align.Applications import ClustalwCommandline
import os

Setup the `ncbi_download` directory, and check that the list of organisms is in the file [`organisms.txt`](../../datasets/shotgun_sequences/organism_list.txt).

In [13]:
# Create a new directory (if needed) for output/downloads
path = os.path.realpath("..\\..\\datasets\\shotgun_sequences\\ncbi_downloads")
print(path)
os.makedirs(path, exist_ok=True)

out_amount = os.listdir(path)

if len(out_amount) == 0:
    raise Exception("No organism list in ncbi_downloads.")

E:\Kaede\Documents\GitHub\BIOL417_CORAL\datasets\shotgun_sequences\ncbi_downloads


In [14]:
Entrez.email = "kaedeito@student.ubc.ca"
# This line sets the name of the tool that is making the queries
Entrez.tool = "alignment.ipynb"

## Sample search to download from NCBI
Here is an example using the generic search term of "Endozoicomonas" and "coral" in NCBI's Genbank nucleotide database.

The `Entrez.esearch(...)` function is documented [here](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch).

In [15]:
from Bio.Entrez.Parser import DictionaryElement, IntegerElement, ListElement, NoneElement


def search_sample_coral_bacteria():
  term = "Endozoicomonas[Organism] AND coral[All Fields] AND biomol_genomic[PROP]"
  handle = Entrez.esearch(db="nucleotide",
    term=term,
    retmax=10,
    idtype="acc",
  )

  # This line converts the returned information from NCBI into a form we
  # can use, as before.
  record: DictionaryElement | ListElement | IntegerElement | NoneElement | None = Entrez.read(handle)
  if isinstance(record, DictionaryElement):
    print(record['IdList'])

search_sample_coral_bacteria()

['NZ_CP120717.1', 'CP120717.1', 'NZ_JAHHPM000000000.1', 'NZ_JAHHPM010000603.1', 'NZ_JAHHPM010000602.1', 'NZ_JAHHPM010000601.1', 'NZ_JAHHPM010000600.1', 'NZ_JAHHPM010000599.1', 'NZ_JAHHPM010000598.1', 'NZ_JAHHPM010000597.1']


## Download sequences for our organism list
From a file called "organism_list.txt", read the list of organisms to search.

Add the search term "coral" to the end of each organism name, and limit to 10 results per taxon. Also limit the search to specific range of nucleotide length.

In [16]:

list_orgs = open("../../datasets/shotgun_sequences/organism_list.txt").read().splitlines()

def load_records():
  list_of_accession_id: list[str] = []

  min_seq_length = 1200
  max_seq_length = 10000

  # limit length of sequence
  length_search = f"({min_seq_length}[SLEN] : {max_seq_length}[SLEN])"

  for org in list_orgs:
    term = f"(\"{org}\"[Organism] OR {org}[All Fields]) AND coral[All Fields] AND {length_search} AND biomol_genomic[PROP]"
    handle = Entrez.esearch(db="nucleotide",
      term=term,
      retmax=10,
      idtype="acc",
    )
    record = Entrez.read(handle)
    if isinstance(record, DictionaryElement):
      list_of_accession_id.append(record['IdList'])

  return list_of_accession_id


A class to hold the search results, and a function to print the results to a fasta file.

In [17]:
from dataclasses import dataclass
from io import TextIOWrapper

@dataclass
class OrganismAccession:
  ids: list[str]
  organism: str
  handle: TextIOWrapper
  folder: str
  record: list[SeqRecord]

  def __init__(self, ids, organism):
    self.ids = ids
    self.organism = organism
    self.folder = f"..\\..\\datasets\\shotgun_sequences\\ncbi_downloads\\{self.organism}"
    os.makedirs(self.folder, exist_ok=True)
    self.handle = Entrez.efetch(db="nucleotide", id=ids, rettype="fasta", retmode="text")

  def download_save(self):
    with open(f"{self.folder}\\nucleotide.fasta", "w") as out_handle:
      out_handle.write(self.handle.read())
      print(f"{self.organism} Saved")

  def get_record(self):
    self.records: list[SeqRecord] = list(SeqIO.parse(f"{self.folder}\\nucleotide.fasta", "fasta"))

In [18]:
def download_sequences():
  """
  Download sequences from NCBI from the handle set by load_records()
  and save them in a folder named after the organism.
  Load the SeqRecords from the saved fasta file.
  """
  organisms_records: list[SeqRecord] = []

  len_org = len(list_orgs)
  list_of_accession_id = load_records()
  len_results = len(list_of_accession_id)
  if len_org != len_results:
    raise Exception("Length of organism list and results list are not the same")

  for i in range(len_org):
    org = list_orgs[i]
    ids = list_of_accession_id[i]
    organism_accession = OrganismAccession(ids, org)
    organism_accession.download_save()
    organism_accession.get_record()
    organisms_records.extend(organism_accession.records)

  return organisms_records


def remove_duplicates(organisms_records: list[SeqRecord]):
  """"
  Remove duplicate records from the list of SeqRecords using ID.
  """
  accession_ids: list[str] = []
  duplicates = []
  duplicate_ids = []

  retained_records: list[SeqRecord] = []
  for idx, fasta in enumerate(organisms_records):
    if fasta.id in accession_ids:
      duplicates.append(idx)
      duplicate_ids.append(fasta.id)
    else:
      retained_records.append(fasta)
    accession_ids.append(fasta.id)

  return retained_records

In [19]:
organisms_records = download_sequences()

Endozoicomonas sp. ONNA2 Saved
Endozoicomonas acroporae Saved
Endozoicomonas sp. ONNA1 Saved
Endozoicomonas sp. SESOKO3 Saved
Endozoicomonas sp. SESOKO2 Saved
Endozoicomonas sp. SESOKO4 Saved
Endozoicomonas sp. YOMI1 Saved
Endozoicomonas sp. SESOKO1 Saved
Endozoicomonas sp. ISHI1 Saved
Endozoicomonas montiporae Saved
Endozoicomonas sp. G2_1 Saved
Endozoicomonas sp. Saved
Endozoicomonas sp. G2_2 Saved
Endozoicomonas coralli Saved
Endozoicomonas gorgoniicola Saved
Endozoicomonas euniceicola Saved


In [20]:
organisms_records = remove_duplicates(organisms_records)

In [21]:
SeqIO.write(organisms_records, os.path.join("../../outputs/pilot_3_shotgun", "all_unaligned.fasta"), "fasta")
print("Done writing unaligned fasta")

Done writing unaligned fasta


## Align the sequences
I had MUSCLE v5.1 {cite:p}`Edgar_2022` align [them](../../outputs/pilot_3_shotgun/all_unaligned.fasta).

```bash
muscle5.1 -input ../../outputs/pilot_3_shotgun/all_unaligned.fasta -output ../../outputs/pilot_3_shotgun/local/muscle/all_aligned.fasta
```
