# Align and Download

We are using the package biopython to download our SRA from NCBI. We will be using the SRA toolkit to convert the SRA files to fastq files. We will then use bowtie2 to align the fastq files to the reference genome. We will then use samtools to convert the SAM files to BAM files and sort the BAM files. 

> We will then use bedtools to convert the BAM files to BED files. We wil

The guide on downloading from NCBI is from [this guide](https://notebook.community/widdowquinn/Notebooks-Bioinformatics/Biopython_NCBI_Entrez_downloads).

In [None]:
from Bio import Entrez, SeqIO
# from Bio.Align.Applications import ClustalwCommandline
import os

In [None]:
# Create a new directory (if needed) for output/downloads
os.makedirs("ncbi_downloads", exist_ok=True)

In [None]:
Entrez.email = "kaedeito@student.ubc.ca"
# This line sets the name of the tool that is making the queries
Entrez.tool = "alignment.ipynb"

## Sample search to download from NCBI
Here is an example using the generic search term of "Endozoicomonas" and "coral" in NCBI's Genbank nucleotide database.

The `Entrez.esearch(...)` function is documented [here](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch).

In [None]:


term = "Endozoicomonas[Organism] AND coral[All Fields] AND biomol_genomic[PROP]"
handle = Entrez.esearch(db="nucleotide",
  term=term,
  retmax=10,
  idtype="acc",
)

# This line converts the returned information from NCBI into a form we
# can use, as before.
record = Entrez.read(handle)
print(record['IdList'])


## Download sequences for our organism list
From a file called "organism_list.txt", read the list of organisms to search with.
Add the search term "coral" to the end of each organism name, and limit to 10 results per taxon.

In [None]:

list_orgs = open("ncbi_downloads/organism_list.txt").read().splitlines()

list_of_accession_id = []

min_seq_length = 1200
max_seq_length = 10000

length_search = f"({min_seq_length}[SLEN] : {max_seq_length}[SLEN])"

for org in list_orgs:
  term = f"(\"{org}\"[Organism] OR {org}[All Fields]) AND coral[All Fields] AND {length_search} AND biomol_genomic[PROP]"
  handle = Entrez.esearch(db="nucleotide",
    term=term,
    retmax=10,
    idtype="acc",
  )
  record = Entrez.read(handle)
  list_of_accession_id.append(record['IdList'])


A class to hold the search results, and a function to print the results to a fasta file.

In [None]:
from dataclasses import dataclass

@dataclass
class OrganismAccession:
  ids: list[str]
  organism: str
  handle: str
  folder: str

  def __init__(self, ids, organism):
    self.ids = ids
    self.organism = organism
    self.folder = f"ncbi_downloads\\{self.organism}"
    os.makedirs(self.folder, exist_ok=True)
    handle = Entrez.efetch(db="nucleotide", id=ids, rettype="fasta", retmode="text")

  def download_save(self):
    with open(f"{self.folder}\\nucleotide.fasta", "w") as out_handle:
      out_handle.write(self.handle.read())
      print(f"all\\{self.organism} Saved")

In [None]:
len_org = len(list_orgs)
len_results = len(list_of_accession_id)
if len_org != len_results:
  throw Exception("Length of organism list and results list are not the same")

for i in range(len_org):
  org = list_orgs[i]
  ids = list_of_accession_id[i]
  organism_accession = OrganismAccession(ids, org)
  organism_accession.download_save()
  print(f"Saved {org}")