# Align and Download

We are using the package biopython to download our SRA from NCBI. We will be using the SRA toolkit to convert the SRA files to fastq files. We will then use bowtie2 to align the fastq files to the reference genome. We will then use samtools to convert the SAM files to BAM files and sort the BAM files. 

> We will then use bedtools to convert the BAM files to BED files. We wil

How to download from NCBI from [this guide](https://notebook.community/widdowquinn/Notebooks-Bioinformatics/Biopython_NCBI_Entrez_downloads).

In [4]:
from Bio import Entrez, SeqIO
# from Bio.Align.Applications import ClustalwCommandline
import os

In [5]:
# Create a new directory (if needed) for output/downloads
outdir = "ncbi_downloads"
os.makedirs(outdir, exist_ok=True)

In [6]:
Entrez.email = "kaedeito@student.ubc.ca"
# This line sets the name of the tool that is making the queries
Entrez.tool = "alignment.ipynb"

## Download reference genome

In [None]:
# The line below uses the Entrez.einfo() function to
# ask NCBI what databases are available. The result is
# 'stored' in a variable called 'handle'
handle = Entrez.einfo()

# In the line below, the response from NCBI is read
# into a record, that organises NCBI's response into
# something you can work with.
record = Entrez.read(handle)

# The line below carries out a search of the `assembly` database at NCBI,
# using the phrase `Ralstonia solanacearum` as the search query,
# and asks NCBI to return up to the first 100 results
handle = Entrez.esearch(db="assembly", term="Ralstonia solanacearum", retmax=100)

Entrez.elink(db="assembly")
# This line converts the returned information from NCBI into a form we
# can use, as before.
record = Entrez.read(handle)

# Download our list of files

In [19]:
def download_save(outdir, list_genomes: list[str]):
  os.makedirs(f"ncbi_downloads\\{outdir}", exist_ok=True)
  records = []
  for accession_id in list_genomes:
    handle = Entrez.efetch(db="nucleotide", id=accession_id, rettype="fasta", retmode="text")
    records.append(handle)
    with open(f"ncbi_downloads\\{outdir}\\{accession_id}.fasta", "w") as out_handle:
      out_handle.write(handle.read())
      print(f"{outdir}\\{accession_id} Saved")
  return records

In [15]:
# list_genomes = ["GCF_001583435.1" "GCF_027366395.1", "GCF_023716865.1", "GCF_023822025.1", "GCF_032883915.1"]
list_genomes_16S = open("ncbi_downloads/list_16S.txt").read().splitlines()
records_16S = download_save("list_16S", list_genomes)

Saved
Saved
Saved
Saved
Saved
Saved
Saved


In [21]:
list_genomes_16S_sp = open("ncbi_downloads/list_16S_sp.txt").read().splitlines()
records_16S_sp = download_save("list_16S_sp", list_genomes_16S_sp)

list_16S_sp\LN879492.1 Saved
list_16S_sp\KX780138.1 Saved
list_16S_sp\LN875493.1 Saved
list_16S_sp\MK633876.1 Saved
list_16S_sp\MH201322.1 Saved
list_16S_sp\LN626318.1 Saved
list_16S_sp\OQ618154.1 Saved
