# Genome Assembly 3-1
**Assembling Genomes: Short Reads vs. Long Reads**

**Xanthomonas Assembly**

Xanthomonas genomes are known for their complexity, characterized by numerous repeat elements and TAL effectors with highly repetitive sequences. These features pose significant challenges for genome assembly, requiring careful consideration of the sequencing technology and assembly methods used.

In this section, we will first assemble a Xanthomonas bacterial genome using Illumina short-read sequences. Short reads often struggle with repetitive regions, which can lead to fragmented assemblies. To address this, we will then perform a long-read assembly, which has the potential to resolve these repetitive elements more effectively.

By comparing the strengths and weaknesses of these two sequencing methods, we aim to evaluate their performance in assembling the challenging Xanthomonas genome.

##Install dependencies and tools##

**Install miniconda**

In [None]:
# @title
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge

In [None]:
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

accepted Terms of Service for [4;94mhttps://repo.anaconda.com/pkgs/main[0m
accepted Terms of Service for [4;94mhttps://repo.anaconda.com/pkgs/r[0m


**Install fastqc, trim_galore, spades, Nanoplot, filtlong**

In [None]:
# @title
!conda install -c conda-forge ncbi-datasets-cli -y
!conda install bioconda::nanoplot -y
!conda install -c bioconda filtlong -y
!conda install bioconda::flye -y
!conda install -c bioconda quast -y
!conda install bioconda::pysradb -y

In [None]:
!apt-get update -qq
!apt-get install -y fastqc

#Short reads assembly

Fetch illumina sequences and Run spades

In [None]:
!pysradb search --title "Xanthomonas oryzae pv. oryzae"

In [None]:
!wget https://zenodo.org/record/14018699/files/SRR30576374_1.fastq.gz
!wget https://zenodo.org/record/14018699/files/SRR30576374_2.fastq.gz

Run quality control for the illumina reads

In [None]:
!fastqc SRR30576374_1.fastq.gz
!fastqc SRR30576374_2.fastq.gz

In [None]:
import os
from IPython.core.display import display, HTML

# Ask the user for the file name they want to display
file_name = input("Enter the name of the HTML file you want to display (include .html extension): ")

# Check if the file exists
if os.path.exists(file_name):
    # Open and read the HTML file
    with open(file_name, 'r') as file:
        html_content = file.read()
        display(HTML(html_content))  # Display the HTML content
else:
    print(f"File '{file_name}' not found. Please ensure the file exists in the current directory.")


In [None]:
%%bash
# 1️⃣  Install micromamba (lightweight conda)
wget -qO- https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba > /dev/null

# 2️⃣  Initialize the shell
eval "$(./bin/micromamba shell hook -s bash)"

# 3️⃣  Create an isolated environment (Python 3.12 avoids cutadapt conflict)
micromamba create -y -n tg -c conda-forge -c bioconda python=3.12 trim-galore cutadapt fastqc

# 4️⃣  Confirm installation
micromamba run -n tg trim_galore --version
micromamba run -n tg cutadapt --version


**Filter and Clip Sequences**

Filter and trim sequences based on a Phred score greater than 20, removing adapters and considering nucleotide composition.

In [None]:
%%bash
# Activate micromamba shell
eval "$(./bin/micromamba shell hook -s bash)"
micromamba run -n tg trim_galore --paired --clip_R1 15 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10 --fastqc SRR30576374_1.fastq.gz SRR30576374_2.fastq.gz


In [None]:
import os
from IPython.core.display import display, HTML

# Ask the user for the file name they want to display
file_name = input("Enter the name of the HTML file you want to display (include .html extension): ")

# Check if the file exists
if os.path.exists(file_name):
    # Open and read the HTML file
    with open(file_name, 'r') as file:
        html_content = file.read()
        display(HTML(html_content))  # Display the HTML content
else:
    print(f"File '{file_name}' not found. Please ensure the file exists in the current directory.")

**Run the Illumina assembler SPAdes using the --isolate option. This option is designed for cases where the reads originate from a single, pure isolate..**

Get Spades

In [None]:
#2022-10-21 Steven Tang
#!wget http://cab.spbu.ru/files/release3.9.0/SPAdes-3.9.0-Linux.tar.gz
#!tar -xzf SPAdes-3.9.0-Linux.tar.gz

#2023-09-15 Renald Legaspi
#Updated: Spades3.9 to 3.15 since that version no longer runs on colab because a different version of python is being implemented.
#Fix: No longer installs the Linux tarfile due to segment fault issue. Spades is now being compiled from source.
# !wget http://cab.spbu.ru/files/release3.15.5/SPAdes-3.15.5.tar.gz
# !tar -xzf SPAdes-3.15.5.tar.gz
# !cd SPAdes-3.15.5
# !./SPAdes-3.15.5/spades_compile.sh

#2023-09-18 Steven Tang
#Fix: Use precompiled SPAdes that works with Colab
!wget https://github.com/steventango/colab-spades/releases/download/v3.15.5/SPAdes-3.15.5-Colab.tar.gz
!tar -xzf SPAdes-3.15.5-Colab.tar.gz


from datetime import datetime
from google.colab import files
from pathlib import Path
import subprocess

Run Spades

In [None]:
# Tries to reduce the number of mismatches and short indels.
# Also runs MismatchCorrector: A post processing tool that uses BWA tool.
# Recommended mostly for small and/or low complexity genome.

#2022-10-21 Steven Tang
#careful_mode = True

#2023-09-15 Renald Legaspi
#Updated: Careful mode may cause the spades.py to crash due to insufficient RAM
careful_mode = False

#2023-09-15 Renald Legaspi
#Colab no longer implements python2; thus 'python /path/spades.py' is used instead of 'python2 /path/spades.py'
pe1_filename = "SRR30576374_1_val_1.fq.gz"
pe2_filename = "SRR30576374_2_val_2.fq.gz"

output_directory = f"{Path(pe1_filename).stem}_{Path(pe2_filename).stem}_{datetime.now().isoformat()}"

process = subprocess.run(
    f'python ./bin/spades.py --isolate -1 SRR30576374_1_val_1.fq.gz -2 SRR30576374_2_val_2.fq.gz -o spades_output',
    capture_output=True,
    text=True,
    shell=True
)

#print(process.stdout)
#print(process.stderr)

Your results are in spades_output. We will compare the results with long reads assembly

#Long Reads assembly

Fetch Pacbio HIFI sequences and run long read assembler - flye

In [None]:
!wget https://zenodo.org/record/14018699/files/SRR30576370.fastq.gz

Run Quality Control for Long Reads

NanoPlot is a tool designed for quality control of Oxford Nanopore long reads. However, it can also be adapted for use with PacBio HiFi long reads to perform simple QC analysis.

In [None]:
!NanoPlot --fastq SRR30576370.fastq.gz -o nanoplot_output

Filter reads shorter than 1 Kb

In [None]:
!filtlong --min_length 1000 --keep_percent 90 SRR30576370.fastq.gz | gzip > filtered_SRR30576370.fastq.gz


Scoring long reads
  213,987 reads (1,244,010,289 bp)

Filtering long reads
  target: 1,119,609,260 bp
  keeping 1,119,611,298 bp

Outputting passed long reads



Run QC again and check results

In [None]:
!NanoPlot --fastq filtered_SRR30576370.fastq.gz -o filtered_nanoplot_output

Run Long-Read Assembler - Flye

Run the long-read assembler Flye using only a subset of reads that provide 50x coverage of the genome. This approach helps conserve computational resources. The coverage can be increased as needed based on specific requirements.

Install Flye

In [None]:
!apt-get update
!apt-get install -y build-essential python3-dev zlib1g-dev libbz2-dev liblzma-dev git

# Download Flye source
!git clone https://github.com/fenderglass/Flye.git
%cd Flye

# Build Flye
!make

# Flye executable will be created in bin/
%cd bin
!chmod +x flye

# Add Flye to PATH
import os
os.environ["PATH"] += ":/content/Flye/bin"

In [None]:
!flye --asm-coverage 50 --pacbio-hifi /content/filtered_SRR30576370.fastq.gz -o /content/flye_results --genome-size 5000000

#Assembly stats and comparisions

Compare Both Assemblies Using quast

Evaluate metrics such as the number of contigs, genome size, N50.

In [None]:
!mkdir assemblies/
!cp spades_output/contigs.fasta assemblies/spades_contigs.fasta
!cp flye_results/assembly.fasta assemblies/flye_contigs.fasta
!quast assemblies/spades_contigs.fasta assemblies/flye_contigs.fasta