# Genome Assembly 2
**Assembling Complex Genomes: Short Reads vs. Long Reads**


Xanthomonas genomes are known for their complexity, characterized by numerous repeat elements and TAL effectors with highly repetitive sequences. These features pose significant challenges for genome assembly, requiring careful consideration of the sequencing technology and assembly methods used.

In this section, we will first assemble a Xanthomonas bacterial genome using Illumina short-read sequences. Short reads often struggle with repetitive regions, which can lead to fragmented assemblies. To address this, we will then perform a long-read assembly, which has the potential to resolve these repetitive elements more effectively.

By comparing the strengths and weaknesses of these two sequencing methods, we aim to evaluate their performance in assembling the challenging Xanthomonas genome.

##Install dependencies and tools##

**Install miniconda**

In [None]:
# @title
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge

**Install fastqc, trim_galore, spades, Nanoplot, filtlong, and checkM**

In [None]:
# @title
!conda install bioconda::fastqc -y
!!conda install trim-galore -y
!conda install -c conda-forge ncbi-datasets-cli -y
!conda install bioconda::spades -y
!conda install bioconda::nanoplot -y
!conda install -c bioconda filtlong -y
!conda install bioconda::flye -y
!conda install -c bioconda quast -y
!conda install bioconda::pysradb -y

# Short reads assembly

Fetch illumina sequences and Run spades

In [None]:
!pysradb search --title "Xanthomonas oryzae pv. oryzae"

In [None]:
!wget https://zenodo.org/record/14018699/files/SRR30576374_1.fastq.gz
!wget https://zenodo.org/record/14018699/files/SRR30576374_2.fastq.gz

Run quality control for the illumina reads

In [None]:
!fastqc SRR30576374_1.fastq.gz
!fastqc SRR30576374_2.fastq.gz

**Filter and Clip Sequences**

Filter and trim sequences based on a Phred score greater than 20, removing adapters and considering nucleotide composition.

In [None]:
!trim_galore --paired --clip_R1 15 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10 --fastqc SRR30576374_1.fastq.gz SRR30576374_2.fastq.gz

**Run the Illumina assembler SPAdes using the --isolate option. This option is designed for cases where the reads originate from a single, pure isolate..**

In [None]:
!spades.py --isolate -1 SRR30576374_1_val_1.fq.gz -2 SRR30576374_2_val_2.fq.gz -o spades_output

Your results are in spades_output. We will compare the results with long reads assembly

#Long Reads assembly

Fetch Pacbio HIFI sequences and run long read assembler - flye

In [None]:
!wget https://zenodo.org/record/14018699/files/SRR30576370.fastq.gz

Run Quality Control for Long Reads

NanoPlot is a tool designed for quality control of Oxford Nanopore long reads. However, it can also be adapted for use with PacBio HiFi long reads to perform simple QC analysis.

In [None]:
!NanoPlot -fastq SRR30576370.fastq.gz -o nanoplot_output

Filter reads shorter than 1 Kb

In [None]:
!filtlong --min_length 1000 --keep_percent 90 SRR30576370.fastq.gz | gzip > filtered_SRR30576370.fastq.gz

Run QC again and check results

In [None]:
!NanoPlot -fastq filtered_SRR30576370.fastq.gz -o filtered_nanoplot_output

Run Long-Read Assembler - Flye

Run the long-read assembler Flye using only a subset of reads that provide 50x coverage of the genome. This approach helps conserve computational resources. The coverage can be increased as needed based on specific requirements.

In [None]:
!flye --asm-coverage 50 --pacbio-hifi filtered_SRR30576370.fastq.gz -o flye_results --genome-size 5000000

#Assembly stats and comparisions

Compare Both Assemblies Using quast

Evaluate metrics such as the number of contigs, genome size, N50.

In [None]:
!mkdir assemblies/
!cp spades_output/contigs.fasta assemblies/spades_contigs.fasta
!cp flye_results/assembly.fasta assemblies/flye_contigs.fasta
!quast assemblies/spades_contigs.fasta assemblies/flye_contigs.fasta