This README is a detailed analysis of aligning two Fragaria iinumae genomes, combining Illumina and Nanopore sequencing data. The files in the GitHub are more of a memonto for us if we continue this project in the future.
Note: As these genomes are not published, we have not uploaded the files here.
- Introduction
- Project
- Material
- Method
- Results
- Conclusions and Future Work
- Commands Used
- Acknowledgements
- References
Fragaria iinumae (2n) is one of the diploid progenitors of cultivated strawberry, Fragaria ×ananassa (8x) (Rousseau-Gueutin et al. 2009; Dimeglio et al., 2014). The Davis Lab at UNH sequenced the genome of F. iinumae in 2017 using the Illumina Sequencing platform. This assembly is in its pseudochromosomes form. However, there are still some gaps or missing reads in the assembly. Recently, the genome from the same plant was sequenced using Oxford Nanopore Sequencing Technology in the Davis Lab. This data is in its raw form (220 Fastq files). The purpose of this assembly is to fill in the gaps between the earlier genome and check its accuracy.
For Gen 711/811, we are doing a part of this project, i.e., aligning the raw reads of Nanopore genome data to pseudochromosomes of the Illumina Sequenced Genome. We are interested in checking how accurate the pseudochromosome assembly is and how well the reads from Nanopore data align to it.
- Reference Genome: Pseudochromosomes assembly of F. iinumae (~198 Mb)
- Query Genome: Raw reads generated from Oxford Nanopore Sequencing (220 Fastq files) Both genomes are sequenced from one plant of F. iinumae. As these genomes are not published, we have not uploaded the files here.
The Project Workflow:
The major software used were BWA, SAMtools, and IGV. The exact commands used in this study are in a shell script file in the directory. Following are brief details about these tools:
Software for aligning sequence against a genome. There are 3 different algorithms we can specify that will make the analysis more accurate, determined from the size of the reads. We used the mem algorithm at first because it aligns 70bp-1Mbp query sequences, and ours are 100-200 bp long. This did not generate correct SAM/BAM files, so we were advised to use bwasw as it is better for long reads. The inputs are reference genome sequence and raw reads in the form of fastq, and it outputs a SAM file. A SAM file is a human-readable text file that contains the results of the analysis that compared the raw reads to the reference sequence.
BWA can be cloned by: git clone https://github.com/lh3/bwa.git
cd bwa;
make;
./bwa index reference.fastq
./bwa algorithm reference.fastq rawread.fastq > align.sam
A software for parsing and manipulating the alignments in SAM/BAM file. The input was a SAM file, and a BAM file was generated. Then the BAM file, which is the binary computer-readable version of the SAM file, was used to generate a stats file. The stats file was then used to generate plots on coverage and depth statistics, which are helpful in knowing about the similarities in both genomes.
Samtools –b align.sam > align.bam
samtools stats input.bam > input.bam.stats
plot-bamstats -p sample# input.bam.stats
Integrated Genome Viewer (IGV) is an interactive tool designed for the visual exploration of genomic data. We used IGV to align both genome sequences and visually inspect the alignment. Keep in mind that index files for both genomes are required for proper viewing.
We initiated the analysis by concatenating 15 fastq files (out of 220 total files) into one file and aligning this combined file to the reference genome.
We first concatenated the alignment results of 15 random reads to ensure the functionality of the provided pipeline.
Mapped depth increases with the percent of mapped sequence, and there is a higher median depth for higher GC content. This suggests that sequences with higher GC content could be better sequenced.
This plot is empty even though we expect the coverage line to peak at 0 and taper off to 1, representing fewer areas with high depth. We are not sure why this plot is not showing depth, but when this file was aligned with the reference genome in IGV, much read depth was seen.
The long reads have a model GC content of 38.4%, approximately the same as mentioned by Qiao et al. (2021), i.e., 39.70% GC content.
The number of insertions and deletions is high for the first 20,000 cycles but then sharply decreases and stays steady.
These reads are high in Thymine (blue) and Adenine (green) as shown by the blue and faint green lines that are underneath the blue.
The insertion and deletion patterns of the indels are fairly the same, with smaller indels more frequent than larger ones. The model of the ins/del ratio, ranging between 50 and 60, indicates that insertions occur more frequently than deletions in these read samples, and the inserted reads are longer than the size of the deleted reads.
After confirming the pipeline's accuracy with 15 concatenated reads, we proceeded with the analysis by concatenating all 220 fastq files into one file and aligning it with the reference genome.
The insertion and deletion patterns of the indels are similar to the results from 15 concatenated reads, with smaller indels more frequent than larger ones. The model of the ins/del ratio is around 80 which means insertions occur more frequently than deletions and the inserted reads are longer than the size of the deleated reads in these read samples which is consistant with the results of 15 concatnated reads.
The number of insertions and deletions is high for the first 20,000 cycles but then sharply decreases and stays steady, which is consistant with the results of the 15 concatnated reads.
These reads are high in Thymine (blue) and Adenine (green), consistent with the results of 15 concatenated reads. It also makesn sense for the purine and pyrimidine to both be high as they are complements.
This plot is empty even though we expect the coverage line to peak at 0 and taper off to 1, indicating fewer areas with high depth. We are not sure why this plot is not showing depth.
The long reads have a model GC content of 38.4%, consistent with the results achieved for the 15 concatenated reads.
In this section, our aim was to test the pipeline for errors. To achieve this, we randomly selected 15 reads for depth analysis. As we examined these reads, we observed similarities and anticipated that with the inclusion of more reads, there would be identifiable regions of both high and low depth.
This concatenation of 15 reads already shows a pattern of regions with higher depth. Split reads and regions of repeated sequences can be observed, suggesting potential improvements to the reference genome assembly. Not pictured here, but there are regions of repeated sequences, so reads will pile up at the first instance of the repeated sequence. This means there will be higher depth at a repeated sequence location at the beginning of the reference genome rather than later instances of the same sequence because alignment is order-based.
This image was generated by aligning all 220 concatenated files with the reference genome, indicating more depth compared to the 15 concatenated files.
Initially, we observed that the Nanopore sequencing generated 220 fastq files, likely due to multiple runs on the same sample over around 3 days. The analysis revealed similarities in read length between concatenated files (15 fastq files and 220 fastq files) viewed in IGV. We concluded that eliminating errors in the Nanopore genome assembly and aligning it to the reference genome can fill in the gaps of the reference genome.
These are the general commands we used for the analysis:
Reference genome : FragariaPseudoChrsV2.fasta Query genome files: FAQ99085_pass_635cf375_2.fastq
bwa index reference.fastq
bwa bwasw FragariaPseudoChrsV2.fasta FAQ99085_pass_635cf375_2.fastq > sample.sam
samtools view -S -b sample.sam > sample.bam
samtools stats sample4.bam > sample4.bam.stats
samtools view sample4.bam | head
samtools sort sample4.bam -o sample4.sorted.bam
samtools view sample4.sorted.bam | head
samtools index sample4.sorted.bam
plot-bamstats -p my_output sample4.bam.stats
- Dr Jefferey Miller
- Dr Tom Davis
- Clayton Ludwig (PhD Candidate)