---
title: "Removing Host Reads"
editor: visual
jupyter: python3
---


Our next step is to remove or filter out human reads to prepare for downstream analysis. We use [MiniMap2](https://github.com/lh3/minimap2), a fast and flexible alignment tool, to map reads against a human reference genome. This produces a BAM file, which contains all aligned sequences. The resulting BAM file can then be sorted using Samtools.

To remove human reads, we use reference genome: GCF_000001405.26_GRCh38_genomic.fna.gz. If other host organisms need to be filtered (e.g., horses, sheep, or cattle), you should first remove human reads as human contamination can be introduced during wet lab handling and then map against additional host references as needed.

Reference genomes for other species (eg., horses, sheep or cattle) are available under the path: /mnt/viro0002-data/workgroups_projects/Bioinformatics/DB/

``` bash
minimap2 -Y -t 32 -x map-ont -a /mnt/viro0002-data/workgroups_projects/Bioinformatics/DB/HG38/GCF_000001405.26_GRCh38_genomic.fna.gz all_reads_QC.fastq 2> /dev/null | samtools view -bf 4 - | samtools sort - > all_reads_nonhuman.bam
samtools fastq all_reads_nonhuman.bam > all_reads_QC_hg19.fastq
```

ðŸ”¹<strong style="color:darkblue">-Y</strong> â€“ Use soft clipping for supplementary alignments. This keeps partial alignments from being "hard clipped" â€” helps retain info from partially mapped reads.\
ðŸ”¹<strong style="color:darkblue">-t 32</strong> â€“ Use 32 threads to speed up the alignment.\
ðŸ”¹<strong style="color:darkblue">-x map-ont</strong> â€“ Use preset parameters for Oxford Nanopore reads (optimized scoring and alignment behavior).\
ðŸ”¹<strong style="color:darkblue">-a</strong> â€“ Output SAM format (sequence alignment/map). Required for downstream tools like samtools.\
ðŸ”¹<strong style="color:darkblue">GRCh38_genomic.fna.gz</strong> â€“ Path to the human genome reference (can be gzipped). You're aligning reads to this genome.\
ðŸ”¹<strong style="color:darkblue">all_reads_QC.fastq</strong> â€“ The input FASTQ file (your cleaned and quality-filtered reads).\
ðŸ”¹<strong style="color:darkblue">/dev/null</strong> â€“ Silences stderr output (errors, warnings, and logs) so the console isnâ€™t cluttered.\
ðŸ”¹<strong style="color:darkblue">-b</strong> â€“ Output as BAM (compressed binary format for alignments).\
ðŸ”¹<strong style="color:darkblue">-f 4</strong> â€“ Keep only unmapped reads (reads that did not align to the reference genome).

To reduce the volume of unnecessary files (and save space), please remove the resulting bam file created. We are not interested in this bam file as it represents all the aligned sequences for the human genome.

``` bash
rm all_reads_nonhuman.bam
```

::: callout-important
## Important

You now have your cleaned and trimmed reads **"all_reads_QC_hg19.fastq"**. We will be using these reads for all further analysis!
:::

Now we need to screen the sample, either through taxonomic classification or *de novo* assembly. While mapping refers to mapping reads to a reference, assembly focuses on the reconstruction of the original sequence by aligning and merging shorter reads.

![Illustration](Mapping_diff.jpg)