In [1]:
# 1: Install Required Tools
!apt-get update
!apt-get install -y bwa samtools bowtie2


Hit:1 https://cli.github.com/packages stable InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,398 kB]
Get:13 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,288 kB]
Get:14 https://r

In [2]:
#  2: Setup Working Directory
!mkdir -p /content/chr21_demo
%cd /content/chr21_demo


/content/chr21_demo


In [3]:
# 3: Download chr21 Reference Genome
!wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr21.fa.gz
!gunzip -f chr21.fa.gz


--2025-10-30 07:46:43--  https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr21.fa.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12709705 (12M) [application/x-gzip]
Saving to: ‘chr21.fa.gz’


2025-10-30 07:46:44 (17.3 MB/s) - ‘chr21.fa.gz’ saved [12709705/12709705]



In [4]:
# 4: Extract a Small Region for Speed
!head -n 2000 chr21.fa > chr21_small.fa


# Read Alignment with BWA

In [5]:
# 1: Index the Reference Genome
!bwa index chr21_small.fa


[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.02 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.01 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index chr21_small.fa
[main] Real time: 0.081 sec; CPU: 0.040 sec


In [6]:
# 2: Create a Small Query FASTA File
query_seq = """>query1
TGGAAGGACTGAGGTTGATAAAGTAAAGCCAAAGAACTAG
"""

with open("/content/query.fa", "w") as f:
    f.write(query_seq)


In [7]:
# 3: Align the Query Using BWA-MEM
!bwa mem chr21_small.fa /content/query.fa > /content/alignment_bwa_chr21.sam


[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1 sequences (40 bp)...
[M::mem_process_seqs] Processed 1 reads in 0.000 CPU sec, 0.000 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem chr21_small.fa /content/query.fa
[main] Real time: 0.006 sec; CPU: 0.003 sec


In [16]:
# 4: View BWA Alignment Output
!head /content/alignment_bwa_chr21.sam

@SQ	SN:chr21	LN:99950
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:bwa mem chr21_small.fa /content/query.fa
query1	4	*	0	0	*	*	0	0	TGGAAGGACTGAGGTTGATAAAGTAAAGCCAAAGAACTAG	*	AS:i:0	XS:i:0


### Interpretation of BWA Alignment Output

The output displayed above is in SAM (Sequence Alignment/Map) format, which is a standard format for storing biological sequences aligned to a reference sequence. Here is a breakdown of the key lines and fields:

*   **`@SQ SN:chr21 LN:99950`**: This line is a header providing information about the reference sequence.
    *   `@SQ`: Indicates a sequence dictionary header line.
    *   `SN:chr21`: Specifies the sequence name (Reference Sequence Name), which is 'chr21' in this case.
    *   `LN:99950`: Specifies the length of the reference sequence (Length of the Reference Sequence), which is 99950 base pairs.

*   **`@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem chr21_small.fa /content/query.fa`**: This is a header line describing the program used for alignment.
    *   `@PG`: Indicates a program header line.
    *   `ID:bwa`: Program identifier.
    *   `PN:bwa`: Program name.
    *   `VN:0.7.17-r1188`: Version of the program.
    *   `CL:bwa mem chr21_small.fa /content/query.fa`: Command line used to run the program.

*   **`query1 4 * 0 0 * * 0 0 TGGAAGGACTGAGGTTGATAAAGTAAAGCCAAAGAACTAG * AS:i:0 XS:i:0`**: This line represents the alignment of a single read (query1). Each field is tab-separated.
    *   `query1`: Query template NAME (the name of the read).
    *   `4`: Bitwise FLAG (in this case, 4 indicates that the read is unmapped).
    *   `*`: Reference sequence name (RNAME). An asterisk here indicates the read is unmapped.
    *   `0`: 1-based leftmost mapping Position of the first matching base (POS). 0 indicates unmapped.
    *   `0`: Mapping quality (MAPQ). 0 for unmapped reads.
    *   `*`: CIGAR string (Compact Idiosyncratic Gapped Alignment Report). An asterisk indicates no alignment.
    *   `*`: Reference sequence name of the mate/next read in the template (RNEXT). An asterisk indicates no mate or unmapped mate.
    *   `0`: Position of the mate/next read (PNEXT). 0 indicates no mate or unmapped mate.
    *   `0`: Observed template LENgth (TLEN). 0 indicates no template or unmapped.
    *   `TGGAAGGACTGAGGTTGATAAAGTAAAGCCAAAGAACTAG`: Segment sequence (SEQ). This is the sequence of the read.
    *   `*`: ASCII of Phred-scaled base Quality plus 33 (QUAL). An asterisk indicates no quality scores.
    *   `AS:i:0`: Alignment score.
    *   `XS:i:0`: Suboptimal alignment score.

In summary, this specific alignment output shows that the query sequence `query1` did not map to the small reference genome `chr21_small.fa` using BWA-MEM, as indicated by the flag `4` and the asterisks in the RNAME, CIGAR, RNEXT, and PNEXT fields, and the 0 values for POS, MAPQ, and TLEN. This is likely because the small region extracted from `chr21.fa` (`chr21_small.fa`) does not contain a sequence that aligns to `query1`.

# Read Alignment with Bowtie2

In [9]:
# 1: Create Bowtie2 Working Directory
%cd /content
!mkdir -p bowtie2_chr21_demo
%cd bowtie2_chr21_demo


/content
/content/bowtie2_chr21_demo


In [10]:
# 2: Manually Create a Small chr21 Reference FASTA
%%bash
cat > chr21_small.fa << "EOF"
>chr21_small
AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG
EOF


In [11]:
# 3: Build Bowtie2 Index
!bowtie2-build chr21_small.fa chr21_small_index


Settings:
  Output files: "chr21_small_index.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  chr21_small.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 8
Using parameters --bmax 6 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 6 --dcv 1024
Constructing suffix-array element gen

In [13]:
# 4: Create a Small Query FASTA File
!echo -e ">query1\nAGCTTAGCTAGCTACCTAT" > query.fa

In [14]:
# 5: Align the Query Using Bowtie2
!bowtie2 -x chr21_small_index -f query.fa -S alignment_bowtie2_chr21.sam

1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    1 (100.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
100.00% overall alignment rate


In [17]:
# 6: View Bowtie2 Alignment Output
!head alignment_bowtie2_chr21.sam

@HD	VN:1.0	SO:unsorted
@SQ	SN:chr21_small	LN:35
@PG	ID:bowtie2	PN:bowtie2	VN:2.4.4	CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr21_small_index -f query.fa -S alignment_bowtie2_chr21.sam"
query1	0	chr21_small	1	42	19M	*	0	0	AGCTTAGCTAGCTACCTAT	IIIIIIIIIIIIIIIIIII	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:19	YT:Z:UU


### Interpretation of Bowtie2 Alignment Output

The output above is also in SAM format, showing the results from Bowtie2. Let's break down the key parts:

*   **`@HD VN:1.0 SO:unsorted`**: Header line indicating the SAM format version and sorting order.
    *   `@HD`: Header line.
    *   `VN:1.0`: SAM format version.
    *   `SO:unsorted`: The alignment is not sorted.

*   **`@SQ SN:chr21_small LN:35`**: Header line for the reference sequence.
    *   `@SQ`: Sequence dictionary header.
    *   `SN:chr21_small`: Reference sequence name, matching the small FASTA we created.
    *   `LN:35`: Length of the reference sequence (35 base pairs).

*   **`@PG ID:bowtie2 PN:bowtie2 VN:2.4.4 CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr21_small_index -f query.fa -S alignment_bowtie2_chr21.sam"`**: Header line describing the Bowtie2 program and command used.
    *   `@PG`: Program header.
    *   `ID`, `PN`, `VN`: Program identifier, name, and version.
    *   `CL`: Command line used for the alignment.

*   **`query1 0 chr21_small 1 42 19M * 0 0 AGCTTAGCTAGCTACCTAT IIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:19 YT:Z:UU`**: This is the alignment record for `query1`.
    *   `query1`: Query template NAME.
    *   `0`: Bitwise FLAG. `0` indicates that the read is mapped to the forward strand.
    *   `chr21_small`: Reference sequence name (RNAME) that the read mapped to.
    *   `1`: 1-based leftmost mapping Position (POS). The read starts aligning at the first base of the reference.
    *   `42`: Mapping quality (MAPQ). A score indicating the confidence of the mapping. 42 is a high mapping quality.
    *   `19M`: CIGAR string. `19M` means 19 matches. This indicates the entire query sequence (19 bp) aligned perfectly to the reference.
    *   `*`: Reference sequence name of the mate/next read (RNEXT). An asterisk means no mate.
    *   `0`: Position of the mate/next read (PNEXT). 0 means no mate.
    *   `0`: Observed template LENgth (TLEN). 0 for a single-end read.
    *   `AGCTTAGCTAGCTACCTAT`: Segment sequence (SEQ). The sequence of the read.
    *   `IIIIIIIIIIIIIIIIIII`: ASCII of Phred-scaled base Quality plus 33 (QUAL). High quality scores for all bases.
    *   `AS:i:0`: Alignment score. A score of 0 indicates a perfect match with the default scoring.
    *   `NM:i:0`: Number of mismatches. 0 indicates a perfect match.
    *   `MD:Z:19`: CIGAR-like string for mismatches. `19` means 19 matches from the reference.

In contrast to the BWA alignment, the Bowtie2 output shows a successful alignment (`FLAG 0`, `RNAME chr21_small`, `POS 1`, `CIGAR 19M`). This is because the small reference sequence created for the Bowtie2 example (`AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG`) contains the query sequence (`AGCTTAGCTAGCTACCTAT`) at the beginning.

## Comparison of BWA and Bowtie2 Alignment Outputs

Based on the alignment outputs we've examined:

*   **BWA-MEM Output:** The BWA-MEM alignment output for `query1` against `chr21_small.fa` showed that the read was **unmapped**. This was indicated by the bitwise flag `4`, the asterisks in fields like RNAME and CIGAR, and the zero values for POS, MAPQ, and TLEN. This suggests that the query sequence did not find a suitable match within the small region of `chr21.fa` that was used as the reference for BWA.

*   **Bowtie2 Output:** The Bowtie2 alignment output for `query1` against its `chr21_small_index` (built from a manually created small FASTA) showed a **successful alignment**. This was indicated by the bitwise flag `0` (mapped to the forward strand), the RNAME `chr21_small`, a POS of `1` (mapping to the beginning of the reference), a high MAPQ of `42`, and a CIGAR string of `19M` (indicating a perfect match over the entire read length). This successful alignment occurred because the manually created small reference sequence for Bowtie2 explicitly contained the query sequence.

**Key Difference:** The primary difference in the outputs is the **mapping status** of the query read. BWA-MEM reported the read as unmapped, while Bowtie2 reported it as successfully mapped. This highlights the importance of the reference sequence used for alignment. In this case, the small reference used for BWA did not contain the query sequence, leading to an unmapped result, whereas the small reference used for Bowtie2 was specifically designed to contain the query sequence, resulting in a successful alignment.