In [1]:
# Step 1: Install Required Tools
print("Step 1: Installing BWA, SAMtools, and Bowtie2.")
!apt-get install -qq bwa samtools > /dev/null
!apt-get install -qq bowtie2 > /dev/null
print("   -> Installation complete.")

Step 1: Installing BWA, SAMtools, and Bowtie2.
   -> Installation complete.


In [2]:
# Step 2: Create and Navigate to BWA Working Directory
!mkdir -p /content/reference_chr20_bwa
%cd /content/reference_chr20_bwa
print("\nStep 2: Changed directory to /content/reference_chr20_bwa")

/content/reference_chr20_bwa

Step 2: Changed directory to /content/reference_chr20_bwa


In [3]:
# Step 3: Download and Decompress hg38 Chromosome 20 Reference
print("\nStep 3: Downloading and unzipping hg38 Chromosome 20 (chr20.fa.gz)...")
!wget -q https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr20.fa.gz
!gunzip -f chr20.fa.gz
print("   -> Download and decompression complete. File: chr20.fa")


Step 3: Downloading and unzipping hg38 Chromosome 20 (chr20.fa.gz)...
   -> Download and decompression complete. File: chr20.fa


In [13]:
# Step 4: Index Full Reference with Samtools and Get Length
# Samtools indexes the full FASTA file (chr20.fa) to create chr20.fa.fai
!samtools faidx chr20.fa

# Extract the length of chr20 from the .fai file and save it directly to a file
# Using a single command to avoid issues with shell variable scope in Colab
!awk '{if ($1=="chr20") print $2}' chr20.fa.fai > chr20_length.txt

# Read the length from the file to display it
with open("chr20_length.txt", "r") as f:
    chr20_len = f.read().strip()
print(f"Chromosome 20 Length: {chr20_len}")


print("Step 4: Full chr20 indexed and length extracted.")

Chromosome 20 Length: 64444167
Step 4: Full chr20 indexed and length extracted.


# # Read Alignment with BWA-MEM on chr20 (Random Match)

In [5]:
# Step 6: BWA Index the FULL Reference Genome
# BWA indexes the full chr20.fa file for the alignment step.
!bwa index chr20.fa
print("\nStep 6: BWA Indexing of FULL chr20 Complete.")

[bwa_index] Pack FASTA... 0.41 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=128888334, availableWord=21068624
[BWTIncConstructFromPacked] 10 iterations done. 34753182 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 64202446 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 90372990 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 113629422 characters processed.
[bwt_gen] Finished constructing BWT in 48 iterations.
[bwa_index] 31.83 seconds elapse.
[bwa_index] Update BWT... 0.30 sec
[bwa_index] Pack forward-only FASTA... 0.27 sec
[bwa_index] Construct SA from BWT and Occ... 15.33 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index chr20.fa
[main] Real time: 48.792 sec; CPU: 48.141 sec

Step 6: BWA Indexing of FULL chr20 Complete.


In [14]:
# Step 7: Generate Random Query FASTA File from Entire CHR 20
import random
import os

READ_LENGTH = 50

# Read the chromosome length saved in the previous cell
with open("chr20_length.txt", "r") as f:
    chr_len = int(f.read().strip())

# Choose a random start position (1-based) that allows for a full 50 bp read
random_start = random.randint(1, chr_len - READ_LENGTH)

# Define the genomic coordinate for samtools faidx
coord = f"chr20:{random_start}-{random_start + READ_LENGTH - 1}"

# Use samtools faidx to extract the sequence and save as a FASTA file
!samtools faidx chr20.fa {coord} > query_bwa_chr20_random.fa

print(f"Step 7: Random 50 bp sequence extracted from {coord} and saved to query_bwa_chr20_random.fa.")

Step 7: Random 50 bp sequence extracted from chr20:58634375-58634424 and saved to query_bwa_chr20_random.fa.


In [15]:
# Step 8: Perform Alignment using BWA-MEM
# Align the random query against the full chr20 reference (guaranteed successful match).
!bwa mem chr20.fa query_bwa_chr20_random.fa > /content/alignment_bwa_chr20_random.sam

print("\nStep 8: BWA-MEM Alignment complete.")

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1 sequences (50 bp)...
[M::mem_process_seqs] Processed 1 reads in 0.001 CPU sec, 0.000 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem chr20.fa query_bwa_chr20_random.fa
[main] Real time: 0.068 sec; CPU: 0.065 sec

Step 8: BWA-MEM Alignment complete.


In [16]:
# Step 9: View and Interpret BWA Alignment Results
# Expected: Successful alignment (Flag 0, POS = random_start, CIGAR 50M)
print("\nBWA Alignment Results (SAM format - Expected Success):")
!head /content/alignment_bwa_chr20_random.sam


BWA Alignment Results (SAM format - Expected Success):
@SQ	SN:chr20	LN:64444167
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:bwa mem chr20.fa query_bwa_chr20_random.fa
chr20:58634375-58634424	0	chr20	58634375	60	50M	*	0	0	GGTCAGGTCTTCAGAGGGGAGACTCCTGCCCTGGTGTGCCCGGCTCCTGC	*	NM:i:0	MD:Z:50	AS:i:50	XS:i:0


Here's a simpler way to understand the main line of the alignment results:

`chr20:58634375-58634424        0       chr20   58634375        60      50M     *       0       0       GGTCAGGTCTTCAGAGGGGAGACTCCTGCCCTGGTGTGCCCGGCTCCTGC      *       NM:i:0  MD:Z:50 AS:i:50 XS:i:0`

Imagine you're looking for a specific sentence in a giant book (the human genome).

1.  **`chr20:58634375-58634424`**: This is like the **name of the sentence** you were looking for. In this case, the name tells us it came from a specific spot on chromosome 20.
2.  **`0`**: This is a **code** that tells us the sentence is found right-side-up (not reversed) and it's a single, complete sentence (not broken into pieces).
3.  **`chr20`**: This tells us **which "chapter" or chromosome** in the book the sentence was found. It was found on chromosome 20.
4.  **`58634375`**: This is the **exact page or starting point** in that chapter where the sentence begins. It starts at position 58,634,375 on chromosome 20.
5.  **`60`**: This is a **confidence score**. A high score like 60 means the program is very, very sure that this is the correct place in the book for your sentence.
6.  **`50M`**: This is a **description of how the sentence matches**. `50M` means all 50 "words" (base pairs) of your sentence perfectly match the words in the book at this spot.
7.  **`*`, `0`, `0`**: These indicate that the sentence wasn't part of a longer story split into multiple parts.
8.  **`GGTCAGGTCTTCAGAGGGGAGACTCCTGCCCTGGTGTGCCCGGCTCCTGC`**: This is the **actual sentence** that was found.
9.  **`*`**: This relates to how clear the "words" were in your original sentence (quality scores), but it's not detailed here.
10. **`NM:i:0`, `MD:Z:50`, `AS:i:50`, `XS:i:0`**: These are extra notes confirming things like:
    *   `NM:i:0`: There were **zero mistakes** or differences between your sentence and the one in the book.
    *   `AS:i:50`: The **match was perfect**, giving it the highest possible score (50 out of 50).
    *   The other notes further support that this was a clear, perfect match.

Here's a summary of the mapping findings from the BWA alignment:

1.  **A Match Was Found:** BWA-MEM successfully found a place in the reference genome where your query sequence fits.
2.  **Where It Mapped:** The sequence mapped to **chromosome 20 (chr20)**.
3.  **Starting Point:** The alignment starts at **position 58,634,375** on chromosome 20.
4.  **High Confidence:** The mapping has a **very high confidence score (60)**, meaning the program is extremely sure this is the correct location.
5.  **Perfect Match:** The **50M** in the results means all 50 bases of your query sequence **perfectly match** the reference sequence at this location with **zero mismatches (NM:i:0)**.
6.  **Single, Forward Alignment:** The flag `0` and the other symbols indicate it's a single, complete sequence aligning on the forward strand of the reference.

In short, your random 50 bp sequence from chromosome 20 was found exactly where it was expected to be on chromosome 20 with perfect accuracy.

# # Read Alignment with Bowtie2 on chr20 (Custom Match)

In [17]:
# Step 11: Create and Navigate to Bowtie2 Working Directory
%cd /content
!mkdir -p bowtie2_demo_chr20
%cd bowtie2_demo_chr20
print("\nStep 11: Changed directory to /content/bowtie2_demo_chr20")

/content
/content/bowtie2_demo_chr20

Step 11: Changed directory to /content/bowtie2_demo_chr20


In [18]:
# Step 12: Create a Minimal Reference File
%%bash
cat > chr20_minimal.fa <<'EOF'
>chr20_minimal_sequence
AGCTTAGCTAGCTACCTATTACGAT
EOF

echo "Step 12: Created chr20_minimal.fa."

Step 12: Created chr20_minimal.fa.


In [19]:
# Step 13: Build Bowtie2 Index
!bowtie2-build chr20_minimal.fa chr20_minimal_index
print("\nStep 13: Bowtie2 Indexing Complete.")

Settings:
  Output files: "chr20_minimal_index.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  chr20_minimal.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 6
Using parameters --bmax 5 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 5 --dcv 1024
Constructing suffix-array element

In [20]:
# Step 14: Create Bowtie2 Query FASTA File
%%bash
cat > query_bowtie2_chr20.fa <<'EOF'
>query1_bowtie2
AGCTTAGCTAGCTACCTAT
EOF

echo "Step 14: Query file for Bowtie2 created."

Step 14: Query file for Bowtie2 created.


In [21]:
# Step 15: Perform Alignment using Bowtie2
!bowtie2 -x chr20_minimal_index -f query_bowtie2_chr20.fa -S result_bowtie2_chr20.sam
print("\nStep 15: Bowtie2 Alignment complete.")

1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    1 (100.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
100.00% overall alignment rate

Step 15: Bowtie2 Alignment complete.


In [22]:
# Step 16: View Bowtie2 Alignment Output results
print("\nBowtie2 Alignment Results (SAM format):")
!head result_bowtie2_chr20.sam


Bowtie2 Alignment Results (SAM format):
@HD	VN:1.0	SO:unsorted
@SQ	SN:chr20_minimal_sequence	LN:25
@PG	ID:bowtie2	PN:bowtie2	VN:2.4.4	CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr20_minimal_index -f query_bowtie2_chr20.fa -S result_bowtie2_chr20.sam"
query1_bowtie2	0	chr20_minimal_sequence	1	42	19M	*	0	0	AGCTTAGCTAGCTACCTAT	IIIIIIIIIIIIIIIIIII	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:19	YT:Z:UU


Here's a simpler way to understand the main line of the Bowtie2 alignment results:

`query1_bowtie2  0       chr20_minimal_sequence  1       42      19M     *       0       0       AGCTTAGCTAGCTACCTAT     IIIIIIIIIIIIIIIIIII     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:19 YT:Z:UU`

Think of this like finding a specific phrase (your query sequence) within a short paragraph (your minimal reference sequence).

1.  **`query1_bowtie2`**: This is the **name of the phrase** you were looking for.
2.  **`0`**: This is a **code** telling us the phrase was found right-side-up (not reversed) and is a single piece.
3.  **`chr20_minimal_sequence`**: This tells us **where** the phrase was found â€“ within the "chr20_minimal_sequence" paragraph.
4.  **`1`**: This is the **starting point** in that paragraph where the phrase begins. It starts at position 1.
5.  **`42`**: This is a **confidence score**. A score of 42 means Bowtie2 is quite sure this is the correct location.
6.  **`19M`**: This is a **description of how the phrase matches**. `19M` means all 19 "words" (base pairs) of your phrase perfectly match the words in the paragraph at this spot.
7.  **`*`, `0`, `0`**: These indicate the phrase wasn't part of a longer text split into multiple parts.
8.  **`AGCTTAGCTAGCTACCTAT`**: This is the **actual phrase** that was found.
9.  **`IIIIIIIIIIIIIIIIIII`**: This relates to how clear the "words" were in your original phrase (quality scores).
10. **`AS:i:0`, `XN:i:0`, `XM:i:0`, `XO:i:0`, `XG:i:0`, `NM:i:0`, `MD:Z:19`, `YT:Z:UU`**: These are extra notes confirming things like:
    *   `NM:i:0`: There were **zero mistakes** or differences between your phrase and the one in the paragraph.
    *   `MD:Z:19`: Indicates that the first 19 characters match.
    *   `AS:i:0`: The alignment score.
    *   The other notes further support that this was a clear, perfect match.

Here's a summary of the mapping findings from the Bowtie2 alignment:

1.  **A Match Was Found:** Bowtie2 successfully found a place in the minimal reference sequence where your query sequence aligns.
2.  **Where It Mapped:** The sequence mapped to the **`chr20_minimal_sequence`**.
3.  **Starting Point:** The alignment starts at **position 1** of the minimal reference sequence.
4.  **Good Confidence:** The mapping has a **confidence score of 42**, indicating a good level of certainty in the alignment.
5.  **Perfect Match:** The **`19M`** in the results and the **`NM:i:0`** tag confirm that all 19 bases of your query sequence **perfectly match** the reference sequence at this location with **zero mismatches**.
6.  **Single, Forward Alignment:** The flag `0` indicates it's a single sequence aligning on the forward strand.

In summary, your query sequence perfectly matched the beginning of the `chr20_minimal_sequence` in the Bowtie2 alignment.

Here is a comparison of the mapping summary findings from the BWA-MEM and Bowtie2 alignments:

**Overall Finding:**

*   **BWA-MEM:** A match was found for the query sequence in the reference genome.
*   **Bowtie2:** A match was found for the query sequence in the minimal reference sequence.

**Reference Used:**

*   **BWA-MEM:** Aligned against the **full hg38 Chromosome 20 (chr20)**.
*   **Bowtie2:** Aligned against a **minimal, custom reference sequence (`chr20_minimal_sequence`)**.

**Query Sequence:**

*   **BWA-MEM:** A **random 50 bp sequence extracted directly from chr20**.
*   **Bowtie2:** A **custom 19 bp sequence** (`AGCTTAGCTAGCTACCTAT`) designed to match the minimal reference.

**Mapping Location:**

*   **BWA-MEM:** Mapped to **chr20 at position 58,634,375**. (This varied based on the random extraction).
*   **Bowtie2:** Mapped to **`chr20_minimal_sequence` at position 1**. (This was fixed due to the custom sequences).

**Confidence Score:**

*   **BWA-MEM:** Very high confidence score (**60**).
*   **Bowtie2:** Good confidence score (**42**). (Note: Scores are calculated differently between the tools).

**Match Quality:**

*   **BWA-MEM:** **Perfect match (50M, NM:i:0)** over the entire 50 bp query.
*   **Bowtie2:** **Perfect match (19M, NM:i:0)** over the entire 19 bp query.

**Alignment Type:**

*   Both tools reported a **single alignment on the forward strand** (Flag 0).

**In essence:**

Both BWA-MEM and Bowtie2 successfully found perfect matches for their respective query sequences in their corresponding reference sequences. The main differences lie in the scale of the reference used (full chromosome vs. minimal sequence) and the nature of the query (randomly extracted vs. custom).