In [1]:
# Code Cell 1: Install Required Tools
print("Step 1: Installing BWA, SAMtools, and Bowtie2.")
!apt-get install -qq bwa samtools > /dev/null
!apt-get install -qq bowtie2 > /dev/null
print("   -> Installation complete.")

Step 1: Installing BWA, SAMtools, and Bowtie2.
   -> Installation complete.


In [2]:
# Code Cell 2: Create and Navigate to BWA Working Directory
!mkdir -p /content/reference_chr20_bwa
%cd /content/reference_chr20_bwa
print("\nStep 2: Changed directory to /content/reference_chr20_bwa")

/content/reference_chr20_bwa

Step 2: Changed directory to /content/reference_chr20_bwa


In [3]:
# Code Cell 3: Download and Decompress hg38 Chromosome 20 Reference
print("\nStep 3: Downloading and unzipping hg38 Chromosome 20 (chr20.fa.gz)...")
!wget -q https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr20.fa.gz
!gunzip -f chr20.fa.gz
print("   -> Download and decompression complete. File: chr20.fa")


Step 3: Downloading and unzipping hg38 Chromosome 20 (chr20.fa.gz)...
   -> Download and decompression complete. File: chr20.fa


In [4]:
# Code Cell 4: Create Small Reference Subset AND Extract Real Query Sequence
# 1. Create a small reference file (~100 kb region)
!head -n 2000 chr20.fa > chr20_small.fa
print("\nStep 4a: Created a small working reference: chr20_small.fa.")

# 2. DYNAMIC GENERATION: Extract the first 50 bases from the new small reference for a guaranteed hit.
!grep -v '>' chr20_small.fa | head -n 1 | cut -c 1-50 > chr20_query_seq.txt
print("Step 4b: DYNAMICALLY GENERATED a 50 bp query sequence from the start of chr20_small.fa.")


Step 4a: Created a small working reference: chr20_small.fa.
Step 4b: DYNAMICALLY GENERATED a 50 bp query sequence from the start of chr20_small.fa.


# # Read Alignment with BWA-MEM on chr20 (Guaranteed Match)

In [5]:
# Code Cell 6: Index the Small Reference Genome
!bwa index chr20_small.fa
print("\nStep 6: BWA Indexing Complete.")

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.02 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.01 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index chr20_small.fa
[main] Real time: 0.057 sec; CPU: 0.030 sec

Step 6: BWA Indexing Complete.


In [6]:
# Code Cell 7: Create Query FASTA File from Extracted Sequence
# This block reads the dynamically generated sequence and formats it.
import os

if os.path.exists("chr20_query_seq.txt"):
    with open("chr20_query_seq.txt", "r") as f:
        real_query_seq = f.read().strip()

    # Format the read as a FASTA entry
    query_seq_bwa = f""">chr20_start_read_50bp
{real_query_seq}
"""
    with open("/content/query_bwa_chr20_real.fa", "w") as f:
        f.write(query_seq_bwa)

    print("Step 7: Real query FASTA file created at /content/query_bwa_chr20_real.fa.")
else:
    print("Error: Could not find dynamically generated sequence file.")

Step 7: Real query FASTA file created at /content/query_bwa_chr20_real.fa.


In [7]:
# Code Cell 8: Perform Alignment using BWA-MEM
!bwa mem chr20_small.fa /content/query_bwa_chr20_real.fa > /content/alignment_bwa_chr20_real.sam
print("\nStep 8: BWA-MEM Alignment complete.")

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1 sequences (50 bp)...
[M::mem_process_seqs] Processed 1 reads in 0.000 CPU sec, 0.000 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem chr20_small.fa /content/query_bwa_chr20_real.fa
[main] Real time: 0.004 sec; CPU: 0.002 sec

Step 8: BWA-MEM Alignment complete.


In [8]:
# Code Cell 9: View and Interpret BWA Alignment Results
print("\nBWA Alignment Results (SAM format - Expected Success):")
!head /content/alignment_bwa_chr20_real.sam


BWA Alignment Results (SAM format - Expected Success):
@SQ	SN:chr20	LN:99950
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:bwa mem chr20_small.fa /content/query_bwa_chr20_real.fa
chr20_start_read_50bp	4	*	0	0	*	*	0	0	NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN	*	AS:i:0	XS:i:0


Let's interpret the BWA alignment results line by line:

*   **`@SQ SN:chr20 LN:99950`**: This is a header line (`@SQ`) describing the reference sequence. `SN:chr20` indicates the sequence name is "chr20", and `LN:99950` indicates its length is 99950 bases (which is the size of our `chr20_small.fa` file).
*   **`@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem chr20_small.fa /content/query_bwa_chr20_real.fa`**: This is a header line (`@PG`) describing the program used for alignment. `ID:bwa` and `PN:bwa` identify the program as BWA. `VN:0.7.17-r1188` is the version number. `CL:bwa mem chr20_small.fa /content/query_bwa_chr20_real.fa` shows the command line used to generate this alignment.
*   **`chr20_start_read_50bp 4 * 0 0 * * 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN * AS:i:0 XS:i:0`**: This is the alignment record for our query sequence (`chr20_start_read_50bp`). Let's break down the fields:
    *   `chr20_start_read_50bp`: The name of the query sequence.
    *   `4`: The SAM flag. A value of 4 indicates the read is unmapped. This means BWA-MEM could not find a significant alignment for this read against the reference sequence *with its default parameters*. Even though we know the sequence comes from the reference, the 50 N's at the beginning of the small reference file likely prevented a successful alignment with the default settings.
    *   `*`: The reference sequence name where the read is aligned. Since the read is unmapped, this is `*`.
    *   `0`: The 1-based leftmost mapping position. 0 for unmapped reads.
    *   `0`: Mapping quality (MAPQ). 0 for unmapped reads.
    *   `*`: CIGAR string. Describes the alignment of the read to the reference. `*` for unmapped reads.
    *   `*`: Name of mate/next read. `*` for single-end reads.
    *   `0`: Position of mate/next read. 0 for single-end reads.
    *   `0`: Inferred template size. 0 for single-end reads.
    *   `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN`: The query sequence.
    *   `*`: The query quality string. `*` as qualities were not provided in the input FASTA.
    *   `AS:i:0`: Alignment score. 0 for unmapped reads.
    *   `XS:i:0`: Suboptimal alignment score. 0 for unmapped reads.

**Interpretation Summary:**

Despite the query sequence being extracted directly from the reference, the BWA-MEM alignment shows the read as unmapped (flag 4). This is most likely due to the query sequence consisting entirely of 'N' characters at the very beginning of the `chr20_small.fa` file. BWA-MEM, by default, is unlikely to align reads with so many ambiguous bases ('N') as it cannot determine the actual nucleotide at those positions for alignment. To guarantee a successful alignment, we would need a query sequence derived from a region of the reference that contains actual A, T, C, or G nucleotides.

# # Read Alignment with Bowtie2 on chr20 (Custom Match)

In [9]:
# Code Cell 11: Create and Navigate to Bowtie2 Working Directory
%cd /content
!mkdir -p bowtie2_demo_chr20
%cd bowtie2_demo_chr20
print("\nStep 11: Changed directory to /content/bowtie2_demo_chr20")

/content
/content/bowtie2_demo_chr20

Step 11: Changed directory to /content/bowtie2_demo_chr20


In [10]:
# Code Cell 12: Create a Minimal Reference File
%%bash
cat > chr20_minimal.fa <<'EOF'
>chr20_minimal_sequence
AGCTTAGCTAGCTACCTATTACGAT
EOF

echo "Step 12: Created chr20_minimal.fa."

Step 12: Created chr20_minimal.fa.


In [11]:
# Code Cell 13: Build Bowtie2 Index
!bowtie2-build chr20_minimal.fa chr20_minimal_index
print("\nStep 13: Bowtie2 Indexing Complete.")

Settings:
  Output files: "chr20_minimal_index.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  chr20_minimal.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 6
Using parameters --bmax 5 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 5 --dcv 1024
Constructing suffix-array element

In [12]:
# Code Cell 14: Create Bowtie2 Query FASTA File
%%bash
cat > query_bowtie2_chr20.fa <<'EOF'
>query1_bowtie2
AGCTTAGCTAGCTACCTAT
EOF

echo "Step 14: Query file for Bowtie2 created."

Step 14: Query file for Bowtie2 created.


In [13]:
# Code Cell 15: Perform Alignment using Bowtie2
!bowtie2 -x chr20_minimal_index -f query_bowtie2_chr20.fa -S result_bowtie2_chr20.sam
print("\nStep 15: Bowtie2 Alignment complete.")

1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    1 (100.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
100.00% overall alignment rate

Step 15: Bowtie2 Alignment complete.


In [14]:
# Code Cell 16: View and Interpret Bowtie2 Alignment Results
print("\nBowtie2 Alignment Results (SAM format):")
!head result_bowtie2_chr20.sam


Bowtie2 Alignment Results (SAM format):
@HD	VN:1.0	SO:unsorted
@SQ	SN:chr20_minimal_sequence	LN:25
@PG	ID:bowtie2	PN:bowtie2	VN:2.4.4	CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr20_minimal_index -f query_bowtie2_chr20.fa -S result_bowtie2_chr20.sam"
query1_bowtie2	0	chr20_minimal_sequence	1	42	19M	*	0	0	AGCTTAGCTAGCTACCTAT	IIIIIIIIIIIIIIIIIII	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:19	YT:Z:UU


## Comparison of BWA-MEM and Bowtie2 Alignment Results

Based on the alignment results generated in the previous steps, here's a comparison of how BWA-MEM and Bowtie2 performed with the provided reference and query sequences:

**BWA-MEM Alignment (with `chr20_small.fa` and the N-rich query):**

*   **Result:** The query sequence was reported as **unmapped** (SAM flag 4).
*   **Reasoning:** The query sequence was dynamically extracted from the very beginning of `chr20_small.fa`, which consists entirely of 'N' characters. BWA-MEM, by default, is generally unable to align reads with a high proportion of ambiguous bases (N's) as it relies on specific nucleotide matches for seeding and extending alignments. The default parameters and scoring scheme are not designed to handle reads composed solely of N's.
*   **Conclusion:** This demonstrates that BWA-MEM requires actual nucleotide information (A, T, C, G) in the query sequence to find a valid alignment under typical settings.

**Bowtie2 Alignment (with `chr20_minimal.fa` and a specific query):**

*   **Result:** The query sequence (`AGCTTAGCTAGCTACCTAT`) was reported as **mapped** (SAM flag 0) to the reference `chr20_minimal_sequence` with high confidence (MAPQ 42).
*   **Reasoning:** The `chr20_minimal.fa` reference and the `query_bowtie2_chr20.fa` query were specifically designed to have a perfect match. Bowtie2 successfully identified this perfect match, indicated by the SAM flag 0, a CIGAR string of `19M` (19 matches), and an edit distance (NM) of 0.
*   **Conclusion:** This demonstrates Bowtie2's ability to accurately align a query sequence to a reference when a clear match exists.

**Overall Comparison:**

The key difference in the results is the mapping status of the query sequences. The BWA-MEM alignment failed because the query was composed of ambiguous 'N' bases, which are not suitable for alignment with its standard parameters. In contrast, the Bowtie2 alignment was successful because both the reference and query were specifically constructed to ensure a perfect match with unambiguous nucleotides.

This highlights the importance of:

1.  **Query Sequence Quality:** Alignment tools perform best with query sequences containing clear nucleotide information (A, T, C, G). Ambiguous bases ('N') can hinder or prevent successful alignment.
2.  **Reference and Query Suitability:** For successful alignment, the query sequence must have a sufficient degree of similarity to a region in the reference sequence, and the sequences should ideally not be composed of ambiguous bases in the region expected to align.

Let's interpret the Bowtie2 alignment results line by line:

*   **`@HD VN:1.0 SO:unsorted`**: This is a header line (`@HD`) specifying the SAM format version (`VN:1.0`) and that the alignments are unsorted (`SO:unsorted`).
*   **`@SQ SN:chr20_minimal_sequence LN:25`**: This is a header line (`@SQ`) describing the reference sequence. `SN:chr20_minimal_sequence` indicates the sequence name is "chr20_minimal_sequence", and `LN:25` indicates its length is 25 bases (the size of our `chr20_minimal.fa` file).
*   **`@PG ID:bowtie2 PN:bowtie2 VN:2.4.4 CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr20_minimal_index -f query_bowtie2_chr20.fa -S result_bowtie2_chr20.sam"`**: This is a header line (`@PG`) describing the program used for alignment. `ID:bowtie2` and `PN:bowtie2` identify the program as Bowtie2. `VN:2.4.4` is the version number. `CL:...` shows the command line used to generate this alignment.
*   **`query1_bowtie2 0 chr20_minimal_sequence 1 42 19M * 0 0 AGCTTAGCTAGCTACCTAT IIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:19 YT:Z:UU`**: This is the alignment record for our query sequence (`query1_bowtie2`). Let's break down the fields:
    *   `query1_bowtie2`: The name of the query sequence.
    *   `0`: The SAM flag. A value of 0 indicates the read is mapped and is the primary alignment, mapping to the forward strand.
    *   `chr20_minimal_sequence`: The name of the reference sequence where the read is aligned.
    *   `1`: The 1-based leftmost mapping position on the reference. This indicates the alignment starts at the first base of `chr20_minimal_sequence`.
    *   `42`: Mapping quality (MAPQ). A value of 42 indicates a high confidence in the alignment.
    *   `19M`: CIGAR string. `19M` means that 19 bases of the read align as a match to the reference sequence.
    *   `*`: Name of mate/next read. `*` for single-end reads.
    *   `0`: Position of mate/next read. 0 for single-end reads.
    *   `0`: Inferred template size. 0 for single-end reads.
    *   `AGCTTAGCTAGCTACCTAT`: The query sequence.
    *   `IIIIIIIIIIIIIIIIIII`: The query quality string. In this case, 'I' represents a high-quality score.
    *   `AS:i:0`: Alignment score. This score is calculated based on matches, mismatches, and gaps. A score of 0 indicates a perfect match in Bowtie2's scoring system with default parameters for short reads.
    *   `XN:i:0`: Number of ambiguous bases (N's) in the read involved in the alignment. 0 indicates no N's were involved.
    *   `XM:i:0`: Number of mismatches in the alignment. 0 indicates a perfect match.
    *   `XO:i:0`: Number of gap opens. 0 indicates no gap opens.
    *   `XG:i:0`: Number of gap extensions. 0 indicates no gap extensions.
    *   `NM:i:0`: Edit distance to the reference. 0 indicates a perfect match.
    *   `MD:Z:19`: Describes mismatches and deletions. `MD:Z:19` indicates that there are 19 matched bases and no mismatches or deletions.
    *   `YT:Z:UU`: Indicates the type of alignment. `UU` indicates a uniquely mapping read.

**Interpretation Summary:**

The Bowtie2 alignment results show that the query sequence `AGCTTAGCTAGCTACCTAT` successfully and uniquely aligned to the beginning of the `chr20_minimal_sequence` reference. The SAM flag of 0, the mapping position of 1, the high mapping quality (42), the CIGAR string of 19M, and the edit distance (NM:i:0) and mismatch (XM:i:0) counts of 0 all confirm this perfect alignment. This demonstrates Bowtie2's ability to correctly align a query sequence to its source reference.