In [1]:
# Step 1: Install required tools
!apt-get install bwa samtools

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libhts3 libhtscodecs2
Suggested packages:
  cwltool
The following NEW packages will be installed:
  bwa libhts3 libhtscodecs2 samtools
0 upgraded, 4 newly installed, 0 to remove and 38 not upgraded.
Need to get 1,158 kB of archives.
After this operation, 2,736 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 bwa amd64 0.7.17-6 [195 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libhtscodecs2 amd64 1.1.1-3 [53.2 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libhts3 amd64 1.13+ds-2build1 [390 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 samtools amd64 1.13-4 [520 kB]
Fetched 1,158 kB in 0s (2,961 kB/s)
Selecting previously unselected package bwa.
(Reading database ... 126455 files and directories currently installed.)
Preparing to un

In [2]:
# Step 2: Setup Working Directory
# # Create reference folder
!mkdir -p /content/reference
%cd /content/reference


/content/reference


In [3]:
# Step 3: Download chr22 Reference Genome
!wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr22.fa.gz
!gunzip -f chr22.fa.gz


--2025-10-30 06:15:36--  https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr22.fa.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12255678 (12M) [application/x-gzip]
Saving to: ‘chr22.fa.gz’


2025-10-30 06:15:37 (24.0 MB/s) - ‘chr22.fa.gz’ saved [12255678/12255678]



In [4]:
# Step 4: Extract a Small Region for Speed
# Extract first 2000 lines (~100 kb region)
!head -n 2000 chr22.fa > chr22_small.fa


# Reading the  Alignment with *BWA* Tool.

In [5]:
# 1: Indexing the Reference Genome
!bwa index chr22_small.fa


[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.02 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.01 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index chr22_small.fa
[main] Real time: 0.071 sec; CPU: 0.039 sec


In [6]:
# 2: Creating a Small Query FASTA File
query_seq = """>query1
TGGAAGGACTGAGGTTGATAAAGTAAAGCCAAAGAACTAG
"""

with open("/content/query.fa", "w") as f:
    f.write(query_seq)


In [7]:
# 3: Align the Query Using BWA-MEM
!bwa mem chr22_small.fa /content/query.fa > /content/alignment_bwa.sam


[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1 sequences (40 bp)...
[M::mem_process_seqs] Processed 1 reads in 0.000 CPU sec, 0.001 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem chr22_small.fa /content/query.fa
[main] Real time: 0.004 sec; CPU: 0.003 sec


In [8]:
# 4: Viewing the BWA Alignment results
!head /content/alignment_bwa.sam

@SQ	SN:chr22	LN:99950
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:bwa mem chr22_small.fa /content/query.fa
query1	4	*	0	0	*	*	0	0	TGGAAGGACTGAGGTTGATAAAGTAAAGCCAAAGAACTAG	*	AS:i:0	XS:i:0


# detailed interpretation of the BWA alignment results displayed in the output of cell pNyf_2v4x6ki:

The output is in SAM format (Sequence Alignment Map). Each line represents information about the alignment.

    Line 1: @SQ SN:chr22 LN:99950
        @SQ: Header line indicating a sequence dictionary.
        SN:chr22: Sequence Name, which is the reference chromosome 22.
        LN:99950: Length of the reference sequence (the small region extracted).
    Line 2: @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem chr22_small.fa /content/query.fa
        @PG: Header line indicating a program record.
        ID:bwa: Program ID, which is bwa.
        PN:bwa: Program Name, also bwa.
        VN:0.7.17-r1188: Version of the program used.
        CL:bwa mem chr22_small.fa /content/query.fa: Command line used to generate the alignment.
    Line 3: query1 4 * 0 0 * * 0 0 TGGAAGGACTGAGGTTGATAAAGTAAAGCCAAAGAACTAG * AS:i:0 XS:i:0
        query1: The name of the query sequence.
        4: This is the SAM flag. A value of 4 indicates that the read is unmapped.
        *: RNAME (Reference sequence name). Since the read is unmapped, this is "*".
        0: POS (1-based leftmost mapping position). Since the read is unmapped, this is 0.
        0: MAPQ (Mapping Quality). Since the read is unmapped, this is 0.
        *: CIGAR string. Describes the alignment. Since the read is unmapped, this is "*".
        *: RNEXT (Reference sequence name of the next segment). Since it's a single-end read and unmapped, this is "*".
        0: PNEXT (Position of the next segment). Since it's a single-end read and unmapped, this is 0.
        0: TLEN (Template length). Since it's a single-end read and unmapped, this is 0.
        TGGAAGGACTGAGGTTGATAAAGTAAAGCCAAAGAACTAG: The query sequence.
        *: QUAL (Phred+33 encoded base quality scores). Not provided in this example, so "*".
        AS:i:0: Alignment score. 0 indicates no alignment was found.
        XS:i:0: Suboptimal alignment score. 0 indicates no suboptimal alignment was found.

In summary, the alignment result shows that the query sequence query1 could not be mapped to the small reference genome chr22_small.fa by BWA-MEM.

# Read the Alignment with Bowtie2 tool

In [9]:
# 1: Create Bowtie2 Working Directory
%cd /content
!mkdir -p bowtie2_demo
%cd bowtie2_demo


/content
/content/bowtie2_demo


In [12]:
#2. Installing Bowtie2 and SAM tools
!apt-get   -qq  install bowtie2  samtools  >   /dev/null

In [13]:
# 3. Downloading  chr22 Reference Genome
# Download and decompress chr22
!wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr22.fa.gz
!gunzip -f chr22.fa.gz


--2025-10-30 06:33:34--  https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr22.fa.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12255678 (12M) [application/x-gzip]
Saving to: ‘chr22.fa.gz’


2025-10-30 06:33:35 (22.7 MB/s) - ‘chr22.fa.gz’ saved [12255678/12255678]



In [14]:
# # 4: Reducing file size for quick indexing
 # Extract a Small Region for Speed
# Extract first 2000 lines (~100 kb region)

!head -n 2000 chr22.fa > chr22_small.fa

In [15]:
%%bash
cat   > chr22_small.fa   <<'EOF'
>chr22_small
AGCTTAGC TAGCTACCTATATCTTGGTCTTGGCCG
EOF

In [16]:
# 5. BuildING  Bowtie2 Index
!bowtie2-build chr22_small.fa chr22_small_index


Settings:
  Output files: "chr22_small_index.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  chr22_small.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 8
Using parameters --bmax 6 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 6 --dcv 1024
Constructing suffix-array element gen

In [17]:
# 6. Create a Small Query FASTA File
%%bash
cat  > query.fa <<'EOF'
>query1
AGCTTAGCTAGCTACCTAT
EOF

echo "Query read content"
cat query.fa



Query read content
>query1
AGCTTAGCTAGCTACCTAT


In [18]:
# 7. Align the Query Using Bowtie2 tool  &  Running the Bowtie 2
!bowtie2 -x chr22_small_index -f query.fa -S result_bowtie2.sam


1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    1 (100.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
100.00% overall alignment rate


In [19]:
# 8. View Bowtie2 Alignment Output results
!head result_bowtie2.sam


@HD	VN:1.0	SO:unsorted
@SQ	SN:chr22_small	LN:35
@PG	ID:bowtie2	PN:bowtie2	VN:2.4.4	CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr22_small_index -f query.fa -S result_bowtie2.sam"
query1	0	chr22_small	1	42	19M	*	0	0	AGCTTAGCTAGCTACCTAT	IIIIIIIIIIIIIIIIIII	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:19	YT:Z:UU


# Here is a detailed interpretation of the Bowtie2 alignment results displayed in the output of cell xBHaE8VJ3z7V:

The output is in SAM format (Sequence Alignment Map). Each line represents information about the alignment.

    Line 1: @HD VN:1.0 SO:unsorted
        @HD: Header line indicating the header.
        VN:1.0: Format version.
        SO:unsorted: Sort order of the alignments (unsorted in this case).
    Line 2: @SQ SN:chr22_small LN:35
        @SQ: Header line indicating a sequence dictionary.
        SN:chr22_small: Sequence Name, which is the small reference sequence used for indexing.
        LN:35: Length of the reference sequence.
    Line 3: @PG ID:bowtie2 PN:bowtie2 VN:2.4.4 CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr22_small_index -f query.fa -S result_bowtie2.sam"
        @PG: Header line indicating a program record.
        ID:bowtie2: Program ID, which is bowtie2.
        PN:bowtie2: Program Name, also bowtie2.
        VN:2.4.4: Version of the program used.
        CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x chr22_small_index -f query.fa -S result_bowtie2.sam": Command line used to generate the alignment.
    Line 4: query1 0 chr22_small 1 42 19M * 0 0 AGCTTAGCTAGCTACCTAT IIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:19 YT:Z:UU
        query1: The name of the query sequence.
        0: This is the SAM flag. A value of 0 indicates that the read is mapped to the forward strand.
        chr22_small: RNAME (Reference sequence name) to which the read is mapped.
        1: POS (1-based leftmost mapping position) on the reference sequence. The alignment starts at position 1.
        42: MAPQ (Mapping Quality). A value of 42 indicates a high mapping quality.
        19M: CIGAR string. 19M means that 19 bases of the query sequence align as matches (M) to the reference sequence.
        *: RNEXT (Reference sequence name of the next segment). Since it's a single-end read, this is "*".
        0: PNEXT (Position of the next segment). Since it's a single-end read, this is 0.
        0: TLEN (Template length). Since it's a single-end read, this is 0.
        AGCTTAGCTAGCTACCTAT: The query sequence.
        IIIIIIIIIIIIIIIIIII: QUAL (Phred+33 encoded base quality scores). "I" corresponds to a high quality score.
        AS:i:0: Alignment score. A score of 0 indicates a perfect match for this alignment based on Bowtie2's scoring system.
        XN:i:0: Number of ambiguous bases in the reference. 0 means no ambiguous bases.
        XM:i:0: Number of mismatches in the alignment. 0 means no mismatches.
        XO:i:0: Number of gap opens. 0 means no gap opens.
        XG:i:0: Number of gap extensions. 0 means no gap extensions.
        NM:i:0: Edit distance to the reference. 0 means no differences (mismatches, insertions, or deletions).
        MD:Z:19: MD tag. 19 means that the first 19 bases of the reference match the query.
        YT:Z:UU: Type of alignment. UU indicates an unpaired alignment.

In summary, the Bowtie2 alignment shows that the query sequence query1 perfectly maps to the chr22_small reference sequence starting at position 1, with no mismatches, insertions, or deletions.