# 🧬 Sequence Alignment Map (SAM) Format Introduction

## 📚 Introduction to the SAM Format

The **SAM format** (Sequence Alignment/Map) is a widely used **text-based format** for storing biological sequence alignment data, especially from **next-generation sequencing (NGS)** experiments.  
It was developed as part of the **SAMtools** project to efficiently handle **large volumes of sequence reads aligned** to a reference genome.

---

## 📂 Structure of a SAM File

Each line in a SAM file represents one read and its alignment information.  
A SAM file consists of two parts:

- **Header Section (optional)**:  
  Lines beginning with `@` that describe metadata like reference sequences (`@SQ`) and program versions (`@PG`).
  
- **Alignment Section**:  
  A table where each row describes one read, with fields separated by **tabs**.

---

## 🔑 Main Fields in a SAM Alignment Entry

The **11 mandatory fields** are:

| Field | Description |
|:-----|:------------|
| QNAME | Query (read) name |
| FLAG | Bitwise flag describing the read status (e.g., paired, mapped, reversed) |
| RNAME | Reference sequence name (e.g., chromosome) |
| POS | 1-based leftmost position of clipped alignment |
| MAPQ | Mapping quality score |
| CIGAR | Compact representation of alignment (matches, insertions, deletions) |
| RNEXT | Reference name of the mate/next read |
| PNEXT | Position of the mate/next read |
| TLEN | Observed Template Length |
| SEQ | Sequence of the read |
| QUAL | Base quality scores |

---

### 🧙‍♀️ Example SAM Line

In [None]:
#read123	0	chr1	100	255	4M	*	0	0	ACTG	IIII

---

## 📦 SAM vs BAM

- **SAM**: Text-based, human-readable.
- **BAM**: Binary version of SAM. Compressed for efficient storage and fast access.

# 🧵 The CIGAR String

The **CIGAR** string describes how the read aligns with the reference genome:

- Events are **length + type**.
- Common event types:
  - `M`: Match (can be sequence match or mismatch)
  - `I`: Insertion to the reference
  - `D`: Deletion from the reference
  - `S`: Soft clipping (clipped sequences present in SEQ)
  - `H`: Hard clipping (clipped sequences NOT present in SEQ)

---

### ✏️ Example: CIGAR String Explained

In [None]:
# CIGAR: 10M1I5M5D10M

# 10 matches
# 1 inserted base (in read, not in reference)
# 5 matches
# 5 deleted bases (present in reference, not read)
# 10 matches

> ⚡ Note: **Deletions** cannot be directly shown in the SEQ field — they are indicated separately, e.g., using **MD tags**. The MD tag is not a default tag and must be actively integrated. It is especially useful when the reference should be reconstructed using only the SAM file.

# 🧬 Nucleotide Modification Tags (MM/ML)

In **Oxford Nanopore Technologies (ONT)** data, SAM/BAM formats are enhanced to store additional biological information like **base modifications**.

**Base modifications** are stored using two special tags:

- **MM (Modified Bases)**: Lists which bases are modified and where.
- **ML (Modification Likelihoods)**: Lists probabilities of modifications.

---

### 🔎 Example with MM and ML Tags

In [None]:
#read123	0	chr1	100	255	4M	*	0	0	ACTG	IIII	MM:Z:C+m,5,2,1; ML:B:C,200,180,150

**MM Tag**:  
- `MM:Z:C+m,5,2,1;`
- Meaning:
  - Modified base: Cytosine (`C`)
  - Modification type: `+m` (e.g., methylation)
  - Modified positions:
    - 6th `C`
    - After 2 more `C`s
    - After 1 more `C`

**ML Tag**:  
- `ML:B:C,200,180,150`
- Meaning:
  - Probabilities:
    - 200/255 ≈ 78%
    - 180/255 ≈ 70%
    - 150/255 ≈ 59%

> ⚡ Tip: Probabilities are stored as 8-bit integers (0–255).

# 🛠️ Tools for Handling SAM/BAM Files

| Tool | Description |
|:----|:------------|
| **samtools** | Command-line tool for general operations: sort, index, filter, summarize |
| **pysam** | Python package for fine-grained read manipulation (e.g., analyzing base modifications) |
| **Modkit** | Oxford Nanopore Technologies' command line tool for modification detection and analysis |
| **IGV (Integrative Genomics Viewer)** | GUI to visualize alignments, detect systematic errors like indels, coverage issues, or structural variants |

---

# 📖 Further Documentation

- **SAM Format Specification**:  
  👉 [https://samtools.github.io/hts-specs/SAMv1.pdf](https://samtools.github.io/hts-specs/SAMv1.pdf)

- **Tags like MM and ML (Extended SAM Tags)**:  
  👉 [https://samtools.github.io/hts-specs/SAMtags.pdf](https://samtools.github.io/hts-specs/SAMtags.pdf)
  
- **Samtools**:  
  👉 [https://www.htslib.org/doc/samtools.html](https://www.htslib.org/doc/samtools.html)

- **Pysam**:  
  👉 [https://pysam.readthedocs.io/en/latest/api.html](https://pysam.readthedocs.io/en/latest/api.html)

- **Modkit**:  
  👉 [https://nanoporetech.github.io/modkit/](https://nanoporetech.github.io/modkit/)

- **IGV**:  
  👉 [https://igv.org/doc/desktop/](https://igv.org/doc/desktop/)

--- 

# Cheat sheet samtools functionalities

Samtools is a command line based tool to visaulize summarize and manipulate SAM and BAM files.

- **View SAM/BAM bamfile**:
    - 👉 samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...] > [out.bam]
    - 👉 samtools view --help (Open Manual)
    - 👉 Check the manual for possible filter methods. Especially the flags -F and -f can be very helpful, to filter SAM/BAM files for reads with specific properties. 
    - 👉 Ensure to set the falgs -h and -b to include headers and produce a binary data output
    - 👉 Check [https://broadinstitute.github.io/picard/explain-flags.html](https://broadinstitute.github.io/picard/explain-flags.html) to play with possible filter options. 
    
- **Sort SAM/BAMbamfile**:
    - 👉 samtools sort [options...] [in.bam] > [out.bam]
    - 👉 samtools sort --help (To open Manual)
    - 👉 Many downstream analysis tools require sorted SAM/BAM files. The default sorts read by read length in a descending manner. 

- **Index SAM/BAM file**:
    - 👉 samtools index [in.bam]
    - 👉 samtools index --help
    - 👉 Indexing the SAM/BAM file is necessary to visualize the alignment with a genome viewer like IGV.

- **Summarize the alignment file**:
    - 👉 samtools stats [OPTIONS] file.bam
    - 👉 samtools stats --help
    - 👉 Summarizes how many reads align on different chromosomes of the reference.

- **Index fasta files**:
    - 👉 samtools faidx [in.bam]
    - 👉 samtools faidx --help
    - 👉 IGV additionally needs indexed reference files in fasta format for visualization. 
    


---
# Cheat sheet pysam
In contrast to samtools pysam is a python libraries to operate on SAM/BAM file format. It allows to operate on reads in a single nucleotide resolution manner. 
In pysam reads are accessed with a for loop. A BAM file must be indexed before it can be manipulated with pysam. 

In [None]:
import pysam
from pathlib import Path

#Define a path
bamfile_path = Path("/home/stefan/Synology/Data_course_SS2025/filtered_RNA004_UHRR_1_cancer_primary_alignment.bam")

#Initialize the bamfile 
bamfile = pysam.AlignmentFile(bamfile_path, mode="rb")

#Initialize a second BAM/SAM file if you want to write on it


#Run the for loop 
for read in bamfile.fetch(until_eof=True):
    
    #Print the read id 
    print(read.query_name)
    print("\n")
    
    #Access the query sequence of the read
    print(read.query_sequence[0:100]) #First 100 nucleotides only
    print("\n")
    
    #Access Cigartstring or a tuple version of the CIGAR string
    print(read.cigarstring[0:100]) #First 100 operations only
    print(read.cigartuples[0:100]) #First 100 operations only
    print("\n")
    
    #Access alignment pairs between query sequence and 
    print(read.get_aligned_pairs()[0:100]) #First 100 operations only
    print("\n")
    
    #Access read length
    print(read.query_length)
    print("\n")
    
    #Access available modification information
    print(read.modified_bases)
    print("\n")
    #The output of moddified bases is a dictionary, which carries a key being composed of three parts ('Modified Nucleotide',0,'Modification identifier')
    #The value of the values in the dictionary is a list which stores the following information in tuples: (position_on_query, modification_probability)
    #The modification probability is stored as a number between 0-256 (8-bit integer), which corresponds to a probability between 0-100%
    #To align all modifications to the reference, one needs to transfer the position on the query to th position on the reference.
    #The latter can be achieved by using the get_aligned_pairs function in combination with the modified_bases variable as shown below. 
    aligned_pairs = read.get_aligned_pairs(with_seq=True)
    alignment_dict = {}
    for pair_element in aligned_pairs:
        if None not in pair_element:
            alignment_dict[pair_element[0]] = {"index_query":pair_element[0],"index_reference":pair_element[1]}
    #The transfer of the position on the reference can then be executed
    modification_object = read.modified_bases
    if modification_object != None:
        m6a_modifications_on_read = list(modification_object[('A', 0, 'a')])
        for m6a_modification in m6a_modifications_on_read: #m6a modification is the tuple element (position_on_query, modification_probability) 
            try:
                index_on_reference = alignment_dict[m6a_modification[0]]["index_reference"]
                probability_of_modification = m6a_modification[1] / 256
                print("Reference index: ",index_on_reference,";", "Probability of modification: ",probability_of_modification)
                #We will print only one base on this read for didactic reasons
                break
            except KeyError:
                continue
    #Explore the pysam documentation for more functions
    break

3b24d9c7-d599-48eb-a730-377a79796d50


GCGGAGCGAGCCGCCGGGAGGATGTGCGCCGAGCGCCCCGAGCCCCGCGCCGCCGCGCTTTGAGGGCCGCGGGCGAGAGGCACCTCCGCCGCCCCGGAAG


1S46M3D168M2I45M1I3M1D23M1D59M1D54M1D130M1I17M3I944M3D195M73242N126M181N116M439N104M1D113M2D37M1D11M
[(4, 1), (0, 46), (2, 3), (0, 168), (1, 2), (0, 45), (1, 1), (0, 3), (2, 1), (0, 23), (2, 1), (0, 59), (2, 1), (0, 54), (2, 1), (0, 130), (1, 1), (0, 17), (1, 3), (0, 944), (2, 3), (0, 195), (3, 73242), (0, 126), (3, 181), (0, 116), (3, 439), (0, 104), (2, 1), (0, 113), (2, 2), (0, 37), (2, 1), (0, 11), (3, 184), (0, 20), (2, 2), (0, 55), (2, 2), (0, 214), (3, 1434), (0, 231), (3, 326), (0, 395), (2, 1), (0, 173), (1, 2), (0, 92), (1, 3), (0, 9), (1, 1), (0, 350), (1, 1), (0, 100), (2, 1), (0, 67), (2, 2), (0, 301), (2, 6), (0, 123), (2, 1), (0, 25), (1, 1), (0, 3), (2, 1), (0, 57), (1, 2), (0, 66), (2, 3), (0, 25), (2, 1), (0, 78), (2, 1), (0, 16), (2, 1), (0, 147), (2, 3), (0, 561), (1, 1), (0, 118), (1, 1), (0, 142), (1, 1), (0, 74), (2, 1), (0

---
# Cheat sheet modkit
Fortunately there are tools like Oxford Nanopore Technologies' modkit, which already automatize such an extraction. The tool uses a rust library called htslib, which has similar functionalities to pysam, but even a bit faster.
With Modkit the extraction about modification information is automatized and wrapped into a handy command line tool. 

