# 🧬 Sequence Alignment Map (SAM) Format Introduction

## 📚 Introduction to the SAM Format

The **SAM format** (Sequence Alignment/Map) is a widely used **text-based format** for storing biological sequence alignment data, especially from **next-generation sequencing (NGS)** experiments.  
It was developed as part of the **SAMtools** project to efficiently handle **large volumes of sequence reads aligned** to a reference genome.

---

## 📂 Structure of a SAM File

Each line in a SAM file represents one read and its alignment information.  
A SAM file consists of two parts:

- **Header Section (optional)**:  
  Lines beginning with `@` that describe metadata like reference sequences (`@SQ`) and program versions (`@PG`).
  
- **Alignment Section**:  
  A table where each row describes one read, with fields separated by **tabs**.

---

## 🔑 Main Fields in a SAM Alignment Entry

The **11 mandatory fields** are:

| Field | Description |
|:-----|:------------|
| QNAME | Query (read) name |
| FLAG | Bitwise flag describing the read status (e.g., paired, mapped, reversed) |
| RNAME | Reference sequence name (e.g., chromosome) |
| POS | 1-based leftmost position of clipped alignment |
| MAPQ | Mapping quality score |
| CIGAR | Compact representation of alignment (matches, insertions, deletions) |
| RNEXT | Reference name of the mate/next read |
| PNEXT | Position of the mate/next read |
| TLEN | Observed Template Length |
| SEQ | Sequence of the read |
| QUAL | Base quality scores |

---

### 🧙‍♀️ Example SAM Line

In [None]:
#read123	0	chr1	100	255	4M	*	0	0	ACTG	IIII

---

## 📦 SAM vs BAM

- **SAM**: Text-based, human-readable.
- **BAM**: Binary version of SAM. Compressed for efficient storage and fast access.

# 🧵 The CIGAR String

The **CIGAR** string describes how the read aligns with the reference genome:

- Events are **length + type**.
- Common event types:
  - `M`: Match (can be sequence match or mismatch)
  - `I`: Insertion to the reference
  - `D`: Deletion from the reference
  - `S`: Soft clipping (clipped sequences present in SEQ)
  - `H`: Hard clipping (clipped sequences NOT present in SEQ)

---

### ✏️ Example: CIGAR String Explained

In [None]:
# CIGAR: 10M1I5M5D10M

# 10 matches
# 1 inserted base (in read, not in reference)
# 5 matches
# 5 deleted bases (present in reference, not read)
# 10 matches

> ⚡ Note: **Deletions** cannot be directly shown in the SEQ field — they are indicated separately, e.g., using **MD tags**. The MD tag is not a default tag and must be actively integrated. It is especially useful when the reference should be reconstructed using only the SAM file.

# 🧬 Nucleotide Modification Tags (MM/ML)

In **Oxford Nanopore Technologies (ONT)** data, SAM/BAM formats are enhanced to store additional biological information like **base modifications**.

**Base modifications** are stored using two special tags:

- **MM (Modified Bases)**: Lists which bases are modified and where.
- **ML (Modification Likelihoods)**: Lists probabilities of modifications.

---

### 🔎 Example with MM and ML Tags

In [None]:
#read123	0	chr1	100	255	4M	*	0	0	ACTG	IIII	MM:Z:C+m,5,2,1; ML:B:C,200,180,150

**MM Tag**:  
- `MM:Z:C+m,5,2,1;`
- Meaning:
  - Modified base: Cytosine (`C`)
  - Modification type: `+m` (e.g., methylation)
  - Modified positions:
    - 6th `C`
    - After 2 more `C`s
    - After 1 more `C`

**ML Tag**:  
- `ML:B:C,200,180,150`
- Meaning:
  - Probabilities:
    - 200/255 ≈ 78%
    - 180/255 ≈ 70%
    - 150/255 ≈ 59%

> ⚡ Tip: Probabilities are stored as 8-bit integers (0–255).

# 🛠️ Tools for Handling SAM/BAM Files

| Tool | Description |
|:----|:------------|
| **samtools** | Command-line tool for general operations: sort, index, filter, summarize |
| **pysam** | Python package for fine-grained read manipulation (e.g., analyzing base modifications) |
| **IGV (Integrative Genomics Viewer)** | GUI to visualize alignments, detect systematic errors like indels, coverage issues, or structural variants |

---

# 📖 Further Documentation

- **SAM Format Specification**:  
  👉 [https://samtools.github.io/hts-specs/SAMv1.pdf](https://samtools.github.io/hts-specs/SAMv1.pdf)

- **Tags like MM and ML (Extended SAM Tags)**:  
  👉 [https://samtools.github.io/hts-specs/SAMtags.pdf](https://samtools.github.io/hts-specs/SAMtags.pdf)
  
- **Samtools**:  
  👉 [https://www.htslib.org/doc/samtools.html](https://www.htslib.org/doc/samtools.html)

- **Pysam**:  
  👉 [https://pysam.readthedocs.io/en/latest/api.html](https://pysam.readthedocs.io/en/latest/api.html)

- **IGV**:  
  👉 [https://igv.org/doc/desktop/](https://igv.org/doc/desktop/)