# __Lecture 17:__ working with immune repertoire sequencing data

Today we'll work with IMGT/V-QUEST AIRR-formatted annotations of human antibody heavy chain sequences.

Our goals:

1. Load and inspect AIRR-formatted sequence annotations  
2. Compute per-sequence mutation counts by comparing observed vs germline sequences  
3. Compare mutations in a CDR vs framework region using IMGT boundaries  
4. (if time) Extend mutation density analyses across all regions  

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 0. Loading and inspecting the data

I've annotated 50 human antibody heavy chain sequences using [IMGT/V-QUEST](https://www.imgt.org/IMGT_vquest/analysis).

Today we will explore these annotated sequences and eventually quantify how many mutations occur in different antibody regions.

Let's begin by loading the dataset and looking at a few key annotation fields.

We will focus on:

- `sequence_alignment`: observed nucleotide sequence aligned to a reference  
- `germline_alignment`: inferred germline V-region sequence aligned to the same reference  
- `cdr*_start` / `cdr*_end`: boundaries of the CDR regions (* takes values 1, 2, or 3). These are **1-based indices** referring to the *ungapped* germline numbering scheme.
- `fwr*_start` / `fwr*_end`: boundaries of the framework regions (* takes values 1, 2, 3, or 4)

In [2]:
# Load the AIRR-formatted annotations from IMGT/V-QUEST
df = pd.read_csv("vquest_airr.tsv", sep="\t")

# Look at the first few rows to see what the table looks like
df.head()

Unnamed: 0,sequence_id,sequence,sequence_aa,rev_comp,productive,complete_vdj,vj_in_frame,stop_codon,locus,v_call,...,rearrangement_id,repertoire_id,rearrangement_set_id,sequence_analysis_category,d_number,5prime_trimmed_n_nb,3prime_trimmed_n_nb,insertions,deletions,junction_decryption
0,1,cagatcaccttgaaggagtctggtcctacgctggtgaaacccacac...,,F,T,T,T,F,IGH,Homsap IGHV2-5*02 F,...,,,,1 (noindelsearch),1,0,0,,,(13)0{3}-5(26)0{0}0(21)
1,2,gaggtgcagctggtggagtctgggggaggcttggtccagcctgggg...,,F,T,T,T,F,IGH,Homsap IGHV3-7*01 F,...,,,,1 (noindelsearch),1,0,0,,,(3)-8{4}-10(5)-3{4}-9(8)
2,3,caggtgcagctgcaggagtcgggcccaggactggtgaagccttcgg...,,F,T,T,T,F,IGH,Homsap IGHV4-59*01 F,...,,,,1 (noindelsearch),1,0,0,,,(10)-1{2}-1(11)-4{1}-5(12)
3,4,caggttcagctggtgcagtctggagctgaggtgaagaagcctgggg...,,F,T,T,T,F,IGH,Homsap IGHV1-18*01 F,...,,,,1 (noindelsearch),1,0,0,,,(10)-1{14}-7(11)-13{7}-11(21)
4,5,caggtgcagctggtgcagtctggggctgaggtgaagaagcctgggt...,,F,T,T,T,F,IGH,"Homsap IGHV1-69*01 F, or Homsap IGHV1-69D*01 F",...,,,,1 (noindelsearch),1,0,0,,,(8)-3{4}-1(15)+1{10}-7(25)


### Subsetting to the columns we will use

The IMGT output contains many fields (~125 columns).  
For this tutorial, let's keep only the columns needed for mutation counting and region analysis.

In [3]:
cols = [
    "sequence_id",
    "sequence_alignment",
    "germline_alignment",
    "cdr1_start", "cdr1_end",
    "cdr2_start", "cdr2_end",
    "cdr3_start", "cdr3_end",
    "fwr1_start", "fwr1_end",
    "fwr2_start", "fwr2_end",
    "fwr3_start", "fwr3_end",
    "fwr4_start", "fwr4_end",
]

df_subset = df[cols].copy()

df_subset.head()

Unnamed: 0,sequence_id,sequence_alignment,germline_alignment,cdr1_start,cdr1_end,cdr2_start,cdr2_end,cdr3_start,cdr3_end,fwr1_start,fwr1_end,fwr2_start,fwr2_end,fwr3_start,fwr3_end,fwr4_start,fwr4_end
0,1,cagatcaccttgaaggagtctggtcct...acgctggtgaaaccca...,cagatcaccttgaaggagtctggtcct...acgctggtgaaaccca...,76,105,157,177,292,348,1,75,106,156,178,291,349,381
1,2,gaggtgcagctggtggagtctggggga...ggcttggtccagcctg...,gaggtgcagctggtggagtctggggga...ggcttggtccagcctg...,76,99,151,174,289,306,1,75,100,150,175,288,307,339
2,3,caggtgcagctgcaggagtcgggccca...ggactggtgaagcctt...,caggtgcagctgcaggagtcgggccca...ggactggtgaagcctt...,76,99,151,171,286,315,1,75,100,150,172,285,316,348
3,4,caggttcagctggtgcagtctggagct...gaggtgaagaagcctg...,caggttcagctggtgcagtctggagct...gaggtgaagaagcctg...,76,99,151,174,289,345,1,75,100,150,175,288,346,378
4,5,caggtgcagctggtgcagtctggggct...gaggtgaagaagcctg...,caggtgcagctggtgcagtctggggct...gaggtgaagaagcctg...,76,99,151,174,289,345,1,75,100,150,175,288,346,378


### Inspecting the first sequence

Before writing any code to count mutations, let's look closely at the aligned observed and germline sequences for the first entry.

This will help us understand the data formatting and decide how to compare aligned sequences properly.

In [4]:
first_seq = df_subset.iloc[0]

print("Germline alignment:\n", first_seq["germline_alignment"])
print("\nObserved alignment:\n", first_seq["sequence_alignment"])

Germline alignment:
 cagatcaccttgaaggagtctggtcct...acgctggtgaaacccacacagaccctcacgctgacctgcaccttctctgggttctcactcagc......actagtggagtgggtgtgggctggatccgtcagcccccaggaaaggccctggagtggcttgcactcatttattgggat.........gatgataagcgctacagcccatctctgaag...agcaggctcaccatcaccaaggacacctccaaaaaccaggtggtccttacaatgaccaacatggaccctgtggacacagccacatattactgtgcacacagacnnnactatgatagtagtggttattactacgctgaatacttccagcactggggccagggcaccctggtcaccgtctcctca

Observed alignment:
 cagatcaccttgaaggagtctggtcct...acgctggtgaaacccacacagaccctcacgctgacgtgcaccttctctgggttctcactcacg......actcgtggagtgggtgtgggctgggtccgtcagcccccagggaaggccctggagtggcttgcactcatttattgggat.........gatgatatccgctatagtgaatctatgaag...aacagactcagcattaccaaggacacccccaaaaaccaggtggtcctcacattgaccaacatggaccctgtggacacagccacatattactgtgcacacggacttcactttgatactattggttattactacgctgaatacttccagtactggagccagggcaccctggtcaccgtctcctca


__Question:__ Do you see positions where the nucleotides differ?  

Notice that IMGT uses `"."` characters as **numbering spacers** so that all sequences share a unified coordinate system (IMGT's standardized V-region numbering). These `"."` symbols are **not alignment gaps** and do **not** indicate identity with the germline. Instead, **they are positional placeholders**.

For our purposes today, these `"."` characters would interfere with counting mutations and with using CDR/FWR boundary coordinates. To keep things simple, we will remove all `"."` characters from both the observed and germline alignments.

In [5]:
# Remove IMGT numbering spacer dots so indices match CDR/FWR boundaries
df_subset["germline_alignment"] = df_subset["germline_alignment"].str.replace(".", "", regex=False)
df_subset["sequence_alignment"] = df_subset["sequence_alignment"].str.replace(".", "", regex=False)

df_subset.head()

Unnamed: 0,sequence_id,sequence_alignment,germline_alignment,cdr1_start,cdr1_end,cdr2_start,cdr2_end,cdr3_start,cdr3_end,fwr1_start,fwr1_end,fwr2_start,fwr2_end,fwr3_start,fwr3_end,fwr4_start,fwr4_end
0,1,cagatcaccttgaaggagtctggtcctacgctggtgaaacccacac...,cagatcaccttgaaggagtctggtcctacgctggtgaaacccacac...,76,105,157,177,292,348,1,75,106,156,178,291,349,381
1,2,gaggtgcagctggtggagtctgggggaggcttggtccagcctgggg...,gaggtgcagctggtggagtctgggggaggcttggtccagcctgggg...,76,99,151,174,289,306,1,75,100,150,175,288,307,339
2,3,caggtgcagctgcaggagtcgggcccaggactggtgaagccttcgg...,caggtgcagctgcaggagtcgggcccaggactggtgaagccttcgg...,76,99,151,171,286,315,1,75,100,150,172,285,316,348
3,4,caggttcagctggtgcagtctggagctgaggtgaagaagcctgggg...,caggttcagctggtgcagtctggagctgaggtgaagaagcctgggg...,76,99,151,174,289,345,1,75,100,150,175,288,346,378
4,5,caggtgcagctggtgcagtctggggctgaggtgaagaagcctgggt...,caggtgcagctggtgcagtctggggctgaggtgaagaagcctgggt...,76,99,151,174,289,345,1,75,100,150,175,288,346,378


Now that our sequences contain only nucleotides, we're ready to start counting mutations.

## 1. Counting total mutations per sequence

To begin, we will count the total number of nucleotide mutations in each sequence.

We will treat each position where the observed nucleotide differs from the germline nucleotide as a mutation.

Because we already removed the `"."` spacer characters from the alignments, the observed and germline sequences now:

- have the same length  
- align position-by-position

In [6]:
def count_mutations(seq_align, germ_align):
    """Count nucleotide mismatches between aligned query and germline sequences."""
    mismatches = 0

    for q, g in zip(seq_align, germ_align):
        if q != g:
            mismatches += 1

    return mismatches

Let's try this function on the first sequence to confirm it works as expected.

In [7]:
first_seq = df_subset.iloc[0]

first_mutations = count_mutations(
    first_seq["sequence_alignment"],
    first_seq["germline_alignment"]
)

first_mutations

29

### Exercise 1 (approx. 10 minutes)

**Goal:** Compute the total number of mutations for each sequence and visualize the distribution.

Steps:

1. Loop over all rows in `df_subset` and use `count_mutations()` to compute the mutation count
   *(Hint: use `df_subset.iterrows()`)*
2. Save the result to a new column named `mutation_count`  
3. Use `seaborn.histplot` to visualize the distribution  
4. (Optional) Compute the mean and median number of mutations

Fill in the code cells below.

In [8]:
# Step 1–2: compute total mutation counts

# your code here...

In [9]:
# Step 3: visualize mutation count distribution

# your code here...

In [10]:
# Step 4 (optional): summary statistics

# your code here...

## 2. Mutations in CDR vs framework regions

Somatic hypermutation tends to target the **CDR regions** (which contact antigen) more heavily than the **framework (FWR) regions**.

IMGT provides **1-based** start and end positions for each region relative to the ungapped germline numbering:

- `cdr1_start`, `cdr1_end`, `cdr2_start`, `cdr2_end`, `cdr3_start`, `cdr3_end`  
- `fwr1_start`, `fwr1_end`, `fwr2_start`, `fwr2_end`, `fwr3_start`, `fwr3_end`, `fwr4_start`, `fwr4_end`  

### Exercise 2 (approx. 10–15 minutes)

**Goal:** Compare mutation densities in CDR3 vs FWR3.

Perform the following steps:

1. Write a function to count mutations in a specific region (characterized by a start and end index). Hint: adapt the `count_mutations` function above.  
2. Use your function to compute the number of mutations in CDR3 and FWR3 for each sequence.  
3. Compute the region lengths:  
   - `cdr3_len = cdr3_end - cdr3_start + 1`  
   - `fwr3_len = fwr3_end - fwr3_start + 1`  
4. Compute mutation densities:  
   - `cdr3_density = cdr3_mut / cdr3_len`  
   - `fwr3_density = fwr3_mut / fwr3_len`  
5. Use seaborn to compare the two distributions (e.g., boxplot or violin plot).  
6. Answer: Which region is more mutated on average?

Fill in the scaffold below.

In [None]:
# Step 1: Write a function to count mutations in a specific region
def count_region_muts(seq_align, germ_align, start, end):
    """
    Count mismatches between seq and germ within the region defined by
    the IMGT start/end coordinates.

    IMGT boundaries are 1-based, inclusive.
    Python slicing uses 0-based indexing and excludes the end index.

    So for IMGT start/end:
        Python slice = seq[start-1 : end]
    """
    # your code here ...
    # Hint:
    # start_idx = int(start) - 1
    # end_idx = int(end)
    # seq_region = seq_align[start_idx:end_idx]
    # germ_region = germ_align[start_idx:end_idx]
    # then loop over seq_region and germ_region and count mismatches

    return None

In [None]:
# Step 2: compute cdr3_mut and fwr3_mut for each sequence

# your code here...


In [None]:
# Step 3: compute region lengths

# your code here...


In [None]:
# Step 4: compute mutation densities

# your code here...

In [None]:
# Step 5: compare mutation density distributions using seaborn

# your code here...


### Question

Based on your plot:

- Which region (CDR3 or FWR3) has higher mutation density?  
- Does this match what you expect biologically, given that CDRs directly contact antigen?

## Exercise 3 (Optional): Extend to all CDR and framework regions

If you have extra time:

Use the dictionary below to compute mutation density for **all** antibody regions and plot their average mutation densities.

This will help visualize the typical pattern of SHM:  
> CDRs highly mutated, FWRs more conserved.

In [16]:
regions = {
    "FWR1": ("fwr1_start", "fwr1_end"),
    "CDR1": ("cdr1_start", "cdr1_end"),
    "FWR2": ("fwr2_start", "fwr2_end"),
    "CDR2": ("cdr2_start", "cdr2_end"),
    "FWR3": ("fwr3_start", "fwr3_end"),
    "CDR3": ("cdr3_start", "cdr3_end"),
    "FWR4": ("fwr4_start", "fwr4_end"),
}

# your code here ...

## 4. (Very optional extra challenge)

__Challenge__: Explore whether certain *sequence contexts* are more frequently mutated than others.

Steps:

1. For each mutated position, extract the **3-mer** centered at that position from the *germline* sequence  
2. Count occurrences of each 3-mer at mutated sites  
3. Which motifs are the most common?

This relates to SHM "hotspot" motifs (e.g., WRC/GYW).

In [17]:
# your code here...

## Wrap-up

In this tutorial, you:

- Loaded and inspected IMGT/V-QUEST AIRR-formatted data  
- Counted total mutations per sequence  
- Used IMGT region boundaries to compute mutation densities  
- Compared mutation densities between CDR and FWR regions  
- (Optionally) extended these analyses across all regions  
- (Optional) explored SHM context-dependent mutation motifs  

These analyses form the foundation for understanding how somatic hypermutation shapes B cell receptor sequences during affinity maturation.