# AF3 Antibody-Antigen Complex Input Preparation

This notebook generates AlphaFold 3 JSON input files for modeling antibody-antigen binding complexes.

**Critical Requirement**: Each JSON file must contain **ALL THREE chains together**:
- Heavy chain (H) - from antibody data
- Light chain (L) - from antibody data  
- Antigen (Ag) - from immunogen sequences

This allows AF3 to predict the complete antibody-antigen complex structure and binding interface.

## Workflow Overview

1. Load and validate input data
2. Map and standardize immunogen names
3. Parse immunogen sequences
4. Generate test set (top 10 binders per immunogen)
5. Create AF3 JSON generator function
6. Generate JSON files for test set
7. Generate data provenance report

## 1. Load and Validate Input Data

**Data Sources:**
- `data/cleaned_data.csv` - Antibody sequences and binding data
- `data/immunogens.fasta` - Antigen sequences

**Validation Checkpoint**: Review loaded data before proceeding.

In [1]:
import pandas as pd
import json
from pathlib import Path
from datetime import datetime
import os

# Track data provenance
provenance = {
    "timestamp": datetime.now().isoformat(),
    "input_files": {},
    "transformations": [],
    "outputs": {}
}

# Load antibody data
antibody_data_path = Path("data/cleaned_data.csv")
print(f"Loading antibody data from: {antibody_data_path}")
print(f"File exists: {antibody_data_path.exists()}")
print(f"File size: {antibody_data_path.stat().st_size / 1024:.2f} KB")
print(f"Last modified: {datetime.fromtimestamp(antibody_data_path.stat().st_mtime)}")

df = pd.read_csv(antibody_data_path)
provenance["input_files"]["cleaned_data.csv"] = {
    "path": str(antibody_data_path.absolute()),
    "size_bytes": antibody_data_path.stat().st_size,
    "modified": datetime.fromtimestamp(antibody_data_path.stat().st_mtime).isoformat(),
    "rows": len(df),
    "columns": len(df.columns)
}

print(f"\n✓ Loaded {len(df)} rows, {len(df.columns)} columns")
print(f"\nColumn names:")
print(df.columns.tolist())

Loading antibody data from: data/cleaned_data.csv
File exists: True
File size: 761.01 KB
Last modified: 2026-01-13 06:46:25.905068

✓ Loaded 606 rows, 138 columns

Column names:
['Unnamed: 0', 'Sample', 'H_N', 'H_N_IF', 'K_N', 'K_N_IF', 'L_N', 'L_N_IF', 'ONE_LIGHT_CHAIN', 'ONE_LIGHT_CHAIN_IF', 'SEQ_NUMBER_H', 'NUMBER_READS_FOR_CONSENSUS_H', 'INFRAME_H', 'FUNCTIONAL_H', 'PRODUCTIVE_H', 'PARTIAL_V_H', 'PARTIAL_J_H', 'VGENE_H', 'DGENE_H', 'DGENE_RF_H', 'JGENE_H', 'CDR3_AA_H', 'CDR3LENGTH_AA_H', 'VMUFREQ_H', 'UNIQUE_GROUPS_H', 'GROUP_COUNTS_H', 'VDJ_NT_H', 'VDJ_AA_H', 'VDJ_NUMNS_H', 'ORIGINAL_ID_H', 'VALID_H', 'SMUA_H', 'PROBVALLELE_H', 'PROBDALLELE_H', 'PROBJALLELE_H', 'RV_H', 'RD1_H', 'RD2_H', 'RJ_H', 'CDR3_H', 'CDR3LENGTH_H', 'NVDNS_H', 'NDJNS_H', 'AA@CYS1_H', 'AA@CYS2_H', 'AA@JINV_H', 'NVBASES_H', 'NVSUBSTITUTIONS_H', 'NVINSERTIONS_H', 'NVDELETIONS_H', 'NUMBER_DROPPED_LT10_H', 'SEQ_NUMBER_K', 'NUMBER_READS_FOR_CONSENSUS_K', 'INFRAME_K', 'FUNCTIONAL_K', 'PRODUCTIVE_K', 'PARTIAL_V_K', 'P

In [2]:
# Display sample of data
print("Sample of antibody data (first 3 rows, key columns):")
key_cols = ["Sample", "VDJ_AA_H", "VJ_AA_K"]
if all(col in df.columns for col in key_cols):
    display_cols = key_cols + [col for col in df.columns if any(x in col for x in ["1JPL", "HK", "Hong Kong"])]
    print(df[display_cols].head(3))
else:
    print("Warning: Some expected columns not found")
    print(f"Available columns containing 'AA': {[c for c in df.columns if 'AA' in c]}")

Sample of antibody data (first 3 rows, key columns):
  Sample                                           VDJ_AA_H  \
0  P10A1  EVQLVESGGDLVKPGGSLKLSCAASGFTFSTYGMSWVRQTPDKRLE...   
1  P10A9  QVQLQQPGAELVMPGASVKLSCKASGYTFTSYWMHWVKQRPGQGLE...   
2  P10B8  DVQLQESGPGLVKPSQFLSLTCSVTGYSITSGYYWNWIRQFPGNKL...   

                                             VJ_AA_K  1JPLm414_C_G1  \
0  DVLMTQTPLSLPVSLGDQASISCRSSQNIVHSNGNTYLQWYLQKPG...       19.30430   
1  DIQMTQTTSSLSASLGDRVTISCRASQDISNYLNWYQQKPDGTIKL...       13.64316   
2  DIVMTQSHKFMSTSVGDRVSITCKASQDVVTAVAWYQQKPGQSPEL...      823.81270   

   1JPLm414_T_G3_PAPRE  1JPL_4I_Avi  1JPL_WT_Avi  A/Hong Kong/1/1968  \
0          61101.92000      4.92734     11.94316            15.01113   
1              5.35732     10.03398     27.38594            14.69570   
2            197.29960      7.75127      7.38604            17.50137   

   HK68head avi bio  
0           6.88467  
1          13.60332  
2           5.13320  


In [3]:
# Load immunogen sequences
immunogen_fasta_path = Path("data/immunogens.fasta")
print(f"\nLoading immunogen sequences from: {immunogen_fasta_path}")
print(f"File exists: {immunogen_fasta_path.exists()}")
print(f"File size: {immunogen_fasta_path.stat().st_size / 1024:.2f} KB")
print(f"Last modified: {datetime.fromtimestamp(immunogen_fasta_path.stat().st_mtime)}")

# Parse FASTA file
immunogen_headers = []
immunogen_sequences_raw = {}

with open(immunogen_fasta_path, 'r') as f:
    current_header = None
    current_sequence = []
    
    for line in f:
        line = line.strip()
        if line.startswith('>'):
            if current_header is not None:
                immunogen_sequences_raw[current_header] = ''.join(current_sequence)
            current_header = line[1:]  # Remove '>'
            immunogen_headers.append(current_header)
            current_sequence = []
        else:
            if line:
                current_sequence.append(line)
    
    # Don't forget the last sequence
    if current_header is not None:
        immunogen_sequences_raw[current_header] = ''.join(current_sequence)

provenance["input_files"]["immunogens.fasta"] = {
    "path": str(immunogen_fasta_path.absolute()),
    "size_bytes": immunogen_fasta_path.stat().st_size,
    "modified": datetime.fromtimestamp(immunogen_fasta_path.stat().st_mtime).isoformat(),
    "num_sequences": len(immunogen_sequences_raw)
}

print(f"\n✓ Found {len(immunogen_sequences_raw)} immunogen sequences")
print(f"\nFASTA headers:")
for header in immunogen_headers:
    seq_len = len(immunogen_sequences_raw[header])
    print(f"  - {header} (length: {seq_len} aa)")


Loading immunogen sequences from: data/immunogens.fasta
File exists: True
File size: 1.26 KB
Last modified: 2026-01-26 16:17:35.367452

✓ Found 6 immunogen sequences

FASTA headers:
  - 1JPL-m4i4-C-G1 (length: 167 aa)
  - 1JPL-m4i4-T-G3 (length: 167 aa)
  - 1JPL-4i (length: 161 aa)
  - 1JPL WT (length: 166 aa)
  - H3/Johannesburg/94_avibio (length: 267 aa)
  - HK68head_WT (length: 267 aa)


**Validation Checkpoint**: Please review the loaded data above:
- ✓ Antibody data loaded with correct number of rows
- ✓ Sequence columns (VDJ_AA_H, VJ_AA_K) present
- ✓ Binding data columns present
- ✓ Immunogen sequences parsed correctly

If everything looks correct, proceed to the next section.

## 2. Map and Standardize Immunogen Names

**Goal**: Create a consistent mapping between:
- Binding data column names (e.g., `1JPLm414_C_G1`)
- FASTA headers (e.g., `1JPL-m4i4-C-G1`)
- Standardized names for use in JSON files

**Validation Checkpoint**: Review the mapping table and confirm it's correct before proceeding.

In [4]:
# Identify binding columns in the dataframe
binding_columns = [col for col in df.columns if any(x in col for x in ["1JPL", "HK", "Hong Kong", "Johannesberg"])]
print("Binding data columns found:")
for col in binding_columns:
    print(f"  - {col}")

print(f"\nFASTA headers from immunogen file:")
for header in immunogen_headers:
    print(f"  - {header}")

Binding data columns found:
  - 1JPLm414_C_G1
  - 1JPLm414_T_G3_PAPRE
  - 1JPL_4I_Avi
  - 1JPL_WT_Avi
  - A/Hong Kong/1/1968
  - HK68head avi bio
  - H3/Johannesberg/94 avi bio

FASTA headers from immunogen file:
  - 1JPL-m4i4-C-G1
  - 1JPL-m4i4-T-G3
  - 1JPL-4i
  - 1JPL WT
  - H3/Johannesburg/94_avibio
  - HK68head_WT


In [5]:
# Create mapping dictionary
# This maps: binding_column_name -> (standardized_name, fasta_header)

immunogen_mapping = {
    "1JPLm414_C_G1": {
        "standardized_name": "1JPLm414_C_G1",
        "fasta_header": "1JPL-m4i4-C-G1",
        "notes": "C variant"
    },
    "1JPLm414_T_G3_PAPRE": {
        "standardized_name": "1JPLm414_T_G3",
        "fasta_header": "1JPL-m4i4-T-G3",
        "notes": "T variant"
    },
    "1JPL_4I_Avi": {
        "standardized_name": "1JPL_4I",
        "fasta_header": "1JPL-4i",
        "notes": "4i variant"
    },
    "1JPL_WT_Avi": {
        "standardized_name": "1JPL_WT",
        "fasta_header": "1JPL WT",
        "notes": "Wild type"
    },
    "A/Hong Kong/1/1968": {
        "standardized_name": "HK68",
        "fasta_header": "HK68head_WT",
        "notes": "Hong Kong 1968 head. same HA sequence. Just one is trimerized"
    },
    "HK68head avi bio": {
        "standardized_name": "HK68",
        "fasta_header": "HK68head_WT",
        "notes": "Hong Kong 1968 head"
    },
    "H3/Johannesberg/94 avi bio": {
        "standardized_name": "H3_JHB_94",
        "fasta_header": "H3/Johannesburg/94_avibio",
        "notes": "H3 Johannesburg 1994"
    }
}

# Display mapping table
print("Immunogen Name Mapping:")
print("=" * 80)
print(f"{'Binding Column':<30} {'Standardized Name':<20} {'FASTA Header':<30}")
print("=" * 80)
for binding_col, mapping_info in immunogen_mapping.items():
    print(f"{binding_col:<30} {mapping_info['standardized_name']:<20} {mapping_info['fasta_header']:<30}")

provenance["transformations"].append({
    "step": "name_mapping",
    "description": "Created mapping between binding columns, standardized names, and FASTA headers",
    "mapping": immunogen_mapping
})

Immunogen Name Mapping:
Binding Column                 Standardized Name    FASTA Header                  
1JPLm414_C_G1                  1JPLm414_C_G1        1JPL-m4i4-C-G1                
1JPLm414_T_G3_PAPRE            1JPLm414_T_G3        1JPL-m4i4-T-G3                
1JPL_4I_Avi                    1JPL_4I              1JPL-4i                       
1JPL_WT_Avi                    1JPL_WT              1JPL WT                       
A/Hong Kong/1/1968             HK68                 HK68head_WT                   
HK68head avi bio               HK68                 HK68head_WT                   
H3/Johannesberg/94 avi bio     H3_JHB_94            H3/Johannesburg/94_avibio     


In [6]:
# Verify all FASTA headers are mapped
print("\nVerifying FASTA header coverage:")
for header in immunogen_headers:
    found = False
    for binding_col, mapping_info in immunogen_mapping.items():
        if mapping_info["fasta_header"] == header:
            found = True
            break
    status = "✓" if found else "✗ MISSING"
    print(f"  {status} {header}")

# Verify binding columns exist in dataframe
print("\nVerifying binding columns exist in dataframe:")
for binding_col in immunogen_mapping.keys():
    exists = binding_col in df.columns
    status = "✓" if exists else "✗ NOT FOUND"
    print(f"  {status} {binding_col}")


Verifying FASTA header coverage:
  ✓ 1JPL-m4i4-C-G1
  ✓ 1JPL-m4i4-T-G3
  ✓ 1JPL-4i
  ✓ 1JPL WT
  ✓ H3/Johannesburg/94_avibio
  ✓ HK68head_WT

Verifying binding columns exist in dataframe:
  ✓ 1JPLm414_C_G1
  ✓ 1JPLm414_T_G3_PAPRE
  ✓ 1JPL_4I_Avi
  ✓ 1JPL_WT_Avi
  ✓ A/Hong Kong/1/1968
  ✓ HK68head avi bio
  ✓ H3/Johannesberg/94 avi bio


**Validation Checkpoint**: Please review the mapping table above:
- ✓ All FASTA headers are mapped
- ✓ All binding columns exist in the dataframe
- ✓ Standardized names are consistent

If the mapping looks correct, proceed to the next section. If corrections are needed, update the `immunogen_mapping` dictionary above.

## 3. Parse Immunogen Sequences

**Goal**: Create a dictionary of immunogen sequences keyed by standardized names for easy lookup.

**Validation Checkpoint**: Verify all sequences parsed correctly with correct lengths.

In [7]:
# Create dictionary: standardized_name -> sequence
immunogen_sequences = {}

for binding_col, mapping_info in immunogen_mapping.items():
    standardized_name = mapping_info["standardized_name"]
    fasta_header = mapping_info["fasta_header"]
    
    if fasta_header in immunogen_sequences_raw:
        immunogen_sequences[standardized_name] = immunogen_sequences_raw[fasta_header]
    else:
        print(f"WARNING: FASTA header '{fasta_header}' not found in parsed sequences")

# Display parsed sequences
print("Parsed Immunogen Sequences:")
print("=" * 100)
for std_name, seq in immunogen_sequences.items():
    seq_len = len(seq)
    first_20 = seq[:20]
    last_20 = seq[-20:] if len(seq) > 20 else seq
    print(f"\n{std_name}:")
    print(f"  Length: {seq_len} amino acids")
    print(f"  First 20 aa: {first_20}")
    print(f"  Last 20 aa: {last_20}")

provenance["transformations"].append({
    "step": "parse_immunogens",
    "description": "Parsed immunogen sequences from FASTA and created standardized dictionary",
    "num_sequences": len(immunogen_sequences),
    "sequences": {name: {"length": len(seq)} for name, seq in immunogen_sequences.items()}
})

Parsed Immunogen Sequences:

1JPLm414_C_G1:
  Length: 167 amino acids
  First 20 aa: MSGPEGQALREALEKAIDPQ
  Last 20 aa: GIIKEEPEIPTNETLEELGP

1JPLm414_T_G3:
  Length: 167 amino acids
  First 20 aa: MSGPEGQALREALEKAIDPV
  Last 20 aa: GIIKEEPEIPTNETLEELGP

1JPL_4I:
  Length: 161 amino acids
  First 20 aa: MAEAEGESLESWLNSATNPS
  Last 20 aa: MLKRQGIVQSDPPIPVDRTL

1JPL_WT:
  Length: 166 amino acids
  First 20 aa: MAEAEGESLESWLNKATNPS
  Last 20 aa: DPPIPVDRTLIPSPPPRPKN

HK68:
  Length: 267 amino acids
  First 20 aa: VQSSSTGKICNNPHRILDGI
  Last 20 aa: NDKPFQNVNKITYGACPKYV

H3_JHB_94:
  Length: 267 amino acids
  First 20 aa: VQSSPTGRICDSPHRILDGK
  Last 20 aa: NDKPFQNVNRITYGACPRYV


In [8]:
# Validate sequences (check for invalid amino acids)
valid_aa = set("ACDEFGHIKLMNPQRSTVWY")
print("\nSequence validation:")
for std_name, seq in immunogen_sequences.items():
    invalid_chars = set(seq) - valid_aa
    if invalid_chars:
        print(f"  ✗ {std_name}: Invalid characters found: {invalid_chars}")
    else:
        print(f"  ✓ {std_name}: Valid sequence ({len(seq)} aa)")


Sequence validation:
  ✓ 1JPLm414_C_G1: Valid sequence (167 aa)
  ✓ 1JPLm414_T_G3: Valid sequence (167 aa)
  ✓ 1JPL_4I: Valid sequence (161 aa)
  ✓ 1JPL_WT: Valid sequence (166 aa)
  ✓ HK68: Valid sequence (267 aa)
  ✓ H3_JHB_94: Valid sequence (267 aa)


**Validation Checkpoint**: Please verify:
- ✓ All sequences parsed correctly
- ✓ Sequence lengths are reasonable
- ✓ No invalid amino acid characters
- ✓ Sequences match expected immunogens

If everything looks correct, proceed to the next section.

## 4. Generate Test Set (Top 10 Binders per Immunogen)

**Goal**: Identify the top 10 antibodies with highest binding signal for each immunogen to create a manageable test set.

**Validation Checkpoint**: Review the selected test set before generating JSON files.

In [9]:
# Get unique standardized immunogen names (avoid duplicates)
unique_immunogens = {}
for binding_col, mapping_info in immunogen_mapping.items():
    std_name = mapping_info["standardized_name"]
    if std_name not in unique_immunogens:
        unique_immunogens[std_name] = binding_col

print("Immunogens to process:")
for std_name, binding_col in unique_immunogens.items():
    print(f"  - {std_name} (from column: {binding_col})")

# Generate test set: top 10 binders per immunogen
test_set = []
test_set_summary = {}

for std_name, binding_col in unique_immunogens.items():
    if binding_col not in df.columns:
        print(f"WARNING: Binding column '{binding_col}' not found in dataframe")
        continue
    
    # Get binding values
    binding_data = df[["Sample", "VDJ_AA_H", "VJ_AA_K", binding_col]].copy()
    binding_data = binding_data.dropna(subset=[binding_col, "VDJ_AA_H", "VJ_AA_K"])
    
    # Sort by binding value (descending) and take top 10
    top_binders = binding_data.nlargest(10, binding_col)
    
    test_set_summary[std_name] = {
        "binding_column": binding_col,
        "total_antibodies": len(binding_data),
        "top_10_binding_values": top_binders[binding_col].tolist(),
        "sample_ids": top_binders["Sample"].tolist()
    }
    
    # Add to test set
    for _, row in top_binders.iterrows():
        test_set.append({
            "sample_id": str(row["Sample"]),
            "immunogen_name": std_name,
            "binding_value": float(row[binding_col]),
            "hc_sequence": str(row["VDJ_AA_H"]),
            "lc_sequence": str(row["VJ_AA_K"])
        })

print(f"\n✓ Generated test set with {len(test_set)} antibody-immunogen pairs")

Immunogens to process:
  - 1JPLm414_C_G1 (from column: 1JPLm414_C_G1)
  - 1JPLm414_T_G3 (from column: 1JPLm414_T_G3_PAPRE)
  - 1JPL_4I (from column: 1JPL_4I_Avi)
  - 1JPL_WT (from column: 1JPL_WT_Avi)
  - HK68 (from column: A/Hong Kong/1/1968)
  - H3_JHB_94 (from column: H3/Johannesberg/94 avi bio)

✓ Generated test set with 60 antibody-immunogen pairs


In [10]:
# Display test set summary
print("\nTest Set Summary:")
print("=" * 100)
for std_name, summary in test_set_summary.items():
    print(f"\n{std_name}:")
    print(f"  Binding column: {summary['binding_column']}")
    print(f"  Total antibodies with data: {summary['total_antibodies']}")
    print(f"  Top 10 binding values: {[f'{v:.2f}' for v in summary['top_10_binding_values'][:5]]}... (showing first 5)")
    print(f"  Sample IDs: {summary['sample_ids'][:5]}... (showing first 5)")
    print(f"  Pairs to model: 10")


Test Set Summary:

1JPLm414_C_G1:
  Binding column: 1JPLm414_C_G1
  Total antibodies with data: 425
  Top 10 binding values: ['203160.47', '201835.22', '199423.56', '197486.53', '196183.56']... (showing first 5)
  Sample IDs: ['P31A3', 'P40G1', 'P28A7', 'P1A8', 'P35C12']... (showing first 5)
  Pairs to model: 10

1JPLm414_T_G3:
  Binding column: 1JPLm414_T_G3_PAPRE
  Total antibodies with data: 425
  Top 10 binding values: ['199567.39', '198764.94', '197054.60', '196172.72', '196095.97']... (showing first 5)
  Sample IDs: ['P40G1', 'P35A1', 'P39A3', 'P36B2', 'P28A7']... (showing first 5)
  Pairs to model: 10

1JPL_4I:
  Binding column: 1JPL_4I_Avi
  Total antibodies with data: 425
  Top 10 binding values: ['50441.56', '6345.71', '5136.67', '2198.22', '2198.22']... (showing first 5)
  Sample IDs: ['P40G1', 'P45B7', 'P55E2', 'P38F11', 'P38F11']... (showing first 5)
  Pairs to model: 10

1JPL_WT:
  Binding column: 1JPL_WT_Avi
  Total antibodies with data: 425
  Top 10 binding values: ['2

In [11]:
# Check for missing sequences in test set
missing_seqs = []
for pair in test_set:
    if pd.isna(pair["hc_sequence"]) or pair["hc_sequence"] == "nan" or not pair["hc_sequence"]:
        missing_seqs.append(f"{pair['sample_id']}: missing H chain")
    if pd.isna(pair["lc_sequence"]) or pair["lc_sequence"] == "nan" or not pair["lc_sequence"]:
        missing_seqs.append(f"{pair['sample_id']}: missing L chain")

if missing_seqs:
    print("WARNING: Missing sequences found:")
    for msg in missing_seqs:
        print(f"  - {msg}")
else:
    print("✓ All test set pairs have both H and L chain sequences")

# Save test set with provenance
test_set_provenance = {
    "timestamp": datetime.now().isoformat(),
    "selection_criteria": "Top 10 binders per immunogen (highest binding signal)",
    "total_pairs": len(test_set),
    "pairs_per_immunogen": {name: 10 for name in unique_immunogens.keys()},
    "excluded_pairs": len(missing_seqs) if missing_seqs else 0
}

provenance["transformations"].append({
    "step": "generate_test_set",
    "description": "Generated test set of top 10 binders per immunogen",
    "details": test_set_provenance
})

✓ All test set pairs have both H and L chain sequences


**Validation Checkpoint**: Please review the test set:
- ✓ Top binders selected for each immunogen
- ✓ Binding values are reasonable
- ✓ No missing sequences
- ✓ Total number of pairs is as expected (10 per immunogen)

If the test set looks correct, proceed to create the JSON generator function.

## 5. AF3 JSON Generator Function

**Goal**: Create a function that generates AlphaFold 3 JSON files with all three chains (H, L, Ag) together.

**Critical**: The JSON must include H, L, and antigen sequences in a single file to model the complex.

**Validation Checkpoint**: Test the function on a single example and verify the JSON structure.

In [12]:
def generate_af3_json(sample_id, immunogen_name, hc_seq, lc_seq, ag_seq, output_dir, model_seeds=[1]):
    """
    Generate AlphaFold 3 JSON input file for antibody-antigen complex.
    
    Parameters:
    -----------
    sample_id : str
        Sample identifier (e.g., "P10A1")
    immunogen_name : str
        Standardized immunogen name (e.g., "1JPLm414_C_G1")
    hc_seq : str
        Heavy chain amino acid sequence
    lc_seq : str
        Light chain amino acid sequence
    ag_seq : str
        Antigen amino acid sequence
    output_dir : Path
        Directory to save JSON file
    model_seeds : list
        List of model seeds (default: [1])
    
    Returns:
    --------
    dict : Generated JSON structure
    """
    # Validate sequences
    valid_aa = set("ACDEFGHIKLMNPQRSTVWY")
    
    for seq_name, seq in [("H", hc_seq), ("L", lc_seq), ("Ag", ag_seq)]:
        if not seq or pd.isna(seq) or str(seq) == "nan":
            raise ValueError(f"Missing or invalid {seq_name} chain sequence")
        invalid_chars = set(str(seq)) - valid_aa
        if invalid_chars:
            raise ValueError(f"Invalid amino acids in {seq_name} chain: {invalid_chars}")
    
    # Create JSON structure following AF3 format
    # NOTE: The DHVI AlphaFold 3 runner currently accepts JSON versions 1-3.
    # Use version=3 for compatibility.
    json_data = {
        "name": f"{sample_id}_{immunogen_name}",
        "modelSeeds": model_seeds,
        "dialect": "alphafold3",
        "version": 3,
        "sequences": [
            {
                "protein": {
                    "id": "H",
                    "sequence": str(hc_seq).strip()
                }
            },
            {
                "protein": {
                    "id": "L",
                    "sequence": str(lc_seq).strip()
                }
            },
            {
                "protein": {
                    "id": "Ag",
                    "sequence": str(ag_seq).strip()
                }
            }
        ]
    }
    
    # Save to file
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    filename = f"{sample_id}_{immunogen_name}.json"
    output_path = output_dir / filename
    
    with open(output_path, 'w') as f:
        json.dump(json_data, f, indent=2)
    
    return json_data, output_path

print("✓ JSON generator function created")

✓ JSON generator function created


In [13]:
# Test the function on a single example
test_sample = test_set[0]
print(f"Testing JSON generation with:")
print(f"  Sample ID: {test_sample['sample_id']}")
print(f"  Immunogen: {test_sample['immunogen_name']}")
print(f"  H chain length: {len(test_sample['hc_sequence'])} aa")
print(f"  L chain length: {len(test_sample['lc_sequence'])} aa")
print(f"  Antigen length: {len(immunogen_sequences[test_sample['immunogen_name']])} aa")

# Get antigen sequence
test_ag_seq = immunogen_sequences[test_sample['immunogen_name']]

# Generate test JSON
test_output_dir = Path("af3_inputs")
test_json, test_path = generate_af3_json(
    sample_id=test_sample['sample_id'],
    immunogen_name=test_sample['immunogen_name'],
    hc_seq=test_sample['hc_sequence'],
    lc_seq=test_sample['lc_sequence'],
    ag_seq=test_ag_seq,
    output_dir=test_output_dir
)

print(f"\n✓ Test JSON generated: {test_path}")
print(f"\nGenerated JSON structure:")
print(json.dumps(test_json, indent=2))

Testing JSON generation with:
  Sample ID: P31A3
  Immunogen: 1JPLm414_C_G1
  H chain length: 121 aa
  L chain length: 107 aa
  Antigen length: 167 aa

✓ Test JSON generated: af3_inputs/P31A3_1JPLm414_C_G1.json

Generated JSON structure:
{
  "name": "P31A3_1JPLm414_C_G1",
  "modelSeeds": [
    1
  ],
  "dialect": "alphafold3",
  "version": 3,
  "sequences": [
    {
      "protein": {
        "id": "H",
        "sequence": "QVQLKQSGPGLVQSSQSLSITCTVSGFSLTTYGVHWVRQSPGKGLEWLGVIWSGGSTDYNAAFISRLSISKDNSKSQVFFKMNSLQADDTAIYYCARKASYGSLFWYFDVWGTGTTVTVSS"
      }
    },
    {
      "protein": {
        "id": "L",
        "sequence": "DIVMTQSQKFMSTSVRDRVSVTCKASQNVGTNVAWYQQKPGQSPKALIYSASYRYSGVPDRFTGSGSGTDFTLTISNVQSEDLAEYFCQQYNTYPLTFGGGTKLEIK"
      }
    },
    {
      "protein": {
        "id": "Ag",
        "sequence": "MSGPEGQALREALEKAIDPQGRPWVRGTSLRWEYVLGFCDLVNKSPNGPQIAVRLLAEYIASPEPEVALNALVVLEACIENCGDKFIKEVSKPEFLNELEKIVSPEHLGKRIPEEVKERVLRLLYYLTRKYPDYTNIREAYEKLKEDGIIKEEPEIPTNETLEELGP"
      }
   

## 6. Generate JSON Files for Test Set

**Goal**: Generate JSON files for all antibody-immunogen pairs in the test set.

Progress will be tracked and a summary report will be generated.

In [14]:
# Set output directory
output_dir = Path("af3_inputs")
output_dir.mkdir(parents=True, exist_ok=True)

# Track generation progress
generation_stats = {
    "total_pairs": len(test_set),
    "files_created": 0,
    "files_skipped": 0,
    "errors": []
}

print(f"Generating JSON files for {len(test_set)} antibody-immunogen pairs...")
print(f"Output directory: {output_dir.absolute()}\n")

# Generate JSON files
for i, pair in enumerate(test_set, 1):
    sample_id = pair['sample_id']
    immunogen_name = pair['immunogen_name']
    
    try:
        # Get antigen sequence
        if immunogen_name not in immunogen_sequences:
            raise ValueError(f"Immunogen '{immunogen_name}' not found in sequences")
        
        ag_seq = immunogen_sequences[immunogen_name]
        
        # Generate JSON
        json_data, output_path = generate_af3_json(
            sample_id=sample_id,
            immunogen_name=immunogen_name,
            hc_seq=pair['hc_sequence'],
            lc_seq=pair['lc_sequence'],
            ag_seq=ag_seq,
            output_dir=output_dir
        )
        
        generation_stats["files_created"] += 1
        
        if i % 10 == 0:
            print(f"  Progress: {i}/{len(test_set)} files generated...")
            
    except Exception as e:
        generation_stats["files_skipped"] += 1
        error_msg = f"{sample_id}_{immunogen_name}: {str(e)}"
        generation_stats["errors"].append(error_msg)
        print(f"  ERROR: {error_msg}")

print(f"\n✓ Generation complete!")

Generating JSON files for 60 antibody-immunogen pairs...
Output directory: /cwork/hsb26/ab_seq_bind_analysis/af3_inputs

  Progress: 10/60 files generated...
  Progress: 20/60 files generated...
  Progress: 30/60 files generated...
  Progress: 40/60 files generated...
  Progress: 50/60 files generated...
  Progress: 60/60 files generated...

✓ Generation complete!


In [15]:
# Display generation summary
print("Generation Summary:")
print("=" * 80)
print(f"Total pairs in test set: {generation_stats['total_pairs']}")
print(f"Files created: {generation_stats['files_created']}")
print(f"Files skipped: {generation_stats['files_skipped']}")
print(f"Errors: {len(generation_stats['errors'])}")

if generation_stats['errors']:
    print("\nErrors encountered:")
    for error in generation_stats['errors']:
        print(f"  - {error}")

# List generated files
generated_files = list(output_dir.glob("*.json"))
print(f"\nGenerated files in {output_dir}: {len(generated_files)}")
if len(generated_files) <= 20:
    print("File list:")
    for f in sorted(generated_files):
        print(f"  - {f.name}")
else:
    print("First 10 files:")
    for f in sorted(generated_files)[:10]:
        print(f"  - {f.name}")
    print(f"  ... and {len(generated_files) - 10} more")

# Update provenance
provenance["outputs"]["json_files"] = {
    "directory": str(output_dir.absolute()),
    "total_files": len(generated_files),
    "naming_convention": "{sample_id}_{immunogen_name}.json",
    "generation_stats": generation_stats
}

Generation Summary:
Total pairs in test set: 60
Files created: 60
Files skipped: 0
Errors: 0

Generated files in af3_inputs: 59
First 10 files:
  - P10A1_1JPL_WT.json
  - P10A9_1JPL_WT.json
  - P12E11_H3_JHB_94.json
  - P13C1_1JPLm414_C_G1.json
  - P17A1_1JPLm414_T_G3.json
  - P17E9_H3_JHB_94.json
  - P1A8_1JPLm414_C_G1.json
  - P22H12_1JPLm414_T_G3.json
  - P23F10_1JPL_4I.json
  - P24E7_H3_JHB_94.json
  ... and 49 more


**Validation Checkpoint**: Please review:
- ✓ Expected number of files created
- ✓ No unexpected errors
- ✓ File naming convention is correct
- ✓ Spot-check a few generated files to verify structure

If everything looks good, proceed to generate the provenance report.

## 7. Data Provenance Report

**Goal**: Generate a comprehensive report documenting all data sources, transformations, and outputs for reproducibility.

In [16]:
# Add final summary to provenance
provenance["summary"] = {
    "total_antibodies_in_dataset": len(df),
    "total_immunogens": len(immunogen_sequences),
    "test_set_size": len(test_set),
    "json_files_generated": generation_stats["files_created"],
    "output_directory": str(output_dir.absolute())
}

# Display provenance report
print("Data Provenance Report")
print("=" * 80)
print(f"\nTimestamp: {provenance['timestamp']}")

print("\nInput Files:")
print("-" * 80)
for filename, info in provenance["input_files"].items():
    print(f"\n{filename}:")
    print(f"  Path: {info['path']}")
    print(f"  Size: {info['size_bytes'] / 1024:.2f} KB")
    print(f"  Modified: {info['modified']}")
    if 'rows' in info:
        print(f"  Rows: {info['rows']}")
        print(f"  Columns: {info['columns']}")
    if 'num_sequences' in info:
        print(f"  Sequences: {info['num_sequences']}")

print("\nTransformations Applied:")
print("-" * 80)
for i, transform in enumerate(provenance["transformations"], 1):
    print(f"\n{i}. {transform['step']}:")
    print(f"   Description: {transform['description']}")
    if 'details' in transform:
        for key, value in transform['details'].items():
            if key != 'sequences':  # Skip full sequences in display
                print(f"   {key}: {value}")

print("\nOutputs:")
print("-" * 80)
for output_type, info in provenance["outputs"].items():
    print(f"\n{output_type}:")
    for key, value in info.items():
        if key != 'generation_stats':
            print(f"  {key}: {value}")

print("\nSummary:")
print("-" * 80)
for key, value in provenance["summary"].items():
    print(f"  {key}: {value}")

Data Provenance Report

Timestamp: 2026-02-01T11:58:09.101272

Input Files:
--------------------------------------------------------------------------------

cleaned_data.csv:
  Path: /cwork/hsb26/ab_seq_bind_analysis/data/cleaned_data.csv
  Size: 761.01 KB
  Modified: 2026-01-13T06:46:25.905068
  Rows: 606
  Columns: 138

immunogens.fasta:
  Path: /cwork/hsb26/ab_seq_bind_analysis/data/immunogens.fasta
  Size: 1.26 KB
  Modified: 2026-01-26T16:17:35.367452
  Sequences: 6

Transformations Applied:
--------------------------------------------------------------------------------

1. name_mapping:
   Description: Created mapping between binding columns, standardized names, and FASTA headers

2. parse_immunogens:
   Description: Parsed immunogen sequences from FASTA and created standardized dictionary

3. generate_test_set:
   Description: Generated test set of top 10 binders per immunogen
   timestamp: 2026-02-01T11:58:09.721515
   selection_criteria: Top 10 binders per immunogen (highest

In [17]:
# Save provenance report to file
provenance_file = Path("af3_inputs") / "provenance_report.json"
with open(provenance_file, 'w') as f:
    json.dump(provenance, f, indent=2)

print(f"\n✓ Provenance report saved to: {provenance_file}")
print(f"  This file contains complete data lineage for reproducibility.")


✓ Provenance report saved to: af3_inputs/provenance_report.json
  This file contains complete data lineage for reproducibility.
