# AI-Driven antibody design: enhancing CoV-AbDab with diffusion models and machine learning
## Module 2: Design antibody using diffusion model
The goal is to introduce sequence diversity to the CDR3 regions that bind to the paratope (Covid spike protein) for optimizing the binding affinity between antibody and antigen.  

In [11]:

# download pdb file: Spike protein of SARS-CoV-2 in complex with antibody (2G1) (PDB ID: 7X08)
!wget https://files.rcsb.org/download/7X08.pdb

--2024-08-08 02:10:09--  https://files.rcsb.org/download/7X08.pdb
Resolving files.rcsb.org (files.rcsb.org)... 132.249.213.241
Connecting to files.rcsb.org (files.rcsb.org)|132.249.213.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘7X08.pdb.1’

7X08.pdb.1              [                <=> ]   3.12M   832KB/s    in 3.7s    

2024-08-08 02:10:13 (866 KB/s) - ‘7X08.pdb.1’ saved [3268593]



## Fix the PDB file

In [12]:
import pdbfixer
def prepare_protein(
    pdb_file, ignore_missing_residues=True, ignore_terminal_missing_residues=True, ph=7.0
):
    """
    Use pdbfixer to prepare the protein from a PDB file. Hetero atoms such as ligands are
    removed and non-standard residues replaced. Missing atoms to existing residues are added.
    Missing residues are ignored by default, but can be included.

    Parameters
    ----------
    pdb_file: pathlib.Path or str
        PDB file containing the system to simulate.
    ignore_missing_residues: bool, optional
        If missing residues should be ignored or built.
    ignore_terminal_missing_residues: bool, optional
        If missing residues at the beginning and the end of a chain should be ignored or built.
    ph: float, optional
        pH value used to determine protonation state of residues

    Returns
    -------
    fixer: pdbfixer.pdbfixer.PDBFixer
        Prepared protein system.
    """
    fixer = pdbfixer.PDBFixer(str(pdb_file))
    fixer.removeHeterogens()  # co-crystallized ligands are unknown to PDBFixer
    fixer.findMissingResidues()  # identify missing residues, needed for identification of missing atoms

    # if missing terminal residues shall be ignored, remove them from the dictionary
    if ignore_terminal_missing_residues:
        chains = list(fixer.topology.chains())
        keys = fixer.missingResidues.keys()
        for key in list(keys):
            chain = chains[key[0]]
            if key[1] == 0 or key[1] == len(list(chain.residues())):
                del fixer.missingResidues[key]

    # if all missing residues shall be ignored ignored, clear the dictionary
    if ignore_missing_residues:
        fixer.missingResidues = {}

    fixer.findNonstandardResidues()  # find non-standard residue
    fixer.replaceNonstandardResidues()  # replace non-standard residues with standard one
    fixer.findMissingAtoms()  # find missing heavy atoms
    fixer.addMissingAtoms()  # add missing atoms and residues
    fixer.addMissingHydrogens(ph)  # add missing hydrogens
    return fixer

In [3]:
pdb_path = '7X08.pdb'
prepared_protein = prepare_protein(pdb_path, ignore_missing_residues=False)

In [4]:
from openmm.app import PDBFile

# Define the path where you want to save the prepared protein
output_pdb_path = 'prepared_7X08.pdb'

# Save the prepared protein to the specified file
with open(output_pdb_path, 'w') as output_file:
    PDBFile.writeFile(prepared_protein.topology, prepared_protein.positions, output_file)

In [None]:
# View the prepared protein
import nglview as nv

# Create a view object for the prepared protein
#view = nv.show_pdbid("7x08")
view = nv.show_file(output_pdb_path)
# Display the view. The 3D view won't show in the saved notebook.
view

# make sure to install nglview with the listed version of dependencies from the link below
# https://github.com/nglviewer/nglview

![title11](ab_spike.png)

## Partial Diffusion of specificed regions or CDR3 regions



In [None]:
# run RFdiffusion partial diffusion module to introduce sequence diversity to antibodies
!python-rfd scripts/run_inference.py inference.output_prefix=diversify_ab/diversified_antibody inference.input_pdb=pdb/prepared_spike_heavy_light_chain_trimmed.pdb 'contigmap.contigs=["65-65/0 214-214/0 211-211"]' 'contigmap.provide_seq=[0-64,65-149,175-278,279-366,377-489]' diffuser.partial_T=10 inference.num_designs=10



### Explaination of `contigmap`
`contigmap.contigs=["65-65/0 214-214/0 211-211"]` 

`contigmap.provide_seq=[0-64,65-149,175-278,279-366,377-489]`

In RFdiffusion, `contigmap.contigs` and `contigmap.provide_seq` are used to specify the regions or segments of a protein structure that will be considered during the RFdiffusion run.

`contigmap.contigs`: Tell RFDiffusion what're in the input PDB file. The first chain has 65 residues: `65-65`. `/0 ` indicates chain termination (must include the space ` `). The next chain has 214 residues (`214-214/0 `), and the chain after that has 211 residues (`211-211`).  

`contigmap.provide_seq`: Specify the residue ranges that will be ignored for diffusion. The residue ID starts from 0 (0-indexed). Here we only partially diffuse residue from 150 to 174, and 367 to 376. The following string exclude those ranges: `0-64,65-149,175-278,279-366,377-489` 


In [None]:
view = nv.show_file('diversify_ab/diversified_antibody_0.pdb')
# Display the view. The 3D view won't show in the saved notebook.
view

![](ab_diffused.png)
Note that only the light and heavy chain of CDR3 are partially diffused. The rest sequence remained constant. 

## Use ProteinMPNN to generate sequence for the partially diffused structures
Feed `diversified_antibody_0.pdb` to the ProteinMPNN package using the following script.

Input the residue ID that needs to be filled with amino acid in `design_only_positions` 

```bash
folder_with_pdbs="./pdb"

output_dir="./output_ab"
if [ ! -d $output_dir ]
then
    mkdir -p $output_dir
fi


path_for_parsed_chains=$output_dir"/parsed_pdbs.jsonl"
path_for_assigned_chains=$output_dir"/assigned_pdbs.jsonl"
path_for_fixed_positions=$output_dir"/fixed_pdbs.jsonl"
chains_to_design="A"
#The first amino acid in the chain corresponds to 1 and not PDB residues index for now.
design_only_positions="151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 368 369 370 371 372 373 374 375 376 377" #design only these residues; use flag --specify_non_fixed

python ../helper_scripts/parse_multiple_chains.py --input_path=$folder_with_pdbs --output_path=$path_for_parsed_chains

python ../helper_scripts/assign_fixed_chains.py --input_path=$path_for_parsed_chains --output_path=$path_for_assigned_chains --chain_list "$chains_to_design"

python ../helper_scripts/make_fixed_positions_dict.py --input_path=$path_for_parsed_chains --output_path=$path_for_fixed_positions --chain_list "$chains_to_design" --position_list "$design_only_positions" --specify_non_fixed

python ../protein_mpnn_run.py \
        --jsonl_path $path_for_parsed_chains \
        --chain_id_jsonl $path_for_assigned_chains \
        --fixed_positions_jsonl $path_for_fixed_positions \
        --out_folder $output_dir \
        --num_seq_per_target 2 \
        --sampling_temp "0.1" \
        --seed 37 \
        --batch_size 1
```

Inspect file `output_ab/seqs/diversified_antibody_0.fa` and identify the following new sequences: 

Old CDRH3 sequence: `ARGLIRGIIMTGAFDI`

New CDRH3 sequence: `GKAKELNNKLNPSVTE`

Old CDRL3 sequence: `SSYAGSNNWV`

New CDRL3 sequence: `QSRTSNGGTK`