# Monomer-optimized AlphaFold Inference pipeline 

This notebook demonstrates how to run a Monomer-optimized AlphaFold Inference pipeline. 

# Install and import required packages

In [None]:
%pip install -U kfp google-cloud-aiplatform google-cloud-pipeline-components

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
from google.cloud import aiplatform as vertex_ai
from kfp.v2 import compiler

## Configure environment settings

Change the values of the following parameters to reflect your environment.

- `PROJECT` - Project ID of your environment
- `REGION`- GCP Region where your resources are located
- `BUCKET_NAME` - GCS bucket to use for Vertex staging. Must be located in the `REGION`
- `FILESTORE_IP` - IP address of your Filestore instance
- `FILESTORE_NETWORK_NAME` - Filestore VPC name


In [None]:
PROJECT = '<YOUR PROJECT ID>'  # Change to your project ID
REGION = '<YOUR REGION>'   # Change to your region
BUCKET_NAME = '<YOUR BUCKET NAME>'  # Change to your bucket name        
FILESTORE_IP = '<FILESTORE IP ADDREDD>' # Change the IP of your Filestore instance
FILESTORE_NETWORK_NAME = '<FILESTORE NETWORK NAME>' # Change to the name of the Filestore instance network name

If you set up the sandbox environment using the provided Terraform configuration you do not need to change the below settings. Otherwise make sure that they are consistent with your environment.

- `FILESTORE_SHARE` - Filestore share with AlphaFold reference databases
- `FILESTORE_MOUNT_PATH` - Mount path for Filestore fileshare
- `MODEL_PARAMS` - GCS location of AlphaFold model parameters. The pipelines are configured to retrieve the parameters from the `<MODEL_PARAMS>/params` folder.

In [None]:
FILESTORE_SHARE = '/datasets'
FILESTORE_MOUNT_PATH = '/mnt/nfs/alphafold'   
PROJECT_NUMBER = !(gcloud projects list --filter="projectId:{PROJECT}" --format="value(PROJECT_NUMBER)")  
FILESTORE_NETWORK = f'projects/{PROJECT_NUMBER[0]}/global/networks/{FILESTORE_NETWORK_NAME}'  
MODEL_PARAMS = f'gs://{BUCKET_NAME}'
IMAGE_URI = f'gcr.io/{PROJECT}/alphafold-components'

## Configure and run the Optimized pipeline with custom settings

There are two types of parameters that can be used to customize Vertex Pipelines: compile time and runtime. The compile time parameters must be set before compiling the pipeline code. The runtime parameters can be supplied when starting a pipeline run.

In the AlphaFold inference pipelines, the compile time parameters are used to control settings like CPU/GPU configuration of compute nodes and the Filestore instance settings. The runtime parameters include a sequence to fold, model presets, the maximum date for template searches and more. 

The pipelines have been designed to retrieve compile time parameters from environment variables. This makes it easy to integrate a pipeline compilation step with CI/CD systems.

By default, the pipeline uses a `c2-standard-16` node to run the feature engineering step and  `n1-standard-8` nodes with NVIDIA T4 GPUs to run prediction and relaxation.  This hardware configuration is optimal for folding smaller proteins, roughly 1000 residues or fewer. 

To demonstrate how you can change the default pipeline settings, you will reconfigure the pipeline to use nodes with a single NVIDIA A100 GPU for prediction and relaxation

### Configure and compile the pipeline

### Set compile time parameters

At minimum you have to configure:
- the settings of your Filestore instance that hosts genetic databases, 
- the URI of the docker image that packages custom KFP components, and 
- the GCS location of AlphaFold parameters.

The other variables set in the below cell configure hardware settings for prediction and relaxation nodes.

In [None]:
os.environ['ALPHAFOLD_COMPONENTS_IMAGE'] = IMAGE_URI
os.environ['NFS_SERVER'] = FILESTORE_IP
os.environ['NFS_PATH'] = FILESTORE_SHARE
os.environ['NETWORK'] = FILESTORE_NETWORK
os.environ['MODEL_PARAMS_GCS_LOCATION'] = MODEL_PARAMS

# Host configuration for Inference
os.environ['MEMORY_LIMIT'] = '85'    # Amount of host memory (RAM)
os.environ['CPU_LIMIT'] = '12'       # Number of CPUs
os.environ['GPU_LIMIT'] = '1'        # Number of GPUs
os.environ['GPU_TYPE'] = 'nvidia-tesla-a100'  # GPU type

# Host configuration for Protein Relaxation
os.environ['RELAX_MEMORY_LIMIT'] = '85'    # Amount of host memory (RAM)
os.environ['RELAX_CPU_LIMIT'] = '12'       # Number of CPUs
os.environ['RELAX_GPU_LIMIT'] = '1'        # Number of GPUs
os.environ['RELAX_GPU_TYPE'] = 'nvidia-tesla-a100'  # GPU type

### Compile the pipeline

In [None]:
from src.pipelines.alphafold_optimized_monomer import alphafold_monomer_pipeline as pipeline

pipeline_name = 'monomer-optimized-pipeline'
compiler.Compiler().compile(
    pipeline_func=pipeline,
    package_path=f'{pipeline_name}.json')

### Configure runtime parameters

At minimum you need to configure a GCS location of your sequence, the maximum date for template searches and a project and region where to run the pipeline. With the default settings, the pipeline will run inference using the full version of BFD. 



#### Copy the sample sequence to a GCS location

You can find a few sample sequences in the `sequences` folder.

For this demo, we will use the `Q9Y490`, which is a medium size sequence.

In [None]:
sequence = 'Q9Y490.fasta'
gcs_sequence_path = f'gs://{BUCKET_NAME}/fasta/{sequence}'

! gsutil cp sequences/{sequence} {gcs_sequence_path}

### Start the run

In [None]:
params = {
    'sequence_path': gcs_sequence_path,
    'max_template_date': '2020-05-14',
    'project': PROJECT,
    'region': REGION,
    'is_run_relax': 'relax'
}

### Submit a pipeline run

We recommend annotating pipeline runs with at least two labels. The first label groups multiple pipeline runs into a single experiment. The second label identifies a given run within the experiment. Annotating with labels helps to discover and analyze pipeline runs in large scale settings. The third notebook that demonstrates how to analyze pipeline runs depends on the labels. 

You will be able to monitor the run using the link printed by executing the cell.

In [None]:
vertex_ai.init(
    project=PROJECT,
    location=REGION,
    staging_bucket=f'gs://{BUCKET_NAME}/staging'
)

In [None]:
# Define the name of your experiment
# This name will be used to locale the PipelineJob
experiment_id = 'Q9Y490-experiment'
labels = {'experiment_id': experiment_id.lower(), 'sequence_id': sequence.split(sep='.')[0].lower()}

pipeline_job = vertex_ai.PipelineJob(
    display_name=pipeline_name,
    template_path=f'{pipeline_name}.json',
    pipeline_root=f'gs://{BUCKET_NAME}/pipeline_runs/{pipeline_name}',
    parameter_values=params,
    enable_caching=False,
    labels=labels
)

pipeline_job.run(sync=False)

In [None]:
# Check the state of the pipeline
pipeline_job.state