# Quick Start: AlphaFold Inference with Vertex Pipelines [Monomer]

This quick start notebook demonstrates how to configure and run the Universal inference pipeline using the monomer presets.



## Install and import required packages

In [None]:
%pip install -U kfp google-cloud-aiplatform google-cloud-pipeline-components

In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
import os
from google.cloud import aiplatform as vertex_ai
from kfp.v2 import compiler

## Configure environment settings

Change the values of the following parameters to reflect your environment.

- `PROJECT` - Project ID of your environment
- `REGION`- GCP Region where your resources are located
- `BUCKET_NAME` - GCS bucket to use for Vertex staging. Must be located in the `REGION`
- `FILESTORE_IP` - IP address of your Filestore instance
- `FILESTORE_NETWORK_NAME` - Filestore VPC name


In [2]:
PROJECT = 'alphafold-deploy-0525'  # Change to your project ID
REGION = 'us-central1'   # Change to your region
BUCKET_NAME = 'alphafold-deploy-0525-bucket'  # Change to your bucket name        
FILESTORE_IP = '10.130.0.2' # Change the IP of your Filestore instance
FILESTORE_NETWORK_NAME = 'default-alphafold' # Change to the name of the Filestore instance network name

If you set up the sandbox environment using the provided Terraform configuration you do not need to change the below settings. Otherwise make sure that they are consistent with your environment.

- `FILESTORE_SHARE` - Filestore share with AlphaFold reference databases
- `FILESTORE_MOUNT_PATH` - Mount path for Filestore fileshare
- `MODEL_PARAMS` - GCS location of AlphaFold model parameters. The pipelines are configured to retrieve the parameters from the `<MODEL_PARAMS>/params` folder.


In [3]:
FILESTORE_SHARE = '/datasets'
FILESTORE_MOUNT_PATH = '/mnt/nfs/alphafold'
PROJECT_NUMBER = !(gcloud projects list --filter="projectId:{PROJECT}" --format="value(PROJECT_NUMBER)")  
FILESTORE_NETWORK = f'projects/{PROJECT_NUMBER[0]}/global/networks/{FILESTORE_NETWORK_NAME}'  
MODEL_PARAMS = f'gs://{BUCKET_NAME}'
IMAGE_URI = f'gcr.io/{PROJECT}/alphafold-components'

## Configure and run the Universal pipeline with monomer presets

There are two types of parameters that can be used to customize Vertex Pipelines: compile time and runtime. The compile time parameters must be set before compiling the pipeline code. The runtime parameters can be supplied when starting a pipeline run.

In the AlphaFold inference pipelines, the compile time parameters are used to control settings like CPU/GPU configuration of compute nodes and the Filestore instance settings. The runtime parameters include a sequence to fold, model presets, the maximum date for template searches and more. 

The pipelines have been designed to retrieve compile time parameters from environment variables. This makes it easy to integrate a pipeline compilation step with CI/CD systems.

By default, the pipeline uses a `c2-standard-16` node to run the feature engineering step and  `n1-standard-8` nodes with NVIDIA T4 GPUs to run prediction and relaxation. For now, you will use the default settings. This hardware configuration is optimal for folding smaller proteins, roughly 1000 residues or fewer. 

In other notebooks we will demonstrate how to change the default hardware configuration.

### Set compile time parameters

At minimum you have to configure:
- the settings of your Filestore instance that hosts genetic databases, 
- the URI of the docker image that packages custom KFP components, and 
- the GCS location of AlphaFold parameters.

In [4]:
os.environ['ALPHAFOLD_COMPONENTS_IMAGE'] = f'gcr.io/{PROJECT}/alphafold-components'
os.environ['NFS_SERVER'] = FILESTORE_IP
os.environ['NFS_PATH'] = FILESTORE_SHARE
os.environ['NETWORK'] = FILESTORE_NETWORK
os.environ['MODEL_PARAMS_GCS_LOCATION'] = MODEL_PARAMS

### Compile the pipeline

In [6]:
from src.pipelines.alphafold_inference_pipeline import alphafold_inference_pipeline as pipeline

pipeline_name = 'universal-pipeline'
compiler.Compiler().compile(
    pipeline_func=pipeline,
    package_path=f'{pipeline_name}.json')



### Configure runtime parameters

At minimum you need to configure a GCS location of your sequence, the maximum date for template searches and a project and region where to run the pipeline. With the default settings, the pipeline will run monomer inference using the small version of BFD.

#### Copy the sample sequence to a GCS location

You can find a few sample sequences in the `sequences` folder.

In [7]:
sequence = 'T1050.fasta'
gcs_sequence_path = f'gs://{BUCKET_NAME}/fasta/{sequence}'

! gsutil cp sequences/{sequence} {gcs_sequence_path}

Copying file://sequences/T1050.fasta [Content-Type=application/octet-stream]...
/ [1 files][  830.0 B/  830.0 B]                                                
Operation completed over 1 objects/830.0 B.                                      


In [8]:
params = {
    'sequence_path': gcs_sequence_path,
    'max_template_date': '2020-05-14',
    'model_preset': 'monomer',
    'project': PROJECT,
    'region': REGION
}

### Submit a pipeline run

We recommend annotating pipeline runs with at least two labels. The first label groups multiple pipeline runs into a single experiment. The second label identifies a given run within the experiment. Annotating with labels helps to discover and analyze pipeline runs in large scale settings. The third notebook that demonstrates how to analyze pipeline runs depends on the labels. 

You will be able to monitor the run using the link printed by executing the cell.

In [9]:
experiment_id = 'T1050-experiment'
labels = {'experiment_id': experiment_id.lower(), 'sequence_id': sequence.split(sep='.')[0].lower()}

pipeline_job = vertex_ai.PipelineJob(
    display_name=pipeline_name,
    template_path=f'{pipeline_name}.json',
    pipeline_root=f'gs://{BUCKET_NAME}/pipeline_runs/{pipeline_name}',
    parameter_values=params,
    enable_caching=False,
    labels=labels
)

pipeline_job.run(sync=False)

Creating PipelineJob
PipelineJob created. Resource name: projects/569249717177/locations/us-central1/pipelineJobs/alphafold-inference-pipeline-20220606192735
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/569249717177/locations/us-central1/pipelineJobs/alphafold-inference-pipeline-20220606192735')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/alphafold-inference-pipeline-20220606192735?project=569249717177
PipelineJob projects/569249717177/locations/us-central1/pipelineJobs/alphafold-inference-pipeline-20220606192735 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/569249717177/locations/us-central1/pipelineJobs/alphafold-inference-pipeline-20220606192735 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/569249717177/locations/us-central1/pipelineJobs/alphafold-inference-pipeline-20220606192735 current state:
PipelineState.PIPELINE_STATE_R

In [10]:
# Check the state of the pipeline
pipeline_job.state

<PipelineState.PIPELINE_STATE_RUNNING: 3>