## Take Home Assessment

**Disclaimer**: This assessment is work in progress, so we apologise in advance for any hiccup. Any feedback is valuable!

**Setup**: You are provided with some training code for a model that takes protein 3D structure and predicts the associated amino acid sequence. This notebook provides the required steps to download the code repository and training data (a subset of the Protein Data Bank), alongside minimal code to call the training loop. Please fork the repository that you can find below and edit your own version.

**Compute**: You will be provided a [Lambda](https://cloud.lambdalabs.com/) instance with a A10 GPU on an agreed day. For this we need your public key and we will share an IP address to access the compute instance.

**Evaluation**: The following questions are on purpose quite open-ended. No specific answer is expected. The aim is to provide a semi-realistic setup that you may encounter if you were to join our team. We want to assess your ability to probe deep learning models and to come up with solutions to alleviate potential identified limitations. Please write down your answers (e.g. with plots, tables etc) in your copy of the repository (e.g. in this notebook or in any other format of your choice) and push them to your fork. Do include any documentation of what all you did to arrive at your answers. We will discuss during the onsite interview. Please keep the time commitment under 4h.

**Questions**:
1. Log and profile the training loop footprint. Is there any bottleneck? What type?
2. What can you do to improve this? Implement some of your proposed solutions.
3. What will start to become an issue as we increase the model size? How could these be partially alleviated? Implement some of your proposed solutions.
4. What will start to become an issue as we increase the training dataset (e.g. using the AlphaFold database)?
  - a. How could one alleviate the longer training time?
  - b. What would be necessary to scale the dataloading?

5. Log the average weight norm & activation norm through training. What do you observe?

In [None]:
# Get code from repo
!git clone https://github.com/Orion-Medicines/design_team_RE_itw.git
!mv design_team_RE_itw/* .
!rm -rf design_team_RE_itw
!rm -rf sample_data/

Cloning into 'design_team_RE_itw'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 16 (delta 5), reused 16 (delta 5), pack-reused 0[K
Receiving objects: 100% (16/16), 29.20 KiB | 3.24 MiB/s, done.
Resolving deltas: 100% (5/5), done.


In [None]:
# Download subset of training data
!wget https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02_sample.tar.gz
!tar xvf "pdb_2021aug02_sample.tar.gz"
!rm pdb_2021aug02_sample.tar.gz

--2024-06-24 13:30:40--  https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02_sample.tar.gz
Resolving files.ipd.uw.edu (files.ipd.uw.edu)... 128.95.160.134, 128.95.160.135, 2607:4000:406::160:135, ...
Connecting to files.ipd.uw.edu (files.ipd.uw.edu)|128.95.160.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49690915 (47M) [application/octet-stream]
Saving to: ‘pdb_2021aug02_sample.tar.gz’


2024-06-24 13:30:44 (15.0 MB/s) - ‘pdb_2021aug02_sample.tar.gz’ saved [49690915/49690915]

./pdb_2021aug02_sample/
./pdb_2021aug02_sample/README
./pdb_2021aug02_sample/list.csv
./pdb_2021aug02_sample/pdb/
./pdb_2021aug02_sample/pdb/l3/
./pdb_2021aug02_sample/pdb/l3/5l3p.pt
./pdb_2021aug02_sample/pdb/l3/5l3g_A.pt
./pdb_2021aug02_sample/pdb/l3/5l3f.pt
./pdb_2021aug02_sample/pdb/l3/5l3r_B.pt
./pdb_2021aug02_sample/pdb/l3/4l3o_G.pt
./pdb_2021aug02_sample/pdb/l3/1l3b_E.pt
./pdb_2021aug02_sample/pdb/l3/3l3t_C.pt
./pdb_2021aug02_sample/pdb/l3/6l3y_A.pt
./pdb_2021aug02_sam

In [None]:
from training.training import main as run_training

class MyArgs(object):
  def __init__(self):
    self.path_for_training_data = "/content/pdb_2021aug02_sample"
    self.path_for_outputs = "/content/test"
    self.previous_checkpoint = ""
    self.num_epochs = 2
    self.save_model_every_n_epochs = 5
    self.reload_data_every_n_epochs = 4
    self.num_examples_per_epoch = 200
    self.batch_size = 2000
    self.max_protein_length = 2000
    self.hidden_dim = 128
    self.num_encoder_layers = 3
    self.num_decoder_layers = 3
    self.num_neighbors = 32
    self.dropout = 0.1
    self.backbone_noise = 0.1
    self.rescut = 3.5
    self.debug = False
    self.gradient_norm = -1.0 #no norm

args = MyArgs()
run_training(args)




epoch: 1, step: 8, time: 6.0, train: 54.672, valid: 57.897, train_acc: 0.058, valid_acc: 0.033
epoch: 2, step: 16, time: 1.3, train: 42.546, valid: 36.189, train_acc: 0.058, valid_acc: 0.024
