## Acknowledgments

Parts of this codebase were adapted from:

- https://github.com/vishakhpk/iter-extrapolation — which implements the iterative controlled extrapolation method
- https://github.com/huggingface/transformers — for model loading, fine-tuning, and tokenization

We thank the original authors for making their work openly available.

In [1]:
import itertools
import numpy as np
import pandas as pd

In [2]:
from utils.pair_data_generation_modules import get_label, get_pair_data, balance_pair_data
from utils.pair_data_generation_modules import create_one_input_pair, generate_json_file, save_json, load_json, save_to_jsonl

#### Load Data

In [3]:
# Load Input CSV File Containing Variant, Fold Changes for Pb and Zn, and Variant Sequence
data = pd.read_csv('input.csv')
data.columns = ["Variant", "Pb", "Zn", "seq"]

In [4]:
# Generate All Possible Pairs 
pairs = list(itertools.combinations(data.itertuples(index=False), 2))

In [5]:
# Define Threshold For Pair Differences in Each Fold Change Metric (Pb and Zn)
PB_THRESHOLD = 0.5  
ZN_THRESHOLD = 0.5 

In [6]:
# Generate Pair Data
pair_data_df = get_pair_data(pairs, pb_thresh = PB_THRESHOLD, zn_thresh = ZN_THRESHOLD)

In [7]:
# Balance the Data so that we have similar number of pairs for each objective category 
# We also Check Coverage to Make sure we see all mutants at least once in the pair data
balanced_df = balance_pair_data(pair_data_df, data)

Number of insignificant comparisons: 507996
Number of same direction significant comparisons: 83157
Number of opposite direction significant comparisons: 12198
Minimum number of appearances per mutant: 2
Maximum number of appearances per mutant: 342


In [8]:
# We take a look At the Balanced Pairs:
balanced_df.head()

Unnamed: 0,Mutant1,Mutant2,Label
0,K10A,D64K_K104V_G128M,inc-dec
1,K10A,D64K_K104V_L107C_G128I,inc-dec
2,K10A,D64K_K104V_L107C,inc-dec
3,K10A,M60L_D64K_K104V,inc-dec
4,K10A,L107C_G128I,inc-dec


In [9]:
# Make the Pair Data into a List For Generating Input Text for Seq2Seq Model
balanced_list_pairs = {(row["Mutant1"], row["Mutant2"]) for _, row in balanced_df.iterrows()}

In [10]:
# Check The Number of Unique Balanced Pairs
print(f"Number of Unique Balanced Pairs: {len(balanced_list_pairs)}")

Number of Unique Balanced Pairs: 24396


#### Create JSONL File for Seq2Seq Model Input

In [11]:
# Generate json file for seq2seq model input. This is our Training Data.
# The Length of this training data is two times the number of unique balanced pairs
# because for each unique pair we can reverse source and target 
training_data = generate_json_file(data, balanced_list_pairs)

In [12]:
# Give Dataset name for naming JSONL file
dataset_name = "PbrR"

In [13]:
# Save the list of dictionaries to JSONL file
save_to_jsonl("pair_data_"+dataset_name+".json", training_data)

In [14]:
# Print Length of Training Set 
print(f"Length of Training Set: {len(training_data)}")

Length of Training Set: 48792
