# Abstract
This notebook guides through running different Semi-supervised fine-tuning pipelines implemented in this repo. This notebook uses custom `CSV` files as datasets for fine-tuning BERT-like models.<br/>

I have placed sample dataset files for getting familiarized with the input format of the dataset files. Please ensure the dataset each dataset file has the following columns:
1. `smiles` -> Input molecular SMILES.
2. `y` -> corresponds to the labels. For unlabelled samples used in Semi-supervised models set y values to `-1`.

## 0. Setup
Setup the environment by creating a virtual environment via conda using the following command. <br/>
`conda create -n drug_discovery_v1 python==3.8.13` <br/>

Install the required dependencies.<br/>
`pip install -r requirements.txt`

## 1. Supervised Fine-tuning using custom files (MLM/MTR)

In [None]:
#uncomment this before running the cell below if and change the name of output file as per requirement
#!touch eval_result_supervised.csv

In [None]:
#Setup parameters/hyperparamters for training

from tqdm import tqdm

"""
Data related configuration
"""
datasets = ['custom'] # set always to custom when reading data from CSV
#path of the folder in which this notebook is located, can be obtained by running `pwd` command on linux
folder_path="/Users/shahrukh/Desktop/moleculenet-bert-ssl/notebooks" 
train_path=f'{folder_path}/data/train.csv' # replace this with full path to train split of your input data
valid_path=f'{folder_path}/data/val.csv' # replace this with full path to validation split of your input data
test_path=f'{folder_path}/data/test.csv' # replace this with full path to test split of your input data
out_file = f'{folder_path}/eval_result_supervised_augment.csv' # full path to output file, ensure this file already exists
SAMPLES_PER_CLASS = [200] # set to maximum number of samples in a class
N_AUGMENT = [2] # number of augmentations on each input sample

"""
Training related configuration
"""
N_TRIALS = 20 # number of times you want to repeat the training to obtain standard error later
EPOCHS = 20 # number of training epochs per trial
model_name_or_path= "shahrukhx01/smole-bert" #name of the model from huggingface model hub or path from file system
DO_SEMI_SUPERVISED_TRAINING=0 # SET this to `0` when you want to do supervised training 

## run the experiment here
for dataset in datasets:
    for SAMPLE in SAMPLES_PER_CLASS:
        for n_augment in N_AUGMENT:
            for i in tqdm(range(N_TRIALS)):
                !python ../pseudo_label/main.py --dataset-name={dataset} --epochs={EPOCHS} \
                --batch-size=16 --model-name-or-path={model_name_or_path} --samples-per-class={SAMPLE} \
                --eval-after={EPOCHS} --train-log=0 --train-ssl={DO_SEMI_SUPERVISED_TRAINING} --out-file={out_file} \
                --n-augment={n_augment} --train-path={train_path} --val-path={valid_path} --test-path={test_path}
                !cat {out_file}

## 2. Semi-supervised Pseudo-label-based Fine-tuning using custom files

In [None]:
#uncomment this before running the cell below if and change the name of output file as per requirement
#!touch eval_result_pseudo_label.csv

In [None]:
#Setup parameters/hyperparamters for training

from tqdm import tqdm

"""
Data related configuration
"""
datasets = ['custom'] # set always to custom when reading data from CSV
#path of the folder in which this notebook is located, can be obtained by running `pwd` command on linux
folder_path="/Users/shahrukh/Desktop/moleculenet-bert-ssl/notebooks" 
train_path=f'{folder_path}/data/train_semi_supervised.csv' # replace this with full path to train split of your input data
valid_path=f'{folder_path}/data/val.csv' # replace this with full path to validation split of your input data
test_path=f'{folder_path}/data/test.csv' # replace this with full path to test split of your input data
out_file = f'{folder_path}/eval_result_pseudo_label.csv' # full path to output file, ensure this file already exists
SAMPLES_PER_CLASS = [200] # set to maximum number of samples in a class
N_AUGMENT = [2] # number of augmentations on each input sample

"""
Training related configuration
"""
N_TRIALS = 20 # number of times you want to repeat the training to obtain standard error later
EPOCHS = 20 # number of training epochs per trial
model_name_or_path= "shahrukhx01/smole-bert" #name of the model from huggingface model hub or path from file system
DO_SEMI_SUPERVISED_TRAINING=1 # SET this to `1` when you want to do Semi-supervised training 

## run the experiment here
for dataset in datasets:
    for SAMPLE in SAMPLES_PER_CLASS:
        for n_augment in N_AUGMENT:
            for i in tqdm(range(N_TRIALS)):
                !python ../pseudo_label/main.py --dataset-name={dataset} --epochs={EPOCHS} \
                --batch-size=16 --model-name-or-path={model_name_or_path} --samples-per-class={SAMPLE} \
                --eval-after={EPOCHS} --train-log=0 --train-ssl={DO_SEMI_SUPERVISED_TRAINING} --out-file={out_file} \
                --n-augment={n_augment} --train-path={train_path} --val-path={valid_path} --test-path={test_path}
                !cat {out_file}

## 3. Semi-supervised Co-training-based Fine-tuning using custom files

In [None]:
#uncomment this before running the cell below if and change the name of output file as per requirement
#!touch eval_result_co_training.csv

In [None]:
#Setup parameters/hyperparamters for training

from tqdm import tqdm

"""
Data related configuration
"""
datasets = ['custom'] # set always to custom when reading data from CSV
#path of the folder in which this notebook is located, can be obtained by running `pwd` command on linux
folder_path="/Users/shahrukh/Desktop/moleculenet-bert-ssl/notebooks" 
train_path=f'{folder_path}/data/train_semi_supervised.csv' # replace this with full path to train split of your input data
valid_path=f'{folder_path}/data/val.csv' # replace this with full path to validation split of your input data
test_path=f'{folder_path}/data/test.csv' # replace this with full path to test split of your input data
out_file = f'{folder_path}/eval_result_co_training.csv' # full path to output file, ensure this file already exists
SAMPLES_PER_CLASS = [200] # set to maximum number of samples in a class

"""
Training related configuration
"""
N_TRIALS = 20 # number of times you want to repeat the training to obtain standard error later
EPOCHS = 20 # number of training epochs per trial
model_name_or_path= "shahrukhx01/smole-bert" #name of the model from huggingface model hub or path from file system
posterior_threshold = 0.9
DO_SEMI_SUPERVISED_TRAINING=1 # SET this to `1` when you want to do Semi-supervised training 

## run the experiment here
for dataset in datasets:
    for SAMPLE in SAMPLES_PER_CLASS:
        for i in tqdm(range(N_TRIALS)):
            !python ../co_training/main.py --dataset-name={dataset} --epochs={EPOCHS} \
            --batch-size=16 --model-name-or-path={model_name_or_path} --samples-per-class={SAMPLE} \
            --eval-after={EPOCHS} --train-log=0 --train-ssl={DO_SEMI_SUPERVISED_TRAINING} --out-file={out_file} \
            --train-path={train_path} --val-path={valid_path} --test-path={test_path} --posterior-threshold={posterior_threshold}
            !cat {out_file}