# **Creating Execution Environment**
It is necessary to keep the original folder names and run the following cells to enter the folder and download dependencies.

In [2]:
#import os
#os._exit(00)

Access files on the Google drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Access the drive folder

In [4]:
cd drive

/content/drive


Access the MyDrive folder

In [5]:
cd MyDrive

/content/drive/MyDrive


Access the ppi-version folder with the BioPrediction files

In [6]:
cd BioPrediction-PPI

/content/drive/MyDrive/BioPrediction-PPI


Download the required python libraries

In [7]:
print("Downloading dependencies")
!pip install attrs==23.2.0 biopython==1.83 bioservices==1.11.2 catboost==1.2.3 cattrs==23.2.3 colorlog==6.8.2 dgl==2.1.0 easydev==0.13.1 fst-pso==1.8.1 fuzzytm==2.0.5 gevent==24.2.1 greenlet==3.0.3 grequests==0.7.0 hyperopt==0.2.7 imbalanced-learn==0.12.0 line-profiler==4.1.2 miniful==0.0.6 networkx==2.8.8 node2vec==0.4.6 pexpect==4.9.0 platformdirs==4.2.0 polars==0.20.10 py4j==0.10.9.7 pyfume==0.2.25 graphviz==0.20.1 reportlab==4.1.0 requests-cache==1.2.0 scikit-learn==1.4.1.post1 shap==0.44.1 simpful==2.11.1 slicer==0.0.7 suds-community==1.1.2 torch==2.2.1 torchdata==0.7.1 typing-extensions==4.9.0 uniprot==1.3 url-normalize==1.4.3 xgboost==2.0.3 xmltodict==0.13.0 zope-event==5.0 > /dev/null 2>&1
print("Downloaded dependencies")

Downloading dependencies
Downloaded dependencies


# **Training a New Model**

To run BioPrediction-PPI, known interactions in the given context and all proteins in FASTA format are required.

**To start the tool**, after create the environment, use the next form or use the command: `!python BioPrediction.py -h`

where -h is:

    -input_interactions_train: CSV format file with the interaction table (firts and second columns the proteins name and third the label with 1 for interaction and 0 for non interaction), e.g., all-data/data_human_virus/Sars/interaction.csv

    -input_interactions_candidates: CSV format file with the interaction candidates to the prediction (also three columns, but the third put 2 for unlabeled candidates), e.g., all-data/data_human_virus/Sars/interaction.csv

    -sequences_dictionary: fasta format file with all the sequences, e.g., all-data/data_human_virus/Sars/dictionary.fasta

    Those dictionaries must contain all sequences in train, test, and candidates.

    -topology_features: uses topology features to characterization of the sequences, e.g., yes or no, default=yes)

    -output: output path, e.g., sars_experiment

    Use a string formant to execute BioPrediction, like this complete exemple:
    !python BioPrediction.py -input_interactions_train all-data/data_human_virus/Sars/interaction.csv -sequences_dictionary all-data/data_human_virus/Sars/dictionary.fasta -output sars_test -input_interactions_candidates all-data/data_human_virus/Sars/interaction.csv

The trained model files will be in the tool's folder with the name specified in the output.

In [7]:
# @title **Form to train a model using BioPrediction-PPI**
input_interaction_train = "all-data/data_human_virus/Sars/interaction.csv" # @param {"type":"string","placeholder":"This field is required"}
sequences_dictionary = "all-data/data_human_virus/Sars/dictionary.fasta" # @param {"type":"string","placeholder":"This field is required"}
output = "sars_final" # @param {"type":"string","placeholder":"This field is required"}
input_interactions_candidates = "all-data/data_human_virus/Sars/interaction.csv" # @param {"type":"string","placeholder":"optional"}
topology_features = "yes" # @param {"type":"string","placeholder":"default = yes"}


if topology_features == "":
  topology_features = "yes"

if input_interactions_candidates == "":
  !python BioPrediction.py -input_interactions_train {input_interaction_train} -sequences_dictionary {sequences_dictionary} -output {output} -topological_features {topology_features}
else:
  !python BioPrediction.py -input_interactions_train {input_interaction_train} -sequences_dictionary {sequences_dictionary} -input_interactions_candidates {input_interactions_candidates} -output {output} -topological_features {topology_features}

DGL backend not selected or invalid.  Assuming PyTorch for now.
Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)
input_interactions_train - all-data/data_human_virus/Sars/interaction.csv: Found File
sequences_dictionary - all-data/data_human_virus/Sars/dictionary.fasta: Found File
Make the folds
Topology features extraction fold1
Extracting topological features...
Topology features extraction fold2
Extracting topological features...
Topology features extraction fold3
Extracting topological features...
Topology features extraction fold4
Extracting topological features...
Topology features extraction fold5
Extracting topological features...
Extracting extructural features with MathFeature...
Starting the model training stage

The metrics for fold 1 are available in sars_final/folds_and_topology_feats/fold1/metrics_model_final.csv
The pred

# **Using a Pre-Trained Model to Predict Candidate Interactions**

It is possible to use a pre-trained model to predict interactions again between the proteins in the FASTA file.

To do this, use the command: !python -reuse.py -h

where -h is:

    -input_interactions_candidates: CSV format file with the interaction candidates to the prediction (also three columns, but the third put 2 for unlabeled candidates), e.g., all-data/data_human_virus/Sars/interaction.csv

    -trained_model_path: treined model path, e.g., sars_experiment

    -topology_features: topology features were used to characterize the sequences, e.g., yes or no, default=yes)

    -fold_prediction: choose the fold that will be used to predict the candidates, from 1 to 5, default=1')

    Use a string formant to execute reuse a model BioPrediction, like this complete exemple:
    !python reuse.py -trained_model_path sars_test -input_interactions_candidates all-data/data_human_virus/Sars/interaction.csv

    Or use the next form, after filling in the black spaces and run the cell

In [8]:
# @title **Form to rerun a trained model to predict new candidates**
trained_model_path = "sars_final" # @param {"type":"string","placeholder":"This field is required"}
input_interactions_candidates = "all-data/data_human_virus/Sars/interaction.csv" # @param {"type":"string","placeholder":"This field is required"}
topology_features = "yes" # @param {"type":"string","placeholder":"yes"}
fold_prediction = "1" # @param {"type":"string","placeholder":"1"}

if topology_features == "":
  topology_features = "yes"

if fold_prediction == "":
  fold_prediction = 1


!python reuse.py -trained_model_path {trained_model_path} -input_interactions_candidates {input_interactions_candidates} -topological_features {topology_features} -fold_prediction {fold_prediction}

DGL backend not selected or invalid.  Assuming PyTorch for now.
Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)
Fold number 1 was used
Predicted and saved interactor candidates in sars_final/folds_and_topology_feats/fold1/candidates_prediction.csv
