# Example of running VHIP



In [1]:
import os

## Running VHIP

Imports 

In [2]:
from vhip.predict_interactions import PredictInteractions
from vhip.mlmodel.build import BuildModel

There are two steps to run VHIP: 
1. Building the machine learning model (using best parameters)
2. Applying model on data of interest 

In this notebook, the entire pipeline is demonstrated. If you are interested in running VHIP on your own, please make sure you can run this notebook with the given example first to ensure the VHIP module was properly installed. 

### 1. Retrain the machine learning model 

The first step is to retrain the machine learning model using the best hyperparameters as explored in the VHIP manuscript. 

In [3]:
ml_training = "./ml_input.csv"

model = BuildModel(ml_training)
VHIP = model.build()

The dataframe is made of 6925 rows and 4 columns!
Accuracy on training set: 1.000
Accuracy on test set: 0.873


Now that the model is loaded and trained, we can apply it to new datasets. 

For the input, you will need: 
1. the directory of virus sequences. Each file should represent an unique virus. 
2. The directory of host sequences. Each file should represent an unique host. 
3. The blastn between virus sequences and host sequences. 
4. The blastn between virus sequences and host spacers. 

In [4]:
virus_genomes_directory_path = "./virus_genomes/"
host_genomes_directory_path = "./host_genomes/"
virus_genes_directory_path = "./virus_genes/"
host_genes_directory_path = "./host_genes/"

blastn_path = "./StaphStudy_virusvhosts.tsv"
spacer_path = "./StaphStudy_virusvspacers_blastn.tsv"

In addition, let's also define the name of the output and the number of CPU cores to be used. 

In [5]:
output_filename = "./predictions.tsv"
CPU_CORES = 6

### 2. Run model 


In [6]:
predictions = PredictInteractions(
    virus_genomes_directory_path,
    host_genomes_directory_path,
    virus_genes_directory_path,
    host_genes_directory_path,
)
predictions.model = VHIP
predictions.add_blastn_files(blastn_path, spacer_path)
predictions.do_setup()
predictions.run_parallel(CPU_CORES)
predictions.predict()
predictions.save_predictions(output_filename)

SETUP - ...indexing genome fasta filenames for viruses and hosts...
SETUP - ...indexing annotated gene fasta filenames for viruses and hosts...
SETUP - ...initialize all pairs...
-------> There are 40 viral sequences
-------> There are 27 host sequences
-------> Total number of interactions: 1080
SETUP - ...getting fasta headers...
SETUP - ...process blastn and spacers output...
SETUP - ...calculate GC content and k-mer profiles...
-------> current pair: GCA_021090845.1_ASM2109084v1_genomic.fasta | Staphylococcus_pettenkoferi.fasta
-------> current pair: GCA_021090945.1_ASM2109094v1_genomic.fasta | Staphylococcus_pseudintermedius.fasta
-------> current pair: GCA_021090835.1_ASM2109083v1_genomic.fasta | Staphylococcus_hominis.fasta
-------> current pair: GCA_021090825.1_ASM2109082v1_genomic.fasta | Staphylococcus_pseudintermedius.fasta
-------> current pair: GCA_021090955.1_ASM2109095v1_genomic.fasta | Staphylococcus_pettenkoferi.fasta
-------> current pair: GCA_021090975.1_ASM2109097v1