# Example of running tool 

## blastn jobs

Before running VHIP to predict virus-host interactions, blastn are first needed: 
1. blastn between virus sequences against host sequences
2. blastn between virus sequences against host spacers

Here are guidelinees in running blastn jobs.
First combine the virus sequences into a single file (`cat ./viruses/*.fasta > allviruses.fasta`). 
1. Run CRISPRCasFinder on host sequences of interest (each host should be in its own `.fasta` file)  
    a. Extract the spacers and store them into a single `.fasta` file. We included two helper files in the folder `helper_scripts/ `
        i. To get a table of all spacers, use the `results_to_csv.sh` script
        ii. To convert the csv file into a fasta files of spacers, use the `csv_to_multifasta.py`
    b. Make blastn database (`makeblastdb -in fastafilename.fasta -title thetitleyouwant -dbtype nucl`)  
    c. Run blastn between viruses against spacers. The output of this blastn is one of the needed input for VHIP 
2. Run blastn for viruses against hosts.  
    a. Combine all the hosts into a single file (`cat ./host_sequences/*.fasta > allhosts.fasta`)  
    b. Make blastn database  
    c. Run blastn between viruses against host sequences. The output of this blastn is the other needed input for VHIP. 

The blastn files for this example are included as part of this tutorial. 


#  Running VHIP 


First, let's load the PredictInteractions class that will make the predictions. Please make sure that you have the conda environment setup correctly. The list of modules required to run this tool can be found in the `requirements.txt` file. 


In [1]:
from vhip.predict_interactions import PredictInteractions

# path of saved machine learning model 
model_path = "../src/vhip/gbrt.skops"

Next, the user need to define certain parameters:
1. Location of viruses fasta files. All viruses should be in their own separate file. 
2. Location of the host fasta files. All hosts should be in their own separate files as well. 
3. Blastn results of viruses against host, and viruses against spacers. 
4. The filename for the output. 
4. The number of CPU cores to be used. 

IMPORTANT: Viruses and hosts have to be in separate folders. 

In here, I will be using the test_set folder as an example. 

There are two different ways to use the tool: 
1. Make a prediction for each possible virus-host pair. 
2. Make predictions only for virus-host pair of interest by providing a tsv as an additional input file. 

In either case, user need to provide the virus and host sequences in separate folders, and output of blastn. 

### 1. **Make a prediction for each possible virus-host pair**


Regarding the number of CPU cores, I strongly recommend to use at least 6 if that's possible. With only 1 core, it takes around 40 minutes to run this tool on the test set. Using 6, it will only take about 5 minutes. 

This tool also assumes that the user is interesting in predicting interactions for every virus-host combinations possible. 

The output file will be saved in the `test_set/` folder.


In [2]:
# USER INPUTS
virus_directory_path = './test_set/virus_sequences/'
host_directory_path = './test_set/host_sequences/'

blastn_path = './test_set/StaphStudy_virusvhosts.tsv'
spacer_path = './test_set/StaphStudy_virusvspacers_blastn.tsv'

output_filename = './output/test_allpossiblepairs_predictions.tsv'

CPU_CORES = 6

The code below computes the predictions. There is nothing for the user to change!

In [3]:
# run model 
predictions = PredictInteractions(virus_directory_path, host_directory_path)
predictions.add_blastn_files(blastn_path, spacer_path)
predictions.load_model(model_path)
predictions.do_setup()
predictions.run_parallel(CPU_CORES)
predictions.predict()
predictions.save_predictions(output_filename)

SETUP - ...indexing fasta filenames for viruses and hosts...
SETUP - ...initialize all pairs...
None
-------> There are 40 viral sequences
-------> There are 27 host sequences
-------> Total number of interactions: 1080
SETUP - ...getting fasta headers...
SETUP - ...process blastn and spacers output...
SETUP - ...calculate GC content and k-mer profiles...
-------> current pair: GCA_021090945.1_ASM2109094v1_genomic.fasta | Staphylococcus_pseudintermedius.fasta
-------> current pair: GCA_021090845.1_ASM2109084v1_genomic.fasta | Staphylococcus_pettenkoferi.fasta
-------> current pair: GCA_021090835.1_ASM2109083v1_genomic.fasta | Staphylococcus_hominis.fasta
-------> current pair: GCA_021090825.1_ASM2109082v1_genomic.fasta | Staphylococcus_pseudintermedius.fasta
-------> current pair: GCA_021090955.1_ASM2109095v1_genomic.fasta | Staphylococcus_pettenkoferi.fasta
-------> current pair: GCA_021090975.1_ASM2109097v1_genomic.fasta | Staphylococcus_hominis.fasta
-------> current pair: GCA_02109

### **Predict only for virus-host pair of interest**

For this example, I will use the same set. But this I am also going to provide the additional parameter needed: a tsv file containing specific virus-host pair to be tested. 

In [4]:
pairs_of_interest_path = './example_VirusHost_pairlist.txt'


output_filename = './output/test_onlydesiredpairs_predictions.tsv'


With this parameter defined, now we can make the predictions!

In [5]:
predictions = PredictInteractions(virus_directory_path, host_directory_path, pairs_of_interest=pairs_of_interest_path)
predictions.add_blastn_files(blastn_path, spacer_path)
predictions.load_model(model_path)
predictions.do_setup()
predictions.run_parallel(CPU_CORES)
predictions.predict()
predictions.save_predictions(output_filename)


SETUP - ...indexing fasta filenames for viruses and hosts...
SETUP - ...initialize all pairs...
./example_VirusHost_pairlist.txt
reading pairs file
SETUP - ...getting fasta headers...
SETUP - ...process blastn and spacers output...
SETUP - ...calculate GC content and k-mer profiles...
