![buscadoR logo - generated by DALL-E mini](https://raw.githubusercontent.com/TeamMacLean/buscadoR/main/logo.png)

# buscadoR Searches

This notebook will help you to run the hmmer, deeptmhmm and BLAST searches needed for RLK annotation using `buscadoR` (available [here](https://github.com/TeamMacLean/buscadoR)). After you've uploaded a protein FASTA file and run the programs you'll have the search result files for loading into the `buscadoR` R package for further inspection and a simple text summary file of annotations and sequences.

The notebook is a Google Colab document - for an introduction to these see this [Colab intro video](https://www.youtube.com/watch?v=inN8seMm7UI).

In brief, the page you are looking at will run the code written below for you. Hit the little 'play' buttons to make each cell work. You'll need to run each one in turn.

The code runs on a computer (here called a 'runtime') in the Google cloud. The runtimes are free but limited. The runtime will be available for 12 hours from the time you start it, after that point the runtime will be erased and you'll need to start again. Naturally, you'll need a Google account to use this. You can choose to pay for more power with Google shoule you need it.

## 0. Install software and files into environment

This is a setup step and should only need to be run once for each new instance of the notebook.

Wait for step to complete before proceeding. It usually takes about 3 minutes.

In [None]:
#@title
from google.colab import drive
drive.mount('/content/drive')

#install hmmer
!rm -rf data/ hmm/ blast/ results/ sample_data/
!pip install -q condacolab
import condacolab
condacolab.install()
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge
!conda install hmmer
#get hmmer files, build db
!mkdir hmm
!wget https://github.com/TeamMacLean/buscador_hlp/raw/main/buscador.hmm -P hmm/
!hmmpress hmm/buscador.hmm
#install seqkit
!conda install -c bioconda seqkit
#install DeepTMHMM
!pip3 install -qU pybiolib
!mkdir data
!mkdir results

#install BLAST
!apt install ncbi-blast+
!mkdir blast
!wget https://github.com/TeamMacLean/buscador_hlp/raw/main/At_ecto.fa -P blast/

print("Software installation done!")


## 1. Upload protein FASTA file from your computer

The file will take a while to completely upload. Once uploaded the file will be deduplicated with respect to sequences and sequence names.

In [None]:
#@title
import time
from google.colab import files

DEDUP_FILE = "data/deduplicated_sequences.fa"
HMMER_RESULTS = "results/hmmer_results.txt"
HMMER_GZ = "results/hmmer_results.txt.gz"
IDS_FOR_TMHMM = "data/seq_ids_for_tmhmm.txt"
SEQS_FOR_TMHMM = "data/seqs_for_tmhmm.fa"
INPUT = "data/input.fa"

uploaded = files.upload()
fname = list(uploaded.keys())[0]
cmd = "mv {} data/input.fa".format(fname)
!{cmd}
cmd = "seqkit rmdup -s -P {} | seqkit rmdup -n > {}".format(INPUT, DEDUP_FILE)
!{cmd}

## 2. Run `hmmer` on deduplicated sequences

In [None]:
#@title
!echo "Starting hmmer "; date;  hmmscan --domtblout {HMMER_RESULTS}  hmm/buscador.hmm {DEDUP_FILE} 1> /dev/null ; echo "Ending hmmer "; date;
!gzip {HMMER_RESULTS}
!echo "hmmer finished!"


## Download the `hmmer` results

Use the cell below to send the results to your machine.

Alternatively, use the file browser on the left to download - right click on the file `results/hmmer_results.txt.gz`

In [None]:
#@title
from google.colab import files
files.download('results/hmmer_results.txt.gz')

## Run `DeepTMHMM` on sequences with `hmmer` hits

This is split into two steps,

 1. Isolate the sequences that had `hmmer` hits ...

In [None]:
#@title
HMMER_GZ = "results/hmmer_results.txt.gz"
!zcat {HMMER_GZ} | grep -o '^[^#]*' | tr -s ' ' | cut -d ' ' -f4 | sort | uniq > {IDS_FOR_TMHMM}
!seqkit grep -f {IDS_FOR_TMHMM} {DEDUP_FILE} -o {SEQS_FOR_TMHMM}
!seqkit split {SEQS_FOR_TMHMM} -s 100 --force

Then,
  2. Run `DeepTMHMM` on the sequences with `hmmer` hits

In [None]:
#@title
cmd = '''
FILES=data/seqs_for_tmhmm.fa.split/*.fa
NFILES=`ls data/seqs_for_tmhmm.fa.split/*.fa | wc -l`
i=1
for f in ${FILES}
do
  echo "run ${i} of ${NFILES} "
  biolib run DTU/DeepTMHMM --fasta ${f}
  cat biolib_results/TMRs.gff3 >> results/deeptmhmm_results.txt
  rm -rf biolib_results
  i=$((i+1))
done
rm -rf data/seqs_for_tmhmm.fa.split

echo "All DeepTMHMM runs complete" '''
!{cmd}
!gzip results/deeptmhmm_results.txt

In [None]:
!pip install udocker
!udocker --allow-root install

In [None]:
!alias docker='udocker --allow-root'

In [None]:
!udocker --allow-root pull hello-world  # Example: Pull a Docker image
!udocker --allow-root run hello-world  # Example: Run a Docker container

In [None]:
!apt-get install docker.io && biolib run --local 'DTU/DeepTMHMM:1.0.24' --fasta data/seqs_for_tmhmm.fa

In [None]:
!biolib run --local 'DTU/DeepTMHMM:1.0.24' --fasta data/seqs_for_tmhmm.fa

## Download the `DeepTMHMM` results

Use the cell below to send the results to your machine.

Alternatively, use the file browser on the left to download - right click on the file `results/deeptmhmm_results.txt.gz`

In [None]:
#@title
from google.colab import files
files.download('results/deeptmhmm_results.txt.gz')

## Run `BLAST` to find sequences with ectodomains


In [None]:
#@title
print("Starting BLASTP")
!date
cmd = "blastp -subject {} -query blast/At_ecto.fa -outfmt 6 > results/blast_tmp".format(SEQS_FOR_TMHMM)
!{cmd}
!date
with open("results/blast_tmp", "r") as blast, open("results/blast_results.txt", "w") as out:
  for line in  blast.readlines():
    els = line.split("\t")
    if (float(els[2]) >= 50.0):
      out.write(line)
!gzip results/blast_results.txt
!rm results/blast_tmp

## Download the `BLAST` results

Use the cell below to send the results to your machine.

Alternatively, use the file browser on the left to download - right click on the file `results/blast_results.txt.gz`

In [None]:
#@title
from google.colab import files
files.download('results/blast_results.txt.gz')

## Help

### My results don't download

Google Colab only permits file downloads up to a certain size ( ~10 Mb). Every effort is made in the notebook to retain only essential information in the results file, but if the results file gets too big for Colab to send using simple download interface you'll need to transfer it to your Google Drive and get it from there. This document shows how to do that [Google Colab Tutorial on external data](https://colab.research.google.com/notebooks/io.ipynb)


In [None]:
!grep -c ">" data/seqs_for_tmhmm.fa
