# BLAST

Here is an example of how one might take a multi sequence fasta file and using NCBI Blast, compare the sequences with the Swiss-Prot Database on their own computer.

---
**OS** - 
This notebook was originally developed on a Mac OS

**Organization** - First take some time to decide how you will organize directories on the machine you are on.
One suggestion is that there is a central location for all "bioinformatic" programs as well as a specific location for blast databases (that has relevant metadata). 
 ___

## Download Stand-alone BLAST

see `ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/`

In [12]:
%%bash
cd /Applications/bioinfo/
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-macosx.tar.gz
tar -xf ncbi-blast-2.12.0+-x64-macosx.tar.gz
cd -

/Users/sr320/Documents/GitHub/code


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  144M  100  144M    0     0  1479k      0  0:01:39  0:01:39 --:--:-- 1417k


You could add the programs to your system PATH, however I prefer to use absolute paths / variables. 

In [17]:
#here one can set the path to blast on your local machin3
bldir = "/Applications/bioinfo/ncbi-blast-2.12.0+/bin/"

In [19]:
#showing how file path variable is working
!{bldir}blastx -h

USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
    [-negative_taxidlist filename] [-ipglist filename]
    [-negative_ipglist filename] [-entrez_query entrez_query]
    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-qcov_hsp_perc float_value] [-max_hsps int_value]
    [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking

## Create a Blast Database

I would like to make a database of UniProt/Swiss-prot. see https://www.uniprot.org/downloads

In [21]:
%%bash
cd /Users/sr320/Documents/wd_bio/blast/
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2021_03.fasta.gz
gunzip -k uniprot_sprot_r2021_03.fasta.gz
cd -

/Users/sr320/Documents/GitHub/code


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 86.0M  100 86.0M    0     0  1515k      0  0:00:58  0:00:58 --:--:-- 1761k


In [22]:

!{bldir}makeblastdb \
-in /Users/sr320/Documents/wd_bio/blast/uniprot_sprot_r2021_03.fasta \
-dbtype prot \
-out /Users/sr320/Documents/wd_bio/blast/uniprot_sprot_r2021_03



Building a new DB, current time: 08/18/2021 14:48:02
New DB name:   /Users/sr320/Documents/wd_bio/blast/uniprot_sprot_r2021_03
New DB title:  /Users/sr320/Documents/wd_bio/blast/uniprot_sprot_r2021_03.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 565254 sequences in 13.8158 seconds.




## Get a Query Sequence

**tmp directory** - here as this is a template I am setting up a tmp dir structure

In [32]:
%%bash
mkdir tmp
mkdir tmp/data/
mkdir tmp/analyses/

In [35]:
#getting file from url to local location
!curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \
-k \
> tmp/data/Ab_4denovo_CLC6_a.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1982k  100 1982k    0     0  1279k      0  0:00:01  0:00:01 --:--:-- 1279k


In [36]:
#lets get a preview
!head tmp/data/Ab_4denovo_CLC6_a.fa

>solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_1
ACACCCCACCCCAACGCACCCTCACCCCCACCCCAACAATCCATGATTGAATACTTCATC
TATCCAAGACAAACTCCTCCTACAATCCATGATAGAATTCCTCCAAAAATAATTTCACAC
TGAAACTCCGGTATCCGAGTTATTTTGTTCCCAGTAAAATGGCATCAACAAAAGTAGGTC
TGGATTAACGAACCAATGTTGCTGCGTAATATCCCATTGACATATCTTGTCGATTCCTAC
CAGGATCCGGACTGACGAGATTTCACTGTACGTTTATGCAAGTCATTTCCATATATAAAA
TTGGATCTTATTTGCACAGTTAAATGTCTCTATGCTTATTTATAAATCAATGCCCGTAAG
CTCCTAATATTTCTCTTTTCGTCCGACGAGCAAACAGTGAGTTTACTGTGGCCTTCAGCA
AAAGTATTGATGTTGTAAATCTCAGTTGTGATTGAACAATTTGCCTCACTAGAAGTAGCC
TTC


In [37]:
#how many sequences? lets count ">" as we know each contig has 1
!grep -c ">" tmp/data/Ab_4denovo_CLC6_a.fa

5490


## Run Blast

In [39]:

!{bldir}blastx \
-query tmp/data/Ab_4denovo_CLC6_a.fa \
-db /Users/sr320/Documents/wd_bio/blast/uniprot_sprot_r2021_03 \
-out tmp/analyses/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 8 \
-max_target_seqs 1 \
-outfmt 6



In [40]:
!head tmp/analyses/Ab_4-uniprot_blastx.tab

solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3	sp|O42248|GBLP_DANRE	82.456	171	30	0	1	513	35	205	2.78e-103	301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5	sp|Q08013|SSRG_RAT	75.385	65	16	0	3	197	121	185	1.39e-28	104
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_6	sp|P12234|MPCP_BOVIN	76.623	77	18	0	2	232	286	362	7.22e-24	98.6
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_9	sp|Q41629|ADT1_WHEAT	82.258	62	11	0	3	188	170	231	6.00e-28	104
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_13	sp|Q32NG4|GALD1_XENLA	54.444	90	40	1	1	270	140	228	1.49e-28	106
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_23	sp|Q9GNE2|RL23_AEDAE	97.222	72	2	0	67	282	14	85	6.73e-44	142
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_31	sp|B3EWZ9|HEPHL_ACRMI	56.589	129	53	1	2	379	26	154	1.78e-44	157
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_31	sp|B3EWZ9|HEPHL_ACRMI	44.715	123	64

In [41]:
#how many blast hits?
!wc -l tmp/analyses/Ab_4-uniprot_blastx.tab

764 tmp/analyses/Ab_4-uniprot_blastx.tab
