# <span style="color:deeppink"> BLAST <span/>

This is a script to run a BLAST search on highly conserved methylation machinery in humans, plants and fungi against the Pst annotated proteins.

The following proteins will be compared:

- In fungi:
    - RAD8 XP_001831325.2
    - DNMT1/MASC2 XP_001833175.2
    - DNMT2 XP_001828513.2

- In plants (At)
    - MET1 AT5G49160 sp|P34881|DNMT1_ARATH
    - CMT3 AT1G69770 sp|Q94F88|CMT3_ARATH
    - DRM2 AT5G14620 sp|Q9M548|DRM2_ARATH

- In humans
    - DNMT1 NP_001124295.1
    
The BLAST will be performed using the following steps:
1. Copy the protein fasta files to the current folder.
2. Copy the amino acid sequence fasta from the database to a .txt file in the current directory.
3. Run blastdb to make databases from protein fasta files.
4. Use blastp to align each protein sequence query against each genome file and generate output.
5. Check whether any high scoring pairs (HSP) are in Pst for any genes.
6. Run muscle for global alignment between HSP Pst proteins and query proteins.
7. Check whether important domains annotated on the online database entry are present in the Pst candidate protein.

In [44]:
#load modules
import Bio
from Bio import Align

In [8]:
%%bash

#go to the blast directory
cd /home/anjuni/blast_db

In [None]:
%%bash

#make blast databases from protein files
makeblastdb -in Pst_104E_v13_h_ctg.protein.fa -dbtype prot
makeblastdb -in Pst_104E_v13_p_ctg.protein.fa -dbtype prot

In [10]:
%%bash


cd /home/anjuni/blast_db
pwd

/home/anjuni/blast_db


In [17]:
%%bash

#go to the blast directory
cd /home/anjuni/blast_db

#run blast on rad8 as a test and view output
blastp -db Pst_104E_v13_h_ctg.protein.fa -query rad8.txt -evalue 1e-03 -out rad8_h_blast_results


### <span style="color:darkturquoise"> BLAST on the haplotigs  <span/>

In [34]:
%%bash

#go to the blast directory
cd /home/anjuni/blast_db

#for loop
for x in *.txt
do
len=${#x}
blastp -db Pst_104E_v13_h_ctg.protein.fa -query ${x} -evalue 1e-03 -out ${x::len-4}_h_blast_results
echo ${x::len-4}_h_blast_results
done

cmt3_h_blast_results
dnmt1_masc2_h_blast_results
dnmt1_h_blast_results
dnmt2_h_blast_results
drm2_h_blast_results
met1_h_blast_results
rad8_h_blast_results


### <span style="color:limegreen"> BLAST on the primary contigs  <span/>

In [36]:
%%bash

#go to the blast directory
cd /home/anjuni/blast_db

#for loop
for x in *.txt
do
len=${#x}
blastp -db Pst_104E_v13_p_ctg.protein.fa -query ${x} -evalue 1e-03 -out ${x::len-4}_p_blast_results
echo ${x::len-4}_p_blast_results
done

cmt3_p_blast_results
dnmt1_masc2_p_blast_results
dnmt1_p_blast_results
dnmt2_p_blast_results
drm2_p_blast_results
met1_p_blast_results
rad8_p_blast_results


### <span style='color:#ffbf00'> Making parseable output file. <span/>

In [38]:
%%bash

#go to the blast directory
cd /home/anjuni/blast_db

#for loop
for x in *.txt
do
len=${#x}
blastp -db Pst_104E_v13_p_ctg.protein.fa -query ${x} -evalue 1e-03 -outfmt 6 -out ${x::len-4}_p_blast_6
echo ${x::len-4}_p_blast_6
done


for x in *.txt
do
len=${#x}
blastp -db Pst_104E_v13_h_ctg.protein.fa -query ${x} -evalue 1e-03 -outfmt 6 -out ${x::len-4}_h_blast_6
echo ${x::len-4}_h_blast_6
done


cmt3_p_blast_6
dnmt1_masc2_p_blast_6
dnmt1_p_blast_6
dnmt2_p_blast_6
drm2_p_blast_6
met1_p_blast_6
rad8_p_blast_6
cmt3_h_blast_6
dnmt1_masc2_h_blast_6
dnmt1_h_blast_6
dnmt2_h_blast_6
drm2_h_blast_6
met1_h_blast_6
rad8_h_blast_6


### <span style="color:#fa7d00"> Observations  <span/>

Continue analysis:
- DNMT1/MASC2: 2 hits, p and h, similar length to query
- RAD8: 2 hits, p and h, similar length to query


Discarded:

- DNMT2: no hits on p or h contig
- DRM2: no hits on p or h contig

- DNMT1: 2 hits, p and h, that was too short and lacked most catalytic sites. same hits as for DNMT1/MASC2
- CMT1: 2 hits, p and h, that was too short and lacked most catalytic sites. same hits as for DNMT1/MASC2
- MET1: 2 hits, p and h, that was too short and lacked most catalytic sites. same hits as for DNMT1/MASC2

HSPs that were different in length to the query were discarded.


Only two hits of a similar length:

#### DNMT1/MASC2

<span style='color:darkred'> query: XP_001833175.2, len = 1253 <span/>

<span style='color:purple'> h_subject: Pst104E_20230, len = 1248 <span/>

Score = 206 bits (523),  Expect = 5e-54, Method: Compositional matrix adjust.
Identities = 241/925 (26%), Positives = 372/925 (40%), Gaps = 178/925 (19%)


<span style='color:darkblue'> p_subject: Pst104E_04293, len = 1248 <span/>

Score = 206 bits (523),  Expect = 5e-54, Method: Compositional matrix adjust.
Identities = 241/925 (26%), Positives = 372/925 (40%), Gaps = 178/925 (19%)

#### RAD8

<span style='color:darkred'> query: XP_001831325.2, len = 2184 <span/>

<span style='color:purple'> h_subject: Pst104E_28179, len = 2204 <span/>

Score = 1158 bits (2996),  Expect = 0.0, Method: Compositional matrix adjust.
Identities = 643/1386 (46%), Positives = 837/1386 (60%), Gaps = 76/1386 (5%)



<span style='color:darkblue'> p_subject: Pst104E_12497, len = 1248 <span/>

Score = 1159 bits (2997),  Expect = 0.0, Method: Compositional matrix adjust.
Identities = 643/1386 (46%), Positives = 838/1386 (60%), Gaps = 76/1386 (5%)

Catalytic domains in Rad8 and Dnmt1/Masc2 matches are outlined in Excel spreadsheet.

### Testing out whether annotated domains from NCBI entries are in the annotated protein file.

In [9]:
%%bash

#testing out whether methylase domain is in these proteins
cd /home/anjuni/blast_db
grep 'PF00145' Supplemental_file_2_id_to_locus_tag.txt

Pst104E_17382	PF00176;PF00145
Pst104E_17447	PF00145
Pst104E_20230	PF00145
Pst104E_21609	PF00145;PF00176
Pst104E_28179	PF00145;PF00176;PF00271
Pst104E_01411	PF00145;PF00176
Pst104E_04293	PF00145
Pst104E_12497	PF00271;PF00145;PF00176


Pst104E_20230, Pst104E_04293 matches dnmt1/masc2

Pst104E_12497, Pst104E_28179 matches rad8

Pst104E_17382, Pst104E_17447, Pst104E_21609, Pst104E_01411 unknown.

In [11]:
%%bash

#testing out whether methylase domain unique to dnmt1/masc2 is in these proteins
cd /home/anjuni/blast_db
grep 'COG0270' Supplemental_file_2_id_to_locus_tag.txt

Pst104E_20230	0957U@basNOG;0IF68@euNOG;0PK6H@fuNOG;12RZZ@opiNOG;COG0270@NOG
Pst104E_04293	0957U@basNOG;0IF68@euNOG;0PK6H@fuNOG;12RZZ@opiNOG;COG0270@NOG


### InterProScan Results

The amino acid sequences for the RAD8 and DNMT1/MASC2 queries, and their matching subjects were run on InterPro website below to detect catalytic domains in the subjects.
https://www.ebi.ac.uk/interpro/search/sequence-search
date accessed: 17/07/2018

Rad8:
    
    Both subjects had matches for all the domains in Dnmt1/Masc2:
        S-adenosyl-L-methionine-dependent methyltransferase SSF53335
        SNF2-like, N-terminal domain superfamily G3DSA:3.40.50.10810
        P-loop containing nucleoside triphosphate hydrolase SSF52540
        C-5 cytosine methyltransferase PF00145
        Helicase superfamily 1/2, ATP-binding domain PS51192 SM00487
        SNF2-related, N-terminal domain PF00176
        Helicase, C-terminal PS51194 cd00079 PF00271 
        
        Importantly, the Rad8 match has the SNF2 domain and Methylase domain characteristic of Rad8 proteins.

Dnmt1/Masc2:
    
    Both subjects had matches for:
        S-adenosyl-L-methionine-dependent methyltransferase SSF53335
        C-5 cytosine methyltransferase PR00105 PF00145 PS51679
        Bromo adjacent homology (BAH) domain PS51038 SM00439
    
    Both subjects lacked matches for:
        C-5 cytosine methyltransferase TIGR00675
        DNA (cytosine-5)-methyltransferase 1-like PIRSF037404
        DNA (cytosine-5)-methyltransferase 1, replication foci domain PF12047
        DNA methylase, C-5 cytosine-specific, active site PS00094



 (delete later)
## potential computational way to check for sites in the subject:
### (currently manually collecting data in spreadsheet)

idea to check whether key sites are in each candidate gene:
get position intervals of key sites
make a list of these
run the list against query to see if they are letters and not spaces
for the sites: if letters, make a table saying "yes", if spaces, put "no" in the table
for the regions: put the number of letters (matches) and the number of spaces (mismatches), put the percentage mismatch and match from the total

Alternatively: all the best hits and their stats were listed.
Check whether they contain important domains.

### <span style='color:violet'> Global alignment using Muscle <span/>

In [None]:
%%bash

# perform global alignment on rad8
muscle -in rad8_subject_h.txt -out rad8_h_muscle
muscle -in rad8_subject_p.txt -out rad8_p_muscle

muscle -in dnmt1_masc2_subject_h.txt -out dnmt1_masc2_h_muscle
muscle -in dnmt1_masc2_subject_p.txt -out dnmt1_masc2_p_muscle

Muscle output:

rad8_subject_h 2 seqs, max length 2203, avg  length 2193
00:00:00    23 MB(11%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00    23 MB(11%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00    35 MB(18%)  Iter   1  100.00%  Align node
00:00:00    35 MB(18%)  Iter   1  100.00%  Root alignment

rad8_subject_p 2 seqs, max length 2203, avg  length 2193
00:00:00    23 MB(11%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00    23 MB(11%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00    35 MB(18%)  Iter   1  100.00%  Align node
00:00:00    35 MB(18%)  Iter   1  100.00%  Root alignment

dnmt1_masc2_subject_h 2 seqs, max length 1253, avg  length 1250
00:00:00    23 MB(11%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00    23 MB(11%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00    29 MB(15%)  Iter   1  100.00%  Align node
00:00:00    29 MB(15%)  Iter   1  100.00%  Root alignment

dnmt1_masc2_subject_p 2 seqs, max length 1253, avg  length 1250
00:00:00    23 MB(11%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00    23 MB(11%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00    29 MB(15%)  Iter   1  100.00%  Align node
00:00:00    29 MB(15%)  Iter   1  100.00%  Root alignment

In [3]:
#align muscle output using biopython

aligner = Align.PairwiseAligner()

AttributeError: module 'Bio.Align' has no attribute 'PairwiseAligner'

In [2]:
from Bio import Align
aligner = Align.PairwiseAligner()

AttributeError: module 'Bio.Align' has no attribute 'PairwiseAligner'

### <span style='color:mediumorchid'> tbastn results: comparing amino acid sequence to nucleotide sequence <span/>

In [1]:
%%bash

#run tblastn on all queries
#make the databases
cd /home/anjuni/blast_db

makeblastdb -in Pst_104E_v13_p_ctg.fa -dbtype nucl
makeblastdb -in Pst_104E_v13_h_ctg.fa -dbtype nucl



Building a new DB, current time: 07/26/2018 12:22:29
New DB name:   /home/anjuni/blast_db/Pst_104E_v13_p_ctg.fa
New DB title:  Pst_104E_v13_p_ctg.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 156 sequences in 2.33927 seconds.


Building a new DB, current time: 07/26/2018 12:22:32
New DB name:   /home/anjuni/blast_db/Pst_104E_v13_h_ctg.fa
New DB title:  Pst_104E_v13_h_ctg.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 475 sequences in 2.57479 seconds.


In [2]:
%%bash

#run tblastx on dnmt1/masc2, dnmt2 and drm2
cd /home/anjuni/blast_db

#for loop
for x in *.txt
do
len=${#x}
tblastn -db Pst_104E_v13_h_ctg.fa -query ${x} -evalue 1e-03 -out ${x::len-4}_h_tblastn_results
echo ${x::len-4}_h_tblastn_results
done

for x in *.txt
do
len=${#x}
tblastn -db Pst_104E_v13_p_ctg.fa -query ${x} -evalue 1e-03 -out ${x::len-4}_p_tblastn_results
echo ${x::len-4}_p_tblastn_results
done

cmt3_h_tblastx_results
dnmt1_masc2_subject_h_h_tblastx_results
dnmt1_masc2_subject_p_h_tblastx_results
dnmt1_masc2_h_tblastx_results
dnmt1_h_tblastx_results
dnmt2_h_tblastx_results
drm2_h_tblastx_results
met1_h_tblastx_results
rad8_subject_h_h_tblastx_results
rad8_subject_p_h_tblastx_results
rad8_h_tblastx_results
Supplemental_file_2_id_to_locus_tag_h_tblastx_results
cmt3_p_tblastx_results
dnmt1_masc2_subject_h_p_tblastx_results
dnmt1_masc2_subject_p_p_tblastx_results
dnmt1_masc2_p_tblastx_results
dnmt1_p_tblastx_results
dnmt2_p_tblastx_results
drm2_p_tblastx_results
met1_p_tblastx_results
rad8_subject_h_p_tblastx_results
rad8_subject_p_p_tblastx_results
rad8_p_tblastx_results
Supplemental_file_2_id_to_locus_tag_p_tblastx_results


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [4]:
%%bash

cd /home/anjuni/blast_db/bed_tblastn

#run tblastn with 6 format output
for x in *.txt
do
len=${#x}
tblastn -db Pst_104E_v13_p_ctg.fa -query ${x} -evalue 1e-03 -outfmt 6 -out ${x::len-4}_p_tblastn_6
echo ${x::len-4}_p_tblastn_6
done


for x in *.txt
do
len=${#x}
tblastn -db Pst_104E_v13_h_ctg.fa -query ${x} -evalue 1e-03 -outfmt 6 -out ${x::len-4}_h_tblastn_6
echo ${x::len-4}_h_tblastn_6
done

cmt3_p_tblastn_6
dnmt1_masc2_p_tblastn_6
dnmt1_p_tblastn_6
dnmt2_p_tblastn_6
drm2_p_tblastn_6
met1_p_tblastn_6
rad8_p_tblastn_6
cmt3_h_tblastn_6
dnmt1_masc2_h_tblastn_6
dnmt1_h_tblastn_6
dnmt2_h_tblastn_6
drm2_h_tblastn_6
met1_h_tblastn_6
rad8_h_tblastn_6


In [5]:
%%bash

cd /home/anjuni/blast_db/bed_tblastn
#run bedtools intersect to find overlap between

cat *_6 > all_6
less all_6 | awk '{if ($9 < $10) print $2"\t"$9"\t"$10"\t"$1;else print $2"\t"$10"\t"$9"\t"$1}' > all_6.bed
bedtools intersect -a Pst_104E_v13_ph_ctg_combined_sorted_anno.gff3 -b all_6.bed > proteins_tblastn.bed
bedtools intersect -a all_6.bed -b Pst_104E_v13_ph_ctg_combined_sorted_anno.gff3  > tblastn_proteins.bed

In [9]:
%%bash

cd /home/anjuni/blast_db/bed_tblastn
#get both features together
bedtools intersect -a all_6.bed -b Pst_104E_v13_ph_ctg_combined_sorted_anno.gff3 -wa -wb > tblastn.bed

#the results from this file were used to identify proteins matching the query proteins.
#while a few DNMT1 (human) matches were found, most matches were to Rad8