## Extract _P.generosa_ BLAST results using gene IDs associated with methylation machinery.

List of methylation machinery gene IDs comes from this GitHub Issue:

- [https://github.com/RobertsLab/resources/issues/1116](https://github.com/RobertsLab/resources/issues/1116)

In [1]:
%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Fri Feb 26 11:36:59 PST 2021
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.2 LTS
Release:	20.04
Codename:	focal

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
Stepping:                        2
CPU MHz:                         2399.998
BogoMIPS:                        4799.99
Hypervisor vendor:               VMware
Virtualization type:       

No LSB modules are available.


### Set variables

In [2]:
# Set data directories
%env data_dir=/home/samb/data/P_generosa
%env genes_gff=/home/samb/data/P_generosa/Panopea-generosa-vv0.74.a4.gene.gff3
%env unique_pgen_match_IDs=/home/samb/data/P_generosa/20210219_pgen_methylation-machinery_gene-IDs.txt
%env meth_machinery_list=/home/samb/data/P_generosa/20210219_methylation_list.txt
%env results_table=/home/samb/data/P_generosa/20210222_pgen_methylation-machinery_BLAST-evals.tab
data_dir="/home/samb/data/P_generosa"

env: data_dir=/home/samb/data/P_generosa
env: genes_gff=/home/samb/data/P_generosa/Panopea-generosa-vv0.74.a4.gene.gff3
env: unique_pgen_match_IDs=/home/samb/data/P_generosa/20210219_pgen_methylation-machinery_gene-IDs.txt
env: meth_machinery_list=/home/samb/data/P_generosa/20210219_methylation_list.txt
env: results_table=/home/samb/data/P_generosa/20210222_pgen_methylation-machinery_BLAST-evals.tab


In [3]:
cd {data_dir}

/home/samb/data/P_generosa


## Download gene GFF, GFF checksums file, and GenSAS BLAST results

In [4]:
%%bash
wget --quiet https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation/Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab

wget --quiet https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation/Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab

wget --quiet https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation/checksums.md5

wget --quiet https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation/Panopea-generosa-vv0.74.a4.gene.gff3
    
ls -lh

total 14M
-rw-rw-r-- 1 samb samb  147 Feb 19 10:59 20210219_methylation_list.txt
-rw-rw-r-- 1 samb samb 1.5M Oct  3  2019 Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab
-rw-rw-r-- 1 samb samb 1.3M Oct  3  2019 Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab
-rw-rw-r-- 1 samb samb  11M Oct 14  2019 Panopea-generosa-vv0.74.a4.gene.gff3
-rw-rw-r-- 1 samb samb 6.0K Feb 19 20:33 checksums.md5


## Inspect files

In [5]:
%%bash
line="-----------------------------------------------------------"
for file in *
do
    echo ""
    echo "${line}"
    echo ""
    echo "${file}"
    echo ""
    head -n 15 "${file}"
    echo ""
done



-----------------------------------------------------------

20210219_methylation_list.txt

dnmt1
dnmt3a
dnmt3b
dnmt3l
mbd1
mbd2
mbd3
mbd4
mbd5
mbd6
mecp2
Baz2a
Baz2b
UHRF1
UHRF2


-----------------------------------------------------------

Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab

#
# Output is generated by GenSAS 7.x-5.0
#
#name     : mRNA
#start    : Start of alignment in subject
#end      : End of alignment in subject
#m_start  : Start of alignment in query
#m_end    : End of alignment in query
#al       : Alignment length
#score    : Row score of the match
#evalue   : E value of the match
#identity : Percentage of identical matches
mame	start	end	score	Accession	Match ID	m_start	m_end	E-value	identity	al
21910-PGEN_.00g000010.m01	121	229	165	Q86IC9	sp|Q86IC9|CAMT1_DICDI	11	122	8.93e-14	35.652	115
21910-PGEN_.00g000020.m01	147	467	968	P04177	sp|P04177|TY3H_RAT	20	339	3.47e-127	55.140	321


-----------------------------------------------------------

Panopea-g

## Verify GFF checksum

In [6]:
%%bash
diff <(md5sum Panopea-generosa-vv0.74.a4.gene.gff3 | cut -d " " -f1) <(grep "Panopea-generosa-vv0.74.a4.gene.gff3" checksums.md5 | cut -d " " -f1)

## List of _P.generosa_ matching gene IDs from methylation machinery list file

In [7]:
%%bash
# Pull out unique list of pgen IDs matching methylation machinery list
while read -r line
do

  # Test for empty line
  [ -z ${line} ] && { echo "Empty line found in ${meth_machinery_list}."; exit 1; }

  # Search GFF for methylation gene name
  if grep --quiet --ignore-case "|${line}" "${genes_gff}"; then

    # Loop through matches, in case of multiple matches
    for match in $(grep --ignore-case "|${line}" "${genes_gff}" | awk -F'[=;]' '{print $2}')
    do
      # Print tab-delimited results
      printf "%s\t%s\n" "${match}" "${line}"
    done
  fi

done < ${meth_machinery_list} | sort -k1,1 -u >> ${unique_pgen_match_IDs}

head ${unique_pgen_match_IDs}

PGEN_.00g104080	Baz2b
PGEN_.00g104170	Baz2b
PGEN_.00g116950	mbd5
PGEN_.00g186870	ctcf
PGEN_.00g192900	UHRF1
PGEN_.00g202750	mbd2
PGEN_.00g209890	mbd2
PGEN_.00g209900	mbd4
PGEN_.00g243700	egr1
PGEN_.00g249090	egr1


## Search BLAST tables for gene IDs and print to tab-delimited file

In [8]:
%%bash
printf "%s\t%s\t%s\t%s\n" "Gene_ID" "gene_name" "BLASTp_evalue" "DIAMOND_evalue" > ${results_table}
while read -r pgen_ID meth_machinery
do
  blastp=$(grep "${pgen_ID}" Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab | head -n 1 | cut -f9)
  diamond=$(grep "${pgen_ID}" Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab | head -n 1 | cut -f9)
  printf "%s\t%s\t%s\t%s\n" "${pgen_ID}" "${meth_machinery}" "${blastp}" "${diamond}"
done < ${unique_pgen_match_IDs} >> ${results_table}


In [9]:
%%bash
cat ${results_table} | column -t

Gene_ID          gene_name  BLASTp_evalue  DIAMOND_evalue
PGEN_.00g104080  Baz2b      1.05e-98       5.4e-102
PGEN_.00g104170  Baz2b      3.09e-96       1.2e-109
PGEN_.00g116950  mbd5       6.40e-21       2.8e-20
PGEN_.00g186870  ctcf       1.25e-116
PGEN_.00g192900  UHRF1      2.32e-19
PGEN_.00g202750  mbd2       9.46e-82       2.6e-63
PGEN_.00g209890  mbd2       4.37e-19       9.2e-09
PGEN_.00g209900  mbd4       3.14e-32       8.0e-29
PGEN_.00g243700  egr1       6.24e-58       2.2e-23
PGEN_.00g249090  egr1       4.19e-18       2.6e-06
PGEN_.00g283000  dnmt1      5.03e-10
PGEN_.00g283010  dnmt1      0.0            7.3e-224
