## Extract _P.generosa_ gene IDs associated with methylation machinery.

List of methylation machinery gene IDs comes from this GitHub Issue:

- [https://github.com/RobertsLab/resources/issues/1116](https://github.com/RobertsLab/resources/issues/1116)

In [1]:
%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Fri Feb 19 21:04:59 PST 2021
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.2 LTS
Release:	20.04
Codename:	focal

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
Stepping:                        2
CPU MHz:                         2399.998
BogoMIPS:                        4799.99
Hypervisor vendor:               VMware
Virtualization type:       

No LSB modules are available.


### Set variables

In [7]:
# Set data directories
%env data_dir=/home/samb/data/P_generosa
data_dir="/home/samb/data/P_generosa"

env: data_dir=/home/samb/data/P_generosa


In [9]:
cd {data_dir}

/home/samb/data/P_generosa


## Download gene GFF and checksums file

In [12]:
%%bash
wget wget --quiet https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation/checksums.md5

wget --quiet https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation/Panopea-generosa-vv0.74.a4.gene.gff3
    
ls -lh

total 11M
-rw-rw-r-- 1 samb samb  147 Feb 19 10:59 20210219_methylation_list.txt
-rw-rw-r-- 1 samb samb  11M Oct 14  2019 Panopea-generosa-vv0.74.a4.gene.gff3
-rw-rw-r-- 1 samb samb 6.0K Feb 19 20:33 checksums.md5


## Inspect files

In [17]:
%%bash
line="-----------------------------------------------------------"
for file in *
do
    echo ""
    echo "${line}"
    echo ""
    echo "${file}"
    echo ""
    head -n 5 "${file}"
    echo ""
done



-----------------------------------------------------------

20210219_methylation_list.txt

dnmt1
dnmt3a
dnmt3b
dnmt3l
mbd1


-----------------------------------------------------------

Panopea-generosa-vv0.74.a4.gene.gff3

##gff-version 3
##Generated using GenSAS, Monday 7th of October 2019 04:54:37 AM
##Project Name : Pgenerosa_v074
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	gene	19808	36739	.	-	.	ID=PGEN_.00g000020;Name=PGEN_.00g000020;original_ID=21510-PGEN_.00g234150;Alias=21510-PGEN_.00g234150;original_name=21510-PGEN_.00g234150;Notes=sp|P04177|TY3H_RAT [BLAST protein vs protein (blastp) 2.7.1],sp|P04177|TY3H_RAT [DIAMOND Functional 0.

## Verify GFF checksum

In [16]:
%%bash
diff <(md5sum Panopea-generosa-vv0.74.a4.gene.gff3 | cut -d " " -f1) <(grep "Panopea-generosa-vv0.74.a4.gene.gff3" checksums.md5 | cut -d " " -f1)

## List of _P.generosa_ matching gene IDs from methylation machinery list file

In [20]:
%%bash
while read -r line
do
    grep --ignore-case "|${line}" Panopea-generosa-vv0.74.a4.gene.gff3 | awk -F'[=;]' '{print $2}'
done < 20210219_methylation_list.txt | sort -u

PGEN_.00g104080
PGEN_.00g104170
PGEN_.00g116950
PGEN_.00g186870
PGEN_.00g192900
PGEN_.00g202750
PGEN_.00g209890
PGEN_.00g209900
PGEN_.00g243700
PGEN_.00g249090
PGEN_.00g283000
PGEN_.00g283010


## List of methylation machinery gene list with no match to _P.generosa_ GFF.

In [21]:
%%bash
while read -r line
do
    if grep --ignore-case --invert-match --quiet "|${line}" Panopea-generosa-vv0.74.a4.gene.gff3; then
    echo "${line}"
    fi
done < 20210219_methylation_list.txt

dnmt1
dnmt3a
dnmt3b
dnmt3l
mbd1
mbd2
mbd3
mbd4
mbd5
mbd6
mecp2
Baz2a
Baz2b
UHRF1
UHRF2
Kaiso
zbtb4
zbtb38b
zfp57
klf4
egr1
wt1
ctcf
tet1
tet2
tet3
