# GOterm Annotation

In this notebook, I'll annotate the CpG background and DML lists from `methylKit` and `DSS` with GOterms. I will use these annotations for gene enrichment.

1. Create master annotation table
2. Match CpG background and DML lists with GOterms
3. Modify lists for downstream gene enrichment

## 0. Prepare to run script

### 0a. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/project-gigas-oa-meth/code'

In [3]:
cd ../output/

/Users/yaamini/Documents/project-gigas-oa-meth/output


In [4]:
!mkdir 11-GOterm-annotation

In [5]:
cd 11-GOterm-annotation/

/Users/yaamini/Documents/project-gigas-oa-meth/output/11-GOterm-annotation


## 1. Create master annotation table

### 1a. Format `DIAMOND blastx` output

In [6]:
#Download blastx output
!wget https://gannet.fish.washington.edu/spartina/project-gigas-oa-meth/output/blastx/20210605-cgigas-roslin-mito-blastx.outfmt6 \
--no-check-certificate

--2021-06-07 10:19:13--  https://gannet.fish.washington.edu/spartina/project-gigas-oa-meth/output/blastx/20210605-cgigas-roslin-mito-blastx.outfmt6
Resolving gannet.fish.washington.edu (gannet.fish.washington.edu)... 128.95.149.52
Connecting to gannet.fish.washington.edu (gannet.fish.washington.edu)|128.95.149.52|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 10442376 (10.0M)
Saving to: ‘20210605-cgigas-roslin-mito-blastx.outfmt6’


2021-06-07 10:19:14 (46.6 MB/s) - ‘20210605-cgigas-roslin-mito-blastx.outfmt6’ saved [10442376/10442376]



In [7]:
#Check output
!head 20210605-cgigas-roslin-mito-blastx.outfmt6
!wc -l 20210605-cgigas-roslin-mito-blastx.outfmt6

lcl|NC_047559.1_mrna_XM_034463066.1_26	sp|A1KZ92|PXDNL_HUMAN	24.891	229	153	6	867	1535	278	493	4.55e-05	51.2
lcl|NC_047559.1_mrna_XM_034463224.1_35	sp|Q7LHG5|YI31B_YEAST	50.658	152	71	1	1072	1515	617	768	8.84e-40	157
lcl|NC_047559.1_mrna_XM_034456887.1_39	sp|A1A5V9|ELP5_DANRE	28.873	284	194	4	356	1198	1	279	6.43e-29	118
lcl|NC_047559.1_mrna_XM_011455176.3_47	sp|A1A5V9|ELP5_DANRE	28.873	284	194	4	376	1218	1	279	7.04e-29	118
lcl|NC_047559.1_mrna_XM_011455177.3_48	sp|A1A5V9|ELP5_DANRE	28.873	284	194	4	245	1087	1	279	3.80e-29	118
lcl|NC_047559.1_mrna_XM_011419438.3_53	sp|A0A159BP93|CITB_MONRU	30.795	302	179	11	3053	2199	3	291	3.63e-29	123
lcl|NC_047559.1_mrna_XM_020064467.2_54	sp|A0A159BP93|CITB_MONRU	30.795	302	179	11	3307	2453	3	291	3.97e-29	123
lcl|NC_047559.1_mrna_XM_011419437.3_55	sp|O35077|GPDA_RAT	59.259	351	140	3	111	1160	1	349	4.39e-140	434
lcl|NC_047559.1_mrna_XM_011419460.2_56	sp|A0A159BP93|CITB_MONRU	30.795	302	179	11	1	855	3	291	4.56e-31	122
lcl|NC_047559.1_mrna_XM_011419435.3

In [9]:
#convert pipes to tab to isolate Uniprot accession code
!tr '|' '\t' < 20210605-cgigas-roslin-mito-blastx.outfmt6 \
> cgigas-roslin-mito-blastx.outfmt6.codeIsolated

In [149]:
!head cgigas-roslin-mito-blastx.outfmt6.codeIsolated
!wc -l cgigas-roslin-mito-blastx.outfmt6.codeIsolated

lcl	NC_047559.1_mrna_XM_034463066.1_26	sp	A1KZ92	PXDNL_HUMAN	24.891	229	153	6	867	1535	278	493	4.55e-05	51.2
lcl	NC_047559.1_mrna_XM_034463224.1_35	sp	Q7LHG5	YI31B_YEAST	50.658	152	71	1	1072	1515	617	768	8.84e-40	157
lcl	NC_047559.1_mrna_XM_034456887.1_39	sp	A1A5V9	ELP5_DANRE	28.873	284	194	4	356	1198	1	279	6.43e-29	118
lcl	NC_047559.1_mrna_XM_011455176.3_47	sp	A1A5V9	ELP5_DANRE	28.873	284	194	4	376	1218	1	279	7.04e-29	118
lcl	NC_047559.1_mrna_XM_011455177.3_48	sp	A1A5V9	ELP5_DANRE	28.873	284	194	4	245	1087	1	279	3.80e-29	118
lcl	NC_047559.1_mrna_XM_011419438.3_53	sp	A0A159BP93	CITB_MONRU	30.795	302	179	11	3053	2199	3	291	3.63e-29	123
lcl	NC_047559.1_mrna_XM_020064467.2_54	sp	A0A159BP93	CITB_MONRU	30.795	302	179	11	3307	2453	3	291	3.97e-29	123
lcl	NC_047559.1_mrna_XM_011419437.3_55	sp	O35077	GPDA_RAT	59.259	351	140	3	111	1160	1	349	4.39e-140	434
lcl	NC_047559.1_mrna_XM_011419460.2_56	sp	A0A159BP93	CITB_MONRU	30.795	302	179	11	1	855	3	291	4.56e-31	122
lcl	NC_047559.1_mrna_XM_011419435.3

In [147]:
#Extract column with transcript ID
#Convert _ to tab
#Extract column with transcript ID
#Add XM_ to the front of each ID

!cut -f2 cgigas-roslin-mito-blastx.outfmt6.codeIsolated \
| tr "_" "\t" \
| cut -f5 \
| sed 's/^/XM_/' \
> cgigas-roslin-mito-blastx.outfmt6.transcriptID

In [148]:
!head cgigas-roslin-mito-blastx.outfmt6.transcriptID
!wc -l cgigas-roslin-mito-blastx.outfmt6.transcriptID

XM_034463066.1
XM_034463224.1
XM_034456887.1
XM_011455176.3
XM_011455177.3
XM_011419438.3
XM_020064467.2
XM_011419437.3
XM_011419460.2
XM_011419435.3
   93983 cgigas-roslin-mito-blastx.outfmt6.transcriptID


In [150]:
#Paste original transcript ID with codeIsolated file
#Check output

!paste cgigas-roslin-mito-blastx.outfmt6.transcriptID cgigas-roslin-mito-blastx.outfmt6.codeIsolated \
> cgigas-roslin-mito-blastx.outfmt6.codeIsolated.transcriptID
!head cgigas-roslin-mito-blastx.outfmt6.codeIsolated.transcriptID

XM_034463066.1	lcl	NC_047559.1_mrna_XM_034463066.1_26	sp	A1KZ92	PXDNL_HUMAN	24.891	229	153	6	867	1535	278	493	4.55e-05	51.2
XM_034463224.1	lcl	NC_047559.1_mrna_XM_034463224.1_35	sp	Q7LHG5	YI31B_YEAST	50.658	152	71	1	1072	1515	617	768	8.84e-40	157
XM_034456887.1	lcl	NC_047559.1_mrna_XM_034456887.1_39	sp	A1A5V9	ELP5_DANRE	28.873	284	194	4	356	1198	1	279	6.43e-29	118
XM_011455176.3	lcl	NC_047559.1_mrna_XM_011455176.3_47	sp	A1A5V9	ELP5_DANRE	28.873	284	194	4	376	1218	1	279	7.04e-29	118
XM_011455177.3	lcl	NC_047559.1_mrna_XM_011455177.3_48	sp	A1A5V9	ELP5_DANRE	28.873	284	194	4	245	1087	1	279	3.80e-29	118
XM_011419438.3	lcl	NC_047559.1_mrna_XM_011419438.3_53	sp	A0A159BP93	CITB_MONRU	30.795	302	179	11	3053	2199	3	291	3.63e-29	123
XM_020064467.2	lcl	NC_047559.1_mrna_XM_020064467.2_54	sp	A0A159BP93	CITB_MONRU	30.795	302	179	11	3307	2453	3	291	3.97e-29	123
XM_011419437.3	lcl	NC_047559.1_mrna_XM_011419437.3_55	sp	O35077	GPDA_RAT	59.259	351	140	3	111	1160	1	349	4.39e-140	434
XM_011419460.2

In [151]:
#Reduce the number of columns using awk: accession code, transcript ID, and e-value
#Sort, and save as a new file.
!awk -v OFS='\t' '{print $5, $1, $15}' < cgigas-roslin-mito-blastx.outfmt6.codeIsolated.transcriptID | sort \
> gigas-blast-sort.tab

In [152]:
!head gigas-blast-sort.tab

A0A060WQA3	XM_011455784.3	4.79e-06
A0A060WQA3	XM_011455785.3	4.65e-06
A0A061ACU2	XM_034483499.1	0.0
A0A061ACU2	XM_034483499.1	1.40e-12
A0A061ACU2	XM_034483499.1	2.13e-27
A0A061ACU2	XM_034483500.1	0.0
A0A061ACU2	XM_034483500.1	1.40e-12
A0A061ACU2	XM_034483500.1	2.44e-27
A0A061ACU2	XM_034483501.1	0.0
A0A061ACU2	XM_034483501.1	1.40e-12


### 1b. Join with GOterms

In [22]:
#Download Uniprot database with GOterm information
!curl -O http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/uniprot-SP-GO.sorted

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  340M  100  340M    0     0  26.1M      0  0:00:12  0:00:12 --:--:-- 50.6M


In [23]:
!head uniprot-SP-GO.sorted
!wc -l uniprot-SP-GO.sorted

A0A023GPI8	LECA_CANBL	reviewed	Lectin alpha chain (CboL) [Cleaved into: Lectin beta chain; Lectin gamma chain]		Canavalia boliviana	237			mannose binding [GO:0005537]; metal ion binding [GO:0046872]	mannose binding [GO:0005537]; metal ion binding [GO:0046872]	GO:0005537; GO:0046872
A0A023GPJ0	CDII_ENTCC	reviewed	Immunity protein CdiI	cdiI ECL_04450.1	Enterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCDC 279-56)	145					
A0A023PXA5	YA19A_YEAST	reviewed	Putative uncharacterized protein YAL019W-A	YAL019W-A	Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)	189					
A0A023PXB0	YA019_YEAST	reviewed	Putative uncharacterized protein YAR019W-A	YAR019W-A	Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)	110					
A0A023PXB5	IRC2_YEAST	reviewed	Putative uncharacterized membrane protein IRC2 (Increased recombination centers protein 2)	IRC2 YDR112W	Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)	102		integ

In [153]:
#Join the first column in the first file with the first column in the second file
#The files are tab delimited, and the output should also be tab delimited (-t $'\t')
!join -1 1 -2 1 -t $'\t' \
gigas-blast-sort.tab \
uniprot-SP-GO.sorted \
> gigas-blast-annot.tab
!head gigas-blast-annot.tab
!wc -l gigas-blast-annot.tab

A0A0A7DNP6	XM_001305294.1	8.75e-10	GRHLP_RUDPH	reviewed	Prepro-gonadotropin-releasing hormone-like protein (rp-GnRH) [Cleaved into: GnRH dodecapeptide; GnRH-associated peptide (GAP)]		Ruditapes philippinarum (Japanese littleneck clam) (Venerupis philippinarum)	94	neuropeptide signaling pathway [GO:0007218]	extracellular region [GO:0005576]		extracellular region [GO:0005576]; neuropeptide signaling pathway [GO:0007218]	GO:0005576; GO:0007218
A0A0B5A7M7	XM_001308866.2	3.97e-12	INS1_CONIM	reviewed	Con-Ins Im1 (Insulin 1) [Cleaved into: Con-Ins I1 B chain; Con-Ins I1 A chain]		Conus imperialis (Imperial cone)	150	glucose metabolic process [GO:0006006]	extracellular region [GO:0005576]	hormone activity [GO:0005179]	extracellular region [GO:0005576]; hormone activity [GO:0005179]; glucose metabolic process [GO:0006006]	GO:0005179; GO:0005576; GO:0006006
A0A0B5A7M7	XM_034475035.1	1.48e-12	INS1_CONIM	reviewed	Con-Ins Im1 (Insulin 1) [Cleaved into: Con-Ins I1 B chain; Con-Ins I1 A chain]		Conus

In [154]:
#Extract columns 1 (accession), 2 (transcript ID), and 14 (GOterms)
#Save output
!cut -f1,2,14 gigas-blast-annot.tab \
> _blast-annot.tab
!head _blast-annot.tab

A0A0A7DNP6	XM_001305294.1	GO:0005576; GO:0007218
A0A0B5A7M7	XM_001308866.2	GO:0005179; GO:0005576; GO:0006006
A0A0B5A7M7	XM_034475035.1	GO:0005179; GO:0005576; GO:0006006
A0A0B5A8P8	XM_011417422.3	GO:0005179; GO:0005576; GO:0006006
A0A0F7YYX3	XM_011419035.3	GO:0005576
A0A0F7YYX3	XM_011419037.3	GO:0005576
A0A0F7YYX3	XM_011419038.3	GO:0005576
A0A0F7YYX3	XM_011419039.3	GO:0005576
A0A0F7YYX3	XM_020064360.2	GO:0005576
A0A0F7YZI5	XM_011442798.3	GO:0005179; GO:0005576


In [155]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="_blast-annot.tab"
file="_intermediate.file"
output_file="_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [156]:
!head _blast-GO-unfolded.tab

A0A0A7DNP6	XM_001305294.1	GO:0005576
A0A0A7DNP6	XM_001305294.1	GO:0007218
A0A0B5A7M7	XM_001308866.2	GO:0005179
A0A0B5A7M7	XM_001308866.2	GO:0005576
A0A0B5A7M7	XM_001308866.2	GO:0006006
A0A0B5A7M7	XM_034475035.1	GO:0005179
A0A0B5A7M7	XM_034475035.1	GO:0005576
A0A0B5A7M7	XM_034475035.1	GO:0006006
A0A0B5A8P8	XM_011417422.3	GO:0005179
A0A0B5A8P8	XM_011417422.3	GO:0005576


In [157]:
#Reorganize and sort columns
!awk '{print $3"\t"$2}' _blast-GO-unfolded.tab | gsort -V > _blast-GO-unfolded.sorted

In [158]:
!head _blast-GO-unfolded.sorted
!wc _blast-GO-unfolded.sorted

GO:0000002	XM_004596953.1
GO:0000002	XM_004599087.1
GO:0000002	XM_004599974.1
GO:0000002	XM_004602618.1
GO:0000002	XM_004604080.1
GO:0000002	XM_011413926.3
GO:0000002	XM_011416774.2
GO:0000002	XM_011428102.3
GO:0000002	XM_011430231.3
GO:0000002	XM_011434743.3
 1565846 3129025 40681933 _blast-GO-unfolded.sorted


### 1c. Join with GO Slim terms

In [30]:
#Get GO to GOSlim matching
!curl -O http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/GO-GOslim.sorted

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2314k  100 2314k    0     0   662k      0  0:00:03  0:00:03 --:--:--  663k


In [31]:
!head GO-GOslim.sorted
!wc -l GO-GOslim.sorted

GO:0000001	mitochondrion inheritance	cell organization and biogenesis	P
GO:0000002	mitochondrial genome maintenance	cell organization and biogenesis	P
GO:0000003	reproduction	other biological processes	P
GO:0000006	high affinity zinc uptake transmembrane transporter activity	transporter activity	F
GO:0000007	low-affinity zinc ion transmembrane transporter activity	transporter activity	F
GO:0000009	"alpha-1,6-mannosyltransferase activity"	other molecular function	F
GO:0000010	trans-hexaprenyltranstransferase activity	other molecular function	F
GO:0000011	vacuole inheritance	cell organization and biogenesis	P
GO:0000012	single strand break repair	DNA metabolism	P
GO:0000012	single strand break repair	stress response	P
   30796 GO-GOslim.sorted


In [162]:
#Join files to get GOslim for each query (with duplicate GOslim / query removed)
!join -1 1 -2 1 -t $'\t' \
_blast-GO-unfolded.sorted \
GO-GOslim.sorted \
| uniq | awk -F'\t' -v OFS='\t' '{print $2, $1, $3, $4, $5}' \
| sort \
> Blastquery-GOslim.tab
!head Blastquery-GOslim.tab
!wc -l Blastquery-GOslim.tab

XM_	GO:0016021	integral to membrane	other membranes	C
XM_001305288.1	GO:0001501	skeletal system development	developmental processes	P
XM_001305288.1	GO:0005576	extracellular region	non-structural extracellular	C
XM_001305288.1	GO:0005595	collagen type XII	extracellular matrix	C
XM_001305288.1	GO:0005615	extracellular space	non-structural extracellular	C
XM_001305288.1	GO:0005788	endoplasmic reticulum lumen	ER/Golgi	C
XM_001305288.1	GO:0007155	cell adhesion	cell adhesion	P
XM_001305288.1	GO:0030020	extracellular matrix structural constituent conferring tensile strength	extracellular structural activity	F
XM_001305288.1	GO:0030199	collagen fibril organization	cell organization and biogenesis	P
XM_001305288.1	GO:0030574	collagen catabolic process	other metabolic processes	P
  754502 Blastquery-GOslim.tab


## 2. Match CpG background and DML lists with GOterms

### 2a. Obtain transcript and gene ID information from mRNA track

In [103]:
!head ../../genome-feature-files/cgigas_uk_roslin_v1_mRNA.gff
!wc -l ../../genome-feature-files/cgigas_uk_roslin_v1_mRNA.gff

NC_047559.1	Gnomon	mRNA	14114	15804	.	+	.	ID=rna-XM_034463183.1;Parent=gene-LOC109621113;Dbxref=GeneID:109621113,Genbank:XM_034463183.1;Name=XM_034463183.1;gbkey=mRNA;gene=LOC109621113;model_evidence=Supporting evidence includes similarity to: 78%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=uncharacterized LOC109621113;transcript_id=XM_034463183.1
NC_047559.1	Gnomon	mRNA	16867	19160	.	-	.	ID=rna-XM_034463195.1;Parent=gene-LOC117687066;Dbxref=GeneID:117687066,Genbank:XM_034463195.1;Name=XM_034463195.1;gbkey=mRNA;gene=LOC117687066;model_evidence=Supporting evidence includes similarity to: 98%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=uncharacterized LOC117687066;transcript_id=XM_034463195.1
NC_047559.1	Gnomon	mRNA	61887	71038	.	-	.	ID=rna-XM_034471753.1;Parent=gene-LOC117689737;Dbxref=GeneID:117689737,Gen

In [119]:
#Isolate column with ID information
#Convert multiple delimiters to tabs
#Isolate transcript and gene ID columns
#Ensure they are tab-delimited
#Sort by gene ID
#Save output

!cut -f9 ../../genome-feature-files/cgigas_uk_roslin_v1_mRNA.gff \
| tr "=;:-," "\t" \
| cut -f3,9 \
| awk '{print $2"\t"$1}' \
| sort \
> cgigas_uk_roslin_v1_mRNA.transcriptID.geneID

In [120]:
!head cgigas_uk_roslin_v1_mRNA.transcriptID.geneID
!wc -l cgigas_uk_roslin_v1_mRNA.transcriptID.geneID

105317035	XM_011413559.3
105317035	XM_034443160.1
105317035	XM_034443162.1
105317035	XM_034443163.1
105317035	XM_034443164.1
105317036	XM_034443166.1
105317036	XM_034443167.1
105317036	XM_034443168.1
105317040	XM_011413566.3
105317040	XM_034443178.1
   63341 cgigas_uk_roslin_v1_mRNA.transcriptID.geneID


### 2b. Female-DML

#### Format file

I want chr, start, end, and gene ID.

In [73]:
!head ../../output/10_DML-characterization/DML-pH-75-Cov5-Fem-Gene-wb.bed

NC_047559.1	2799230	2799232	83.0864530029626	NC_047559.1	Gnomon	gene	2623361	2849124	.	-	.	ID=gene-LOC117683566;Dbxref=GeneID:117683566;Name=LOC117683566;gbkey=Gene;gene=LOC117683566;gene_biotype=protein_coding
NC_047559.1	6531650	6531652	76.7441860465116	NC_047559.1	Gnomon	gene	6531377	6533346	.	-	.	ID=gene-LOC105340512;Dbxref=GeneID:105340512;Name=LOC105340512;gbkey=Gene;gene=LOC105340512;gene_biotype=protein_coding
NC_047559.1	8166690	8166692	77.3809523809524	NC_047559.1	Gnomon	gene	8162945	8170542	.	+	.	ID=gene-LOC105319125;Dbxref=GeneID:105319125;Name=LOC105319125;gbkey=Gene;gene=LOC105319125;gene_biotype=protein_coding
NC_047559.1	11117667	11117669	-77.0833333333333	NC_047559.1	Gnomon	gene	11073862	11129325	.	-	.	ID=gene-LOC105326856;Dbxref=GeneID:105326856;Name=LOC105326856;gbkey=Gene;gene=LOC105326856;gene_biotype=protein_coding
NC_047559.1	18206975	18206977	80.1851851851852	NC_047559.1	Gnomon	gene	18183683	18216623	.	+	.	ID=gene-LOC105347948;Dbxref=GeneID:105347948;Name=LO

In [81]:
#Isolate column with gene ID information
#Convert =, :, and ; to \t
#Isolate gene ID

!cut -f13 ../../output/10_DML-characterization/DML-pH-75-Cov5-Fem-Gene-wb.bed \
| tr "=:;" "\t" \
| cut -f5 \
> DML-pH-75-Cov5-Fem.GeneIDs

In [82]:
!head DML-pH-75-Cov5-Fem.GeneIDs
!wc -l DML-pH-75-Cov5-Fem.GeneIDs

117683566
105340512
105319125
105326856
105347948
105322687
105337787
105337787
105337787
105337787
     301 DML-pH-75-Cov5-Fem.GeneIDs


In [96]:
!paste DML-pH-75-Cov5-Fem.GeneIDs ../../output/10_DML-characterization/DML-pH-75-Cov5-Fem-Gene-wb.bed \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $3, $4}' \
| sort \
> DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap

In [100]:
!head DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap
!wc -l DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap

105317473	NC_047561.1	12946567	12946569
105317885	NC_047562.1	13927784	13927786
105317888	NC_047565.1	30498760	30498762
105318419	NC_047564.1	51737035	51737037
105318826	NC_047567.1	33591814	33591816
105318949	NC_047562.1	47310318	47310320
105319080	NC_047562.1	46825715	46825717
105319092	NC_047564.1	26558120	26558122
105319092	NC_047564.1	26558163	26558165
105319092	NC_047564.1	26558169	26558171
     301 DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap


#### Join with transcript IDs

In [134]:
#Join files to get transcript ID for DML
!join -1 1 -2 1 -t $'\t' \
DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap \
cgigas_uk_roslin_v1_mRNA.transcriptID.geneID \
| uniq | awk -F'\t' -v OFS='\t' '{print $5, $1, $2, $3, $4}' \
| sort \
> DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs

In [135]:
#Col names: transcript IDs, gene IDs, chr, start, end
!head DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs
!wc -l DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs

XM_011414688.3	105317888	NC_047565.1	30498760	30498762
XM_011414689.3	105317888	NC_047565.1	30498760	30498762
XM_011415535.3	105318419	NC_047564.1	51737035	51737037
XM_011415536.3	105318419	NC_047564.1	51737035	51737037
XM_011415537.3	105318419	NC_047564.1	51737035	51737037
XM_011416280.3	105318949	NC_047562.1	47310318	47310320
XM_011416487.3	105319080	NC_047562.1	46825715	46825717
XM_011416488.3	105319080	NC_047562.1	46825715	46825717
XM_011416528.3	105319125	NC_047559.1	8166690	8166692
XM_011416529.3	105319125	NC_047559.1	8166690	8166692
    1021 DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs


#### Join with annotations

In [166]:
#Join files to get GO annotations for DML
!join -1 1 -2 1 -t $'\t' \
DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs \
Blastquery-GOslim.tab \
| uniq \
| sort \
> DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs.GOAnnot

In [167]:
!head DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs.GOAnnot
!wc -l DML-pH-75-Cov5-Fem.GeneIDs.geneOverlap.transcriptIDs.GOAnnot

XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005200	structural constituent of cytoskeleton	cytoskeletal activity	F
XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005524	ATP binding	other molecular function	F
XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005737	cytoplasm	other cellular component	C
XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005885	Arp2/3 protein complex	cytoskeleton	C
XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005903	brush border	other cellular component	C
XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005911	cell-cell junction	other membranes	C
XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005911	cell-cell junction	plasma membrane	C
XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005925	focal adhesion	other membranes	C
XM_011414688.3	105317888	NC_047565.1	30498760	30498762	GO:0005925	focal adhesion	plasma membrane	C
XM_011414688.3	105317888	NC_047565.1	30498760	304

### 2c. Indeterminate-DML

#### Format file

In [168]:
!head ../../output/10_DML-characterization/DML-pH-100-Cov5-Ind-Gene-wb.bed

NC_047559.1	738014	738016	-100	NC_047559.1	Gnomon	gene	732799	748662	.	+	.	ID=gene-LOC105328744;Dbxref=GeneID:105328744;Name=LOC105328744;gbkey=Gene;gene=LOC105328744;gene_biotype=protein_coding
NC_047559.1	1006145	1006147	100	NC_047559.1	Gnomon	gene	990880	1010116	.	+	.	ID=gene-LOC105320530;Dbxref=GeneID:105320530;Name=LOC105320530;gbkey=Gene;gene=LOC105320530;gene_biotype=protein_coding
NC_047559.1	1715466	1715468	100	NC_047559.1	Gnomon	gene	1712178	1732215	.	-	.	ID=gene-LOC105327355;Dbxref=GeneID:105327355;Name=LOC105327355;gbkey=Gene;gene=LOC105327355;gene_biotype=protein_coding
NC_047559.1	2193954	2193956	-100	NC_047559.1	Gnomon	gene	2191064	2198259	.	+	.	ID=gene-LOC105319884;Dbxref=GeneID:105319884;Name=LOC105319884;gbkey=Gene;gene=LOC105319884;gene_biotype=protein_coding
NC_047559.1	3595157	3595159	-100	NC_047559.1	Gnomon	gene	3595061	3605639	.	-	.	ID=gene-LOC105338359;Dbxref=GeneID:105338359;Name=LOC105338359;gbkey=Gene;gene=LOC105338359;gene_biotype=protein_coding
NC_0475

In [169]:
#Isolate column with gene ID information
#Convert =, :, and ; to \t
#Isolate gene ID

!cut -f13 ../../output/10_DML-characterization/DML-pH-100-Cov5-Ind-Gene-wb.bed \
| tr "=:;" "\t" \
| cut -f5 \
> DML-pH-100-Cov5-Ind.GeneIDs

In [170]:
!head DML-pH-100-Cov5-Ind.GeneIDs
!wc -l DML-pH-100-Cov5-Ind.GeneIDs

105328744
105320530
105327355
105319884
105338359
105338358
109617077
105326456
105323254
105323247
    2642 DML-pH-100-Cov5-Ind.GeneIDs


In [173]:
!paste DML-pH-100-Cov5-Ind.GeneIDs ../../output/10_DML-characterization/DML-pH-100-Cov5-Ind-Gene-wb.bed \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $3, $4}' \
| sort \
> DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap

In [174]:
!head DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap
!wc -l DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap

105317044	NC_047560.1	1300547	1300549
105317044	NC_047560.1	1309141	1309143
105317066	NC_047561.1	798026	798028
105317076	NC_047568.1	32563018	32563020
105317114	NC_047563.1	39039437	39039439
105317154	NC_047567.1	17056242	17056244
105317173	NC_047561.1	17784403	17784405
105317194	NC_047563.1	7508457	7508459
105317220	NC_047562.1	48730553	48730555
105317220	NC_047562.1	48759032	48759034
    2642 DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap


#### Join with transcript IDs

In [175]:
#Join files to get transcript ID for DML
!join -1 1 -2 1 -t $'\t' \
DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap \
cgigas_uk_roslin_v1_mRNA.transcriptID.geneID \
| uniq | awk -F'\t' -v OFS='\t' '{print $5, $1, $2, $3, $4}' \
| sort \
> DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap.transcriptIDs

In [176]:
#Col names: transcript IDs, gene IDs, chr, start, end
!head DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap.transcriptIDs
!wc -l DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap.transcriptIDs

NM_001305324.1	105343591	NC_047564.1	6960513	6960515
NM_001305324.1	105343591	NC_047564.1	6970460	6970462
NM_001305351.1	105327034	NC_047566.1	46030575	46030577
NM_001305361.1	105344976	NC_047564.1	10237861	10237863
NM_001305378.1	105344505	NC_047565.1	43638670	43638672
NM_001308886.1	105343370	NC_047568.1	29512915	29512917
XM_011413601.3	105317066	NC_047561.1	798026	798028
XM_011413602.3	105317066	NC_047561.1	798026	798028
XM_011413700.3	105317154	NC_047567.1	17056242	17056244
XM_011413706.3	105317154	NC_047567.1	17056242	17056244
    8775 DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap.transcriptIDs


#### Join with annotations

In [179]:
#Join files to get GO annotations for DML
!join -1 1 -2 1 -t $'\t' \
DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap.transcriptIDs \
Blastquery-GOslim.tab \
| sort \
| uniq \
> DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap.transcriptIDs.GOAnnot

In [180]:
!head DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap.transcriptIDs.GOAnnot
!wc -l DML-pH-100-Cov5-Ind.GeneIDs.geneOverlap.transcriptIDs.GOAnnot

XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0004672	protein kinase activity	kinase activity	F
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0004707	MAP kinase activity	kinase activity	F
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0004707	MAP kinase activity	signal transduction activity	F
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0005524	ATP binding	other molecular function	F
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0005634	nucleus	nucleus	C
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0005737	cytoplasm	other cellular component	C
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0005829	cytosol	cytosol	C
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0006468	protein amino acid phosphorylation	protein metabolism	P
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0007049	cell cycle	cell cycle and proliferation	P
XM_011413601.3	105317066	NC_047561.1	798026	798028	GO:0010468	regulation of gene expression	other met

### 2d. All 5x CpGs (CpG background)