# Characterizing the general methylation landscape

In this notebook, I will characterize the general methylation landscape. This will provide context I need to understand the significance of differentially methylated loci I obtain with `methylKit`. To characterize CpG methylation, I will use individual samples, as well as a union BEDgraph that concatenates all sample information.

1. Concatenate coverage information
2. Characterize methylation for each CpG dinucleotide in individual samples and union BEDgraph
2. Determine genomic location of methylated, sparsely methylated, and unmethylated CpGs

## 0. Set working directory

In [1]:
!pwd

/Users/yaamini/Documents/project-oyster-oa/code/Haws


In [2]:
cd ../output/

[Errno 2] No such file or directory: '../output/'
/Users/yaamini/Documents/project-oyster-oa/code/Haws


In [3]:
#!mkdir 06-methylation-landscape

In [4]:
cd 06-methylation-landscape/

/Users/yaamini/Documents/project-oyster-oa/code/Haws/06-methylation-landscape


In [5]:
#Install pandas for this notebook
import pandas as pd
print(pd.__version__)

0.18.1


## 1. Obtain sample BEDgraphs

In [6]:
#Download 5x bedgraphs
!wget -r \
--no-check-certificate --no-directories --no-parent --reject "index.html*" \
-P . \
-A "*5x.bedgraph" https://gannet.fish.washington.edu/spartina/project-oyster-oa/Haws/bismark-2/

--2021-05-17 20:50:07--  https://gannet.fish.washington.edu/spartina/project-oyster-oa/Haws/bismark-2/
Resolving gannet.fish.washington.edu (gannet.fish.washington.edu)... 128.95.149.52
Connecting to gannet.fish.washington.edu (gannet.fish.washington.edu)|128.95.149.52|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘./index.html.tmp’

index.html.tmp          [ <=>                ] 168.69K  --.-KB/s    in 0.004s  

2021-05-17 20:50:12 (45.3 MB/s) - ‘./index.html.tmp’ saved [172743]

Loading robots.txt; please ignore errors.
--2021-05-17 20:50:12--  https://gannet.fish.washington.edu/robots.txt
Reusing existing connection to gannet.fish.washington.edu:443.
HTTP request sent, awaiting response... 404 Not Found
2021-05-17 20:50:12 ERROR 404: Not Found.

Removing ./index.html.tmp since it should be rejected.

--2021-05-17 20:50:12--  https://gannet.fish.washington.edu/spartina/pr

In [7]:
#Check directory for all files
!ls -lh

total 11482968
-rw-r--r--  1 yaamini  staff   240M Mar 11 02:25 zr3644_10_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   251M Mar 11 02:25 zr3644_11_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   227M Mar 11 02:25 zr3644_12_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   247M Mar 11 02:26 zr3644_13_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   191M Mar 11 02:26 zr3644_14_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   258M Mar 11 02:26 zr3644_15_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   227M Mar 11 02:26 zr3644_16_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   244M Mar 11 02:27 zr3644_17_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   235M Mar 11 02:27 zr3644_18_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw

In [8]:
#Obtain md5
!md5 *

MD5 (zr3644_10_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = f0dc26c38229b3640fa93fb29e1fa491
MD5 (zr3644_11_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 0fd1e7003a0cb80de0e094cfdb8a7d0a
MD5 (zr3644_12_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = f4c8c3b70c40770c6d3376a2b7140925
MD5 (zr3644_13_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 54cc1f3a915e03c34aa905fff5be2b63
MD5 (zr3644_14_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = cba3c994a0dc7502a64c5e0ae2c8727d
MD5 (zr3644_15_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 9d54e2bd92b198b7ba4ab036d297b801
MD5 (zr3644_16_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 54a28e3fd4ce60f6908ff11fc84e72c2
MD5 (zr3644_17_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 650e074555739b4aac40df54abc79814
MD5 (zr3644_18_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 2736303b8d17bce072892b08ac1ad978
MD5 (zr3644_19_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 28ec1746f1cd1122eaea62bec434d38d


## 2. Concatenate coverage information

I will use `unionBedGraphs` to concatenate information for all loci across samples.

In [9]:
bedtoolsDirectory = "/Users/Shared/bioinformatics/bedtools2/bin/"

In [10]:
!{bedtoolsDirectory}unionBedGraphs -h


Tool:    bedtools unionbedg (aka unionBedGraphs)
Version: v2.26.0
Summary: Combines multiple BedGraph files into a single file,
	 allowing coverage comparisons between them.

Usage:   bedtools unionbedg [OPTIONS] -i FILE1 FILE2 .. FILEn
	 Assumes that each BedGraph file is sorted by chrom/start 
	 and that the intervals in each are non-overlapping.

Options: 
	-header		Print a header line.
			(chrom/start/end + names of each file).

	-names		A list of names (one/file) to describe each file in -i.
			These names will be printed in the header line.

	-g		Use genome file to calculate empty regions.
			- STRING.

	-empty		Report empty regions (i.e., start/end intervals w/o
			values in all files).
			- Requires the '-g FILE' parameter.

	-filler TEXT	Use TEXT when representing intervals having no value.
			- Default is '0', but you can use 'N/A' or any text.

	-examples	Show detailed usage examples.



### 2a. Create a union BEDgraph

In [11]:
%%bash

for f in *5x.bedgraph
do
/Users/Shared/bioinformatics/bedtools2/bin/sortBed \
-i ${f} \
> $(basename ${f%_5x.bedgraph})_5x.sort.bedgraph
done

In [12]:
!ls *sort*

zr3644_10_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_11_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_12_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_13_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_14_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_15_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_16_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_17_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_18_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_19_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_20_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_21_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_22_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_23_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3644_24_R1_val_1_val_1_v

In [41]:
#Create union BEDgraph from sorted files
#Include a header
#Use N/A when there is no data for a CpG in a sample
#Define sample IDs
#Use sorted bedgraphs
#Save output
!{bedtoolsDirectory}unionBedGraphs \
-header \
-filler N/A \
-names 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 \
-i \
*5x.sort.bedgraph \
> union_5x.bedgraph

In [42]:
#Check output
!head union_5x.bedgraph
!wc -l union_5x.bedgraph

chrom	start	end	1	2	3	4	5	6	7	8
NC_001276.1	34	36	0.000000	1.672241	0.000000	0.886918	0.647249	0.294985	0.735294	1.428571
NC_001276.1	123	125	0.531915	0.582751	0.218579	0.502513	0.446927	0.544070	0.146951	0.400534
NC_001276.1	305	307	0.313480	0.303490	0.344828	0.418410	0.480769	0.869565	0.195312	0.000000
NC_001276.1	433	435	0.243112	0.106496	0.207900	0.285103	0.437158	0.361011	0.446999	0.479616
NC_001276.1	457	459	0.148368	0.476644	0.191022	0.194426	0.096993	0.336134	0.348028	0.778643
NC_001276.1	482	484	0.409500	0.418848	0.110619	0.292612	0.425532	0.486855	0.197759	0.633714
NC_001276.1	609	611	0.072993	0.484966	0.285442	0.333556	0.632911	0.000000	0.174419	0.227790
NC_001276.1	781	783	0.330852	0.209644	0.461361	0.289226	0.417101	0.393314	0.592885	0.368098
NC_001276.1	826	828	0.435540	0.226757	0.241838	0.231660	0.228833	0.623701	0.373413	0.392670
 11238223 union_5x.bedgraph


### 2b. Manipulate with `pandas`

In [25]:
#Import union data into pandas
#Check head
df = pd.read_table("union_5x.bedgraph")
df.head(5)

Unnamed: 0,chrom,start,end,1,2,3,4,5,6,7,8
0,NC_001276.1,34,36,0.0,1.672241,0.0,0.886918,0.647249,0.294985,0.735294,1.428571
1,NC_001276.1,123,125,0.531915,0.582751,0.218579,0.502513,0.446927,0.54407,0.146951,0.400534
2,NC_001276.1,305,307,0.31348,0.30349,0.344828,0.41841,0.480769,0.869565,0.195312,0.0
3,NC_001276.1,433,435,0.243112,0.106496,0.2079,0.285103,0.437158,0.361011,0.446999,0.479616
4,NC_001276.1,457,459,0.148368,0.476644,0.191022,0.194426,0.096993,0.336134,0.348028,0.778643


In [26]:
#Average all samples for total genome methylation information and save as a new column
#NA are not included in averages
#Check output
df['total'] = df[['1', '2', '3', '4', '5', '6', '7', '8']].mean(axis=1)
df.tail(10)

Unnamed: 0,chrom,start,end,1,2,3,4,5,6,7,8,total
11238212,NW_022994998.1,54770,54772,3.333333,0.0,0.0,0.0,0.0,42.857143,0.0,0.0,5.77381
11238213,NW_022994998.1,54834,54836,0.0,3.125,0.0,0.0,0.0,0.0,0.0,0.0,0.390625
11238214,NW_022994998.1,54843,54845,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11238215,NW_022994998.1,54860,54862,0.0,0.0,0.0,0.0,0.0,0.0,8.333333,0.0,1.041667
11238216,NW_022994998.1,54872,54874,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11238217,NW_022994998.1,54934,54936,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
11238218,NW_022994998.1,54949,54951,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
11238219,NW_022994998.1,54953,54955,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
11238220,NW_022994998.1,54958,54960,0.0,0.0,0.0,28.571429,,0.0,0.0,,4.761905
11238221,NW_022994998.1,55001,55003,,,0.0,,,,,,0.0


In [35]:
#Save dataframe in a tabular format and include N/As. Do not include quotes.
df.to_csv("union-averages.bedgraph", sep = "\t", na_rep = "N/A", quoting = 3)

In [36]:
#Check pandas manipulations
!head union-averages.bedgraph

	chrom	start	end	1	2	3	4	5	6	7	8	total
0	NC_001276.1	34	36	0.0	1.672241	0.0	0.886918	0.647249	0.294985	0.735294	1.4285709999999998	0.70815725
1	NC_001276.1	123	125	0.5319149999999999	0.582751	0.218579	0.502513	0.446927	0.5440699999999999	0.146951	0.400534	0.42178
2	NC_001276.1	305	307	0.31348000000000004	0.30349	0.344828	0.41841000000000006	0.48076899999999995	0.8695649999999999	0.19531199999999999	0.0	0.36573174999999997
3	NC_001276.1	433	435	0.243112	0.10649600000000001	0.2079	0.285103	0.437158	0.361011	0.446999	0.479616	0.320924375
4	NC_001276.1	457	459	0.148368	0.47664399999999996	0.191022	0.194426	0.096993	0.336134	0.348028	0.778643	0.32128225
5	NC_001276.1	482	484	0.4095	0.41884799999999994	0.11061900000000001	0.292612	0.42553199999999997	0.48685500000000004	0.197759	0.633714	0.37192987499999997
6	NC_001276.1	609	611	0.07299299999999999	0.484966	0.28544200000000003	0.333556	0.632911	0.0	0.17441900000000002	0.22779000000000002	0.27650962500000004
7	NC_001276.1	781	783	0.33

In [37]:
#Remove header
#Keep chr, start, end, and the average
#Save output
! tail -n+2 union-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{print $2, $3, $4, $13}' \
> zr3616_union-averages_5x.bedgraph

In [38]:
#Check output: chr, start, end, average %meth
!head zr3616_union-averages_5x.bedgraph
!wc -l zr3616_union-averages_5x.bedgraph

NC_001276.1	34	36	0.70815725
NC_001276.1	123	125	0.42178
NC_001276.1	305	307	0.36573174999999997
NC_001276.1	433	435	0.320924375
NC_001276.1	457	459	0.32128225
NC_001276.1	482	484	0.37192987499999997
NC_001276.1	609	611	0.27650962500000004
NC_001276.1	781	783	0.38281012499999995
NC_001276.1	826	828	0.3443015
NC_001276.1	951	953	0.34978312500000003


## 3. Characterize methylation for each CpG dinucleotude

- methylated: ≥ 50%
- sparsely methylated: 10-50%
- unmethylated: ≤ 10%

In [44]:
#8 individual sample files + 1 union bedgraph
!find zr3616*5x.bedgraph

zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
zr3616_union-averages_5x.bedgraph


### 3a. Methylated loci

In [45]:
%%bash
for f in zr3616*5x.bedgraph
do
    awk '{if ($4 >= 50) { print $1, $2, $3, $4 }}' ${f} \
    > ${f}-Meth
done

In [46]:
!head *-Meth

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth <==
NC_047559.1 1547 1549 100.000000
NC_047559.1 1571 1573 87.500000
NC_047559.1 2267 2269 100.000000
NC_047559.1 2291 2293 100.000000
NC_047559.1 4073 4075 60.000000
NC_047559.1 4791 4793 64.705882
NC_047559.1 4835 4837 88.235294
NC_047559.1 4843 4845 81.250000
NC_047559.1 5605 5607 60.000000
NC_047559.1 5613 5615 66.666667

==> zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth <==
NC_047559.1 1571 1573 83.333333
NC_047559.1 2267 2269 100.000000
NC_047559.1 2291 2293 100.000000
NC_047559.1 2448 2450 100.000000
NC_047559.1 2988 2990 81.818182
NC_047559.1 3902 3904 57.142857
NC_047559.1 3916 3918 100.000000
NC_047559.1 4073 4075 90.909091
NC_047559.1 4270 4272 57.142857
NC_047559.1 4791 4793 71.428571

==> zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth <==
NC_047559.1 1571 1573 100.000000
NC_047559.1 2291 2293 100.000000
NC_047559.1 2448 2450 100.000000
NC_047559.1 2988 2990 100.000000
NC_047

In [47]:
!wc -l *-Meth

  911899 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
  898703 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
  922047 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
  952973 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
  866480 zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
  902192 zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
  955157 zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
  895682 zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
 1029894 zr3616_union-averages_5x.bedgraph-Meth
 8335027 total


In [53]:
#Get line counts for each fine
# Remove 10th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-Meth \
| sed '10,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-Meth-counts.txt

In [54]:
!head zr3616_5x-Meth-counts.txt

911899	zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
898703	zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
922047	zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
952973	zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
866480	zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
902192	zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
955157	zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
895682	zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
1029894	zr3616_union-averages_5x.bedgraph-Meth


### 3b. Sparsely methylated loci

In [55]:
%%bash
for f in zr3616*5x.bedgraph
do
    awk '{if ($4 < 50) { print $1, $2, $3, $4}}' ${f} \
    | awk '{if ($4 > 10) { print $1, $2, $3, $4 }}' \
    > ${f}-sparseMeth
done

In [56]:
!head *-sparseMeth

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth <==
NC_047559.1 4887 4889 20.000000
NC_047559.1 4909 4911 33.333333
NC_047559.1 5500 5502 25.000000
NC_047559.1 7716 7718 40.000000
NC_047559.1 7814 7816 47.058824
NC_047559.1 9237 9239 29.166667
NC_047559.1 9658 9660 37.500000
NC_047559.1 9661 9663 22.222222
NC_047559.1 10899 10901 12.195122
NC_047559.1 18234 18236 11.111111

==> zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth <==
NC_001276.1 4444 4446 14.285714
NC_047559.1 4397 4399 47.058824
NC_047559.1 4909 4911 42.857143
NC_047559.1 5605 5607 42.857143
NC_047559.1 5613 5615 37.500000
NC_047559.1 7716 7718 20.000000
NC_047559.1 7850 7852 40.000000
NC_047559.1 8970 8972 11.111111
NC_047559.1 8979 8981 14.285714
NC_047559.1 9658 9660 33.333333

==> zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth <==
NC_047559.1 4073 4075 42.857143
NC_047559.1 8970 8972 25.000000
NC_047559.1 8979 8981 11.111111

In [57]:
!wc -l *-sparseMeth

  689001 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
  689951 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
  658036 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
  706382 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
  630216 zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
  682731 zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
  628525 zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
  670381 zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
 1006406 zr3616_union-averages_5x.bedgraph-sparseMeth
 6361629 total


In [58]:
#Get line counts for each fine
# Remove 10th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-sparseMeth \
| sed '10,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-sparseMeth-counts.txt

In [59]:
!head zr3616_5x-sparseMeth-counts.txt

689001	zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
689951	zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
658036	zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
706382	zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
630216	zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
682731	zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
628525	zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
670381	zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
1006406	zr3616_union-averages_5x.bedgraph-sparseMeth


### 3c. Unmethylated loci

In [60]:
%%bash
for f in zr3616*5x.bedgraph
do
    awk '{if ($4 <= 10) { print $1, $2, $3, $4 }}' ${f} \
    > ${f}-unMeth
done

In [61]:
!head *-unMeth

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth <==
NC_001276.1 34 36 0.000000
NC_001276.1 123 125 0.531915
NC_001276.1 305 307 0.313480
NC_001276.1 433 435 0.243112
NC_001276.1 457 459 0.148368
NC_001276.1 482 484 0.409500
NC_001276.1 609 611 0.072993
NC_001276.1 781 783 0.330852
NC_001276.1 826 828 0.435540
NC_001276.1 951 953 0.305499

==> zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth <==
NC_001276.1 34 36 1.672241
NC_001276.1 123 125 0.582751
NC_001276.1 305 307 0.303490
NC_001276.1 433 435 0.106496
NC_001276.1 457 459 0.476644
NC_001276.1 482 484 0.418848
NC_001276.1 609 611 0.484966
NC_001276.1 781 783 0.209644
NC_001276.1 826 828 0.226757
NC_001276.1 951 953 0.276243

==> zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth <==
NC_001276.1 34 36 0.000000
NC_001276.1 123 125 0.218579
NC_001276.1 305 307 0.344828
NC_001276.1 433 435 0.207900
NC_001276.1 457 459 0.191022
NC_001276.1 482 484 0.110619
NC_001276.1 609 611 0.285442
NC

In [62]:
!wc -l *-unMeth

 6821711 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
 6879957 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
 6804581 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
 7030213 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
 6565845 zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
 6886482 zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
 6641450 zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
 6750245 zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
 9201922 zr3616_union-averages_5x.bedgraph-unMeth
 63582406 total


In [63]:
#Get line counts for each fine
# Remove 10th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-unMeth \
| sed '10,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-unMeth-counts.txt

In [64]:
!head zr3616_5x-unMeth-counts.txt

6821711	zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
6879957	zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
6804581	zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
7030213	zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
6565845	zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
6886482	zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
6641450	zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
6750245	zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
9201922	zr3616_union-averages_5x.bedgraph-unMeth


## 4. Characterize genomic location of CpGs

I will identify overlaps between CpG loci (methylated, sparsely methylated, unmethylated) and various genome feature tracks:

- gene
- exon UTR
- CDS
- intron
- upstream flanks
- downstream flanks
- intergenic regions
- lncRNA
- transposable elements

Since the exon track = exon UTR + CDS, and mRNA = exon + intron, I will not need to use those tracks separately.

### 4a. Create BEDfiles

In [67]:
#9 file types (8 samples + 1 union), 3 files per type (Meth, sparseMeth, unMeth) = 27 total
!find zr3616*5x.bedgraph-*
!find zr3616*5x.bedgraph-* | wc -l

zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth
zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth
zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth
zr3616_6_R1_val

In [68]:
%%bash

for f in zr3616*5x.bedgraph-*
do
    awk '{print $1"\t"$2"\t"$3}' ${f} > ${f}.bed
    wc -l ${f}.bed
done

  911899 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed
  689001 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed
 6821711 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed
  898703 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed
  689951 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed
 6879957 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed
  922047 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed
  658036 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed
 6804581 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed
  952973 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed
  706382 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed
 7030213 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed
  866480 zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5

### 4b. Gene

In [72]:
%%bash
for f in *bed
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b ../../genome-feature-files/cgigas_uk_roslin_v1_gene.gff \
    > ${f}-Gene
done

In [73]:
#Check output
!head *Gene

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-Gene <==
NC_047559.1	360322	360324
NC_047559.1	360361	360363
NC_047559.1	361347	361349
NC_047559.1	364329	364331
NC_047559.1	375341	375343
NC_047559.1	375360	375362
NC_047559.1	376111	376113
NC_047559.1	376170	376172
NC_047559.1	376962	376964
NC_047559.1	377335	377337

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-Gene <==
NC_047559.1	10899	10901
NC_047559.1	18234	18236
NC_047559.1	61081	61083
NC_047559.1	68070	68072
NC_047559.1	100249	100251
NC_047559.1	100276	100278
NC_047559.1	100305	100307
NC_047559.1	100319	100321
NC_047559.1	100440	100442
NC_047559.1	100454	100456

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-Gene <==
NC_047559.1	10270	10272
NC_047559.1	10292	10294
NC_047559.1	10314	10316
NC_047559.1	10358	10360
NC_047559.1	10380	10382
NC_047559.1	10391	10393
NC_047559.1	10402	10404
NC_047559.1	10413	10415
NC_047559.1	10457	10459
NC_047559.1	10479	1048

In [74]:
#Count number of overlaps
!wc -l *Gene

  850326 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-Gene
  490366 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-Gene
 3727767 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-Gene
  836425 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-Gene
  487236 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-Gene
 3764776 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-Gene
  858969 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-Gene
  465804 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-Gene
 3726617 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-Gene
  887229 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-Gene
  498052 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-Gene
 3810154 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-G

In [79]:
#Get line counts for each fine
# Remove 28th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-Gene \
| sed '28,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-Gene-counts.txt

### 4c. Exon UTR

In [80]:
%%bash
for f in *bed
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b ../../genome-feature-files/cgigas_uk_roslin_v1_exonUTR.gff \
    > ${f}-exonUTR
done

In [81]:
#Check output
!head *exonUTR

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-exonUTR <==
NC_047559.1	545229	545231
NC_047559.1	545256	545258
NC_047559.1	571906	571908
NC_047559.1	571929	571931
NC_047559.1	572049	572051
NC_047559.1	572233	572235
NC_047559.1	572245	572247
NC_047559.1	572263	572265
NC_047559.1	572332	572334
NC_047559.1	572453	572455

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-exonUTR <==
NC_047559.1	10899	10901
NC_047559.1	431124	431126
NC_047559.1	431205	431207
NC_047559.1	545687	545689
NC_047559.1	571583	571585
NC_047559.1	571838	571840
NC_047559.1	572075	572077
NC_047559.1	572153	572155
NC_047559.1	572367	572369
NC_047559.1	572615	572617

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-exonUTR <==
NC_047559.1	10920	10922
NC_047559.1	10950	10952
NC_047559.1	11000	11002
NC_047559.1	11026	11028
NC_047559.1	14214	14216
NC_047559.1	14232	14234
NC_047559.1	14243	14245
NC_047559.1	14259	14261
NC_047559.1	14319	14321
NC_0475

In [82]:
#Count number of overlaps
!wc -l *exonUTR

   52760 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-exonUTR
   35405 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-exonUTR
  435275 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-exonUTR
   49868 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-exonUTR
   35609 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-exonUTR
  440703 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-exonUTR
   53855 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-exonUTR
   34116 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-exonUTR
  435575 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-exonUTR
   55932 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-exonUTR
   35992 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-exonUTR
  440101 zr3616_4_R1_val_1_val_1_val_1_bismark

In [83]:
#Get line counts for each fine
# Remove 28th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-exonUTR \
| sed '28,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-exonUTR-counts.txt

### 4d. Intron

In [84]:
%%bash
for f in *bed
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b ../../genome-feature-files/cgigas_uk_roslin_v1_intron.bed \
    > ${f}-intron
done

In [85]:
#Check output
!head *intron

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intron <==
NC_047559.1	360322	360324
NC_047559.1	360361	360363
NC_047559.1	361347	361349
NC_047559.1	364329	364331
NC_047559.1	375341	375343
NC_047559.1	375360	375362
NC_047559.1	376111	376113
NC_047559.1	376170	376172
NC_047559.1	376962	376964
NC_047559.1	377335	377337

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intron <==
NC_047559.1	18234	18236
NC_047559.1	61081	61083
NC_047559.1	68070	68072
NC_047559.1	100249	100251
NC_047559.1	100276	100278
NC_047559.1	100305	100307
NC_047559.1	100319	100321
NC_047559.1	100440	100442
NC_047559.1	100454	100456
NC_047559.1	101107	101109

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-intron <==
NC_047559.1	10270	10272
NC_047559.1	10292	10294
NC_047559.1	10314	10316
NC_047559.1	10358	10360
NC_047559.1	10380	10382
NC_047559.1	10391	10393
NC_047559.1	10402	10404
NC_047559.1	10413	10415
NC_047559.1	10457	10459
NC_047559.1	10

In [86]:
#Count number of overlaps
!wc -l *intron

  470535 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intron
  384517 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intron
 2438864 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-intron
  460539 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intron
  380341 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intron
 2472285 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-intron
  472620 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intron
  368885 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intron
 2440976 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-intron
  496028 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intron
  395595 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intron
 2510506 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x

In [87]:
#Get line counts for each fine
# Remove 28th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-intron \
| sed '28,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-intron-counts.txt

### 4e. Upstream flanks

In [88]:
%%bash
for f in *bed
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b ../../genome-feature-files/cgigas_uk_roslin_v1_upstream.gff \
    > ${f}-upstreamFlanks
done

In [89]:
#Check output
!head *upstreamFlanks

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-upstreamFlanks <==
NC_047559.1	576634	576636
NC_047559.1	576752	576754
NC_047559.1	1468258	1468260
NC_047559.1	1800917	1800919
NC_047559.1	1800924	1800926
NC_047559.1	2253122	2253124
NC_047559.1	3763635	3763637
NC_047559.1	3763649	3763651
NC_047559.1	3763653	3763655
NC_047559.1	3763678	3763680

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-upstreamFlanks <==
NC_047559.1	9237	9239
NC_047559.1	9658	9660
NC_047559.1	9661	9663
NC_047559.1	335850	335852
NC_047559.1	335858	335860
NC_047559.1	335878	335880
NC_047559.1	335886	335888
NC_047559.1	335892	335894
NC_047559.1	576308	576310
NC_047559.1	576693	576695

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-upstreamFlanks <==
NC_047559.1	9122	9124
NC_047559.1	9140	9142
NC_047559.1	9159	9161
NC_047559.1	9240	9242
NC_047559.1	9664	9666
NC_047559.1	9774	9776
NC_047559.1	9781	9783
NC_047559.1	9787	9789
NC_047559.1	9795	979

In [90]:
#Count number of overlaps
!wc -l *upstreamFlanks

    5087 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-upstreamFlanks
   15399 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-upstreamFlanks
  373950 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-upstreamFlanks
    5059 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-upstreamFlanks
   16118 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-upstreamFlanks
  375561 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-upstreamFlanks
    5150 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-upstreamFlanks
   15296 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-upstreamFlanks
  374100 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-upstreamFlanks
    5462 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-upstreamFlanks
   16363 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph

In [91]:
#Get line counts for each fine
# Remove 28th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-upstreamFlanks \
| sed '28,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-upstreamFlanks-counts.txt

### 4f. Downstream flanks

In [92]:
%%bash
for f in *bed
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b ../../genome-feature-files/cgigas_uk_roslin_v1_downstream.gff \
    > ${f}-downstreamFlanks
done

In [93]:
#Check output
!head *downstreamFlanks

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-downstreamFlanks <==
NC_047559.1	264885	264887
NC_047559.1	264911	264913
NC_047559.1	264924	264926
NC_047559.1	344440	344442
NC_047559.1	344447	344449
NC_047559.1	344477	344479
NC_047559.1	344549	344551
NC_047559.1	344794	344796
NC_047559.1	344812	344814
NC_047559.1	344829	344831

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-downstreamFlanks <==
NC_047559.1	258442	258444
NC_047559.1	264959	264961
NC_047559.1	265013	265015
NC_047559.1	265028	265030
NC_047559.1	265111	265113
NC_047559.1	326295	326297
NC_047559.1	326317	326319
NC_047559.1	344861	344863
NC_047559.1	345009	345011
NC_047559.1	433645	433647

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-downstreamFlanks <==
NC_047559.1	16061	16063
NC_047559.1	16105	16107
NC_047559.1	16112	16114
NC_047559.1	16220	16222
NC_047559.1	16260	16262
NC_047559.1	16289	16291
NC_047559.1	16310	1

In [94]:
#Count number of overlaps
!wc -l *downstreamFlanks

   21353 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-downstreamFlanks
   27281 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-downstreamFlanks
  313272 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-downstreamFlanks
   20128 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-downstreamFlanks
   27827 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-downstreamFlanks
  316409 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-downstreamFlanks
   21075 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-downstreamFlanks
   26439 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-downstreamFlanks
  314995 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-downstreamFlanks
   21718 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-downstreamFlanks
   28951 zr3616_4_R1_val_1_val_1_val_1_bismark

In [95]:
#Get line counts for each fine
# Remove 28th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-downstreamFlanks \
| sed '28,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-downstreamFlanks-counts.txt

### 4g. Intergenic regions

In [98]:
%%bash
for f in *bed
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b ../../genome-feature-files/cgigas_uk_roslin_v1_intergenic.bed \
    > ${f}-intergenic
done

In [99]:
#Check output
!head *intergenic

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intergenic <==
NC_047559.1	1547	1549
NC_047559.1	1571	1573
NC_047559.1	2267	2269
NC_047559.1	2291	2293
NC_047559.1	4073	4075
NC_047559.1	4791	4793
NC_047559.1	4835	4837
NC_047559.1	4843	4845
NC_047559.1	5605	5607
NC_047559.1	5613	5615

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intergenic <==
NC_047559.1	4887	4889
NC_047559.1	4909	4911
NC_047559.1	5500	5502
NC_047559.1	7716	7718
NC_047559.1	7814	7816
NC_047559.1	23610	23612
NC_047559.1	24932	24934
NC_047559.1	24934	24936
NC_047559.1	26463	26465
NC_047559.1	26485	26487

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-intergenic <==
NC_047559.1	5883	5885
NC_047559.1	20252	20254
NC_047559.1	20297	20299
NC_047559.1	20319	20321
NC_047559.1	20341	20343
NC_047559.1	20363	20365
NC_047559.1	20385	20387
NC_047559.1	20407	20409
NC_047559.1	20429	20431
NC_047559.1	20451	20453

==> zr3616_2_R1_val_1_val_1_val_1_bismark_b

In [100]:
#Count number of overlaps
!wc -l *intergenic

   36707 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intergenic
  157868 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intergenic
 2438022 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-intergenic
   38562 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intergenic
  160920 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intergenic
 2454734 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-intergenic
   38486 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intergenic
  152402 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intergenic
 2420205 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-intergenic
   40300 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-intergenic
  165140 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-intergenic
 2544181 zr36

In [101]:
#Get line counts for each fine
# Remove 28th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-intergenic \
| sed '28,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-intergenic-counts.txt

### 4h. lncRNA

In [102]:
%%bash
for f in *bed
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b ../../genome-feature-files/cgigas_uk_roslin_v1_lncRNA.gff \
    > ${f}-lncRNA
done

In [103]:
#Check output
!head *lncRNA

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-lncRNA <==
NC_047559.1	1664457	1664459
NC_047559.1	1664745	1664747
NC_047559.1	1664773	1664775
NC_047559.1	1664902	1664904
NC_047559.1	1665259	1665261
NC_047559.1	1688362	1688364
NC_047559.1	1688423	1688425
NC_047559.1	1688433	1688435
NC_047559.1	1688466	1688468
NC_047559.1	2139066	2139068

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-lncRNA <==
NC_047559.1	10899	10901
NC_047559.1	255751	255753
NC_047559.1	255789	255791
NC_047559.1	416920	416922
NC_047559.1	417337	417339
NC_047559.1	418447	418449
NC_047559.1	419514	419516
NC_047559.1	786896	786898
NC_047559.1	789201	789203
NC_047559.1	789687	789689

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-lncRNA <==
NC_047559.1	10270	10272
NC_047559.1	10292	10294
NC_047559.1	10314	10316
NC_047559.1	10358	10360
NC_047559.1	10380	10382
NC_047559.1	10391	10393
NC_047559.1	10402	10404
NC_047559.1	10413	10415
NC_047559.1	10

In [104]:
#Count number of overlaps
!wc -l *lncRNA

   17029 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-lncRNA
   25219 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-lncRNA
  220026 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-lncRNA
   16029 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-lncRNA
   25519 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-lncRNA
  223536 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-lncRNA
   16227 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-lncRNA
   24325 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-lncRNA
  222165 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-lncRNA
   17319 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-lncRNA
   26743 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-lncRNA
  226455 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x

In [105]:
#Get line counts for each fine
# Remove 28th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-lncRNA \
| sed '28,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-lncRNA-counts.txt

### 4i. Transposable elements

In [106]:
%%bash
for f in *bed
do
    /Users/Shared/bioinformatics/bedtools2/bin/intersectBed \
    -u \
    -a ${f} \
    -b ../../genome-feature-files/cgigas_uk_roslin_v1_rm.te.bed \
    > ${f}-TE
done

In [107]:
#Check output
!head *TE

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-TE <==
NC_047559.1	91258	91260
NC_047559.1	91312	91314
NC_047559.1	232127	232129
NC_047559.1	234978	234980
NC_047559.1	264885	264887
NC_047559.1	264911	264913
NC_047559.1	264924	264926
NC_047559.1	293248	293250
NC_047559.1	294921	294923
NC_047559.1	294970	294972

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-TE <==
NC_047559.1	26463	26465
NC_047559.1	26485	26487
NC_047559.1	26966	26968
NC_047559.1	27211	27213
NC_047559.1	44183	44185
NC_047559.1	46646	46648
NC_047559.1	47794	47796
NC_047559.1	50864	50866
NC_047559.1	50869	50871
NC_047559.1	50878	50880

==> zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-TE <==
NC_047559.1	15746	15748
NC_047559.1	24588	24590
NC_047559.1	26419	26421
NC_047559.1	26448	26450
NC_047559.1	26509	26511
NC_047559.1	26532	26534
NC_047559.1	26547	26549
NC_047559.1	26555	26557
NC_047559.1	26561	26563
NC_047559.1	26573	26575

==> zr3616_2_R1_val

In [108]:
#Count number of overlaps
!wc -l *TE

  258093 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-TE
  353014 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-TE
 2380940 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-TE
  260273 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-TE
  359780 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-TE
 2393690 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-TE
  256323 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-TE
  341791 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-TE
 2369534 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-TE
  264142 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-Meth.bed-TE
  373219 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-sparseMeth.bed-TE
 2492532 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph-unMeth.bed-TE
  236754 zr3616_5_R1

In [109]:
#Get line counts for each fine
# Remove 28th line (total entries)
#Ensure output is tab-delimited
#Save output
!wc -l *-TE \
| sed '28,$ d' \
| awk '{print $1"\t"$2}' \
> zr3616_5x-TE-counts.txt