# Characterizing the general methylation landscape

In this notebook, I will characterize the general methylation landscape. This will provide context I need to understand the significance of differentially methylated loci I obtain with `methylKit`. To characterize CpG methylation, I will use individual samples, as well as a union BEDgraph that concatenates all sample information.

1. Concatenate coverage information
2. Characterize methylation for each CpG dinucleotide in individual samples and union BEDgraph
2. Determine genomic location of methylated, sparsely methylated, and unmethylated CpGs

## 0. Set working directory

In [1]:
!pwd

/Users/yaamini/Documents/project-gigas-oa-meth/code


In [2]:
cd ../output/

/Users/yaamini/Documents/project-gigas-oa-meth/output


In [3]:
#!mkdir 09-methylation-landscape

In [4]:
cd 09-methylation-landscape/

/Users/yaamini/Documents/project-gigas-oa-meth/output/09-methylation-landscape


In [5]:
#Install pandas for this notebook
import pandas as pd
print(pd.__version__)

0.18.1


## 1. Obtain sample BEDgraphs

In [9]:
#Download 5x bedgraphs
!wget -r \
--no-check-certificate --no-directories --no-parent --reject "index.html*" \
-P . \
-A "*5x.bedgraph" https://gannet.fish.washington.edu/spartina/project-gigas-oa-meth/output/bismark-roslin/

--2021-05-08 16:59:53--  https://gannet.fish.washington.edu/spartina/project-gigas-oa-meth/output/bismark-roslin/
Resolving gannet.fish.washington.edu (gannet.fish.washington.edu)... 128.95.149.52
Connecting to gannet.fish.washington.edu (gannet.fish.washington.edu)|128.95.149.52|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘./index.html.tmp’

index.html.tmp          [ <=>                ]  57.65K  --.-KB/s    in 0.002s  

2021-05-08 16:59:55 (36.9 MB/s) - ‘./index.html.tmp’ saved [59037]

Loading robots.txt; please ignore errors.
--2021-05-08 16:59:55--  https://gannet.fish.washington.edu/robots.txt
Reusing existing connection to gannet.fish.washington.edu:443.
HTTP request sent, awaiting response... 404 Not Found
2021-05-08 16:59:55 ERROR 404: Not Found.

Removing ./index.html.tmp since it should be rejected.

--2021-05-08 16:59:55--  https://gannet.fish.washington.edu/s

In [15]:
#Check directory for all files
!ls -lh

total 5073480
-rw-r--r--  1 yaamini  staff   311M Mar 22 03:46 zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   313M Mar 22 03:46 zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   310M Mar 22 03:46 zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   321M Mar 22 03:47 zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   298M Mar 22 03:47 zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   313M Mar 22 03:47 zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   304M Mar 22 03:48 zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph
-rw-r--r--  1 yaamini  staff   307M Mar 22 03:48 zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph


In [13]:
#Obtain md5
!md5 *

MD5 (zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = b86b7514414eaa9cc41f4881970adfab
MD5 (zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = a0510824561539bc16ab6b44b979dcdd
MD5 (zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = bafe45ada4b7e3b899aa8d5b0fad91bf
MD5 (zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 2f62012037096b13cbd16faed0edc9f0
MD5 (zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = e6fb2faee28b11bd1cf17e25da372db0
MD5 (zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = ec56e0ab4ffe2b851ba4e89de929b748
MD5 (zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 5125d74b39de4e7d449c9fff3792d257
MD5 (zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.bedgraph) = 1ae82042c2fca091b6c98cfea601db6c


## 2. Concatenate coverage information

I will use `unionBedGraphs` to concatenate information for all loci across samples.

In [16]:
bedtoolsDirectory = "/Users/Shared/bioinformatics/bedtools2/bin/"

In [17]:
!{bedtoolsDirectory}unionBedGraphs -h


Tool:    bedtools unionbedg (aka unionBedGraphs)
Version: v2.26.0
Summary: Combines multiple BedGraph files into a single file,
	 allowing coverage comparisons between them.

Usage:   bedtools unionbedg [OPTIONS] -i FILE1 FILE2 .. FILEn
	 Assumes that each BedGraph file is sorted by chrom/start 
	 and that the intervals in each are non-overlapping.

Options: 
	-header		Print a header line.
			(chrom/start/end + names of each file).

	-names		A list of names (one/file) to describe each file in -i.
			These names will be printed in the header line.

	-g		Use genome file to calculate empty regions.
			- STRING.

	-empty		Report empty regions (i.e., start/end intervals w/o
			values in all files).
			- Requires the '-g FILE' parameter.

	-filler TEXT	Use TEXT when representing intervals having no value.
			- Default is '0', but you can use 'N/A' or any text.

	-examples	Show detailed usage examples.



### 2a. Create a union BEDgraph

In [20]:
%%bash

for f in *5x.bedgraph
do
/Users/Shared/bioinformatics/bedtools2/bin/sortBed \
-i ${f} \
> $(basename ${f%_5x.bedgraph})_5x.sort.bedgraph
done

In [21]:
!ls *sort*

zr3616_1_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3616_2_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3616_3_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3616_4_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3616_5_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3616_6_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3616_7_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph
zr3616_8_R1_val_1_val_1_val_1_bismark_bt2_pe._5x.sort.bedgraph


In [23]:
#Create union BEDgraph from sorted files
#Include a header
#Use N/A when there is no data for a CpG in a sample
#Define sample IDs
#Use sorted bedgraphs
#Save output
!{bedtoolsDirectory}unionBedGraphs \
-header \
-filler N/A \
-names 1 2 3 4 5 6 7 8 \
-i \
*5x.sort.bedgraph \
> zr3616_union_5x.bedgraph

In [24]:
#Check output
!head zr3616_union_5x.bedgraph
!wc -l zr3616_union_5x.bedgraph

chrom	start	end	1	2	3	4	5	6	7	8
NC_001276.1	34	36	0.000000	1.672241	0.000000	0.886918	0.647249	0.294985	0.735294	1.428571
NC_001276.1	123	125	0.531915	0.582751	0.218579	0.502513	0.446927	0.544070	0.146951	0.400534
NC_001276.1	305	307	0.313480	0.303490	0.344828	0.418410	0.480769	0.869565	0.195312	0.000000
NC_001276.1	433	435	0.243112	0.106496	0.207900	0.285103	0.437158	0.361011	0.446999	0.479616
NC_001276.1	457	459	0.148368	0.476644	0.191022	0.194426	0.096993	0.336134	0.348028	0.778643
NC_001276.1	482	484	0.409500	0.418848	0.110619	0.292612	0.425532	0.486855	0.197759	0.633714
NC_001276.1	609	611	0.072993	0.484966	0.285442	0.333556	0.632911	0.000000	0.174419	0.227790
NC_001276.1	781	783	0.330852	0.209644	0.461361	0.289226	0.417101	0.393314	0.592885	0.368098
NC_001276.1	826	828	0.435540	0.226757	0.241838	0.231660	0.228833	0.623701	0.373413	0.392670
 11238223 zr3616_union_5x.bedgraph


### 2b. Manipulate with `pandas`

In [25]:
#Import union data into pandas
#Check head
df = pd.read_table("zr3616_union_5x.bedgraph")
df.head(5)

Unnamed: 0,chrom,start,end,1,2,3,4,5,6,7,8
0,NC_001276.1,34,36,0.0,1.672241,0.0,0.886918,0.647249,0.294985,0.735294,1.428571
1,NC_001276.1,123,125,0.531915,0.582751,0.218579,0.502513,0.446927,0.54407,0.146951,0.400534
2,NC_001276.1,305,307,0.31348,0.30349,0.344828,0.41841,0.480769,0.869565,0.195312,0.0
3,NC_001276.1,433,435,0.243112,0.106496,0.2079,0.285103,0.437158,0.361011,0.446999,0.479616
4,NC_001276.1,457,459,0.148368,0.476644,0.191022,0.194426,0.096993,0.336134,0.348028,0.778643


In [26]:
#Average all samples for total genome methylation information and save as a new column
#NA are not included in averages
#Check output
df['total'] = df[['1', '2', '3', '4', '5', '6', '7', '8']].mean(axis=1)
df.tail(10)

Unnamed: 0,chrom,start,end,1,2,3,4,5,6,7,8,total
11238212,NW_022994998.1,54770,54772,3.333333,0.0,0.0,0.0,0.0,42.857143,0.0,0.0,5.77381
11238213,NW_022994998.1,54834,54836,0.0,3.125,0.0,0.0,0.0,0.0,0.0,0.0,0.390625
11238214,NW_022994998.1,54843,54845,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11238215,NW_022994998.1,54860,54862,0.0,0.0,0.0,0.0,0.0,0.0,8.333333,0.0,1.041667
11238216,NW_022994998.1,54872,54874,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11238217,NW_022994998.1,54934,54936,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
11238218,NW_022994998.1,54949,54951,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
11238219,NW_022994998.1,54953,54955,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0
11238220,NW_022994998.1,54958,54960,0.0,0.0,0.0,28.571429,,0.0,0.0,,4.761905
11238221,NW_022994998.1,55001,55003,,,0.0,,,,,,0.0


In [35]:
#Save dataframe in a tabular format and include N/As. Do not include quotes.
df.to_csv("union-averages.bedgraph", sep = "\t", na_rep = "N/A", quoting = 3)

In [36]:
#Check pandas manipulations
!head union-averages.bedgraph

	chrom	start	end	1	2	3	4	5	6	7	8	total
0	NC_001276.1	34	36	0.0	1.672241	0.0	0.886918	0.647249	0.294985	0.735294	1.4285709999999998	0.70815725
1	NC_001276.1	123	125	0.5319149999999999	0.582751	0.218579	0.502513	0.446927	0.5440699999999999	0.146951	0.400534	0.42178
2	NC_001276.1	305	307	0.31348000000000004	0.30349	0.344828	0.41841000000000006	0.48076899999999995	0.8695649999999999	0.19531199999999999	0.0	0.36573174999999997
3	NC_001276.1	433	435	0.243112	0.10649600000000001	0.2079	0.285103	0.437158	0.361011	0.446999	0.479616	0.320924375
4	NC_001276.1	457	459	0.148368	0.47664399999999996	0.191022	0.194426	0.096993	0.336134	0.348028	0.778643	0.32128225
5	NC_001276.1	482	484	0.4095	0.41884799999999994	0.11061900000000001	0.292612	0.42553199999999997	0.48685500000000004	0.197759	0.633714	0.37192987499999997
6	NC_001276.1	609	611	0.07299299999999999	0.484966	0.28544200000000003	0.333556	0.632911	0.0	0.17441900000000002	0.22779000000000002	0.27650962500000004
7	NC_001276.1	781	783	0.33

In [37]:
#Remove header
#Keep chr, start, end, and the average
#Save output
! tail -n+2 union-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{print $2, $3, $4, $13}' \
> zr3616_union-averages_5x.bedgraph

In [38]:
#Check output: chr, start, end, average %meth
!head zr3616_union-averages_5x.bedgraph
!wc -l zr3616_union-averages_5x.bedgraph

NC_001276.1	34	36	0.70815725
NC_001276.1	123	125	0.42178
NC_001276.1	305	307	0.36573174999999997
NC_001276.1	433	435	0.320924375
NC_001276.1	457	459	0.32128225
NC_001276.1	482	484	0.37192987499999997
NC_001276.1	609	611	0.27650962500000004
NC_001276.1	781	783	0.38281012499999995
NC_001276.1	826	828	0.3443015
NC_001276.1	951	953	0.34978312500000003


## 3. Characterize methylation for each CpG dinucleotude

### 3a. Methylated loci

### 3b. Sparsely methylated loci

### 3c. Unmethylated loci

## 4. Characterize genomic location of CpGs