Check code for identifying methylated CpGs #675

yaaminiv · 2019-04-07T22:38:03Z

Frequency distribution for %CpG methylation in C. virginica is very different from Mac's findings in C. gigas.

C. virginica:

Why is my distribution so different from what we expected (more CpGs with 0% methylation)?

kubu4 · 2019-04-08T14:48:06Z

Here are some things that have caught my eye.

Now that I know how many loci have at least 5x coverage in each control sample, I want to isolate all unique loci with 5x coverage.

In [17]:
!cat *5x.bedgraph | sort | uniq -u > 2019-03-18-Control-5x-CpG-Loci.bedgraph

In [18]:
#Confirm concatenation
!head 2019-03-18-Control-5x-CpG-Loci.bedgraph
NC_007175.2 10013 10014 5.12820512820513
NC_007175.2 1008 1009 1.45985401459854
NC_007175.2 1008 1009 10.5263157894737
NC_007175.2 1009 1010 0
NC_007175.2 1014 1015 0
NC_007175.2 1014 1015 2.63157894736842
NC_007175.2 1014 1015 2.73972602739726
NC_007175.2 1014 1015 7.69230769230769

Your code isn't isolating unique loci, as your comment indicates you'd like. This is confirmed by the fact that the same locus is present multiple times (e.g. 1014 1015). As such, this locus is contributing to multiple bins in your histogram.

The uniq command finds unique lines. That fourth column (methylation percentage) is highly unlikely to ever be the same for multiple occurrences of a single locus, so uniq won't eliminate any lines. You'll always end up with multiple entries for a single locus, even if a locus occurs multiple times in your concatenated bedgraph. Is this what you want?

I'm confused why the 2019-03-18-Control-5x-CpG-Loci.bedgraph output above has four columns. Just above that is this:

#Check columns for one of the file. I only need the chromosome, start position, and stop position
!head zr2096_1_s1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov_5x.bedgraph
NC_007175.2 1579 1580
NC_007175.2 2180 2181
NC_007175.2 3383 3384
NC_007175.2 3394 3395

This shows that your 5x.bedgraph files only have three columns. How did that fourth column make it into 2019-03-18-Control-5x-CpG-Loci.bedgraph? What am I missing?

Then, there's also this part, which I don't think is a problem, just double-checking ( I'm guessing the 2. Count loci with 1x coverage is a typo?)

2. Count loci with 1x coverage
Since I did an MBD enrichment, it's not likely that I have all 14,458,703 CpG motifs represented in my dataset. I want to know how many CpG loci have at least 1x coverage across all of my samples.

2a. Filter 1x loci
In [12]:
%%bash
for f in *.cov
do
    awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} | awk '{if ($5 >= 5) { print $1, $2-1, $2}}' \
> ${f}_5x.bedgraph
done

sr320 · 2019-04-08T15:17:31Z

Concatenation should be done at trimmed sequence stage - then run through Bismark.

…

On Apr 8, 2019, 7:48 AM -0700, kubu4 ***@***.***>, wrote: Here are some things that have caught my eye. Now that I know how many loci have at least 5x coverage in each control sample, I want to isolate all unique loci with 5x coverage. In [17]: !cat *5x.bedgraph | sort | uniq -u > 2019-03-18-Control-5x-CpG-Loci.bedgraph In [18]: #Confirm concatenation !head 2019-03-18-Control-5x-CpG-Loci.bedgraph NC_007175.2 10013 10014 5.12820512820513 NC_007175.2 1008 1009 1.45985401459854 NC_007175.2 1008 1009 10.5263157894737 NC_007175.2 1009 1010 0 NC_007175.2 1014 1015 0 NC_007175.2 1014 1015 2.63157894736842 NC_007175.2 1014 1015 2.73972602739726 NC_007175.2 1014 1015 7.69230769230769 Your code isn't isolating unique loci, as your comment indicates you'd like. This is confirmed by the fact that the same locus is present multiple times (e.g. 1014 1015). As such, this locus is contributing to multiple bins in your histogram. The uniq command finds unique lines. That fourth column (methylation percentage) is highly unlikely to ever be the same for multiple occurrences of a single locus, so uniq won't eliminate any lines. You'll always end up with multiple entries for a single locus, even if a locus occurs multiple times in your concatenated bedgraph. Is this what you want? I'm confused why the 2019-03-18-Control-5x-CpG-Loci.bedgraph output above has four columns. Just above that is this: #Check columns for one of the file. I only need the chromosome, start position, and stop position !head zr2096_1_s1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov_5x.bedgraph NC_007175.2 1579 1580 NC_007175.2 2180 2181 NC_007175.2 3383 3384 NC_007175.2 3394 3395 This shows that your 5x.bedgraph files only have three columns. How did that fourth column make it into 2019-03-18-Control-5x-CpG-Loci.bedgraph? What am I missing? Then, there's also this part, which I don't think is a problem, just double-checking ( I'm guessing the 2. Count loci with 1x coverage is a typo?) 2. Count loci with 1x coverage Since I did an MBD enrichment, it's not likely that I have all 14,458,703 CpG motifs represented in my dataset. I want to know how many CpG loci have at least 1x coverage across all of my samples. 2a. Filter 1x loci In [12]: %%bash for f in *.cov do awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} | awk '{if ($5 >= 5) { print $1, $2-1, $2}}' \ > ${f}_5x.bedgraph done — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

kubu4 · 2019-04-08T15:32:18Z

Concatenation should be done at trimmed sequence stage - then run through Bismark.

Any reason for concatenation? A list of FastQ files can be supplied to Bismark.

sr320 · 2019-04-08T15:41:38Z

That’s fine too - just my slang for merge :) But as Sam is doing something similar- this is not high priority- as opposed to DML annotation, DMGs, DMRs

…

On Apr 8, 2019, 8:33 AM -0700, kubu4 ***@***.***>, wrote: > Concatenation should be done at trimmed sequence stage - then run through Bismark. Any reason for concatenation? A list of FastQ files can be supplied to Bismark. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

yaaminiv · 2019-04-08T20:52:48Z

this is not high priority- as opposed to DML annotation, DMGs, DMRs

I'd like to do a chi-squared test comparing location of methylated CpGs to my DML list, so I'd like to straighten out this issue!

Then, there's also this part, which I don't think is a problem, just double-checking ( I'm guessing the 2. Count loci with 1x coverage is a typo?)

Yup, just a typo!

This shows that your 5x.bedgraph files only have three columns. How did that fourth column make it into 2019-03-18-Control-5x-CpG-Loci.bedgraph? What am I missing?

Excellent question. I was modifying this code yesterday from code I already wrote so it wasn't as clear as it should be.

The 2019-03-18-All-Unique-5x-CpGs.bedgraph file only has the chromosome, start position, and stop position. All I wanted to do was count the number of unique loci.

The 2019-03-18-Control-5x-CpG-Loci.bedgraph file has chromosome, start position, stop position, and % methylation. I deleted some code chunks that should have been there and made some changes to my naming scheme. But I guess that still doesn't get me what I want, which brings me to...

this:

You'll always end up with multiple entries for a single locus, even if a locus occurs multiple times in your concatenated bedgraph. Is this what you want?

No this is not what I want! I need to find some way to identify loci with multiple entries, then average methylation percentages. Why would I need to do that with trimmed FastQ files in bismark?

kubu4 · 2019-04-08T22:04:18Z

There's something else I noticed, which I think might be wrong.

Next, I'll filter loci from coverage files. I want four columns this time: chromosome, start position, stop position, and percent methylation.

In [16]:
%%bash
for f in *.cov
do
    awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} | awk '{if ($5 >= 5) { print $1, $2-1, $2, $5+$6}}' \
> ${f}_control.5x.bedgraph
done

I don't think your awk statements are doing what you want them to do. I'll explain.

awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} - This takes your input file (${f}) and prints the following:

Column 1
Column 2 - 1
Column 2
Column 4
Column 5 + Column 6

That produces a total of 5 columns of output.

| awk '{if ($5 >= 5) { print $1, $2-1, $2, $5+$6}}' - This takes the output from your first awk command and prints the following:

Column 1 (original column 1 of your input file)
Column 2 -1 (original column 2 - 1 of your input file, subtracting 1 again; effectively subtracting 2 from original Column 2 of your input file)
Column 2 (original column 2 -1; effectively subtracting 1 from your original Column 2 of your input file)
Column 5 + Column 6 (since there is no 6th column coming from your previous awk command, this ends up printing Column 5 from your previous awk command; at the end of the day, you get the same result, but there's no need to add columns 5 & 6 at this point, because there is no column 6 any more).

I'd say the biggest concern here is that you end up subtracting 2 from the start position and 1 from the end position. Is this what you want to do here?

Personally, I'd rewrite the awk statement to the following, as I think it's easier to follow without passing through a pipe and having to keep track of changes to number of columns:

awk '{if ($5+$6 >= 5) { print $1, $2-1, $2, $5+$6}}' ${f}

Of course, if you intended to subtract 2 from your start position and 1 from your end position, then make those adjustments.

Presumably, the begraph file below was made by the your awk command above:

In [19]:
#Check the columns: <chromosome> <start position> <stop position> <percent methylation>
!head zr2096_1_s1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov_control.5x.bedgraph
NC_007175.2 1579 1580 5
NC_007175.2 2180 2181 5
NC_007175.2 3383 3384 5
NC_007175.2 3394 3395 5
NC_007175.2 5413 5414 5
NC_007175.2 5415 5416 5
NC_007175.2 5426 5427 5
NC_007175.2 11101 11102 5
NC_007175.2 12881 12882 5
NC_007175.2 12985 12986 5

All of the values in the percent methylation column are all the same (and, all integers). This doesn't seem right...

No this is not what I want! I need to find some way to identify loci with multiple entries, then average methylation percentages.

So, you don't want to identify unique loci any more? If not, then you're already set. Your file has all loci listed and their corresponding methylation percentages. You need to find a way to compute average methylation at each loci.

sr320 · 2019-04-08T22:07:25Z

I might be missing it - but there should only be one coverage file, one bedgraph file.

No this is not what I want! I need to find some way to identify loci with multiple entries, then average methylation percentages. Why would I need to do that with trimmed FastQ files in bismark?

this is how you would get the one file

yaaminiv · 2019-04-08T23:22:51Z

I don't think your awk statements are doing what you want them to do
I'd say the biggest concern here is that you end up subtracting 2 from the start position and 1 from the end position. Is this what you want to do here?

I got that code from @sr320: https://github.com/sr320/nb-2019/blob/master/C_virginica/01-OAKL-3x-tracks.ipynb (I think the link is broken now so I'm not entirely sure where the actual code lives). I believe subtracting 2 from the start position and 1 from the end position is what I want to do to account for that different start position thing we encountered earlier. I'll modify the awk command as you suggest to make it cleaner.

All of the values in the percent methylation column are all the same (and, all integers). This doesn't seem right...

...yeah I'm not sure what's happening there. It wasn't like this before I started messing with my code yesterday. The previous version of my notebook can be found here.

So, you don't want to identify unique loci any more?
I might be missing it - but there should only be one coverage file, one bedgraph file.

I'm confused. In this issue, I was instructed to use coverage files, identify 5x loci, then (somehow) combine all of the 5x loci from all control sample files to understand how many loci were methylated, partially methylated, and unmethylated. Based on this approach, I've used my code to pare down unique loci, but am missing the last step. I need a way, as @kubu4, points out, to calculate average methylation at each loci.

The alternative approach would be using FastQ trimmed files in bismark...? How would this work differently from what I am doing now?

yaaminiv · 2019-04-09T21:48:02Z

New approach: using file below to characterize general methylation trends

http://gannet.fish.washington.edu/Atumefaciens/20190312_cvir_gonad_bismark/total_reads_bismark/cvir_bsseq_all_pe_R1_bismark_bt2_pe.bismark.cov.gz

@kubu4 Is there a readme for this somewhere? Not sure what the columns are, but here are my best guesses:

chromosome
start pos
stop pos (same as column 2)
% methylation...?
coverage...?

yaaminiv · 2019-04-09T21:50:20Z

Is there a readme for this somewhere? Not sure what the columns are, but here are my best guesses

Just realized columns 4-6 are methylation percentage, count methylated, and count unmethylated

yaaminiv · 2019-04-09T22:19:26Z

I used the .cov file @kubu4 provided to identify methylated, partially methylated, and unmethylated loci. The breakdown is strange...

4,304,257 loci with 5x coverage
3,181,904 methylated
481,788 sparsely methylated
640,565 unmethylated

I did not expect that most loci would be methylated. There's probably something weird about how I'm using awk. My Jupyter notebook is here, but I've screenshotted the code I used below:

kubu4 · 2019-04-10T04:19:18Z

A few things:

Coding in screenshot looks good to me.
How many total loci with coverage (i.e. >=1x).
I'd recommend loading the data into a genome viewer (e.g. IGV) and convincing yourself it's lefit.

yaaminiv · 2019-04-10T17:36:26Z

🎉
14,026,131 loci with ≥ 1x coverage. There are 14,458,703 CG motifs in the C. virginica genome.
Here are some screenshots from IGV. The methylated loci look legit...I'm just surprised the methylation level is that high.

yaaminiv · 2019-04-10T17:37:33Z

Another more zoomed out screenshot from IGV.

yaaminiv added the E2O-Pub label Apr 7, 2019

sr320 closed this as completed Apr 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check code for identifying methylated CpGs #675

Check code for identifying methylated CpGs #675

yaaminiv commented Apr 7, 2019 •

edited

kubu4 commented Apr 8, 2019

sr320 commented Apr 8, 2019 via email

kubu4 commented Apr 8, 2019

sr320 commented Apr 8, 2019 via email

yaaminiv commented Apr 8, 2019

kubu4 commented Apr 8, 2019

sr320 commented Apr 8, 2019 •

edited

yaaminiv commented Apr 8, 2019

yaaminiv commented Apr 9, 2019 •

edited

yaaminiv commented Apr 9, 2019

yaaminiv commented Apr 9, 2019

kubu4 commented Apr 10, 2019

yaaminiv commented Apr 10, 2019

yaaminiv commented Apr 10, 2019

Check code for identifying methylated CpGs #675

Check code for identifying methylated CpGs #675

Comments

yaaminiv commented Apr 7, 2019 • edited

kubu4 commented Apr 8, 2019

sr320 commented Apr 8, 2019 via email

kubu4 commented Apr 8, 2019

sr320 commented Apr 8, 2019 via email

yaaminiv commented Apr 8, 2019

kubu4 commented Apr 8, 2019

sr320 commented Apr 8, 2019 • edited

yaaminiv commented Apr 8, 2019

yaaminiv commented Apr 9, 2019 • edited

yaaminiv commented Apr 9, 2019

yaaminiv commented Apr 9, 2019

kubu4 commented Apr 10, 2019

yaaminiv commented Apr 10, 2019

yaaminiv commented Apr 10, 2019

yaaminiv commented Apr 7, 2019 •

edited

sr320 commented Apr 8, 2019 •

edited

yaaminiv commented Apr 9, 2019 •

edited