Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check code for identifying methylated CpGs #675

Closed
yaaminiv opened this issue Apr 7, 2019 · 14 comments
Closed

Check code for identifying methylated CpGs #675

yaaminiv opened this issue Apr 7, 2019 · 14 comments
Labels

Comments

@yaaminiv
Copy link
Contributor

yaaminiv commented Apr 7, 2019

Frequency distribution for %CpG methylation in C. virginica is very different from Mac's findings in C. gigas.

C. virginica:

Screen Shot 2019-04-04 at 11 01 00 PM

Jupyter notebook

Code to create figure

Why is my distribution so different from what we expected (more CpGs with 0% methylation)?

@kubu4
Copy link
Contributor

kubu4 commented Apr 8, 2019

Here are some things that have caught my eye.


Now that I know how many loci have at least 5x coverage in each control sample, I want to isolate all unique loci with 5x coverage.

In [17]:
!cat *5x.bedgraph | sort | uniq -u > 2019-03-18-Control-5x-CpG-Loci.bedgraph

In [18]:
#Confirm concatenation
!head 2019-03-18-Control-5x-CpG-Loci.bedgraph
NC_007175.2 10013 10014 5.12820512820513
NC_007175.2 1008 1009 1.45985401459854
NC_007175.2 1008 1009 10.5263157894737
NC_007175.2 1009 1010 0
NC_007175.2 1014 1015 0
NC_007175.2 1014 1015 2.63157894736842
NC_007175.2 1014 1015 2.73972602739726
NC_007175.2 1014 1015 7.69230769230769

Your code isn't isolating unique loci, as your comment indicates you'd like. This is confirmed by the fact that the same locus is present multiple times (e.g. 1014 1015). As such, this locus is contributing to multiple bins in your histogram.

The uniq command finds unique lines. That fourth column (methylation percentage) is highly unlikely to ever be the same for multiple occurrences of a single locus, so uniq won't eliminate any lines. You'll always end up with multiple entries for a single locus, even if a locus occurs multiple times in your concatenated bedgraph. Is this what you want?


I'm confused why the 2019-03-18-Control-5x-CpG-Loci.bedgraph output above has four columns. Just above that is this:

#Check columns for one of the file. I only need the chromosome, start position, and stop position
!head zr2096_1_s1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov_5x.bedgraph
NC_007175.2 1579 1580
NC_007175.2 2180 2181
NC_007175.2 3383 3384
NC_007175.2 3394 3395

This shows that your 5x.bedgraph files only have three columns. How did that fourth column make it into 2019-03-18-Control-5x-CpG-Loci.bedgraph? What am I missing?


Then, there's also this part, which I don't think is a problem, just double-checking ( I'm guessing the 2. Count loci with 1x coverage is a typo?)

2. Count loci with 1x coverage
Since I did an MBD enrichment, it's not likely that I have all 14,458,703 CpG motifs represented in my dataset. I want to know how many CpG loci have at least 1x coverage across all of my samples.

2a. Filter 1x loci
In [12]:
%%bash
for f in *.cov
do
    awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} | awk '{if ($5 >= 5) { print $1, $2-1, $2}}' \
> ${f}_5x.bedgraph
done


@sr320
Copy link
Member

sr320 commented Apr 8, 2019 via email

@kubu4
Copy link
Contributor

kubu4 commented Apr 8, 2019

Concatenation should be done at trimmed sequence stage - then run through Bismark.

Any reason for concatenation? A list of FastQ files can be supplied to Bismark.

@sr320
Copy link
Member

sr320 commented Apr 8, 2019 via email

@yaaminiv
Copy link
Contributor Author

yaaminiv commented Apr 8, 2019

this is not high priority- as opposed to DML annotation, DMGs, DMRs

I'd like to do a chi-squared test comparing location of methylated CpGs to my DML list, so I'd like to straighten out this issue!

Then, there's also this part, which I don't think is a problem, just double-checking ( I'm guessing the 2. Count loci with 1x coverage is a typo?)

Yup, just a typo!

This shows that your 5x.bedgraph files only have three columns. How did that fourth column make it into 2019-03-18-Control-5x-CpG-Loci.bedgraph? What am I missing?

Excellent question. I was modifying this code yesterday from code I already wrote so it wasn't as clear as it should be.

The 2019-03-18-All-Unique-5x-CpGs.bedgraph file only has the chromosome, start position, and stop position. All I wanted to do was count the number of unique loci.

The 2019-03-18-Control-5x-CpG-Loci.bedgraph file has chromosome, start position, stop position, and % methylation. I deleted some code chunks that should have been there and made some changes to my naming scheme. But I guess that still doesn't get me what I want, which brings me to...

this:

You'll always end up with multiple entries for a single locus, even if a locus occurs multiple times in your concatenated bedgraph. Is this what you want?

No this is not what I want! I need to find some way to identify loci with multiple entries, then average methylation percentages. Why would I need to do that with trimmed FastQ files in bismark?

@kubu4
Copy link
Contributor

kubu4 commented Apr 8, 2019

There's something else I noticed, which I think might be wrong.

Next, I'll filter loci from coverage files. I want four columns this time: chromosome, start position, stop position, and percent methylation.

In [16]:
%%bash
for f in *.cov
do
    awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} | awk '{if ($5 >= 5) { print $1, $2-1, $2, $5+$6}}' \
> ${f}_control.5x.bedgraph
done

I don't think your awk statements are doing what you want them to do. I'll explain.

  1. awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} - This takes your input file (${f}) and prints the following:
  • Column 1
  • Column 2 - 1
  • Column 2
  • Column 4
  • Column 5 + Column 6

That produces a total of 5 columns of output.

  1. | awk '{if ($5 >= 5) { print $1, $2-1, $2, $5+$6}}' - This takes the output from your first awk command and prints the following:
  • Column 1 (original column 1 of your input file)
  • Column 2 -1 (original column 2 - 1 of your input file, subtracting 1 again; effectively subtracting 2 from original Column 2 of your input file)
  • Column 2 (original column 2 -1; effectively subtracting 1 from your original Column 2 of your input file)
  • Column 5 + Column 6 (since there is no 6th column coming from your previous awk command, this ends up printing Column 5 from your previous awk command; at the end of the day, you get the same result, but there's no need to add columns 5 & 6 at this point, because there is no column 6 any more).

I'd say the biggest concern here is that you end up subtracting 2 from the start position and 1 from the end position. Is this what you want to do here?

Personally, I'd rewrite the awk statement to the following, as I think it's easier to follow without passing through a pipe and having to keep track of changes to number of columns:

awk '{if ($5+$6 >= 5) { print $1, $2-1, $2, $5+$6}}' ${f}

Of course, if you intended to subtract 2 from your start position and 1 from your end position, then make those adjustments.


Presumably, the begraph file below was made by the your awk command above:

In [19]:
#Check the columns: <chromosome> <start position> <stop position> <percent methylation>
!head zr2096_1_s1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov_control.5x.bedgraph
NC_007175.2 1579 1580 5
NC_007175.2 2180 2181 5
NC_007175.2 3383 3384 5
NC_007175.2 3394 3395 5
NC_007175.2 5413 5414 5
NC_007175.2 5415 5416 5
NC_007175.2 5426 5427 5
NC_007175.2 11101 11102 5
NC_007175.2 12881 12882 5
NC_007175.2 12985 12986 5

All of the values in the percent methylation column are all the same (and, all integers). This doesn't seem right...


No this is not what I want! I need to find some way to identify loci with multiple entries, then average methylation percentages.

So, you don't want to identify unique loci any more? If not, then you're already set. Your file has all loci listed and their corresponding methylation percentages. You need to find a way to compute average methylation at each loci.

@sr320
Copy link
Member

sr320 commented Apr 8, 2019

I might be missing it - but there should only be one coverage file, one bedgraph file.

No this is not what I want! I need to find some way to identify loci with multiple entries, then average methylation percentages. Why would I need to do that with trimmed FastQ files in bismark?

this is how you would get the one file

@yaaminiv
Copy link
Contributor Author

yaaminiv commented Apr 8, 2019

I don't think your awk statements are doing what you want them to do
I'd say the biggest concern here is that you end up subtracting 2 from the start position and 1 from the end position. Is this what you want to do here?

I got that code from @sr320: https://github.com/sr320/nb-2019/blob/master/C_virginica/01-OAKL-3x-tracks.ipynb (I think the link is broken now so I'm not entirely sure where the actual code lives). I believe subtracting 2 from the start position and 1 from the end position is what I want to do to account for that different start position thing we encountered earlier. I'll modify the awk command as you suggest to make it cleaner.

All of the values in the percent methylation column are all the same (and, all integers). This doesn't seem right...

...yeah I'm not sure what's happening there. It wasn't like this before I started messing with my code yesterday. The previous version of my notebook can be found here.

So, you don't want to identify unique loci any more?
I might be missing it - but there should only be one coverage file, one bedgraph file.

I'm confused. In this issue, I was instructed to use coverage files, identify 5x loci, then (somehow) combine all of the 5x loci from all control sample files to understand how many loci were methylated, partially methylated, and unmethylated. Based on this approach, I've used my code to pare down unique loci, but am missing the last step. I need a way, as @kubu4, points out, to calculate average methylation at each loci.

The alternative approach would be using FastQ trimmed files in bismark...? How would this work differently from what I am doing now?

@yaaminiv
Copy link
Contributor Author

yaaminiv commented Apr 9, 2019

New approach: using file below to characterize general methylation trends

http://gannet.fish.washington.edu/Atumefaciens/20190312_cvir_gonad_bismark/total_reads_bismark/cvir_bsseq_all_pe_R1_bismark_bt2_pe.bismark.cov.gz

Screen Shot 2019-04-09 at 2 46 03 PM

@kubu4 Is there a readme for this somewhere? Not sure what the columns are, but here are my best guesses:

  1. chromosome
  2. start pos
  3. stop pos (same as column 2)
  4. % methylation...?
  5. coverage...?

@yaaminiv
Copy link
Contributor Author

yaaminiv commented Apr 9, 2019

Is there a readme for this somewhere? Not sure what the columns are, but here are my best guesses

Just realized columns 4-6 are methylation percentage, count methylated, and count unmethylated

@yaaminiv
Copy link
Contributor Author

yaaminiv commented Apr 9, 2019

I used the .cov file @kubu4 provided to identify methylated, partially methylated, and unmethylated loci. The breakdown is strange...

  • 4,304,257 loci with 5x coverage
  • 3,181,904 methylated
  • 481,788 sparsely methylated
  • 640,565 unmethylated

I did not expect that most loci would be methylated. There's probably something weird about how I'm using awk. My Jupyter notebook is here, but I've screenshotted the code I used below:

Screen Shot 2019-04-09 at 3 18 53 PM

Screen Shot 2019-04-09 at 3 17 52 PM

@kubu4
Copy link
Contributor

kubu4 commented Apr 10, 2019

A few things:

  1. Coding in screenshot looks good to me.

  2. How many total loci with coverage (i.e. >=1x).

  3. I'd recommend loading the data into a genome viewer (e.g. IGV) and convincing yourself it's lefit.

@yaaminiv
Copy link
Contributor Author

  1. 🎉
  2. 14,026,131 loci with ≥ 1x coverage. There are 14,458,703 CG motifs in the C. virginica genome.
  3. Here are some screenshots from IGV. The methylated loci look legit...I'm just surprised the methylation level is that high.

Screen Shot 2019-04-10 at 10 34 09 AM

Screen Shot 2019-04-10 at 10 33 47 AM

@yaaminiv
Copy link
Contributor Author

Another more zoomed out screenshot from IGV.

Screen Shot 2019-04-10 at 10 37 14 AM

@sr320 sr320 closed this as completed Apr 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants