New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check code for identifying methylated CpGs #675
Comments
Here are some things that have caught my eye.
Your code isn't isolating unique loci, as your comment indicates you'd like. This is confirmed by the fact that the same locus is present multiple times (e.g. 1014 1015). As such, this locus is contributing to multiple bins in your histogram. The I'm confused why the
This shows that your Then, there's also this part, which I don't think is a problem, just double-checking ( I'm guessing the
|
Concatenation should be done at trimmed sequence stage - then run through Bismark.
…On Apr 8, 2019, 7:48 AM -0700, kubu4 ***@***.***>, wrote:
Here are some things that have caught my eye.
Now that I know how many loci have at least 5x coverage in each control sample, I want to isolate all unique loci with 5x coverage.
In [17]:
!cat *5x.bedgraph | sort | uniq -u > 2019-03-18-Control-5x-CpG-Loci.bedgraph
In [18]:
#Confirm concatenation
!head 2019-03-18-Control-5x-CpG-Loci.bedgraph
NC_007175.2 10013 10014 5.12820512820513
NC_007175.2 1008 1009 1.45985401459854
NC_007175.2 1008 1009 10.5263157894737
NC_007175.2 1009 1010 0
NC_007175.2 1014 1015 0
NC_007175.2 1014 1015 2.63157894736842
NC_007175.2 1014 1015 2.73972602739726
NC_007175.2 1014 1015 7.69230769230769
Your code isn't isolating unique loci, as your comment indicates you'd like. This is confirmed by the fact that the same locus is present multiple times (e.g. 1014 1015). As such, this locus is contributing to multiple bins in your histogram.
The uniq command finds unique lines. That fourth column (methylation percentage) is highly unlikely to ever be the same for multiple occurrences of a single locus, so uniq won't eliminate any lines. You'll always end up with multiple entries for a single locus, even if a locus occurs multiple times in your concatenated bedgraph. Is this what you want?
I'm confused why the 2019-03-18-Control-5x-CpG-Loci.bedgraph output above has four columns. Just above that is this:
#Check columns for one of the file. I only need the chromosome, start position, and stop position
!head zr2096_1_s1_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov_5x.bedgraph
NC_007175.2 1579 1580
NC_007175.2 2180 2181
NC_007175.2 3383 3384
NC_007175.2 3394 3395
This shows that your 5x.bedgraph files only have three columns. How did that fourth column make it into 2019-03-18-Control-5x-CpG-Loci.bedgraph? What am I missing?
Then, there's also this part, which I don't think is a problem, just double-checking ( I'm guessing the 2. Count loci with 1x coverage is a typo?)
2. Count loci with 1x coverage
Since I did an MBD enrichment, it's not likely that I have all 14,458,703 CpG motifs represented in my dataset. I want to know how many CpG loci have at least 1x coverage across all of my samples.
2a. Filter 1x loci
In [12]:
%%bash
for f in *.cov
do
awk '{print $1, $2-1, $2, $4, $5+$6}' ${f} | awk '{if ($5 >= 5) { print $1, $2-1, $2}}' \
> ${f}_5x.bedgraph
done
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Any reason for concatenation? A list of FastQ files can be supplied to Bismark. |
That’s fine too - just my slang for merge :)
But as Sam is doing something similar- this is not high priority- as opposed to DML annotation, DMGs, DMRs
…On Apr 8, 2019, 8:33 AM -0700, kubu4 ***@***.***>, wrote:
> Concatenation should be done at trimmed sequence stage - then run through Bismark.
Any reason for concatenation? A list of FastQ files can be supplied to Bismark.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I'd like to do a chi-squared test comparing location of methylated CpGs to my DML list, so I'd like to straighten out this issue!
Yup, just a typo!
Excellent question. I was modifying this code yesterday from code I already wrote so it wasn't as clear as it should be. The The this:
No this is not what I want! I need to find some way to identify loci with multiple entries, then average methylation percentages. Why would I need to do that with trimmed FastQ files in |
There's something else I noticed, which I think might be wrong.
I don't think your
That produces a total of 5 columns of output.
I'd say the biggest concern here is that you end up subtracting 2 from the start position and 1 from the end position. Is this what you want to do here? Personally, I'd rewrite the
Of course, if you intended to subtract 2 from your start position and 1 from your end position, then make those adjustments. Presumably, the begraph file below was made by the your
All of the values in the percent methylation column are all the same (and, all integers). This doesn't seem right...
So, you don't want to identify unique loci any more? If not, then you're already set. Your file has all loci listed and their corresponding methylation percentages. You need to find a way to compute average methylation at each loci. |
I might be missing it - but there should only be one coverage file, one bedgraph file.
this is how you would get the one file |
I got that code from @sr320: https://github.com/sr320/nb-2019/blob/master/C_virginica/01-OAKL-3x-tracks.ipynb (I think the link is broken now so I'm not entirely sure where the actual code lives). I believe subtracting 2 from the start position and 1 from the end position is what I want to do to account for that different start position thing we encountered earlier. I'll modify the
...yeah I'm not sure what's happening there. It wasn't like this before I started messing with my code yesterday. The previous version of my notebook can be found here.
I'm confused. In this issue, I was instructed to use coverage files, identify 5x loci, then (somehow) combine all of the 5x loci from all control sample files to understand how many loci were methylated, partially methylated, and unmethylated. Based on this approach, I've used my code to pare down unique loci, but am missing the last step. I need a way, as @kubu4, points out, to calculate average methylation at each loci. The alternative approach would be using FastQ trimmed files in |
New approach: using file below to characterize general methylation trends @kubu4 Is there a readme for this somewhere? Not sure what the columns are, but here are my best guesses:
|
Just realized columns 4-6 are methylation percentage, count methylated, and count unmethylated |
I used the
I did not expect that most loci would be methylated. There's probably something weird about how I'm using |
A few things:
|
Frequency distribution for %CpG methylation in C. virginica is very different from Mac's findings in C. gigas.
C. virginica:
Jupyter notebook
Code to create figure
Why is my distribution so different from what we expected (more CpGs with 0% methylation)?
The text was updated successfully, but these errors were encountered: