-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plot annotation - missing info #6
Comments
Hi memglu!
t a conference. Will answer Monday.
Stuart
…On Wed, 5 Jun 2024 at 5:18 pm, MengLu-flw ***@***.***> wrote:
Hi Stuart and Natalia,
I’ve been using diem on my samples from the hybrid zones, and I think this
program has such unique value for hybridisation studies - many thanks for
making diem available for us :^)
---** A bit of background **---
My vcf.gz file (containing information from 21 main chromosomes only) was
failed to convert to diem input format with vcf2diem() in R (requested 80G
of MEM yet still having the out of memory issue). So, I switched to the
Python script written by Sam Ebdon (
https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the
conversion was done without any problem with about 20G of MEM.
I then read the Python outputs into R, and carried on my analysis
following this tutorial (
https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html
).
---** The issue **---
I was able to produce the plot for each chromosome (please see the
attachment “Geum_SYM_CHR21_no_missingGENO.png”). However, there is no
individual/sample info or site info annotated on my plots (like Fig. 5 in
https://doi.org/10.1111/2041-210X.14010)
Do you know how I can produce plots like Fig. 5 in the original paper?
Many thanks in advance!
All the very best,
Meng
Geum_SYM_CHR21_no_missingGENO.png (view on web)
<https://github.com/StuartJEBaird/diem/assets/83467875/c89a32b7-8f53-466c-8a00-6fcbee5c3db6>
—
Reply to this email directly, view it on GitHub
<#6>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV547B4XWO5NSGQAUSK3ZF426FAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZTMMZTGMZTAMQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
One thing now: yes filter, but *after* running diem filter for high DI.
Will explain when back.
S
On Thu, 6 Jun 2024 at 3:17 pm, StuartJEBaird ***@***.***>
wrote:
… Hi memglu!
t a conference. Will answer Monday.
Stuart
On Wed, 5 Jun 2024 at 5:18 pm, MengLu-flw ***@***.***>
wrote:
> Hi Stuart and Natalia,
>
> I’ve been using diem on my samples from the hybrid zones, and I think
> this program has such unique value for hybridisation studies - many thanks
> for making diem available for us :^)
>
> ---** A bit of background **---
> My vcf.gz file (containing information from 21 main chromosomes only) was
> failed to convert to diem input format with vcf2diem() in R (requested 80G
> of MEM yet still having the out of memory issue). So, I switched to the
> Python script written by Sam Ebdon (
> https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the
> conversion was done without any problem with about 20G of MEM.
>
> I then read the Python outputs into R, and carried on my analysis
> following this tutorial (
> https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html
> ).
>
> ---** The issue **---
> I was able to produce the plot for each chromosome (please see the
> attachment “Geum_SYM_CHR21_no_missingGENO.png”). However, there is no
> individual/sample info or site info annotated on my plots (like Fig. 5 in
> https://doi.org/10.1111/2041-210X.14010)
>
> Do you know how I can produce plots like Fig. 5 in the original paper?
>
> Many thanks in advance!
>
> All the very best,
> Meng
> Geum_SYM_CHR21_no_missingGENO.png (view on web)
> <https://github.com/StuartJEBaird/diem/assets/83467875/c89a32b7-8f53-466c-8a00-6fcbee5c3db6>
>
> —
> Reply to this email directly, view it on GitHub
> <#6>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABDCV547B4XWO5NSGQAUSK3ZF426FAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZTMMZTGMZTAMQ>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
Hi Meng!
Sorry for the delay in my reply. I am now back from the conference.
On Wed, 5 Jun 2024 at 18:18, MengLu-flw ***@***.***> wrote:
Hi Stuart and Natalia,
I’ve been using diem on my samples from the hybrid zones, and I think this
program has such unique value for hybridisation studies - many thanks for
making diem available for us :^)
You are welcome.
---** A bit of background **---
My vcf.gz file (containing information from 21 main chromosomes only) was
failed to convert to diem input format with vcf2diem() in R (requested 80G
of MEM yet still having the out of memory issue). So, I switched to the
Python script written by Sam Ebdon (
https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the
conversion was done without any problem with about 20G of MEM.
Excellent. Thank you for this information.
I then read the Python outputs into R, and carried on my analysis
following this tutorial (
https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html
).
---** The issue **---
I was able to produce the plot for each chromosome (please see the
attachment “Geum_SYM_CHR21_no_missingGENO.png”). However, there is no
individual/sample info or site info annotated on my plots (like Fig. 5 in
https://doi.org/10.1111/2041-210X.14010)
Do you know how I can produce plots like Fig. 5 in the original paper?
The different versions of *diem* are at different stages of development
when it comes to the visualisation step.
I will check with Natalia, check your second email, and get back to you.
Best
Stuart
… Many thanks in advance!
All the very best,
Meng
Geum_SYM_CHR21_no_missingGENO.png (view on web)
<https://github.com/StuartJEBaird/diem/assets/83467875/c89a32b7-8f53-466c-8a00-6fcbee5c3db6>
—
Reply to this email directly, view it on GitHub
<#6>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV547B4XWO5NSGQAUSK3ZF426FAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZTMMZTGMZTAMQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi Again!
On Wed, 5 Jun 2024 at 18:28, MengLu-flw ***@***.***> wrote:
Hi again :^)
Within the same thread, I'd also like to ask a question about what is your
recommended practice for handling the missing genotypes in the input VCF.
The plot that I presented in the previous post is based on a VCF filtered
for "no missing genotype allowed".
Is this an option for Sam's vcf2diem? (I need to understand his filtering
better...will ask Sam and look at the code).
I also tried to run diem on a VCF with missing data (I only extracted the
first 1000 sites from this dataset to run diem), and the plot looks like
this:
Geum_SYM_CHR21_first_1000_sites.png (view on web)
<https://github.com/StuartJEBaird/diem/assets/83467875/3c505ee2-e343-4dfe-90d4-4a0315168fe7>
It seems to me that it would be better if I applied some filters to my VCF
before using diem (I am using whole genome resequencing data). What are
your general recommendations regarding this issue?
Looking forward to your reply!
I have two suggestions.
*First: *regarding your first analysis with ("no missing genotype allowed").
I would like to produce annotated output for you using the tools for
diemmca (the Mathematica version).
To add the annotation to the figure I need a list *individualIDs* (to go on
the Y axis),
and a *sites* 'bed'-like file for the sites analysed (to annotate the
x-axis).
The *individualIDs *must be in the order that the sites' states are encoded
for *diem*.
The 'bed'-like *sites* information needed is just the scaffold and refpos
of each site in the *diem* input.
I will also need the *diem* input and output:
I will polarise the *diem* input using the *diem* output (polarity column)
I will filter the polarised data using the *diem* output (diagnostic index
(DI) column)
I will plot the DI-filtered, polarised data while annotating with the
(DI-filtered) bed information (and individualIDs).
I have just talked with Natalia: she is going to send you tools in R so you
can do the same thing....then we can compare, to ensure no bugs!
*Second: *regarding your second analysis (with missing data).
You can see that maybe 10% of the 1000 sites in your missing-data plot are
data-full for most individuals.
These are the sites which will have high diagnostic index (DI) when all
sites are analysed by *diem*.
A filter that e.g. removes all sites with any missing data *before* *diem*
would be a very drastic censorship for the data you have.
Instead, to get the most from your data, it is best to run *diem* on all
the data (except invariant or empty sites),
and then *after* *diem* filter the sites by DI.
The difference *before* vs *after* is this:
Once we have done the analysis we have an efficient filter based on the
data.
Before we have done the analysis we can only guess at what an efficient
filter for this data might be!
Hope this is clear and helpful,
Best
Stuart
Best,
… Meng
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV53ML2GLY3KPJW54A7DZF44BNAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGQ4DEMBRHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Dear Meng, Thank you for raising this issue. 1. Individual labels plotPolarized(gens, h)
axis(2, at = 1:length(inds), labels = inds[order(h)]) 2. Individual tick marks plotPolarized(gens, h)
# create a vector of colours
cols <- palette.colors(length(unique(taxa)), "Accent")[factor(taxa)]
# add coloured tick marks to the plot
invisible(Map(axis, side = 2, at = 1:length(inds), col.ticks = cols[order(h)], labels = "", lwd = 4)) Note that it is important to filter the sites after polarization as Stuart suggests. To obtain the filtered sites and updated hybrid indices, let's have # find the threshold to filter the top 2% of the most diagnostic sites
threshold <- quantile(DI, prob = .98)
# reduce the matrix of polarized genotypes to only include the most diagnostic markers
gens1 <- gens[, DI > threshold]
# update the individual hybrid indices from the filtered polarized genotypes
h1 <- apply(gens1,
MARGIN = 1,
FUN = \(x) pHetErrOnStateCount(sStateCount(x)))[1, ] The filtered polarized genotypes Let me know how this works for you. Best wishes, |
Hi Natalia - related to this (and sorry Meng for stealing your thread): where can I find the mapping from sequentially ordered markers to sites? I'd like to plot tract length distributions. |
Hi Simon!
Stuart here: Natalia and I were just discussing this across environments.
... this is how it is done in diemmca:
You should have a bed-like file for the refPositions (refPos) of your
markers - and you know their sequential order (seqOrd).
This means for each chromosome you can construct a table (matrix...
mapping):
refPos SeqOrd
32411000 1
32430000 2
32500000 3
etc
This allows one to define an interpolation from refPos to seqOrd.
Interrogate the interpolation for a range *Xticks*(refPos) on refPos (eg
1,2,3,4,5 Mb from start).
The interpolation answers are *Xticks*(seqOrd)
Tick the x-axis of a chromosome *diem* plot with
*Labels* *Xticks*(refPos)
*At positions* *Xticks*(seqOrd).
The method extends to plots showing multiple chromosomes by appropriate
'concatenation'.
This is a part of what might loosely be described as '*carpe*':
A wrapper of pre- and post-*diem* code to make user's lives easier.
is perhaps most fully developed in Mathematica, but we are expanding to
other platforms as time permits!
If you feel your code for doing this in eg R is neat - then consider
contributing it to the codebase...
Sam's vcf2diem has been super useful!
Best
Stuart
…On Tue, 11 Jun 2024 at 17:42, simonharnqvist ***@***.***> wrote:
Hi Natalia - related to this (and sorry Meng for stealing your thread):
where can I find the mapping from sequentially ordered markers to sites?
I'd like to plot tract length distributions.
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV54RSDQIM5LTNHU57X3ZG4LGPAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGA4DAMZYGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Ah - I think the problem here is that Sam's script doesn't output anything like a refPos BED, whereas it looks like vcf2diem.r produces something like that:
If that's what I'm looking for then it sounds like I need to try the R version again |
Well spotted. I am just now modifying Sam's vcf2diem so that it outputs 2
similar bed-like files.
This should be easy - Sam's code is really clean.
I believe Natalia has updated the R vcf2diem in the light of the bug your
data threw.
Stuart
…On Wed, 12 Jun 2024 at 12:02, simonharnqvist ***@***.***> wrote:
Ah - I think the problem here is that Sam's script doesn't output anything
like a refPos BED, whereas it looks like vcf2diem.r produces something like
that:
paste0(filepath, "-omittedLoci.txt"),
paste0(filepath, "-includedLoci.txt")
If that's what I'm looking for then it sounds like I need to try the R
version again
—
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV5ZJPKA7RHY2RMPMEE3ZHAMDLAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRSGYYTCOJQGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Simon, have you tried the -l option for Sam's vcf2diem?
S
On Wed, 12 Jun 2024 at 12:10, StuartJEBaird ***@***.***>
wrote:
… Well spotted. I am just now modifying Sam's vcf2diem so that it outputs 2
similar bed-like files.
This should be easy - Sam's code is really clean.
I believe Natalia has updated the R vcf2diem in the light of the bug your
data threw.
Stuart
On Wed, 12 Jun 2024 at 12:02, simonharnqvist ***@***.***>
wrote:
> Ah - I think the problem here is that Sam's script doesn't output
> anything like a refPos BED, whereas it looks like vcf2diem.r produces
> something like that:
>
> paste0(filepath, "-omittedLoci.txt"),
> paste0(filepath, "-includedLoci.txt")
>
> If that's what I'm looking for then it sounds like I need to try the R
> version again
>
> —
> Reply to this email directly, view it on GitHub
> <#6 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABDCV5ZJPKA7RHY2RMPMEE3ZHAMDLAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRSGYYTCOJQGI>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Aha! Classic case of |
Dear Meng and Simon, I will be sending When you get a chance to test, could you let me know how the changes address your needs? Best, |
Hi Natalia, Thank you so much! I'll do so :^) Best, |
Hi Stuart and Natalia,
I’ve been using diem on my samples from the hybrid zones, and I think this program has such unique value for hybridisation studies - many thanks for making diem available for us :^)
---** A bit of background **---
My vcf.gz file (containing information from 21 main chromosomes only) was failed to convert to diem input format with vcf2diem() in R (requested 80G of MEM yet still having the out of memory issue). So, I switched to the Python script written by Sam Ebdon (https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the conversion was done without any problem costing about 20G of MEM.
I then read the Python outputs into R, and carried on my analysis following this tutorial (https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html).
---** The issue **---
I was able to produce the plot for each chromosome. However, there is no individual/sample info or site info annotated on my plots (like Fig. 5 in https://doi.org/10.1111/2041-210X.14010)
Do you know how I can produce plots like Fig. 5 in the original paper?
Many thanks in advance!
All the very best,
Meng
P.S. Here is one example of my plots.
![Geum_SYM_CHR21_no_missingGENO](https://private-user-images.githubusercontent.com/83467875/336922265-c89a32b7-8f53-466c-8a00-6fcbee5c3db6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA2NzMyOTAsIm5iZiI6MTcyMDY3Mjk5MCwicGF0aCI6Ii84MzQ2Nzg3NS8zMzY5MjIyNjUtYzg5YTMyYjctOGY1My00NjZjLThhMDAtNmZjYmVlNWMzZGI2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzExVDA0NDMxMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTllMmZhNGU5ZjcwNmZhZWM1ODk2NWMxMTViM2FjMmNhMTgxMzNjMWVmMjFmYWQxNTY1NzBmOWY5YTliZWQwY2YmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.S4fWSvzmxIR_Ka57Avj5SdlVLcVbMUGPuQuri4iisIQ)
The text was updated successfully, but these errors were encountered: