Plot annotation - missing info #6

MengLu-flw · 2024-06-05T16:18:22Z

Hi Stuart and Natalia,

I’ve been using diem on my samples from the hybrid zones, and I think this program has such unique value for hybridisation studies - many thanks for making diem available for us :^)

---** A bit of background **---
My vcf.gz file (containing information from 21 main chromosomes only) was failed to convert to diem input format with vcf2diem() in R (requested 80G of MEM yet still having the out of memory issue). So, I switched to the Python script written by Sam Ebdon (https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the conversion was done without any problem costing about 20G of MEM.

I then read the Python outputs into R, and carried on my analysis following this tutorial (https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html).

---** The issue **---
I was able to produce the plot for each chromosome. However, there is no individual/sample info or site info annotated on my plots (like Fig. 5 in https://doi.org/10.1111/2041-210X.14010)

Do you know how I can produce plots like Fig. 5 in the original paper?

Many thanks in advance!

All the very best,
Meng

P.S. Here is one example of my plots.

MengLu-flw · 2024-06-05T16:27:44Z

Hi again :^)

Within the same thread, I'd also like to ask a question about what is your recommended practice for handling the missing genotypes in the input VCF.

The plot that I presented in the previous post is based on a VCF filtered for "no missing genotype allowed".

I also tried to run diem on a VCF with missing data (I only extracted the first 1000 sites from this dataset to run diem), and the plot looks like this:

It seems to me that it would be better if I applied some filters to my VCF before using diem (I am using whole genome resequencing data). What are your general recommendations regarding this issue?

Looking forward to your reply!

Best,
Meng

StuartJEBaird · 2024-06-06T14:17:40Z

Hi memglu! t a conference. Will answer Monday. Stuart

…

On Wed, 5 Jun 2024 at 5:18 pm, MengLu-flw ***@***.***> wrote: Hi Stuart and Natalia, I’ve been using diem on my samples from the hybrid zones, and I think this program has such unique value for hybridisation studies - many thanks for making diem available for us :^) ---** A bit of background **--- My vcf.gz file (containing information from 21 main chromosomes only) was failed to convert to diem input format with vcf2diem() in R (requested 80G of MEM yet still having the out of memory issue). So, I switched to the Python script written by Sam Ebdon ( https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the conversion was done without any problem with about 20G of MEM. I then read the Python outputs into R, and carried on my analysis following this tutorial ( https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html ). ---** The issue **--- I was able to produce the plot for each chromosome (please see the attachment “Geum_SYM_CHR21_no_missingGENO.png”). However, there is no individual/sample info or site info annotated on my plots (like Fig. 5 in https://doi.org/10.1111/2041-210X.14010) Do you know how I can produce plots like Fig. 5 in the original paper? Many thanks in advance! All the very best, Meng Geum_SYM_CHR21_no_missingGENO.png (view on web) <https://github.com/StuartJEBaird/diem/assets/83467875/c89a32b7-8f53-466c-8a00-6fcbee5c3db6> — Reply to this email directly, view it on GitHub <#6>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDCV547B4XWO5NSGQAUSK3ZF426FAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZTMMZTGMZTAMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

StuartJEBaird · 2024-06-06T14:22:56Z

One thing now: yes filter, but *after* running diem filter for high DI. Will explain when back. S On Thu, 6 Jun 2024 at 3:17 pm, StuartJEBaird ***@***.***> wrote:

…

Hi memglu! t a conference. Will answer Monday. Stuart On Wed, 5 Jun 2024 at 5:18 pm, MengLu-flw ***@***.***> wrote: > Hi Stuart and Natalia, > > I’ve been using diem on my samples from the hybrid zones, and I think > this program has such unique value for hybridisation studies - many thanks > for making diem available for us :^) > > ---** A bit of background **--- > My vcf.gz file (containing information from 21 main chromosomes only) was > failed to convert to diem input format with vcf2diem() in R (requested 80G > of MEM yet still having the out of memory issue). So, I switched to the > Python script written by Sam Ebdon ( > https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the > conversion was done without any problem with about 20G of MEM. > > I then read the Python outputs into R, and carried on my analysis > following this tutorial ( > https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html > ). > > ---** The issue **--- > I was able to produce the plot for each chromosome (please see the > attachment “Geum_SYM_CHR21_no_missingGENO.png”). However, there is no > individual/sample info or site info annotated on my plots (like Fig. 5 in > https://doi.org/10.1111/2041-210X.14010) > > Do you know how I can produce plots like Fig. 5 in the original paper? > > Many thanks in advance! > > All the very best, > Meng > Geum_SYM_CHR21_no_missingGENO.png (view on web) > <https://github.com/StuartJEBaird/diem/assets/83467875/c89a32b7-8f53-466c-8a00-6fcbee5c3db6> > > — > Reply to this email directly, view it on GitHub > <#6>, or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDCV547B4XWO5NSGQAUSK3ZF426FAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZTMMZTGMZTAMQ> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

StuartJEBaird · 2024-06-11T08:23:35Z

Hi Meng! Sorry for the delay in my reply. I am now back from the conference.

On Wed, 5 Jun 2024 at 18:18, MengLu-flw ***@***.***> wrote: Hi Stuart and Natalia, I’ve been using diem on my samples from the hybrid zones, and I think this program has such unique value for hybridisation studies - many thanks for making diem available for us :^)

You are welcome.

---** A bit of background **--- My vcf.gz file (containing information from 21 main chromosomes only) was failed to convert to diem input format with vcf2diem() in R (requested 80G of MEM yet still having the out of memory issue). So, I switched to the Python script written by Sam Ebdon ( https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the conversion was done without any problem with about 20G of MEM.

Excellent. Thank you for this information.

I then read the Python outputs into R, and carried on my analysis following this tutorial ( https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html ). ---** The issue **--- I was able to produce the plot for each chromosome (please see the attachment “Geum_SYM_CHR21_no_missingGENO.png”). However, there is no individual/sample info or site info annotated on my plots (like Fig. 5 in https://doi.org/10.1111/2041-210X.14010) Do you know how I can produce plots like Fig. 5 in the original paper?

The different versions of *diem* are at different stages of development when it comes to the visualisation step. I will check with Natalia, check your second email, and get back to you. Best Stuart

…

Many thanks in advance! All the very best, Meng Geum_SYM_CHR21_no_missingGENO.png (view on web) <https://github.com/StuartJEBaird/diem/assets/83467875/c89a32b7-8f53-466c-8a00-6fcbee5c3db6> — Reply to this email directly, view it on GitHub <#6>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDCV547B4XWO5NSGQAUSK3ZF426FAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZTMMZTGMZTAMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

StuartJEBaird · 2024-06-11T10:12:15Z

Hi Again!

On Wed, 5 Jun 2024 at 18:28, MengLu-flw ***@***.***> wrote: Hi again :^) Within the same thread, I'd also like to ask a question about what is your recommended practice for handling the missing genotypes in the input VCF. The plot that I presented in the previous post is based on a VCF filtered for "no missing genotype allowed".

Is this an option for Sam's vcf2diem? (I need to understand his filtering better...will ask Sam and look at the code).

I also tried to run diem on a VCF with missing data (I only extracted the first 1000 sites from this dataset to run diem), and the plot looks like this: Geum_SYM_CHR21_first_1000_sites.png (view on web) <https://github.com/StuartJEBaird/diem/assets/83467875/3c505ee2-e343-4dfe-90d4-4a0315168fe7> It seems to me that it would be better if I applied some filters to my VCF before using diem (I am using whole genome resequencing data). What are your general recommendations regarding this issue? Looking forward to your reply!

I have two suggestions. *First: *regarding your first analysis with ("no missing genotype allowed"). I would like to produce annotated output for you using the tools for diemmca (the Mathematica version). To add the annotation to the figure I need a list *individualIDs* (to go on the Y axis), and a *sites* 'bed'-like file for the sites analysed (to annotate the x-axis). The *individualIDs *must be in the order that the sites' states are encoded for *diem*. The 'bed'-like *sites* information needed is just the scaffold and refpos of each site in the *diem* input. I will also need the *diem* input and output: I will polarise the *diem* input using the *diem* output (polarity column) I will filter the polarised data using the *diem* output (diagnostic index (DI) column) I will plot the DI-filtered, polarised data while annotating with the (DI-filtered) bed information (and individualIDs). I have just talked with Natalia: she is going to send you tools in R so you can do the same thing....then we can compare, to ensure no bugs! *Second: *regarding your second analysis (with missing data). You can see that maybe 10% of the 1000 sites in your missing-data plot are data-full for most individuals. These are the sites which will have high diagnostic index (DI) when all sites are analysed by *diem*. A filter that e.g. removes all sites with any missing data *before* *diem* would be a very drastic censorship for the data you have. Instead, to get the most from your data, it is best to run *diem* on all the data (except invariant or empty sites), and then *after* *diem* filter the sites by DI. The difference *before* vs *after* is this: Once we have done the analysis we have an efficient filter based on the data. Before we have done the analysis we can only guess at what an efficient filter for this data might be! Hope this is clear and helpful, Best Stuart Best,

…

Meng — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDCV53ML2GLY3KPJW54A7DZF44BNAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGQ4DEMBRHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

nmartinkova · 2024-06-11T10:41:50Z

Dear Meng,

Thank you for raising this issue. diemr currently does not have an option in plotPolarized that would include individual labels in the plot. We have discussed how to improve the functionality and we will release an update soon. In the meantime, here are two alternatives how to identify individuals in the plot of polarized markers.

1. Individual labels
Provided that gens is the matrix of polarized genotypes, such as that resulting from importPolarized, h is a numeric vector of hybrid indices, and inds is a vector of individual labels, one can include them in the plot as:

plotPolarized(gens, h)
axis(2, at = 1:length(inds), labels = inds[order(h)])

2. Individual tick marks
Alternatively, one might prefer to show colour-coded tick marks. This option is especially useful when the number of individuals is large and the names overlap. For this plotting option, let's assume that taxa is a character vector which specifies the taxa or another level of groups of interest.

plotPolarized(gens, h)
# create a vector of colours
cols <- palette.colors(length(unique(taxa)), "Accent")[factor(taxa)]
# add coloured tick marks to the plot
invisible(Map(axis, side = 2, at = 1:length(inds), col.ticks = cols[order(h)], labels = "", lwd = 4))

Note that it is important to filter the sites after polarization as Stuart suggests. To obtain the filtered sites and updated hybrid indices, let's have DI that is a numeric vector of diagnostic indices from the diem analysis (to be found in the file MarkerDiagnosticsWithOptimalPolarities.txt or in the function output in the diemResult$DI$DI).

# find the threshold to filter the top 2% of the most diagnostic sites
threshold <- quantile(DI, prob = .98)
# reduce the matrix of polarized genotypes to only include the most diagnostic markers
gens1 <- gens[, DI > threshold]
# update the individual hybrid indices from the filtered polarized genotypes
h1 <- apply(gens1, 
    MARGIN = 1, 
    FUN = \(x) pHetErrOnStateCount(sStateCount(x)))[1, ]

The filtered polarized genotypes gens1 and the updated hybrid indices h1 can then be used for plotting as shown earlier.

Let me know how this works for you.

Best wishes,
Natalia

simonharnqvist · 2024-06-11T15:42:10Z

Hi Natalia - related to this (and sorry Meng for stealing your thread): where can I find the mapping from sequentially ordered markers to sites? I'd like to plot tract length distributions.

StuartJEBaird · 2024-06-11T15:59:50Z

Hi Simon! Stuart here: Natalia and I were just discussing this across environments. ... this is how it is done in diemmca: You should have a bed-like file for the refPositions (refPos) of your markers - and you know their sequential order (seqOrd). This means for each chromosome you can construct a table (matrix... mapping): refPos SeqOrd 32411000 1 32430000 2 32500000 3 etc This allows one to define an interpolation from refPos to seqOrd. Interrogate the interpolation for a range *Xticks*(refPos) on refPos (eg 1,2,3,4,5 Mb from start). The interpolation answers are *Xticks*(seqOrd) Tick the x-axis of a chromosome *diem* plot with *Labels* *Xticks*(refPos) *At positions* *Xticks*(seqOrd). The method extends to plots showing multiple chromosomes by appropriate 'concatenation'. This is a part of what might loosely be described as '*carpe*': A wrapper of pre- and post-*diem* code to make user's lives easier. is perhaps most fully developed in Mathematica, but we are expanding to other platforms as time permits! If you feel your code for doing this in eg R is neat - then consider contributing it to the codebase... Sam's vcf2diem has been super useful! Best Stuart

…

On Tue, 11 Jun 2024 at 17:42, simonharnqvist ***@***.***> wrote: Hi Natalia - related to this (and sorry Meng for stealing your thread): where can I find the mapping from sequentially ordered markers to sites? I'd like to plot tract length distributions. — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDCV54RSDQIM5LTNHU57X3ZG4LGPAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGA4DAMZYGY> . You are receiving this because you commented.Message ID: ***@***.***>

simonharnqvist · 2024-06-12T10:02:08Z

Ah - I think the problem here is that Sam's script doesn't output anything like a refPos BED, whereas it looks like vcf2diem.r produces something like that:

      paste0(filepath, "-omittedLoci.txt"),
      paste0(filepath, "-includedLoci.txt")

If that's what I'm looking for then it sounds like I need to try the R version again

StuartJEBaird · 2024-06-12T10:10:53Z

Well spotted. I am just now modifying Sam's vcf2diem so that it outputs 2 similar bed-like files. This should be easy - Sam's code is really clean. I believe Natalia has updated the R vcf2diem in the light of the bug your data threw. Stuart

…

On Wed, 12 Jun 2024 at 12:02, simonharnqvist ***@***.***> wrote: Ah - I think the problem here is that Sam's script doesn't output anything like a refPos BED, whereas it looks like vcf2diem.r produces something like that: paste0(filepath, "-omittedLoci.txt"), paste0(filepath, "-includedLoci.txt") If that's what I'm looking for then it sounds like I need to try the R version again — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDCV5ZJPKA7RHY2RMPMEE3ZHAMDLAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRSGYYTCOJQGI> . You are receiving this because you commented.Message ID: ***@***.***>

StuartJEBaird · 2024-06-12T10:23:40Z

Simon, have you tried the -l option for Sam's vcf2diem? S On Wed, 12 Jun 2024 at 12:10, StuartJEBaird ***@***.***> wrote:

…

Well spotted. I am just now modifying Sam's vcf2diem so that it outputs 2 similar bed-like files. This should be easy - Sam's code is really clean. I believe Natalia has updated the R vcf2diem in the light of the bug your data threw. Stuart On Wed, 12 Jun 2024 at 12:02, simonharnqvist ***@***.***> wrote: > Ah - I think the problem here is that Sam's script doesn't output > anything like a refPos BED, whereas it looks like vcf2diem.r produces > something like that: > > paste0(filepath, "-omittedLoci.txt"), > paste0(filepath, "-includedLoci.txt") > > If that's what I'm looking for then it sounds like I need to try the R > version again > > — > Reply to this email directly, view it on GitHub > <#6 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDCV5ZJPKA7RHY2RMPMEE3ZHAMDLAVCNFSM6AAAAABI3BULVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRSGYYTCOJQGI> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

simonharnqvist · 2024-06-12T10:42:23Z

Aha! Classic case of RTFM: Read The F... Fine Manual.
Thanks!

nmartinkova · 2024-06-14T06:26:25Z

Dear Meng and Simon,

I will be sending diemr 1.3 to CRAN today that implements the issues raised here. It includes also a new vignette on what the output files contain and how to use them.

When you get a chance to test, could you let me know how the changes address your needs?

Best,
Natalia

MengLu-flw · 2024-06-14T09:36:27Z

Hi Natalia,

Thank you so much! I'll do so :^)
Have a lovely day.

Best,
Meng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plot annotation - missing info #6

Plot annotation - missing info #6

MengLu-flw commented Jun 5, 2024 •

edited

Loading

MengLu-flw commented Jun 5, 2024

StuartJEBaird commented Jun 6, 2024 via email

StuartJEBaird commented Jun 6, 2024 via email

StuartJEBaird commented Jun 11, 2024 via email

StuartJEBaird commented Jun 11, 2024 via email

nmartinkova commented Jun 11, 2024

simonharnqvist commented Jun 11, 2024

StuartJEBaird commented Jun 11, 2024 via email

simonharnqvist commented Jun 12, 2024

StuartJEBaird commented Jun 12, 2024 via email

StuartJEBaird commented Jun 12, 2024 via email

simonharnqvist commented Jun 12, 2024

nmartinkova commented Jun 14, 2024

MengLu-flw commented Jun 14, 2024

Plot annotation - missing info #6

Plot annotation - missing info #6

Comments

MengLu-flw commented Jun 5, 2024 • edited Loading

MengLu-flw commented Jun 5, 2024

StuartJEBaird commented Jun 6, 2024 via email

StuartJEBaird commented Jun 6, 2024 via email

StuartJEBaird commented Jun 11, 2024 via email

StuartJEBaird commented Jun 11, 2024 via email

nmartinkova commented Jun 11, 2024

simonharnqvist commented Jun 11, 2024

StuartJEBaird commented Jun 11, 2024 via email

simonharnqvist commented Jun 12, 2024

StuartJEBaird commented Jun 12, 2024 via email

StuartJEBaird commented Jun 12, 2024 via email

simonharnqvist commented Jun 12, 2024

nmartinkova commented Jun 14, 2024

MengLu-flw commented Jun 14, 2024

MengLu-flw commented Jun 5, 2024 •

edited

Loading