Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot annotation - missing info #6

Open
MengLu-flw opened this issue Jun 5, 2024 · 14 comments
Open

Plot annotation - missing info #6

MengLu-flw opened this issue Jun 5, 2024 · 14 comments

Comments

@MengLu-flw
Copy link

MengLu-flw commented Jun 5, 2024

Hi Stuart and Natalia,

I’ve been using diem on my samples from the hybrid zones, and I think this program has such unique value for hybridisation studies - many thanks for making diem available for us :^)

---** A bit of background **---
My vcf.gz file (containing information from 21 main chromosomes only) was failed to convert to diem input format with vcf2diem() in R (requested 80G of MEM yet still having the out of memory issue). So, I switched to the Python script written by Sam Ebdon (https://github.com/samebdon/vcf2diem/blob/main/vcf2diem.py), and the conversion was done without any problem costing about 20G of MEM.

I then read the Python outputs into R, and carried on my analysis following this tutorial (https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html).

---** The issue **---
I was able to produce the plot for each chromosome. However, there is no individual/sample info or site info annotated on my plots (like Fig. 5 in https://doi.org/10.1111/2041-210X.14010)

Do you know how I can produce plots like Fig. 5 in the original paper?

Many thanks in advance!

All the very best,
Meng

P.S. Here is one example of my plots.
Geum_SYM_CHR21_no_missingGENO

@MengLu-flw
Copy link
Author

Hi again :^)

Within the same thread, I'd also like to ask a question about what is your recommended practice for handling the missing genotypes in the input VCF.

The plot that I presented in the previous post is based on a VCF filtered for "no missing genotype allowed".

I also tried to run diem on a VCF with missing data (I only extracted the first 1000 sites from this dataset to run diem), and the plot looks like this:
Geum_SYM_CHR21_first_1000_sites

It seems to me that it would be better if I applied some filters to my VCF before using diem (I am using whole genome resequencing data). What are your general recommendations regarding this issue?

Looking forward to your reply!

Best,
Meng

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Jun 6, 2024 via email

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Jun 6, 2024 via email

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Jun 11, 2024 via email

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Jun 11, 2024 via email

@nmartinkova
Copy link
Collaborator

Dear Meng,

Thank you for raising this issue. diemr currently does not have an option in plotPolarized that would include individual labels in the plot. We have discussed how to improve the functionality and we will release an update soon. In the meantime, here are two alternatives how to identify individuals in the plot of polarized markers.

1. Individual labels
Provided that gens is the matrix of polarized genotypes, such as that resulting from importPolarized, h is a numeric vector of hybrid indices, and inds is a vector of individual labels, one can include them in the plot as:

plotPolarized(gens, h)
axis(2, at = 1:length(inds), labels = inds[order(h)])

2. Individual tick marks
Alternatively, one might prefer to show colour-coded tick marks. This option is especially useful when the number of individuals is large and the names overlap. For this plotting option, let's assume that taxa is a character vector which specifies the taxa or another level of groups of interest.

plotPolarized(gens, h)
# create a vector of colours
cols <- palette.colors(length(unique(taxa)), "Accent")[factor(taxa)]
# add coloured tick marks to the plot
invisible(Map(axis, side = 2, at = 1:length(inds), col.ticks = cols[order(h)], labels = "", lwd = 4))

Note that it is important to filter the sites after polarization as Stuart suggests. To obtain the filtered sites and updated hybrid indices, let's have DI that is a numeric vector of diagnostic indices from the diem analysis (to be found in the file MarkerDiagnosticsWithOptimalPolarities.txt or in the function output in the diemResult$DI$DI).

# find the threshold to filter the top 2% of the most diagnostic sites
threshold <- quantile(DI, prob = .98)
# reduce the matrix of polarized genotypes to only include the most diagnostic markers
gens1 <- gens[, DI > threshold]
# update the individual hybrid indices from the filtered polarized genotypes
h1 <- apply(gens1, 
    MARGIN = 1, 
    FUN = \(x) pHetErrOnStateCount(sStateCount(x)))[1, ]

The filtered polarized genotypes gens1 and the updated hybrid indices h1 can then be used for plotting as shown earlier.

Let me know how this works for you.

Best wishes,
Natalia

@simonharnqvist
Copy link

Hi Natalia - related to this (and sorry Meng for stealing your thread): where can I find the mapping from sequentially ordered markers to sites? I'd like to plot tract length distributions.

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Jun 11, 2024 via email

@simonharnqvist
Copy link

Ah - I think the problem here is that Sam's script doesn't output anything like a refPos BED, whereas it looks like vcf2diem.r produces something like that:

      paste0(filepath, "-omittedLoci.txt"),
      paste0(filepath, "-includedLoci.txt")

If that's what I'm looking for then it sounds like I need to try the R version again

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Jun 12, 2024 via email

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Jun 12, 2024 via email

@simonharnqvist
Copy link

Aha! Classic case of RTFM: Read The F... Fine Manual.
Thanks!

@nmartinkova
Copy link
Collaborator

Dear Meng and Simon,

I will be sending diemr 1.3 to CRAN today that implements the issues raised here. It includes also a new vignette on what the output files contain and how to use them.

When you get a chance to test, could you let me know how the changes address your needs?

Best,
Natalia

@MengLu-flw
Copy link
Author

Hi Natalia,

Thank you so much! I'll do so :^)
Have a lovely day.

Best,
Meng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants