Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LD Reference #14

Closed
samkleeman1 opened this issue May 6, 2021 · 14 comments
Closed

LD Reference #14

samkleeman1 opened this issue May 6, 2021 · 14 comments

Comments

@samkleeman1
Copy link

Hi,

Congratulations on the preprint and many thanks for making this available.

I am just about to make our own LD reference files according to your detailed tutorial, to cover EUR, AFR, CSA and EAS ancestries. I was wondering what your thoughts were about using the 1000 Genomes and Human Genome Diversity Project callset provided by gnomad for the input (MAF >1%, variants and samples passing their QC)? This would be instead of the phase 3 imputed data that you used in the preprint. Alternatively I wondered about using the Biobank data as input, as our summary statistics are all from UKB. Very grateful for any thoughts or suggestions on this matter, and I would be happy to make the reference files available for others in future.

Kind regards,

Sam Kleeman
PhD Student, Cold Spring Harbor Laboratory

@paturley
Copy link
Collaborator

paturley commented May 6, 2021

Hi Sam!

Thanks! We are excited to finally have it out for people to start using.

That's a great question. I can't think of a reason that you couldn't use any data set as a reference panel as long as it is a good match for the GWAS samples that are used and the genotyping is dense enough. That would be my main considerations with using the UKB. One one hand, if you are using a UKB GWAS, then the ancestry matching would be perfect. However, if the there are SNPs that aren't included in the UKB data (perhaps because they weren't genotyped or weren't imputed well enough) and if there are big LD differences between the genotyped and ungenotyped SNPs between ancestries, that could affect MAMA's performance.

Best,.
Patrick

@samkleeman1
Copy link
Author

Hi Patrick,

Many thanks for your detailed response, much appreciated. The imputation point is an important one, the recent TOPMED paper seems to suggest that HRC based imputation (i.e. what UKB used) for non-European ancestries does not perform very well. This might be a reason to use whole genome sequencing, which is directly genotyped rather than imputed. I am rather in two minds as to whether to go for the 1000G/HGDP panel (approx 3000 subjects, with well characterized ancestry assignments) versus using HRC-imputed UKB data.

Sam

@samkleeman1
Copy link
Author

Also - I am good to include four different ancestry groups as the input to python3 ../mama_ldscores.py

@paturley
Copy link
Collaborator

paturley commented May 6, 2021 via email

@samkleeman1
Copy link
Author

This line...

all_snps = pd.DataFrame(pd.concat(bims.values())[1].drop_duplicates())
...is creating some issues for me - in the UK biobank dataset there are some "duplicated" SNPs e.g. a single rsid refers to two or more single nucleotide polymorphisms e.g. A-->T and A-->G. This means that the following commands lead to the following error:

Length of values (9572556) does not match length of index (9564716)

Shall I just remove all "duplicated" SNPs from the input plink files?

@paturley
Copy link
Collaborator

paturley commented May 7, 2021

Hmm. This is from the tutorial? I think that the software should be robust to duplicates, but it's possible that the tutorial isn't. In the meantime, removing duplicates should probably solve this problem, but we will try to fix this in the longer run. @JonJala Can you check on this?

Thanks!

@samkleeman1
Copy link
Author

Yes this is from the tutorial.

On an unrelated note I think you could consider adding the following lines after 'clean the ancestry file' in https://github.com/JonJala/mama/blob/092297fe357349bb525602353ba3e3b2dc3b339e/legacy/mama_ldscore.py. Otherwise it does not correctly merge together the two pandas dataframes.

array_indivs.IDList["IID"] = pd.to_numeric(array_indivs.IDList["IID"])
df_ances.IID = pd.to_numeric(df_ances.IID)

@samkleeman1
Copy link
Author

This line in the tutorial creates the duplicate issue

all_snps = pd.DataFrame(pd.concat(bims.values())[1].drop_duplicates())

@JonJala
Copy link
Owner

JonJala commented May 7, 2021

Ah, ok. Thanks for pointing this stuff out. The tutorial is a pretty recent thing and not as robust as the MAMA software should hopefully be (especially the main method, after all the LD score creation is done), and clearly there are still some kinks to work out.

@ggoldman1
Copy link
Collaborator

Hi @samkleeman1 -- I'm going to help out with this.

Yes, I am assuming that the user does not have duplicated SNP's in their .bim file(s). Have you tried dropping duplicate SNPs in your .bim files (prior to running 1.make_defn_files.ipynb?) I think that's probably the cleanest way to deal with this.

@samkleeman1
Copy link
Author

Yes that's what I did, and that fixed the problem. But I guess we are removing most/all common multi-allelic SNPs, there are only a few thousands though.

Also re above, adding these lines works better

array_indivs.IDList["IID"] = array_indivs.IDList["IID"].astype(str)
df_ances.IID = df_ances.IID.astype(str)

@ggoldman1
Copy link
Collaborator

Ok cool, I'm glad you were able to resolve this! I'm just trying to think if there's a way to keep the multi-allelic SNPs without any issues. I'll need to revisit our LDSC construction and get back to you. Just out of curiosity, are you able to run the standard LDSC module with your data?

@samkleeman1
Copy link
Author

I would have to double check and get back to you re LDSC

@ggoldman1
Copy link
Collaborator

No worries! Was just curious but no need to deal with that (unless you want to).

Thanks for that suggestion on the mama_ldscore.py script, by the way! The second suggestion looks good to me (was worried with the first that casting to numeric would be problematic for IID's that contain non-numeric characters).

I'm going to close this out and add a note about dropping duplicate rsID's (at least for now) and look into your suggestion with mama_ldscore.py. Thanks again for raising this! Let us know if anything else comes up.

Best,
Grant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants