#### Notebook to format genotypes for use with tensorQTL

typically store wgs genotypes by chromosome in vcf or plink2 pfiles
tensorQTL using plink1 bfiles, so convert, also since small cohort go ahead and merge from per chromosome to genome

for now still using plink bfiles with tensorQTL but probably need to figure out newer version of tensorQTL can read from vcf I think

In [1]:
!date

Wed Dec 14 16:17:54 EST 2022


#### import libraries

In [2]:
import concurrent.futures
from os.path import exists
from os import sched_getaffinity
from pandas import read_csv

#### set notebook variables

In [3]:
# naming
cohort = 'foundin'
amp_abbr = 'PP'
version = 'amppdv1'
cohort_version = f'{cohort}.{version}'

# directories
wrk_dir = '/home/gibbsr/working/foundin/foundin_qtl'
geno_dir = f'{wrk_dir}/genotypes'

# input files
pfiles = '{genodir}/{cohortversion}.chr{chr}'

# output files
genome_bfile = f'{geno_dir}/{cohort_version}.bfile'

# constant values
autosomes = [str(x) for x in list(range(1,23))]
cpu_cnt = len(sched_getaffinity(0))
DEBUG = False

#### utility functions

In [4]:
def run_bash_cmd(this_cmd: str, verbose: bool=False):
    !{this_cmd}

#### convert from plink2 pfiles to plink bfiles

In [5]:
with concurrent.futures.ProcessPoolExecutor() as ppe:
    for chrom in autosomes:
        this_pfile = pfiles.format(genodir=geno_dir, cohortversion=cohort_version, chr=chrom)
        this_cmd = f'plink2 --pfile {this_pfile} --make-bed --out {this_pfile}.bfile --silent'
        ppe.submit(run_bash_cmd, this_cmd)    

In [6]:
# merge the files into a single plink binary set
def frmt_merge_list_file(geno_dir, cohort_version, autosomes):
    merge_file_set = f'{geno_dir}/bfile_merge-list.txt'
    with open(merge_file_set, 'w') as file_handler:
        for chrom in autosomes:
            this_pfile = pfiles.format(genodir=geno_dir, cohortversion=cohort_version, chr=chrom)
            file_handler.write(f'{this_pfile}.bfile\n')
    return merge_file_set

def run_plink_bfile_merge(merge_file_set, genome_bfile):
    this_cmd = f'plink --merge-list {merge_file_set} --make-bed --allow-no-sex \
    --silent --out {genome_bfile} --maf 0.01 --geno 0.05 --hwe 0.000001'
    run_bash_cmd(this_cmd, verbose=DEBUG)

# merge the per chrom bfiles into a genome bfile
merge_file_set = frmt_merge_list_file(geno_dir, cohort_version, autosomes)
run_plink_bfile_merge(merge_file_set, genome_bfile)

# if there was a missnp problem remove those variant and re-attemp merge
if exists(f'{genome_bfile}-merge.missnp'):
    print('removing problem variants and retrying merge')
    with concurrent.futures.ProcessPoolExecutor() as ppe:
        for chrom in autosomes:
            this_pfile = pfiles.format(genodir=geno_dir, cohortversion=cohort_version, chr=chrom)
            this_cmd = f'plink2 --pfile {this_pfile} --make-bed --out {this_pfile}.bfile \
--silent --exclude {genome_bfile}-merge.missnp'
            ppe.submit(run_bash_cmd, this_cmd)           

    # try the merge again
    merge_file_set = frmt_merge_list_file(geno_dir, cohort_version, autosomes)
    run_plink_bfile_merge(merge_file_set, genome_bfile)

with matching IDs are all merged together; if this is not what you want (e.g.
you have a bunch of novel variants, all with ID "."), assign distinct IDs to
them (with e.g. --set-missing-var-ids) before rerunning this merge.
to length-80+ variant IDs; consider using a different naming scheme for long
indels and the like.
Error: 6239 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  /home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile-merge.missnp.
  alleles probably remain in your data.  If LD between nearby SNPs is high,
  --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
removing problem variants and retryi

In [7]:
!ls {genome_bfile}*
!head {genome_bfile}.log
!tail {genome_bfile}.log

/home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile.bed
/home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile.bim
/home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile.fam
/home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile.log
/home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile-merge.missnp
PLINK v1.90b6.21 64-bit (19 Oct 2020)
Options in effect:
  --allow-no-sex
  --geno 0.05
  --hwe 0.000001
  --maf 0.01
  --make-bed
  --merge-list /home/gibbsr/working/foundin/foundin_qtl/genotypes/bfile_merge-list.txt
  --out /home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile
  --silent
(--maf/--max-maf/--mac/--max-mac).
8697174 variants and 119 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
/home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile.bed +
/home/gibbsr/working/foundin/foundin_qtl/genotypes/foundin.amppdv1.bfile.bim +
/home/g

#### IDs used in analysis will be prefixed 'PPMI' so change AMP-PD 'PPs'

In [8]:
# read fam file and replace IDs
fam_df = read_csv(f'{genome_bfile}.fam', sep='\s+', header=None)
print(fam_df.shape)
if DEBUG:
    display(fam_df.head())
# do the replace
fam_df[0] = fam_df[1] = fam_df[0].str.replace('PP-', 'PPMI')
print(fam_df.shape)
if DEBUG:
    display(fam_df.head())
# write corrected file
fam_df.to_csv(f'{genome_bfile}.fam', header=False, index=False, sep=' ')

(119, 6)
(119, 6)


In [9]:
!date

Wed Dec 14 16:20:27 EST 2022
