# Genetic Data QC
This notebook documents genetic data qc for the population specific SMMAT analyses using the updated AD pheno (https://github.com/gaow/alzheimers-family/blob/master/notebook/20221121_AD_pheno_update.ipynb).

Major updates for the pheno data

* Most of missing data for age has been completed
* missing info for APOE4 updated based on the sequence data
* controls under 60 years of age excluded
* For the European samples (n = 15) age values coded as like 999, 8027 were replaced by the correct age
* unaffected singletons removed 

Pheno data
 > /mnt/mfs/statgen/alzheimers-family/pheno/pheno_updated_20221121/
 
Geno data: WGS data with jointly called EFIGA and NIALOAD data is available here
 > /mnt/mfs/statgen/alzheimers-family/normalized_bed/normalized_merged_autosome.*  


# Split the genetic data per population and do QC

In [2]:
# split the geno file per pop. 
ml Singularity
for i in African European Hispanic; do
sos run ~/project2022/notebook/AD/xqtl-pipeline/pipeline/GWAS_QC.ipynb qc:1 \
  --cwd /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno \
  --genoFile /mnt/mfs/statgen/alzheimers-family/normalized_bed/normalized_merged_autosome.bed \
  --maf_filter 0.0 \
  --keep_samples /mnt/mfs/statgen/alzheimers-family/pheno/pheno_updated_20221121/$i.id \
  --name $i \
  --container /mnt/vast/hpc/csg/containers/lmm.sif
done

INFO: Running [32mbasic QC filters[0m: Filter SNPs and select individuals
INFO: [32mbasic QC filters[0m is [32mcompleted[0m.
INFO: [32mbasic QC filters[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/normalized_merged_autosome.African.filtered.bed[0m
INFO: Workflow qc (ID=wcbedcdbe1e821720) is executed successfully with 1 completed step.
INFO: Running [32mbasic QC filters[0m: Filter SNPs and select individuals
INFO: [32mbasic QC filters[0m is [32mcompleted[0m.
INFO: [32mbasic QC filters[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/normalized_merged_autosome.European.filtered.bed[0m
INFO: Workflow qc (ID=wad4e79bae7bb194f) is executed successfully with 1 completed step.
INFO: Running [32mbasic QC filters[0m: Filter SNPs and select individuals
INFO: [32mbasic QC filters[0m is [32mcompleted[0m.
INFO: [32mbasic QC filters[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/normalized_m

In [3]:
for i in African European Hispanic; do
sos run ~/project2022/notebook/AD/xqtl-pipeline/pipeline/GWAS_QC.ipynb king \
  --cwd /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/King \
  --container /mnt/vast/hpc/csg/containers/lmm.sif \
  --genoFile /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/normalized_merged_autosome.$i.filtered.bed \
  --maf_filter 0.0 
done

INFO: Running [32mking_1[0m: Inference of relationships in the sample to identify closely related individuals
INFO: [32mking_1[0m is [32mcompleted[0m.
INFO: [32mking_1[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/King/normalized_merged_autosome.African.filtered.kin0[0m
INFO: Running [32mking_2[0m: Select a list of unrelated individual with an attempt to maximize the unrelated individuals selected from the data
INFO: [32mking_2[0m is [32mcompleted[0m.
INFO: [32mking_2[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/King/normalized_merged_autosome.African.filtered.related_id[0m
INFO: Running [32mking_3[0m: Split genotype data into related and unrelated samples, if related individuals are detected
INFO: [32mking_3[0m is [32mcompleted[0m.
INFO: [32mking_3[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/King/normalized_merged_autosome.African.filtered.unrelated.bed /mnt/mfs/statgen/

# Gnerate QCed genoFile without LD pruning to use in the GMMAT analysis


In [4]:
for i in African European Hispanic; do
# unrelated individuals
sos run ~/project2022/notebook/AD/xqtl-pipeline/pipeline/GWAS_QC.ipynb qc_no_prune \
    --cwd /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/ \
    --genoFile /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/normalized_merged_autosome.$i.filtered.bed \
    --remove_samples /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/King/normalized_merged_autosome.$i.filtered.related_id \
    --maf_filter 0.0 \
    --geno_filter 0.1 \
    --mind_filter 0.1 \
    --hwe_filter 5e-08 \
    --name unrelated \
    --container /mnt/mfs/statgen/containers/lmm.sif
# related individuals same set of variants
sos run ~/project2022/notebook/AD/xqtl-pipeline/pipeline/GWAS_QC.ipynb qc:1 \
    --cwd /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/ \
    --genoFile /mnt/mfs/statgen/alzheimers-family/AD_common_variants/PCA/normalized_merged_autosome.$i.filtered.bed \
    --keep_samples /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno/King/normalized_merged_autosome.$i.filtered.related_id \
    --keep_variants /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.$i.filtered.unrelated.filtered.bim \
    --maf_filter 0.0 \
    --geno_filter 0.1 \
    --mind_filter 0.1 \
    --hwe_filter 0 \
    --name related \
    --container /mnt/mfs/statgen/containers/lmm.sif 
done

INFO: Running [32mqc_no_prune[0m: Filter SNPs and select individuals
INFO: [32mqc_no_prune[0m is [32mcompleted[0m.
INFO: [32mqc_no_prune[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.African.filtered.unrelated.filtered.bed[0m
INFO: Workflow qc_no_prune (ID=w28688c7066998731) is executed successfully with 1 completed step.
INFO: Running [32mbasic QC filters[0m: Filter SNPs and select individuals
INFO: [32mbasic QC filters[0m is [32mcompleted[0m.
INFO: [32mbasic QC filters[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.African.filtered.related.filtered.extracted.bed[0m
INFO: Workflow qc (ID=w7af314c82261ff6d) is executed successfully with 1 completed step.
INFO: Running [32mqc_no_prune[0m: Filter SNPs and select individuals
INFO: [32mqc_no_prune[0m is [32mcompleted[0m.
INFO: [32mqc_no_prune[0m output:   [32m/mnt/mfs/statgen/alzheimers-family/

In [None]:
# merge two data-sets
bash: container= '/mnt/mfs/statgen/containers/lmm.sif'
for i in African European Hispanic; do
    plink --bfile /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.$i.filtered.related.filtered.extracted \
         --bmerge /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.$i.filtered.unrelated.filtered.bed \
                  /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.$i.filtered.unrelated.filtered.bim \
                  /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.$i.filtered.unrelated.filtered.fam \
        --make-bed --keep-allele-order --memory 800000 --out /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/geno_qced.$i
done

In [None]:
plink --bfile /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.African.filtered.related.filtered.extracted \
         --bmerge /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.African.filtered.unrelated.filtered.bed \
                  /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.African.filtered.unrelated.filtered.bim \
                  /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/normalized_merged_autosome.African.filtered.unrelated.filtered.fam \
        --make-bed --keep-allele-order --memory 80000 --out /mnt/mfs/statgen/alzheimers-family/AD_rare_variants/geno_qced/African_rare