# Performing Group-Specific GWASes (Risk)
- **Project:** GP2 AFR-AAC meta-GWAS 
- **Version:** Python/3.9
- **Last Updated:** 21-FEB-2023
    - **Update Description:** Running GWASes (release 4, no indels, age, sex, PCs 1-10 as covariates)

## Notebook Overview
- Running GWASes per group (no indels, age, sex, PCs 1-10 as covariates)

### CHANGELOG
- 15-FEB-2023: Notebook started
- 21-FEB-2023: Running GWASes (no indels, age, sex, PCs 1-10 as covariates)

---
# Data Overview 

| ANCESTRY |     DATASET     | CASES | CONTROLS |  TOTAL  |           ARRAY           |                NOTES                |
|:--------:|:---------------:|:-----:|:--------:|:-------------------------:|:---------------------------------------------------------------------------------------------------------------:|:-----------------------------------:|
|    AFR   | IPDGC – Nigeria |  304  |    285   |   589   |         NeuroChip         | . | 
|    AFR   |  GP2  |  711  |   1,011  |  1,722  |        NeuroBooster       | . |
|    AAC   |  GP2 |  185  |   1,149  |  1,334  |        NeuroBooster       | . | 
|    AAC   |     23andMe     |  288  |  193,985 | 194,273 | Omni Express & GSA & 550k |        Just summary statistics       |

# Getting Started

## Importing packages

In [3]:
## Import the necessary packages 
import os
import numpy as np
import pandas as pd
import math
import numbers
import sys
import subprocess
import statsmodels.api as sm
import scipy
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

## Print out package versions
## Getting packages loaded into this notebook and their versions to allow for reproducibility
    # Repurposed code from stackoverflow here: https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook

## Import packages 
import pkg_resources
import types
from datetime import date
today = date.today()
date = today.strftime("%d-%b-%Y").upper()

## Define function 
def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]

        # Some packages are weird and have different imported names vs. system/pip names
        # Unfortunately, there is no systematic way to get pip names from a package's imported name. You'll have to add exceptions to this list manually!
        poorly_named_packages = {
            "PIL": "Pillow",
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]

        yield name

## Get a list of packages imported 
imports = list(set(get_imports()))

# The only way I found to get the version of the root package from only the name of the package is to cross-check the names of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

## Print out packages and versions 
print(f"PACKAGE VERSIONS ({date})")
for r in requirements:
    print("\t{}=={}".format(*r))

PACKAGE VERSIONS (21-FEB-2023)
	matplotlib==3.5.2
	numpy==1.22.4
	scipy==1.8.1
	pandas==1.4.3
	statsmodels==0.13.2
	seaborn==0.11.2


# Run GWASes

- [x] IPDGC – Nigeria – AFR - NC
- [x] GP2 release 5 – AFR (with Nigerian NB)
- [x] GP2 release 5 – AAC

## IPDGC – Nigeria – AFR - NC

In [13]:
%%bash

plink2 \
--bfile ${NG_AFR_NEUROCHIP} \
--maf 0.05 \
--logistic \
--ci 0.95 \
--snps-only \
--covar ${WORK_DIR}/data/AFR/NIGERIAN-NC/NIGERIAN-NEUROCHIP-AFR-covariate-wAGE-FEB2023.txt \
--covar-name AGE_ANALYSIS,SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--covar-variance-standardize \
--out ${WORK_DIR}/data/AFR/NIGERIAN-NC/NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023

In [19]:
%%bash 
cd ${WORK_DIR}/data/AFR/NIGERIAN-NC

head -1 NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023.PHENO1.glm.logistic.hybrid > NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt
grep "ADD" NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023.PHENO1.glm.logistic.hybrid >> NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt

awk '{print $1, $2, $3, $15}' NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt > NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023-SUMMARYSTATS-FILTERED.txt
cat NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023-SUMMARYSTATS-FILTERED.txt | awk '$4 <= 0.00000005' > NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023-HITS.txt #5E-8
head NIGERIAN-NEUROCHIP-AFR-GWAS-MAF005-FEB2023-HITS.txt

## no hits 

<div class="alert alert-block alert-info">
<b>Results for IPDGC - Nigerian - NeuroChip:</b> 
<ul>
    <li>No genome-wide hits </li>
</ul>
</div>

## GP2 release 5 – AFR (includes Nigerian NB)

### Notes
- Study covariate not included as some studies only contribute one phenotype

In [15]:
%%bash

plink2 \
--bfile ${UPDATED_GP2_v5_AFR} \
--maf 0.05 \
--logistic \
--ci 0.95 \
--covar ${WORK_DIR}/data/masterfile_updated_GP2_v5_covariateFile_wAGE_FEB2023.txt \
--covar-name AGE_ANALYSIS,SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--snps-only \
--require-pheno PHENO1 \
--covar-variance-standardize \
--out ${WORK_DIR}/data/AFR/GP2-v5-AFR-wNIGERIAN-NB/GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023

In [20]:
%%bash 
cd ${WORK_DIR}/data/AFR/GP2-v5-AFR-wNIGERIAN-NB/

head -1 GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023.PHENO1.glm.logistic.hybrid > GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt
grep "ADD" GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023.PHENO1.glm.logistic.hybrid >> GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt

awk '{print $1, $2, $3, $15}' GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt > GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023-SUMMARYSTATS-FILTERED.txt
cat GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023-SUMMARYSTATS-FILTERED.txt | awk '$4 <= 0.00000005' > GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023-HITS.txt #5E-8
cat GP2-v5-AFR-wNIGERIAN-NB-GWAS-MAF005-FEB2023-HITS.txt

1 155235878 chr1:155235878:G:T 1.00617e-08


<div class="alert alert-block alert-info">
<b>Results for GP2 AFR - Release v5:</b> 
<ul>
    <li>1 genome-wide significant hit</li>
    <li>chr1:155235878:G:T (hg38; P=1.00617E-08)</li>
</ul>
</div>

---
## GP2 release 5 – AAC 

In [26]:
%%bash

plink2 \
--bfile ${UPDATED_GP2_v5_AAC} \
--maf 0.05 \
--logistic \
--ci 0.95 \
--covar ${WORK_DIR}/data/masterfile_updated_GP2_v5_covariateFile_wAGE_FEB2023.txt \
--covar-name AGE_ANALYSIS,SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--snps-only \
--require-pheno PHENO1 \
--covar-variance-standardize \
--out ${WORK_DIR}/data/AAC/GP2-v5-AAC/GP2-v5-AAC-GWAS-MAF005-FEB2023

In [29]:
%%bash 
cd ${WORK_DIR}/data/AAC/GP2-v5-AAC/

head -1 GP2-v5-AAC-GWAS-MAF005-FEB2023.PHENO1.glm.logistic.hybrid > GP2-v5-AAC-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt
grep "ADD" GP2-v5-AAC-GWAS-MAF005-FEB2023.PHENO1.glm.logistic.hybrid >> GP2-v5-AAC-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt

awk '{print $1, $2, $3, $15}' GP2-v5-AAC-GWAS-MAF005-FEB2023-SUMMARYSTATS.txt > GP2-v5-AAC-GWAS-MAF005-FEB2023-SUMMARYSTATS-FILTERED.txt
cat GP2-v5-AAC-GWAS-MAF005-FEB2023-SUMMARYSTATS-FILTERED.txt | awk '$4 <= 0.00000005' > GP2-v5-AAC-GWAS-MAF005-FEB2023-HITS.txt #5E-8
head GP2-v5-AAC-GWAS-MAF005-FEB2023-HITS.txt

<div class="alert alert-block alert-info">
<b>Results for GP2 AAC - Release v5:</b> 
<ul>
    <li>No genome-wide significant hits</li>
</ul>
</div>

## 23andMe Summary Statistics 

In [14]:
%%bash

cd ${WORK_DIR}/data/23andMe

head -1 AAC_23andMe_MAF0.05.hg38.noindels.newMarkerIDs.tab
cat AAC_23andMe_MAF0.05.hg38.noindels.newMarkerIDs.tab | awk '$7 <= 0.00000005' #5E-8

markerID	effect_allele	alt_allele	effect	stderr	N	pvalue


<div class="alert alert-block alert-info">
<b>Results for 23andMe AAC Summary Stats:</b> 
<ul>
    <li>No genome-wide significant hits</li>
</ul>
</div>

--- 
# chr1 EUR - GP2 Release 4

In [30]:
%%bash

plink2 \
--bfile ${WORK_DIR}/data/other/GP2_v5_EUR_chr1/chr1_EUR_release4 \
--maf 0.05 \
--logistic \
--ci 0.95 \
--covar ${WORK_DIR}/data/other/GP2_v5_EUR_chr1/GP2_v5_EUR_updated_covariateFile_FEB2023.txt \
--covar-name SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--snps-only \
--out ${WORK_DIR}/data/other/GP2_v5_EUR_chr1/GP2_v5_EUR_chr1_FEB2023

In [32]:
%%bash 
cd ${WORK_DIR}/data/other/GP2_v5_EUR_chr1

head -1 GP2_v5_EUR_chr1_FEB2023.PHENO1.glm.logistic.hybrid > GP2_v5_EUR_chr1_FEB2023-SUMMARYSTATS.txt
grep "ADD" GP2_v5_EUR_chr1_FEB2023.PHENO1.glm.logistic.hybrid >> GP2_v5_EUR_chr1_FEB2023-SUMMARYSTATS.txt

awk '{print $1, $2, $3, $15}' GP2_v5_EUR_chr1_FEB2023-SUMMARYSTATS.txt > GP2_v5_EUR_chr1_FEB2023-SUMMARYSTATS-FILTERED.txt
cat GP2_v5_EUR_chr1_FEB2023-SUMMARYSTATS-FILTERED.txt | awk '$4 <= 0.00000005' > GP2_v5_EUR_chr1_FEB2023-HITS.txt #5E-8
# no hits

In [None]:
%%bash 
cd ${WORK_DIR}/data/other/GP2_v5_EUR_chr1

## Grep out the AFR-AAC SNPs
echo "After imputation, before GWAS"
grep -E -f ${WORK_DIR}/data/AFR-AAC-META/genomewide-hits.txt ${WORK_DIR}/data/other/GP2_v5_EUR_chr1/chr1_EUR_release4.bim | wc -l
echo ""

echo "After MAF>5% GWAS"
head -1 GP2_v5_EUR_chr1_FEB2023-SUMMARYSTATS.txt > ${WORK_DIR}/data/other/GP2_v5_EUR_chr1/GP2_v5_EUR_chr1_FEB2023-extractAACAFRhits.txt
grep -E -f ${WORK_DIR}/data/AFR-AAC-META/genomewide-hits.txt GP2_v5_EUR_chr1_FEB2023-SUMMARYSTATS.txt >> ${WORK_DIR}/data/other/GP2_v5_EUR_chr1/GP2_v5_EUR_chr1_FEB2023-extractAACAFRhits.txt
wc -l ${WORK_DIR}/data/other/GP2_v5_EUR_chr1/GP2_v5_EUR_chr1_FEB2023-extractAACAFRhits.txt

# After imputation, before GWAS
# 17

# After MAF>5% GWAS
# 16 