# Zygostity per Group
- **Project:** GP2 AFR-AAC meta-GWAS 
- **Version:** Python/3.9
- **Status:** COMPLETE
- **Started:** 22-FEB-2023
- **Last Updated:** 22-FEB-2023
    - **Update Description:**  Notebook started

## Notebook Overview
- Get the zygosity distribution of individuals for rs3115534

### CHANGELOG
- 22-FEB-2023: Notebook started 


---
# Data Overview 

| ANCESTRY |     DATASET     | CASES | CONTROLS |  TOTAL  |           ARRAY           |                NOTES                |
|:--------:|:---------------:|:-----:|:--------:|:-------------------------:|:---------------------------------------------------------------------------------------------------------------:|:-----------------------------------:|
|    AFR   | IPDGC – Nigeria |  304  |    285   |   589   |         NeuroChip         | . | 
|    AFR   |  GP2  |  711  |   1,011  |  1,722  |        NeuroBooster       | . |
|    AAC   |  GP2 |  185  |   1,149  |  1,334  |        NeuroBooster       | . | 
|    AAC   |     23andMe     |  288  |  193,985 | 194,273 | Omni Express & GSA & 550k |        Just summary statistics       |


# Getting Started

## Importing packages

In [3]:
## Import the necessary packages 
import os
import numpy as np
import pandas as pd
import math
import numbers
import sys
import subprocess
import statsmodels.api as sm
import scipy
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

## Print out package versions
## Getting packages loaded into this notebook and their versions to allow for reproducibility
    # Repurposed code from stackoverflow here: https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook

## Import packages 
import pkg_resources
import types
from datetime import date
today = date.today()
date = today.strftime("%d-%b-%Y").upper()

## Define function 
def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]

        # Some packages are weird and have different imported names vs. system/pip names
        # Unfortunately, there is no systematic way to get pip names from a package's imported name. You'll have to add exceptions to this list manually!
        poorly_named_packages = {
            "PIL": "Pillow",
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]

        yield name

## Get a list of packages imported 
imports = list(set(get_imports()))

# The only way I found to get the version of the root package from only the name of the package is to cross-check the names of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

## Print out packages and versions 
print(f"PACKAGE VERSIONS ({date})")
for r in requirements:
    print("\t{}=={}".format(*r))

PACKAGE VERSIONS (22-FEB-2023)
	matplotlib==3.5.2
	numpy==1.22.4
	scipy==1.8.1
	pandas==1.4.3
	statsmodels==0.13.2
	seaborn==0.11.2


# IPDGC – Nigeria – AFR - NC

## Pulling out rs3115534 in PLINK

In [None]:
%%bash

module load plink/1.9

plink --bfile ${NG_AFR_NEUROCHIP} \
--snp chr1:155235878:G:T \
--allow-no-sex \
--recodeA --out ${NG_AFR_NEUROCHIP}.rs3115534

## Processing .raw file in Python

In [7]:
## Read in .raw file
recode = pd.read_csv(f"{NG_AFR_NEUROCHIP}.rs3115534.raw", sep=" ")

## Recode some of the columns to make reading summaries easier 
recode['Phenotype'] = np.where(recode['PHENOTYPE'] == 1, "Control", "Case")
recode['Sex'] = np.where(recode['SEX'] == 1, "Male", "Female")

zygosity =  {0 : 'T/T', 
             1 : 'G/T', 
             2 : 'G/G'}

recode['1:155235878:G:T_G'] = recode['chr1:155235878:G:T_G'].map(zygosity)

### Summaries of Zygosity by Phenotype 

In [8]:
recode.groupby(['Phenotype'])['IID'].count()

Phenotype
Case       304
Control    285
Name: IID, dtype: int64

In [9]:
recode.groupby(['Phenotype', '1:155235878:G:T_G'])['IID'].count()

Phenotype  1:155235878:G:T_G
Case       G/G                   38
           G/T                  120
           T/T                  107
Control    G/G                   18
           G/T                  105
           T/T                  121
Name: IID, dtype: int64

### Summaries of Zygosity by Phenotype and Sex

In [10]:
recode.groupby(['Phenotype','Sex'])['IID'].count()

Phenotype  Sex   
Case       Female     80
           Male      224
Control    Female     96
           Male      189
Name: IID, dtype: int64

In [11]:
recode.groupby(['Phenotype', '1:155235878:G:T_G','Sex'])['IID'].count()

Phenotype  1:155235878:G:T_G  Sex   
Case       G/G                Female    12
                              Male      26
           G/T                Female    38
                              Male      82
           T/T                Female    24
                              Male      83
Control    G/G                Female     4
                              Male      14
           G/T                Female    33
                              Male      72
           T/T                Female    42
                              Male      79
Name: IID, dtype: int64

# GP2 – AFR (includes public AFR+Nigerian NB)

## Pulling out rs3115534 in PLINK

In [None]:
%%bash

module load plink/1.9

plink --bfile ${UPDATED_GP2_v4_AFR} \
--snp chr1:155235878:G:T \
--allow-no-sex \
--recodeA --out ${UPDATED_GP2_v4_AFR}.rs3115534

## Processing .raw file in Python

In [13]:
## Read in .raw file
recode = pd.read_csv(f"{UPDATED_GP2_v4_AFR}.rs3115534.raw", sep=" ")

## Recode some of the columns to make reading summaries easier 
recode['Phenotype'] = np.where(recode['PHENOTYPE'] == 1, "Control", "Case")
recode['Sex'] = np.where(recode['SEX'] == 1, "Male", "Female")

zygosity =  {0 : 'T/T', 
             1 : 'G/T', 
             2 : 'G/G'}

recode['1:155235878:G:T_G'] = recode['chr1:155235878:G:T_G'].map(zygosity)

### Summaries of Zygosity by Phenotype 

In [14]:
recode.groupby(['Phenotype'])['IID'].count()

Phenotype
Case        711
Control    1011
Name: IID, dtype: int64

In [15]:
recode.groupby(['Phenotype', '1:155235878:G:T_G'])['IID'].count()

Phenotype  1:155235878:G:T_G
Case       G/G                   92
           G/T                  278
           T/T                  317
Control    G/G                   31
           G/T                  330
           T/T                  625
Name: IID, dtype: int64

### Summaries of Zygosity by Phenotype and Sex

In [16]:
recode.groupby(['Phenotype','Sex'])['IID'].count()

Phenotype  Sex   
Case       Female    206
           Male      505
Control    Female    448
           Male      563
Name: IID, dtype: int64

In [17]:
recode.groupby(['Phenotype', '1:155235878:G:T_G','Sex'])['IID'].count()

Phenotype  1:155235878:G:T_G  Sex   
Case       G/G                Female     21
                              Male       71
           G/T                Female     80
                              Male      198
           T/T                Female     98
                              Male      219
Control    G/G                Female     12
                              Male       19
           G/T                Female    133
                              Male      197
           T/T                Female    292
                              Male      333
Name: IID, dtype: int64

# GP2 - AAC

## Pulling out rs3115534 in PLINK

In [None]:
%%bash

module load plink/1.9

plink --bfile ${UPDATED_GP2_v4_AAC} \
--snp chr1:155235878:G:T \
--allow-no-sex \
--recodeA --out ${UPDATED_GP2_v4_AAC}.rs3115534

## Processing .raw file in Python

In [19]:
## Read in .raw file
recode = pd.read_csv(f"{UPDATED_GP2_v4_AAC}.rs3115534.raw", sep=" ")

## Recode some of the columns to make reading summaries easier 
recode['Phenotype'] = np.where(recode['PHENOTYPE'] == 1, "Control", "Case")
recode['Sex'] = np.where(recode['SEX'] == 1, "Male", "Female")

zygosity =  {0 : 'T/T', 
             1 : 'G/T', 
             2 : 'G/G'}

recode['1:155235878:G:T_G'] = recode['chr1:155235878:G:T_G'].map(zygosity)

### Summaries of Zygosity by Phenotype 

In [20]:
recode.groupby(['Phenotype'])['IID'].count()

Phenotype
Case        185
Control    1149
Name: IID, dtype: int64

In [21]:
recode.groupby(['Phenotype', '1:155235878:G:T_G'])['IID'].count()

Phenotype  1:155235878:G:T_G
Case       G/G                   11
           G/T                   61
           T/T                  111
Control    G/G                   18
           G/T                  274
           T/T                  848
Name: IID, dtype: int64

### Summaries of Zygosity by Phenotype and Sex

In [22]:
recode.groupby(['Phenotype','Sex'])['IID'].count()

Phenotype  Sex   
Case       Female     80
           Male      105
Control    Female    714
           Male      435
Name: IID, dtype: int64

In [23]:
recode.groupby(['Phenotype', '1:155235878:G:T_G','Sex'])['IID'].count()

Phenotype  1:155235878:G:T_G  Sex   
Case       G/G                Female      6
                              Male        5
           G/T                Female     22
                              Male       39
           T/T                Female     52
                              Male       59
Control    G/G                Female      9
                              Male        9
           G/T                Female    175
                              Male       99
           T/T                Female    526
                              Male      322
Name: IID, dtype: int64

# Summaries

## Adding up all AFR data
| AFR            |      |     | Frequencies | Times   |
|----------------|------|-----|-------------|---------|
| Total cases    | 1015 |     |             |         |
| Total controls | 1296 |     |             |         |
|                |      |     |             |         |
| Total cases    | G/G  | 130 | 0.12808     | 3.38755 |
|                | G/T  | 398 | 0.39212     | 1.16824 |
|                | T/T  | 424 | 0.41773     |         |
| Total controls | G/G  | 49  | 0.03781     |         |
|                | G/T  | 435 | 0.33565     |         |
|                | T/T  | 746 | 0.57562     | 1.37795 |

## Summing up AAC data
| AAC (GP2 only) |      |     | Frequencies | Times       |
|----------------|------|-----|-------------|-------------|
| Total cases    | 185  |     |             |             |
| Total controls | 1149 |     |             |             |
|                |      |     |             |             |
| Total cases    | G/G  | 11  | 0.05946     | 3.795495495 |
|                | G/T  | 61  | 0.32973     | 1.382698757 |
|                | T/T  | 111 | 0.60000     |             |
| Total controls | G/G  | 18  | 0.01567     |             |
|                | G/T  | 274 | 0.23847     |             |
|                | T/T  | 848 | 0.73803     | 1.23005512  |

## Conclusions
- G/G is 3.39x more common in AFR cases than controls 
- G/G is 3.80x more common in AAC cases than controls
- G/T is 1.17x more common in AFR cases than controls
- G/T is 1.38x more common in AAC cases than controls
- T/T is 1.38x more common in AFR controls than cases
- T/T is 1.23x more common in AAC controls than cases 
