# Data Preparation

- Project: Multiancestry LRRK2 p.A419V analysis
- Version: Python/3.10.12
- Created: 05-MAY-2025
- Last Update: 12-JUNE-2025

## Notebook Overview

- Remove related individual
- Files preparation
- Baseline differences in sex and fhx
- HWE
- Allele frequency

# Getting started

## Load python libraries

In [2]:
# Import necessary packages
import os
import pandas as pd
import numpy as np
from io import StringIO
from firecloud import api as fapi
from IPython.core.display import display, HTML
import urllib.parse
from google.cloud import bigquery
import sys as sys

# Define function
# Utility routine for printing a shell command before executing it
def shell_do(command):
    print(f'Executing: {command}', file=sys.stderr)
    !$command
    
def shell_return(command):
    print(f'Executing: {command}', file=sys.stderr)
    output = !$command
    return '\n'.join(output)

  from IPython.core.display import display, HTML


## Install R and load packages

In [None]:
pip install rpy2

In [3]:
%load_ext rpy2.ipython

In [4]:
%%R
install.packages("tidyverse")
install.packages("data.table")
install.packages("qqman")

library(tidyverse)
library(data.table)
library(qqman)

* installing *source* package ‘tidyverse’ ...
** package ‘tidyverse’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (tidyverse)
* installing *source* package ‘data.table’ ...
** package ‘data.table’ successfully unpacked and MD5 sums checked
** using staged installation


gcc 9.4.0
zlib 1.2.11 is available ok
* checking if R installation supports OpenMP without any extra hints... yes
gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c assign.c -o assign.o


** libs
using C compiler: ‘gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0’


gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c between.c -o between.o
gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c bmerge.c -o bmerge.o
gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c chmatch.c -o chmatch.o
gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c cj.c -o cj.o
gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fp

gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c transpose.c -o transpose.o
gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c types.c -o types.o
gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c uniqlist.c -o uniqlist.o
gcc -I"/usr/share/R/include" -DNDEBUG      -fopenmp  -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-EpRONj/r-base-4.4.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c utils.c -o utils.o
gcc -I"/usr/share/R/include" -DNDEBUG      -fo

installing to /home/jupyter/packages/00LOCK-data.table/00new/data.table/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (data.table)
* installing *source* package ‘qqman’ ...
** package ‘qqman’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
**

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Installing package into ‘/home/jupyter/packages’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/tidyverse_2.0.0.tar.gz'
Content type 'application/x-gzip' length 704618 bytes (688 KB)
downloaded 688 KB


The downloaded source packages are in
	‘/tmp/RtmpEei0DZ/downloaded_packages’
Installing package into ‘/home/jupyter/packages’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/data.table_1.17.4.tar.gz'
Content type 'application/x-gzip' length 5839682 bytes (5.6 MB)
downloaded 5.6 MB


The downloaded source packages are in
	‘/tmp/RtmpEei0DZ/downloaded_packages’
Installing package into ‘/home/jupyter/packages’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/qqman_0.1.9.tar.gz'
Content type 'application/x-gzip' length 1337980 bytes (1.3 MB)
downloaded 1.3 MB


The downloaded source packages are in
	‘/tmp/RtmpEei0DZ/downloaded_packages’
data.table 1.17.4 using 4 threads (see ?getDTthreads).  Latest news

# Remove related individuals

In [7]:
labels = ['AAC', 'AFR', 'AJ', 'AMR', 'CAH', 'CAS', 'EAS', 'EUR', 'MDE', 'SAS']

for label in labels:
    df = pd.read_csv(f'{WORK_DIR}/{label}/{label}_release9.related')
    related_list = ['first_deg', 'second_deg']
    related = df[df['REL'].isin(related_list)]
    
    # Keep one of the samples that flag as related
    related = related.drop_duplicates(subset = 'IID1', keep = 'first')
    
    # Print out the numbers of related individuals
    print(label, len(related))
    
    # Save out files
    related['IID1'].to_csv(f'{WORK_DIR}/{label}/{label}_related_ids.samples', sep = '\t', header = False, index = False)

AAC 10
AFR 69
AJ 139
AMR 58
CAH 22
CAS 74
EAS 227
EUR 1407
MDE 22
SAS 46


In [None]:
%%bash
WORK_DIR='/home/jupyter/A419V_release9'
cd $WORK_DIR

labels=('AAC' 'AFR' 'AJ' 'AMR' 'CAH' 'CAS' 'EAS' 'EUR' 'MDE' 'SAS')

for label in "${labels[@]}"
do

    # Remove the "_s1" at the end of the samples labeling to match with PLINK files
    
    sed 's/_s1//' "${label}/${label}_related_ids.samples" > tmp
    mv tmp "${label}/${label}_related_ids.samples"
    
done

In [None]:
%%bash
WORK_DIR='/home/jupyter/A419V_release9'
cd $WORK_DIR

labels=('AAC' 'AFR' 'AJ' 'AMR' 'CAH' 'CAS' 'EAS' 'EUR' 'MDE' 'SAS')

for label in "${labels[@]}"
do

    /home/jupyter/plink2 \
        --pfile "${label}/${label}_release9" \
        --remove "${label}/${label}_related_ids.samples" \
        --make-bed \
        --out "${label}/${label}_release9_remove_related"

done

# File preparation

## PLINK file

In [None]:
# Recode the PHENOTYPE AND SEX as 1 and 2
master = pd.read_csv(f"{WORK_DIR}/master_key_release9_final.csv")

master.loc[(master["baseline_GP2_phenotype"] != "PD") & (master["baseline_GP2_phenotype"] != "Control"), "baseline_GP2_phenotype"] = -9
master.loc[master["baseline_GP2_phenotype"] == "Control", "baseline_GP2_phenotype"] = 1
master.loc[master["baseline_GP2_phenotype"] == "PD", "baseline_GP2_phenotype"] = 2

master.loc[master["biological_sex_for_qc"] == "Male", "biological_sex_for_qc"] = 1
master.loc[master["biological_sex_for_qc"] == "Female", "biological_sex_for_qc"] = 2

In [None]:
# Rename the columns
master.rename(columns={"baseline_GP2_phenotype":"PHENO"}, inplace = True)
master.rename(columns={"biological_sex_for_qc":"SEX"}, inplace = True)
master.rename(columns={"age_at_sample_collection" : "AGE", "GP2ID":"IID"}, inplace = True)
master.rename(columns={"family_history_for_qc":"FHX"}, inplace = True)

In [None]:
# Save it to update PLINK file

sex = master[["IID", "SEX"]]
pheno = master[["IID", "PHENO"]]

sex.insert(0, "FID", 0)
pheno.insert(0, "FID", 0)

sex.to_csv(f"{WORK_DIR}/sex_update.txt", sep = "\t", header = False, index = False)
pheno.to_csv(f"{WORK_DIR}/pheno_update.txt", sep = "\t", header = False, index = False)

In [None]:
%%bash
WORK_DIR='/home/jupyter/A419V_release9'
cd $WORK_DIR

labels=('AAC' 'AFR' 'AJ' 'AMR' 'CAH' 'CAS' 'EAS' 'EUR' 'FIN' 'MDE' 'SAS')

for label in "${labels[@]}"
do
    /home/jupyter/plink1.9 \
        --bfile "${label}/${label}_release9_remove_related" \
        --update-sex sex_update.txt \
        --pheno pheno_update.txt \
        --make-bed \
        --out "${label}/${label}_release9_remove_related_updated"
done

## Covariate file

In [None]:
labels = ['AAC', 'AFR', 'AJ', 'AMR', 'CAH', 'CAS', 'EAS', 'EUR', 'MDE', 'SAS']

for label in labels:

    # Keep separate covariate files per ancestry
    master_red = master[(master["nba_label"] == label) & (master["nba"] == 1)]
    master_red_base = master_red[["IID", "PHENO", "SEX", "AGE", "FHX"]]
    master_red_base.to_csv(f"{WORK_DIR}/{label}/{label}_other_cov.txt", sep = "\t", header = True, index = False)

In [None]:
labels = ['AAC', 'AFR', 'AJ', 'AMR', 'CAH', 'CAS', 'EAS', 'EUR', 'MDE', 'SAS']

for label in labels:
    
    fam = pd.read_csv(f'{WORK_DIR}/{label}/{label}_release9_remove_related_updated.fam', delim_whitespace = True, names = ['FID', 'IID', 'PATID', 'FATID', 'SEX', 'PHENO'])
    fam = fam[['FID', 'IID', 'SEX', 'PHENO']]
    
    new_pcs = pd.read_csv(f'{WORK_DIR}/{label}/{label}_release9.eigenvec', sep = '\t')
    new_pcs.columns =['FID', 'IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10']
    new_pcs = new_pcs[['FID', 'IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10']]
    
    # Merge it with PCs information to get full covariate file
    merge1 = pd.merge(fam, new_pcs, on = ['FID', 'IID'])
    merge1 = merge1[~merge1['PHENO'].isna()]
    merge1['PHENO'] = merge1['PHENO'].astype(int)
    
    age = master[["IID", "AGE"]]
    merge2 = pd.merge(merge1, age, how = "left")
    
    merge2.insert(2, "PAT", 0)
    merge2.insert(2, "MAT", 0)
    merge2 = merge2[['FID', 'IID','SEX', 'PHENO', 'AGE', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10']]
    
    merge2.to_csv(f'{WORK_DIR}/{label}/{label}_covar.txt', sep = '\t',  index = False)

# Demographics Documentation

## Phenotype per ancestry

In [8]:
pheno_df = pd.DataFrame()
lables = ['AAC', 'AFR', 'AJ', 'AMR', 'CAH', 'CAS', 'EAS', 'EUR', 'MDE', 'SAS']
for label in lables:
    
    fam      = pd.read_csv(f'{WORK_DIR}/{label}/{label}_release9_remove_related_updated.fam', delim_whitespace = True, names = ["FID", "IID", "PAT", "MAT", "SEX", "PHENO"])
    fam_filt = fam[fam["PHENO"] != -9]
    
    # Number of controls and cases
    nCon    = fam_filt[fam_filt["PHENO"] == 1].shape[0]
    nCase   = fam_filt[fam_filt["PHENO"] == 2].shape[0]
    
    # Percentage of controls and cases
    pCon    = round( (nCon / (nCon + nCase)) * 100, 2)
    pCase     = round( (nCase / (nCon + nCase)) * 100, 2)
    
    tmp      = pd.DataFrame([{"nCon": nCon, "pCon": pCon, "nCase": nCase, "pCase": pCase, "label" : label}])
    pheno_df   = pd.concat([pheno_df, tmp])
    
pheno_df

Unnamed: 0,nCon,pCon,nCase,pCase,label
0,827,70.99,338,29.01,AAC
0,1667,62.91,983,37.09,AFR
0,824,32.53,1709,67.47,AJ
0,1428,41.6,2005,58.4,AMR
0,310,32.49,644,67.51,CAH
0,329,33.23,661,66.77,CAS
0,2379,42.7,3192,57.3,EAS
0,5492,26.37,15332,73.63,EUR
0,197,29.06,481,70.94,MDE
0,199,35.99,354,64.01,SAS


## Sex per ancestry

In [9]:
sex_df = pd.DataFrame()
lables = ['AAC', 'AFR', 'AJ', 'AMR', 'CAH', 'CAS', 'EAS', 'EUR', 'MDE', 'SAS']
for label in lables:
    
    fam      = pd.read_csv(f'{WORK_DIR}/{label}/{label}_release9_remove_related_updated.fam', delim_whitespace = True, names = ["FID", "IID", "PAT", "MAT", "SEX", "PHENO"])
    fam_filt = fam[fam["PHENO"] != -9]
    
    # Number of male and female
    nMale    = fam_filt[fam_filt["SEX"] == 1].shape[0]
    nFem     = fam_filt[fam_filt["SEX"] == 2].shape[0]
    
    # Percentage of male and female
    pMale    = round( (nMale / (nMale + nFem)) * 100, 2)
    pFem     = round( (nFem / (nMale + nFem)) * 100, 2)
    
    tmp      = pd.DataFrame([{"nMale": nMale, "pMale": pMale, "nFem": nFem, "pFem": pFem, "label" : label}])
    sex_df   = pd.concat([sex_df, tmp])
    
sex_df

Unnamed: 0,nMale,pMale,nFem,pFem,label
0,481,41.29,684,58.71,AAC
0,1470,55.47,1180,44.53,AFR
0,1553,61.31,980,38.69,AJ
0,1631,47.51,1802,52.49,AMR
0,500,52.41,454,47.59,CAH
0,452,45.66,538,54.34,CAS
0,3454,62.0,2117,38.0,EAS
0,12627,60.64,8197,39.36,EUR
0,417,61.5,261,38.5,MDE
0,358,64.74,195,35.26,SAS


## Age 

In [10]:
# Control Age
age_df = pd.DataFrame()
lables = ['AAC', 'AFR', 'AJ', 'AMR', 'CAH', 'CAS', 'EAS', 'EUR', 'MDE', 'SAS']
for label in lables:
    
    covar          = pd.read_csv(f'{WORK_DIR}/{label}/{label}_covar.txt', sep = '\t')
    covar_con      = covar[covar["PHENO"] == 1]
    covar_con_filt = covar_con[~covar_con['AGE'].isna()]
    
    # Calculate number of individuals with missing age 
    nMissing  = covar_con[covar_con['AGE'].isna()].shape[0]
    nComplete = covar_con[~covar_con['AGE'].isna()].shape[0]
    
    mean_age  = round(covar_con_filt['AGE'].mean(), 2)
    stdev     = round(covar_con_filt['AGE'].std(), 2) # Standard deviation
    tmp       = pd.DataFrame([{"nComplete": nComplete, "nMissing": nMissing, "mean_age":mean_age, "std":stdev, "label" : label}])
    age_df    = pd.concat([age_df, tmp])
    
age_df

Unnamed: 0,nComplete,nMissing,mean_age,std,label
0,775,52,64.94,11.46,AAC
0,949,718,62.5,16.0,AFR
0,627,197,63.11,11.47,AJ
0,1374,54,59.83,8.45,AMR
0,299,11,47.2,19.33,CAH
0,239,90,54.86,6.13,CAS
0,1307,1072,62.41,11.13,EAS
0,3179,2313,64.72,12.51,EUR
0,193,4,55.58,8.84,MDE
0,123,76,54.15,17.43,SAS


In [11]:
# PD Age
age_df = pd.DataFrame()
lables = ['AAC', 'AFR', 'AJ', 'AMR', 'CAH', 'CAS', 'EAS', 'EUR', 'MDE', 'SAS']
for label in lables:
    
    covar           = pd.read_csv(f'{WORK_DIR}/{label}/{label}_covar.txt', sep = '\t')
    covar_case      = covar[covar["PHENO"] == 2]
    covar_case_filt = covar_case[~covar_case['AGE'].isna()]
    
    # Calculate number of individuals with missing age
    nMissing  = covar_case[covar_case['AGE'].isna()].shape[0]
    nComplete = covar_case[~covar_case['AGE'].isna()].shape[0]
    
    mean_age  = round(covar_case_filt['AGE'].mean(), 2)
    stdev     = round(covar_case_filt['AGE'].std(), 2) # Standard deviation
    tmp       = pd.DataFrame([{"nComplete": nComplete, "nMissing": nMissing, "mean_age":mean_age, "std":stdev, "label" : label}])
    age_df    = pd.concat([age_df, tmp])
    
age_df

Unnamed: 0,nComplete,nMissing,mean_age,std,label
0,287,51,65.9,11.55,AAC
0,270,713,63.97,11.82,AFR
0,1617,92,70.26,9.75,AJ
0,1916,89,62.83,12.25,AMR
0,556,88,61.51,12.44,CAH
0,528,133,61.13,10.87,CAS
0,2612,580,64.26,11.54,EAS
0,11906,3426,67.27,10.76,EUR
0,333,148,64.07,12.23,MDE
0,316,38,60.57,12.46,SAS


# Compare differences between p.A419V PD carriers and PD non-carriers 

In [None]:
lables = ['AAC', 'AFR', 'AJ', 'AMR', 'CAH', 'CAS', 'EAS', 'EUR', 'MDE', 'SAS']
for label in lables:

    master_red = master[(master["nba_label"] == label) & (master["nba"] == 1)]
    master_red_base = master_red[["IID", "PHENO", "SEX", "AGE", "FHX"]]
    master_red_base.to_csv(f"{WORK_DIR}/{label}/{label}_other_cov.txt", sep = "\t", header = True, index = False)

In [None]:
%%bash
WORK_DIR=/home/jupyter/A419V_release9
cd $WORK_DIR

lables=('AAC' 'AFR' 'AJ' 'AMR' 'CAH' 'CAS' 'EAS' 'EUR' 'MDE' 'SAS')

# Extract p.A419V
for label in "${lables[@]}"
do

    /home/jupyter/plink1.9 \
    --bfile ${label}/${label}_release9_remove_related_updated \
    --chr 12 \
    --from-bp 40252984 \
    --to-bp 40252984 \
    --missing \
    --make-bed \
    --out ${label}/${label}_release9_remove_related_a419v
    
done

In [None]:
%%bash
WORK_DIR=/home/jupyter/A419V_release9
cd $WORK_DIR

lables=('AAC' 'AFR' 'AJ' 'AMR' 'CAH' 'CAS' 'EAS' 'EUR' 'MDE' 'SAS')

for label in "${lables[@]}"
do

    /home/jupyter/plink1.9 \
    --bfile ${label}/${label}_release9_remove_related_a419v \
    --recode A \
    --out ${label}/${label}_release9_remove_related_a419v
    
done

## Family History

In [13]:
%%R

labels <- c("CAH", "EUR", "EAS", "CAS")

df <- data.table()

for (label in labels){
    
    raw <- read.delim(paste0("/home/jupyter/A419V_release9/", label, "/", label, "_release9_remove_related_a419v.raw"), sep = " ")
    cov <- read.delim(paste0("/home/jupyter/A419V_release9/", label, "/", label, "_other_cov.txt"), sep = "\t")

    merged <- merge(raw, cov, by = "IID")

    merge_filtered <- merged %>% 
                      filter(PHENOTYPE == 2) %>%
                      subset(FHX %in% c("No", "Yes"))

    table_fhx <- table(merge_filtered$exm994472_A, merge_filtered$FHX)

    p          <- fisher.test(table_fhx)$p
    total      <- nrow(merge_filtered)
    withFHx    <- merge_filtered %>% filter(FHX == "Yes") %>% nrow()
    withoutFHx <- merge_filtered %>% filter(FHX == "No") %>% nrow()
    
    withFHx_carrier <- merge_filtered %>% filter(FHX == "Yes" & exm994472_A != 0) %>% nrow()
    withFHx_noncarrier <- merge_filtered %>% filter(FHX == "Yes" & exm994472_A == 0) %>% nrow()

    total_carrier <- merge_filtered %>% filter(exm994472_A != 0) %>% nrow()
    total_noncarrier <- merge_filtered %>% filter(exm994472_A == 0) %>% nrow()
    
    tmp <- data.table(anc = label, ntotal = total, nFHX = withFHx, nWithoutFHX = withoutFHx, p_val = p, 
                      FHx_c = withFHx_carrier, FHx_nc = withFHx_noncarrier,
                      total_c = total_carrier, total_nc = total_noncarrier)
    df  <- rbind(df, tmp)

}

df

      anc ntotal  nFHX nWithoutFHX     p_val FHx_c FHx_nc total_c total_nc
   <char>  <int> <int>       <int>     <num> <int>  <int>   <int>    <int>
1:    CAH    508   102         406 0.1814038     2    100       4      504
2:    EUR  11630  2959        8671 0.3559985     3   2953      11    11610
3:    EAS   2711   664        2047 0.2906378     8    656      53     2657
4:    CAS    516    59         457 1.0000000     1     58      11      504


# Sex

In [14]:
%%R

labels <- c("CAH", "EUR", "EAS", "CAS")

df <- data.table()

for (label in labels){
    
    raw <- read.delim(paste0("/home/jupyter/A419V_release9/", label, "/", label, "_release9_remove_related_a419v.raw"), sep = " ")
    cov <- read.delim(paste0("/home/jupyter/A419V_release9/", label, "/", label, "_other_cov.txt"), sep = "\t")

    merged <- merge(raw, cov, by = "IID")

    merge_filtered <- merged %>% 
                      filter(PHENOTYPE == 2) %>%
                      subset(SEX.y %in% c(1, 2))

    table_sex <- table(merge_filtered$exm994472_A, merge_filtered$SEX.y)

    p          <- fisher.test(table_sex)$p
    total      <- nrow(merge_filtered)
    male       <- merge_filtered %>% filter(SEX.y == 1) %>% nrow()
    female     <- merge_filtered %>% filter(SEX.y == 2) %>% nrow()
    
    male_carrier       <- merge_filtered %>% filter(SEX.y == 1 & exm994472_A != 0) %>% nrow()
    male_noncarrier    <- merge_filtered %>% filter(SEX.y == 1 & exm994472_A == 0) %>% nrow()
    
    total_carrier <- merge_filtered %>% filter(exm994472_A != 0) %>% nrow()
    total_noncarrier <- merge_filtered %>% filter(exm994472_A == 0) %>% nrow()
    
    tmp <- data.table(anc = label, ntotal = total, nMale = male, nFemale = female, p_val = p, 
                      male_c = male_carrier, male_nc = male_noncarrier, 
                      total_c = total_carrier, total_nc = total_noncarrier)
    df  <- rbind(df, tmp)

}

df

      anc ntotal nMale nFemale       p_val male_c male_nc total_c total_nc
   <char>  <int> <int>   <int>       <num>  <int>   <int>   <int>    <int>
1:    CAH    644   369     275 0.051800339      4     365      14      630
2:    EUR  15332  9549    5783 0.040907116     15    9528      33    15288
3:    EAS   3192  1767    1425 0.087092101     37    1729      81     3109
4:    CAS    661   301     360 0.008410676      1     299      13      647


# HWE check

In [None]:
%%bash
WORK_DIR=/home/jupyter/A419V_release9
cd $WORK_DIR

lables=('AAC' 'AFR' 'AJ' 'AMR' 'CAH' 'CAS' 'EAS' 'EUR' 'MDE' 'SAS')

for label in "${lables[@]}"
do

    /home/jupyter/plink1.9 \
    --bfile ${label}/${label}_release9_remove_related_a419v \
    --hardy \
    --out ${label}/${label}_release9_remove_related_a419v_hwe
    
done

In [15]:
%%bash
WORK_DIR=/home/jupyter/A419V_release9
cd $WORK_DIR

ancestry_labels=('AAC' 'AFR' 'AJ' 'AMR' 'CAH' 'CAS' 'EAS' 'EUR' 'MDE' 'SAS')

for label in "${ancestry_labels[@]}";
do
    
    echo ${label}
    cat ${label}/${label}_release9_remove_related_a419v_hwe.hwe
    
done

AAC
 CHR              SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P 
  12   Seq_rs34594498      ALL    A    G             0/0/1203        0        0            1
  12   Seq_rs34594498      AFF    A    G              0/0/338        0        0            1
  12   Seq_rs34594498    UNAFF    A    G              0/0/824        0        0            1
  12        exm994472      ALL    A    G             0/0/1207        0        0            1
  12        exm994472      AFF    A    G              0/0/338        0        0            1
  12        exm994472    UNAFF    A    G              0/0/827        0        0            1
AFR
AJ
 CHR              SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P 
  12   Seq_rs34594498      ALL    A    G             0/1/3076 0.000325 0.0003249            1
  12   Seq_rs34594498      AFF    A    G             0/1/1705 0.0005862 0.000586            1
  12   Seq_rs34594498    UNAFF    A    G              0

cat: AFR/AFR_release9_remove_related_a419v_hwe.hwe: No such file or directory


 CHR              SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P 
  12   Seq_rs34594498      ALL    A    G             0/1/3492 0.0002863 0.0002862            1
  12   Seq_rs34594498      AFF    A    G             0/0/2005        0        0            1
  12   Seq_rs34594498    UNAFF    A    G             0/0/1428        0        0            1
  12        exm994472      ALL    A    G             0/1/3491 0.0002864 0.0002863            1
  12        exm994472      AFF    A    G             0/0/2004        0        0            1
  12        exm994472    UNAFF    A    G             0/0/1428        0        0            1
CAH
 CHR              SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P 
  12   Seq_rs34594498      ALL    A    G             0/13/968  0.01325  0.01316            1
  12   Seq_rs34594498      AFF    A    G             0/13/630  0.02022  0.02001            1
  12   Seq_rs34594498    UNAFF    A    G              0/0/31

# LRRK2 p.A419V allele frequency by sex

In [16]:
# For Male
WORK_DIR='/home/jupyter/A419V_release9'
sex = 1

geno_df = pd.DataFrame()
ancestry_labels = ['AAC', 'AJ', 'AMR', 'CAS','MDE', 'SAS', 'EAS', 'EUR', 'CAH']

for label in ancestry_labels:

    raw = pd.read_csv(f"{WORK_DIR}/{label}/{label}_release9_remove_related_a419v.raw", delim_whitespace = True)
    raw_male = raw[(raw["SEX"] == sex) & (raw["PHENOTYPE"] != -9)]
    
    nCon = raw_male[raw_male["PHENOTYPE"] == 1].shape[0]
    nCase = raw_male[raw_male["PHENOTYPE"] == 2].shape[0]

    hom_ref = raw_male[raw_male["exm994472_A"] == 0].shape[0]
    het     = raw_male[raw_male["exm994472_A"] == 1].shape[0]
    hom_alt = raw_male[raw_male["exm994472_A"] == 2].shape[0]
    
    alt_freq = ( het + (hom_alt*2) ) / ( (hom_ref + het + hom_alt)*2 )
    
    tmp       = pd.DataFrame([{"nCon": nCon, "nCase": nCase, "hom_ref": hom_ref, "het": het, "hom_alt":hom_alt, "alt_freq": alt_freq, "label" : label}])
    geno_df   = pd.concat([geno_df, tmp])

geno_df

Unnamed: 0,nCon,nCase,hom_ref,het,hom_alt,alt_freq,label
0,298,183,481,0,0,0.0,AAC
0,425,1128,1551,1,0,0.000322,AJ
0,502,1129,1630,0,0,0.0,AMR
0,151,301,448,3,0,0.003326,CAS
0,118,299,416,0,0,0.0,MDE
0,133,225,357,0,0,0.0,SAS
0,1687,1767,3402,42,0,0.006098,EAS
0,3078,9549,12602,16,0,0.000634,EUR
0,131,369,496,4,0,0.004,CAH


In [18]:
# For Female
WORK_DIR='/home/jupyter/A419V_release9'
sex = 2

geno_df = pd.DataFrame()
ancestry_labels = ['AAC', 'AJ', 'AMR', 'CAS','MDE', 'SAS', 'EAS', 'EUR', 'CAH']

for label in ancestry_labels:

    raw = pd.read_csv(f"{WORK_DIR}/{label}/{label}_release9_remove_related_a419v.raw", delim_whitespace = True)
    raw_fem = raw[(raw["SEX"] == sex) & (raw["PHENOTYPE"] != -9)]
    
    nCon = raw_fem[raw_fem["PHENOTYPE"] == 1].shape[0]
    nCase = raw_fem[raw_fem["PHENOTYPE"] == 2].shape[0]

    hom_ref = raw_fem[raw_fem["exm994472_A"] == 0].shape[0]
    het     = raw_fem[raw_fem["exm994472_A"] == 1].shape[0]
    hom_alt = raw_fem[raw_fem["exm994472_A"] == 2].shape[0]
    
    alt_freq = ( het + (hom_alt*2) ) / ( (hom_ref + het + hom_alt)*2 )
    
    tmp       = pd.DataFrame([{"nCon": nCon, "nCase": nCase, "hom_ref": hom_ref, "het": het, "hom_alt":hom_alt, "alt_freq": alt_freq, "label" : label}])
    geno_df   = pd.concat([geno_df, tmp])

geno_df

Unnamed: 0,nCon,nCase,hom_ref,het,hom_alt,alt_freq,label
0,529,155,684,0,0,0.0,AAC
0,399,581,980,0,0,0.0,AJ
0,926,876,1802,0,0,0.0,AMR
0,178,360,524,14,0,0.013011,CAS
0,79,182,261,0,0,0.0,MDE
0,66,129,195,0,0,0.0,SAS
0,692,1425,2062,52,2,0.013233,EAS
0,2414,5783,8171,17,2,0.001282,EUR
0,179,275,444,10,0,0.011013,CAH
