# TMEM175 - Single gene analysis in GP2 NeuroBooster genotyping data (all ancestries)

`GP2 ❤️ Open Science 😍`

**Last updated:** June 2025


**Project:** GP2 TMEM175

**Version:** Python/3.10.15, R/4.3.3


## Description

This notebook contains the code and workflow used in the study: **“TMEM175, SCARB2 and CTSB associations with Parkinson's disease risk across populations”**.

In this notebook performed a single variant based and gene based analysis of TMEM175 with GP2 data.

## Notebook Overview

1. Loading Python libraries
* Set paths
* Make working directory
2. Installing packages

3. Create a covariate file with GP2 data

4. Annotation of the gene TMEM175

5. Association analysis to compare allele frequencies between cases and controls

6. GLM analysis adjusting for gender, age, PC1-5

7. Burden test (SkatO, Skat, cmc,zeggini,mb,fp,cmcWald)

8. Conditional analysis

## 1. Loading Python libraries

In [None]:
# Use pathlib for file path manipulation
import pathlib

# Install numpy
import numpy as np

# Install Pandas for tabular data
import pandas as pd

# Install plotnine: a ggplot2-compatible Python plotting package
from plotnine import *

# Always show all columns in a Pandas DataFrame
pd.set_option('display.max_columns', None)

### Set paths

In [None]:
REL7_PATH = pathlib.Path(pathlib.Path.home(), '~/workspace/path_to_release7')
!ls -hal {REL7_PATH}

### Make working directory

In [5]:
WORK_DIR = "~/workspace/ws_files/TMEM175/"

## 2. Installing packages

In [None]:
# Make sure all tools are installed
! ls /home/jupyter/tools

LICENSE		       plink2				rvtests
annovar		       plink2_linux_x86_64_latest.zip	toy.map
annovar.latest.tar.gz  plink_linux_x86_64_20190304.zip	toy.ped
plink		       prettify


In [None]:
# Give permission

# chmod to make sure you have permission to run the program
! chmod u+x /home/jupyter/tools/plink
! chmod u+x /home/jupyter/tools/plink2
! chmod 777 /home/jupyter/tools/rvtests/executable/rvtest

In [None]:
%%bash
# making working directory
# Loop over all the ancestries
for ancestry in {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'} ;
do

# Make a folder for each ancestry
mkdir ~/workspace/ws_files/TMEM175/TMEM175_"$ancestry"

done

## 3. Create a covariate file with GP2 data

In [4]:
CLINICAL_DATA_PATH = pathlib.Path(REL7_PATH, 'clinical_data/master_key_release7_final_vwb.csv')

In [None]:
# Let's load the master key
key = pd.read_csv(CLINICAL_DATA_PATH, low_memory=False)
print(key.shape)
key.head()

In [None]:
# Subsetting to keep only a few columns 
key = key[['GP2sampleID', 'baseline_GP2_phenotype_for_qc', 'biological_sex_for_qc', 'age_at_sample_collection', 'age_of_onset', 'label']]
# Renaming the columns
key.rename(columns = {'GP2sampleID':'IID',
                                     'baseline_GP2_phenotype_for_qc':'phenotype',
                                     'biological_sex_for_qc':'SEX', 
                                     'age_at_sample_collection':'AGE', 
                                     'age_of_onset':'AAO'}, inplace = True)
key

In [None]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    # Subset to keep ancestry of interest 
    ancestry_key = key[key['label']==ancestry].copy()
    ancestry_key.reset_index(drop=True)
    
     # Load information about related individuals in the ancestry analyzed
    related_df = pd.read_csv(f'{REL7_PATH}/meta_data/related_samples/{ancestry}_release7_vwb.related')
    print(f'Related individuals: {related_df.shape}')
    
    # Make a list of just one set of related people
    related_list = list(related_df['IID1'])
    
    # Check value counts of related and remove only one related individual
    ancestry_key = ancestry_key[~ancestry_key["IID"].isin(related_list)]
    
    # Check size
    print(f'Unrelated individuals: {ancestry_key.shape}')
    
    # Convert phenotype to binary (1/2)
    # Assign conditions so case=2 and controls=1, and -9 otherwise (matching PLINK convention)
    # PD = 2; control = 1
    pheno_mapping = {"PD": 2, "Control": 1}
    ancestry_key['PHENO'] = ancestry_key['phenotype'].map(pheno_mapping).astype('Int64')
    
    # Check value counts of pheno
    ancestry_key['PHENO'].value_counts(dropna=False)
    
    # Get the PCs
    pcs = pd.read_csv(f'{REL7_PATH}/meta_data/qc_metrics/projected_pcs_vwb.csv')
    
     # Select just first 5 PCs
    selected_columns = ['IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
    pcs = pd.DataFrame(data=pcs.iloc[:, 1:7].values, columns=selected_columns)
    
     # Drop the first row (since it's now the column names)
    pcs = pcs.drop(0)
    
    # Reset the index to remove any potential issues
    pcs = pcs.reset_index(drop=True)
    
    # Check size
    print(f'PCs: {pcs.shape}')
    
    # Check value counts of SEX
    sex_og_values = ancestry_key['SEX'].value_counts(dropna=False)
    print(f'Sex value counts - original:\n {sex_og_values.to_string()}')
    
    # Convert sex to binary (1/2)
    # Assign conditions so female=2 and men=1, and -9 otherwise (matching PLINK convention)
    # Female = 2; Male = 1
    sex_mapping = {"Female": 2, "Male": 1}
    ancestry_key['SEX'] = ancestry_key['SEX'].map(sex_mapping).astype('Int64')
    
    # Check value counts of SEX after recoding
    sex_recode_values = ancestry_key['SEX'].value_counts(dropna=False)
    print(f'Sex value counts - recoded:\n{sex_recode_values.to_string()}')
    
    # Make covariate file
    df = pd.merge(pcs, ancestry_key, on='IID', how='left')
    print(f'Check columns for covariate file: {df.columns}')
    
    # Make additional columns - FID, fatid and matid - these are needed for RVtests!!
    # RVtests needs the first 5 columns to be fid, iid, fatid, matid and sex otherwise it does not run correctly
    # Uppercase column name is ok
    # See https://zhanxw.github.io/rvtests/# phenotype-file
    df['FID'] = 0
    df['FATID'] = 0
    df['MATID'] = 0
    
    # Clean up and keep columns we need 
    final_df = df[['FID','IID', 'FATID', 'MATID', 'SEX', 'AGE', 'PHENO','PC1', 'PC2', 'PC3', 'PC4', 'PC5']].copy()
    
    ## DO NOT replace missing values with -9 as this is misinterpreted by RVtests - needs to be nonnumeric
    # Leave missing values as NA
    
    # Check number of PD cases missing age
    pd_missAge = final_df[(final_df['PHENO']==2)&(final_df['AGE'].isna())]
    print(f'Number of PD cases missing age: {pd_missAge.shape[0]}')
    
    # Check number of controls missing age
    control_missAge = final_df[(final_df['PHENO']==1)&(final_df['AGE'].isna())]
    print(f'Number of controls missing age: {control_missAge.shape[0]}')
    
    # Make file of sample IDs to keep 
    samples_toKeep = final_df[['FID', 'IID']].copy()
    samples_toKeep.to_csv(f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}/{ancestry}.samplestoKeep', sep = '\t', index=False, header=None)
    
    # Make your covariate file
    # Included na_rep to write out missing/NA values explicitly as string/text, not as blank otherwise they are misread in RVtests
    final_df.to_csv(f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}/{ancestry}_covariate_file.txt', sep = '\t', na_rep='NA', index=False)

WORKING ON: AFR
Related individuals: (228, 9)
Unrelated individuals: (2752, 6)
PCs: (58209, 6)
Sex value counts - original:
 SEX
Male                          1553
Female                        1197
Other/Unknown/Not Reported       2
Sex value counts - recoded:
SEX
1       1553
2       1197
<NA>       2
Check columns for covariate file: Index(['IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'phenotype', 'SEX', 'AGE',
       'AAO', 'label', 'PHENO'],
      dtype='object')
Number of PD cases missing age: 835
Number of controls missing age: 809
WORKING ON: AAC
Related individuals: (28, 9)
Unrelated individuals: (1181, 6)
PCs: (58209, 6)
Sex value counts - original:
 SEX
Female                        690
Male                          489
Other/Unknown/Not Reported      2
Sex value counts - recoded:
SEX
2       690
1       489
<NA>      2
Check columns for covariate file: Index(['IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'phenotype', 'SEX', 'AGE',
       'AAO', 'label', 'PHENO'],
      dtype='obje

## 4. Annotation of the gene TMEM175

* Extract the region using PLINK

* Extract TMEM175 gene

TMEM175 coordinates: 
* Chromosome 4:932387-958656 (GRCh38/hg38)

In [None]:
# Extract region using plink
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --pfile {REL7_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release7_vwb \
    --chr 4 \
    --from-bp 882387 \
    --to-bp 1008656 \
    --make-bed \
    --out {WORK_DIR}/{ancestry}_TMEM175

In [None]:
# Visualize bim file
WORK_DIR = '/home/jupyter/workspace/ws_files/TMEM175/'
! head {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.bim

4	chr4:882390:C:A	0	882390	A	C
4	chr4:882397:C:T	0	882397	T	C
4	chr4:882422:C:T	0	882422	T	C
4	chr4:882432:G:A	0	882432	A	G
4	chr4:882454:C:T	0	882454	T	C
4	chr4:882471:G:A	0	882471	A	G
4	chr4:882483:C:G	0	882483	G	C
4	chr4:882498:C:G	0	882498	G	C
4	chr4:882507:G:A	0	882507	A	G
4	chr4:882517:C:A	0	882517	A	C


In [None]:
# Visualize bim file
! head {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.fam

In [21]:
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}/'
    
    ! head -n 1 {WORK_DIR}/{ancestry}_TMEM175.fam > {WORK_DIR}/{ancestry}_s1.txt

In [None]:
! head /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EUR/EUR_s1.txt

### Turn binary files into VCF

In [None]:
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    # Turn binary files into VCF
    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --keep {WORK_DIR}/{ancestry}_s1.txt \
    --make-bed \
    --out {WORK_DIR}/{ancestry}_TMEM175_v1

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175_v1.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175
  --keep /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_s1.txt
  --make-bed
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175_v1

Start time: Wed Jul 31 10:13:40 2024
52223 MiB RAM detected, ~50482 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
2643 samples (1163 females, 1480 males; 2643 founders) loaded from
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.fam.
3853 variants loaded from
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.bim.
1 binary phenotype loaded (942 cases, 1679 controls).
--keep: 1 sample remaining.
1 sample (0 females, 1 male; 1 found

In [None]:
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    # Turn binary files into VCF
    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_TMEM175_v1 \
    --recode vcf-fid \
    --out {WORK_DIR}/{ancestry}_TMEM175_v1

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175_v1.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175_v1
  --export vcf-fid
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175_v1

Start time: Wed Jul 31 10:15:08 2024
Note: --export 'vcf-fid' modifier is deprecated.  Use 'vcf' + 'id-paste=fid'.
52223 MiB RAM detected, ~50568 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
1 sample (0 females, 1 male; 1 founder) loaded from
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175_v1.fam.
3853 variants loaded from
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175_v1.bim.
1 binary phenotype loaded (1 case, 0 controls).
--export vcf to
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_

In [None]:
## Bgzip and Tabix (zip and index the file)
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    ! bgzip -f {WORK_DIR}/{ancestry}_TMEM175_v1.vcf
    ! tabix -f -p vcf {WORK_DIR}/{ancestry}_TMEM175_v1.vcf.gz 

### Annotate using ANNOVAR

In [None]:
# annotate using ANNOVAR

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    ! perl /home/jupyter/tools/annovar/table_annovar.pl {WORK_DIR}/{ancestry}_TMEM175_v1.vcf.gz /home/jupyter/tools/annovar/humandb/ -buildver hg38 \
    -out {WORK_DIR}/{ancestry}_TMEM175.annovar \
    -remove -protocol refGene,clinvar_20140902 \
    -operation g,f \
    --nopolish \
    -nastring . \
    -vcfinput


NOTICE: Running with system command <convert2annovar.pl  -includeinfo -allsample -withfreq -format vcf4 /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175_v1.vcf.gz > /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.annovar.avinput>
NOTICE: Finished reading 3860 lines from VCF file
NOTICE: A total of 3853 locus in VCF file passed QC threshold, representing 3628 SNPs (2760 transitions and 868 transversions) and 225 indels/substitutions
NOTICE: Finished writing allele frequencies based on 3628 SNP genotypes (2760 transitions and 868 transversions) and 225 indels/substitutions for 1 samples

NOTICE: Running with system command </home/jupyter/tools/annovar/table_annovar.pl /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.annovar.avinput /home/jupyter/tools/annovar/humandb/ -buildver hg38 -outfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.annovar -remove -protocol refGene,clinvar_20140902 -operation g,f --nopolish -nastri

### AAC

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_AAC/AAC_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882400,882400,G,A,intronic,GAK,.,.,.,.,0.0,.,.,4,882400,chr4:882400:G:A,G,A,.,.,PR,GT,0/0
1,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0.0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
2,4,882455,882455,G,A,intronic,GAK,.,.,.,.,0.0,.,.,4,882455,chr4:882455:G:A,G,A,.,.,PR,GT,0/0
3,4,882471,882471,G,A,intronic,GAK,.,.,.,.,0.0,.,.,4,882471,chr4:882471:G:A,G,A,.,.,PR,GT,0/0
4,4,882498,882498,C,G,intronic,GAK,.,.,.,.,0.0,.,.,4,882498,chr4:882498:C:G,C,G,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3444,4,1008499,1008499,C,T,intergenic,IDUA;FGFRL1,dist=3935;dist=3127,.,.,.,0.0,.,.,4,1008499,chr4:1008499:C:T,C,T,.,.,PR,GT,0/0
3445,4,1008515,1008515,C,T,intergenic,IDUA;FGFRL1,dist=3951;dist=3111,.,.,.,0.0,.,.,4,1008515,chr4:1008515:C:T,C,T,.,.,PR,GT,0/0
3446,4,1008535,1008535,G,C,intergenic,IDUA;FGFRL1,dist=3971;dist=3091,.,.,.,0.0,.,.,4,1008535,chr4:1008535:G:C,G,C,.,.,PR,GT,0/0
3447,4,1008649,1008649,C,T,intergenic,IDUA;FGFRL1,dist=4085;dist=2977,.,.,.,0.0,.,.,4,1008649,chr4:1008649:C:T,C,T,.,.,PR,GT,0/0


In [30]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           2746
exonic              287
intergenic          175
UTR3                102
downstream           58
UTR5                 42
upstream             36
splicing              2
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [32]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV         155
synonymous SNV            123
stopgain                    4
nonframeshift deletion      3
frameshift deletion         2
Name: count, dtype: int64

### AFR

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_AFR/AFR_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882400,882400,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882400,chr4:882400:G:A,G,A,.,.,PR,GT,0/0
1,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
2,4,882454,882454,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882454,chr4:882454:C:T,C,T,.,.,PR,GT,0/0
3,4,882471,882471,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882471,chr4:882471:G:A,G,A,.,.,PR,GT,0/0
4,4,882498,882498,C,G,intronic,GAK,.,.,.,.,0,.,.,4,882498,chr4:882498:C:G,C,G,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3848,4,1008499,1008499,C,T,intergenic,IDUA;FGFRL1,dist=3935;dist=3127,.,.,.,0,.,.,4,1008499,chr4:1008499:C:T,C,T,.,.,PR,GT,0/0
3849,4,1008524,1008524,G,T,intergenic,IDUA;FGFRL1,dist=3960;dist=3102,.,.,.,0,.,.,4,1008524,chr4:1008524:G:T,G,T,.,.,PR,GT,0/0
3850,4,1008620,1008620,G,T,intergenic,IDUA;FGFRL1,dist=4056;dist=3006,.,.,.,0,.,.,4,1008620,chr4:1008620:G:T,G,T,.,.,PR,GT,0/0
3851,4,1008649,1008649,C,T,intergenic,IDUA;FGFRL1,dist=4085;dist=2977,.,.,.,0,.,.,4,1008649,chr4:1008649:C:T,C,T,.,.,PR,GT,0/0


In [34]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           3069
exonic              314
intergenic          203
UTR3                118
downstream           63
UTR5                 45
upstream             40
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [36]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV          161
synonymous SNV             143
frameshift deletion          4
nonframeshift deletion       3
stopgain                     2
nonframeshift insertion      1
Name: count, dtype: int64

### AJ

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_AJ/AJ_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882498,882498,C,G,intronic,GAK,.,.,.,.,0.0,.,.,4,882498,chr4:882498:C:G,C,G,.,.,PR,GT,0/0
1,4,882530,882530,C,T,intronic,GAK,.,.,.,.,0.0,.,.,4,882530,chr4:882530:C:T,C,T,.,.,PR,GT,0/0
2,4,882611,882611,T,C,intronic,GAK,.,.,.,.,0.0,.,.,4,882611,chr4:882611:T:C,T,C,.,.,PR,GT,0/0
3,4,882745,882745,G,A,exonic,GAK,.,synonymous SNV,"GAK:NM_001318134:exon11:c.C1242T:p.H414H,GAK:N...",.,0.0,.,.,4,882745,chr4:882745:G:A,G,A,.,.,PR,GT,0/0
4,4,882801,882801,C,T,exonic,GAK,.,nonsynonymous SNV,"GAK:NM_001318134:exon11:c.G1186A:p.A396T,GAK:N...",.,0.0,.,.,4,882801,chr4:882801:C:T,C,T,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1280,4,1008337,1008337,C,T,intergenic,IDUA;FGFRL1,dist=3773;dist=3289,.,.,.,1.0,.,.,4,1008337,chr4:1008337:C:T,C,T,.,.,PR,GT,1/1
1281,4,1008371,1008371,C,-,intergenic,IDUA;FGFRL1,dist=3807;dist=3255,.,.,.,0.0,.,.,4,1008370,chr4:1008370:TC:T,TC,T,.,.,PR,GT,0/0
1282,4,1008480,1008480,A,G,intergenic,IDUA;FGFRL1,dist=3916;dist=3146,.,.,.,0.0,.,.,4,1008480,chr4:1008480:A:G,A,G,.,.,PR,GT,0/0
1283,4,1008538,1008538,A,G,intergenic,IDUA;FGFRL1,dist=3974;dist=3088,.,.,.,0.0,.,.,4,1008538,chr4:1008538:A:G,A,G,.,.,PR,GT,0/0


In [38]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           1024
exonic              125
intergenic           60
UTR3                 43
downstream           20
upstream              6
UTR5                  6
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [40]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV         66
synonymous SNV            55
frameshift deletion        2
stopgain                   1
nonframeshift deletion     1
Name: count, dtype: int64

### AMR

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_AMR/AMR_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882397,882397,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882397,chr4:882397:C:T,C,T,.,.,PR,GT,0/0
1,4,882400,882400,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882400,chr4:882400:G:A,G,A,.,.,PR,GT,0/0
2,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
3,4,882454,882454,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882454,chr4:882454:C:T,C,T,.,.,PR,GT,0/0
4,4,882498,882498,C,G,intronic,GAK,.,.,.,.,0,.,.,4,882498,chr4:882498:C:G,C,G,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1982,4,1008395,1008395,C,T,intergenic,IDUA;FGFRL1,dist=3831;dist=3231,.,.,.,0,.,.,4,1008395,chr4:1008395:C:T,C,T,.,.,PR,GT,0/0
1983,4,1008480,1008480,A,G,intergenic,IDUA;FGFRL1,dist=3916;dist=3146,.,.,.,0,.,.,4,1008480,chr4:1008480:A:G,A,G,.,.,PR,GT,0/0
1984,4,1008515,1008515,C,T,intergenic,IDUA;FGFRL1,dist=3951;dist=3111,.,.,.,0,.,.,4,1008515,chr4:1008515:C:T,C,T,.,.,PR,GT,0/0
1985,4,1008571,1008571,G,A,intergenic,IDUA;FGFRL1,dist=4007;dist=3055,.,.,.,0,.,.,4,1008571,chr4:1008571:G:A,G,A,.,.,PR,GT,0/0


In [42]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           1596
exonic              178
intergenic          104
UTR3                 56
downstream           23
UTR5                 17
upstream             12
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [44]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV          89
synonymous SNV             85
frameshift deletion         2
stopgain                    1
nonframeshift insertion     1
Name: count, dtype: int64

### CAH

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_CAH/CAH_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882400,882400,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882400,chr4:882400:G:A,G,A,.,.,PR,GT,0/0
1,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
2,4,882454,882454,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882454,chr4:882454:C:T,C,T,.,.,PR,GT,0/0
3,4,882498,882498,C,G,intronic,GAK,.,.,.,.,0,.,.,4,882498,chr4:882498:C:G,C,G,.,.,PR,GT,0/0
4,4,882524,882524,A,G,intronic,GAK,.,.,.,.,0,.,.,4,882524,chr4:882524:A:G,A,G,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3085,4,1008480,1008480,A,G,intergenic,IDUA;FGFRL1,dist=3916;dist=3146,.,.,.,0,.,.,4,1008480,chr4:1008480:A:G,A,G,.,.,PR,GT,0/0
3086,4,1008515,1008515,C,T,intergenic,IDUA;FGFRL1,dist=3951;dist=3111,.,.,.,0,.,.,4,1008515,chr4:1008515:C:T,C,T,.,.,PR,GT,0/0
3087,4,1008570,1008570,C,T,intergenic,IDUA;FGFRL1,dist=4006;dist=3056,.,.,.,0,.,.,4,1008570,chr4:1008570:C:T,C,T,.,.,PR,GT,0/0
3088,4,1008649,1008649,C,T,intergenic,IDUA;FGFRL1,dist=4085;dist=2977,.,.,.,0,.,.,4,1008649,chr4:1008649:C:T,C,T,.,.,PR,GT,0/0


In [46]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           2467
exonic              260
intergenic          168
UTR3                 87
downstream           46
UTR5                 31
upstream             30
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [48]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV         139
synonymous SNV            113
stopgain                    4
frameshift deletion         2
nonframeshift deletion      1
frameshift insertion        1
Name: count, dtype: int64

### CAS

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_CAS/CAS_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882429,882429,G,C,intronic,GAK,.,.,.,.,0,.,.,4,882429,chr4:882429:G:C,G,C,.,.,PR,GT,0/0
1,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
2,4,882468,882468,C,G,intronic,GAK,.,.,.,.,0,.,.,4,882468,chr4:882468:C:G,C,G,.,.,PR,GT,0/0
3,4,882498,882498,C,G,intronic,GAK,.,.,.,.,0,.,.,4,882498,chr4:882498:C:G,C,G,.,.,PR,GT,0/0
4,4,882501,882501,C,A,intronic,GAK,.,.,.,.,0,.,.,4,882501,chr4:882501:C:A,C,A,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1950,4,1008371,1008371,C,-,intergenic,IDUA;FGFRL1,dist=3807;dist=3255,.,.,.,0,.,.,4,1008370,chr4:1008370:TC:T,TC,T,.,.,PR,GT,0/0
1951,4,1008480,1008480,A,G,intergenic,IDUA;FGFRL1,dist=3916;dist=3146,.,.,.,0,.,.,4,1008480,chr4:1008480:A:G,A,G,.,.,PR,GT,0/0
1952,4,1008538,1008538,A,G,intergenic,IDUA;FGFRL1,dist=3974;dist=3088,.,.,.,0,.,.,4,1008538,chr4:1008538:A:G,A,G,.,.,PR,GT,0/0
1953,4,1008570,1008570,C,T,intergenic,IDUA;FGFRL1,dist=4006;dist=3056,.,.,.,0,.,.,4,1008570,chr4:1008570:C:T,C,T,.,.,PR,GT,0/0


In [50]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           1568
exonic              167
intergenic           96
UTR3                 62
UTR5                 26
downstream           22
upstream             13
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [52]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV      89
synonymous SNV         73
stopgain                4
frameshift deletion     1
Name: count, dtype: int64

### EAS

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_EAS/EAS_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882390,882390,C,G,intronic,GAK,.,.,.,.,0,.,.,4,882390,chr4:882390:C:G,C,G,.,.,PR,GT,0/0
1,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
2,4,882501,882501,C,A,intronic,GAK,.,.,.,.,0,.,.,4,882501,chr4:882501:C:A,C,A,.,.,PR,GT,0/0
3,4,882507,882507,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882507,chr4:882507:G:A,G,A,.,.,PR,GT,0/0
4,4,882530,882530,C,T,intronic,GAK,.,.,.,.,1,.,.,4,882530,chr4:882530:C:T,C,T,.,.,PR,GT,1/1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3166,4,1008570,1008570,C,T,intergenic,IDUA;FGFRL1,dist=4006;dist=3056,.,.,.,0,.,.,4,1008570,chr4:1008570:C:T,C,T,.,.,PR,GT,0/0
3167,4,1008634,1008634,A,G,intergenic,IDUA;FGFRL1,dist=4070;dist=2992,.,.,.,0,.,.,4,1008634,chr4:1008634:A:G,A,G,.,.,PR,GT,0/0
3168,4,1008636,1008636,C,A,intergenic,IDUA;FGFRL1,dist=4072;dist=2990,.,.,.,0,.,.,4,1008636,chr4:1008636:C:A,C,A,.,.,PR,GT,0/0
3169,4,1008644,1008644,T,C,intergenic,IDUA;FGFRL1,dist=4080;dist=2982,.,.,.,0,.,.,4,1008644,chr4:1008644:T:C,T,C,.,.,PR,GT,0/0


In [54]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           2518
exonic              285
intergenic          174
UTR3                 99
UTR5                 43
downstream           33
upstream             17
exonic;splicing       1
splicing              1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [56]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV         164
synonymous SNV            109
stopgain                    6
frameshift deletion         4
nonframeshift deletion      2
Name: count, dtype: int64

### MDE

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_MDE/MDE_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0.0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
1,4,882471,882471,G,A,intronic,GAK,.,.,.,.,0.0,.,.,4,882471,chr4:882471:G:A,G,A,.,.,PR,GT,0/0
2,4,882498,882498,C,G,intronic,GAK,.,.,.,.,0.0,.,.,4,882498,chr4:882498:C:G,C,G,.,.,PR,GT,0/0
3,4,882517,882517,C,A,intronic,GAK,.,.,.,.,0.0,.,.,4,882517,chr4:882517:C:A,C,A,.,.,PR,GT,0/0
4,4,882530,882530,C,T,intronic,GAK,.,.,.,.,0.5,.,.,4,882530,chr4:882530:C:T,C,T,.,.,PR,GT,0/1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1940,4,1008371,1008371,C,-,intergenic,IDUA;FGFRL1,dist=3807;dist=3255,.,.,.,0.0,.,.,4,1008370,chr4:1008370:TC:T,TC,T,.,.,PR,GT,0/0
1941,4,1008480,1008480,A,G,intergenic,IDUA;FGFRL1,dist=3916;dist=3146,.,.,.,0.5,.,.,4,1008480,chr4:1008480:A:G,A,G,.,.,PR,GT,0/1
1942,4,1008515,1008515,C,T,intergenic,IDUA;FGFRL1,dist=3951;dist=3111,.,.,.,0.0,.,.,4,1008515,chr4:1008515:C:T,C,T,.,.,PR,GT,0/0
1943,4,1008521,1008521,T,C,intergenic,IDUA;FGFRL1,dist=3957;dist=3105,.,.,.,0.0,.,.,4,1008521,chr4:1008521:T:C,T,C,.,.,PR,GT,0/0


In [58]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           1547
exonic              177
intergenic           94
UTR3                 59
downstream           34
UTR5                 22
upstream             11
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [60]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
synonymous SNV         91
nonsynonymous SNV      84
frameshift deletion     1
stopgain                1
Name: count, dtype: int64

### SAS

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_SAS/SAS_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882397,882397,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882397,chr4:882397:C:T,C,T,.,.,PR,GT,0/0
1,4,882530,882530,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882530,chr4:882530:C:T,C,T,.,.,PR,GT,0/0
2,4,882566,882566,G,-,intronic,GAK,.,.,.,.,0,.,.,4,882565,chr4:882565:AG:A,AG,A,.,.,PR,GT,0/0
3,4,882611,882611,T,C,intronic,GAK,.,.,.,.,0,.,.,4,882611,chr4:882611:T:C,T,C,.,.,PR,GT,0/0
4,4,882678,882678,C,G,intronic,GAK,.,.,.,.,0,.,.,4,882678,chr4:882678:C:G,C,G,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1723,4,1008246,1008246,C,T,intergenic,IDUA;FGFRL1,dist=3682;dist=3380,.,.,.,0,.,.,4,1008246,chr4:1008246:C:T,C,T,.,.,PR,GT,0/0
1724,4,1008337,1008337,C,T,intergenic,IDUA;FGFRL1,dist=3773;dist=3289,.,.,.,0.5,.,.,4,1008337,chr4:1008337:C:T,C,T,.,.,PR,GT,0/1
1725,4,1008371,1008371,C,-,intergenic,IDUA;FGFRL1,dist=3807;dist=3255,.,.,.,0,.,.,4,1008370,chr4:1008370:TC:T,TC,T,.,.,PR,GT,0/0
1726,4,1008480,1008480,A,G,intergenic,IDUA;FGFRL1,dist=3916;dist=3146,.,.,.,0,.,.,4,1008480,chr4:1008480:A:G,A,G,.,.,PR,GT,0/0


In [62]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           1412
exonic              143
intergenic           72
UTR3                 50
UTR5                 24
downstream           18
upstream              8
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [64]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV      75
synonymous SNV         64
stopgain                3
frameshift deletion     1
Name: count, dtype: int64

### FIN

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_FIN/FIN_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882530,882530,C,T,intronic,GAK,.,.,.,.,0.5,.,.,4,882530,chr4:882530:C:T,C,T,.,.,PR,GT,0/1
1,4,882611,882611,T,C,intronic,GAK,.,.,.,.,0,.,.,4,882611,chr4:882611:T:C,T,C,.,.,PR,GT,0/0
2,4,882849,882849,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882849,chr4:882849:G:A,G,A,.,.,PR,GT,0/0
3,4,883018,883018,G,A,intronic,GAK,.,.,.,.,0,.,.,4,883018,chr4:883018:G:A,G,A,.,.,PR,GT,0/0
4,4,883133,883133,G,A,intronic,GAK,.,.,.,.,0,.,.,4,883133,chr4:883133:G:A,G,A,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
600,4,1008009,1008009,C,T,intergenic,IDUA;FGFRL1,dist=3445;dist=3617,.,.,.,0.5,.,.,4,1008009,chr4:1008009:C:T,C,T,.,.,PR,GT,0/1
601,4,1008209,1008209,C,T,intergenic,IDUA;FGFRL1,dist=3645;dist=3417,.,.,.,0,.,.,4,1008209,chr4:1008209:C:T,C,T,.,.,PR,GT,0/0
602,4,1008337,1008337,C,T,intergenic,IDUA;FGFRL1,dist=3773;dist=3289,.,.,.,1,.,.,4,1008337,chr4:1008337:C:T,C,T,.,.,PR,GT,1/1
603,4,1008371,1008371,C,-,intergenic,IDUA;FGFRL1,dist=3807;dist=3255,.,.,.,0,.,.,4,1008370,chr4:1008370:TC:T,TC,T,.,.,PR,GT,0/0


In [66]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           491
exonic              50
intergenic          29
UTR3                20
downstream           6
upstream             5
UTR5                 3
exonic;splicing      1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [68]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
synonymous SNV       27
nonsynonymous SNV    23
Name: count, dtype: int64

### EUR

In [None]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_EUR/EUR_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882390,882390,C,A,intronic,GAK,.,.,.,.,0,.,.,4,882390,chr4:882390:C:A,C,A,.,.,PR,GT,0/0
1,4,882397,882397,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882397,chr4:882397:C:T,C,T,.,.,PR,GT,0/0
2,4,882422,882422,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882422,chr4:882422:C:T,C,T,.,.,PR,GT,0/0
3,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
4,4,882454,882454,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882454,chr4:882454:C:T,C,T,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9499,4,1008570,1008570,C,T,intergenic,IDUA;FGFRL1,dist=4006;dist=3056,.,.,.,0,.,.,4,1008570,chr4:1008570:C:T,C,T,.,.,PR,GT,0/0
9500,4,1008571,1008571,G,A,intergenic,IDUA;FGFRL1,dist=4007;dist=3055,.,.,.,0,.,.,4,1008571,chr4:1008571:G:A,G,A,.,.,PR,GT,0/0
9501,4,1008596,1008596,C,T,intergenic,IDUA;FGFRL1,dist=4032;dist=3030,.,.,.,.,.,.,4,1008596,chr4:1008596:C:T,C,T,.,.,PR,GT,./.
9502,4,1008600,1008600,C,T,intergenic,IDUA;FGFRL1,dist=4036;dist=3026,.,.,.,0,.,.,4,1008600,chr4:1008600:C:T,C,T,.,.,PR,GT,0/0


In [75]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           7451
exonic              951
intergenic          474
UTR3                266
downstream          152
UTR5                114
upstream             90
splicing              5
exonic;splicing       1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [77]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV          548
synonymous SNV             367
nonframeshift deletion      12
stopgain                    10
frameshift deletion          9
frameshift insertion         4
nonframeshift insertion      1
Name: count, dtype: int64

### Make lists of variants to keep

In [None]:
# Make lists of variants to keep - all coding, coding nonsynonymous (missense - as they are coded in ANNOVAR), deleterious (CADD_phred > 20)

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    # Read in ANNOVAR multianno file
    gene = pd.read_csv(f'{WORK_DIR}/{ancestry}_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
    
    # Print number of variants in the different categories
    results = [] 
    
    utr5 = gene[gene['Func.refGene']== 'UTR5']
    intronic = gene[gene['Func.refGene']== 'intronic']
    exonic = gene[gene['Func.refGene']== 'exonic']
    utr3 = gene[gene['Func.refGene']== 'UTR3']
    coding_nonsynonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'nonsynonymous SNV')]
    coding_synonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] != 'nonsynonymous SNV')]
    lof = exonic[(exonic['ExonicFunc.refGene'] == 'stopgain') | (exonic['ExonicFunc.refGene'] == 'stoploss') | (exonic['ExonicFunc.refGene'] == 'frameshift deletion') | (exonic['ExonicFunc.refGene'] == 'frameshift insertion')]
    nonsynonymous_lof = pd.concat([coding_nonsynonymous, lof])
    
    print({ancestry})
    print('Total variants: ', len(gene))
    print("Intronic: ", len(intronic))
    print('UTR3: ', len(utr3))
    print('UTR5: ', len(utr5))
    print("Total exonic: ", len(exonic))
    print('  Synonymous: ', len(coding_synonymous))
    print("  Nonsynonymous: ", len(coding_nonsynonymous))
    print("nonsynonymous_lof: ", len(nonsynonymous_lof))
    results.append((gene, intronic, utr3, utr5, exonic, coding_synonymous, coding_nonsynonymous, nonsynonymous_lof))
    print('\n')
    
    # Save in PLINK format - coding nonsynonymous 
    # These are missense variants - other types of nonsynonymous variants (e.g stopgain/loss, or frameshift variants are coded differently in the ExonicFunc.refGene 
    variants_toKeep = nonsynonymous_lof[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep.to_csv(f'{WORK_DIR}/{ancestry}_TMEM175.nonsynonymous_lof.variantstoKeep.txt', sep="\t", index=False, header=False)
    
    # Save in PLINK format - all coding variants
    variants_toKeep2 = exonic[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep2.to_csv(f'{WORK_DIR}/{ancestry}_TMEM175.exonic.variantstoKeep.txt', sep="\t", index=False, header=False)

WORKING ON: AFR
{'AFR'}
Total variants:  3853
Intronic:  3069
UTR3:  118
UTR5:  45
Total exonic:  314
  Synonymous:  153
  Nonsynonymous:  161
nonsynonymous_lof:  167


WORKING ON: AAC
{'AAC'}
Total variants:  3449
Intronic:  2746
UTR3:  102
UTR5:  42
Total exonic:  287
  Synonymous:  132
  Nonsynonymous:  155
nonsynonymous_lof:  161


WORKING ON: FIN
{'FIN'}
Total variants:  605
Intronic:  491
UTR3:  20
UTR5:  3
Total exonic:  50
  Synonymous:  27
  Nonsynonymous:  23
nonsynonymous_lof:  23


WORKING ON: AMR
{'AMR'}
Total variants:  1987
Intronic:  1596
UTR3:  56
UTR5:  17
Total exonic:  178
  Synonymous:  89
  Nonsynonymous:  89
nonsynonymous_lof:  92


WORKING ON: EAS
{'EAS'}
Total variants:  3171
Intronic:  2518
UTR3:  99
UTR5:  43
Total exonic:  285
  Synonymous:  121
  Nonsynonymous:  164
nonsynonymous_lof:  174


WORKING ON: CAH
{'CAH'}
Total variants:  3090
Intronic:  2467
UTR3:  87
UTR5:  31
Total exonic:  260
  Synonymous:  121
  Nonsynonymous:  139
nonsynonymous_lof:  146




## 5. Association analysis to compare allele frequencies between cases and controls
* ALL variants

* assoc

* glm

### ASSOC

In [None]:
# Run case-control analysis using plink assoc for all variants, not adjusting for any covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    ! /home/jupyter/tools/plink \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --keep {WORK_DIR}/{ancestry}.samplestoKeep \
    --assoc \
    --allow-no-sex \
    --ci 0.95 \
    --maf 0.01 \
    --out {WORK_DIR}/{ancestry}_TMEM175.all

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.all.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175
  --ci 0.95
  --keep /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC.samplestoKeep
  --maf 0.01
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.all

52223 MB RAM detected; reserving 26111 MB for main workspace.
3449 variants loaded from .bim file.
1111 people (455 males, 656 females) loaded from .fam.
1086 phenotype values loaded from .fam.
--keep: 1111 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1111 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243

In [None]:
# Test if we can see TMEM175 p.M393T when set maf<0.01 in AFR
WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_AFR'

! /home/jupyter/tools/plink \
--bfile {WORK_DIR}/AFR_TMEM175 \
--keep {WORK_DIR}/AFR.samplestoKeep \
--allow-no-sex \
--max-maf 0.01 \
--ci 0.95 \
--assoc \
--out {WORK_DIR}/AFR_TMEM175.all.test

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.all.test.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175
  --ci 0.95
  --keep /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR.samplestoKeep
  --max-maf 0.01
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.all.test

12984 MB RAM detected; reserving 6492 MB for main workspace.
3853 variants loaded from .bim file.
2643 people (1480 males, 1163 females) loaded from .fam.
2621 phenotype values loaded from .fam.
--keep: 2643 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2643 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353

In [10]:
WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_AFR'
freq = pd.read_csv(f'{WORK_DIR}/AFR_TMEM175.all.test.assoc', sep='\s+')
freq.to_csv(f'{WORK_DIR}/AFR.all_nonadj.test.csv')

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    # L ook at assoc results, only variants with nominal p-value < 0.05
    freq = pd.read_csv(f'{WORK_DIR}/{ancestry}_TMEM175.all.assoc', sep='\s+')
    sig_all_nonadj = freq[freq['P']<0.05]
    
    print(f'Variants with p-value < 0.05: {sig_all_nonadj.shape}')
    
    # Save FREQ to csv
    freq.to_csv(f'{WORK_DIR}/{ancestry}.all_nonadj.csv')

WORKING ON: AAC
Variants with p-value < 0.05: (26, 13)
WORKING ON: AFR
Variants with p-value < 0.05: (90, 13)
WORKING ON: AJ
Variants with p-value < 0.05: (115, 13)
WORKING ON: AMR
Variants with p-value < 0.05: (7, 13)
WORKING ON: CAS
Variants with p-value < 0.05: (38, 13)
WORKING ON: EAS
Variants with p-value < 0.05: (200, 13)
WORKING ON: EUR
Variants with p-value < 0.05: (202, 13)
WORKING ON: FIN
Variants with p-value < 0.05: (9, 13)
WORKING ON: MDE
Variants with p-value < 0.05: (84, 13)
WORKING ON: SAS
Variants with p-value < 0.05: (24, 13)
WORKING ON: CAH
Variants with p-value < 0.05: (54, 13)


## 6. GLM analysis adjusting for gender, age, PC1-5

In [None]:
# Run case-control analysis with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --keep {WORK_DIR}/{ancestry}.samplestoKeep \
    --allow-no-sex \
    --maf 0.01 \
    --ci 0.95 \
    --glm \
    --covar {WORK_DIR}/{ancestry}_covariate_file.txt \
    --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --neg9-pheno-really-missing \
    --out {WORK_DIR}/{ancestry}_TMEM175.all_adj

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.all_adj.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175
  --ci 0.95
  --covar /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_covariate_file.txt
  --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
  --covar-variance-standardize
  --glm
  --keep /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC.samplestoKeep
  --maf 0.01
  --neg9-pheno-really-missing
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.all_adj

Start time: Wed Jul 31 11:04:25 2024
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
52223 MiB RAM detected, ~50540 available

In [None]:
# test if we can see TMEM175 p.M393T when set maf<0.01 in AFR
WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_AFR'

! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/AFR_TMEM175 \
--keep {WORK_DIR}/AFR.samplestoKeep \
--allow-no-sex \
--max-maf 0.01 \
--ci 0.95 \
--glm \
--covar {WORK_DIR}/AFR_covariate_file.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--covar-variance-standardize \
--neg9-pheno-really-missing \
--out {WORK_DIR}/AFR_TMEM175.all_adj.test

PLINK v2.0.0-a.6.0LM 64-bit Intel (11 Nov 2024)    cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.all_adj.test.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175
  --ci 0.95
  --covar /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_covariate_file.txt
  --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
  --covar-variance-standardize
  --glm
  --keep /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR.samplestoKeep
  --max-maf 0.01
  --neg9-pheno-really-missing
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AFR/AFR_TMEM175.all_adj.test

Start time: Wed Mar  5 10:28:20 2025
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
12984 MiB RAM detected, ~1

In [None]:
# Read in plink glm results
WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_AFR'
assoc = pd.read_csv(f'{WORK_DIR}/AFR_TMEM175.all_adj.test.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
assoc_add = assoc[assoc['TEST']=="ADD"]
assoc_add.to_csv(f'{WORK_DIR}/AFR.all_adj.test.csv')



In [None]:
# Process results from plink glm analysis including covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    # Read in plink glm results
    assoc = pd.read_csv(f'{WORK_DIR}/{ancestry}_TMEM175.all_adj.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
    
    # Filter for additive test only - this is the variant results
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    # Check if there are any significant (p < 0.05) variants
    significant = assoc_add[assoc_add['P']<0.05]

    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    # Check if there are any significant (p < 0.05) variants
    GWsignificant = assoc_add[assoc_add['P']<5e-8]

    print(f'There are {len(GWsignificant)} variants with p-value < 5e-8 in glm')
    
    # Save assoc_add to csv
    assoc_add.to_csv(f'{WORK_DIR}/{ancestry}.all_adj.csv')

WORKING ON: AAC
There are 24 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AFR
There are 46 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AJ
There are 121 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: AMR




There are 7 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: CAS
There are 26 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: EAS
There are 44 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: EUR
There are 148 variants with p-value < 0.05 in glm
There are 40 variants with p-value < 5e-8 in glm
WORKING ON: FIN
There are 15 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: MDE
There are 124 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: SAS




There are 42 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: CAH
There are 27 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




## 7. Burden test (SkatO, Skat, cmc, zeggini, mb, fp, cmcWald)

In [None]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

# Loop over all the ancestries and the 2 variant classes - run rvtests for all coding and missense variants
for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running plink to extract {variant_class} variants for ancestry: {ancestry}')
        
        # Extract relevant variants
        ! /home/jupyter/tools/plink2 \
        --pfile {REL7_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release7_vwb \
        --keep {WORK_DIR}/{ancestry}.samplestoKeep \
        --extract range {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.variantstoKeep.txt \
        --recode vcf-iid \
        --out {WORK_DIR}/{ancestry}_TMEM175.{variant_class}
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running bgzip and tabix for {variant_class} variants for ancestry: {ancestry}')
        
        # Bgzip and Tabix (zip and index the file)
        ! bgzip -f {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.vcf
        ! tabix -f -p vcf {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.vcf.gz

In [None]:
# Run RVtests
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running RVtests for {variant_class} variants for ancestry: {ancestry}')
        
        # RVtests with covariates 
        # Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        # The pheno-name flag only works when the pheno/covar file is structured properly
        ! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
        --out {WORK_DIR}/{ancestry}_TMEM175.burden.{variant_class} \
        --kernel skato \
        --inVcf {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.vcf.gz \
        --pheno {WORK_DIR}/{ancestry}_covariate_file.txt \
        --pheno-name PHENO \
        --gene TMEM175 \
        --geneFile ~/workspace/ws_files/TMEM175/refFlat.txt \
        --covar {WORK_DIR}/{ancestry}_covariate_file.txt \
        --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
        --freqUpper 0.01
# --burden cmc,zeggini,mb,fp,cmcWald --kernel skat,skato \

### EUR

In [5]:
WORK_DIR = "~/workspace/ws_files/TMEM175"

In [None]:
# Check EUR all_coding variant results
# Next step - burden test
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	323773	0.9	0.0304464


In [None]:
# Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
# The pheno-name flag only works when the pheno/covar file is structured properly
! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
--out {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burdenTEST.exonic \
--burden cmc,zeggini,mb,fp,cmcWald \
--inVcf {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.exonic.vcf.gz \
--pheno {WORK_DIR}/TMEM175_EUR/EUR_covariate_file.txt \
--pheno-name PHENO \
--gene TMEM175 \
--geneFile ~/workspace/ws_files/TMEM175/refFlat.txt \
--covar {WORK_DIR}/TMEM175_EUR/EUR_covariate_file.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01

In [None]:
# See bureden test in exonic in EUR
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burdenTEST.exonic.CMCWald.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	NonRefSite	Beta	SE	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	1500	0.0708554	0.0620621	0.253585
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	1500	-0.576489	0.0316899	6.02168e-74
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	1500	0.0309305	0.00133732	2.37789e-118
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	1500	-0.0224515	0.0086344	0.00931601
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	1500	0.0140758	0.00750917	0.0608638
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:9324

In [12]:
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burdenTEST.exonic.CMC.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	NonRefSite	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	1500	0.253509


In [9]:
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burdenTEST.exonic.Fp.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	0.0365734


In [10]:
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burdenTEST.exonic.MadsonBrowning.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	10000	6504	6.10068	1000	0	0.153752


In [11]:
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burdenTEST.exonic.Zeggini.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	174	109	0.227047


In [None]:
# Check EUR nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	105	65	107233	0.0560423	10000	10000	107233	537	0	0.0537


In [None]:
# Check EUR nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	20918	105	65	53616.4	0	0.101898


### AAC

In [None]:
# Check AAC all_coding variant results
! cat {WORK_DIR}/TMEM175_AAC/AAC_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	925	57	45	5938.41	0.692352	10000	1323	5938.41	1000	0	0.755858


In [None]:
# Check AAC all_coding variant results
! cat {WORK_DIR}/TMEM175_AAC/AAC_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	925	57	45	2977.49	1	0.5655


In [None]:
# Check AAC all_coding variant results
! cat {WORK_DIR}/TMEM175_AAC/AAC_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	925	38	31	3451.67	0.742505	10000	1275	3451.67	1000	0	0.784314


In [None]:
# Check AAC all_coding variant results
! cat {WORK_DIR}/TMEM175_AAC/AAC_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	925	38	31	1792.92	1	0.607447


### AFR

In [None]:
# Check AFR all_coding variant results
! cat {WORK_DIR}/TMEM175_AFR/AFR_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	1062	60	43	7846.37	0.311869	10000	2587	7846.37	1000	0	0.386548


In [None]:
# Check AFR all_coding variant results
! cat {WORK_DIR}/TMEM175_AFR/AFR_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	1062	60	43	3923.18	0	0.495811


In [None]:
# Check AFR all_coding variant results
! cat {WORK_DIR}/TMEM175_AFR/AFR_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	1062	38	27	6145.09	0.371636	10000	2581	6145.09	1000	0	0.387447


In [None]:
# Check AFR all_coding variant results
! cat {WORK_DIR}/TMEM175_AFR/AFR_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	1062	38	27	3072.54	0	0.56138


### AJ

In [None]:
# Check AJ all_coding variant results
! cat {WORK_DIR}/TMEM175_AJ/AJ_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	1410	20	15	2660.25	0.548165	10000	1693	2660.25	1000	0	0.590667


In [None]:
# Check AJ all_coding variant results
! cat {WORK_DIR}/TMEM175_AJ/AJ_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	1410	20	15	594.277	1	0.740559


In [None]:
# Check AJ all_coding variant results
! cat {WORK_DIR}/TMEM175_AJ/AJ_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	1410	13	10	2189.49	0.52697	10000	1745	2189.49	1000	0	0.573066


In [None]:
# Check AJ all_coding variant results
! cat {WORK_DIR}/TMEM175_AJ/AJ_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	1410	13	10	1094.74	0	0.708875


### AMR

In [None]:
# Check AMR all_coding variant results
! cat {WORK_DIR}/TMEM175_AMR/AMR_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	447	36	26	4759.54	0.274054	10000	4048	4759.54	1000	0	0.247036


In [None]:
# Check AMR all_coding variant results
! cat {WORK_DIR}/TMEM175_AMR/AMR_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	447	36	26	2379.77	0	0.433785


In [None]:
# Check AMR all_coding variant results
! cat {WORK_DIR}/TMEM175_AMR/AMR_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	447	18	14	3656.15	0.188299	10000	6542	3656.15	1000	0	0.152858


In [None]:
# Check AMR all_coding variant results
! cat {WORK_DIR}/TMEM175_AMR/AMR_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	447	18	14	1828.07	0	0.311882


### CAH

In [None]:
# Check CAH all_coding variant results
! cat {WORK_DIR}/TMEM175_CAH/CAH_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	693	43	34	4011.03	0.742795	10000	1290	4011.03	1000	0	0.775194


In [None]:
# Check CAH all_coding variant results
! cat {WORK_DIR}/TMEM175_CAH/CAH_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	693	43	34	14226.6	1	0.0312151


In [None]:
# Check CAH all_coding variant results
! cat {WORK_DIR}/TMEM175_CAH/CAH_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	693	26	22	3545.62	0.531319	10000	1628	3545.62	1000	0	0.614251


In [None]:
# Check CAH all_coding variant results
! cat {WORK_DIR}/TMEM175_CAH/CAH_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	693	26	22	10667.7	1	0.0308928


### CAS

In [None]:
# Check CAS all_coding variant results
! cat {WORK_DIR}/TMEM175_CAS/CAS_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	658	30	19	589.951	0.984523	10000	1000	589.951	1000	0	1


In [None]:
# Check CAS all_coding variant results
! cat {WORK_DIR}/TMEM175_CAS/CAS_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	658	30	19	447.27	1	0.767453


In [None]:
# Check CAS all_coding variant results
! cat {WORK_DIR}/TMEM175_CAS/CAS_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	658	22	11	445.823	0.237621	10000	4284	445.823	1000	0	0.233427


In [None]:
# Check CAS all_coding variant results
! cat {WORK_DIR}/TMEM175_CAS/CAS_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	658	22	11	222.912	0	0.351341


### EAS

In [None]:
# Check EAS all_coding variant results
! cat {WORK_DIR}/TMEM175_EAS/EAS_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	2523	45	24	6681.36	0.298387	10000	3613	6681.36	1000	0	0.276778


In [None]:
# Check EAS all_coding variant results
! cat {WORK_DIR}/TMEM175_EAS/EAS_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	2523	45	24	3340.68	0	0.45024


In [None]:
# Check EAS all_coding variant results
! cat {WORK_DIR}/TMEM175_EAS/EAS_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	2523	31	15	7571.83	0.0863703	10000	10000	7571.83	648	0	0.0648


In [None]:
# Check EAS all_coding variant results
! cat {WORK_DIR}/TMEM175_EAS/EAS_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	2523	31	15	3785.91	0	0.141074


### FIN

In [None]:
# Check FIN all_coding variant results
! cat {WORK_DIR}/TMEM175_FIN/FIN_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	40	2	0	NA	NA	NA	NA	NA	NA	NA	NA


In [None]:
# Check FIN all_coding variant results
! cat {WORK_DIR}/TMEM175_FIN/FIN_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	40	2	0	NA	NA	NA


In [None]:
# Check FIN all_coding variant results
! cat {WORK_DIR}/TMEM175_FIN/FIN_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	40	2	0	NA	NA	NA	NA	NA	NA	NA	NA


In [None]:
# Check FIN all_coding variant results
! cat {WORK_DIR}/TMEM175_FIN/FIN_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	40	2	0	NA	NA	NA


### MDE

In [None]:
# Check MDE all_coding variant results
! cat {WORK_DIR}/TMEM175_MDE/MDE_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	379	30	13	234.147	0.708651	10000	1101	234.147	1000	0	0.908265


In [None]:
# Check MDE all_coding variant results
! cat {WORK_DIR}/TMEM175_MDE/MDE_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	379	30	13	149.403	1	0.681695


In [None]:
# Check MDE all_coding variant results
! cat {WORK_DIR}/TMEM175_MDE/MDE_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	379	15	6	58.3182	1	10000	1013	58.3182	1000	0	0.987167


In [None]:
# Check MDE all_coding variant results
! cat {WORK_DIR}/TMEM175_MDE/MDE_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	379	15	6	29.0677	1	0.839732


### SAS

In [None]:
# Check SAS all_coding variant results
! cat {WORK_DIR}/TMEM175_SAS/SAS_TMEM175.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	296	17	10	1810.68	0.10187	10000	10000	1810.68	430	0	0.043


In [None]:
# Check SAS all_coding variant results
! cat {WORK_DIR}/TMEM175_SAS/SAS_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	296	17	10	2890.96	1	0.0183843


In [None]:
# Check SAS all_coding variant results
! cat {WORK_DIR}/TMEM175_SAS/SAS_TMEM175.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	296	12	8	1859.12	0.0974177	10000	10000	1859.12	397	0	0.0397


In [None]:
# Check EUR all_coding variant results
! cat {WORK_DIR}/TMEM175_SAS/SAS_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932459-958655,4:932386-958655,4:932459-958655	296	12	8	2650.87	1	0.0230414


### Change format of association file in EUR

In [None]:
# If you need to read in the sumstats into python - you can skip this is if your data is already in python
df = pd.read_csv(f'{WORK_DIR}/TMEM175_EUR/EUR.all_adj.csv')

In [8]:
df

Unnamed: 0.1,Unnamed: 0,#CHROM,POS,ID,REF,ALT,PROVISIONAL_REF?,A1,OMITTED,A1_FREQ,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,L95,U95,Z_STAT,P,ERRCODE
0,0,4,882530,chr4:882530:C:T,C,T,Y,T,C,0.164467,N,ADD,20837,1.034410,0.030398,0.974580,1.097910,1.112890,0.265757,.
1,8,4,882611,chr4:882611:T:C,T,C,Y,C,T,0.051185,N,ADD,20885,0.931946,0.050102,0.844780,1.028110,-1.406740,0.159504,.
2,16,4,883018,chr4:883018:G:A,G,A,Y,A,G,0.050599,N,ADD,20880,0.928749,0.050369,0.841441,1.025120,-1.467500,0.142240,.
3,24,4,883133,chr4:883133:G:A,G,A,Y,A,G,0.012734,N,ADD,20811,0.809012,0.096557,0.669523,0.977562,-2.194990,0.028164,.
4,32,4,883483,chr4:883483:G:A,G,A,Y,A,G,0.035357,N,ADD,20901,0.883241,0.058911,0.786926,0.991344,-2.107530,0.035071,.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
392,3136,4,1007954,chr4:1007954:T:C,T,C,Y,T,C,0.279743,N,ADD,18971,0.913461,0.025919,0.868217,0.961063,-3.492280,0.000479,.
393,3144,4,1008009,chr4:1008009:C:T,C,T,Y,C,T,0.444961,N,ADD,18814,0.949197,0.023666,0.906174,0.994262,-2.203110,0.027587,.
394,3152,4,1008209,chr4:1008209:C:T,C,T,Y,T,C,0.036468,N,ADD,20717,1.002060,0.059756,0.891315,1.126570,0.034499,0.972480,.
395,3160,4,1008337,chr4:1008337:C:T,C,T,Y,C,T,0.279831,N,ADD,18965,0.913397,0.025912,0.868167,0.960983,-3.495880,0.000473,.


In [None]:
# First just select relevant columns: chromosome, bp position and p-value
export_ldassoc = df[['# CHROM', 'POS', 'P']].copy()

In [10]:
export_ldassoc

Unnamed: 0,#CHROM,POS,P
0,4,882530,0.265757
1,4,882611,0.159504
2,4,883018,0.142240
3,4,883133,0.028164
4,4,883483,0.035071
...,...,...,...
392,4,1007954,0.000479
393,4,1008009,0.027587
394,4,1008209,0.972480
395,4,1008337,0.000473


In [None]:
# Rename the # CHROM column to remove the hashtag as I think this might be confusing LDassoc
export_ldassoc = export_ldassoc.rename(columns={'# CHROM': 'CHROM'}) 

In [12]:
export_ldassoc

Unnamed: 0,CHROM,POS,P
0,4,882530,0.265757
1,4,882611,0.159504
2,4,883018,0.142240
3,4,883133,0.028164
4,4,883483,0.035071
...,...,...,...
392,4,1007954,0.000479
393,4,1008009,0.027587
394,4,1008209,0.972480
395,4,1008337,0.000473


In [None]:
# Then export as a tab-separated, not comma-separated file

export_ldassoc.to_csv('EUR.all_adj.formatted.tab', sep = '\t', index=False)

## 8. Conditional analysis
* Pearson and conditional analysis with GLM (rs6599388)

In [None]:
%load_ext rpy2.ipython

In [None]:
WORK_DIR = "~/workspace/ws_files/TMEM175/"

command = f"""
/home/jupyter/tools/plink \
--bfile {WORK_DIR}/TMEM175_AJ/AJ_TMEM175 \
--chr 4 \
--from-bp 945299 \
--to-bp 945299 \
--recode A \
--out {WORK_DIR}/TMEM175_AJ/TMEM175_rs6599388 \
--double-id
"""

!{command}

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175//TMEM175_AJ/TMEM175_rs6599388.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/TMEM175//TMEM175_AJ/AJ_TMEM175
  --chr 4
  --double-id
  --from-bp 945299
  --out /home/jupyter/workspace/ws_files/TMEM175//TMEM175_AJ/TMEM175_rs6599388
  --recode A
  --to-bp 945299

12984 MB RAM detected; reserving 6492 MB for main workspace.
1 out of 1285 variants loaded from .bim file.
2655 people (1648 males, 1007 females) loaded from .fam.
1703 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2655 founders and 0 nonfounders present.
Calculating allele frequencies... 1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818

In [None]:
rs6599388 = pd.read_csv(f'{WORK_DIR}/TMEM175_AJ/TMEM175_rs6599388.raw',sep=" ")

### Pearson and conditional analysis with GLM (rs74391911)

In [None]:
WORK_DIR = "~/workspace/ws_files/TMEM175/"

command = f"""
/home/jupyter/tools/plink \
--bfile {WORK_DIR}/TMEM175_AJ/AJ_TMEM175 \
--chr 4 \
--from-bp 956095 \
--to-bp 956095 \
--recode A \
--out {WORK_DIR}/TMEM175_AJ/TMEM175_rs74391911 \
--double-id
"""

!{command}

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175//TMEM175_AJ/TMEM175_rs74391911.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/TMEM175//TMEM175_AJ/AJ_TMEM175
  --chr 4
  --double-id
  --from-bp 956095
  --out /home/jupyter/workspace/ws_files/TMEM175//TMEM175_AJ/TMEM175_rs74391911
  --recode A
  --to-bp 956095

12984 MB RAM detected; reserving 6492 MB for main workspace.
1 out of 1285 variants loaded from .bim file.
2655 people (1648 males, 1007 females) loaded from .fam.
1703 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2655 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808

In [None]:
rs74391911 = pd.read_csv(f'{WORK_DIR}/TMEM175_AJ/TMEM175_rs74391911.raw',sep=" ")

In [None]:
total = pd.merge(rs6599388, rs74391911, on="IID")
total

Unnamed: 0,FID_x,IID,PAT_x,MAT_x,SEX_x,PHENOTYPE_x,chr4:945299:C:T_C,FID_y,PAT_y,MAT_y,SEX_y,PHENOTYPE_y,chr4:956095:C:G_G
0,0,APGS_000021_s1,0,0,1,2,0.0,0,0,0,1,2,0.0
1,0,APGS_000475_s1,0,0,1,2,2.0,0,0,0,1,2,0.0
2,0,APGS_000486_s1,0,0,1,2,1.0,0,0,0,1,2,0.0
3,0,APGS_000646_s1,0,0,2,2,0.0,0,0,0,2,2,0.0
4,0,APGS_000808_s1,0,0,1,2,1.0,0,0,0,1,2,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2650,0,VIPD_000746_s1,0,0,1,2,1.0,0,0,0,1,2,0.0
2651,0,VIPD_000748_s1,0,0,1,2,1.0,0,0,0,1,2,0.0
2652,0,YMS_000030_s1,0,0,1,2,0.0,0,0,0,1,2,0.0
2653,0,YMS_000031_s1,0,0,1,1,1.0,0,0,0,1,1,0.0


In [None]:
cov = pd.read_csv(f'{WORK_DIR}/TMEM175_AJ/AJ_covariate_file.txt', sep = '\t')
cov

Unnamed: 0,FID,IID,FATID,MATID,SEX,AGE,PHENO,PC1,PC2,PC3,PC4,PC5
0,0,APGS_000002_s1,0,0,,,,-29.483298,-38.473005,-9.398904,-10.842865,25.744648
1,0,APGS_000003_s1,0,0,,,,-33.068958,-34.288862,-11.116951,-8.056146,22.924043
2,0,APGS_000004_s1,0,0,,,,-28.204441,-30.969481,-15.883093,-7.004351,18.647007
3,0,APGS_000005_s1,0,0,,,,-35.039753,-39.565822,-12.522753,-9.339179,21.956533
4,0,APGS_000006_s1,0,0,,,,-30.279907,-34.988905,-10.664628,-7.407177,26.357548
...,...,...,...,...,...,...,...,...,...,...,...,...
58204,0,YMS_000049_s1,0,0,,,,-33.320502,-37.170809,-9.867466,-6.818637,18.344976
58205,0,YMS_000050_s1,0,0,,,,-33.550740,-38.934634,-13.550840,-11.161991,26.601555
58206,0,YMS_000051_s1,0,0,,,,-36.415950,-36.634710,-9.623981,-8.333470,19.165537
58207,0,YMS_000052_s1,0,0,,,,55.200734,-12.367759,-2.246785,-0.681236,5.281365


In [None]:
total = pd.merge(total, cov, on="IID")
total

Unnamed: 0,FID_x,IID,PAT_x,MAT_x,SEX_x,PHENOTYPE_x,chr4:945299:C:T_C,FID_y,PAT_y,MAT_y,SEX_y,PHENOTYPE_y,chr4:956095:C:G_G,FID,FATID,MATID,SEX,AGE,PHENO,PC1,PC2,PC3,PC4,PC5
0,0,APGS_000021_s1,0,0,1,2,0.0,0,0,0,1,2,0.0,0,0,0,1.0,,2.0,-28.028682,-35.735363,-6.580637,-17.501411,-23.026102
1,0,APGS_000475_s1,0,0,1,2,2.0,0,0,0,1,2,0.0,0,0,0,1.0,,2.0,-28.154728,-36.708128,-5.730325,-20.750956,-21.729629
2,0,APGS_000486_s1,0,0,1,2,1.0,0,0,0,1,2,0.0,0,0,0,1.0,,2.0,-29.810620,-37.561248,-9.373941,-16.773748,-27.482223
3,0,APGS_000646_s1,0,0,2,2,0.0,0,0,0,2,2,0.0,0,0,0,2.0,,2.0,-31.967732,-34.309571,-8.603258,-17.143810,-21.027547
4,0,APGS_000808_s1,0,0,1,2,1.0,0,0,0,1,2,0.0,0,0,0,1.0,,2.0,-29.929420,-33.911640,-9.610071,-18.087631,-24.559511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2650,0,VIPD_000746_s1,0,0,1,2,1.0,0,0,0,1,2,0.0,0,0,0,1.0,74.0,2.0,-31.455498,-35.058387,-9.777408,-18.399794,-27.254806
2651,0,VIPD_000748_s1,0,0,1,2,1.0,0,0,0,1,2,0.0,0,0,0,1.0,73.0,2.0,-31.319358,-39.960596,-10.932457,-17.389794,-1.963005
2652,0,YMS_000030_s1,0,0,1,2,0.0,0,0,0,1,2,0.0,0,0,0,1.0,72.0,2.0,-29.635489,-34.956373,-6.466043,-14.761441,-20.758634
2653,0,YMS_000031_s1,0,0,1,1,1.0,0,0,0,1,1,0.0,0,0,0,1.0,69.0,1.0,-32.862131,-35.306282,-9.166195,-15.693077,-21.088431


In [None]:
total = total[(total['PHENOTYPE_x']==1) | (total['PHENOTYPE_x']==2)]
total

Unnamed: 0,FID_x,IID,PAT_x,MAT_x,SEX_x,PHENOTYPE_x,chr4:945299:C:T_C,FID_y,PAT_y,MAT_y,SEX_y,PHENOTYPE_y,chr4:956095:C:G_G,FID,FATID,MATID,SEX,AGE,PHENO,PC1,PC2,PC3,PC4,PC5
0,0,APGS_000021_s1,0,0,1,2,0.0,0,0,0,1,2,0.0,0,0,0,1.0,,2.0,-28.028682,-35.735363,-6.580637,-17.501411,-23.026102
1,0,APGS_000475_s1,0,0,1,2,2.0,0,0,0,1,2,0.0,0,0,0,1.0,,2.0,-28.154728,-36.708128,-5.730325,-20.750956,-21.729629
2,0,APGS_000486_s1,0,0,1,2,1.0,0,0,0,1,2,0.0,0,0,0,1.0,,2.0,-29.810620,-37.561248,-9.373941,-16.773748,-27.482223
3,0,APGS_000646_s1,0,0,2,2,0.0,0,0,0,2,2,0.0,0,0,0,2.0,,2.0,-31.967732,-34.309571,-8.603258,-17.143810,-21.027547
4,0,APGS_000808_s1,0,0,1,2,1.0,0,0,0,1,2,0.0,0,0,0,1.0,,2.0,-29.929420,-33.911640,-9.610071,-18.087631,-24.559511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2650,0,VIPD_000746_s1,0,0,1,2,1.0,0,0,0,1,2,0.0,0,0,0,1.0,74.0,2.0,-31.455498,-35.058387,-9.777408,-18.399794,-27.254806
2651,0,VIPD_000748_s1,0,0,1,2,1.0,0,0,0,1,2,0.0,0,0,0,1.0,73.0,2.0,-31.319358,-39.960596,-10.932457,-17.389794,-1.963005
2652,0,YMS_000030_s1,0,0,1,2,0.0,0,0,0,1,2,0.0,0,0,0,1.0,72.0,2.0,-29.635489,-34.956373,-6.466043,-14.761441,-20.758634
2653,0,YMS_000031_s1,0,0,1,1,1.0,0,0,0,1,1,0.0,0,0,0,1.0,69.0,1.0,-32.862131,-35.306282,-9.166195,-15.693077,-21.088431


In [None]:
total = total[["IID",'PHENOTYPE_x',"chr4:945299:C:T_C","chr4:956095:C:G_G", "SEX", "AGE", "PC1","PC2","PC3","PC4","PC5"]]

In [None]:
total.columns = ["IID",'PHENOTYPE',"TMEM175_rs6599388","TMEM175_rs74391911","SEX", "AGE", "PC1","PC2","PC3","PC4","PC5"]
total

Unnamed: 0,IID,PHENOTYPE,TMEM175_rs6599388,TMEM175_rs74391911,SEX,AGE,PC1,PC2,PC3,PC4,PC5
0,APGS_000021_s1,2,0.0,0.0,1.0,,-28.028682,-35.735363,-6.580637,-17.501411,-23.026102
1,APGS_000475_s1,2,2.0,0.0,1.0,,-28.154728,-36.708128,-5.730325,-20.750956,-21.729629
2,APGS_000486_s1,2,1.0,0.0,1.0,,-29.810620,-37.561248,-9.373941,-16.773748,-27.482223
3,APGS_000646_s1,2,0.0,0.0,2.0,,-31.967732,-34.309571,-8.603258,-17.143810,-21.027547
4,APGS_000808_s1,2,1.0,0.0,1.0,,-29.929420,-33.911640,-9.610071,-18.087631,-24.559511
...,...,...,...,...,...,...,...,...,...,...,...
2650,VIPD_000746_s1,2,1.0,0.0,1.0,74.0,-31.455498,-35.058387,-9.777408,-18.399794,-27.254806
2651,VIPD_000748_s1,2,1.0,0.0,1.0,73.0,-31.319358,-39.960596,-10.932457,-17.389794,-1.963005
2652,YMS_000030_s1,2,0.0,0.0,1.0,72.0,-29.635489,-34.956373,-6.466043,-14.761441,-20.758634
2653,YMS_000031_s1,1,1.0,0.0,1.0,69.0,-32.862131,-35.306282,-9.166195,-15.693077,-21.088431


In [None]:
total.to_csv(f'{WORK_DIR}/TMEM175_AJ/TMEM175_rs6599388_rs74391911.csv')

In [None]:
%%R
total <- read.csv("~/workspace/ws_files/TMEM175/TMEM175_AJ/TMEM175_rs6599388_rs74391911.csv")

In [None]:
%%R
head(total)

  X            IID PHENOTYPE TMEM175_rs6599388 TMEM175_rs74391911 SEX AGE
1 0 APGS_000021_s1         2                 0                  0   1  NA
2 1 APGS_000475_s1         2                 2                  0   1  NA
3 2 APGS_000486_s1         2                 1                  0   1  NA
4 3 APGS_000646_s1         2                 0                  0   2  NA
5 4 APGS_000808_s1         2                 1                  0   1  NA
6 5 APGS_001059_s1         2                 0                  0   1  NA
        PC1       PC2       PC3       PC4       PC5
1 -28.02868 -35.73536 -6.580637 -17.50141 -23.02610
2 -28.15473 -36.70813 -5.730325 -20.75096 -21.72963
3 -29.81062 -37.56125 -9.373941 -16.77375 -27.48222
4 -31.96773 -34.30957 -8.603258 -17.14381 -21.02755
5 -29.92942 -33.91164 -9.610071 -18.08763 -24.55951
6 -28.19315 -35.13333 -4.791519 -16.36048 -22.63633


In [None]:
%%R
dep_corr<- glm(PHENOTYPE ~ TMEM175_rs6599388 + TMEM175_rs74391911 + SEX + AGE + PC1 + PC2 + PC3 + PC4 + PC5, data=total)
summary(dep_corr)


Call:
glm(formula = PHENOTYPE ~ TMEM175_rs6599388 + TMEM175_rs74391911 + 
    SEX + AGE + PC1 + PC2 + PC3 + PC4 + PC5, data = total)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.1225957  0.2629897   8.071 1.49e-15 ***
TMEM175_rs6599388  -0.0564446  0.0137131  -4.116 4.08e-05 ***
TMEM175_rs74391911 -0.2827724  0.0685212  -4.127 3.90e-05 ***
SEX                -0.1090329  0.0201904  -5.400 7.81e-08 ***
AGE                 0.0031262  0.0009613   3.252  0.00117 ** 
PC1                 0.0035181  0.0060532   0.581  0.56120    
PC2                 0.0071335  0.0051950   1.373  0.16993    
PC3                 0.0001873  0.0051913   0.036  0.97123    
PC4                -0.0071116  0.0052951  -1.343  0.17947    
PC5                 0.0022921  0.0025337   0.905  0.36581    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1266327)

    Null deviance: 187.71  on 1408 

### GCTA conditional analysis

In [None]:
# Read in summary statistics
sumstats = pd.read_csv(f'{WORK_DIR}/TMEM175_EUR/EUR.all_adj.csv')
sumstats

Unnamed: 0.1,Unnamed: 0,#CHROM,POS,ID,REF,ALT,PROVISIONAL_REF?,A1,OMITTED,A1_FREQ,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,L95,U95,Z_STAT,P,ERRCODE
0,0,4,882530,chr4:882530:C:T,C,T,Y,T,C,0.164467,N,ADD,20837,1.034410,0.030398,0.974580,1.097910,1.112890,0.265757,.
1,8,4,882611,chr4:882611:T:C,T,C,Y,C,T,0.051185,N,ADD,20885,0.931946,0.050102,0.844780,1.028110,-1.406740,0.159504,.
2,16,4,883018,chr4:883018:G:A,G,A,Y,A,G,0.050599,N,ADD,20880,0.928749,0.050369,0.841441,1.025120,-1.467500,0.142240,.
3,24,4,883133,chr4:883133:G:A,G,A,Y,A,G,0.012734,N,ADD,20811,0.809012,0.096557,0.669523,0.977562,-2.194990,0.028164,.
4,32,4,883483,chr4:883483:G:A,G,A,Y,A,G,0.035357,N,ADD,20901,0.883241,0.058911,0.786926,0.991344,-2.107530,0.035071,.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
392,3136,4,1007954,chr4:1007954:T:C,T,C,Y,T,C,0.279743,N,ADD,18971,0.913461,0.025919,0.868217,0.961063,-3.492280,0.000479,.
393,3144,4,1008009,chr4:1008009:C:T,C,T,Y,C,T,0.444961,N,ADD,18814,0.949197,0.023666,0.906174,0.994262,-2.203110,0.027587,.
394,3152,4,1008209,chr4:1008209:C:T,C,T,Y,T,C,0.036468,N,ADD,20717,1.002060,0.059756,0.891315,1.126570,0.034499,0.972480,.
395,3160,4,1008337,chr4:1008337:C:T,C,T,Y,C,T,0.279831,N,ADD,18965,0.913397,0.025912,0.868167,0.960983,-3.495880,0.000473,.


In [None]:
# Format summary statistics for GCTA-COJO
# First get the log odds ratio - this is required for COJO
# 1) For a case-control study, the effect size should be log(odds ratio) with its corresponding standard error.
sumstats_formatted = sumstats.copy()
sumstats_formatted['b'] = np.log(sumstats_formatted['OR'])

In [None]:
sumstats_formatted

Unnamed: 0.1,Unnamed: 0,#CHROM,POS,ID,REF,ALT,PROVISIONAL_REF?,A1,OMITTED,A1_FREQ,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,L95,U95,Z_STAT,P,ERRCODE,b
0,0,4,882530,chr4:882530:C:T,C,T,Y,T,C,0.164467,N,ADD,20837,1.034410,0.030398,0.974580,1.097910,1.112890,0.265757,.,0.033831
1,8,4,882611,chr4:882611:T:C,T,C,Y,C,T,0.051185,N,ADD,20885,0.931946,0.050102,0.844780,1.028110,-1.406740,0.159504,.,-0.070480
2,16,4,883018,chr4:883018:G:A,G,A,Y,A,G,0.050599,N,ADD,20880,0.928749,0.050369,0.841441,1.025120,-1.467500,0.142240,.,-0.073917
3,24,4,883133,chr4:883133:G:A,G,A,Y,A,G,0.012734,N,ADD,20811,0.809012,0.096557,0.669523,0.977562,-2.194990,0.028164,.,-0.211942
4,32,4,883483,chr4:883483:G:A,G,A,Y,A,G,0.035357,N,ADD,20901,0.883241,0.058911,0.786926,0.991344,-2.107530,0.035071,.,-0.124157
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
392,3136,4,1007954,chr4:1007954:T:C,T,C,Y,T,C,0.279743,N,ADD,18971,0.913461,0.025919,0.868217,0.961063,-3.492280,0.000479,.,-0.090515
393,3144,4,1008009,chr4:1008009:C:T,C,T,Y,C,T,0.444961,N,ADD,18814,0.949197,0.023666,0.906174,0.994262,-2.203110,0.027587,.,-0.052139
394,3152,4,1008209,chr4:1008209:C:T,C,T,Y,T,C,0.036468,N,ADD,20717,1.002060,0.059756,0.891315,1.126570,0.034499,0.972480,.,0.002058
395,3160,4,1008337,chr4:1008337:C:T,C,T,Y,C,T,0.279831,N,ADD,18965,0.913397,0.025912,0.868167,0.960983,-3.495880,0.000473,.,-0.090585


In [None]:
# Now select just the necessary columns for COJO
sumstats_export = sumstats_formatted[['ID', 'A1', 'OMITTED', 'A1_FREQ', 'b', 'LOG(OR)_SE', 'P', 'OBS_CT']].copy()

In [None]:
sumstats_export

Unnamed: 0,ID,A1,OMITTED,A1_FREQ,b,LOG(OR)_SE,P,OBS_CT
0,chr4:882530:C:T,T,C,0.164467,0.033831,0.030398,0.265757,20837
1,chr4:882611:T:C,C,T,0.051185,-0.070480,0.050102,0.159504,20885
2,chr4:883018:G:A,A,G,0.050599,-0.073917,0.050369,0.142240,20880
3,chr4:883133:G:A,A,G,0.012734,-0.211942,0.096557,0.028164,20811
4,chr4:883483:G:A,A,G,0.035357,-0.124157,0.058911,0.035071,20901
...,...,...,...,...,...,...,...,...
392,chr4:1007954:T:C,T,C,0.279743,-0.090515,0.025919,0.000479,18971
393,chr4:1008009:C:T,C,T,0.444961,-0.052139,0.023666,0.027587,18814
394,chr4:1008209:C:T,T,C,0.036468,0.002058,0.059756,0.972480,20717
395,chr4:1008337:C:T,C,T,0.279831,-0.090585,0.025912,0.000473,18965


In [None]:
# Rename columns following COJO format
sumstats_export = sumstats_export.rename(columns = {'ID':'SNP', 'OMITTED':'A2', 'A1_FREQ':'freq', 'LOG(OR)_SE':'se', 'P':'p', 'OBS_CT':'N'})

In [None]:
sumstats_export

Unnamed: 0,SNP,A1,A2,freq,b,se,p,N
0,chr4:882530:C:T,T,C,0.164467,0.033831,0.030398,0.265757,20837
1,chr4:882611:T:C,C,T,0.051185,-0.070480,0.050102,0.159504,20885
2,chr4:883018:G:A,A,G,0.050599,-0.073917,0.050369,0.142240,20880
3,chr4:883133:G:A,A,G,0.012734,-0.211942,0.096557,0.028164,20811
4,chr4:883483:G:A,A,G,0.035357,-0.124157,0.058911,0.035071,20901
...,...,...,...,...,...,...,...,...
392,chr4:1007954:T:C,T,C,0.279743,-0.090515,0.025919,0.000479,18971
393,chr4:1008009:C:T,C,T,0.444961,-0.052139,0.023666,0.027587,18814
394,chr4:1008209:C:T,T,C,0.036468,0.002058,0.059756,0.972480,20717
395,chr4:1008337:C:T,C,T,0.279831,-0.090585,0.025912,0.000473,18965


In [None]:
# Export
sumstats_export.to_csv(f'{WORK_DIR}/TMEM175_EUR/EUR.all_adj.sumstats.ma', sep = '\t', index=False)

### Install GCTA

In [None]:
! wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip

--2024-08-07 11:14:11--  https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip
Resolving yanglab.westlake.edu.cn (yanglab.westlake.edu.cn)... 124.160.108.195, 42.247.30.189, 2001:250:6413:1002:250:56ff:10:195
Connecting to yanglab.westlake.edu.cn (yanglab.westlake.edu.cn)|124.160.108.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14333133 (14M) [application/zip]
Saving to: ‘gcta-1.94.1-linux-kernel-3-x86_64.zip’


2024-08-07 11:14:32 (824 KB/s) - ‘gcta-1.94.1-linux-kernel-3-x86_64.zip’ saved [14333133/14333133]



In [None]:
! unzip gcta-1.94.1-linux-kernel-3-x86_64.zip

Archive:  gcta-1.94.1-linux-kernel-3-x86_64.zip
   creating: gcta-1.94.1-linux-kernel-3-x86_64/
  inflating: __MACOSX/._gcta-1.94.1-linux-kernel-3-x86_64  
  inflating: gcta-1.94.1-linux-kernel-3-x86_64/gcta64  
  inflating: __MACOSX/gcta-1.94.1-linux-kernel-3-x86_64/._gcta64  
  inflating: gcta-1.94.1-linux-kernel-3-x86_64/.DS_Store  
  inflating: __MACOSX/gcta-1.94.1-linux-kernel-3-x86_64/._.DS_Store  
  inflating: gcta-1.94.1-linux-kernel-3-x86_64/test.bim  
  inflating: __MACOSX/gcta-1.94.1-linux-kernel-3-x86_64/._test.bim  
  inflating: gcta-1.94.1-linux-kernel-3-x86_64/MIT_License.txt  
  inflating: __MACOSX/gcta-1.94.1-linux-kernel-3-x86_64/._MIT_License.txt  
  inflating: gcta-1.94.1-linux-kernel-3-x86_64/test.fam  
  inflating: __MACOSX/gcta-1.94.1-linux-kernel-3-x86_64/._test.fam  
  inflating: gcta-1.94.1-linux-kernel-3-x86_64/README.txt  
  inflating: __MACOSX/gcta-1.94.1-linux-kernel-3-x86_64/._README.txt  
  inflating: gcta-1.94.1-linux-kernel-3-x86_64/test.bed  
  inflat

In [None]:
! mv ~/workspace/ws_files/gcta-1.94.1-linux-kernel-3-x86_64 /home/jupyter/tools/

In [None]:
! ls /home/jupyter/tools/

LICENSE				   plink			    prettify
annovar				   plink2			    rvtests
annovar.latest.tar.gz		   plink2_linux_x86_64_latest.zip   toy.map
gcta-1.94.1-linux-kernel-3-x86_64  plink_linux_x86_64_20190304.zip  toy.ped


In [None]:
! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 --version

/bin/bash: line 1: /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64: Permission denied


In [None]:
! chmod u+x /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64

In [None]:
! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 --version

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 11:22:48 UTC on Wed Aug 07 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
[0;31mError: [0mthe --out option is missing.
[0m[0mAn error occurs, please check the options or data


In [None]:
# Select multiple associated SNPs based on significance p-value 5e-08
# Can change the p-value for significance if needed
# Bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS

! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 --bfile {WORK_DIR}/TMEM175_EUR/EUR_TMEM175 \
--maf 0.01 \
--cojo-file {WORK_DIR}/TMEM175_EUR/EUR.all_adj.sumstats.ma \
--cojo-p 5e-8 \
--cojo-slct \
--out {WORK_DIR}/TMEM175_EUR/EUR.all_adj.COJO

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 11:25:36 UTC on Wed Aug 07 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR_TMEM175
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR.all_adj.sumstats.ma
--cojo-p 5e-08
--cojo-slct
--out /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR.all_adj.COJO


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR_TMEM175.fam].
38839 individuals to be included from [

In [None]:
! head {WORK_DIR}/TMEM175_EUR/EUR.all_adj.COJO.cma.cojo

Chr	SNP	bp	refA	freq	b	se	p	n	freq_geno	bC	bC_se	pC
4	chr4:882530:C:T	882530	T	0.164467	0.0338312	0.0303975	0.265725	20489.9	0.164537	-0.0279995	0.0303977	0.356994
4	chr4:882611:T:C	882611	C	0.0511851	-0.0704804	0.0501022	0.159507	21340.5	0.0504798	-0.0330724	0.0501033	0.5092
4	chr4:883018:G:A	883018	A	0.0505987	-0.0739168	0.0503694	0.142242	21346	0.0499974	-0.0364219	0.0503708	0.469634
4	chr4:883133:G:A	883133	A	0.0127337	-0.211942	0.0965569	0.0281642	22193.8	0.0132536	-0.170603	0.0965652	0.0772761
4	chr4:883483:G:A	883483	A	0.0353572	-0.124157	0.0589113	0.0350719	21976.2	0.0366916	-0.0786346	0.0589159	0.181977
4	chr4:883624:C:T	883624	T	0.181816	0.0947191	0.0295052	0.00132619	20080.8	0.179007	-0.0154726	0.029512	0.600084
4	chr4:883837:G:A	883837	A	0.400635	-0.031951	0.022935	0.163586	20597	0.404776	-0.028599	0.0229355	0.212423
4	chr4:883921:T:C	883921	C	0.448883	-0.0438296	0.02274	0.0539267	20335.1	0.452722	-0.0401386	0.0227415	0.0775649
4	chr4:884131:C:T	884131	T	0.0117174	0.0247414

In [None]:
! head {WORK_DIR}/TMEM175_EUR/EUR.all_adj.COJO.jma.cojo

Chr	SNP	bp	refA	freq	b	se	p	n	freq_geno	bJ	bJ_se	pJ	LD_r
4	chr4:958159:T:C	958159	C	0.203811	0.258912	0.0290424	4.87943e-19	18930.3	0.198229	0.258912	0.0291025	5.76097e-19	0


In [None]:
! head {WORK_DIR}/TMEM175_EUR/EUR.all_adj.COJO.ldr.cojo

SNP	chr4:958159:T:C	
chr4:958159:T:C	1	


In [None]:
! head {WORK_DIR}/TMEM175_EUR/EUR.all_adj.COJO.log

*******************************************************************
* Genome-wide Complex Trait Analysis (GCTA)
* version v1.94.1 Linux
* Built at Nov 15 2022 21:14:25, by GCC 8.5
* (C) 2010-present, Yang Lab, Westlake University
* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
*******************************************************************
Analysis started at 11:25:36 UTC on Wed Aug 07 2024.
Hostname: jupyterlabvertexai20240708



In [None]:
# Select multiple associated SNPs based on significance p-value 5e-08
# Can change the p-value for significance if needed
# Bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS

! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 --bfile {WORK_DIR}/TMEM175_EUR/EUR_TMEM175 \
--maf 0.01 \
--cojo-file {WORK_DIR}/TMEM175_EUR/EUR.all_adj.sumstats.ma \
--cojo-p 4.59e-4 \
--cojo-slct \
--out {WORK_DIR}/TMEM175_EUR/EUR.all_adj.LDprune.COJO

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 12:48:26 UTC on Tue Aug 20 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR_TMEM175
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR.all_adj.sumstats.ma
--cojo-p 0.000459
--cojo-slct
--out /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR.all_adj.LDprune.COJO


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR_TMEM175.fam].
38839 individuals to be incl

In [None]:
# Or you can run COJO and select just the top 10 independent SNPs (can be below GWAS significance p-value)
# Select top 10 SNPs
! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 --bfile {WORK_DIR}/TMEM175_EUR/EUR_TMEM175 \
--maf 0.01 \
--cojo-file {WORK_DIR}/TMEM175_EUR/EUR.all_adj.sumstats.ma \
--cojo-top-SNPs 10 \
--out {WORK_DIR}/TMEM175_EUR/EUR.all_adj.top10

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 11:33:10 UTC on Wed Aug 07 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR_TMEM175
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR.all_adj.sumstats.ma
--cojo-top-SNPs 10
--out /home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR.all_adj.top10


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/TMEM175//TMEM175_EUR/EUR_TMEM175.fam].
38839 individuals to be included from [/home/j

In [None]:
! head {WORK_DIR}/TMEM175_EUR/EUR.all_adj.top10.cma.cojo

Chr	SNP	bp	refA	freq	b	se	p	n	freq_geno	bC	bC_se	pC
4	chr4:882530:C:T	882530	T	0.164467	0.0338312	0.0303975	0.265725	20489.9	0.164537	-0.277937	0.105335	0.00832474
4	chr4:882611:T:C	882611	C	0.0511851	-0.0704804	0.0501022	0.159507	21340.5	0.0504798	-0.122516	0.0814926	0.132737
4	chr4:883018:G:A	883018	A	0.0505987	-0.0739168	0.0503694	0.142242	21346	0.0499974	-0.114817	0.0418044	0.00602305
4	chr4:883483:G:A	883483	A	0.0353572	-0.124157	0.0589113	0.0350719	21976.2	0.0366916	0.177534	0.0999408	0.075669
4	chr4:883624:C:T	883624	T	0.181816	0.0947191	0.0295052	0.00132619	20080.8	0.179007	-0.0724453	0.0497149	0.145057
4	chr4:883837:G:A	883837	A	0.400635	-0.031951	0.022935	0.163586	20597	0.404776	0.282062	0.0325328	4.31799e-18
4	chr4:883921:T:C	883921	C	0.448883	-0.0438296	0.02274	0.0539267	20335.1	0.452722	0.110611	0.0509097	0.0298032
4	chr4:884131:C:T	884131	T	0.0117174	0.0247414	0.104187	0.812291	20698.7	0.0122093	-0.136355	0.0806028	0.0907057
4	chr4:884260:C:T	884260	T	0.335987	0.0328447	0

In [None]:
! head {WORK_DIR}/TMEM175_EUR/EUR.all_adj.top10.jma.cojo

Chr	SNP	bp	refA	freq	b	se	p	n	freq_geno	bJ	bJ_se	pJ	LD_r
4	chr4:883133:G:A	883133	A	0.0127337	-0.211942	0.0965569	0.0281642	22193.8	0.0132536	-0.277937	0.105335	0.00832474	-0.0174097
4	chr4:905058:T:C	905058	C	0.0220098	0.103396	0.0776335	0.182912	20053.8	0.0211291	-0.122516	0.0814926	0.132737	-0.0460043
4	chr4:931588:C:T	931588	T	0.098972	-0.14224	0.0364747	9.63155e-05	21915.3	0.0994334	-0.114817	0.0418044	0.00602305	0.372451
4	chr4:933370:G:A	933370	A	0.0153177	0.0255118	0.0925403	0.782792	20143.3	0.0149901	0.177534	0.0999408	0.075669	-0.0298242
4	chr4:939809:A:G	939809	G	0.0536428	-0.111124	0.048938	0.0231648	21395.3	0.0540681	-0.0724453	0.0497149	0.145057	-0.117727
4	chr4:958159:T:C	958159	C	0.203811	0.258912	0.0290424	4.87943e-19	18930.3	0.198229	0.282062	0.0325328	4.31799e-18	-0.121267
4	chr4:983473:C:T	983473	T	0.0623297	0.02634	0.0460307	0.567168	21010.5	0.063255	0.110611	0.0509097	0.0298032	-0.0387902
4	chr4:988557:C:T	988557	T	0.022547	0.0935541	0.076603	0.221978	20117.6	0.02

### Use P-value threshold of 4.59e-4

In [None]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    # Select multiple associated SNPs based on LD pruning p-value 0.00238
    # Can change the p-value for significance if needed
    # Bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS
    ! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --maf 0.01 \
    --cojo-file {WORK_DIR}/{ancestry}.all_adj.sumstats.ma \
    --cojo-p 4.59e-4 \
    --cojo-slct \
    --out {WORK_DIR}/{ancestry}.all_adj.ldprune.COJO

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 13:32:01 UTC on Tue Aug 20 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR_TMEM175
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR.all_adj.sumstats.ma
--cojo-p 0.000459
--cojo-slct
--out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR.all_adj.ldprune.COJO


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR_TMEM175.fam].
646 individuals to be included f

In [None]:
WORK_DIR = "~/workspace/ws_files/TMEM175/"

In [None]:
# Prepare file for COJO
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    # Read in summary statistics
    sumstats = pd.read_csv(f'{WORK_DIR}/TMEM175_{ancestry}/{ancestry}.all_adj.csv')
    
    # Format summary statistics for GCTA-COJO
    # First get the log odds ratio - this is required for COJO
    # 1) For a case-control study, the effect size should be log (odds ratio) with its corresponding standard error.
    sumstats_formatted = sumstats.copy()
    sumstats_formatted['b'] = np.log(sumstats_formatted['OR'])
    
    # Now select just the necessary columns for COJO
    sumstats_export = sumstats_formatted[['ID', 'A1', 'OMITTED', 'A1_FREQ', 'b', 'LOG(OR)_SE', 'P', 'OBS_CT']].copy()
    
    # Rename columns following COJO format
    sumstats_export = sumstats_export.rename(columns = {'ID':'SNP', 'OMITTED':'A2', 'A1_FREQ':'freq', 'LOG(OR)_SE':'se', 'P':'p', 'OBS_CT':'N'})
    
    # Export
    sumstats_export.to_csv(f'{WORK_DIR}/TMEM175_{ancestry}/{ancestry}.all_adj.sumstats.ma', sep = '\t', index=False)

In [None]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    # Select multiple associated SNPs based on significance p-value 5e-08
    # Can change the p-value for significance if needed
    # Bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS
    ! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --maf 0.01 \
    --cojo-file {WORK_DIR}/{ancestry}.all_adj.sumstats.ma \
    --cojo-top-SNPs 10 \
    --out {WORK_DIR}/{ancestry}.all_adj.top10

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 13:04:39 UTC on Tue Aug 20 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR_TMEM175
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR.all_adj.sumstats.ma
--cojo-top-SNPs 10
--out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR.all_adj.top10


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR_TMEM175.fam].
646 individuals to be included from [/home/jupyter

### Change format of association file

In [None]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    WORK_DIR = "~/workspace/ws_files/TMEM175/"
    print(f'WORKING ON: {ancestry}')
    
    df = pd.read_csv(f'{WORK_DIR}/TMEM175_{ancestry}/{ancestry}.all_adj.csv')
    
    # First just select relevant columns: chromosome, bp position and p-value
    export_ldassoc = df[['# CHROM', 'POS', 'P']].copy()
    
    # Rename the # CHROM column to remove the hashtag as I think this might be confusing LDassoc
    export_ldassoc = export_ldassoc.rename(columns={'# CHROM': 'CHROM'}) 
    
    # Then export as a tab-separated, not comma-separated file
    export_ldassoc.to_csv(f'{WORK_DIR}/TMEM175_{ancestry}/{ancestry}.all_adj.formatted.tab', sep = '\t', index=False)

WORKING ON: AMR
WORKING ON: CAS
WORKING ON: MDE
WORKING ON: AJ
WORKING ON: AFR
WORKING ON: EAS
WORKING ON: FIN
WORKING ON: AAC
WORKING ON: CAH
WORKING ON: SAS
WORKING ON: EUR


### Validate GLM output file in EAS ancestry again

In [None]:
WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_EAS'

! /home/jupyter/tools/plink2 \
--pfile {REL7_PATH}/imputed_genotypes/EAS/chr4_EAS_release7_vwb \
--chr 4 \
--from-bp 882387 \
--to-bp 1008656 \
--make-bed \
--out {WORK_DIR}/EAS_TMEM175

In [None]:
# ASSOC analysis

WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_EAS'
    
! /home/jupyter/tools/plink \
--bfile {WORK_DIR}/EAS_TMEM175 \
--keep {WORK_DIR}/EAS.samplestoKeep \
--assoc \
--allow-no-sex \
--ci 0.95 \
--maf 0.01 \
--out {WORK_DIR}/EAS_TMEM175.all

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS_TMEM175.all.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS_TMEM175
  --ci 0.95
  --keep /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS.samplestoKeep
  --maf 0.01
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS_TMEM175.all

52223 MB RAM detected; reserving 26111 MB for main workspace.
3171 variants loaded from .bim file.
5167 people (3278 males, 1889 females) loaded from .fam.
5123 phenotype values loaded from .fam.
--keep: 5167 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5167 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161718192021222324252627282930313233343536373839404142

In [None]:
# Look at assoc results
WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_EAS'
freq = pd.read_csv(f'{WORK_DIR}/EAS_TMEM175.all.assoc', sep='\s+')
freq

Unnamed: 0,CHR,SNP,BP,A1,F_A,F_U,A2,CHISQ,P,OR,SE,L95,U95
0,4,chr4:882530:C:T,882530,T,0.29880,0.28300,C,3.0680,0.07984,1.0800,0.04382,0.9909,1.1770
1,4,chr4:882611:T:C,882611,C,0.06649,0.07835,T,5.3640,0.02056,0.8379,0.07645,0.7213,0.9733
2,4,chr4:883018:G:A,883018,A,0.06649,0.07814,G,5.1880,0.02275,0.8403,0.07649,0.7233,0.9762
3,4,chr4:883483:G:A,883483,A,0.07547,0.06904,G,1.5690,0.21040,1.1010,0.07663,0.9472,1.2790
4,4,chr4:883624:C:T,883624,T,0.40600,0.41950,C,1.8970,0.16840,0.9456,0.04061,0.8733,1.0240
...,...,...,...,...,...,...,...,...,...,...,...,...,...
372,4,chr4:1007954:T:C,1007954,T,0.42810,0.42160,C,0.4384,0.50790,1.0270,0.04030,0.9490,1.1110
373,4,chr4:1008009:C:T,1008009,T,0.37580,0.37980,C,0.1604,0.68880,0.9831,0.04266,0.9042,1.0690
374,4,chr4:1008064:G:C,1008064,C,0.02647,0.02219,G,1.9000,0.16810,1.1980,0.13130,0.9263,1.5500
375,4,chr4:1008209:C:T,1008209,T,0.10880,0.11340,C,0.5450,0.46040,0.9539,0.06397,0.8415,1.0810


In [None]:
# Save FREQ to csv
freq.to_csv(f'{WORK_DIR}/EAS.all_nonadj.csv')

In [None]:
# GLM analysis

WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_EAS'

! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/EAS_TMEM175 \
--keep {WORK_DIR}/EAS.samplestoKeep \
--allow-no-sex \
--maf 0.01 \
--ci 0.95 \
--glm \
--covar {WORK_DIR}/EAS_covariate_file.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--covar-variance-standardize \
--neg9-pheno-really-missing \
--out {WORK_DIR}/EAS_TMEM175.all_adj

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS_TMEM175.all_adj.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS_TMEM175
  --ci 0.95
  --covar /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS_covariate_file.txt
  --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
  --covar-variance-standardize
  --glm
  --keep /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS.samplestoKeep
  --maf 0.01
  --neg9-pheno-really-missing
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EAS/EAS_TMEM175.all_adj

Start time: Thu Aug 29 12:42:49 2024
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
52223 MiB RAM detected, ~50486 available

In [None]:
# Read output file of GLM

# Read in plink GLM results
WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_EAS'
assoc = pd.read_csv(f'{WORK_DIR}/EAS_TMEM175.all_adj.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
assoc



Unnamed: 0,#CHROM,POS,ID,REF,ALT,PROVISIONAL_REF?,A1,OMITTED,A1_FREQ,FIRTH?,TEST,OBS_CT,OR,LOG(OR)_SE,L95,U95,Z_STAT,P,ERRCODE
0,4,882530,chr4:882530:C:T,C,T,Y,T,C,0.278626,N,ADD,2489,1.079170,0.071422,0.938201,1.241320,1.066800,2.860620e-01,.
1,4,882530,chr4:882530:C:T,C,T,Y,T,C,0.278626,N,SEX,2489,0.804336,0.044091,0.737746,0.876936,-4.938340,7.878890e-07,.
2,4,882530,chr4:882530:C:T,C,T,Y,T,C,0.278626,N,AGE,2489,1.510690,0.047695,1.375870,1.658720,8.650160,5.142550e-18,.
3,4,882530,chr4:882530:C:T,C,T,Y,T,C,0.278626,N,PC1,2489,0.978250,0.044623,0.896327,1.067660,-0.492802,6.221530e-01,.
4,4,882530,chr4:882530:C:T,C,T,Y,T,C,0.278626,N,PC2,2489,0.997227,0.116869,0.793077,1.253930,-0.023759,9.810450e-01,.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3011,4,1008337,chr4:1008337:C:T,C,T,Y,C,T,0.419561,N,PC1,2505,0.974019,0.044508,0.892651,1.062800,-0.591460,5.542120e-01,.
3012,4,1008337,chr4:1008337:C:T,C,T,Y,C,T,0.419561,N,PC2,2505,0.980706,0.117453,0.779045,1.234570,-0.165874,8.682560e-01,.
3013,4,1008337,chr4:1008337:C:T,C,T,Y,C,T,0.419561,N,PC3,2505,0.679168,0.102293,0.555782,0.829945,-3.782130,1.554910e-04,.
3014,4,1008337,chr4:1008337:C:T,C,T,Y,C,T,0.419561,N,PC4,2505,0.562607,0.066139,0.494206,0.640475,-8.696510,3.422460e-18,.


In [None]:
# Filter for additive test only - this is the variant results
assoc_add = assoc[assoc['TEST']=="ADD"]

In [None]:
# Save assoc_add to csv
assoc_add.to_csv(f'{WORK_DIR}/EAS.all_adj.csv')