# Extract variant carriers and perform annotation using GP2 WGS and CES data (PLINK files from Release 8)

## Exploring the Global Landscape of Rare Causal and Common High-Risk Variants in Parkinson’s Disease

`GP2 ❤️ Open Science 😍`

## Description:

This notebook contains the code and workflow used in the study: **“Exploring the Global Landscape of Rare Causal and Common High-Risk Variants in Parkinson’s Disease”**.

In this notebook we extract variant carriers and perform annotation using GP2 WGS and CES data (PLINK files from Release 8).

### Outline:

* **0. Set Up**

* **1. Install software and define paths**
    * 1.1. Install plink
    * 1.2. Install bcftools
    * 1.3. Install ANNOVAR
    * 1.4 Create working directory and set paths

* **2. Create and edit .bed file with genes of interest and genomic coordinates**

* **3. Extract variant carriers from WGS data**
    * AAC (as an example)

 * **4. CES variant extraction** 
    * AAC (as an example)

## 0. Set Up

In [None]:
## Use the os package to interact with the environment
import os

## Bring in Pandas for Dataframe functionality
import pandas as pd

import subprocess

## Numpy for basics
import numpy as np

## Use pathlib for file path manipulation
import pathlib

## Use StringIO for working with file contents
from io import StringIO

## Enable IPython to display matplotlib graphs
import matplotlib.pyplot as plt
%matplotlib inline

## Import the iPython HTML rendering for displaying links to Google Cloud Console
from IPython.core.display import display, HTML

## Import urllib modules for building URLs to Google Cloud Console
import urllib.parse

## BigQuery for querying data
from google.cloud import bigquery

## Import Sys
import sys as sys

## 1. Install software and define paths

### 1.1. Install plink

In [None]:
%%capture
%%bash

# Install plink 1.9
mkdir -p ~/tools
cd ~/tools/
if test -e ~/tools/plink; then
    echo "Plink is already installed"
else
    echo "Plink is not installed"
    cd ~/tools/

    wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190304.zip 

    unzip -o plink_linux_x86_64_20190304.zip
    mv plink plink1.9
fi

In [None]:
%%bash

# chmod plink 1.9 to make sure you have permission to run the program
chmod u+x ~/tools/plink1.9

In [None]:
%%capture
%%bash

# Install plink 2.0
cd ~/tools/
if test -e ~/tools/plink2; then
    echo "Plink2 is already installed"
else
    echo "Plink2 is not installed"
    cd ~/tools/

    wget http://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_latest.zip

    unzip -o plink2_linux_x86_64_latest.zip
fi

In [None]:
%%bash

# chmod plink 2 to make sure you have permission to run the program
chmod u+x ~/tools/plink2

### 1.2. Install bcftools

In [None]:
%%capture
%%bash 

#Install bcftools
cd /home/jupyter/tools/

if test -e /home/jupyter/tools/bcftools; then
    echo "bcftools is already installed in /home/jupyter/tools/"
else
    echo -e "Downloading bcftools \n    -------"
    git clone --recurse-submodules https://github.com/samtools/htslib.git
    git clone https://github.com/samtools/bcftools.git
    cd bcftools
    make
    echo -e "\n bcftools downloaded and unzipped in /home/jupyter/tools \n "

fi

In [None]:
%%bash

# chmod bcftools to make sure you have permission to run the program
chmod +x /home/jupyter/tools/bcftools

### 1.3. Install ANNOVAR

In [None]:
%%capture
%%bash

# Install ANNOVAR: We are adding the download link after registration on the annovar website
# https://www.openbioinformatics.org/annovar/annovar_download_form.php

if test -e ~/tools/annovar ; then
    echo "annovar is already installed in /home/jupyter/workspace/ws_files/fangz_workdir/annovar"
else
    echo "annovar is not installed"
    cd ~/tools

    wget http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz

    tar xvfz annovar.latest.tar.gz

fi

In [None]:
%%capture
%%bash

# Install ANNOVAR: Download resources for annotation

cd ~/tools/annovar
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar clinvar_20240917 humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar dbnsfp47a humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar gnomad41_genome humandb/

### 1.4 Create working directory and set paths

In [None]:
# Create a folder on your workspace
print("Making a working directory")
!mkdir -p /home/jupyter/workspace/ws_files/your_directory
!mkdir -p /home/jupyter/workspace/ws_files/your_directory/bed_files
!mkdir -p /home/jupyter/workspace/ws_files/your_directory/temp_files
!mkdir -p /home/jupyter/workspace/ws_files/your_directory/results

workdir="/home/jupyter/workspace/ws_files/your_directory"

## 2. Create and edit .bed file with genes of interest and genomic coordinates

#### Create a .txt file with the gene information (include all genes in which you want to extract variant carriers)
Use the following format: ```CHR START END GENE```

**CHR** refers to the chromosome the gene is located on (e.g., chr1)

**START** refers to the chromosomal position at which the gene starts

**END** refers to the chromosomal position at which the gene ends

**GENE** refers to the gene name (optional)

Use the ensembl genome browser to obtain these information (https://useast.ensembl.org/index.html). 

In [None]:
# Convert your .txt file into a .bed file
input_file = "/home/jupyter/workspace/ws_files/GP2_R8_CES_monogenic/docs/chrom_pos.txt"
output_file = "/home/jupyter/workspace/ws_files/GP2_R8_CES_monogenic/docs/chrom_pos.bed"

with open(input_file, "r") as infile, open(output_file, "w") as outfile:
    for line in infile:
        cleaned_line = "\t".join(line.strip().split())  # Replace spaces with actual tabs
        outfile.write(cleaned_line + "\n")

print("Conversion complete! File saved as:", output_file)

In [None]:
# Read in the bed file
bed = pd.read_csv('/home/jupyter/workspace/ws_files/GP2_R8_monogenic/bed_files/chrom_pos.bed',sep='\t',header=None,names=['chr','start_bp','stop_bp','gene'])
bed

## 3. Extract variant carriers from WGS data

### AAC (as an example)

In [None]:
# Find unique chromosomes in .bed files
unique_chr=bed['chr'].unique().tolist()
unique_chr

In [None]:
# Define all the variables

INPUT_PLINK_DIR="/home/jupyter/workspace/path/to/release8/wgs/deepvariant_joint_calling/plink"
PLINK2_PATH="/home/jupyter/tools/plink2"
dir_bed="/home/jupyter/workspace/ws_files/your_directory/bed_files"
TEMP_DIR="/home/jupyter/workspace/ws_files/your_directory/temp_files"
Ancestry="AAC"

# Extract variants we need from each chromosome
for chrom in unique_chr:    
    # Make the output directory if not exist
    os.makedirs(f'{TEMP_DIR}/{Ancestry}', exist_ok=True)
    
    # Construct the command as a list 
    # Keep only variants with maf <0.05 in the dataset to keep the files small
    # Since we are not interested in frequent variants
    cmd = [
        PLINK2_PATH,
        "--pfile", f"{INPUT_PLINK_DIR}/{Ancestry}/{chrom}_{Ancestry}_release8",
        "--mac", "1",
        "--extract", "bed1", f"{dir_bed}/{chrom}.bed",
        "--make-pgen",
        "--out", f"{TEMP_DIR}/{chrom}"
    ]

    subprocess.run(cmd, check=True)

In [None]:
%%bash

TEMP_DIR="/home/jupyter/workspace/ws_files/your_directory/temp_files"
Ancestry="AAC"

# Merge all the chrs and convert the final file to .vcf
# For merging across pfiles https://www.cog-genomics.org/plink/2.0/data

# List the files, sort by chromosome and remove .pgen from the filename
ls ${TEMP_DIR}/*.pgen | sort -V  | sed 's/\.pgen//g' > ${TEMP_DIR}/pfiles.list

~/tools/plink2 --pmerge-list ${TEMP_DIR}/pfiles.list \
                           --recode vcf \
                           --out ${TEMP_DIR}/${Ancestry}

In [None]:
%%bash
# bgzip the vcf
TEMP_DIR="/home/jupyter/workspace/ws_files/your_directory/temp_files"
Ancestry="AAC"

bgzip -c ${TEMP_DIR}/${Ancestry}.vcf > ${TEMP_DIR}/annovar_input_${Ancestry}.vcf.gz
tabix -p vcf ${TEMP_DIR}/annovar_input_${Ancestry}.vcf.gz

In [None]:
%%bash

TEMP_DIR="/home/jupyter/workspace/ws_files/your_directory/temp_files"

# Run ANNOVAR annotation
perl ~/tools/annovar/table_annovar.pl \
    /home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files/annovar_input_AAC.vcf.gz \
    /home/jupyter/tools/annovar/humandb/ \
    -buildver hg38 \
    -out /home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/results/wgs_mac1_final_AAC.annovar \
    -remove \
    -protocol refGene,gnomad41_genome,clinvar_20240917,dbnsfp47a \
    -operation g,f,f,f \
    -nastring . \
    -polish \
    -vcfinput

In [None]:
%%bash 

head /home/jupyter/workspace/ws_files/your_directory/results/wgs_mac1_final_AAC.annovar.hg38_multianno.txt

In [None]:
workdir="/home/jupyter/workspace/ws_files/your_directory"
# Show columns of multianno.txt output file
anno=pd.read_csv(f'{workdir}/results/wgs_mac1_final_AAC.annovar.hg38_multianno.txt',sep='\t',dtype=str)

# Find variant ID column
anno['Otherinfo6']

In [None]:
anno.columns.tolist()

In [None]:
# Select the columns to keep
basic_cols= anno.columns.tolist()[0:10]
additional_cols_to_keep=['Otherinfo6',
                         'gnomad41_genome_fafmax_faf95_max',
                         'CLNDN',
                         'CLNSIG',
                         'CADD_phred']
            
all_cols_to_keep= basic_cols+ additional_cols_to_keep
all_cols_to_keep

In [None]:
# Subset the columns
AAC_df=anno[all_cols_to_keep]

# Rename column
AAC_df.rename({'Otherinfo6':'var_id'},axis=1,inplace=True)

AAC_df

In [None]:
# Count occurrences of each value
value_counts = AAC_df["ExonicFunc.refGene"].value_counts()

# Display counts
print(value_counts)

#### 3.1 Filter variants

In [None]:
# Define the values you want to keep
keep_values = ["exonic", "splicing", "exonic;splicing"]  # Add more if needed

# Subset the dataframe
filtered_AAC_df = AAC_df[AAC_df["Func.refGene"].isin(keep_values)]

# Display the filtered dataframe
print (filtered_AAC_df.head())


In [None]:
# Filter out synonymous SNVs
filtered_AAC_df = filtered_AAC_df[filtered_AAC_df["ExonicFunc.refGene"] != "synonymous SNV"]

# Display the filtered dataframe
print (filtered_AAC_df.head())

In [None]:
# Save the filtered output
filtered_AAC_df.to_csv(f"{workdir}/results/filtered_multianno_AAC.tsv", sep="\t", index=False)

# Write out 'var_id' to extract from plink files
filtered_AAC_df['var_id'].to_csv(f"{workdir}/results/AAC_var_to_extract.txt",index=False,header=False)


####  3.2 Extract carrier IDs and genotypes

In [None]:
%%bash
workdir="/home/jupyter/workspace/ws_files/your_directory"
TEMP_DIR="/home/jupyter/workspace/ws_files/your_directory/temp_files"
Ancestry="AAC"

~/tools/plink2 --pfile ${TEMP_DIR}/${Ancestry} \
               --extract ${workdir}/results/${Ancestry}_var_to_extract.txt \
               --recode A \
               --out ${TEMP_DIR}/${Ancestry}_geno
               

In [None]:
TEMP_DIR="/home/jupyter/workspace/ws_files/your_directory/temp_files"
Ancestry="AAC"

aac_var = pd.read_csv(f'{TEMP_DIR}/{Ancestry}_geno.raw', sep='\s+')
aac_var

In [None]:
# Transpose the dataframe to be row as variants and columns as samples
var_col=aac_var.columns[6:len(aac_var)]
d = aac_var.drop(columns=['FID','PAT','MAT','SEX','PHENOTYPE'])
sample=aac_var[['IID','PHENOTYPE']]

# Filtering rows where any value in 'var_col' is ≤1 (either het or hom)
t=d[(d[var_col]<=1).any(axis=1)].T
t.columns = t.iloc[0]
t=t.iloc[1:]
t.reset_index(inplace=True)

t

In [None]:
# Strip the last '_${ref_allele}', so we can keep the same variant id as in annotation 
t['index'] = t['index'].str.rsplit('_', n=1).str[0]
t.rename({'index':'var_id'},axis=1,inplace=True)
t

In [None]:
t['hom_carrier'] = t.apply(lambda row: row[row == 0].index.tolist() , axis=1)
t['het_carrier'] = t.apply(lambda row: row[row == 1].index.tolist() , axis=1)  

# Store hom and het seperately to later explode the dataframe correctly
hom = t[['var_id','hom_carrier']]
het = t[['var_id','het_carrier']]

In [None]:
# Check hom as example what to expect
hom

In [None]:
# Split the carrier ID from the list and each to a row
hom = hom.explode('hom_carrier', ignore_index=True)

hom

In [None]:
# Keep just the variants with carriers
hom= hom.loc[~hom['hom_carrier'].isnull()]

hom

In [None]:
# Merge with annotation
out_hom = pd.merge(hom,filtered_AAC_df, on='var_id',how='left')
out_hom['zygosity'] = 'hom'

# Rename column
out_hom.rename({'hom_carrier':'carrier_id'},axis=1,inplace=True)

out_hom

In [None]:
# Repeat the same for het
het = het.explode('het_carrier', ignore_index=True)
het= het.loc[~het['het_carrier'].isnull()]

het

In [None]:
# Merge with annotation
out_het = pd.merge(het,filtered_AAC_df, on='var_id',how='left')
out_het['zygosity'] = 'het'

# Rename column
out_het.rename({'het_carrier':'carrier_id'},axis=1,inplace=True)

out_het

In [None]:
# Check if there's any comphet by grouping gene and sample ID
pd.concat(g for _, g in out_het.groupby(["Gene.refGene","carrier_id"]) if len(g) > 1)

In [None]:
# Write out comphet
comphet = pd.concat(g for _, g in out_het.groupby(["Gene.refGene","carrier_id"]) if len(g) > 1)
comphet['zygosity'] = 'comphet'

In [None]:
# Get everything and write out
merged_df_AAC = pd.concat([out_het,out_hom,comphet],axis=0)

# Save the dataset
merged_df_AAC.to_csv(f"{workdir}/results/results_AAC.tsv",sep='\t',index=False)

### Now repeat the same steps for all other ancestries (AFR, AJ, AMR, CAH, CAS, EAS, EUR, FIN, MDE, SAS)!

## 4. CES variant extraction

## AAC (as an example)

In [None]:
# To extract variants from CES data, use the same script but update the input path to the genetic files
INPUT_PLINK_DIR="/home/jupyter/workspace/path/to/release8/clinical_exomes/deepvariant_joint_calling/plink"

In [None]:
# Create a folder on your workspace
print("Making a working directory")
!mkdir -p /home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic
!mkdir -p /home/jupyter/workspace/ws_files/GP2_R8_monogenic/bed_files
!mkdir -p /home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files
!mkdir -p /home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/results

workdir="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic"

In [None]:
# Find unique chromosomes in .bed files
unique_chr=bed['chr'].unique().tolist()
unique_chr

In [None]:
# Define all the variables

INPUT_PLINK_DIR="/home/jupyter/workspace/path/to/release8/wgs/deepvariant_joint_calling/plink"
PLINK2_PATH="/home/jupyter/tools/plink2"
dir_bed="/home/jupyter/workspace/ws_files/GP2_R8_monogenic/bed_files"
TEMP_DIR="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files"
Ancestry="AAC"

# Extract variants we need from each chromosome
for chrom in unique_chr:    
    # Make the output directory if not exist
    os.makedirs(f'{TEMP_DIR}/{Ancestry}', exist_ok=True)
    
    # Construct the command as a list 
    # Keep only variants with maf <0.05 in the dataset to keep the files small
    # Since we are not interested in frequent variants
    cmd = [
        PLINK2_PATH,
        "--pfile", f"{INPUT_PLINK_DIR}/{Ancestry}/{chrom}_{Ancestry}_release8",
        "--mac", "1",
        "--extract", "bed1", f"{dir_bed}/{chrom}.bed",
        "--make-pgen",
        "--out", f"{TEMP_DIR}/{chrom}"
    ]

    subprocess.run(cmd, check=True)

In [None]:
%%bash

TEMP_DIR="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files"
Ancestry="AAC"

# Merge all the chrs and convert the final file to .vcf
# For merging across pfiles https://www.cog-genomics.org/plink/2.0/data

# List the files, sort by chromosome and remove .pgen from the filename
ls ${TEMP_DIR}/*.pgen | sort -V  | sed 's/\.pgen//g' > ${TEMP_DIR}/pfiles.list

~/tools/plink2 --pmerge-list ${TEMP_DIR}/pfiles.list \
                           --recode vcf \
                           --out ${TEMP_DIR}/${Ancestry}

In [None]:
%%bash
# bgzip the vcf
TEMP_DIR="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files"
Ancestry="AAC"

bgzip -c ${TEMP_DIR}/${Ancestry}.vcf > ${TEMP_DIR}/annovar_input_${Ancestry}.vcf.gz
tabix -p vcf ${TEMP_DIR}/annovar_input_${Ancestry}.vcf.gz

In [None]:
%%bash

TEMP_DIR="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files"

# Run ANNOVAR annotation
perl ~/tools/annovar/table_annovar.pl \
    /home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files/annovar_input_AAC.vcf.gz \
    /home/jupyter/tools/annovar/humandb/ \
    -buildver hg38 \
    -out /home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/results/wgs_mac1_final_AAC.annovar \
    -remove \
    -protocol refGene,gnomad41_genome,clinvar_20240917,dbnsfp47a \
    -operation g,f,f,f \
    -nastring . \
    -polish \
    -vcfinput

In [None]:
%%bash 

head /home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/results/wgs_mac1_final_AAC.annovar.hg38_multianno.txt

In [None]:
workdir="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic"
# Show columns of multianno.txt output file
anno=pd.read_csv(f'{workdir}/results/wgs_mac1_final_AAC.annovar.hg38_multianno.txt',sep='\t',dtype=str)

# Find variant ID column
anno['Otherinfo6']

In [None]:
anno.columns.tolist()

In [None]:
# Select the columns to keep
basic_cols= anno.columns.tolist()[0:10]
additional_cols_to_keep=['Otherinfo6',
                         'gnomad41_genome_fafmax_faf95_max',
                         'CLNDN',
                         'CLNSIG',
                         'CADD_phred']
            
all_cols_to_keep= basic_cols+ additional_cols_to_keep
all_cols_to_keep

In [None]:
# Subset the columns
AAC_df=anno[all_cols_to_keep]

# Rename column
AAC_df.rename({'Otherinfo6':'var_id'},axis=1,inplace=True)

AAC_df

In [None]:
# Count occurrences of each value
value_counts = AAC_df["ExonicFunc.refGene"].value_counts()

# Display counts
print(value_counts)

#### 4.1 Filter variants

In [None]:
# Define the values you want to keep
keep_values = ["exonic", "splicing", "exonic;splicing"]  # Add more if needed

# Subset the dataframe
filtered_AAC_df = AAC_df[AAC_df["Func.refGene"].isin(keep_values)]

# Display the filtered dataframe
print (filtered_AAC_df.head())


In [None]:
# Filter out synonymous SNVs
filtered_AAC_df = filtered_AAC_df[filtered_AAC_df["ExonicFunc.refGene"] != "synonymous SNV"]

# Display the filtered dataframe
print (filtered_AAC_df.head())

In [None]:
# Save the filtered output
filtered_AAC_df.to_csv(f"{workdir}/results/filtered_multianno_AAC.tsv", sep="\t", index=False)

# Write out 'var_id' to extract from plink files
filtered_AAC_df['var_id'].to_csv(f"{workdir}/results/AAC_var_to_extract.txt",index=False,header=False)


#### 4.2 Extract carrier IDs and genotypes

I would just extract the variants you are interested from the annovar annotation, and you can start from the plink files generated from the merge

In [None]:
%%bash
workdir="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic"
TEMP_DIR="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files"
Ancestry="AAC"

~/tools/plink2 --pfile ${TEMP_DIR}/${Ancestry} \
               --extract ${workdir}/results/${Ancestry}_var_to_extract.txt \
               --recode A \
               --out ${TEMP_DIR}/${Ancestry}_geno
               

In [None]:
TEMP_DIR="/home/jupyter/workspace/ws_files/GP2_R8_AAC_monogenic/temp_files"
Ancestry="AAC"

aac_var = pd.read_csv(f'{TEMP_DIR}/{Ancestry}_geno.raw', sep='\s+')
aac_var

From here you can run similar code for NBA to collect variant carriers 

In [None]:
# Transpose the dataframe to be row as variants and columns as samples
var_col=aac_var.columns[6:len(aac_var)]
d = aac_var.drop(columns=['FID','PAT','MAT','SEX','PHENOTYPE'])
sample=aac_var[['IID','PHENOTYPE']]

# Filtering rows where any value in 'var_col' is ≤1 (either het or hom)
t=d[(d[var_col]<=1).any(axis=1)].T
t.columns = t.iloc[0]
t=t.iloc[1:]
t.reset_index(inplace=True)

t

In [None]:
# Strip the last '_${ref_allele}', so we can keep the same variant id as in annotation 
t['index'] = t['index'].str.rsplit('_', n=1).str[0]
t.rename({'index':'var_id'},axis=1,inplace=True)
t

In [None]:
t['hom_carrier'] = t.apply(lambda row: row[row == 0].index.tolist() , axis=1)
t['het_carrier'] = t.apply(lambda row: row[row == 1].index.tolist() , axis=1)  

# Store hom and het seperately to later explode the dataframe correctly
hom = t[['var_id','hom_carrier']]
het = t[['var_id','het_carrier']]



In [None]:
# Check hom as example what to expect
hom

In [None]:
# Split the carrier ID from the list and each to a row
hom = hom.explode('hom_carrier', ignore_index=True)

hom

In [None]:
# Keep just the variants with carriers
hom= hom.loc[~hom['hom_carrier'].isnull()]

hom

In [None]:
# Merge with annotation
out_hom = pd.merge(hom,filtered_AAC_df, on='var_id',how='left')
out_hom['zygosity'] = 'hom'

# Rename column
out_hom.rename({'hom_carrier':'carrier_id'},axis=1,inplace=True)

out_hom

In [None]:
# Repeat the same for het
het = het.explode('het_carrier', ignore_index=True)
het= het.loc[~het['het_carrier'].isnull()]

het

In [None]:
# Merge with annotation
out_het = pd.merge(het,filtered_AAC_df, on='var_id',how='left')
out_het['zygosity'] = 'het'

# Rename col
out_het.rename({'het_carrier':'carrier_id'},axis=1,inplace=True)

out_het


In [None]:
# Check if there's any comphet by grouping gene and sample ID
pd.concat(g for _, g in out_het.groupby(["Gene.refGene","carrier_id"]) if len(g) > 1)


In [None]:
# Write out comphet
comphet = pd.concat(g for _, g in out_het.groupby(["Gene.refGene","carrier_id"]) if len(g) > 1)
comphet['zygosity'] = 'comphet'

In [None]:
# Get everything and write out
merged_df_AAC = pd.concat([out_het,out_hom,comphet],axis=0)

# Save the dataset
merged_df_AAC.to_csv(f"{workdir}/results/merged_genotypes_AAC_mac1.tsv",sep='\t',index=False)

### Now repeat the same steps for all other ancestries (AFR, AJ, AMR, CAH, CAS, EAS, EUR, FIN, MDE, SAS)!