Reading big data (~1gb file using bcftools to filter for ERAP2 genomic positions, to remove metadata header and after processing file in Ubuntu, converting to tabular CSV format retaining essential variant information for downstream haplotype assembling, statistical analysis and visualization in Python. 
**A tab-delimited variant file (ERAP2_mod.csv) was read in 20,000-row chunks using pandas (v2.1.4) in Python3. Chunks were concatenated into a single DataFrame to enable memory-efficient processing.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
mylist=[]
for chunk in pd.read_csv('/hpc/dla_lti/araja/hapsnew/ERAP2homedir/ERAP2_mod.csv', chunksize=20000, sep="\t"):
    mylist.append(chunk)
big_dataERAP2=pd.concat(mylist, axis=0)
del mylist

##**Removing duplicated data if any**##

In [None]:
big_dataERAP2[big_dataERAP2.duplicated(keep=False)]

In [None]:
##Removing the columns that are not required
big_dataERAP2.pop('QUAL')
big_dataERAP2.pop('FILTER')
big_dataERAP2.pop('INFO')
big_dataERAP2.pop('#CHROM')
big_dataERAP2.pop('FORMAT')
big_dataERAP2.pop('ID')

**Reading frequency of variants file calculated using PLINK 2.0**
plink2 --vcf ERAP2.vcf --freq --out Plink_ERAP2awk '{print $1, $5}' Plink_ERAP2.afreq > Plink_ERAP2.csv




In [None]:
ERAP2_PLINK=pd.read_csv('/hpc/dla_lti/araja/haplotypes/Plink_ERAP2.csv')

In [None]:
###Add frequency data to ERAP2 genotype file####
#based on POS common column#
df1n = big_dataERAP2.set_index('POS')
df2n = ERAP2_PLINK.set_index('POS')
ERAP2AlleleFreqPlinkMerge=df1n.merge(df2n, how='left', on='POS')

**Downloaded GnomAD file for ERAP2 variants, filtered it for variants of interest and using allele_frequency reported for European population and finally saved in  ERAP2_var_GnomAD** 

In [None]:
ERAP2_gnomeAD=pd.read_csv('/hpc/dla_lti/araja/haplotypes/ERAP2_var_GnomAD.csv')

In [None]:
###Adding rsIDs annotation etc by merging GnomAD data####
df1 = ERAP2AlleleFreqPlinkMerge
df2 = ERAP2_gnomeAD
ERAP2_Plink_GnomADmerge = df1.merge(df2, on='POS', how='left')

In [None]:
####Keeping only Misense variant and splice variant that is just below the exon10 and filtering for MAF >=1/10000####
Exclude = ['synonymous_variant','5_prime_UTR_variant', '3_prime_UTR_variant', 'intron_variant', 'stop_gained', 'frameshift_variant']
ERAP2_Plink_GnomADmerge1 = ERAP2_Plink_GnomADmerge[~ERAP2_Plink_GnomADmerge.VEP_Annotation.isin(Exclude)]
ERAP2_Plink_GnomADmerge2=ERAP2_Plink_GnomADmerge1[ERAP2_Plink_GnomADmerge1.Allele_Freq >= 0.0001]
ERAP2_Plink_GnomADmerge2=ERAP2_Plink_GnomADmerge2.dropna(subset=['VEP_Annotation'])
ERAP2_Plink_GnomADmerge2.loc[ERAP2_Plink_GnomADmerge2['VEP_Annotation'] == 'missense_variant']

In [None]:
###Save the File for further analysis####
ERAP2_Plink_GnomADmerge2.to_csv('ERAP2_Full_new_March23_final.csv', sep=",", mode='a', index=False)