## Variant data selection and preprocessing (__ClinVar__)

### Load and filter ClinVar missense variants

In [2]:
# FTP site:              https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/
# Donwloaded file:       variant_summary.txt.gz	last modified: 2025-02-09

import pandas as pd
import numpy as np
import re
import warnings
from Bio.Data import IUPACData

warnings.filterwarnings("ignore")

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', None)

### Load data

In [85]:
data = pd.read_csv('../data/clinvar/variant_summary.txt', sep='\t')

In [86]:
data.head()

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,ReferenceAllele,AlternateAllele,Cytogenetic,ReviewStatus,NumberSubmitters,Guidelines,TestedInGTR,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,SomaticClinicalImpact,SomaticClinicalImpactLastEvaluated,ReviewStatusClinicalImpact,Oncogenicity,OncogenicityLastEvaluated,ReviewStatusOncogenicity,SCVsForAggregateGermlineClassification,SCVsForAggregateSomaticClinicalImpact,SCVsForAggregateOncogenicityClassification
0,15041,Indel,NM_014855.3(AP5Z1):c.80_83delinsTGCTGTAAACTGTAACTGTAAA (p.Arg27_Ile28delinsLeuLeuTer),9907,AP5Z1,HGNC:22197,Pathogenic,1,"Jun 25, 2024",397704705,-,RCV000000012|RCV004998069,"MONDO:MONDO:0013342,MedGen:C3150901,OMIM:613647,Orphanet:306511|MedGen:C3661900",Hereditary spastic paraplegia 48|not provided,germline;unknown,germline,GRCh37,NC_000007.13,7,4820844,4820847,na,na,7p22.1,"criteria provided, multiple submitters, no conflicts",3,-,N,"ClinGen:CA215070,OMIM:613653.0001",3,2,4820844,GGAT,TGCTGTAAACTGTAACTGTAAA,-,-,-,-,-,-,SCV001451119|SCV005622007,-,-
1,15041,Indel,NM_014855.3(AP5Z1):c.80_83delinsTGCTGTAAACTGTAACTGTAAA (p.Arg27_Ile28delinsLeuLeuTer),9907,AP5Z1,HGNC:22197,Pathogenic,1,"Jun 25, 2024",397704705,-,RCV000000012|RCV004998069,"MONDO:MONDO:0013342,MedGen:C3150901,OMIM:613647,Orphanet:306511|MedGen:C3661900",Hereditary spastic paraplegia 48|not provided,germline;unknown,germline,GRCh38,NC_000007.14,7,4781213,4781216,na,na,7p22.1,"criteria provided, multiple submitters, no conflicts",3,-,N,"ClinGen:CA215070,OMIM:613653.0001",3,2,4781213,GGAT,TGCTGTAAACTGTAACTGTAAA,-,-,-,-,-,-,SCV001451119|SCV005622007,-,-
2,15042,Deletion,NM_014855.3(AP5Z1):c.1413_1426del (p.Leu473fs),9907,AP5Z1,HGNC:22197,Pathogenic,1,"Jun 29, 2010",397704709,-,RCV000000013,"MONDO:MONDO:0013342,MedGen:C3150901,OMIM:613647,Orphanet:306511",Hereditary spastic paraplegia 48,germline,germline,GRCh37,NC_000007.13,7,4827361,4827374,na,na,7p22.1,no assertion criteria provided,1,-,N,"OMIM:613653.0002,ClinGen:CA215072",1,3,4827360,GCTGCTGGACCTGCC,G,-,-,-,-,-,-,SCV000020156,-,-
3,15042,Deletion,NM_014855.3(AP5Z1):c.1413_1426del (p.Leu473fs),9907,AP5Z1,HGNC:22197,Pathogenic,1,"Jun 29, 2010",397704709,-,RCV000000013,"MONDO:MONDO:0013342,MedGen:C3150901,OMIM:613647,Orphanet:306511",Hereditary spastic paraplegia 48,germline,germline,GRCh38,NC_000007.14,7,4787730,4787743,na,na,7p22.1,no assertion criteria provided,1,-,N,"OMIM:613653.0002,ClinGen:CA215072",1,3,4787729,GCTGCTGGACCTGCC,G,-,-,-,-,-,-,SCV000020156,-,-
4,15043,single nucleotide variant,NM_014630.3(ZNF592):c.3136G>A (p.Gly1046Arg),9640,ZNF592,HGNC:28986,Uncertain significance,0,"Jun 29, 2015",150829393,-,RCV000000014,"MONDO:MONDO:0033005,MedGen:C4551772,OMIM:251300,Orphanet:2065,Orphanet:83472",Galloway-Mowat syndrome 1,germline,germline,GRCh37,NC_000015.9,15,85342440,85342440,na,na,15q25.3,no assertion criteria provided,1,-,N,"ClinGen:CA210674,UniProtKB:Q92610#VAR_064583,OMIM:613624.0001",1,4,85342440,G,A,-,-,-,-,-,-,SCV000020157,-,-


In [87]:
data["GeneSymbol"].nunique()

38979

In [88]:
len(data)

6548628

In [89]:
for count, col in enumerate(data.columns, start=1):
    print(count, col)

1 #AlleleID
2 Type
3 Name
4 GeneID
5 GeneSymbol
6 HGNC_ID
7 ClinicalSignificance
8 ClinSigSimple
9 LastEvaluated
10 RS# (dbSNP)
11 nsv/esv (dbVar)
12 RCVaccession
13 PhenotypeIDS
14 PhenotypeList
15 Origin
16 OriginSimple
17 Assembly
18 ChromosomeAccession
19 Chromosome
20 Start
21 Stop
22 ReferenceAllele
23 AlternateAllele
24 Cytogenetic
25 ReviewStatus
26 NumberSubmitters
27 Guidelines
28 TestedInGTR
29 OtherIDs
30 SubmitterCategories
31 VariationID
32 PositionVCF
33 ReferenceAlleleVCF
34 AlternateAlleleVCF
35 SomaticClinicalImpact
36 SomaticClinicalImpactLastEvaluated
37 ReviewStatusClinicalImpact
38 Oncogenicity
39 OncogenicityLastEvaluated
40 ReviewStatusOncogenicity
41 SCVsForAggregateGermlineClassification
42 SCVsForAggregateSomaticClinicalImpact
43 SCVsForAggregateOncogenicityClassification


We can highlight diverse columns that will be important for the next steps:
- __AlleleID (1)__: Unique identifier for each allele in ClinVar.
- __GeneSymbol (5)__: Identifies the affected gene. 
- __ClinicalSignificance (7)__: Defines pathogenicity classification (e.g., benign, pathogenic, etc).
- __ClinSigSimple (8)__: Simplified numeric version of previous column (e.g., 0 and 1).
- __LastEvaluated (9)__: Date of the last variant classification update.
- __ReviewStatus (25)__: Indicates the confidence level of the variant classification.
- __Chromosome (19)__ and __PositionVCF (32)__: Gives variant location.
- __ReferenceAlleleVCF (33)__ and __AlternateAlleleVCF (34)__: Specifies the genetic variant.

In [90]:
data[["#AlleleID", "GeneSymbol", "ClinicalSignificance", "ClinSigSimple", 
      "LastEvaluated", "ReviewStatus", "Chromosome", "PositionVCF", 
      "ReferenceAlleleVCF", "AlternateAlleleVCF"]].head()

Unnamed: 0,#AlleleID,GeneSymbol,ClinicalSignificance,ClinSigSimple,LastEvaluated,ReviewStatus,Chromosome,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF
0,15041,AP5Z1,Pathogenic,1,"Jun 25, 2024","criteria provided, multiple submitters, no conflicts",7,4820844,GGAT,TGCTGTAAACTGTAACTGTAAA
1,15041,AP5Z1,Pathogenic,1,"Jun 25, 2024","criteria provided, multiple submitters, no conflicts",7,4781213,GGAT,TGCTGTAAACTGTAACTGTAAA
2,15042,AP5Z1,Pathogenic,1,"Jun 29, 2010",no assertion criteria provided,7,4827360,GCTGCTGGACCTGCC,G
3,15042,AP5Z1,Pathogenic,1,"Jun 29, 2010",no assertion criteria provided,7,4787729,GCTGCTGGACCTGCC,G
4,15043,ZNF592,Uncertain significance,0,"Jun 29, 2015",no assertion criteria provided,15,85342440,G,A


### Filter by criteria

### 1. Keep only SNVs (remove insertions, deletions, complex variants).

In [91]:
data['Type'].unique()

array(['Indel', 'Deletion', 'single nucleotide variant', 'Duplication',
       'Microsatellite', 'Insertion', 'Variation', 'Complex',
       'Translocation', 'Inversion', 'copy number gain', 'fusion',
       'copy number loss', 'protein only', 'Tandem duplication'],
      dtype=object)

In [92]:
data_filter1 = data[data.Type =='single nucleotide variant'].reset_index(drop=True)
len(data_filter1)

5944878

##### --> Parse "Name" column and filter variants

In [None]:
# convert the 3-letter Aa codes to 1-letter codes
three_to_one = IUPACData.protein_letters_3to1

print(three_to_one.keys())
print(three_to_one.values())

dict_keys(['Ala', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His', 'Ile', 'Lys', 'Leu', 'Met', 'Asn', 'Pro', 'Gln', 'Arg', 'Ser', 'Thr', 'Val', 'Trp', 'Tyr'])
dict_values(['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'])


In [None]:
# function to extract the variant in 3-letter and 1-letter format
def extract_variant(name):
    # find protein variant descriptions in the Name column
    match = re.search(r'p\.([A-Za-z]+[0-9]+[A-Za-z]+)', name)    # e.g., p.Arg123His
    if match:
        # extract full variant string (e.g., "Arg123His")
        three_letter_variant = match.group(1)
        # convert the 3-letter variant to 1-letter variant
        variant_3letter = three_letter_variant

        # first 3 letters (original Aa)
        three_letter_from = three_letter_variant[:3]
        # extract the numeric position
        position = re.search(r'\d+', three_letter_variant).group()
        # last 3 letters (mutated Aa)
        three_letter_to = three_letter_variant[-3:]

        # convert 3-letter to 1-letter (e.g., "Arg123His" to "R123H")
        one_letter_from = three_to_one.get(three_letter_from.capitalize(), three_letter_from)
        one_letter_to = three_to_one.get(three_letter_to.capitalize(), three_letter_to)
        variant = f"{one_letter_from}{position}{one_letter_to}" 

        return variant_3letter, variant   # e.g., "Arg123His", "R123H"

    else:
        return None, None

In [None]:
# addition of 2 new cols
data_filter1[['Variant (3-letter)', 'Variant']] = data_filter1['Name'].apply(
    lambda x: pd.Series(extract_variant(x)))

In [96]:
data_filter1[["#AlleleID", "GeneSymbol", "Variant (3-letter)", "Variant"]].head()

Unnamed: 0,#AlleleID,GeneSymbol,Variant (3-letter),Variant
0,15043,ZNF592,Gly1046Arg,G1046R
1,15043,ZNF592,Gly1046Arg,G1046R
2,15044,FOXRED1,Gln232Ter,Q232Ter
3,15044,FOXRED1,Gln232Ter,Q232Ter
4,15045,FOXRED1,Asn430Ser,N430S


In [None]:
# function to check if both parts of the variant are valid Aas
def is_valid_variant(variant_3letter):
    valid_amino_acids = set(three_to_one.keys())  # set of valid 3-letter Aas

    if pd.isna(variant_3letter):
        return False

    # extract the 3-letter Aa codes (first 3 and last 3 chars)
    three_letter_from = variant_3letter[:3]
    three_letter_to = variant_3letter[-3:]
    
    # check if both are in the valid set and no "Ter" (stop codon)
    if (three_letter_from in valid_amino_acids and three_letter_to in valid_amino_acids 
        and 'Ter' not in variant_3letter):
        return True
    else:
        return False

In [98]:
data_filter1_1 = data_filter1[data_filter1['Variant (3-letter)'].apply(is_valid_variant)]

In [99]:
data_filter1_1[["#AlleleID", "GeneSymbol", "Variant (3-letter)", "Variant"]].head()

Unnamed: 0,#AlleleID,GeneSymbol,Variant (3-letter),Variant
0,15043,ZNF592,Gly1046Arg,G1046R
1,15043,ZNF592,Gly1046Arg,G1046R
4,15045,FOXRED1,Asn430Ser,N430S
5,15045,FOXRED1,Asn430Ser,N430S
6,15046,NUBPL,Gly56Arg,G56R


In [100]:
# invalid or ambiguous variants were removed
len(data_filter1_1)

3426329

### 2. Retain submissions from 2021 and later.

In [101]:
data_filter1_1[['LastEvaluated']].head()

Unnamed: 0,LastEvaluated
0,"Jun 29, 2015"
1,"Jun 29, 2015"
4,"Jun 06, 2024"
5,"Jun 06, 2024"
6,"Jul 05, 2022"


In [102]:
# convert LastEvaluated column to datetime format
data_filter1_1['LastEvaluated'] = pd.to_datetime(data_filter1_1['LastEvaluated'], errors='coerce')

# new column extracting the year and converting it to integer
data_filter1_1['LastEvaluated (Year)'] = data_filter1_1['LastEvaluated'].dt.year.astype('Int64')

In [103]:
data_filter1_1[['LastEvaluated', 'LastEvaluated (Year)']].head()

Unnamed: 0,LastEvaluated,LastEvaluated (Year)
0,2015-06-29,2015
1,2015-06-29,2015
4,2024-06-06,2024
5,2024-06-06,2024
6,2022-07-05,2022


In [104]:
# filter variants where year is 2021 or later
data_filter2 = data_filter1_1[data_filter1_1['LastEvaluated (Year)'] >= 2021]

In [105]:
data_filter2[['LastEvaluated', 'LastEvaluated (Year)']].head()

Unnamed: 0,LastEvaluated,LastEvaluated (Year)
4,2024-06-06,2024
5,2024-06-06,2024
6,2022-07-05,2022
7,2022-07-05,2022
8,2024-11-01,2024


In [106]:
len(data_filter2)

3112595

In [107]:
data_filter1_1.groupby('LastEvaluated (Year)').size()

LastEvaluated (Year)
1965          2
1973          2
1976          2
1977          2
1979          2
1980          4
1981          4
1982          2
1983          8
1984         10
1985          4
1986         10
1987         14
1988         16
1989         60
1990         80
1991        114
1992        192
1993        176
1994        176
1995        240
1996        166
1997        226
1998        282
1999        294
2000        314
2001        494
2002        393
2003        374
2004        384
2005        288
2006        392
2007        442
2008        424
2009        494
2010        540
2011        984
2012       1620
2013       4998
2014       4900
2015       7295
2016      16063
2017      28332
2018      61288
2019      76755
2020      71901
2021     344526
2022     766916
2023     963595
2024    1034214
2025       3344
dtype: int64

In [108]:
data_filter2.groupby('LastEvaluated (Year)').size()

LastEvaluated (Year)
2021     344526
2022     766916
2023     963595
2024    1034214
2025       3344
dtype: int64

### 3. Exclude variants with zero-star review status, VUS and conflicting classifications.

##### --> Remove VUS and conflicting classifications

In [109]:
data_filter2.ClinicalSignificance.unique()

array(['Likely pathogenic',
       'Conflicting classifications of pathogenicity',
       'Pathogenic/Pathogenic, low penetrance; other; risk factor',
       'Pathogenic/Likely pathogenic/Pathogenic, low penetrance; other',
       'Uncertain significance', 'Pathogenic/Likely pathogenic',
       'Pathogenic', 'Likely benign', 'Benign', 'Benign/Likely benign',
       'Conflicting classifications of pathogenicity; risk factor',
       'drug response', 'Benign; drug response',
       'Conflicting classifications of pathogenicity; association; risk factor',
       'Benign/Likely benign; other',
       'Conflicting classifications of pathogenicity; other',
       'Pathogenic/Likely pathogenic; risk factor',
       'no classifications from unflagged records', 'not provided',
       'Likely benign; other', 'Benign; other', 'Pathogenic; risk factor',
       'Conflicting classifications of pathogenicity; association',
       'Benign/Likely benign; association', 'Benign; risk factor',
       'Ben

In [110]:
data_filter3 = data_filter2[data_filter2['ClinicalSignificance'].isin(['Pathogenic','Likely pathogenic',
                                                                       'Pathogenic/Likely pathogenic', 
                                                                       'Benign', 'Likely benign', 
                                                                       'Benign/Likely benign'])]

In [111]:
data_filter3.ClinicalSignificance.unique()

array(['Likely pathogenic', 'Pathogenic/Likely pathogenic', 'Pathogenic',
       'Likely benign', 'Benign', 'Benign/Likely benign'], dtype=object)

In [112]:
len(data_filter3)

277997

##### --> Filter out variants with zero-star review status

In [113]:
# somatic classification
data_filter3.ReviewStatusClinicalImpact.unique()

array(['-', 'criteria provided, multiple submitters',
       'no assertion criteria provided',
       'criteria provided, single submitter'], dtype=object)

In [114]:
# germline classification
data_filter3.ReviewStatus.unique()

array(['criteria provided, single submitter',
       'criteria provided, multiple submitters, no conflicts',
       'reviewed by expert panel', 'no assertion criteria provided'],
      dtype=object)

In [115]:
data_filter3.ReviewStatus.value_counts()

ReviewStatus
criteria provided, single submitter                     176317
criteria provided, multiple submitters, no conflicts     86823
no assertion criteria provided                           10115
reviewed by expert panel                                  4742
Name: count, dtype: int64

In [116]:
data_filter3_1 = data_filter3[data_filter3['ReviewStatus'] != 'no assertion criteria provided'].reset_index(drop=True)

In [117]:
data_filter3_1.ReviewStatus.value_counts()

ReviewStatus
criteria provided, single submitter                     176317
criteria provided, multiple submitters, no conflicts     86823
reviewed by expert panel                                  4742
Name: count, dtype: int64

In [118]:
len(data_filter3_1)

267882

##### --> Parse Chromosome column

In [119]:
data_filter3_1.Chromosome.unique()

array(['11', '6', '10', '16', '22', '15', '7', '1', '8', '21', '5', '19',
       '4', '3', '17', '12', '20', '9', '18', '2', '14', '13', 'MT', 'Y',
       'X', 'na', 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 'Un'], dtype=object)

In [120]:
data_filter3_2=data_filter3_1[~data_filter3_1.Chromosome.isin(['na','Un'])].reset_index(drop=True)

In [121]:
# after removing invalid or missing chromosome data
len(data_filter3_2)

267735

In [122]:
data_filter3_2.Chromosome.unique()

array(['11', '6', '10', '16', '22', '15', '7', '1', '8', '21', '5', '19',
       '4', '3', '17', '12', '20', '9', '18', '2', '14', '13', 'MT', 'Y',
       'X', 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22], dtype=object)

In [123]:
data_filter3_2['Chromosome'] = data_filter3_2['Chromosome'].astype(str)
data_filter3_2['Chromosome'].nunique()

25

In [124]:
data_filter3_2['Chromosome'].unique()

array(['11', '6', '10', '16', '22', '15', '7', '1', '8', '21', '5', '19',
       '4', '3', '17', '12', '20', '9', '18', '2', '14', '13', 'MT', 'Y',
       'X'], dtype=object)

Now the 'Chromosome' column contains the expected chromosome identifiers:
- autosomes (1-22)
- sex chromosomes (X, Y)
- mitochondrial DNA (MT)

### 4. Keep variants with AF<0.01 in the gnomAD v.2.1.

At this point, we need to pass the data through VEP in order to get gnomAD information.

But first, the __assembly issue__ must be taken into account.

In [125]:
data_filter3_2.Assembly.unique()

array(['GRCh37', 'GRCh38'], dtype=object)

In [126]:
data_filter3_2 = data_filter3_2.sort_values(by='Assembly', ascending=False).reset_index(drop=True)
data_filter3_2.Assembly.value_counts()

Assembly
GRCh37    133870
GRCh38    133865
Name: count, dtype: int64

Here, we notice that several variants have the same Allele ID and the only difference is the Assembly. 

Thus, we make sure this is the case and then, drop duplicates.

In [127]:
# Function to identify Allele ID groups with inconsistent data
def check_duplicates(df):
    cols_to_check=['#AlleleID', 'Type', 'Name', 'GeneID', 'GeneSymbol', 'HGNC_ID',
       'ClinicalSignificance', 'ClinSigSimple', 'LastEvaluated', 'RS# (dbSNP)',
       'nsv/esv (dbVar)', 'RCVaccession', 'PhenotypeIDS', 'PhenotypeList',
       'Origin', 'OriginSimple', 'Cytogenetic', 'ReviewStatus', 
       'NumberSubmitters', 'Guidelines', 'TestedInGTR', 'OtherIDs', 
       'SubmitterCategories', 'VariationID', 'Variant (3-letter)', 'Variant']

    grouped = df.groupby('#AlleleID')
    
    # find groups where any column (except Assembly) has different values
    different_values = {}
    for name, group in grouped:
        if len(group) > 1:  # only check groups with multiple variants
            for col in cols_to_check:
                unique_values = group[col].nunique()
                if unique_values > 1:
                    if name not in different_values:
                        different_values[name] = []
                    different_values[name].append(col)

    if different_values:
        print("Found AlleleIDs with different values in columns other than Assembly:")
        for allele_id, columns in different_values.items():
            print(f"\nAlleleID {allele_id} has different values in columns: {columns}")
    else:
        print("All rows with the same #AlleleID have identical values (except possibly Assembly)")

    return df['#AlleleID'].duplicated().sum()

In [128]:
total_duplicates= check_duplicates(data_filter3_2)

All rows with the same #AlleleID have identical values (except possibly Assembly)


In [129]:
data_filter4 = data_filter3_2.drop_duplicates(subset=['#AlleleID'],keep='first')
data_filter4.Assembly.value_counts()

Assembly
GRCh38    133845
GRCh37        10
Name: count, dtype: int64

Next, we must separate data by Assembly before using VEP. 

This is done because coordinates info should __not__ be mixed.

In [130]:
for assembly in ['GRCh37', 'GRCh38']:
    file = data_filter4[data_filter4['Assembly'] == assembly].copy()
    file.to_csv(f'../data/clinvar/clinvar_data_preVEP_{assembly.lower()}.csv', index=0)

VEP input must be in VCF format, thus the following function.

In [5]:
def create_vcf(df, outputfile):
    # necessary columns for VCF format
    vcf_columns = [
        'Chromosome',
        'PositionVCF',
        'RS# (dbSNP)',
        'ReferenceAlleleVCF',
        'AlternateAlleleVCF'
    ]

    vcf_df = df[vcf_columns].copy()
    vcf_df['Chromosome'] = vcf_df['Chromosome'].astype(str)
    vcf_df['Chromosome'] = vcf_df['Chromosome'].str.replace('chr', '', case=False)

    chrom_order = ([str(i) for i in range(1, 23)] + ['X', 'Y', 'MT'])

    vcf_df['Chromosome'] = pd.Categorical(vcf_df['Chromosome'], categories=chrom_order, ordered=True)
    vcf_df['PositionVCF'] = pd.to_numeric(vcf_df['PositionVCF'])
    vcf_df = vcf_df.sort_values(['Chromosome', 'PositionVCF'])
    
    # add remaining required VCF columns
    vcf_df['ID'] = vcf_df['RS# (dbSNP)'].fillna('.')
    vcf_df['QUAL'] = '.'  
    vcf_df['FILTER'] = '.'  
    vcf_df['INFO'] = '.'  
    
    # reorder and rename columns to match VCF format
    vcf_df = vcf_df[['Chromosome', 'PositionVCF', 'ID', 'ReferenceAlleleVCF', 'AlternateAlleleVCF', 'QUAL', 'FILTER', 'INFO']]
    vcf_df.columns = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO']

    vcf_df.to_csv(outputfile, sep='\t', index=False)
    print(f"VCF file created at: {outputfile}")

After running VEP for the third filter criteria (gnomAD), we should rearrange the output and continue cleaning the dataset.

In [4]:
def parse_output(file_path):
    with open(file_path, 'r') as file:
        for i, line in enumerate(file):
            if line.startswith("#Uploaded_variation"):
                header_line = i
                break

    df = pd.read_csv(file_path, delimiter='\t', skiprows=header_line,low_memory=False)
    #print(df.columns)
    print('initial file length:', len(df))

    # rename the first column (Uploaded_variation)
    df.columns = df.columns.str.replace('#', '')
    
    df2= df[df.Protein_position != '-'].reset_index(drop=True).copy()

    df_filtered = df2[df2['Amino_acids'].str.contains('/')]
    df_filtered = df_filtered[~df_filtered['Amino_acids'].str.contains(r'\*')]

    # create Chromosome, PositionVCF, variant columns before merging with the original clinvar df
    df_filtered[['Chromosome', 'PositionVCF_dashed']] = df_filtered['Location'].str.split(':', expand=True)
    df_filtered['PositionVCF'] = df_filtered.apply(lambda x: x['PositionVCF_dashed'].split('-')[0], axis=1)
    
    df_filtered['PositionVCF'] = df_filtered['PositionVCF'].astype(int)
    df_filtered['Variant'] = df_filtered.apply(lambda x: x['Amino_acids'].split('/')[0] + str(x['Protein_position']) + x['Amino_acids'].split('/')[1], axis=1)
    
    # the Allele column represents the alternate allele for a variant. 
    # the Reference allele is not explicitly listed here because VCF files generally only list 
        # the alternate allele (the variation from the reference genome). 
    # the Reference allele would typically be implied based on the genomic position and reference sequence.
    
    df_filtered['AlternateAlleleVCF'] = df_filtered['Allele']

    df_filtered2 = df_filtered[df_filtered['Consequence'].str.contains('missense_variant', na=False)].copy()
    
    # duplicated lines, due to canonical + isoform. we drop duplicates based on 
    df_filtered3 = df_filtered2.drop_duplicates(subset=['Chromosome','PositionVCF','Allele','Gene','Feature_type','CDS_position','Protein_position','Variant'], keep='first')
    df_filtered3['GeneSymbol'] = df_filtered3['Extra'].str.extract(r'SYMBOL=([^;]+)')
    df_filtered3['HGNC_ID'] = df_filtered3['Extra'].str.extract(r'HGNC_ID=([^;]+)')
    
    print('final file length:',len(df_filtered3))
    
    return df_filtered3


def merge_original_and_vepout(df1,df2):
    df1['Chromosome'] = df1['Chromosome'].astype(str)
    df2['Chromosome'] = df2['Chromosome'].astype(str)
    df1=df1.reset_index(drop=True)
    df2=df2.reset_index(drop=True)
    merged_df = df1.merge(df2, on =['Chromosome','PositionVCF','AlternateAlleleVCF','Variant','GeneSymbol','HGNC_ID'], how='left')
    return merged_df    

First, for GRCh38:

In [6]:
data_grch38 = pd.read_csv('../data/clinvar/clinvar_data_preVEP_grch38.csv')
create_vcf(data_grch38, '../data/clinvar/clinvar_data_inputVEP_grch38.vcf')

VCF file created at: ../data/clinvar/clinvar_data_inputVEP_grch38.vcf


To run VEP, the following command was executed:


In [134]:
#  ./vep -i clinvar_data_inputVEP_grch38.vcf -o clinvar_data_outputVEP_grch38.txt --offline \
#      --assembly GRCh38 \
#      --symbol --transcript_version --ccds --protein --uniprot \
#      --hgvs --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
#      --af --af_1kg --af_gnomade --af_gnomadg --max_af

In [7]:
output_grch38 = "../data/clinvar/clinvar_data_outputVEP_grch38.txt"
df = parse_output(output_grch38)

data_filter4_1_grch38 = merge_original_and_vepout(data_grch38, df)

# same rows, but more columns after merging
len(data_filter4_1_grch38) == len(data_grch38)

initial file length: 1613017
final file length: 339455


True

In [8]:
data_filter4_1_grch38.Consequence.unique()

array(['missense_variant', nan, 'missense_variant,splice_region_variant',
       'missense_variant,NMD_transcript_variant',
       'missense_variant,splice_region_variant,NMD_transcript_variant'],
      dtype=object)

In [13]:
data_filter4_1_grch38['gnomADe_AF'] = data_filter4_1_grch38['Extra'].str.extract(r'gnomADe_AF=([^;]+)')
data_filter4_1_grch38['gnomADg_AF'] = data_filter4_1_grch38['Extra'].str.extract(r'gnomADg_AF=([^;]+)')

data_filter4_1_grch38[['gnomADe_AF','gnomADg_AF']].head()

Unnamed: 0,gnomADe_AF,gnomADg_AF
0,0.001133,0.0009133
1,,
2,,
3,,
4,,


In [14]:
len(data_filter4_1_grch38[data_filter4_1_grch38.gnomADe_AF.notna()])

101504

Then, for GRCh37:

In [11]:
data_grch37 = pd.read_csv('../data/clinvar/clinvar_data_preVEP_grch37.csv')
create_vcf(data_grch37, '../data/clinvar/clinvar_data_inputVEP_grch37.vcf')

VCF file created at: ../data/clinvar/clinvar_data_inputVEP_grch37.vcf


To run VEP, the following command was executed:


In [56]:
#  ./vep -i clinvar_data_inputVEP_grch37.vcf -o clinvar_data_outputVEP_grch37.txt --offline \
#      --assembly GRCh37 \
#      --symbol --transcript_version --ccds --protein --uniprot \
#      --hgvs --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
#      --af --af_1kg --af_gnomade --af_gnomadg --max_af

In [15]:
output_grch37 = "../data/clinvar/clinvar_data_outputVEP_grch37.txt"
df = parse_output(output_grch37)
df['HGNC_ID'] = df['HGNC_ID'].apply(lambda x: 'HGNC:' + str(x) if pd.notna(x) else x)

data_filter4_1_grch37 = merge_original_and_vepout(data_grch37, df)

# same rows, but more columns after merging
len(data_filter4_1_grch37) == len(data_grch37)

initial file length: 20
final file length: 12


True

In [16]:
data_filter4_1_grch37.Consequence.unique()

array(['missense_variant'], dtype=object)

In [17]:
data_filter4_1_grch37['gnomADe_AF'] = data_filter4_1_grch37['Extra'].str.extract(r'gnomADe_AF=([^;]+)')

len(data_filter4_1_grch37[data_filter4_1_grch37.gnomADe_AF.notna()])

10

Finally, we can merge again GRCh37 and GRCh38 files

In [18]:
# cols to mantain, from original dataset
cols = ['#AlleleID', 'Type', 'Name', 'GeneID', 'GeneSymbol', 'HGNC_ID', 
        'ClinicalSignificance', 'ClinSigSimple', 'RS# (dbSNP)', 'nsv/esv (dbVar)', 
        'RCVaccession', 'PhenotypeIDS', 'PhenotypeList', 'Origin', 'OriginSimple', 
        'Assembly', 'ChromosomeAccession', 'Chromosome', 'Start', 'Stop', 'Cytogenetic', 
        'ReviewStatus', 'NumberSubmitters', 'OtherIDs', 'SubmitterCategories', 
        'VariationID', 'PositionVCF', 'ReferenceAlleleVCF', 'AlternateAlleleVCF', 
        'Variant (3-letter)', 'Variant', 'LastEvaluated (Year)']

# cols to add, after obtaining gnomAD info
cols_to_take = ['Uploaded_variation', 'Location', 'Allele', 'Gene', 'Feature', 
                'Feature_type', 'Consequence', 'cDNA_position', 'CDS_position', 
                'Protein_position', 'Amino_acids', 'Codons', 'Existing_variation', 
                'Extra', 'PositionVCF_dashed', 'gnomADe_AF']

for col in cols_to_take:
    cols.append(col)

len(cols)

48

In [19]:
data_filter4_2 = pd.concat([data_filter4_1_grch38,data_filter4_1_grch37])
data_filter4_2 = data_filter4_2[cols]
len(data_filter4_2)
data_filter4_2.head()

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,Cytogenetic,ReviewStatus,NumberSubmitters,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,Variant (3-letter),Variant,LastEvaluated (Year),Uploaded_variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,Extra,PositionVCF_dashed,gnomADe_AF
0,1809239,single nucleotide variant,NM_001386125.1(OBSCN):c.496G>A (p.Ala166Thr),84033,OBSCN,HGNC:15719,Benign/Likely benign,0,-1,-,RCV003418466|RCV004050307,MedGen:C3661900|MedGen:CN169374,not provided|not specified,germline,germline,GRCh38,NC_000001.11,1,228212279,228212279,1q42.13,"criteria provided, multiple submitters, no conflicts",2,ClinGen:CA1431438,2,1744457,228212279,G,A,Ala166Thr,A166T,2023,-1.0,1:228212279,A,ENSG00000154358,ENST00000284548.16,Transcript,missense_variant,637,496,166,A/T,Gca/Aca,rs555146765,"IMPACT=MODERATE;STRAND=1;SYMBOL=OBSCN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:15719;CCDS=CCDS1570.2;ENSP=ENSP00000284548;SWISSPROT=Q5VST9.202;UNIPARC=UPI0000425971;UNIPROT_ISOFORM=Q5VST9-3;HGVSc=ENST00000284548.16:c.496G>A;HGVSp=ENSP00000284548.11:p.Ala166Thr;AF=0.0004;AFR_AF=0.0008;AMR_AF=0.0014;EAS_AF=0;EUR_AF=0;SAS_AF=0;gnomADe_AF=0.001133;gnomADe_AFR_AF=0.0002016;gnomADe_AMR_AF=0.0004112;gnomADe_ASJ_AF=9.352e-05;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0.0004933;gnomADe_MID_AF=0.0005405;gnomADe_NFE_AF=0.001353;gnomADe_REMAINING_AF=0.0005889;gnomADe_SAS_AF=0;gnomADg_AF=0.0009133;gnomADg_AFR_AF=0.0003389;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.001063;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0.0001037;gnomADg_MID_AF=0;gnomADg_NFE_AF=0.00156;gnomADg_REMAINING_AF=0.0004812;gnomADg_SAS_AF=0;MAX_AF=0.00156;MAX_AF_POPS=gnomADg_NFE;CLIN_SIG=likely_benign,benign;PHENO=1",228212279,0.001133
1,2059503,single nucleotide variant,NM_001205293.3(CACNA1E):c.3965C>T (p.Ser1322Phe),777,CACNA1E,HGNC:1392,Likely benign,0,-1,-,RCV002806411,MedGen:C3661900,not provided,germline,germline,GRCh38,NC_000001.11,1,181755373,181755373,1q25.3,"criteria provided, single submitter",1,ClinGen:CA343623382,2,1993841,181755373,C,T,Ser1322Phe,S1322F,2022,-1.0,1:181755373,T,ENSG00000198216,ENST00000367570.6,Transcript,missense_variant,4856,3965,1322,S/F,tCc/tTc,COSV62407041,IMPACT=MODERATE;STRAND=1;SYMBOL=CACNA1E;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:1392;CCDS=CCDS53443.1;ENSP=ENSP00000356542;SWISSPROT=Q15878.209;UNIPARC=UPI000044D37D;UNIPROT_ISOFORM=Q15878-3;HGVSc=ENST00000367570.6:c.3965C>T;HGVSp=ENSP00000356542.1:p.Ser1322Phe;SOMATIC=1;PHENO=1,181755373,
2,2058909,single nucleotide variant,NM_002397.5(MEF2C):c.439A>G (p.Ile147Val),4208,MEF2C,HGNC:6996,Benign,0,-1,-,RCV002828380,"MONDO:MONDO:0013266,MedGen:C3150700,OMIM:613443,Orphanet:228384,Orphanet:664410","Intellectual disability, autosomal dominant 20",germline,germline,GRCh38,NC_000005.10,5,88752007,88752007,5q14.3,"criteria provided, single submitter",1,ClinGen:CA360423982,2,2004843,88752007,T,C,Ile147Val,I147V,2022,-1.0,5:88752007,C,ENSG00000081189,ENST00000437473.6,Transcript,missense_variant,1090,439,147,I/V,Atc/Gtc,-,IMPACT=MODERATE;STRAND=-1;SYMBOL=MEF2C;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:6996;CCDS=CCDS47245.1;ENSP=ENSP00000396219;SWISSPROT=Q06413.210;UNIPARC=UPI0000040635;UNIPROT_ISOFORM=Q06413-1;HGVSc=ENST00000437473.6:c.439A>G;HGVSp=ENSP00000396219.2:p.Ile147Val,88752007,
3,2058962,single nucleotide variant,NM_152296.5(ATP1A3):c.281T>C (p.Leu94Pro),478,ATP1A3,HGNC:801,Pathogenic,1,-1,-,RCV002795956,"MONDO:MONDO:0007496,MedGen:C1868681,OMIM:128235,Orphanet:71517",Dystonia 12,germline,germline,GRCh38,NC_000019.10,19,41988012,41988012,19q13.2,"criteria provided, single submitter",1,ClinGen:CA406056371,2,1992440,41988012,A,G,Leu94Pro,L94P,2022,-1.0,19:41988012,G,ENSG00000105409,ENST00000441343.5,Transcript,missense_variant,419,281,94,L/P,cTg/cCg,-,IMPACT=MODERATE;STRAND=-1;SYMBOL=ATP1A3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:801;ENSP=ENSP00000411503;TREMBL=A0A0A0MT26.64;UNIPARC=UPI0000E5A1FE;HGVSc=ENST00000441343.5:c.281T>C;HGVSp=ENSP00000411503.1:p.Leu94Pro,41988012,
4,2059150,single nucleotide variant,NM_001100.4(ACTA1):c.794A>G (p.Gln265Arg),58,ACTA1,HGNC:129,Pathogenic,1,-1,-,RCV002796062,"MONDO:MONDO:0008070,MedGen:C3711389,OMIM:161800,Orphanet:98904",Actin accumulation myopathy,germline,germline,GRCh38,NC_000001.11,1,229432008,229432008,1q42.13,"criteria provided, single submitter",1,ClinGen:CA345146455,2,1992629,229432008,T,C,Gln265Arg,Q265R,2023,-1.0,1:229432008,C,ENSG00000143632,ENST00000366683.4,Transcript,missense_variant,907,794,265,Q/R,cAg/cGg,CM992127,IMPACT=MODERATE;STRAND=-1;SYMBOL=ACTA1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:129;ENSP=ENSP00000355644;TREMBL=A6NL76.108;UNIPARC=UPI000C755200;HGVSc=ENST00000366683.4:c.794A>G;HGVSp=ENSP00000355644.4:p.Gln265Arg;SOMATIC=1;PHENO=1,229432008,


In [20]:
print(data_filter4_2['Extra'].iloc[0])

IMPACT=MODERATE;STRAND=1;SYMBOL=OBSCN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:15719;CCDS=CCDS1570.2;ENSP=ENSP00000284548;SWISSPROT=Q5VST9.202;UNIPARC=UPI0000425971;UNIPROT_ISOFORM=Q5VST9-3;HGVSc=ENST00000284548.16:c.496G>A;HGVSp=ENSP00000284548.11:p.Ala166Thr;AF=0.0004;AFR_AF=0.0008;AMR_AF=0.0014;EAS_AF=0;EUR_AF=0;SAS_AF=0;gnomADe_AF=0.001133;gnomADe_AFR_AF=0.0002016;gnomADe_AMR_AF=0.0004112;gnomADe_ASJ_AF=9.352e-05;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0.0004933;gnomADe_MID_AF=0.0005405;gnomADe_NFE_AF=0.001353;gnomADe_REMAINING_AF=0.0005889;gnomADe_SAS_AF=0;gnomADg_AF=0.0009133;gnomADg_AFR_AF=0.0003389;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.001063;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0.0001037;gnomADg_MID_AF=0;gnomADg_NFE_AF=0.00156;gnomADg_REMAINING_AF=0.0004812;gnomADg_SAS_AF=0;MAX_AF=0.00156;MAX_AF_POPS=gnomADg_NFE;CLIN_SIG=likely_benign,benign;PHENO=1


In [62]:
len(data_filter4_2[data_filter4_2.Consequence.isna()])

1619

In [63]:
data_filter4_2.Consequence.value_counts()

Consequence
missense_variant                                                 127328
missense_variant,splice_region_variant                             3742
missense_variant,NMD_transcript_variant                            1120
missense_variant,splice_region_variant,NMD_transcript_variant        46
Name: count, dtype: int64

In [64]:
# data_filter4 is the cleaned dataset before splitting by Assembly
# data_filter4_2 is the cleaned dataset after splitting, running VEP and merging again
len(data_filter4_2)==len(data_filter4)

True

In [99]:
data_filter4_2.Assembly.unique()

array(['GRCh38', 'GRCh37'], dtype=object)

In [66]:
data_filter4_2['gnomADe_AF'] = data_filter4_2['Extra'].str.extract(r'gnomADe_AF=([^;]+)')
data_filter4_2['gnomADg_AF'] = data_filter4_2['Extra'].str.extract(r'gnomADg_AF=([^;]+)')

data_filter4_2[['gnomADe_AF','gnomADg_AF']].head()

Unnamed: 0,gnomADe_AF,gnomADg_AF
0,0.001133,0.0009133
1,,
2,,
3,,
4,,


The columns 'gnomADe_AF' and 'gnomADg_AF' represent the AFs of the variant 
in the gnomAD Exomes (gnomADe_AF) and gnomAD Genomes (gnomADg_AF) datasets, respectively. 

The new column 'gnomAD_AF' combines these frequencies, prioritizing Exomes (gnomADe_AF) 
and using Genomes (gnomADg_AF) when Exomes data is missing. 

The final filtering step selects rare variants with a combined gnomAD_AF < 0.01.

With this we make sure we have a frequency value for every variant

In [67]:
data_filter4_2['gnomADe_AF'] = pd.to_numeric(data_filter4_2['gnomADe_AF'], errors='coerce')
data_filter4_2['gnomADg_AF'] = pd.to_numeric(data_filter4_2['gnomADg_AF'], errors='coerce')

# new column 'gnomAD_AF' that fills missing exomes AF with genomes AF
data_filter4_2['gnomAD_AF'] = data_filter4_2['gnomADe_AF'].fillna(data_filter4_2['gnomADg_AF'])

# filter for variants where gnomAD_AF < 0.01 and drop variants where gnomAD_AF is NaN
data_filter4_3 = data_filter4_2[data_filter4_2['gnomAD_AF'] < 0.01].dropna(subset=['gnomAD_AF'])

In [68]:
print(len(data_filter4_3))
data_filter4_3.head(2)

96686


Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,RS# (dbSNP),nsv/esv (dbVar),...,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,Extra,PositionVCF_dashed,gnomADe_AF,gnomADg_AF,gnomAD_AF
0,1809239,single nucleotide variant,NM_001386125.1(OBSCN):c.496G>A (p.Ala166Thr),84033,OBSCN,HGNC:15719,Benign/Likely benign,0,-1,-,...,496,166,A/T,Gca/Aca,rs555146765,IMPACT=MODERATE;STRAND=1;SYMBOL=OBSCN;SYMBOL_S...,228212279,0.001133,0.000913,0.001133
6,2059238,single nucleotide variant,NM_006734.4(HIVEP2):c.1529C>T (p.Ser510Leu),3097,HIVEP2,HGNC:4921,Likely benign,0,-1,-,...,1529,510,S/L,tCa/tTa,rs531761193,IMPACT=MODERATE;STRAND=-1;SYMBOL=HIVEP2;SYMBOL...,142773210,3.1e-05,1.3e-05,3.1e-05


In [69]:
len(data_filter4_3[data_filter4_3.gnomAD_AF.notna()])

96686

In [70]:
print(data_filter4_3["Extra"].iloc[0])

IMPACT=MODERATE;STRAND=1;SYMBOL=OBSCN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:15719;CCDS=CCDS1570.2;ENSP=ENSP00000284548;SWISSPROT=Q5VST9.202;UNIPARC=UPI0000425971;UNIPROT_ISOFORM=Q5VST9-3;HGVSc=ENST00000284548.16:c.496G>A;HGVSp=ENSP00000284548.11:p.Ala166Thr;AF=0.0004;AFR_AF=0.0008;AMR_AF=0.0014;EAS_AF=0;EUR_AF=0;SAS_AF=0;gnomADe_AF=0.001133;gnomADe_AFR_AF=0.0002016;gnomADe_AMR_AF=0.0004112;gnomADe_ASJ_AF=9.352e-05;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0.0004933;gnomADe_MID_AF=0.0005405;gnomADe_NFE_AF=0.001353;gnomADe_REMAINING_AF=0.0005889;gnomADe_SAS_AF=0;gnomADg_AF=0.0009133;gnomADg_AFR_AF=0.0003389;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.001063;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0.0001037;gnomADg_MID_AF=0;gnomADg_NFE_AF=0.00156;gnomADg_REMAINING_AF=0.0004812;gnomADg_SAS_AF=0;MAX_AF=0.00156;MAX_AF_POPS=gnomADg_NFE;CLIN_SIG=likely_benign,benign;PHENO=1


In [71]:
data_filter4_3 = data_filter4_3.sort_values(by='LastEvaluated (Year)', ascending=False)
data_filter4_3['HGVSp'] = data_filter4_3['Extra'].str.extract(r'HGVSp=([^;]+)')

cols_check = ['Type', 'GeneID','Gene', 'GeneSymbol', 'Feature','HGNC_ID',
      'Assembly','ChromosomeAccession', 'Chromosome', 
      'HGVSp','Protein_position', 'Amino_acids','Variant']

data_filter4_4 = data_filter4_3.drop_duplicates(subset=cols_check, keep='first')

print(len(data_filter4_3))
print(len(data_filter4_4))

96686
96544


### 5. Keep genes with at least one pathogenic variant of any type.

In [72]:
data_filter4_3.ClinicalSignificance.unique()

array(['Pathogenic/Likely pathogenic', 'Likely pathogenic', 'Pathogenic',
       'Likely benign', 'Benign', 'Benign/Likely benign'], dtype=object)

In [73]:
def filter_pathogenic_genes(df):
    # pathogenic terms
    pathogenic_terms = ['Pathogenic', 'Likely pathogenic', 'Pathogenic/Likely pathogenic']
    
    # filter for genes that have at least one pathogenic variant
    pathogenic_genes = df[df['ClinicalSignificance'].isin(pathogenic_terms)]['GeneSymbol'].unique()
    
    # filter df to retain only variants for those genes
    df_filtered = df[df['GeneSymbol'].isin(pathogenic_genes)]
    
    return df_filtered

In [74]:
data_filter5 = filter_pathogenic_genes(data_filter4_4)

In [75]:
len(data_filter5)

49187

Notice that only GRCh38 remains. All the pathogenic variants in the filtered dataset are from this assembly.

In [106]:
data_filter5["Assembly"].value_counts()

Assembly
GRCh38    49187
Name: count, dtype: int64

### Some parsing final steps...

In [76]:
def classify_significance(clinical_significance):
    if any(term in clinical_significance for term in ['Pathogenic', 'Likely pathogenic']):
        return 'P'
    elif any(term in clinical_significance for term in ['Benign', 'Likely benign']):
        return 'B'
    return 'Other'

In [77]:
cleaned_ClinVar_dataset = data_filter5.copy()
cleaned_ClinVar_dataset['BinaryClinicalSignificance'] = data_filter5['ClinicalSignificance'].apply(classify_significance)

In [78]:
cleaned_ClinVar_dataset.BinaryClinicalSignificance.value_counts()

BinaryClinicalSignificance
B    34405
P    14782
Name: count, dtype: int64

In [79]:
len(cleaned_ClinVar_dataset.GeneSymbol.unique())

2156

In [93]:
len(cleaned_ClinVar_dataset)

49187

In [81]:
cleaned_ClinVar_dataset.head()

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,RS# (dbSNP),nsv/esv (dbVar),...,Amino_acids,Codons,Existing_variation,Extra,PositionVCF_dashed,gnomADe_AF,gnomADg_AF,gnomAD_AF,HGVSp,BinaryClinicalSignificance
119639,29901,single nucleotide variant,NM_182894.3(VSX2):c.679C>T (p.Arg227Trp),338917,VSX2,HGNC:1975,Pathogenic/Likely pathogenic,1,121912545,-,...,R/W,Cgg/Tgg,"rs121912545,CM042327",IMPACT=MODERATE;STRAND=1;SYMBOL=VSX2;SYMBOL_SO...,74259701,3e-06,2e-05,3e-06,ENSP00000261980.2:p.Arg227Trp,P
93641,104480,single nucleotide variant,NM_000180.4(GUCY2D):c.307G>A (p.Glu103Lys),3000,GUCY2D,HGNC:4689,Likely pathogenic,1,61749668,-,...,E/K,Gag/Aag,"rs61749668,CM077936",IMPACT=MODERATE;STRAND=1;SYMBOL=GUCY2D;SYMBOL_...,8003354,0.000191,3.3e-05,0.000191,ENSP00000254854.4:p.Glu103Lys,P
59576,3734350,single nucleotide variant,NM_000441.2(SLC26A4):c.1335G>C (p.Leu445Phe),5172,SLC26A4,HGNC:8818,Likely pathogenic,1,-1,-,...,L/F,ttG/ttC,rs1355468475,IMPACT=MODERATE;STRAND=1;SYMBOL=SLC26A4;SYMBOL...,107694474,,7e-06,7e-06,ENSP00000494017.1:p.Leu445Phe,P
59575,3734348,single nucleotide variant,NM_000441.2(SLC26A4):c.1279T>C (p.Ser427Pro),5172,SLC26A4,HGNC:8818,Pathogenic,1,-1,-,...,S/P,Tct/Cct,rs758015694,IMPACT=MODERATE;STRAND=1;SYMBOL=SLC26A4;SYMBOL...,107694418,2e-06,,2e-06,ENSP00000494017.1:p.Ser427Pro,P
59572,3734343,single nucleotide variant,NM_000441.2(SLC26A4):c.1207G>T (p.Ala403Ser),5172,SLC26A4,HGNC:8818,Likely pathogenic,1,-1,-,...,A/S,Gcc/Tcc,"rs1791527351,COSV107219136",IMPACT=MODERATE;STRAND=1;SYMBOL=SLC26A4;SYMBOL...,107690181,1e-06,,1e-06,ENSP00000494017.1:p.Ala403Ser,P


### And some statistics

In [82]:
def count_p_and_b_per_gene(df):
    # group by 'GeneSymbol' and 'BinaryClinicalSignificance', then count
    counts = df.groupby(['GeneSymbol', 'BinaryClinicalSignificance']).size().unstack(fill_value=0)

    if 'B' in counts.columns and 'P' in counts.columns:
        counts.columns = ['B_count', 'P_count']
    elif 'P' in counts.columns:
        counts = counts.rename(columns={'P': 'P_count'})
    elif 'B' in counts.columns:
        counts = counts.rename(columns={'B': 'B_count'})
    
    return counts

In [83]:
counts_per_gene = count_p_and_b_per_gene(cleaned_ClinVar_dataset)
counts_per_gene

Unnamed: 0_level_0,B_count,P_count
GeneSymbol,Unnamed: 1_level_1,Unnamed: 2_level_1
AAAS,2,8
AARS1,21,4
AARS2,19,6
AASS,9,1
ABAT,2,3
...,...,...
ZMYND10,9,4
ZMYND11,12,2
ZNF341,21,1
ZNF408,7,3


In [84]:
counts_per_gene[(counts_per_gene.P_count>=30) & (counts_per_gene.B_count>= 30)].sort_values(by=['P_count','B_count'], ascending=False)

Unnamed: 0_level_0,B_count,P_count
GeneSymbol,Unnamed: 1_level_1,Unnamed: 2_level_1
USH2A,90,133
TP53,65,86
COL4A3,36,83
PKHD1,38,79
COL7A1,237,76
FBN1,48,68
COL4A4,47,68
RYR1,47,62
COL4A5,113,60
SCN1A,44,56


In [85]:
cleaned_ClinVar_dataset['PositionVCF'] = pd.to_numeric(cleaned_ClinVar_dataset['PositionVCF'])

# sort by chromosome and position
cleaned_ClinVar_dataset = cleaned_ClinVar_dataset.sort_values(['Chromosome', 'PositionVCF'])
cleaned_ClinVar_dataset = cleaned_ClinVar_dataset.reset_index(drop=True)

In [96]:
cleaned_ClinVar_dataset["Assembly"].unique()

array(['GRCh38'], dtype=object)

In [86]:
# this works as database (with the applied filters) from where to retrieve variants for the variants_pipeline.sh
cleaned_ClinVar_dataset.to_csv('../data/clinvar/cleaned_ClinVar_dataset.csv', index=0)

In [87]:
cleaned_ClinVar_dataset[['Name','Chromosome','ReferenceAlleleVCF','PositionVCF','AlternateAlleleVCF']].head()

Unnamed: 0,Name,Chromosome,ReferenceAlleleVCF,PositionVCF,AlternateAlleleVCF
0,NM_198576.4(AGRN):c.11G>C (p.Arg4Pro),1,G,1020183,C
1,NM_198576.4(AGRN):c.125A>C (p.Glu42Ala),1,A,1020297,C
2,NM_198576.4(AGRN):c.494C>T (p.Pro165Leu),1,C,1035307,T
3,NM_198576.4(AGRN):c.773C>T (p.Thr258Ile),1,C,1041218,T
4,NM_198576.4(AGRN):c.1058A>G (p.Gln353Arg),1,A,1041583,G


Now that the dataset is cleaned, we convert it into VCF format for input to the VEP tool, so we can obtain predictions from the chosen pathogenicity predictors.

In [88]:
create_vcf(cleaned_ClinVar_dataset[['Chromosome', 'PositionVCF', 'RS# (dbSNP)', 
                                    'ReferenceAlleleVCF', 'AlternateAlleleVCF']],
                                    "../data/clinvar/cleaned_Clinvar_dataset_inputVEP.vcf")

VCF file created at: ../data/clinvar/cleaned_Clinvar_dataset_inputVEP.vcf


Command run in the script

In [90]:
#    /home/aitanadiaz/ensembl-vep/./vep -i "$input_vcf" -o "$output_file" --offline \
#        --assembly $assembly \
#        --symbol --transcript_version --ccds --protein --uniprot --canonical \
#        --hgvs --fasta /home/aitanadiaz/ensembl-vep/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
#        --af --af_1kg --af_gnomade --af_gnomadg --max_af \
#        --sift b --polyphen b \
#        --plugin AlphaMissense,file=/home/aitanadiaz/ensembl-vep/plugins/AlphaMissense_${assembly}.tsv.gz \
#        --plugin Blosum62 \
#        --plugin CADD,snv=/home/aitanadiaz/ensembl-vep/plugins/whole_genome_SNVs_${assembly}.tsv.gz \
#        --plugin ClinPred,file=/home/aitanadiaz/ensembl-vep/plugins/ClinPred_${assembly}_tabbed.tsv.gz \
#        --plugin dbNSFP,/home/aitanadiaz/ensembl-vep/plugins/dbNSFP5.1a.grch38.gz,VEST4_score,VEST4_rankscore,BayesDel_addAF_pred,BayesDel_addAF_score \
#        --plugin EVE,file=/home/aitanadiaz/ensembl-vep/plugins/EVE/eve_merged.vcf.gz \
#        --plugin PrimateAI,/home/aitanadiaz/ensembl-vep/plugins/PrimateAI_scores_v0.2_${assembly}_sorted.tsv.bgz \
#        --plugin REVEL,file=/home/aitanadiaz/ensembl-vep/plugins/new_tabbed_revel_${assembly}.tsv.gz
#        #--plugin BayesDel,file=/home/aitanadiaz/ensembl-vep/plugins/BayesDel_170824_addAF/BayesDel_170824_addAF_all_scores.txt.gz \
#        #--plugin VARITY,file=/home/aitanadiaz/ensembl-vep/plugins/varity_all_predictions.tsv.gz

### Read VEP output (predictions added!)

Notice we are obtaining predictions only for the following predictors:
- __CADD__
- __EVE__
- __AlphaMissense__
- __BayesDel__
- __REVEL__
- __VEST4__

Missing ones will be run and add after parsing VEP results.

In [2]:
VEP_output = pd.read_csv(
    '../data/clinvar/cleaned_Clinvar_dataset_outputVEP.txt', 
    sep='\t', 
    comment='#', 
    header=None
)

# actual header
with open('../data/clinvar/cleaned_Clinvar_dataset_outputVEP.txt') as f:
    for line in f:
        if line.startswith("#Uploaded_variation"):  
            columns = line.strip("#").strip().split("\t")
            break

VEP_output.columns = columns

In [3]:
VEP_output.head()

Unnamed: 0,Uploaded_variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,Extra
0,539283387,1:1020183,C,ENSG00000188157,ENST00000379370.7,Transcript,missense_variant,64,11,4,R/P,cGg/cCg,rs539283387,IMPACT=MODERATE;STRAND=1;SYMBOL=AGRN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:329;CANONICAL=YES;CCDS=CCDS30551.1;ENSP=ENSP00000368678;SWISSPROT=O00468.212;UNIPARC=UPI00001D7C8B;UNIPROT_ISOFORM=O00468-6;SIFT=tolerated_low_confidence(0.22);PolyPhen=unknown(0);HGVSc=ENST00000379370.7:c.11G>C;HGVSp=ENSP00000368678.2:p.Arg4Pro;AF=0.0096;AFR_AF=0.0348;AMR_AF=0.0029;EAS_AF=0;EUR_AF=0;SAS_AF=0;gnomADe_AF=0.0009945;gnomADe_AFR_AF=0.04268;gnomADe_AMR_AF=0.002747;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=3.934e-05;gnomADe_FIN_AF=0;gnomADe_MID_AF=0.002163;gnomADe_NFE_AF=3.783e-05;gnomADe_REMAINING_AF=0.002043;gnomADe_SAS_AF=4.178e-05;gnomADg_AF=0.01198;gnomADg_AFR_AF=0.04195;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.003163;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0;gnomADg_MID_AF=0;gnomADg_NFE_AF=0.0001334;gnomADg_REMAINING_AF=0.008621;gnomADg_SAS_AF=0;MAX_AF=0.04268;MAX_AF_POPS=gnomADe_AFR;CLIN_SIG=benign;PHENO=1;BLOSUM62=-2;CADD_PHRED=11.57;CADD_RAW=1.122195;ClinPred=0.00471446055936884;PrimateAI=0.819412112236;REVEL=0.130;VEST4_rankscore=0.04419;VEST4_score=0.071
1,539283387,1:1020183,C,ENSG00000188157,ENST00000620552.4,Transcript,5_prime_UTR_variant,61,-,-,-,-,rs539283387,IMPACT=MODIFIER;STRAND=1;SYMBOL=AGRN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:329;ENSP=ENSP00000484607;TREMBL=A0A087X208.70;UNIPARC=UPI0004E4CB7F;HGVSc=ENST00000620552.4:c.-404G>C;AF=0.0096;AFR_AF=0.0348;AMR_AF=0.0029;EAS_AF=0;EUR_AF=0;SAS_AF=0;gnomADe_AF=0.0009945;gnomADe_AFR_AF=0.04268;gnomADe_AMR_AF=0.002747;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=3.934e-05;gnomADe_FIN_AF=0;gnomADe_MID_AF=0.002163;gnomADe_NFE_AF=3.783e-05;gnomADe_REMAINING_AF=0.002043;gnomADe_SAS_AF=4.178e-05;gnomADg_AF=0.01198;gnomADg_AFR_AF=0.04195;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.003163;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0;gnomADg_MID_AF=0;gnomADg_NFE_AF=0.0001334;gnomADg_REMAINING_AF=0.008621;gnomADg_SAS_AF=0;MAX_AF=0.04268;MAX_AF_POPS=gnomADe_AFR;CLIN_SIG=benign;PHENO=1;CADD_PHRED=11.57;CADD_RAW=1.122195;PrimateAI=0.819412112236
2,757604648,1:1020297,C,ENSG00000188157,ENST00000379370.7,Transcript,missense_variant,178,125,42,E/A,gAg/gCg,rs757604648,"IMPACT=MODERATE;STRAND=1;SYMBOL=AGRN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:329;CANONICAL=YES;CCDS=CCDS30551.1;ENSP=ENSP00000368678;SWISSPROT=O00468.212;UNIPARC=UPI00001D7C8B;UNIPROT_ISOFORM=O00468-6;SIFT=deleterious_low_confidence(0.02);PolyPhen=probably_damaging(0.996);HGVSc=ENST00000379370.7:c.125A>C;HGVSp=ENSP00000368678.2:p.Glu42Ala;gnomADe_AF=8.23e-05;gnomADe_AFR_AF=3.741e-05;gnomADe_AMR_AF=0;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=0.002266;gnomADe_FIN_AF=0;gnomADe_MID_AF=0;gnomADe_NFE_AF=6.699e-06;gnomADe_REMAINING_AF=0.0001103;gnomADe_SAS_AF=0.0004259;gnomADg_AF=0.0001846;gnomADg_AFR_AF=2.414e-05;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.0001311;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0.004673;gnomADg_FIN_AF=0;gnomADg_MID_AF=0;gnomADg_NFE_AF=1.474e-05;gnomADg_REMAINING_AF=0;gnomADg_SAS_AF=0;MAX_AF=0.004673;MAX_AF_POPS=gnomADg_EAS;CLIN_SIG=benign,likely_benign;PHENO=1;BLOSUM62=-1;CADD_PHRED=24.8;CADD_RAW=4.272024;ClinPred=0.169498920597853;PrimateAI=0.942414879799;REVEL=0.270;VEST4_rankscore=0.31372;VEST4_score=0.279"
3,757604648,1:1020297,C,ENSG00000188157,ENST00000620552.4,Transcript,5_prime_UTR_variant,175,-,-,-,-,rs757604648,"IMPACT=MODIFIER;STRAND=1;SYMBOL=AGRN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:329;ENSP=ENSP00000484607;TREMBL=A0A087X208.70;UNIPARC=UPI0004E4CB7F;HGVSc=ENST00000620552.4:c.-290A>C;gnomADe_AF=8.23e-05;gnomADe_AFR_AF=3.741e-05;gnomADe_AMR_AF=0;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=0.002266;gnomADe_FIN_AF=0;gnomADe_MID_AF=0;gnomADe_NFE_AF=6.699e-06;gnomADe_REMAINING_AF=0.0001103;gnomADe_SAS_AF=0.0004259;gnomADg_AF=0.0001846;gnomADg_AFR_AF=2.414e-05;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.0001311;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0.004673;gnomADg_FIN_AF=0;gnomADg_MID_AF=0;gnomADg_NFE_AF=1.474e-05;gnomADg_REMAINING_AF=0;gnomADg_SAS_AF=0;MAX_AF=0.004673;MAX_AF_POPS=gnomADg_EAS;CLIN_SIG=benign,likely_benign;PHENO=1;CADD_PHRED=24.8;CADD_RAW=4.272024;PrimateAI=0.942414879799"
4,140954236,1:1035307,T,ENSG00000188157,ENST00000379370.7,Transcript,missense_variant,547,494,165,P/L,cCt/cTt,"rs140954236,COSV99062194","IMPACT=MODERATE;STRAND=1;SYMBOL=AGRN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:329;CANONICAL=YES;CCDS=CCDS30551.1;ENSP=ENSP00000368678;SWISSPROT=O00468.212;UNIPARC=UPI00001D7C8B;UNIPROT_ISOFORM=O00468-6;SIFT=tolerated_low_confidence(0.06);PolyPhen=possibly_damaging(0.64);HGVSc=ENST00000379370.7:c.494C>T;HGVSp=ENSP00000368678.2:p.Pro165Leu;AF=0.0030;AFR_AF=0.0106;AMR_AF=0.0014;EAS_AF=0;EUR_AF=0;SAS_AF=0;gnomADe_AF=0.0001917;gnomADe_AFR_AF=0.006959;gnomADe_AMR_AF=0.0003578;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0;gnomADe_MID_AF=0.0005201;gnomADe_NFE_AF=1.799e-06;gnomADe_REMAINING_AF=0.0004306;gnomADe_SAS_AF=0;gnomADg_AF=0.001937;gnomADg_AFR_AF=0.006855;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.000392;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0;gnomADg_MID_AF=0;gnomADg_NFE_AF=1.47e-05;gnomADg_REMAINING_AF=0.001418;gnomADg_SAS_AF=0;MAX_AF=0.0106;MAX_AF_POPS=AFR;CLIN_SIG=likely_benign;SOMATIC=0,1;PHENO=1,1;BLOSUM62=-3;CADD_PHRED=15.58;CADD_RAW=1.757188;ClinPred=0.0527309163900522;PrimateAI=0.515500366688;REVEL=0.324;VEST4_rankscore=0.28880;VEST4_score=0.226,0.257,.,."


In [4]:
VEP_output["Extra"].head(1)

0    IMPACT=MODERATE;STRAND=1;SYMBOL=AGRN;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:329;CANONICAL=YES;CCDS=CCDS30551.1;ENSP=ENSP00000368678;SWISSPROT=O00468.212;UNIPARC=UPI00001D7C8B;UNIPROT_ISOFORM=O00468-6;SIFT=tolerated_low_confidence(0.22);PolyPhen=unknown(0);HGVSc=ENST00000379370.7:c.11G>C;HGVSp=ENSP00000368678.2:p.Arg4Pro;AF=0.0096;AFR_AF=0.0348;AMR_AF=0.0029;EAS_AF=0;EUR_AF=0;SAS_AF=0;gnomADe_AF=0.0009945;gnomADe_AFR_AF=0.04268;gnomADe_AMR_AF=0.002747;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=3.934e-05;gnomADe_FIN_AF=0;gnomADe_MID_AF=0.002163;gnomADe_NFE_AF=3.783e-05;gnomADe_REMAINING_AF=0.002043;gnomADe_SAS_AF=4.178e-05;gnomADg_AF=0.01198;gnomADg_AFR_AF=0.04195;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.003163;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0;gnomADg_MID_AF=0;gnomADg_NFE_AF=0.0001334;gnomADg_REMAINING_AF=0.008621;gnomADg_SAS_AF=0;MAX_AF=0.04268;MAX_AF_POPS=gnomADe_AFR;CLIN_SIG=benign;PHENO=1;BLOSUM62=-2;CADD_PHRED=11.57;CADD_RAW=1.122195;ClinPred=0.00471446055936884;PrimateAI=0.8194121122

In [5]:
VEP_output.columns

Index(['Uploaded_variation', 'Location', 'Allele', 'Gene', 'Feature',
       'Feature_type', 'Consequence', 'cDNA_position', 'CDS_position',
       'Protein_position', 'Amino_acids', 'Codons', 'Existing_variation',
       'Extra'],
      dtype='object')

After obtaining the VEP output, we need to **parse** it to extract **relevant** information. The `Extra` column in the VEP output contains multiple annotations in a single field, so we extract these as separate formal columns. Additionally, predictor outputs (e.g., SIFT, PolyPhen) are split into distinct columns for labels and scores.  

This step ensures that the data is structured correctly for analysis. Once parsed, we will merge this processed VEP output with our original dataset, using common columns to retain all the initial information while incorporating the VEP predictions.  

In [6]:
import parsing_ClinVar as parse

input_file = '../data/clinvar/cleaned_Clinvar_dataset_outputVEP.txt'  
output_file = '../data/clinvar/cleaned_Clinvar_dataset_parsed.csv'  

parse.parse_vep_output(input_file, output_file)  

In [4]:
output_file = '../data/clinvar/cleaned_Clinvar_dataset_parsed.csv' 

In [3]:
df = pd.read_csv(output_file)  

In [8]:
df.head()

Unnamed: 0,Existing_variation,Location,Gene,Feature,Feature_type,Canonical,Consequence,Swissprot,Uniparc,Uniprot_isoform,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,GeneSymbol,HGNC_ID,SIFT_label,SIFT_score,PolyPhen_label,PolyPhen_score,BayesDel_label,CADD_PHRED_score,CADD_RAW_score,ClinPred_score,VEST4_score,VEST4_rankscore,EVE_label,EVE_score,REVEL_score,PrimateAI_score,AM_label,AM_score
0,rs539283387,1:1020183,ENSG00000188157,ENST00000379370.7,Transcript,YES,missense_variant,O00468.212,UPI00001D7C8B,O00468-6,64,11,4,R/P,cGg/cCg,AGRN,HGNC:329,tolerated_low_confidence,0.22,unknown,0.0,,11.57,1.122195,0.004714,0.071,0.04419,,,0.13,0.819412,,
1,rs539283387,1:1020183,ENSG00000188157,ENST00000620552.4,Transcript,,5_prime_UTR_variant,,UPI0004E4CB7F,,61,-,-,-,-,AGRN,HGNC:329,,,,,,11.57,1.122195,,0.0,,,,,0.819412,,
2,rs757604648,1:1020297,ENSG00000188157,ENST00000379370.7,Transcript,YES,missense_variant,O00468.212,UPI00001D7C8B,O00468-6,178,125,42,E/A,gAg/gCg,AGRN,HGNC:329,deleterious_low_confidence,0.02,probably_damaging,0.996,,24.8,4.272024,0.169499,0.279,0.31372,,,0.27,0.942415,,
3,rs757604648,1:1020297,ENSG00000188157,ENST00000620552.4,Transcript,,5_prime_UTR_variant,,UPI0004E4CB7F,,175,-,-,-,-,AGRN,HGNC:329,,,,,,24.8,4.272024,,0.0,,,,,0.942415,,
4,"rs140954236,COSV99062194",1:1035307,ENSG00000188157,ENST00000379370.7,Transcript,YES,missense_variant,O00468.212,UPI00001D7C8B,O00468-6,547,494,165,P/L,cCt/cTt,AGRN,HGNC:329,tolerated_low_confidence,0.06,possibly_damaging,0.64,,15.58,1.757188,0.052731,0.257,0.2888,,,0.324,0.5155,,


VEP output contains multiple rows for the same variant due to different transcript versions. As in this case, 2,559 variants were found in the VEP output. However, to maintain consistency with our original dataset, we need to carefully select specific columns for merging. This ensures that we accurately add only the pathogenicity predictions without introducing redundant or erroneous rows of variants.

In [44]:
df[df["GeneSymbol"] == "BRCA1"].shape[0]

2559

From the original ClinVar dataset, we retrieved 82 variants for the BRCA1 case. After merging, we expect to retain the same number of variants, with the only difference being the addition of prediction columns.

In [9]:
cleaned_ClinVar_dataset[cleaned_ClinVar_dataset["GeneSymbol"] == "BRCA1"].shape[0]

82

In [9]:
df.columns

Index(['Existing_variation', 'Location', 'Gene', 'Feature', 'Feature_type',
       'Canonical', 'Consequence', 'Swissprot', 'Uniparc', 'Uniprot_isoform',
       'cDNA_position', 'CDS_position', 'Protein_position', 'Amino_acids',
       'Codons', 'GeneSymbol', 'HGNC_ID', 'SIFT_label', 'SIFT_score',
       'PolyPhen_label', 'PolyPhen_score', 'BayesDel_label',
       'CADD_PHRED_score', 'CADD_RAW_score', 'ClinPred_score', 'VEST4_score',
       'VEST4_rankscore', 'EVE_label', 'EVE_score', 'REVEL_score',
       'PrimateAI_score', 'AM_label', 'AM_score'],
      dtype='object')

An important step is to calculate the coverage for each predictor in the dataset. With this we aim to determine the percentage of non NaN values for each predictor score column and output a table sorted by coverage percentage.

Function is defined now for later usage.

In [14]:
def check_coverage(df):
    predictors = [i for i in df.columns if '_score' in i]
    adding = []
    predictor_groups = {'CADD': ['CADD_RAW_score', 'CADD_PHRED_score']}

    for predictor in predictors:
        if predictor in predictor_groups.get('CADD', []):
            coverage_type = 'RAW' if 'RAW' in predictor else 'PHRED'
            predictor_label = f'CADD ({coverage_type})'
        else:
            predictor_label = predictor.split('_')[0]
            if predictor_label == 'AM':
                predictor_label = 'AlphaMissense' 

        tmp = df[df[predictor].notna()]  
        coverage = round(100 * len(tmp) / len(df), 2)
        adding.append([predictor_label, coverage])

    table = pd.DataFrame(adding, columns=['Predictor', 'Coverage'])
    table = table.sort_values('Coverage', ascending=False).reset_index(drop=True)
    return table

### Final step: merge VEP output with ClinVar original dataset

A Python script was used to merge ClinVar data with VEP predictions by first filtering ClinVar variants for the specified gene and then matching them with VEP entries based on *Feature*, *Existing_variation*, and *Codons*. If multiple matches exist, the row with the least missing values is selected. This ensures that only the relevant pathogenicity predictions are added without redundancy. Unmatched variants are logged, and the final dataset retains the original ClinVar structure with the additional prediction columns.

In [10]:
#    python3 merging_ClinVar.py

In [63]:
clinvar_with_preds = pd.read_csv('../data/clinvar/cleaned_ClinVar_with_preds.csv')
len(clinvar_with_preds)

49187

In [7]:
clinvar_with_preds = clinvar_with_preds.sort_values(['GeneSymbol'])
clinvar_with_preds = clinvar_with_preds.reset_index(drop=True)

In [8]:
clinvar_with_preds.head(3)

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,Cytogenetic,ReviewStatus,NumberSubmitters,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,Variant (3-letter),Variant,LastEvaluated (Year),Uploaded_variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,Extra,PositionVCF_dashed,gnomADe_AF,gnomADg_AF,gnomAD_AF,HGVSp,BinaryClinicalSignificance,SIFT_label,SIFT_score,PolyPhen_label,PolyPhen_score,BayesDel_label,BayesDel_score,CADD_PHRED_score,CADD_RAW_score,ClinPred_score,VEST4_score,VEST4_rankscore,EVE_label,EVE_score,REVEL_score,PrimateAI_score,AM_label,AM_score,Uniprot_acc
0,317931,single nucleotide variant,NM_015665.6(AAAS):c.1597G>A (p.Gly533Arg),8086,AAAS,HGNC:13666,Benign/Likely benign,0,34451260,-,RCV000343022|RCV000886859,"MONDO:MONDO:0009279,MedGen:C0271742,OMIM:231550,Orphanet:869|MedGen:C3661900",Glucocorticoid deficiency with achalasia|not provided,germline,germline,GRCh38,NC_000012.12,12,53307533,53307533,12q13.13,"criteria provided, multiple submitters, no conflicts",3,ClinGen:CA6598796,2,309718,53307533,C,T,Gly533Arg,G533R,2024,34451260.0,12:53307533,T,ENSG00000094914,ENST00000209873.9,Transcript,missense_variant,1742,1597,533,G/R,Ggg/Agg,rs34451260,"IMPACT=MODERATE;STRAND=-1;SYMBOL=AAAS;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:13666;CCDS=CCDS8856.1;ENSP=ENSP00000209873;SWISSPROT=Q9NRG9.206;UNIPARC=UPI0000039E40;UNIPROT_ISOFORM=Q9NRG9-1;HGVSc=ENST00000209873.9:c.1597G>A;HGVSp=ENSP00000209873.4:p.Gly533Arg;AF=0.0124;AFR_AF=0.0439;AMR_AF=0.0043;EAS_AF=0;EUR_AF=0.001;SAS_AF=0;gnomADe_AF=0.0007928;gnomADe_AFR_AF=0.02712;gnomADe_AMR_AF=0.001879;gnomADe_ASJ_AF=3.827e-05;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0;gnomADe_MID_AF=0.001214;gnomADe_NFE_AF=2.428e-05;gnomADe_REMAINING_AF=0.002053;gnomADe_SAS_AF=9.275e-05;gnomADg_AF=0.007867;gnomADg_AFR_AF=0.02713;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0.003529;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0;gnomADg_MID_AF=0;gnomADg_NFE_AF=0.0001029;gnomADg_REMAINING_AF=0.004748;gnomADg_SAS_AF=0;MAX_AF=0.0439;MAX_AF_POPS=AFR;CLIN_SIG=benign,likely_benign;PHENO=1",53307533,0.000793,0.007867,0.000793,ENSP00000209873.4:p.Gly533Arg,B,deleterious_low_confidence,0.0,unknown,0.0,T,-0.362112,20.9,2.95181,0.062144,0.42,0.45743,,,0.257,0.52671,likely_benign,0.1192,Q9NRG9
1,20084,single nucleotide variant,NM_015665.6(AAAS):c.787T>C (p.Ser263Pro),8086,AAAS,HGNC:13666,Pathogenic/Likely pathogenic,1,121918550,-,RCV000005348|RCV000311283|RCV000415076|RCV000624696,"MONDO:MONDO:0009279,MedGen:C0271742,OMIM:231550,Orphanet:869|MedGen:C3661900|Human Phenotype Ontology:HP:0002313,Human Phenotype Ontology:HP:0007191,MedGen:C0037771;Human Phenotype Ontology:HP:0001282,Human Phenotype Ontology:HP:0001347,Human Phenotype Ontology:HP:0006820,Human Phenotype Ontology:HP:0007184,Human Phenotype Ontology:HP:0007318,MONDO:MONDO:0007774,MedGen:C0151889,OMIM:145290;Human Phenotype Ontology:HP:0001352,Human Phenotype Ontology:HP:0003487,MedGen:C0034935|MeSH:D030342,MedGen:C0950123",Glucocorticoid deficiency with achalasia|not provided|Spastic paraparesis;Hyperreflexia;Babinski sign|Inborn genetic diseases,germline;unknown,germline,GRCh38,NC_000012.12,12,53309624,53309624,12q13.13,"criteria provided, multiple submitters, no conflicts",11,"ClinGen:CA117228,UniProtKB:Q9NRG9#VAR_012806,OMIM:605378.0007",3,5045,53309624,A,G,Ser263Pro,S263P,2024,121918600.0,12:53309624,G,ENSG00000094914,ENST00000209873.9,Transcript,missense_variant,932,787,263,S/P,Tca/Cca,"rs121918550,CM010150","IMPACT=MODERATE;STRAND=-1;SYMBOL=AAAS;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:13666;CCDS=CCDS8856.1;ENSP=ENSP00000209873;SWISSPROT=Q9NRG9.206;UNIPARC=UPI0000039E40;UNIPROT_ISOFORM=Q9NRG9-1;HGVSc=ENST00000209873.9:c.787T>C;HGVSp=ENSP00000209873.4:p.Ser263Pro;AF=0.0002;AFR_AF=0;AMR_AF=0;EAS_AF=0;EUR_AF=0.001;SAS_AF=0;gnomADe_AF=7.801e-05;gnomADe_AFR_AF=2.988e-05;gnomADe_AMR_AF=0;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0.0005994;gnomADe_MID_AF=0;gnomADe_NFE_AF=6.925e-05;gnomADe_REMAINING_AF=6.627e-05;gnomADe_SAS_AF=0;gnomADg_AF=5.908e-05;gnomADg_AFR_AF=0;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0.0002824;gnomADg_MID_AF=0;gnomADg_NFE_AF=8.819e-05;gnomADg_REMAINING_AF=0;gnomADg_SAS_AF=0;MAX_AF=0.001;MAX_AF_POPS=EUR;CLIN_SIG=pathogenic/likely_pathogenic,pathogenic;SOMATIC=0,1;PHENO=1,1",53309624,7.8e-05,5.9e-05,7.8e-05,ENSP00000209873.4:p.Ser263Pro,P,deleterious,0.0,probably_damaging,0.992,D,0.25831,29.2,5.211912,0.980945,0.983,0.9927,Pathogenic,0.951397,0.844,0.71898,likely_pathogenic,0.9852,Q9NRG9
2,1312858,single nucleotide variant,NM_015665.6(AAAS):c.500C>T (p.Ala167Val),8086,AAAS,HGNC:13666,Pathogenic,1,1017700992,-,RCV003557743|RCV004765890,"MedGen:C3661900|MONDO:MONDO:0009279,MedGen:C0271742,OMIM:231550,Orphanet:869",not provided|Glucocorticoid deficiency with achalasia,germline,germline,GRCh38,NC_000012.12,12,53314796,53314796,12q13.13,"criteria provided, multiple submitters, no conflicts",2,ClinGen:CA237335767,2,2735891,53314796,G,A,Ala167Val,A167V,2024,1017701000.0,12:53314796,A,ENSG00000094914,ENST00000209873.9,Transcript,missense_variant,645,500,167,A/V,gCa/gTa,"rs1017700992,CM065949","IMPACT=MODERATE;STRAND=-1;SYMBOL=AAAS;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:13666;CCDS=CCDS8856.1;ENSP=ENSP00000209873;SWISSPROT=Q9NRG9.206;UNIPARC=UPI0000039E40;UNIPROT_ISOFORM=Q9NRG9-1;HGVSc=ENST00000209873.9:c.500C>T;HGVSp=ENSP00000209873.4:p.Ala167Val;gnomADe_AF=3.421e-06;gnomADe_AFR_AF=0;gnomADe_AMR_AF=0;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0;gnomADe_MID_AF=0;gnomADe_NFE_AF=4.497e-06;gnomADe_REMAINING_AF=0;gnomADe_SAS_AF=0;MAX_AF=4.497e-06;MAX_AF_POPS=gnomADe_NFE;CLIN_SIG=pathogenic;SOMATIC=0,1;PHENO=1,1",53314796,3e-06,,3e-06,ENSP00000209873.4:p.Ala167Val,P,deleterious,0.03,probably_damaging,0.969,D,0.571933,34.0,5.972388,0.998261,0.98,0.99026,Pathogenic,0.853707,0.92,0.760144,likely_pathogenic,0.8614,Q9NRG9


In [4]:
clinvar_with_preds['GeneSymbol'].nunique()

2156

In [9]:
clinvar_with_preds['GeneSymbol'].value_counts()

GeneSymbol
DNAH11       1023
KMT2D         808
NEB           593
ADGRV1        586
SACS          495
DNAH5         444
TTN           322
COL7A1        313
ABCA4         301
ATRX          295
DMD           287
KMT2B         284
CACNA1H       280
APOB          278
MACF1         253
CHD7          245
KAT6A         240
RAI1          234
NOTCH1        233
SETBP1        226
USH2A         223
FLNA          209
LDLR          206
COL6A3        205
TSC2          201
DYNC1H1       192
FBN2          182
EP300         182
ANKRD11       181
COL5A1        179
ATP7A         175
ARID1B        173
COL4A5        173
SETX          168
COL11A1       164
NBAS          161
MECP2         159
MTOR          156
SON           152
TP53          151
FLNB          148
PKD1          146
GRIN2A        145
ARID1A        143
CREBBP        142
ATP7B         140
L1CAM         140
ABL1          138
MED13L        137
COL2A1        136
PTCH1         134
COL12A1       130
NSD1          128
NOTCH3        127
COL11A2       125

In [10]:
clinvar_with_preds[clinvar_with_preds['GeneSymbol'] == "BRCA1"].head(2)

Unnamed: 0,#AlleleID,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,RS# (dbSNP),nsv/esv (dbVar),RCVaccession,PhenotypeIDS,PhenotypeList,Origin,OriginSimple,Assembly,ChromosomeAccession,Chromosome,Start,Stop,Cytogenetic,ReviewStatus,NumberSubmitters,OtherIDs,SubmitterCategories,VariationID,PositionVCF,ReferenceAlleleVCF,AlternateAlleleVCF,Variant (3-letter),Variant,LastEvaluated (Year),Uploaded_variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,Extra,PositionVCF_dashed,gnomADe_AF,gnomADg_AF,gnomAD_AF,HGVSp,BinaryClinicalSignificance,SIFT_label,SIFT_score,PolyPhen_label,PolyPhen_score,BayesDel_label,BayesDel_score,CADD_PHRED_score,CADD_RAW_score,ClinPred_score,VEST4_score,VEST4_rankscore,EVE_label,EVE_score,REVEL_score,PrimateAI_score,AM_label,AM_score,Uniprot_acc
6567,46226,single nucleotide variant,NM_007294.4(BRCA1):c.5434C>G (p.Pro1812Ala),672,BRCA1,HGNC:1100,Pathogenic/Likely pathogenic,1,1800751,-,RCV000031251|RCV000496797|RCV000484398|RCV001390965|RCV000574861,"MONDO:MONDO:0011450,MedGen:C2676676,OMIM:604370,Orphanet:145|MedGen:CN169374|MedGen:C3661900|MONDO:MONDO:0003582,MeSH:D061325,MedGen:C0677776,Orphanet:145|MONDO:MONDO:0015356,MeSH:D009386,MedGen:C0027672,Orphanet:140162","Breast-ovarian cancer, familial, susceptibility to, 1|not specified|not provided|Hereditary breast ovarian cancer syndrome|Hereditary cancer-predisposing syndrome",germline;inherited;not applicable,germline,GRCh38,NC_000017.11,17,43047676,43047676,17q21.31,"criteria provided, multiple submitters, no conflicts",15,ClinGen:CA003596,2,37670,43047676,G,C,Pro1812Ala,P1812A,2022,1800751.0,17:43047676,C,ENSG00000012048,ENST00000357654.9,Transcript,missense_variant,5547,5434,1812,P/A,Cca/Gca,"rs1800751,CM032862","IMPACT=MODERATE;STRAND=-1;SYMBOL=BRCA1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:1100;CCDS=CCDS11453.1;ENSP=ENSP00000350283;SWISSPROT=P38398.275;UNIPARC=UPI0000126AC8;UNIPROT_ISOFORM=P38398-1;HGVSc=ENST00000357654.9:c.5434C>G;HGVSp=ENSP00000350283.3:p.Pro1812Ala;gnomADe_AF=6.84e-07;gnomADe_AFR_AF=0;gnomADe_AMR_AF=0;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0;gnomADe_MID_AF=0;gnomADe_NFE_AF=8.993e-07;gnomADe_REMAINING_AF=0;gnomADe_SAS_AF=0;MAX_AF=8.993e-07;MAX_AF_POPS=gnomADe_NFE;CLIN_SIG=uncertain_significance,pathogenic,pathogenic/likely_pathogenic;SOMATIC=0,1;PHENO=1,1",43047676,6.84e-07,,6.84e-07,ENSP00000350283.3:p.Pro1812Ala,P,deleterious,0.01,benign,0.031,D,0.097989,21.9,3.234223,0.986117,0.714,0.71542,Uncertain,0.428204,0.639,0.516515,likely_benign,0.171,P38398
6568,70304,single nucleotide variant,NM_007294.4(BRCA1):c.5585A>T (p.His1862Leu),672,BRCA1,HGNC:1100,Benign/Likely benign,0,80357183,-,RCV000049059|RCV000112707|RCV000774923|RCV001356964|RCV003237433|RCV003607227,"MONDO:MONDO:0003582,MeSH:D061325,MedGen:C0677776,Orphanet:145|MONDO:MONDO:0011450,MedGen:C2676676,OMIM:604370,Orphanet:145|MONDO:MONDO:0015356,MeSH:D009386,MedGen:C0027672,Orphanet:140162|MONDO:MONDO:0007254,MedGen:C0006142|MedGen:C3661900|MONDO:MONDO:0016419,MedGen:C0346153,OMIM:114480,Orphanet:227535","Hereditary breast ovarian cancer syndrome|Breast-ovarian cancer, familial, susceptibility to, 1|Hereditary cancer-predisposing syndrome|Malignant tumor of breast|not provided|Familial cancer of breast",germline;unknown,germline,GRCh38,NC_000017.11,17,43045685,43045685,17q21.31,"criteria provided, multiple submitters, no conflicts",8,"ClinGen:CA003734,UniProtKB:P38398#VAR_070519",2,55637,43045685,T,A,His1862Leu,H1862L,2024,80357183.0,17:43045685,A,ENSG00000012048,ENST00000357654.9,Transcript,missense_variant,5698,5585,1862,H/L,cAc/cTc,rs80357183,"IMPACT=MODERATE;STRAND=-1;SYMBOL=BRCA1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:1100;CCDS=CCDS11453.1;ENSP=ENSP00000350283;SWISSPROT=P38398.275;UNIPARC=UPI0000126AC8;UNIPROT_ISOFORM=P38398-1;HGVSc=ENST00000357654.9:c.5585A>T;HGVSp=ENSP00000350283.3:p.His1862Leu;gnomADe_AF=4.79e-06;gnomADe_AFR_AF=0;gnomADe_AMR_AF=0;gnomADe_ASJ_AF=0;gnomADe_EAS_AF=0;gnomADe_FIN_AF=0;gnomADe_MID_AF=0;gnomADe_NFE_AF=6.297e-06;gnomADe_REMAINING_AF=0;gnomADe_SAS_AF=0;gnomADg_AF=6.571e-06;gnomADg_AFR_AF=0;gnomADg_AMI_AF=0;gnomADg_AMR_AF=0;gnomADg_ASJ_AF=0;gnomADg_EAS_AF=0;gnomADg_FIN_AF=0;gnomADg_MID_AF=0;gnomADg_NFE_AF=1.47e-05;gnomADg_REMAINING_AF=0;gnomADg_SAS_AF=0;MAX_AF=1.47e-05;MAX_AF_POPS=gnomADg_NFE;CLIN_SIG=uncertain_significance,benign,likely_benign;PHENO=1",43045685,4.79e-06,7e-06,4.79e-06,ENSP00000350283.3:p.His1862Leu,B,deleterious_low_confidence,0.01,benign,0.013,D,0.220853,10.58,1.021643,0.253849,0.308,0.34533,,,0.553,0.296682,likely_benign,0.094,P38398


In [11]:
clinvar_with_preds[['SIFT_score', 'PolyPhen_score', 'CADD_RAW_score', 'CADD_PHRED_score', 'ClinPred_score', 'VEST4_score', 'VEST4_rankscore', 'EVE_score', 'REVEL_score', 'PrimateAI_score', 'AM_score']].describe()

Unnamed: 0,SIFT_score,PolyPhen_score,CADD_RAW_score,CADD_PHRED_score,ClinPred_score,VEST4_score,VEST4_rankscore,EVE_score,REVEL_score,PrimateAI_score,AM_score
count,47562.0,45581.0,49187.0,49187.0,48977.0,49187.0,47239.0,31381.0,43752.0,48568.0,44992.0
mean,0.146499,0.450637,3.28159,20.868921,0.443935,0.498975,0.529627,0.437151,0.426532,0.55898,0.360627
std,0.265088,0.435525,1.814761,8.64625,0.414668,0.330537,0.32172,0.294033,0.318185,0.198613,0.340495
min,0.0,0.0,-4.563263,0.001,1.7e-05,0.0,4e-05,0.012372,0.0,0.160219,0.0265
25%,0.0,0.006,2.037965,16.93,0.042429,0.19,0.23722,0.158716,0.144,0.389804,0.0866
50%,0.02,0.301,3.625519,23.1,0.256858,0.468,0.53034,0.379827,0.335,0.56107,0.1702
75%,0.15,0.972,4.727878,26.5,0.958861,0.832,0.83995,0.703409,0.731,0.722608,0.659425
max,1.0,1.0,12.062742,56.0,1.0,1.0,0.99999,0.999872,1.0,0.978576,1.0


In [12]:
clinvar_with_preds[['SIFT_score', 'PolyPhen_score', 'CADD_RAW_score', 'CADD_PHRED_score', 'ClinPred_score', 'VEST4_score', 'VEST4_rankscore', 'EVE_score', 'REVEL_score', 'PrimateAI_score', 'AM_score']].isna().sum()

SIFT_score           1625
PolyPhen_score       3606
CADD_RAW_score          0
CADD_PHRED_score        0
ClinPred_score        210
VEST4_score             0
VEST4_rankscore      1948
EVE_score           17806
REVEL_score          5435
PrimateAI_score       619
AM_score             4195
dtype: int64

We use the following function to evaluate the coverage of each predictor in the dataset. Not all variants can have a prediction due to varying reasons, such as the specific type of variant, the prediction model, or the absence of relevant data. This check helps identify if any predictors have a high percentage of missing values (NaNs), indicating they may not be informative for a large portion of the dataset.

In [15]:
check_coverage(clinvar_with_preds)

Unnamed: 0,Predictor,Coverage
0,VEST4,100.0
1,CADD (RAW),100.0
2,CADD (PHRED),100.0
3,BayesDel,99.87
4,ClinPred,99.57
5,PrimateAI,98.74
6,SIFT,96.7
7,PolyPhen,92.67
8,AlphaMissense,91.47
9,REVEL,88.95


In [16]:
clinvar_with_preds.BinaryClinicalSignificance.value_counts()

BinaryClinicalSignificance
B    34405
P    14782
Name: count, dtype: int64

For easier merging with Humsavar and for a better identification of variants, we retrieve a column for UniprotIDs.

In [22]:
clinvar_with_preds["Extra"].iloc[0].split(";")

['IMPACT=MODERATE',
 'STRAND=-1',
 'SYMBOL=AAAS',
 'SYMBOL_SOURCE=HGNC',
 'HGNC_ID=HGNC:13666',
 'CCDS=CCDS8856.1',
 'ENSP=ENSP00000209873',
 'SWISSPROT=Q9NRG9.206',
 'UNIPARC=UPI0000039E40',
 'UNIPROT_ISOFORM=Q9NRG9-1',
 'HGVSc=ENST00000209873.9:c.1597G>A',
 'HGVSp=ENSP00000209873.4:p.Gly533Arg',
 'AF=0.0124',
 'AFR_AF=0.0439',
 'AMR_AF=0.0043',
 'EAS_AF=0',
 'EUR_AF=0.001',
 'SAS_AF=0',
 'gnomADe_AF=0.0007928',
 'gnomADe_AFR_AF=0.02712',
 'gnomADe_AMR_AF=0.001879',
 'gnomADe_ASJ_AF=3.827e-05',
 'gnomADe_EAS_AF=0',
 'gnomADe_FIN_AF=0',
 'gnomADe_MID_AF=0.001214',
 'gnomADe_NFE_AF=2.428e-05',
 'gnomADe_REMAINING_AF=0.002053',
 'gnomADe_SAS_AF=9.275e-05',
 'gnomADg_AF=0.007867',
 'gnomADg_AFR_AF=0.02713',
 'gnomADg_AMI_AF=0',
 'gnomADg_AMR_AF=0.003529',
 'gnomADg_ASJ_AF=0',
 'gnomADg_EAS_AF=0',
 'gnomADg_FIN_AF=0',
 'gnomADg_MID_AF=0',
 'gnomADg_NFE_AF=0.0001029',
 'gnomADg_REMAINING_AF=0.004748',
 'gnomADg_SAS_AF=0',
 'MAX_AF=0.0439',
 'MAX_AF_POPS=AFR',
 'CLIN_SIG=benign,likely_benign

In [64]:
clinvar_with_preds["OtherIDs"].head()

0                                                ClinGen:CA6598796
1    ClinGen:CA117228,UniProtKB:Q9NRG9#VAR_012806,OMIM:605378.0007
2                                              ClinGen:CA237335767
3                                              ClinGen:CA385043710
4                               ClinGen:CA6599181,OMIM:605378.0011
Name: OtherIDs, dtype: object

At the end, we extract UniProt IDs from the variant annotations. These IDs can appear in different formats or columns, so we check several places.

This finally gives us a way to identify which protein each variant maps to.

In [65]:
def extract_uniprot_id(extra, other_ids):
    # check for UNIPROT_ISOFORM in the 'Extra' column
    if pd.notna(extra):
        match_isoform = re.search(r'UNIPROT_ISOFORM=([A-Z0-9]+)-1\b', extra)
        if match_isoform:
            return match_isoform.group(1)
    
    # check for UniProtKB in the 'OtherIDs' column (and get part before #)
    if pd.notna(other_ids):
        match_uniprotkb = re.search(r'UniProtKB:([A-Z0-9]+)(?=#)', other_ids)
        if match_uniprotkb:
            return match_uniprotkb.group(1)
    
    # check for SWISSPROT in the 'Extra' column (and get part before dot)
    if pd.notna(extra):
        match_swiss = re.search(r'SWISSPROT=([A-Z0-9]+)', extra)
        if match_swiss:
            return match_swiss.group(1).split('.')[0]
    
    # check for TREMBL in the 'Extra' column (and get part before dot)
    if pd.notna(extra):
        match_trembl = re.search(r'TREMBL=([A-Z0-9]+)', extra)
        if match_trembl:
            return match_trembl.group(1).split('.')[0]

    return None

We load a list of reviewed UniProt IDs which we know for sure that are manually curated and more reliable than automatic ones.

We use this list to filter out less trustworthy protein mappings, as some variants map to unreviwed IDs.

In [66]:
with open('uniprotkb_reviewed_true_AND_organism_id_2025_04_09.list') as f:
    reviewed_ids = set(line.strip() for line in f if line.strip())

In [67]:
print(sorted(reviewed_ids)[0:10])
print(len(reviewed_ids))

['A0A024R1R8', 'A0A024RBG1', 'A0A075B6H7', 'A0A075B6H8', 'A0A075B6H9', 'A0A075B6I0', 'A0A075B6I1', 'A0A075B6I3', 'A0A075B6I4', 'A0A075B6I6']
20417


In [None]:
clinvar_with_preds['UniprotID'] = clinvar_with_preds.apply(lambda row: extract_uniprot_id(row['Uniprot_acc'], reviewed_ids), axis=1)

Thus, we filter the dataset to keep only variants that map to reviewed UniProt IDs.

And after filtering, we check if any gene still maps to more than one UniProt ID.

In [68]:
reviewed_df = clinvar_with_preds[clinvar_with_preds['UniprotID'].isin(reviewed_ids)].copy()

In [69]:
print(reviewed_df['UniprotID'].isna().sum())
print(reviewed_df['GeneSymbol'].nunique())
print(reviewed_df['UniprotID'].nunique())

0
2105
2105


In [80]:
reviewed_df.shape

(46050, 70)

In [83]:
reviewed_df.BinaryClinicalSignificance.value_counts()

BinaryClinicalSignificance
B    32051
P    13999
Name: count, dtype: int64

In [79]:
(reviewed_df.groupby('GeneSymbol')['UniprotID']
 .nunique()
 .gt(1)
 .loc[lambda x: x])

GeneSymbol
ERCC6    True
GNAS     True
Name: UniprotID, dtype: bool

These genes are linked to more than one reviewed UniProt ID, which could mean alternative isoforms or distinct protein products that were reviewed separately.

It's unusual but not necessarily wrong. As it is something not that relevant for the analysis, we kept both.

In [81]:
np.savetxt('../data/clinvar/uniprot_ids.txt', reviewed_df['UniprotID'].unique(), fmt='%s')

In [82]:
reviewed_df.to_csv('../data/clinvar/cleaned_ClinVar_with_preds.csv', index=0)

The full procedure outlined in this *.ipynb* serves to explain the methodology behind my thesis (for the data preprocessing step), providing a clear and structured approach for better comprehension. 

In practice, however, this process is executed within a single script, which calls various Python functions. As described in the report, by inputting a gene name or a list of gene names, the final output is a CSV file with the added predictions. 

*For more detailed demonstrations and explanations of how the program works, please refer to the __GitHub repository__ linked to my thesis project.*