# Data Manipulation Before CADD Annotations

In [1]:
import pandas as pd
import numpy as np
import io

Since dataframes are very big, we want to show only a few rows using the following code:

In [2]:
pd.set_option('display.max_rows', 8)

In order to use a vcf in a dataframe, the following function has been created. This function will work on any vcf file.

In [3]:
def read_vcf(path):
    with open(path, 'r') as f:
        lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(io.StringIO(''.join(lines)), dtype={'#CHROM': str, 'POS':int, 'ID':str, 'REF': str, 'ALT': str, 'QUAL': str, 'FILTER': str, 'INFO': str}, sep='\t').rename(columns={'#CHROM': 'CHROM'})

# ClinVar Analysis Below:

First, we import all variants from the ClinVar dataset and read them into a dataframe:

In [4]:
clinvar_data = read_vcf('clinvar_20190701.vcf')
clinvar_data

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,1,1014042,475283,G,A,.,.,AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
1,1,1014122,542074,C,T,.,.,AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926...
2,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:..."
3,1,1014179,542075,C,T,.,.,"ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:..."
...,...,...,...,...,...,...,...,...
453956,MT,15965,9571,A,G,.,.,"ALLELEID=24610;CLNDISDB=MedGen:C1838867,OMIM:5..."
453957,MT,15967,9572,G,A,.,.,"ALLELEID=24611;CLNDISDB=MedGen:C0162672,OMIM:5..."
453958,MT,15990,9570,C,T,.,.,ALLELEID=24609;CLNDISDB=Human_Phenotype_Ontolo...
453959,NW_009646201.1,83614,17735,TC,T,.,.,"ALLELEID=32774;CLNDISDB=MedGen:C0000778,OMIM:6..."


Using the INFO section of the ClinVar database, we can split the variants into two separate groups, benign and pathogenic. All we need to do is search through the INFO column for the Clinical Significance (CLNSIG) and take those whose values are equal to benign or pathogenic.

Below are all the benign variants from the clinvar database:

In [5]:
benign_variants = clinvar_data[clinvar_data['INFO'].str.contains('CLNSIG=Benign') & ~clinvar_data['INFO'].str.contains('CLNSIG=Benign/Likely_benign') & ~clinvar_data['INFO'].str.contains('CLNSIG=Benign/Likely benign')& ~clinvar_data['INFO'].str.contains('CLNSIG=Benign/Likely_benign') & ~clinvar_data['INFO'].str.contains('CLNSIG=Benign/Likely')]
benign_variants

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,1,1014042,475283,G,A,.,.,AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
4,1,1014217,475278,C,T,.,.,AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;...
5,1,1014228,402986,G,A,.,.,AF_ESP=0.40158;AF_EXAC=0.37025;AF_TGP=0.33886;...
11,1,1014451,475281,C,T,.,.,AF_ESP=0.00987;AF_EXAC=0.00772;AF_TGP=0.01558;...
...,...,...,...,...,...,...,...,...
453895,MT,15235,377041,A,G,.,.,ALLELEID=363919;CLNDISDB=MedGen:CN517202;CLNDN...
453922,MT,15470,143892,T,C,.,.,"ALLELEID=153617;CLNDISDB=MedGen:C0006142,OMIM:..."
453939,MT,15746,235623,A,G,.,.,ALLELEID=237304;CLNDISDB=MedGen:CN517202;CLNDN...
453947,MT,15884,252455,G,C,.,.,ALLELEID=247210;CLNDISDB=MedGen:CN169374;CLNDN...


Below are all the pathogenic variants from the clinvar database:

In [6]:
pathogenic_variants = clinvar_data[clinvar_data['INFO'].str.contains('CLNSIG=Pathogenic') & ~clinvar_data['INFO'].str.contains('CLNSIG=non-Pathogenic')& ~clinvar_data['INFO'].str.contains('CLNSIG=Pathogenic/Likely_pathogenic') & ~clinvar_data['INFO'].str.contains('CLNSIG=Pathogenic/LikelyPathogenic')]
pathogenic_variants

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
2,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:..."
8,1,1014316,161455,C,CG,.,.,"ALLELEID=171289;CLNDISDB=MedGen:C4015293,OMIM:..."
9,1,1014359,161454,G,T,.,.,AF_EXAC=0.00001;ALLELEID=171288;CLNDISDB=MedGe...
24,1,1022225,243036,G,A,.,.,AF_EXAC=0.00001;ALLELEID=244110;CLNDISDB=MedGe...
...,...,...,...,...,...,...,...,...
453933,MT,15615,9678,G,A,.,.,ALLELEID=24717;CLNDISDB=Human_Phenotype_Ontolo...
453943,MT,15812,9675,G,A,.,.,ALLELEID=24714;CLNDISDB=Human_Phenotype_Ontolo...
453957,MT,15967,9572,G,A,.,.,"ALLELEID=24611;CLNDISDB=MedGen:C0162672,OMIM:5..."
453958,MT,15990,9570,C,T,.,.,ALLELEID=24609;CLNDISDB=Human_Phenotype_Ontolo...


## ClinVar data mapped to nORF regions using BedTools Below:

Using the BedTools intersect tool in a Ubuntu command line, we were able to locate which variants in nORFs. These are the only variants we are interested in for this study. The code for bedtools intersect can be found in the google doc entitled sORFc_Felix's_work_replication Finalized 27/6.

The following code gives us all variants which mapped to nORF regions from the ClinVar database:

In [12]:
clinvar_variants_mapped_to_nORFs = read_vcf('clinvar_mapped_to_norfs_with_header.vcf')
clinvar_variants_mapped_to_nORFs

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chrchr10,100154912,476196,G,A,.,.,AF_EXAC=0.00001;ALLELEID=459810;CLNDISDB=MedGe...
1,chrchr10,100154922,226426,G,A,.,.,"ALLELEID=228227;CLNDISDB=MedGen:C4284588,OMIM:..."
2,chrchr10,100183753,596294,C,T,.,.,ALLELEID=587355;CLNDISDB=MedGen:CN517202;CLNDN...
3,chrchr10,100183802,226427,C,A,.,.,"ALLELEID=228228;CLNDISDB=MedGen:C4284588,OMIM:..."
...,...,...,...,...,...,...,...,...
121810,chrchrX,9760717,98631,C,T,.,.,AF_EXAC=0.00002;ALLELEID=104521;CLNDISDB=MedGe...
121811,chrchrX,9760724,98628,C,T,.,.,ALLELEID=104518;CLNDISDB=MedGen:CN517202;CLNDN...
121812,chrchrX,9760730,98627,C,G,.,.,ALLELEID=104517;CLNDISDB=MedGen:CN517202;CLNDN...
121813,chrchrX,9760731,98626,A,G,.,.,ALLELEID=104516;CLNDISDB=MedGen:CN517202;CLNDN...


Below is the dataframe for the coding region only. This dataset was found using bedtools intersect with the command word -slice. This dataframe is very important to our study since the two classifiers we will make after all dataframes are created will be based on if variants are located in the coding or non-coding regions.

In [13]:
clinvar_coding_region_variants_only = read_vcf('clinvar_mapped_to_norfs_split_with_header.vcf')
clinvar_coding_region_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chrchr10,100154912,476196,G,A,.,.,AF_EXAC=0.00001;ALLELEID=459810;CLNDISDB=MedGe...
1,chrchr10,100154922,226426,G,A,.,.,"ALLELEID=228227;CLNDISDB=MedGen:C4284588,OMIM:..."
2,chrchr10,100183753,596294,C,T,.,.,ALLELEID=587355;CLNDISDB=MedGen:CN517202;CLNDN...
3,chrchr10,100183802,226427,C,A,.,.,"ALLELEID=228228;CLNDISDB=MedGen:C4284588,OMIM:..."
...,...,...,...,...,...,...,...,...
101101,chrchrX,9760717,98631,C,T,.,.,AF_EXAC=0.00002;ALLELEID=104521;CLNDISDB=MedGe...
101102,chrchrX,9760724,98628,C,T,.,.,ALLELEID=104518;CLNDISDB=MedGen:CN517202;CLNDN...
101103,chrchrX,9760730,98627,C,G,.,.,ALLELEID=104517;CLNDISDB=MedGen:CN517202;CLNDN...
101104,chrchrX,9760731,98626,A,G,.,.,ALLELEID=104516;CLNDISDB=MedGen:CN517202;CLNDN...


Below is the dataframe for clinvar benign variants mapped to coding nORF regions. This has been found by taking the Clinical Significance (CLINSIG) for the values of benign only (not likely_benign).

In [14]:
clinvar_coding_region_benign_variants_only = clinvar_coding_region_variants_only[clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=Benign')
                                                                                 & ~clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=Benign/Likely_benign') 
                                                                                 & ~clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=Benign/Likely benign') 
                                                                                 & ~clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=Benign/Likely_benign') 
                                                                                 & ~clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=Benign/Likely')]
clinvar_coding_region_benign_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
18,chrchr10,100987606,136593,G,T,.,.,AF_TGP=0.41214;ALLELEID=140296;CLNDISDB=MedGen...
57,chrchr10,102065918,284200,C,G,.,.,AF_TGP=0.00260;ALLELEID=268437;CLNDISDB=MedGen...
82,chrchr10,102396271,474786,G,A,.,.,AF_ESP=0.00431;AF_EXAC=0.00140;AF_TGP=0.00579;...
87,chrchr10,102399466,541631,C,T,.,.,AF_EXAC=0.00158;AF_TGP=0.00339;ALLELEID=525432...
...,...,...,...,...,...,...,...,...
101076,chrchrX,85964016,255991,T,C,.,.,AF_ESP=0.16937;AF_EXAC=0.19154;AF_TGP=0.12000;...
101079,chrchrX,85978770,497462,G,C,.,.,AF_ESP=0.00246;AF_EXAC=0.00079;AF_TGP=0.00318;...
101081,chrchrX,85978816,377662,T,A,.,.,AF_ESP=0.01335;AF_EXAC=0.01456;AF_TGP=0.00742;...
101087,chrchrX,93671995,208906,G,C,.,.,ALLELEID=205531;CLNDISDB=Human_Phenotype_Ontol...


Below is the dataframe for clinvar pathogenic variants mapped to nORF regions. This has been found by taking the Clinical Significance (CLINSIG) for the values of pathogenic only (not likely_pathogenic).

In [15]:
clinvar_coding_region_pathogenic_variants_only = clinvar_coding_region_variants_only[clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=Pathogenic') & ~clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=non-Pathogenic')& ~clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=Pathogenic/Likely_pathogenic') & ~clinvar_coding_region_variants_only['INFO'].str.contains('CLNSIG=Pathogenic/LikelyPathogenic')]
clinvar_coding_region_pathogenic_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
1,chrchr10,100154922,226426,G,A,.,.,"ALLELEID=228227;CLNDISDB=MedGen:C4284588,OMIM:..."
3,chrchr10,100183802,226427,C,A,.,.,"ALLELEID=228228;CLNDISDB=MedGen:C4284588,OMIM:..."
7,chrchr10,100246864,504028,AT,A,.,.,ALLELEID=495446;CLNDISDB=MedGen:CN517202;CLNDN...
11,chrchr10,100253422,419610,G,A,.,.,AF_ESP=0.00023;AF_EXAC=0.00004;ALLELEID=407798...
...,...,...,...,...,...,...,...,...
101084,chrchrX,85981796,279771,C,A,.,.,ALLELEID=265177;CLNDISDB=MedGen:CN517202;CLNDN...
101094,chrchrX,9759332,10517,C,T,.,.,"ALLELEID=25556;CLNDISDB=MedGen:C0342684,OMIM:3..."
101098,chrchrX,9759390,10516,A,G,.,.,"ALLELEID=25555;CLNDISDB=MedGen:C0342684,OMIM:3..."
101099,chrchrX,9759390,10519,A,T,.,.,"ALLELEID=25558;CLNDISDB=MedGen:C0342684,OMIM:3..."


Now we want to take only the non-coding region variants by using the concat function below. This command takes all the variants mapped to nORFs and all the variants mapped to the coding region only in nORFs and drops those in common. Thus we get the variants in the non-coding region only:

In [16]:
clinvar_noncoding_region_variants_only = pd.concat([clinvar_variants_mapped_to_nORFs, clinvar_coding_region_variants_only, clinvar_coding_region_variants_only]).drop_duplicates(keep=False)
clinvar_noncoding_region_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
20,chrchr10,100987662,298487,G,A,.,.,"ALLELEID=319783;CLNDISDB=MedGen:C0342782,Orpha..."
25,chrchr10,100987970,298492,C,T,.,.,AF_TGP=0.00539;ALLELEID=309197;CLNDISDB=MedGen...
26,chrchr10,100988106,298493,T,C,.,.,"ALLELEID=309201;CLNDISDB=MedGen:C0342782,Orpha..."
27,chrchr10,100988240,380984,C,T,.,.,ALLELEID=371478;CLNDISDB=MedGen:CN169374;CLNDN...
...,...,...,...,...,...,...,...,...
121783,chrchrX,85965588,438064,T,C,.,.,ALLELEID=431849;CLNDISDB=Human_Phenotype_Ontol...
121784,chrchrX,85968493,389953,C,G,.,.,AF_TGP=0.00053;ALLELEID=378679;CLNDISDB=MedGen...
121785,chrchrX,85968616,425508,G,C,.,.,AF_TGP=0.00026;ALLELEID=413842;CLNDISDB=MedGen...
121796,chrchrX,9653674,283429,C,T,.,.,AF_EXAC=0.00013;ALLELEID=267666;CLNDISDB=MedGe...


Now let us take the benign variants from the non-coding set:

In [17]:
clinvar_noncoding_region_benign_variants_only = clinvar_noncoding_region_variants_only[clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=Benign') 
                                                                                       & ~clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=Benign/Likely_benign') 
                                                                                       & ~clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=Benign/Likely benign') 
                                                                                       & ~clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=Benign/Likely_benign') 
                                                                                       & ~clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=Benign/Likely')]
clinvar_noncoding_region_benign_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
73,chrchr10,100989312,136588,G,A,.,.,AF_ESP=0.05359;AF_TGP=0.04553;ALLELEID=140291;...
102,chrchr10,100990864,136589,C,T,.,.,AF_ESP=0.25819;AF_EXAC=0.27515;AF_TGP=0.37600;...
103,chrchr10,100990866,136590,T,C,.,.,AF_ESP=0.25573;AF_EXAC=0.27520;AF_TGP=0.37600;...
108,chrchr10,100991026,136591,C,A,.,.,AF_ESP=0.28510;AF_TGP=0.37580;ALLELEID=140294;...
...,...,...,...,...,...,...,...,...
121408,chrchrX,71132767,213614,CCTCTTCTCTTCTCTTCTCTTCTCTT,C,.,.,ALLELEID=210570;CLNDISDB=MedGen:CN169374;CLNDN...
121410,chrchrX,71132767,95249,CCTCTT,C,.,.,ALLELEID=101148;CLNDISDB=MedGen:CN169374;CLNDN...
121412,chrchrX,71132767,95251,CCTCTTCTCTTCTCTTCTCTTCTCTTCTCTT,C,.,.,ALLELEID=101150;CLNDISDB=MedGen:CN169374;CLNDN...
121709,chrchrX,78118027,558817,C,T,.,.,AF_ESP=0.26615;AF_EXAC=0.27274;AF_TGP=0.34782;...


We will do the same for the pathogenic variants below:

In [18]:
clinvar_noncoding_region_pathogenic_variants_only = clinvar_noncoding_region_variants_only[clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=Pathogenic') 
                                                                                           & ~clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=non-Pathogenic') 
                                                                                           & ~clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=Pathogenic/Likely_pathogenic') 
                                                                                           & ~clinvar_noncoding_region_variants_only['INFO'].str.contains('CLNSIG=Pathogenic/LikelyPathogenic')]
clinvar_noncoding_region_pathogenic_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
38,chrchr10,100988540,225837,CT,C,.,.,"ALLELEID=227652;CLNDISDB=MedGen:C1849096,OMIM:..."
51,chrchr10,100989084,488187,C,A,.,.,"ALLELEID=481146;CLNDISDB=MedGen:C4015307,OMIM:..."
54,chrchr10,100989118,4628,G,A,.,.,AF_EXAC=0.00002;ALLELEID=19667;CLNDISDB=MedGen...
56,chrchr10,100989154,4620,G,T,.,.,"ALLELEID=19659;CLNDISDB=MedGen:C1836439,OMIM:6..."
...,...,...,...,...,...,...,...,...
121537,chrchrX,74529428,212188,G,GC,.,.,"ALLELEID=209181;CLNDISDB=MedGen:C0795889,OMIM:..."
121688,chrchrX,77902641,39768,A,G,.,.,"ALLELEID=48367;CLNDISDB=MedGen:C3550921,OMIM:3..."
121696,chrchrX,78003237,11786,G,A,.,.,"ALLELEID=26825;CLNDISDB=MedGen:C0022716,OMIM:3..."
121713,chrchrX,78122954,9955,G,A,.,.,"ALLELEID=24994;CLNDISDB=MedGen:C1970848,OMIM:3..."


# Human Derived SNVs Below:


#### hg19:

Below is the entire benign variant dataset for the human derived EPO 6-way primates file mapped to hg19 which we received from Dr. Martin Kircher:

In [4]:
human_derived_SNV_data = read_vcf('humanDerived_SNVs.vcf')
human_derived_SNV_data

  if self.run_code(code, result):


Unnamed: 0,1,379177,.,G,T
0,1,379274,.,G,C
1,1,379476,.,T,A
2,1,379631,.,G,C
3,1,379724,.,G,A
...,...,...,...,...,...
15684027,X,155172105,.,T,C
15684028,X,155172597,.,T,C
15684029,X,155173020,.,T,C
15684030,X,155173117,.,C,T


We need to use liftover on this dataframe to change it to hg38 from hg19. To do this, we must manipulate the data in order to package it as a vcf file that can be used in command line:

In [None]:
human_derived_SNV_data['..'] = '.'
human_derived_SNV_data['...'] = '.'
human_derived_SNV_data['....'] = '.'
human_derived_SNV_data.loc[-1] = ['1', '379177', '.', 'G', 'T', '.', '.', '.']
human_derived_SNV_data.index = human_derived_SNV_data.index + 1  # shifting index
human_derived_SNV_data = human_derived_SNV_data.sort_index()
human_derived_SNV_data.columns = ['', '', '', '', '', '', '', '']
human_derived_SNV_data

The below creates a vcf header for the data. Once we execute this code, we will use command line to change it into a bed file so we can use liftover on the bed file. This is documented in the sORFc_Felix's_work_replication Finalized 27/6.

In [None]:
header = """##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/
#CHROM POS ID REF ALT QUAL FILTER INFO
"""

output_VCF = "humanderivedSNVsreal.vcf"
with open(output_VCF, 'w') as vcf:
    vcf.write(header)

human_derived_SNV_data.to_csv(output_VCF, sep="\t", mode='a', index=False)

#### hg38:

Below is the entire benign variant dataset for the human derived EPO 6-way primates file mapped to hg38. It is necessary to only use the file below since we are building our classifier based on all data being from hg38. Therefore, for the rest of the analysis, the file below will be the only one we use and the hg19 file will be disregarded.

In [9]:
human_derived_SNV_data_hg38 = read_vcf('humanderivedSNVsreal_hg38.vcf')
human_derived_SNV_data_hg38

Unnamed: 0,chr1,440160,.,G,T
0,chr1,440063,.,G,C
1,chr1,439861,.,T,A
2,chr1,439706,.,G,C
3,chr1,439613,.,G,A
...,...,...,...,...,...
15679200,chrX,155942441,.,T,C
15679201,chrX,155942933,.,T,C
15679202,chrX,155943356,.,T,C
15679203,chrX,155943453,.,C,T


### Exporting the dataframe as a vcf file to use bedtools on:

In order to use this dataframe for bedtools intersect, we must have it in vcf format. The following code adds three arbitrary columns so that we get to the standard 8 columns of a vcf file, creates a new row for the row currently held in the header, and gives the header all blank values. The reason for this is because bedtools intersect cannot run when there are column names. I also had to manipulate the file using command line, which is documented in the google doc with all the commands on it mentioned previously.

In [10]:
human_derived_SNV_data_hg38['..'] = '.'
human_derived_SNV_data_hg38['...'] = '.'
human_derived_SNV_data_hg38['....'] = '.'
human_derived_SNV_data_hg38.loc[-1] = ['chr1', '440160', '.', 'G', 'T', '.', '.', '.']
human_derived_SNV_data_hg38.index = human_derived_SNV_data_hg38.index + 1  # shifting index
human_derived_SNV_data_hg38 = human_derived_SNV_data_hg38.sort_index()
human_derived_SNV_data_hg38.columns = ['', '', '', '', '', '', '', '']
human_derived_SNV_data_hg38

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,chr1,440160,.,G,T,.,.,.
1,chr1,440063,.,G,C,.,.,.
2,chr1,439861,.,T,A,.,.,.
3,chr1,439706,.,G,C,.,.,.
...,...,...,...,...,...,...,...,...
15679201,chrX,155942441,.,T,C,.,.,.
15679202,chrX,155942933,.,T,C,.,.,.
15679203,chrX,155943356,.,T,C,.,.,.
15679204,chrX,155943453,.,C,T,.,.,.


We will need the following code to create the header of the vcf file:

In [11]:
header = """##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/
#CHROM POS ID REF ALT QUAL FILTER INFO
"""

output_VCF = "humanderivedSNVsreal_hg38forbedtoolsintersect.vcf"
with open(output_VCF, 'w') as vcf:
    vcf.write(header)

human_derived_SNV_data_hg38.to_csv(output_VCF, sep="\t", mode='a', index=False)

## After bedtools intersect, we got the following dataframe which shows the human derived variants in the nORF regions:

In [12]:
human_derived_variants_mapped_to_nORFs = read_vcf('human_derived_mapped_to_norfs.vcf')
human_derived_variants_mapped_to_nORFs

Unnamed: 0,chr10,1000013,.,G,A
0,chr10,100020652,.,G,A
1,chr10,1000297,.,T,G
2,chr10,1000555,.,A,T
3,chr10,1000567,.,G,C
...,...,...,...,...,...
1020174,chrX,99719539,.,A,G
1020175,chrX,99719939,.,C,A
1020176,chrX,99720084,.,G,A
1020177,chrX,99721008,.,G,A


Let us clean this up to make it look more like a VCF file:

In [13]:
human_derived_variants_mapped_to_nORFs.loc[-1] = ['chr10', '1000013', '.', 'G', 'A']
human_derived_variants_mapped_to_nORFs.index = human_derived_variants_mapped_to_nORFs.index + 1  # shifting index
human_derived_variants_mapped_to_nORFs = human_derived_variants_mapped_to_nORFs.sort_index()
human_derived_variants_mapped_to_nORFs.columns = ['CHROM', 'POS', 'ID', 'REF', 'ALT']
human_derived_variants_mapped_to_nORFs

Unnamed: 0,CHROM,POS,ID,REF,ALT
0,chr10,1000013,.,G,A
1,chr10,100020652,.,G,A
2,chr10,1000297,.,T,G
3,chr10,1000555,.,A,T
...,...,...,...,...,...
1020175,chrX,99719539,.,A,G
1020176,chrX,99719939,.,C,A
1020177,chrX,99720084,.,G,A
1020178,chrX,99721008,.,G,A


Below are the variants mapped to the coding region of nORFs only:

In [14]:
human_derived_coding_region_variants_only = read_vcf('human_derived_mapped_to_norfs_with_split.vcf')
human_derived_coding_region_variants_only

Unnamed: 0,chr10,100020652,.,G,A
0,chr10,100190879,.,T,C
1,chr10,100233196,.,G,A
2,chr10,100267615,.,C,T
3,chr10,100347207,.,T,C
...,...,...,...,...,...
53793,chrX,99719539,.,A,G
53794,chrX,99719939,.,C,A
53795,chrX,99720084,.,G,A
53796,chrX,99721008,.,G,A


Making this look more like our favorite vcf format, we get the following:

In [15]:
human_derived_coding_region_variants_only.loc[-1] = ['chr10', '100020652', '.', 'G', 'A']
human_derived_coding_region_variants_only.index = human_derived_coding_region_variants_only.index + 1  # shifting index
human_derived_coding_region_variants_only = human_derived_coding_region_variants_only.sort_index()
human_derived_coding_region_variants_only.columns = ['CHROM', 'POS', 'ID', 'REF', 'ALT']
human_derived_coding_region_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT
0,chr10,100020652,.,G,A
1,chr10,100190879,.,T,C
2,chr10,100233196,.,G,A
3,chr10,100267615,.,C,T
...,...,...,...,...,...
53794,chrX,99719539,.,A,G
53795,chrX,99719939,.,C,A
53796,chrX,99720084,.,G,A
53797,chrX,99721008,.,G,A


Below are the variants mapped to the non-coding region of nORFs only:

In [16]:
human_derived_noncoding_region_variants_only = pd.concat([human_derived_variants_mapped_to_nORFs, human_derived_coding_region_variants_only, human_derived_coding_region_variants_only]).drop_duplicates(keep=False)
human_derived_noncoding_region_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT
0,chr10,1000013,.,G,A
1,chr10,100020652,.,G,A
2,chr10,1000297,.,T,G
3,chr10,1000555,.,A,T
...,...,...,...,...,...
1020168,chrX,9931817,.,T,C
1020169,chrX,9931818,.,G,A
1020170,chrX,9931993,.,T,C
1020171,chrX,9932000,.,C,T


# HGMD Database Shown Below

## The dataframe below is for GRCh37 - We will not be using it in our analysis

In [18]:
read_vcf('HGMD_PRO_2016.4_hg19_best.vcf')

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,1,865595,CM1613956,A,G,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
1,1,874491,CM1613954,C,T,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
2,1,877523,CM1511864,C,G,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
3,1,879286,CS1613955,A,C,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
...,...,...,...,...,...,...,...,...
174266,Y,6931938,CM121018,G,C,.,.,CLASS=DM?;MUT=ALT;GENE=TBL1Y;STRAND=+;DNA=NM_0...
174267,Y,6938305,CM121019,C,T,.,.,CLASS=DM?;MUT=ALT;GENE=TBL1Y;STRAND=+;DNA=NM_0...
174268,Y,14847658,CD993525,TTAAG,T,.,.,CLASS=DM;MUT=ALT;GENE=USP9Y;STRAND=+;DNA=NM_00...
174269,Y,16952726,CM086695,A,G,.,.,CLASS=DM;MUT=ALT;GENE=NLGN4Y;STRAND=+;DNA=NM_0...


## This dataframe is for GRCh38 and we will be using this for bedtools  

In a text editor, Sublime Text 3, I have removed the header of the file. This seemed to be causing some trouble when creating the python dataframe you see below. I was able to keep the column headers on the file to get a good look at the vcf format shown below:

In [17]:
hgmd_for_hg38 = read_vcf('hgmddatabasenew_noheader.vcf')
hgmd_for_hg38

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,1,930215,CM1613956,A,G,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
1,1,939111,CM1613954,C,T,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
2,1,942143,CM1511864,C,G,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
3,1,943906,CS1613955,A,C,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
...,...,...,...,...,...,...,...,...
232614,Y,9467296,CD1112014,GCC,G,.,.,CLASS=DM?;MUT=ALT;GENE=TSPY1;STRAND=+;DNA=NM_0...
232615,Y,9467303,CD1112015,GC,G,.,.,CLASS=DM?;MUT=ALT;GENE=TSPY1;STRAND=+;DNA=NM_0...
232616,Y,12735724,CD993525,TTAAG,T,.,.,CLASS=DM;MUT=ALT;GENE=USP9Y;STRAND=+;DNA=NM_00...
232617,Y,14840846,CM086695,A,G,.,.,CLASS=DM;MUT=ALT;GENE=NLGN4Y;STRAND=+;DNA=NM_0...


In order to export to a vcf file, I removed the column headers:

In [21]:
hgmd_for_hg38.columns = ['', '', '', '', '', '', '', '']
hgmd_for_hg38

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,1,930215,CM1613956,A,G,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
1,1,939111,CM1613954,C,T,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
2,1,942143,CM1511864,C,G,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
3,1,943906,CS1613955,A,C,.,.,CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_...
...,...,...,...,...,...,...,...,...
232614,Y,9467296,CD1112014,GCC,G,.,.,CLASS=DM?;MUT=ALT;GENE=TSPY1;STRAND=+;DNA=NM_0...
232615,Y,9467303,CD1112015,GC,G,.,.,CLASS=DM?;MUT=ALT;GENE=TSPY1;STRAND=+;DNA=NM_0...
232616,Y,12735724,CD993525,TTAAG,T,.,.,CLASS=DM;MUT=ALT;GENE=USP9Y;STRAND=+;DNA=NM_00...
232617,Y,14840846,CM086695,A,G,.,.,CLASS=DM;MUT=ALT;GENE=NLGN4Y;STRAND=+;DNA=NM_0...


Just as with the Human Derived files, I need to create a header in order to use this on bedtools, shown below:

In [17]:
header = """##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/
#CHROM POS ID REF ALT QUAL FILTER INFO
"""

output_VCF = "hgmd_for_bedtools_intersect.vcf"
with open(output_VCF, 'w') as vcf:
    vcf.write(header)

hgmd_for_hg38.to_csv(output_VCF, sep="\t", mode='a', index=False)

# After Running HGMD through Bedtools Intersect

Below are all the variants that were mapped to nORF regions for the HGMD dataset:

In [58]:
hgmd_variants_mapped_to_nORFs = read_vcf('hgmd_mapped_to_norfs_real.vcf')
hgmd_variants_mapped_to_nORFs

Unnamed: 0,chr10,100154922,CM140970,G,A,.,..1,"CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_006459.3:c.763C>T;PROT=NP_006450.2:p.R255*;DB=rs876657413;PHEN=""Spastic_paraplegia_62"";RANKSCORE=0.99"
0,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
2,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100262050,CM162834,C,G,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
64630,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64631,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64632,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64633,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


As with others, we must manipulate this to get to the correct vcf format:

In [59]:
hgmd_variants_mapped_to_nORFs.loc[-1] = ['chr10', '100154922', 'CM140970', 'G', 'A', '.', '.', 'CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_006459.3:c.763C>T;PROT=NP_006450.2:p.R255*;DB=rs876657413;PHEN="Spastic_paraplegia_62";RANKSCORE=0.99']
hgmd_variants_mapped_to_nORFs.index = hgmd_variants_mapped_to_nORFs.index + 1  # shifting index
hgmd_variants_mapped_to_nORFs = hgmd_variants_mapped_to_nORFs.sort_index()
hgmd_variants_mapped_to_nORFs.columns = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO']
hgmd_variants_mapped_to_nORFs

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
64631,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64632,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64633,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64634,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


Since we want the best classifier possible, we want to filter out all variants with uncertainty in whether or not the variant is pathogenic. Therefore, we have to find the variants which contain under the INFO column 'CLASS=DM?', since these are the variants which may not be pathogenic and cut them from the dataframe. This process is shown below:

First, we need to figure out where these uncertain variants are located, which is shown below:

In [77]:
finding_the_uncertainty = hgmd_variants_mapped_to_nORFs[~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DM?;') & 
                                                        ~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DP') 
                                                        & ~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DFP') 
                                                        & ~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=FP') 
                                                        & ~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=R')]
finding_the_uncertainty

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
18,chr10,100989064,CM193135,G,A,.,.,CLASS=DM?;MUT=ALT;GENE=TWNK;STRAND=+;DNA=NM_02...
23,chr10,100989148,CM1720497,G,A,.,.,CLASS=DM?;MUT=ALT;GENE=TWNK;STRAND=+;DNA=NM_02...
41,chr10,100989312,CM012076,G,A,.,.,CLASS=DM?;MUT=ALT;GENE=TWNK;STRAND=+;DNA=NM_02...
53,chr10,100989400,CM182916,A,G,.,.,CLASS=DM?;MUT=ALT;GENE=TWNK;STRAND=+;DNA=NM_02...
...,...,...,...,...,...,...,...,...
64581,chrX,86027503,CM137329,A,G,.,.,CLASS=DM?;MUT=ALT;GENE=CHM;STRAND=-;DNA=NM_000...
64593,chrX,8623669,CM133761,C,T,.,.,CLASS=DM?;MUT=ALT;GENE=ANOS1;STRAND=-;DNA=NM_0...
64595,chrX,8623670,CM183496,A,T,.,.,CLASS=DM?;MUT=ALT;GENE=ANOS1;STRAND=-;DNA=NM_0...
64598,chrX,87518266,CM1417120,C,G,.,.,CLASS=DM?;MUT=ALT;GENE=KLHL4;STRAND=+;DNA=NM_0...


Next, we want to cut these from the dataframe since we do not need them. We do this like we did for ClinVar and the human derived variants when searching for the non-coding region. This gives the complete dataframe for all pathogenic variants mapped to the nORF regions:

In [47]:
hgmd_disease_related_variants_mapped_to_nORFs = pd.concat([hgmd_variants_mapped_to_nORFs, finding_the_uncertainty, finding_the_uncertainty]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
64631,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64632,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64633,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64634,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


Now we will filter out every variant with 'CLASS=R' since these variants also do not pertain to our study. I repeat the same process above to do this:

In [54]:
only_r = hgmd_disease_related_variants_mapped_to_nORFs[hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=R')]
only_r

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
75,chr10,100990866,CS1715008,T,C,.,.,CLASS=R;MUT=REF;GENE=TWNK;STRAND=+;DNA=NM_0218...
291,chr10,119677227,CM117921,C,T,.,.,CLASS=R;MUT=ALT;GENE=BAG3;STRAND=+;DNA=NM_0042...
476,chr10,13298236,CM001290,G,A,.,.,CLASS=R;MUT=ALT;GENE=PHYH;STRAND=-;DNA=NM_0062...
3266,chr1,119955217,CM1110715,A,T,.,.,CLASS=R;MUT=ALT;GENE=NOTCH2;STRAND=-;DNA=NM_02...
...,...,...,...,...,...,...,...,...
62204,chrX,18575402,CM081205,G,A,.,.,CLASS=R;MUT=ALT;GENE=CDKL5;STRAND=+;DNA=NM_003...
64303,chrX,74524444,CM1613116,G,A,.,.,CLASS=R;MUT=ALT;GENE=SLC16A2;STRAND=+;DNA=NM_0...
64587,chrX,8623612,CD146454,CA,C,.,.,CLASS=R;MUT=ALT;GENE=ANOS1;STRAND=-;DNA=NM_000...
64588,chrX,8623617,CM146452,G,T,.,.,CLASS=R;MUT=ALT;GENE=ANOS1;STRAND=-;DNA=NM_000...


Cutting out the variants of the above dataframe, we get:

In [55]:
hgmd_disease_related_variants_mapped_to_nORFs_no_r = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs, only_r, only_r]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
64631,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64632,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64633,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64634,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


We will follow the same process as above, but this time for 'CLASS=FP', since these are not known to be pathogenic.

In [63]:
only_fp = hgmd_disease_related_variants_mapped_to_nORFs[hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=FP') 
                                                        & ~hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DFP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r, only_fp, only_fp]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
64631,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64632,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64633,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64634,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


We will also take out 'CLASS = DP' and 'CLASS = DFP' below using the same steps as above:

In [71]:
only_dfp = hgmd_disease_related_variants_mapped_to_nORFs[hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DFP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp, only_dfp, only_dfp]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
64631,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64632,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64633,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64634,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


In [75]:
only_dp = hgmd_disease_related_variants_mapped_to_nORFs[hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp, only_dp, only_dp]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
64631,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64632,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64633,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64634,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


As a check to show there are no longer any variants without 'CLASS=DM', we run the following code which gives an empty output, signifying our success in filtering the dataframe:

In [76]:
not_dm = hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp[~hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp['INFO'].str.contains('CLASS=DM')]
not_dm

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO


#### A compact version of the code that gives the same result is below. I performed the code step by step so that it is easier to follow.

In [60]:
finding_the_uncertainty = hgmd_variants_mapped_to_nORFs[~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DM?;') & 
                                                        ~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DP') 
                                                        & ~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DFP') 
                                                        & ~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=FP') 
                                                        & ~hgmd_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=R')]
hgmd_disease_related_variants_mapped_to_nORFs = pd.concat([hgmd_variants_mapped_to_nORFs, finding_the_uncertainty, finding_the_uncertainty]).drop_duplicates(keep=False)
only_r = hgmd_disease_related_variants_mapped_to_nORFs[hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=R')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs, only_r, only_r]).drop_duplicates(keep=False)
only_fp = hgmd_disease_related_variants_mapped_to_nORFs[hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=FP') 
                                                        & ~hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DFP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r, only_fp, only_fp]).drop_duplicates(keep=False)
only_dfp = hgmd_disease_related_variants_mapped_to_nORFs[hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DFP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp, only_dfp, only_dfp]).drop_duplicates(keep=False)
only_dp = hgmd_disease_related_variants_mapped_to_nORFs[hgmd_disease_related_variants_mapped_to_nORFs['INFO'].str.contains('CLASS=DP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp, only_dp, only_dp]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
64631,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64632,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64633,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
64634,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


Now we will create a dataframe for the coding region nORFs only from the -split command on bedtools intersect:

In [61]:
hgmd_coding_region_variants_only = read_vcf('hgmd_mapped_to_norfs_real_with_split.vcf')
hgmd_coding_region_variants_only

Unnamed: 0,chr10,100154922,CM140970,G,A,.,..1,"CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_006459.3:c.763C>T;PROT=NP_006450.2:p.R255*;DB=rs876657413;PHEN=""Spastic_paraplegia_62"";RANKSCORE=0.99"
0,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
2,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100262050,CM162834,C,G,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
59425,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59426,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59427,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59428,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


We apply the same technique as above to get this into a usable form. Luckily the first entry has the same information as our previous code so we can copy and paste line for line, only changing the file name we are calling on:

In [62]:
hgmd_coding_region_variants_only.loc[-1] = ['chr10', '100154922', 'CM140970', 'G', 'A', '.', '.', 'CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_006459.3:c.763C>T;PROT=NP_006450.2:p.R255*;DB=rs876657413;PHEN="Spastic_paraplegia_62";RANKSCORE=0.99']
hgmd_coding_region_variants_only.index = hgmd_coding_region_variants_only.index + 1  # shifting index
hgmd_coding_region_variants_only = hgmd_coding_region_variants_only.sort_index()
hgmd_coding_region_variants_only.columns = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO']
hgmd_coding_region_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
59426,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59427,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59428,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59429,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


Next, we will cut out the uncertainties like we did above, leaving only the truly pathogenic variants:

In [63]:
finding_the_uncertainty_coding_region = hgmd_coding_region_variants_only[~hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=DM?;') & 
                                                        ~hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=DP') 
                                                        & ~hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=DFP') 
                                                        & ~hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=FP') 
                                                        & ~hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=R')]
finding_the_uncertainty_coding_region

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
30,chr10,101580292,CD175017,TG,T,.,.,CLASS=DM?;MUT=ALT;GENE=POLL;STRAND=-;DNA=NM_01...
44,chr10,102109470,CM188932,G,A,.,.,CLASS=DM?;MUT=ALT;GENE=LDB1;STRAND=-;DNA=NM_00...
45,chr10,102157525,CM1416537,T,G,.,.,CLASS=DM?;MUT=ALT;GENE=NOLC1;STRAND=+;DNA=NM_0...
54,chr10,102375375,CM160744,G,A,.,.,CLASS=DM?;MUT=ALT;GENE=GBF1;STRAND=+;DNA=NM_00...
...,...,...,...,...,...,...,...,...
59376,chrX,86027503,CM137329,A,G,.,.,CLASS=DM?;MUT=ALT;GENE=CHM;STRAND=-;DNA=NM_000...
59388,chrX,8623669,CM133761,C,T,.,.,CLASS=DM?;MUT=ALT;GENE=ANOS1;STRAND=-;DNA=NM_0...
59390,chrX,8623670,CM183496,A,T,.,.,CLASS=DM?;MUT=ALT;GENE=ANOS1;STRAND=-;DNA=NM_0...
59393,chrX,87518266,CM1417120,C,G,.,.,CLASS=DM?;MUT=ALT;GENE=KLHL4;STRAND=+;DNA=NM_0...


Again, we will cut these out of the dataframe using the concatenation function:

In [64]:
hgmd_disease_related_variants_mapped_to_nORFs_coding_region = pd.concat([hgmd_coding_region_variants_only, finding_the_uncertainty_coding_region, finding_the_uncertainty_coding_region]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_coding_region

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
59426,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59427,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59428,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59429,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


We will also take out the 'CLASS=R' from this dataframe since it is unnecessary to our classifier:

In [65]:
only_r_coding_region = hgmd_disease_related_variants_mapped_to_nORFs_coding_region[hgmd_disease_related_variants_mapped_to_nORFs_coding_region['INFO'].str.contains('CLASS=R')]
only_r_coding_region

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
223,chr10,119677227,CM117921,C,T,.,.,CLASS=R;MUT=ALT;GENE=BAG3;STRAND=+;DNA=NM_0042...
402,chr10,13298236,CM001290,G,A,.,.,CLASS=R;MUT=ALT;GENE=PHYH;STRAND=-;DNA=NM_0062...
2966,chr1,119955217,CM1110715,A,T,.,.,CLASS=R;MUT=ALT;GENE=NOTCH2;STRAND=-;DNA=NM_02...
5900,chr11,6390705,CM154141,T,C,.,.,CLASS=R;MUT=ALT;GENE=SMPD1;STRAND=+;DNA=NM_000...
...,...,...,...,...,...,...,...,...
57205,chrX,18575402,CM081205,G,A,.,.,CLASS=R;MUT=ALT;GENE=CDKL5;STRAND=+;DNA=NM_003...
59121,chrX,74524444,CM1613116,G,A,.,.,CLASS=R;MUT=ALT;GENE=SLC16A2;STRAND=+;DNA=NM_0...
59382,chrX,8623612,CD146454,CA,C,.,.,CLASS=R;MUT=ALT;GENE=ANOS1;STRAND=-;DNA=NM_000...
59383,chrX,8623617,CM146452,G,T,.,.,CLASS=R;MUT=ALT;GENE=ANOS1;STRAND=-;DNA=NM_000...


Now we will cut these rows out of the dataframe. This is shown below:

In [66]:
hgmd_disease_related_variants_mapped_to_nORFs_no_r_coding_region = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_coding_region, only_r_coding_region, only_r_coding_region]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r_coding_region

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
59426,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59427,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59428,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59429,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


Just like for the whole region, we do this again with 'CLASS=FP' for the coding region:

In [67]:
only_fp_coding_region = hgmd_coding_region_variants_only[hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=FP') 
                                                        & ~hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=DFP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_coding_region = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r_coding_region, only_fp_coding_region, only_fp_coding_region]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_coding_region

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
59426,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59427,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59428,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59429,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


We again do this for 'CLASS=DFP':

In [68]:
only_dfp_coding_region = hgmd_coding_region_variants_only[hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=DFP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_coding_region = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_coding_region, only_dfp_coding_region, only_dfp_coding_region]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_coding_region

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
59426,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59427,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59428,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59429,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


We again do this for 'CLASS=DP'

In [69]:
only_dp_coding_region = hgmd_coding_region_variants_only[hgmd_coding_region_variants_only['INFO'].str.contains('CLASS=DP')]
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp_coding_region = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_coding_region, only_dp_coding_region, only_dp_coding_region]).drop_duplicates(keep=False)
hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp_coding_region

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
0,chr10,100154922,CM140970,G,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
1,chr10,100183802,CM140971,C,A,.,.,CLASS=DM;MUT=ALT;GENE=ERLIN1;STRAND=-;DNA=NM_0...
2,chr10,100253438,CI1824020,A,AT,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
3,chr10,100256298,CD162836,TG,T,.,.,CLASS=DM;MUT=ALT;GENE=CWF19L1;STRAND=-;DNA=NM_...
...,...,...,...,...,...,...,...,...
59426,chrX,9760731,CM981395,A,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59427,chrX,9760732,CI183806,G,GA,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59428,chrX,9760736,CD171619,GC,G,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...
59429,chrX,9760741,CI115195,A,AG,.,.,CLASS=DM;MUT=ALT;GENE=GPR143;STRAND=-;DNA=NM_0...


Again, it is possible to condese the code above for the coding region only into one cell. I will not do this, however, for the coding region only case.

Now we can use the above dataframes to create a separate dataframe showing only the non-coding region variants using, again, the concat function. This will give us the true number of HGMD variants found in the non-coding regions of nORFs:

In [70]:
hgmd_noncoding_region_variants_only = pd.concat([hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp, hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp_coding_region, hgmd_disease_related_variants_mapped_to_nORFs_no_r_no_fp_no_dfp_no_dp_coding_region]).drop_duplicates(keep=False)
hgmd_noncoding_region_variants_only

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
9,chr10,100988295,CM114899,C,T,.,.,CLASS=DM;MUT=ALT;GENE=TWNK;STRAND=+;DNA=NM_021...
10,chr10,100988415,CM164756,A,T,.,.,CLASS=DM;MUT=ALT;GENE=TWNK;STRAND=+;DNA=NM_021...
11,chr10,100988457,CM127719,C,T,.,.,CLASS=DM;MUT=ALT;GENE=TWNK;STRAND=+;DNA=NM_021...
12,chr10,100988526,CM1610318,A,G,.,.,CLASS=DM;MUT=ALT;GENE=TWNK;STRAND=+;DNA=NM_021...
...,...,...,...,...,...,...,...,...
64559,chrX,85964053,CS1810957,C,G,.,.,CLASS=DM;MUT=ALT;GENE=CHM;STRAND=-;DNA=NM_0003...
64560,chrX,85964054,CS1723659,T,C,.,.,CLASS=DM;MUT=ALT;GENE=CHM;STRAND=-;DNA=NM_0003...
64561,chrX,85965588,CS173873,T,C,.,.,CLASS=DM;MUT=ALT;GENE=CHM;STRAND=-;DNA=NM_0003...
64562,chrX,85968639,CS032064,A,T,.,.,CLASS=DM;MUT=ALT;GENE=CHM;STRAND=-;DNA=NM_0003...
