# Index
- [Imports](#Imports)
- [CTEPG](#CTEPG-(Catching-The-Elusive-Predictable-Genes)-15-06-2020---22-06-2020)
- [Dyslipid database creation](#Creating-dyslipid-dataset)
- [New vanilla model validation](#AUC-validation-vanilla-new-model)
- [Cross validated models](#CV-Models)
- [XGB models out of CV models](#New-Models)
- [Changes in predition new model](#AUC-validation-vanilla-new-model)
- [Correct new threshold](#Correct-probability-prediction)
- [AUC preparation](#Preparing-for-AUC-analysis)

# Imports

In [63]:
# Importing some shizzle.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats
import time
import requests
import datetime
import sys
import os
import pathlib
import seaborn as sns
from sklearn.metrics import roc_auc_score
import math
import json
import glob
import gzip
import pickle
import xgboost as xgb
from utilities import perform_stats, calc_z_scores, get_header
from sklearn.metrics import recall_score 

# Defining some import and export locations
location = 'rjsietsma'
read_loc = '/run/media/rjsietsma/evo2tb/linux/Datafiles/'
data_expor_loc = '/home/'+location+'/Documents/School/Master_DSLS/Final_Thesis/Past_initial_data/'
img_output_dir = '/home/'+location+'/PycharmProjects/dsls_master_thesis/side_scripts/output_img/'

 # CTEPG (Catching The Elusive Predictable Genes) 15-06-2020 - 22-06-2020
<br/><br/>

What have I done:


Planned for this week:
- Make model on bad performing genes and investigate further.
    - At first, make models of [the original training data](https://zenodo.org/record/3564757#.XudiZ2ixUUE), models of various gene panels (but keep the variants of those panels within the bigger model). See if they improve and how much.
        - _Note: the training data needs to be converted to [variant calling file (VCF)](http://samtools.github.io/hts-specs/VCFv4.1.pdf) before using, which means it needs a fileformat. Or the annotated file can be downloaded directly from Shuang her GCC tmp storage._
        - ___Note 2: Shuang her data has almost half the variants than the newer VGKL and Clinvar datasets, with 3913 benign and 1116 malignant samples.___
    - Then make a single model of the original training data for CADD 1.6 annotations.
    - Then make multiple models of the original training data for CADD 1.6 annotations.
    - Investigate improvements.
    - Then choose between:
        - newer / larger dataset or 
        - Custom features that might be relevant to the cause (prefered)
- Make single plot of panel vs all others combined (except for 5GPM).
- Usefull, [convert 37 to 38](https://genome.ucsc.edu/cgi-bin/hgLiftOver).
- Start investigating what is required for the data sources to be used in the training phase.
- Create outline for article.
     - For static plots: make labels on each point with the term name and the number of genes in (n=x) behind that.
     - For first concept, it's due date is the last week of the resit exams on the 6th of July.
- _Optionally_ :
    - Find possible way to annotate genes for co-factors, sub-family etc.
    - Interactive plot of all variants labeled as benign or pathogenic with their capice score (y-axis) and location (x-axis), to show the variant itself.
<br/><br/>

# Creating dyslipid dataset

In [3]:
file_loc = os.path.join(read_loc, 'train.txt.gz')
header = get_header(file_loc, '#Chrom')
train = pd.read_csv(file_loc, compression='gzip', names=header, comment='#', sep='\t', low_memory=False)
train

Unnamed: 0,#Chrom,Allergy/Immunology/Infectious,Alt,AnnoType,Audiologic/Otolaryngologic,Biochemical,CCDS,CDSpos,Cardiovascular,ConsDetail,...,revel,sift,source,tOverlapMotifs,targetScan,to_be_deleted,verPhCons,verPhyloP,inTest,sample_weight
0,14,False,G,CodingTranscript,False,False,CCDS9787.1,806.0,False,frameshift,...,,,vkgl,,,False,1.000,5.843,False,1.0
1,20,False,T,CodingTranscript,True,False,CCDS13112.1,1899.0,True,"frameshift,stop_gained",...,,,vkgl,,,False,1.000,4.670,False,1.0
2,20,False,C,CodingTranscript,True,False,CCDS13112.1,2118.0,True,frameshift,...,,,vkgl,,,False,1.000,5.043,False,1.0
3,20,False,A,CodingTranscript,True,False,CCDS13112.1,1586.0,True,frameshift,...,,,vkgl,,,False,1.000,6.221,False,1.0
4,20,False,A,Intergenic,True,False,,,True,downstream,...,,,vkgl,,,False,1.000,6.368,False,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
334596,17,False,A,CodingTranscript,False,False,CCDS32642.1,1563.0,False,stop_gained,...,,,unknown,,,False,1.000,6.031,False,0.8
334597,17,False,T,CodingTranscript,False,False,CCDS32642.1,2029.0,False,stop_gained,...,,,unknown,,,False,1.000,4.100,False,0.8
334598,10,False,T,CodingTranscript,False,False,CCDS7431.1,1216.0,False,stop_gained,...,,,unknown,,,False,1.000,5.852,False,0.8
334599,2,False,T,CodingTranscript,False,False,CCDS2382.1,2998.0,False,stop_gained,...,,,unknown,,,False,0.031,2.213,False,0.8


In [4]:
with open('./umcg_genepanels.json', 'r') as json_file:
    genes = json.load(json_file)
dislipid_genes = genes['Hart- en vaatziekten']
genelist = []
for key, value in dislipid_genes.items():
    if key.lower().startswith('dyslipid'):
        for g in value:
            if g not in genelist:
                genelist.append(g)

In [5]:
dislipid_subset = train.loc[train['GeneName'].isin(genelist)]
dislipid_subset

Unnamed: 0,#Chrom,Allergy/Immunology/Infectious,Alt,AnnoType,Audiologic/Otolaryngologic,Biochemical,CCDS,CDSpos,Cardiovascular,ConsDetail,...,revel,sift,source,tOverlapMotifs,targetScan,to_be_deleted,verPhCons,verPhyloP,inTest,sample_weight
142,16,False,T,CodingTranscript,False,False,CCDS10772.1,848.0,True,frameshift,...,,,vkgl,,,False,0.928,2.614,False,1.0
148,19,False,CCGGCGAGGTGCAGGCCATGCT,CodingTranscript,False,False,CCDS12647.1,409.0,True,protein_altering,...,,,vkgl,,,True,0.863,0.839,False,1.0
149,2,False,C,CodingTranscript,False,False,CCDS1703.1,13028.0,True,frameshift,...,,,vkgl,,,False,0.000,0.058,False,1.0
150,2,False,G,CodingTranscript,False,False,CCDS1703.1,28.0,True,frameshift,...,,,vkgl,,,False,0.021,-0.103,False,1.0
151,2,False,C,CodingTranscript,False,False,CCDS1703.1,2534.0,True,frameshift,...,,,vkgl,,,False,0.653,0.251,False,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
334284,12,False,T,CodingTranscript,False,False,CCDS8685.1,1093.0,False,stop_gained,...,,,unknown,,,False,0.999,3.072,False,0.8
334285,12,False,A,CodingTranscript,False,False,CCDS8685.1,1475.0,False,stop_gained,...,,,unknown,,,False,0.580,0.563,False,0.8
334286,12,False,T,CodingTranscript,False,False,CCDS8685.1,1537.0,False,stop_gained,...,,,unknown,,,False,0.003,0.643,False,0.8
334287,12,False,G,CodingTranscript,False,False,CCDS8685.1,1553.0,False,stop_gained,...,,,unknown,,,False,0.510,2.765,False,0.8


In [6]:
dislipid_subset['label'].value_counts()

Benign        3913
Pathogenic    1116
Name: label, dtype: int64

# CV Models

In [7]:
model = pickle.load(open('./models/xgb_weightedSample_randomsearch_v2.pickle.dat', 'rb'))
xgbmodel = model.best_estimator_
optimal_params = xgbmodel.get_params()
optimal_params.pop('missing')
with open('./optimal_capice_v2_params.json', 'w+') as json_file:
    json.dump(optimal_params, json_file, indent=4)
optimal_params

{'objective': 'binary:logistic',
 'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'gpu_id': -1,
 'importance_type': 'gain',
 'interaction_constraints': '',
 'learning_rate': 0.12794736796290956,
 'max_delta_step': 0,
 'max_depth': 15,
 'min_child_weight': 1,
 'monotone_constraints': '()',
 'n_estimators': 420,
 'n_jobs': 8,
 'num_parallel_tree': 1,
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'subsample': 1,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': 1}

In [8]:
model = xgb.XGBClassifier(optimal_params)

In [9]:
model_dislipid = pickle.load(open('./models/xgb_weightedSample_randomsearch_dislipid.pickle.dat', 'rb'))
xgbmodel_dislipid = model_dislipid.best_estimator_
optimal_params_dislipid = xgbmodel_dislipid.get_params()
optimal_params_dislipid.pop('missing')
with open('./optimal_capice_dislipid_params.json', 'w+') as json_file:
    json.dump(optimal_params_dislipid, json_file, indent=4)
optimal_params_dislipid

{'objective': 'binary:logistic',
 'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'gpu_id': -1,
 'importance_type': 'gain',
 'interaction_constraints': '',
 'learning_rate': 0.1968770013672272,
 'max_delta_step': 0,
 'max_depth': 16,
 'min_child_weight': 1,
 'monotone_constraints': '()',
 'n_estimators': 427,
 'n_jobs': 8,
 'num_parallel_tree': 1,
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'subsample': 1,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': 1}

# New Models

In [10]:
model = pickle.load(open('./xgbmodels/xgb_booster_v2.pickle.dat', 'rb'))
model

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.12794736796290956, max_delta_step=0, max_depth=15,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=420, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=1)

In [11]:
model_dislipid = pickle.load(open('./xgbmodels/xgb_booster_dyslipid.pickle.dat', 'rb'))
model_dislipid

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1968770013672272, max_delta_step=0, max_depth=16,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=427, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=1)

# AUC validation vanilla new model

In [12]:
train_original = pd.read_csv('/home/rjsietsma/Documents/School/Master_DSLS/Final_Thesis/Initial_Data_exploration/train_results.txt', sep='\t', low_memory=False)
train_original

Unnamed: 0,chr,pos,ref,alt,GeneName,Consequence,PHRED,probabilities,prediction,combined_prediction
0,17,41246652,ACATTC,GA,BRCA1,FRAME_SHIFT,26.600,9.999933e-01,Pathogenic,Pathogenic
1,19,11216246,TGCAAGGACAAATCTGAC,CCGACTG,LDLR,FRAME_SHIFT,35.000,9.999907e-01,Pathogenic,Pathogenic
2,19,11216251,GGACAAATCTGACGA,AACTGCGGTAAACTGCGGTAAACT,LDLR,FRAME_SHIFT,34.000,9.999896e-01,Pathogenic,Pathogenic
3,19,11216262,ACG,CA,LDLR,FRAME_SHIFT,34.000,9.999894e-01,Pathogenic,Pathogenic
4,2,47702328,GTTGA,TTTC,MSH2,FRAME_SHIFT,35.000,9.999891e-01,Pathogenic,Pathogenic
...,...,...,...,...,...,...,...,...,...,...
334596,14,64653189,T,C,MIR548AZ,SYNONYMOUS,14.370,2.765540e-07,Neutral,Neutral
334597,17,10419945,A,G,MYHAS,SYNONYMOUS,0.611,2.524258e-07,Neutral,Neutral
334598,17,10419849,T,G,MYHAS,SYNONYMOUS,9.827,2.164549e-07,Neutral,Neutral
334599,16,88804658,G,A,LOC100289580,SYNONYMOUS,14.100,1.984229e-07,Neutral,Neutral


In [13]:
train_new = pd.read_csv('/home/rjsietsma/PycharmProjects/dsls_master_thesis/side_scripts/test_output/train_results_v2.txt',
                       sep='\t', low_memory=False)
important_info = train_new['chr_pos_ref_alt'].str.split("_", expand=True)
train_new['chr'] = important_info[0]
train_new['pos'] = important_info[1].astype(np.int64)
train_new['ref'] = important_info[2]
train_new['alt'] = important_info[3]
train_new.drop(columns=['chr_pos_ref_alt'], inplace=True)
train_new

Unnamed: 0,GeneName,Consequence,PHRED,probabilities,prediction,combined_prediction,chr,pos,ref,alt
0,RDH12,FRAME_SHIFT,35.000,1,Pathogenic,Pathogenic,14,68196054,GCCCTG,G
1,ANK1,CANONICAL_SPLICE,32.000,1,Pathogenic,Pathogenic,8,41554106,C,T
2,HNF1A,STOP_GAINED,35.000,1,Pathogenic,Pathogenic,12,121431481,C,T
3,HNF1A,STOP_GAINED,39.000,1,Pathogenic,Pathogenic,12,121435345,C,T
4,HNF1A,INFRAME,19.860,1,Pathogenic,Pathogenic,12,121431471,CAAG,C
...,...,...,...,...,...,...,...,...,...,...
334596,SPG11,NON_SYNONYMOUS,15.230,0,Neutral,Neutral,15,44952677,C,T
334597,SPG11,NON_SYNONYMOUS,19.320,0,Neutral,Neutral,15,44952632,A,G
334598,SPG11,SYNONYMOUS,9.169,0,Neutral,Neutral,15,44943927,G,C
334599,SPG11,NON_SYNONYMOUS,13.460,0,Neutral,Neutral,15,44943897,C,A


In [14]:
train_original.dtypes

chr                     object
pos                      int64
ref                     object
alt                     object
GeneName                object
Consequence             object
PHRED                  float64
probabilities          float64
prediction              object
combined_prediction     object
dtype: object

In [15]:
train_new.dtypes

GeneName                object
Consequence             object
PHRED                  float64
probabilities            int64
prediction              object
combined_prediction     object
chr                     object
pos                      int64
ref                     object
alt                     object
dtype: object

In [16]:
merge = train_original[['chr', 'pos','ref','alt','prediction']].merge(train_new[['chr', 'pos','ref','alt','prediction']],
                                                                     on=['chr', 'pos','ref','alt'])
merge[merge['prediction_x'] != merge['prediction_y']]
print(f"There is a "
      f"{merge[merge['prediction_x'] != merge['prediction_y']].shape[0] / train_original.shape[0] * 100}% mismatch.")

There is a 5.093230444619114% mismatch.


In [22]:
test_original = pd.read_csv('/home/rjsietsma/Documents/School/Master_DSLS/Final_Thesis/Initial_Data_exploration/test_results.txt', sep='\t', low_memory=False)
tellPathogenic_pred = lambda x: "Pathogenic" if x > 0.02 else 'Neutral'
test_original['prediction'] = [tellPathogenic_pred(probability) for probability in test_original['capice']]
test_original.rename(columns={'#Chrom': 'chr', 'Pos':'pos', 'Ref': 'ref', 'Alt':'alt'}, inplace=True)
test_original

Unnamed: 0,chr,pos,ref,alt,max_AF,Consequence,label,revel,clinpred,sift,provean,PHRED,fathmm_score,capice,ponp2,prediction
0,21,33974174,C,G,0.000058,STOP_LOST,LB/B,,,,,15.050,,0.003275,,Neutral
1,X,99661625,G,C,0.000037,SYNONYMOUS,LB/B,,,0.000,-0.0,0.806,,0.000075,,Neutral
2,17,29509638,C,T,0.000000,SYNONYMOUS,LB/B,,,0.647,-0.0,10.760,,0.001511,,Neutral
3,21,35742999,C,T,0.000133,SYNONYMOUS,LB/B,,,0.000,-0.0,18.640,,0.000621,,Neutral
4,1,2160973,G,A,0.000000,SYNONYMOUS,LB/B,,,0.000,-0.0,16.340,0.008252,0.000012,,Neutral
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10837,17,29556342,G,A,0.000000,SYNONYMOUS,LP/P,,,0.000,-0.0,11.450,,0.000139,,Neutral
10838,11,5248177,A,T,0.000344,SYNONYMOUS,LP/P,,,0.000,-0.0,17.970,,0.069934,,Pathogenic
10839,15,48787324,T,C,0.000000,SYNONYMOUS,LP/P,,,0.664,-0.0,12.170,0.019530,0.909190,,Pathogenic
10840,19,17947957,G,A,0.000015,SYNONYMOUS,LP/P,,,0.000,-0.0,10.990,,0.001238,,Neutral


In [19]:
test_new = pd.read_csv('/home/rjsietsma/PycharmProjects/dsls_master_thesis/side_scripts/test_output/test_results_v2.txt', sep='\t', low_memory=False)
important_info = test_new['chr_pos_ref_alt'].str.split("_", expand=True)
test_new['chr'] = important_info[0]
test_new['pos'] = important_info[1].astype(np.int64)
test_new['ref'] = important_info[2]
test_new['alt'] = important_info[3]
test_new.drop(columns=['chr_pos_ref_alt'], inplace=True)
test_new

Unnamed: 0,GeneName,Consequence,PHRED,probabilities,prediction,combined_prediction,chr,pos,ref,alt
0,MSH2,NON_SYNONYMOUS,24.30,1,Pathogenic,Pathogenic,2,47672768,T,A
1,PHIP,FRAME_SHIFT,34.00,1,Pathogenic,Pathogenic,6,79692624,CTTCT,C
2,BRCA1,FRAME_SHIFT,24.20,1,Pathogenic,Pathogenic,17,41244488,TGGAA,T
3,CHEK2,FRAME_SHIFT,26.40,1,Pathogenic,Pathogenic,22,29121049,CA,C
4,BRCA1,FRAME_SHIFT,32.00,1,Pathogenic,Pathogenic,17,41243468,GC,G
...,...,...,...,...,...,...,...,...,...,...
10837,CDCP2,FRAME_SHIFT,23.00,0,Neutral,Neutral,1,54605318,T,TC
10838,ASXL1,FRAME_SHIFT,15.46,0,Neutral,Neutral,20,31022934,GT,G
10839,KIAA0196,FRAME_SHIFT,35.00,0,Neutral,Pathogenic,8,126051130,AGT,A
10840,BRCA1,NON_SYNONYMOUS,28.10,0,Neutral,Neutral,17,41203088,A,C


In [23]:
merge = test_original[['chr', 'pos','ref','alt','prediction']].merge(test_new[['chr', 'pos','ref','alt','prediction']],
                                                                     on=['chr', 'pos','ref','alt'])
merge[merge['prediction_x'] != merge['prediction_y']]
print(f"There is a "
      f"{merge[merge['prediction_x'] != merge['prediction_y']].shape[0] / test_original.shape[0] * 100}% mismatch.")

There is a 27.23667220070098% mismatch.


# Correct probability prediction

### Currently marked as markdown to prevent executing again when re-launching the notebook

thresholds = np.arange(0,1,0.001)
data = pd.read_csv('/home/rjsietsma/PycharmProjects/dsls_master_thesis/side_scripts/test_output/train_results_v3.txt',
                  sep='\t', low_memory=False)
important_info = data['chr_pos_ref_alt'].str.split("_", expand=True)
data['chr'] = important_info[0]
data['pos'] = important_info[1].astype(np.int64)
data['ref'] = important_info[2]
data['alt'] = important_info[3]
data.drop(columns=['chr_pos_ref_alt'], inplace=True)
data

train_in = pd.read_csv('/run/media/rjsietsma/evo2tb/linux/Datafiles/train.txt.gz',
                      compression='gzip', sep='\t', low_memory=False)
train_in

data = data.merge(train_in[['#Chrom', 'Pos', 'Ref', 'Alt', 'label']],
                           left_on=['chr','pos','ref','alt'],
                           right_on=['#Chrom','Pos','Ref','Alt'])
data

drop_labels = ['#Chrom', 'Pos', 'Ref', 'Alt']
for x in data.columns:
    if x.endswith('_x') or x.endswith('_y'):
        drop_labels.append(x)
data.drop(columns=drop_labels, inplace=True)

def apply_func_thresholding(probability, threshold):
    return_value = 0
    if probability > threshold:
        return_value = 1
    return return_value

data['label'].replace({'Pathogenic': 1, 'Benign': 0}, inplace=True)

true_recall = 0
true_threshold = 0
for threshold in thresholds:
    data['pred_label'] = data['probabilities'].apply(lambda x: apply_func_thresholding(x, threshold))
    y_pred = np.array(data['pred_label'])
    y_true = np.array(data['label'])
    recall = recall_score(y_pred, y_true, zero_division=0)
    if 0.94 <= recall <= 0.96:
        true_recall = recall
        true_threshold = threshold
        break
print(f'Recall {true_recall} is found at threshold {true_threshold}.')

# Preparing for AUC analysis


In [48]:
train_out = train_new.copy()
train_out

Unnamed: 0,GeneName,Consequence,PHRED,probabilities,prediction,combined_prediction,chr,pos,ref,alt
0,RDH12,FRAME_SHIFT,35.000,1,Pathogenic,Pathogenic,14,68196054,GCCCTG,G
1,ANK1,CANONICAL_SPLICE,32.000,1,Pathogenic,Pathogenic,8,41554106,C,T
2,HNF1A,STOP_GAINED,35.000,1,Pathogenic,Pathogenic,12,121431481,C,T
3,HNF1A,STOP_GAINED,39.000,1,Pathogenic,Pathogenic,12,121435345,C,T
4,HNF1A,INFRAME,19.860,1,Pathogenic,Pathogenic,12,121431471,CAAG,C
...,...,...,...,...,...,...,...,...,...,...
334596,SPG11,NON_SYNONYMOUS,15.230,0,Neutral,Neutral,15,44952677,C,T
334597,SPG11,NON_SYNONYMOUS,19.320,0,Neutral,Neutral,15,44952632,A,G
334598,SPG11,SYNONYMOUS,9.169,0,Neutral,Neutral,15,44943927,G,C
334599,SPG11,NON_SYNONYMOUS,13.460,0,Neutral,Neutral,15,44943897,C,A


In [49]:
train_in = pd.read_csv('/run/media/rjsietsma/evo2tb/linux/Datafiles/train.txt.gz',
                      compression='gzip', sep='\t', low_memory=False)
train_in

Unnamed: 0,#Chrom,Allergy/Immunology/Infectious,Alt,AnnoType,Audiologic/Otolaryngologic,Biochemical,CCDS,CDSpos,Cardiovascular,ConsDetail,...,revel,sift,source,tOverlapMotifs,targetScan,to_be_deleted,verPhCons,verPhyloP,inTest,sample_weight
0,14,False,G,CodingTranscript,False,False,CCDS9787.1,806.0,False,frameshift,...,,,vkgl,,,False,1.000,5.843,False,1.0
1,20,False,T,CodingTranscript,True,False,CCDS13112.1,1899.0,True,"frameshift,stop_gained",...,,,vkgl,,,False,1.000,4.670,False,1.0
2,20,False,C,CodingTranscript,True,False,CCDS13112.1,2118.0,True,frameshift,...,,,vkgl,,,False,1.000,5.043,False,1.0
3,20,False,A,CodingTranscript,True,False,CCDS13112.1,1586.0,True,frameshift,...,,,vkgl,,,False,1.000,6.221,False,1.0
4,20,False,A,Intergenic,True,False,,,True,downstream,...,,,vkgl,,,False,1.000,6.368,False,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
334596,17,False,A,CodingTranscript,False,False,CCDS32642.1,1563.0,False,stop_gained,...,,,unknown,,,False,1.000,6.031,False,0.8
334597,17,False,T,CodingTranscript,False,False,CCDS32642.1,2029.0,False,stop_gained,...,,,unknown,,,False,1.000,4.100,False,0.8
334598,10,False,T,CodingTranscript,False,False,CCDS7431.1,1216.0,False,stop_gained,...,,,unknown,,,False,1.000,5.852,False,0.8
334599,2,False,T,CodingTranscript,False,False,CCDS2382.1,2998.0,False,stop_gained,...,,,unknown,,,False,0.031,2.213,False,0.8


In [50]:
train_out = train_out.merge(train_in[['#Chrom', 'Pos', 'Ref', 'Alt', 'label']],
                           left_on=['chr','pos','ref','alt'],
                           right_on=['#Chrom','Pos','Ref','Alt'])
train_out

Unnamed: 0,GeneName,Consequence,PHRED,probabilities,prediction,combined_prediction,chr,pos,ref,alt,#Chrom,Pos,Ref,Alt,label
0,RDH12,FRAME_SHIFT,35.000,1,Pathogenic,Pathogenic,14,68196054,GCCCTG,G,14,68196054,GCCCTG,G,Pathogenic
1,ANK1,CANONICAL_SPLICE,32.000,1,Pathogenic,Pathogenic,8,41554106,C,T,8,41554106,C,T,Pathogenic
2,HNF1A,STOP_GAINED,35.000,1,Pathogenic,Pathogenic,12,121431481,C,T,12,121431481,C,T,Pathogenic
3,HNF1A,STOP_GAINED,39.000,1,Pathogenic,Pathogenic,12,121435345,C,T,12,121435345,C,T,Pathogenic
4,HNF1A,INFRAME,19.860,1,Pathogenic,Pathogenic,12,121431471,CAAG,C,12,121431471,CAAG,C,Pathogenic
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
334596,SPG11,NON_SYNONYMOUS,15.230,0,Neutral,Neutral,15,44952677,C,T,15,44952677,C,T,Benign
334597,SPG11,NON_SYNONYMOUS,19.320,0,Neutral,Neutral,15,44952632,A,G,15,44952632,A,G,Benign
334598,SPG11,SYNONYMOUS,9.169,0,Neutral,Neutral,15,44943927,G,C,15,44943927,G,C,Benign
334599,SPG11,NON_SYNONYMOUS,13.460,0,Neutral,Neutral,15,44943897,C,A,15,44943897,C,A,Benign


In [51]:
drop_labels = ['#Chrom', 'Pos', 'Ref', 'Alt']
for x in train_out.columns:
    if x.endswith('_x') or x.endswith('_y'):
        drop_labels.append(x)
drop_labels

['#Chrom', 'Pos', 'Ref', 'Alt']

In [52]:
train_out.drop(columns=drop_labels, inplace=True)

In [53]:
train_out['prediction'].replace({'Pathogenic': 1, 'Neutral': 0}, inplace=True)
train_out['label'].replace({'Pathogenic': 1, 'Benign': 0}, inplace=True)
train_out

Unnamed: 0,GeneName,Consequence,PHRED,probabilities,prediction,combined_prediction,chr,pos,ref,alt,label
0,RDH12,FRAME_SHIFT,35.000,1,1,Pathogenic,14,68196054,GCCCTG,G,1
1,ANK1,CANONICAL_SPLICE,32.000,1,1,Pathogenic,8,41554106,C,T,1
2,HNF1A,STOP_GAINED,35.000,1,1,Pathogenic,12,121431481,C,T,1
3,HNF1A,STOP_GAINED,39.000,1,1,Pathogenic,12,121435345,C,T,1
4,HNF1A,INFRAME,19.860,1,1,Pathogenic,12,121431471,CAAG,C,1
...,...,...,...,...,...,...,...,...,...,...,...
334596,SPG11,NON_SYNONYMOUS,15.230,0,0,Neutral,15,44952677,C,T,0
334597,SPG11,NON_SYNONYMOUS,19.320,0,0,Neutral,15,44952632,A,G,0
334598,SPG11,SYNONYMOUS,9.169,0,0,Neutral,15,44943927,G,C,0
334599,SPG11,NON_SYNONYMOUS,13.460,0,0,Neutral,15,44943897,C,A,0


In [54]:
train_out['probabilities'].value_counts()

0    294979
1     39622
Name: probabilities, dtype: int64