## TreeHarmonizer

TreeHarmonizer is a utility that is used to place called variants onto a pre-existing phylogenetic tree, allowing for visualization of variant trajectories and evolutionary progression. TreeHarmonizer was developed with single nucleotide (SNV), structural (SV), and copy number (CNA) variants in mind, allowing for placement of each variant type.

As of this verion 1.0, TreeHarmonizer works as a jupyter notebook, designed for the paper "Long-read sequencing of melanoma subclones reveals multifaceted and parallel tumor progression", by Liu & Goretsky, et al. 

It is being actively developed to serve as a standalone utility for multiple variant calling inputs and trees alike.

### Dependencies

TreeHarmonizer requires a Python 3.6 environment with the following packages -
* pandas
* io
* functools 
* os
* intervaltree (https://pypi.org/project/intervaltree/)
* ete3 (Can be installed with conda - https://etetoolkit.org/download/)
* jupyter

TreeHarmonizer is being updated to work with ete4, which will allow for Python versions > 3.6 when used with jupyter.

## Load Dependencies

In [None]:
import importlib
import pandas as pd
import TreeHarmonizer_utils as th_utils
importlib.reload(th_utils)

pd.set_option('display.width', 5000)
pd.set_option("display.expand_frame_repr", True)
pd.set_option("display.max_colwidth", 1000)
pd.set_option('display.max_columns', None)

## Import SNV and Severus Data

In [5]:
# Load SNV data into a merged DataFrame
dv_merged = th_utils.generate_merged_df(caller_name="dv_new", caller_path="/data/KolmogorovLab/agoretsky/Latest_Variant_Calls_01_13_25/dv_1.6.1_calls_fvb_exclusion_subset_final")

# Load Severus data into a merged DataFrame
severus = th_utils.generate_severus_df(severus_path="/data/KolmogorovLab/agoretsky/Latest_Variant_Calls_01_13_25/severus_pon_Mar11/somatic_SVs/severus_somatic_conf2_manual_reinclude.vcf", simple_name=True)

# Load tree input via newick string and parse it into various components
imported_tree, non_terminals, terminals, non_terminal_paths, terminal_paths, non_terminal_leaves, terminal_paths_o_keys, non_terminal_paths_without_N1 = th_utils.get_tree_data()

# Add informative columns to the merged DataFrames that were lost from original merging.
dv_merged['CHROM'] = dv_merged['KEY'].str.split(":").str[0]
dv_merged['POS'] = dv_merged['KEY'].str.split(":").str[1]
dv_merged['REF'] = dv_merged['KEY'].str.split(":").str[2]
dv_merged['ALT'] = dv_merged['KEY'].str.split(":").str[3]

severus['CHROM'] = severus['CHROM'].astype(str)
severus['POS'] = severus['POS'].astype(int)
dv_merged['CHROM'] = dv_merged['CHROM'].astype(str)
dv_merged['POS'] = dv_merged['POS'].astype(int)

# Rename INFO COLUMNS to avoid conflicts
dv_merged.rename(columns={'INFO': 'DV_INFO'}, inplace=True)
severus.rename(columns={'INFO': 'SEV_INFO'}, inplace=True)

Root node name:  N0


## Process Severus Variants

In [6]:
# Filter out second breakpoint pair
severus_no_break = severus[severus["ID"].str.contains("_2")==False]
severus_filtered = severus_no_break.copy(deep=True)

# Filter for only deletions
severus_filtered_del = severus_filtered[severus_filtered["SEV_INFO"].str.contains("SVTYPE=DEL")==True]

# Create critical columns, note that these are NOT SETS, JUST ARRAYS (MRCA is a string)
# Chain warning turned off, as we are not modifying the dataframe
pd.options.mode.chained_assignment = None

def severus_called_sublines_helper(row):
    # GT:VAF:hVAF:DR:DV
    # Prior filter method was:
    # return [col for col in severus_data.columns if col.startswith('C') and col != "CHROM" and row[col] != "./.:0:0,0,0:0:0"]
    # A VCF bug in the current Severus version (as of 03.11.2025) requires the following filter method to be used:
    # FILTER HAS BEEN CHANGED TO THE FOLLOWING
    output_subline_list = []
    for col in severus_filtered_del.columns:
        if col.startswith('C') and col != "CHROM":
            internal_sev_data = row[col].split(":")
            DV = int(internal_sev_data[4])
            if DV > 0:
                output_subline_list.append(col)
    return output_subline_list

severus_filtered_del['SEVERUS_SUBLINES'] = severus_filtered_del.apply(severus_called_sublines_helper, axis=1)
severus_filtered_del['SEVERUS_MRCA'] = severus_filtered_del.apply(lambda row: th_utils.common_ancestor_helper(row, "SEVERUS_SUBLINES", input_tree=imported_tree), axis=1)
severus_filtered_del['SEVERUS_MRCA_TERMINALS'] = severus_filtered_del.apply(lambda row: imported_tree.search_nodes(name=row['SEVERUS_MRCA'])[0].get_leaf_names(), axis=1)

# Return chained assignment warning to default
pd.options.mode.chained_assignment = 'warn'

severus_filtered_del.index=severus_filtered_del['ID']

## Import and process Wakhan CNA numbers

In [7]:
# Create interval tree of all severus deletion ranges
wakhan_cna_trees_per_chromosome = {}
wakhan_loh_trees_per_chromosome = {}
wakhan_cna_1_only_trees_per_chromosome = {}
wakhan_cna_0_only_trees_per_chromosome = {}

# Create blank interval trees per chromosome
for sub in th_utils.all_sublines_with_c_char:
    for chrom in th_utils.chromosomes_with_13:
        wakhan_cna_trees_per_chromosome.update({sub + "-" + chrom: th_utils.it.IntervalTree()})
        wakhan_loh_trees_per_chromosome.update({sub + "-" + chrom: th_utils.it.IntervalTree()})
        wakhan_cna_1_only_trees_per_chromosome.update({sub + "-" + chrom: th_utils.it.IntervalTree()})
        wakhan_cna_0_only_trees_per_chromosome.update({sub + "-" + chrom: th_utils.it.IntervalTree()})


for subline in th_utils.all_sublines:
    wk_copy_num = th_utils.read_bed_updated("/data/KolmogorovLab/agoretsky/Latest_Variant_Calls_01_13_25/cna_wakhan_04_16_25_unphased_severus_fvb_exclusion/C" + str(subline) + "/bed_output/C" + str(subline) + "_copynumbers_segments.bed")
    wk_loh = th_utils.read_bed_updated("/data/KolmogorovLab/agoretsky/Latest_Variant_Calls_01_13_25/cna_wakhan_04_16_25_unphased_severus_fvb_exclusion/C" + str(subline) + "/bed_output/C" + str(subline) + "_loh_segments.bed")

    wk_copy_num['chr'] = wk_copy_num['chr'].astype(str)
    wk_loh['chr'] = wk_loh['chr'].astype(str)

    # Filter wakhan down to only autosomes
    wk_copy_num = th_utils.keep_rows_by_values(wk_copy_num, 'chr', th_utils.chromosomes_with_13)
    wk_copy_num['copynumber_state'] = wk_copy_num['copynumber_state'].astype(int)
    wk_loh = th_utils.keep_rows_by_values(wk_loh, 'chr', th_utils.chromosomes_with_13)

    # For each copy num entry, add to its respective interval tree, with the interval being the copy num range, data being the copy num metadata
    for index, row in wk_copy_num.iterrows():
        wakhan_cna_trees_per_chromosome["C" + str(subline) + "-" + str(row['chr'])].addi(int(row['start']), int(row['end']) + 1, ("C" + str(subline), row['copynumber_state'], row['coverage'], row['confidence'], row['svs_breakpoints_ids']))
    
    for index, row in wk_loh.iterrows():
        wakhan_loh_trees_per_chromosome["C" + str(subline) + "-" + str(row['chr'])].addi(int(row['start']), int(row['end']) + 1, "C" + str(subline))

    for index, row in wk_copy_num.iterrows():
        if row['copynumber_state'] == 1:
            wakhan_cna_1_only_trees_per_chromosome["C" + str(subline) + "-" + str(row['chr'])].addi(int(row['start']), int(row['end']) + 1, ("C" + str(subline), row['copynumber_state'], row['coverage'], row['confidence'], row['svs_breakpoints_ids']))
    
    for index, row in wk_copy_num.iterrows():
        if row['copynumber_state'] == 0:
            wakhan_cna_0_only_trees_per_chromosome["C" + str(subline) + "-" + str(row['chr'])].addi(int(row['start']), int(row['end']) + 1, ("C" + str(subline), row['copynumber_state'], row['coverage'], row['confidence'], row['svs_breakpoints_ids']))

# Create a copy of the merged dataframe to modify
df_merged_copy_wakhan = dv_merged.copy(deep=True)

def in_wakhan_loh_helper(row):
    for subline in th_utils.all_sublines:
        if wakhan_loh_trees_per_chromosome["C" + str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            return True
    return False
        
def loh_sublines_helper(row):
    output = []
    for subline in th_utils.all_sublines:
        if wakhan_loh_trees_per_chromosome["C" + str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            output.append("C" + str(subline))
    return output

def in_copy_num_of_1_helper(row):
    for subline in th_utils.all_sublines:
        if wakhan_cna_1_only_trees_per_chromosome["C" + str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            return True
    return False

def in_copy_num_of_0_helper(row):
    for subline in th_utils.all_sublines:
        if wakhan_cna_0_only_trees_per_chromosome["C" + str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            return True
    return False

def copy_num_1_sublines_helper(row):
    output = []
    for subline in th_utils.all_sublines:
        if wakhan_cna_1_only_trees_per_chromosome["C" + str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            output.append("C" + str(subline))
    return output

def copy_num_0_sublines_helper(row):
    output = []
    for subline in th_utils.all_sublines:
        if wakhan_cna_0_only_trees_per_chromosome["C" + str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            output.append("C" + str(subline))
    return output

def get_cn_meta_helper(row, subline_col, interval_tree_dict, input_tree):
    metadata = {}
    for subline in row[subline_col]:
        chrom = row['CHROM']
        pos = int(row['POS'])
        intervals = interval_tree_dict[subline + "-" + chrom][pos]
        metadata[subline] = intervals.pop().data
    return metadata
    

df_merged_copy_wakhan['in_wakhan_loh'] = df_merged_copy_wakhan.apply(lambda row: in_wakhan_loh_helper(row), axis=1)
df_merged_copy_wakhan['loh_sublines'] = df_merged_copy_wakhan.apply(lambda row: loh_sublines_helper(row), axis=1)

# PER CN Loss Type Column Creation (For debugging and closer analysis)
df_merged_copy_wakhan['IN_CN_1'] = df_merged_copy_wakhan.apply(lambda row: in_copy_num_of_1_helper(row), axis=1)
df_merged_copy_wakhan['IN_CN_0'] = df_merged_copy_wakhan.apply(lambda row: in_copy_num_of_0_helper(row), axis=1)

df_merged_copy_wakhan['CN_1_SUBLINES'] = df_merged_copy_wakhan.apply(lambda row: copy_num_1_sublines_helper(row), axis=1)
df_merged_copy_wakhan['CN_0_SUBLINES'] = df_merged_copy_wakhan.apply(lambda row: copy_num_0_sublines_helper(row), axis=1)

df_merged_copy_wakhan['CN_1_MRCA'] = df_merged_copy_wakhan.apply(lambda row: th_utils.common_ancestor_helper(row, "CN_1_SUBLINES", input_tree=imported_tree) if row['IN_CN_1'] else float("nan"), axis=1)
df_merged_copy_wakhan['CN_0_MRCA'] = df_merged_copy_wakhan.apply(lambda row: th_utils.common_ancestor_helper(row, "CN_0_SUBLINES", input_tree=imported_tree) if row['IN_CN_0'] else float("nan"), axis=1)

df_merged_copy_wakhan['CN_1_MRCA_TERMINALS'] = df_merged_copy_wakhan.apply(lambda row: imported_tree.search_nodes(name=row['CN_1_MRCA'])[0].get_leaf_names() if row['IN_CN_1'] else float("nan"), axis=1)
df_merged_copy_wakhan['CN_0_MRCA_TERMINALS'] = df_merged_copy_wakhan.apply(lambda row: imported_tree.search_nodes(name=row['CN_0_MRCA'])[0].get_leaf_names() if row['IN_CN_0'] else float("nan"), axis=1)

# CN 1 and 0 Combined for final union regenotyping
df_merged_copy_wakhan['IN_CN_1_0'] = df_merged_copy_wakhan.apply(lambda row: row['IN_CN_1'] and row['IN_CN_0'], axis=1)
df_merged_copy_wakhan['CN_1_0_SUBLINES'] = df_merged_copy_wakhan.apply(lambda row: row['CN_1_SUBLINES'] + row['CN_0_SUBLINES'], axis=1)
df_merged_copy_wakhan['CN_1_0_MRCA'] = df_merged_copy_wakhan.apply(lambda row: th_utils.common_ancestor_helper(row, "CN_1_0_SUBLINES", input_tree=imported_tree) if row['IN_CN_1_0'] else float("nan"), axis=1)
df_merged_copy_wakhan['CN_1_0_MRCA_TERMINALS'] = df_merged_copy_wakhan.apply(lambda row: imported_tree.search_nodes(name=row['CN_1_0_MRCA'])[0].get_leaf_names() if row['IN_CN_1_0'] else float("nan"), axis=1)

In [8]:
cna_to_merge = df_merged_copy_wakhan[['CHROM', 'POS', 'IN_CN_1_0', 'CN_1_0_SUBLINES', 'CN_1_0_MRCA_TERMINALS', 'IN_CN_1', 'IN_CN_0', 'CN_0_SUBLINES', 'CN_1_SUBLINES', 'CN_0_MRCA_TERMINALS', 'CN_1_MRCA_TERMINALS']].copy(deep=True)

## Create MRCA records and metadata, severus MRCA metadata.

In [9]:
MARKED_dv = dv_merged.copy(deep=True)

# Add Wakhan CNA data to MARKED_dv
MARKED_dv = MARKED_dv.merge(cna_to_merge, on=['CHROM', 'POS'], how='left')

MARKED_dv['DV_SUBLINES'] = MARKED_dv.apply(lambda row: [col for col in MARKED_dv.columns if col.startswith('C') and col != "CHROM" and not col.startswith("CN") and pd.notna(row[col])], axis=1)
MARKED_dv['DV_MRCA'] = MARKED_dv.apply(lambda row: th_utils.common_ancestor_helper(row, "DV_SUBLINES", input_tree=imported_tree), axis=1)
MARKED_dv['DV_MRCA_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['DV_MRCA'])[0].get_leaf_names(), axis=1)
MARKED_dv['DV_SUBLINE_COUNT'] = MARKED_dv.apply(lambda row: len(row['DV_SUBLINES']), axis=1)

# Create interval tree of all severus deletion ranges
severus_internal_trees_per_chromosome = {}

# Create blank interval trees per chromosome
for chrom in th_utils.chromosomes_with_13:
    severus_internal_trees_per_chromosome.update({chrom: th_utils.it.IntervalTree()})

# For each deletion, add to its respective interval tree, with the interval being the deletion range, data being the deletion metadata
# Severus deletions, as of Mar 11 version, are (, ], aka start exclusive, end inclusive.
# Therefore we add 1 to the start position to make it inclusive, and add 1 to the end position to make it inclusive (interval tree is inclusive on lower, exclusive on upper, this has been verified.)
for index, row in severus_filtered_del.iterrows():
    severus_internal_trees_per_chromosome[row['CHROM']].addi(int(row['POS'] + 1), int(row['SEV_INFO'].split(";")[3].split("=")[1]) + 1, (row['ID'], row['SEVERUS_MRCA'], row['SEVERUS_SUBLINES']))

# Create new column in df_merged which states if variant position is in a severus deletion
MARKED_dv['IN_SEVERUS_DELETION'] = MARKED_dv.apply(lambda row: len(severus_internal_trees_per_chromosome[row['CHROM']][int(row['POS'])]) > 0, axis=1)
MARKED_dv['SEVERUS_SUBLINES'] = MARKED_dv.apply(lambda row: severus_internal_trees_per_chromosome[row['CHROM']][int(row['POS'])].pop()[2][2] if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['SEVERUS_MRCA'] = MARKED_dv.apply(lambda row: severus_internal_trees_per_chromosome[row['CHROM']][int(row['POS'])].pop()[2][1] if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['SEVERUS_MRCA_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['SEVERUS_MRCA'])[0].get_leaf_names() if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['SEVERUS_ID'] = MARKED_dv.apply(lambda row: severus_internal_trees_per_chromosome[row['CHROM']][int(row['POS'])].pop()[2][0] if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)

## Apply parisomony assumption due to lack of phasing data in many regions

In [10]:
# If any of the severus deletion sublines called MATCH a subline an SNV called subline
# We make the assumption that deletion occurred on the other haplotype, and therefore not relevant to that SNV.
# If the following col is True, do not use said deletion for this SNV.

MARKED_dv['SEVERUS_OTHER_HAPLO_BOOL'] = MARKED_dv.apply(lambda row: len(set(row['DV_SUBLINES']).intersection(set(row['SEVERUS_SUBLINES']))) > 0 if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['CN_OTHER_HAPLO_BOOL'] = MARKED_dv.apply(lambda row: len(set(row['DV_SUBLINES']).intersection(set(row['CN_1_0_SUBLINES']))) > 0 if row['IN_CN_1_0'] else float("nan"), axis=1)

## Create columns that reflect effects of regenotyping with only Severus deletions.

In [11]:
# Union of DV and Severus deletion sublines
MARKED_dv['REGENO_COMBINED_SUBLINES_SEVERUS_ONLY'] = MARKED_dv.apply(lambda row: list(set(row['SEVERUS_SUBLINES']).union(set(row['DV_SUBLINES']))) if (row['IN_SEVERUS_DELETION'] == True and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# Common ancestor of DV and Severus deletion sublines
MARKED_dv['REGENO_MRCA_SEVERUS_ONLY'] = MARKED_dv.apply(lambda row: th_utils.common_ancestor_helper(row, 'REGENO_COMBINED_SUBLINES_SEVERUS_ONLY', input_tree=imported_tree) if (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# TRUE/FALSE if the common ancestor of DV and Severus deletion sublines differs from the DV common ancestor
MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] = MARKED_dv.apply(lambda row: row['REGENO_MRCA_SEVERUS_ONLY'] != row['DV_MRCA'] if (row['IN_SEVERUS_DELETION'] and (row['SEVERUS_OTHER_HAPLO_BOOL'] == False)) else float("nan"), axis=1)

# Terminal nodes for the CHANGED common ancestor when including sublines from Severus deletion
MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['REGENO_MRCA_SEVERUS_ONLY'])[0].get_leaf_names() if row['IN_SEVERUS_DELETION'] and (row['SEVERUS_OTHER_HAPLO_BOOL'] == False)else float("nan"), axis=1)

# TRUE/FALSE if the union of sublines from DV and Severus deletion differs from the DV sublines (MAY NOT DIFFER, IF DOESN'T, VARIANT WAS STILL CALLED IN DEL POSITION, IMPLIES OTHER HAPLOTYPE FOR DEL)
MARKED_dv['REGENO_SUBLINES_DIFFER_SEVERUS_ONLY'] = MARKED_dv.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_SEVERUS_ONLY']) != set(row['DV_SUBLINES']) if row['IN_SEVERUS_DELETION'] and (row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

## Create columns that reflect effects of regenotyping with only Wakhan losses (CN=0 or CN=1)

In [12]:
# Union of DV and CN 1 sublines
MARKED_dv['REGENO_COMBINED_SUBLINES_CN_1_0_ONLY'] = MARKED_dv.apply(lambda row: list(set(row['CN_1_0_SUBLINES']).union(set(row['DV_SUBLINES']))) if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)

# Common ancestor of DV and CN 1 sublines
MARKED_dv['REGENO_MRCA_CN_1_0_ONLY'] = MARKED_dv.apply(lambda row: th_utils.common_ancestor_helper(row, 'REGENO_COMBINED_SUBLINES_CN_1_0_ONLY', input_tree=imported_tree) if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)

# TRUE/FALSE if the common ancestor of DV and CN 1 sublines differs from the DV common ancestor
MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] = MARKED_dv.apply(lambda row: row['REGENO_MRCA_CN_1_0_ONLY'] != row['DV_MRCA'] if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)

# Terminal nodes for the CHANGED common ancestor when including sublines from CN 1
MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['REGENO_MRCA_CN_1_0_ONLY'])[0].get_leaf_names() if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)

# TRUE/FALSE if the union of sublines from DV and CN 1 differs from the DV sublines (MAY NOT DIFFER, IF DOESN'T, VARIANT WAS STILL CALLED IN DEL POSITION, IMPLIES OTHER HAPLOTYPE FOR DEL)
MARKED_dv['REGENO_SUBLINES_DIFFER_CN_1_0_ONLY'] = MARKED_dv.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_CN_1_0_ONLY']) != set(row['DV_SUBLINES']) if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)

## Create columns that reflect effects of regenoptying using both Severus and Wakhan losses.

* Additionally, create dramatic shift column relfecting if regenotyping would dramatically shift the MRCA up the tree, herein defined as a subline reinclusion rate of >= 100%.

In [None]:
def combined_sublines_union_helper(row):
    dv_set = set(row['DV_SUBLINES'])
    CN_set = set()
    SEV_set = set()
    if row['IN_CN_1_0']:
        CN_set = set(row['CN_1_0_SUBLINES'])
    if row['IN_SEVERUS_DELETION']:
        SEV_set = set(row['SEVERUS_SUBLINES'])
    if SEV_set or CN_set:
        return list(dv_set.union(CN_set).union(SEV_set))
    else:
        return float("nan")

# Union of DV and CN 1 sublines
MARKED_dv['REGENO_COMBINED_SUBLINES_UNION'] = MARKED_dv.apply(lambda row: combined_sublines_union_helper(row), axis=1)

# Common ancestor of DV and CN 1 sublines
MARKED_dv['REGENO_MRCA_UNION'] = MARKED_dv.apply(lambda row: th_utils.common_ancestor_helper(row, 'REGENO_COMBINED_SUBLINES_UNION', input_tree=imported_tree) if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# TRUE/FALSE if the common ancestor of DV and CN 1 sublines differs from the DV common ancestor
MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] = MARKED_dv.apply(lambda row: row['REGENO_MRCA_UNION'] != row['DV_MRCA'] if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# Terminal nodes for the CHANGED common ancestor when including sublines from CN 1
MARKED_dv['REGENO_MRCA_UNION_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['REGENO_MRCA_UNION'])[0].get_leaf_names() if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# TRUE/FALSE if the union of sublines from DV and CN 1 differs from the DV sublines (MAY NOT DIFFER, IF DOESN'T, VARIANT WAS STILL CALLED IN DEL POSITION, IMPLIES OTHER HAPLOTYPE FOR DEL)
MARKED_dv['REGENO_SUBLINES_DIFFER_UNION'] = MARKED_dv.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) != set(row['DV_SUBLINES']) if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

def tree_shift_helper(row):
    if len(row['REGENO_COMBINED_SUBLINES_UNION']) > (len(row['DV_SUBLINES']) * 2):
        return True
    else:
        return False
        

MARKED_dv['DRAMATIC_SHIFT'] = MARKED_dv.apply(lambda row: tree_shift_helper(row) if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

## Load the minimum subline support per clade size thresholds.

In [None]:
minimum_subline_support_per_clade_size_requirement = {
    1: 1,
    2: 2,
    3: 2,
    4: 3,
    5: 4,
    7: 5,
    8: 6,
    12: 10,
    16: 13,
    23: 19
}

## If other clade sizes are necessary, formulaic version of threshold is below, commented out.

# clade_sizes = set()
# # Get a list of all the clade sizes in the tree
# for clade in imported_tree.traverse():
#     if clade.is_leaf():
#         continue
#     clade_sizes.add(len(clade.get_leaves()))
# 
# formulaic_subline_support_per_clade_size_requirement = {}
# 
# cur_tree_clade_sizes = clade_sizes
# cur_tree_fn_rate = 0.15
# for clade_size in cur_tree_clade_sizes:
#     if clade_size < 2:
#         formulaic_subline_support_per_clade_size_requirement[clade_size] = 1
#     elif clade_size == 2:
#         formulaic_subline_support_per_clade_size_requirement[clade_size] = 2
#     else:
#         # Calculate the support requirement based on the FN rate
#         support_requirement = int(clade_size * (1 - float(cur_tree_fn_rate)))
#         formulaic_subline_support_per_clade_size_requirement[clade_size] = support_requirement
#
# minimum_subline_support_per_clade_size_requirement = formulaic_subline_support_per_clade_size_requirement

## Support prior to regenotyping and post regenotyping calculation

In [17]:
# Original Pass or Fail calculation
MARKED_dv['DV_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['DV_SUBLINES']) >= minimum_subline_support_per_clade_size_requirement[len(row['DV_MRCA_TERMINALS'])], axis=1)
MARKED_dv['SEVERUS_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['SEVERUS_SUBLINES']) >= minimum_subline_support_per_clade_size_requirement[len(row['SEVERUS_MRCA_TERMINALS'])] if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['CN_1_0_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['CN_1_0_SUBLINES']) >= minimum_subline_support_per_clade_size_requirement[len(row['CN_1_0_MRCA_TERMINALS'])] if row['IN_CN_1_0'] else float("nan"), axis=1)

# Post REGENO Pass or Fail calculation
MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_SEVERUS_ONLY']) >= minimum_subline_support_per_clade_size_requirement[len(row['REGENO_MRCA_SEVERUS_ONLY_TERMINALS'])] if row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)
MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_CN_1_0_ONLY']) >= minimum_subline_support_per_clade_size_requirement[len(row['REGENO_MRCA_CN_1_0_ONLY_TERMINALS'])] if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)
MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_UNION']) >= minimum_subline_support_per_clade_size_requirement[len(row['REGENO_MRCA_UNION_TERMINALS'])] if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# Do not allow dramatic clade shifting
MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] = MARKED_dv.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_UNION']) >= minimum_subline_support_per_clade_size_requirement[len(row['REGENO_MRCA_UNION_TERMINALS'])] if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) and row['DRAMATIC_SHIFT'] == False else float("nan"), axis=1)

## Regenotyping Category Scoring, Severus Only

In [None]:
# Scoring System
# Pass and Fail imply passing the minimum subline (leaf) threshold for consideration as part of a clade.

# CAT 1: Pass -> pass (unchanged MRCA)
# CAT 2: Pass -> pass (changed MRCA)
# CAT 3: Fail -> pass (unchanged MRCA)
# CAT 4: Fail -> pass (changed MRCA)
#
# CAT 5: Fail -> fail (unchanged MRCA)
# CAT 6: Fail -> fail (changed MRCA)
# CAT 7: Pass -> fail (unchanged MRCA), impossible
# CAT 8: Pass -> fail (changed MRCA)

# To calculate each category:
# ORIGPASS/FAIL = MARKED_DV['DV_MINIMUM_SUPPORT_MET'] == True/False
# REGENO_PASS/FAIL = MARKED_DV['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True/False
# MRCA_CHANGE = MARKED_DV['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True/False

# CAT 1
cat1_SEVERUS_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == False)]
print("CAT 1: Pass -> Pass (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == False)]))

# CAT 2
cat2_SEVERUS_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True)]
print("CAT 2: Pass -> Pass (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True)]))

# CAT 3
cat3_SEVERUS_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == False)]
print("CAT 3: Fail -> Pass (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == False)]))

# CAT 4
cat4_SEVERUS_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True)]
print("CAT 4: Fail -> Pass (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True)]))

# CAT 5
cat5_SEVERUS_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == False)]
print("CAT 5: Fail -> Fail (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == False)]))

# CAT 6
cat6_SEVERUS_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True)]
print("CAT 6: Fail -> Fail (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True)]))

# CAT 7
cat7_SEVERUS_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == False)]
print("CAT 7: Pass -> Fail (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == False)]))

# CAT 8
cat8_SEVERUS_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True)]
print("CAT 8: Pass -> Fail (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True)]))

print("Total Vars in Severus Deletions: ", len(MARKED_dv[MARKED_dv['IN_SEVERUS_DELETION'] == True]))

CAT 1: Pass -> Pass (unchanged MRCA):  556
CAT 2: Pass -> Pass (changed MRCA):  1166
CAT 3: Fail -> Pass (unchanged MRCA):  47
CAT 4: Fail -> Pass (changed MRCA):  0
CAT 5: Fail -> Fail (unchanged MRCA):  381
CAT 6: Fail -> Fail (changed MRCA):  127
CAT 7: Pass -> Fail (unchanged MRCA):  0
CAT 8: Pass -> Fail (changed MRCA):  19368
Total Vars in Severus Deletions:  24306


## Regenotyping Category Scoring, CN 1 and CN 0 only

In [None]:
# Scoring System
# Pass and Fail imply passing the minimum subline (leaf) threshold for consideration as part of a clade.

# CAT 1: Pass -> pass (unchanged MRCA)
# CAT 2: Pass -> pass (changed MRCA)
# CAT 3: Fail -> pass (unchanged MRCA)
# CAT 4: Fail -> pass (changed MRCA)
#
# CAT 5: Fail -> fail (unchanged MRCA)
# CAT 6: Fail -> fail (changed MRCA)
# CAT 7: Pass -> fail (unchanged MRCA), impossible
# CAT 8: Pass -> fail (changed MRCA)

# To calculate each category:
# ORIGPASS/FAIL = MARKED_DV['DV_MINIMUM_SUPPORT_MET'] == True/False
# REGENO_PASS/FAIL = MARKED_DV['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True/False
# MRCA_CHANGE = MARKED_DV['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True/False

# CAT 1
cat1_CN_1_0_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == False)]
print("CAT 1: Pass -> Pass (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == False)]))

# CAT 2
cat2_CN_1_0_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True)]
print("CAT 2: Pass -> Pass (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True)]))

# CAT 3
cat3_CN_1_0_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == False)]
print("CAT 3: Fail -> Pass (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == False)]))

# CAT 4
cat4_CN_1_0_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True)]
print("CAT 4: Fail -> Pass (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True)]))

# CAT 5
cat5_CN_1_0_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == False)]
print("CAT 5: Fail -> Fail (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == False)]))

# CAT 6
cat6_CN_1_0_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True)]
print("CAT 6: Fail -> Fail (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True)]))

# CAT 7
cat7_CN_1_0_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == False)]
print("CAT 7: Pass -> Fail (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == False)]))

# CAT 8
cat8_CN_1_0_ONLY = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True)]
print("CAT 8: Pass -> Fail (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True)]))

print("Total Vars in CN 1 or CN 0 Regions: ", len(MARKED_dv[MARKED_dv['IN_CN_1_0'] == True]))

CAT 1: Pass -> Pass (unchanged MRCA):  0
CAT 2: Pass -> Pass (changed MRCA):  0
CAT 3: Fail -> Pass (unchanged MRCA):  0
CAT 4: Fail -> Pass (changed MRCA):  0
CAT 5: Fail -> Fail (unchanged MRCA):  67
CAT 6: Fail -> Fail (changed MRCA):  26
CAT 7: Pass -> Fail (unchanged MRCA):  0
CAT 8: Pass -> Fail (changed MRCA):  94
Total Vars in CN 1 or CN 0 Regions:  278


## Regenotyping Category scoring - Severus Dels and Wakhan CNA

In [None]:
# Scoring System
# Pass and Fail imply passing the minimum subline (leaf) threshold for consideration as part of a clade.

# CAT 1: Pass -> pass (unchanged MRCA)
# CAT 2: Pass -> pass (changed MRCA)
# CAT 3: Fail -> pass (unchanged MRCA)
# CAT 4: Fail -> pass (changed MRCA)
#
# CAT 5: Fail -> fail (unchanged MRCA)
# CAT 6: Fail -> fail (changed MRCA)
# CAT 7: Pass -> fail (unchanged MRCA), impossible
# CAT 8: Pass -> fail (changed MRCA)

# To calculate each category:
# ORIGPASS/FAIL = MARKED_DV['DV_MINIMUM_SUPPORT_MET'] == True/False
# REGENO_PASS/FAIL = MARKED_DV['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True/False
# MRCA_CHANGE = MARKED_DV['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True/False

# CAT 1
cat1_UNION = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]
print("CAT 1: Pass -> Pass (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]))

# CAT 2
cat2_UNION = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]
print("CAT 2: Pass -> Pass (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]))

# CAT 3
cat3_UNION = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]
print("CAT 3: Fail -> Pass (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]))

# CAT 4
cat4_UNION = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]
print("CAT 4: Fail -> Pass (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]))

# CAT 5
cat5_UNION = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]
print("CAT 5: Fail -> Fail (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]))

# CAT 6
cat6_UNION = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]
print("CAT 6: Fail -> Fail (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]))

# CAT 7
cat7_UNION = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]
print("CAT 7: Pass -> Fail (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]))

# CAT 8
cat8_UNION = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]
print("CAT 8: Pass -> Fail (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]))

print()

print("Total Vars in CN 1, CN 0, or Sev DEL Regions: ", len(MARKED_dv[MARKED_dv['IN_CN_1_0'] == True]) + len(MARKED_dv[MARKED_dv['IN_SEVERUS_DELETION'] == True]))

CAT 1: Pass -> Pass (unchanged MRCA):  556
CAT 2: Pass -> Pass (changed MRCA):  1164
CAT 3: Fail -> Pass (unchanged MRCA):  47
CAT 4: Fail -> Pass (changed MRCA):  0
CAT 5: Fail -> Fail (unchanged MRCA):  448
CAT 6: Fail -> Fail (changed MRCA):  153
CAT 7: Pass -> Fail (unchanged MRCA):  0
CAT 8: Pass -> Fail (changed MRCA):  19430

Total Vars in CN 1, CN 0, or Sev DEL Regions:  24584


In [22]:
# Debugging output for each category
for x in range(1, 9):
    if x == 7:
        continue
    print("Cat" + str(x) + " PRE")
    cat = eval("cat" + str(x) + "_UNION")
    if cat.empty:
        print("Cat " + str(x) + " is empty\n")
        continue
    cat['DV_SUBLINE_COUNT'] = cat.apply(lambda row: len(row['DV_SUBLINES']), axis=1)
    print(cat.value_counts('DV_SUBLINE_COUNT'))
    print("Cat" + str(x) + " POST REGENO")
    cat['REGENO_SUBLINE_COUNT'] = cat.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_UNION']), axis=1)
    print(cat.value_counts('REGENO_SUBLINE_COUNT'))

Cat1 PRE
DV_SUBLINE_COUNT
22    206
21    166
20     88
19     83
4       4
6       3
3       3
5       2
11      1
dtype: int64
Cat1 POST REGENO
REGENO_SUBLINE_COUNT
23    207
22    165
21     88
20     83
5       4
7       3
4       3
6       2
12      1
dtype: int64
Cat2 PRE
DV_SUBLINE_COUNT
1    811
2    344
7      3
4      3
5      2
6      1
dtype: int64
Cat2 POST REGENO
REGENO_SUBLINE_COUNT
2    797
3    350
5      6
6      5
8      3
7      2
4      1
dtype: int64
Cat3 PRE
DV_SUBLINE_COUNT
18    41
4      2
3      2
2      2
dtype: int64
Cat3 POST REGENO
REGENO_SUBLINE_COUNT
19    39
5      2
4      2
3      2
23     1
22     1
dtype: int64
Cat4 PRE
Cat 4 is empty

Cat5 PRE
DV_SUBLINE_COUNT
2     99
3     75
4     54
17    36
16    26
6     25
15    23
5     22
14    18
7     16
12    15
13    10
10     9
11     8
8      7
9      5
dtype: int64
Cat5 POST REGENO
REGENO_SUBLINE_COUNT
3     96
5     58
4     58
18    36
7     30
17    27
16    23
6     22
8     20
15    18
13    1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


REGENO_SUBLINE_COUNT
2     17726
3       903
4       425
5       167
6        82
8        47
9        38
7        18
13        9
17        8
12        3
16        2
15        1
14        1
dtype: int64


## Regenotyping Category scoring - Severus Dels and Wakhan CNA, with no dramatic changes.

In [25]:
# Scoring System
# Pass and Fail imply passing the minimum subline (leaf) threshold for consideration as part of a clade.

# CAT 1: Pass -> pass (unchanged MRCA)
# CAT 2: Pass -> pass (changed MRCA)
# CAT 3: Fail -> pass (unchanged MRCA)
# CAT 4: Fail -> pass (changed MRCA)
#
# CAT 5: Fail -> fail (unchanged MRCA)
# CAT 6: Fail -> fail (changed MRCA)
# CAT 7: Pass -> fail (unchanged MRCA), impossible
# CAT 8: Pass -> fail (changed MRCA)

# To calculate each category:
# ORIGPASS/FAIL = MARKED_DV['DV_MINIMUM_SUPPORT_MET'] == True/False
# REGENO_PASS/FAIL = MARKED_DV['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True/False
# MRCA_CHANGE = MARKED_DV['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True/False

# CAT 1
cat1_UNION_NO_DRAMATIC = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]
cat1_UNION_NO_DRAMATIC['ADDED_REGENO_SUBLINES'] = cat1_UNION_NO_DRAMATIC.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
print("CAT 1: Pass -> Pass (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]))

# CAT 2
cat2_UNION_NO_DRAMATIC = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]
cat2_UNION_NO_DRAMATIC['ADDED_REGENO_SUBLINES'] = cat2_UNION_NO_DRAMATIC.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
print("CAT 2: Pass -> Pass (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]))

# CAT 3
cat3_UNION_NO_DRAMATIC = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]
cat3_UNION_NO_DRAMATIC['ADDED_REGENO_SUBLINES'] = cat3_UNION_NO_DRAMATIC.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
print("CAT 3: Fail -> Pass (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]))

# CAT 4
cat4_UNION_NO_DRAMATIC = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]
if not cat4_UNION_NO_DRAMATIC.empty:
    cat4_UNION_NO_DRAMATIC['ADDED_REGENO_SUBLINES'] = cat4_UNION_NO_DRAMATIC.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
print("CAT 4: Fail -> Pass (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]))

# CAT 5
cat5_UNION_NO_DRAMATIC = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]
cat5_UNION_NO_DRAMATIC['ADDED_REGENO_SUBLINES'] = cat5_UNION_NO_DRAMATIC.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
print("CAT 5: Fail -> Fail (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]))

# CAT 6
cat6_UNION_NO_DRAMATIC = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]
cat6_UNION_NO_DRAMATIC['ADDED_REGENO_SUBLINES'] = cat6_UNION_NO_DRAMATIC.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
print("CAT 6: Fail -> Fail (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == False) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]))

# CAT 7
cat7_UNION_NO_DRAMATIC = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]
if not cat7_UNION_NO_DRAMATIC.empty:
    cat7_UNION_NO_DRAMATIC['ADDED_REGENO_SUBLINES'] = cat7_UNION_NO_DRAMATIC.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
print("CAT 7: Pass -> Fail (unchanged MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == False)]))

# CAT 8
cat8_UNION_NO_DRAMATIC = MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]
cat8_UNION_NO_DRAMATIC['ADDED_REGENO_SUBLINES'] = cat8_UNION_NO_DRAMATIC.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
print("CAT 8: Pass -> Fail (changed MRCA): ", len(MARKED_dv[(MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True) & (MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == False) & (MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True)]))

print("Total Vars in CN 1, CN 0, or Sev DEL Regions: ", len(MARKED_dv[MARKED_dv['IN_CN_1_0'] == True]) + len(MARKED_dv[MARKED_dv['IN_SEVERUS_DELETION'] == True]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

CAT 1: Pass -> Pass (unchanged MRCA):  556
CAT 2: Pass -> Pass (changed MRCA):  1150
CAT 3: Fail -> Pass (unchanged MRCA):  47
CAT 4: Fail -> Pass (changed MRCA):  0
CAT 5: Fail -> Fail (unchanged MRCA):  448
CAT 6: Fail -> Fail (changed MRCA):  153
CAT 7: Pass -> Fail (unchanged MRCA):  0
CAT 8: Pass -> Fail (changed MRCA):  19352
Total Vars in CN 1, CN 0, or Sev DEL Regions:  24584


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [26]:
for x in range(1, 9):
    if x == 7:
        continue
    print("Cat" + str(x) + " PRE")
    cat = eval("cat" + str(x) + "_UNION_NO_DRAMATIC")
    if cat.empty:
        print("Cat " + str(x) + " is empty\n")
        continue
    cat['DV_SUBLINE_COUNT'] = cat.apply(lambda row: len(row['DV_SUBLINES']), axis=1)
    print(cat.value_counts('DV_SUBLINE_COUNT'))
    print("Cat" + str(x) + " POST REGENO")
    cat['REGENO_SUBLINE_COUNT'] = cat.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_UNION']), axis=1)
    print(cat.value_counts('REGENO_SUBLINE_COUNT'))

Cat1 PRE
DV_SUBLINE_COUNT
22    206
21    166
20     88
19     83
4       4
6       3
3       3
5       2
11      1
dtype: int64
Cat1 POST REGENO
REGENO_SUBLINE_COUNT
23    207
22    165
21     88
20     83
5       4
7       3
4       3
6       2
12      1
dtype: int64
Cat2 PRE
DV_SUBLINE_COUNT
1    797
2    344
7      3
4      3
5      2
6      1
dtype: int64
Cat2 POST REGENO
REGENO_SUBLINE_COUNT
2    797
3    343
8      3
5      3
7      2
6      1
4      1
dtype: int64
Cat3 PRE
DV_SUBLINE_COUNT
18    41
4      2
3      2
2      2
dtype: int64
Cat3 POST REGENO
REGENO_SUBLINE_COUNT
19    39
5      2
4      2
3      2
23     1
22     1
dtype: int64
Cat4 PRE
Cat 4 is empty

Cat5 PRE
DV_SUBLINE_COUNT
2     99
3     75
4     54
17    36
16    26
6     25
15    23
5     22
14    18
7     16
12    15
13    10
10     9
11     8
8      7
9      5
dtype: int64
Cat5 POST REGENO
REGENO_SUBLINE_COUNT
3     96
5     58
4     58
18    36
7     30
17    27
16    23
6     22
8     20
15    18
13    1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


REGENO_SUBLINE_COUNT
2     17726
3       852
4       425
5       155
6        67
8        47
9        38
7        18
13        9
17        8
12        3
16        2
15        1
14        1
dtype: int64


## Reconcile passing categories back to VCF, UNION NO Dramatic change

In [27]:
# Category 1 is kept as is, no change to MRCA, and passes anyway
FINAL_VAR_DF = cat1_UNION_NO_DRAMATIC.copy(deep=True)

# Category 2, Need to look through the details. 
FINAL_VAR_DF = pd.concat([FINAL_VAR_DF, cat2_UNION_NO_DRAMATIC], ignore_index=True)

# Category 3, Regenotyped with an unchanged MRCA, include all.
FINAL_VAR_DF = pd.concat([FINAL_VAR_DF, cat3_UNION_NO_DRAMATIC], ignore_index=True)

# Category 4, Regenotyped with a changed MRCA from originally failing.
FINAL_VAR_DF = pd.concat([FINAL_VAR_DF, cat4_UNION_NO_DRAMATIC], ignore_index=True)

In [28]:
# Blanket Default VCF Header
default_header = """##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##FORMAT=<ID=MED_DP,Number=1,Type=Integer,Description="Median DP observed within the GVCF block rounded to the nearest integer.">
##DeepVariant_version=1.6.0
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>\n"""

## Add REGENO entry to variants that are regenotyped rather than called originally.

In [29]:
internal_nodes = [f'N{i}' for i in range(1, 23)]
private_nodes = [f'O{i}' for i in range(1, 25) if i != 2]

for index, row in FINAL_VAR_DF.iterrows():
    if 'ADDED_REGENO_SUBLINES' in row and isinstance(row['ADDED_REGENO_SUBLINES'], set):
        for subline in row['ADDED_REGENO_SUBLINES']:
            if subline in FINAL_VAR_DF.columns:
                FINAL_VAR_DF.at[index, subline] = "REGENO"

columns_to_keep = ['KEY', 'CHROM', 'POS', 'REF', 'ALT', 'REGENO_MRCA_UNION'] + [f'C{i}' for i in range(1, 25) if i != 2]
FINAL_VAR_DF = FINAL_VAR_DF[columns_to_keep]

In [30]:
# Add back all of the non-regenotyped variants.
dv_merged_copy = dv_merged.copy(deep=True)
dv_merged_copy = dv_merged_copy[~dv_merged_copy['KEY'].isin(FINAL_VAR_DF['KEY'])]

print(len(dv_merged_copy))
print(len(dv_merged))
print(len(FINAL_VAR_DF))

FINAL_VAR_DF.index = FINAL_VAR_DF['KEY']
FINAL_VAR_DF = FINAL_VAR_DF.drop(columns=['KEY'])
dv_merged_copy.index = dv_merged_copy['KEY']
dv_merged_copy = dv_merged_copy.drop(columns=['KEY'])

dv_merged_copy = dv_merged_copy.append(FINAL_VAR_DF)

448157
449910
1753


#### Meteadata for FINAL_MRCA column which will show to which internal branch / node this variant is placed at.

In [31]:
dv_merged_copy['DV_SUBLINES'] = dv_merged_copy.apply(lambda row: [col for col in dv_merged_copy.columns if col.startswith('C') and col != "CHROM" and not col.startswith("CN") and pd.notna(row[col])], axis=1)
dv_merged_copy['DV_MRCA'] = dv_merged_copy.apply(lambda row: th_utils.common_ancestor_helper(row, "DV_SUBLINES", input_tree=imported_tree), axis=1)
dv_merged_copy['DV_MRCA_TERMINALS'] = dv_merged_copy.apply(lambda row: imported_tree.search_nodes(name=row['DV_MRCA'])[0].get_leaf_names(), axis=1)
dv_merged_copy['DV_MINIMUM_SUPPORT_MET'] = dv_merged_copy.apply(lambda row: len(row['DV_SUBLINES']) >= minimum_subline_support_per_clade_size_requirement[len(row['DV_MRCA_TERMINALS'])], axis=1)

## Create FINAL_MRCA column - where variant is placed at.

In [32]:
def final_mrca_helper(row):
    if type(row['REGENO_MRCA_UNION']) == float and pd.isna(row['REGENO_MRCA_UNION']):
        if row['DV_MINIMUM_SUPPORT_MET']:
            return row['DV_MRCA']
        else:
            return float('nan')
    else:
        return row['REGENO_MRCA_UNION']

dv_merged_copy['FINAL_MRCA'] = dv_merged_copy.apply(lambda row: final_mrca_helper(row), axis=1)

## Show amount of unplaced variants

In [33]:
nan_count = dv_merged_copy['FINAL_MRCA'].isna().sum()
print(f"Number of rows with NaN in FINAL_MRCA: {nan_count}")

print(nan_count / len(dv_merged_copy) * 100)

Number of rows with NaN in FINAL_MRCA: 14791
3.287546398168523


## Subsection merged DF into sub_df per internal and private node for VCF creation.

In [34]:
internal_node_dfs = {}
private_node_dfs = {}

dv_merged_without_unplaced = dv_merged_copy.dropna(subset=['FINAL_MRCA'])

for internal_node in internal_nodes:
    internal_node_dfs[internal_node] = dv_merged_without_unplaced[dv_merged_without_unplaced['FINAL_MRCA'] == internal_node]

for private_node in private_nodes:
    private_node_dfs[private_node] = dv_merged_without_unplaced[dv_merged_without_unplaced['FINAL_MRCA'] == private_node]

## Prepare exclusive DF formatting for VCF creation.

In [35]:
exclusive_dfs = {**internal_node_dfs, **private_node_dfs}

for node in exclusive_dfs:
    exclusive_dfs[node] = exclusive_dfs[node][['CHROM', 'POS', 'REF', 'ALT'] + [col for col in exclusive_dfs[node].columns if col.startswith('C') and col != "CHROM"]]
    exclusive_dfs[node]['ID'] = "."
    exclusive_dfs[node]['QUAL'] = "."
    exclusive_dfs[node]['FILTER'] = "PASS"
    exclusive_dfs[node]['INFO'] = "."
    exclusive_dfs[node]['FORMAT'] = "GT:GQ:DP:MIN_DP:AD:VAF:PL:MED_DP"
    # Fill Empty cells with "./.:0:0:0:0,0:0:0:0"
    # exclusive_dfs[node].fillna("./.:0:0:0:0,0:0:0:0", inplace=True)

    # Fill Empty cells with "./.:0:0,0,0:0:0:" (As of unphased Sev data)
    exclusive_dfs[node].fillna("./.:0:0,0,0:0:0", inplace=True)

    # Reorder columns
    exclusive_dfs[node] = exclusive_dfs[node][['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT'] + [col for col in exclusive_dfs[node].columns if col.startswith('C') and col != "CHROM"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/p

## Write Exclusive VCFs

In [None]:
for node in exclusive_dfs:
    print(node, ": ", len(exclusive_dfs[node]))
    th_utils.write_vcf(exclusive_dfs[node], f"/data/KolmogorovLab/agoretsky/Latest_Variant_Calls_01_13_25/regenotyped_snv_04_16_25_exclusive/{node}.vcf", default_header)

N1 :  34343
N2 :  348
N3 :  349
N4 :  413
N5 :  737
N6 :  1476
N7 :  5719
N8 :  139
N9 :  159
N10 :  3998
N11 :  574
N12 :  164
N13 :  797
N14 :  2158
N15 :  237
N16 :  1016
N17 :  706
N18 :  9470
N19 :  1098
N20 :  1640
N21 :  2742
N22 :  2529
O1 :  27902
O3 :  12099
O4 :  18917
O5 :  17337
O6 :  7179
O7 :  6544
O8 :  11093
O9 :  21245
O10 :  15729
O11 :  20205
O12 :  14452
O13 :  18199
O14 :  10883
O15 :  6266
O16 :  22784
O17 :  25174
O18 :  7358
O19 :  27759
O20 :  7260
O21 :  10978
O22 :  24886
O23 :  22294
O24 :  7764


## Write Cumulative VCFs

In [None]:
cumulative_dfs = {}
write_vcf_now_cumulative = True

for key, value in non_terminal_paths.items():
    merged_for_key = pd.concat([exclusive_dfs[x] for x in value], ignore_index=True)
    cumulative_dfs.update({key: merged_for_key})
    print("SNV Count for node: " + str(key) + " " + str(len(merged_for_key)))

    if write_vcf_now_cumulative:
        th_utils.write_vcf(merged_for_key, f"/data/KolmogorovLab/agoretsky/Latest_Variant_Calls_01_13_25/regenotyped_snv_04_16_25_cumulative/{key}.vcf", default_header)

SNV Count for node: N1 34343
SNV Count for node: N8 34482
SNV Count for node: N2 34691
SNV Count for node: N12 34646
SNV Count for node: N9 34641
SNV Count for node: N4 35104
SNV Count for node: N3 35040
SNV Count for node: N16 35662
SNV Count for node: N13 35443
SNV Count for node: N10 38639
SNV Count for node: O5 51978
SNV Count for node: O23 57398
SNV Count for node: N5 35841
SNV Count for node: O19 62799
SNV Count for node: O17 60214
SNV Count for node: N17 36368
SNV Count for node: O13 53861
SNV Count for node: N14 37601
SNV Count for node: O9 56688
SNV Count for node: N11 39213
SNV Count for node: O4 57556
SNV Count for node: N6 37317
SNV Count for node: N7 41560
SNV Count for node: N19 37466
SNV Count for node: N18 45838
SNV Count for node: N15 37838
SNV Count for node: O24 45365
SNV Count for node: O1 67115
SNV Count for node: O22 64099
SNV Count for node: O10 53046
SNV Count for node: O12 51769
SNV Count for node: O3 53659
SNV Count for node: O14 52443
SNV Count for node: N20 