## TreeHarmonizer

TreeHarmonizer is a utility that is used to place called variants onto a pre-existing phylogenetic tree, allowing for visualization of variant trajectories and evolutionary progression. TreeHarmonizer was developed with single nucleotide (SNV), structural (SV), and copy number (CNA) variants in mind, allowing for placement of each variant type.

As of version 0.1, TreeHarmonizer works as a jupyter notebook, designed for the paper "Long-read sequencing of single cell-derived melanoma subclones reveals divergent and parallel genomic and epigenomic evolutionary trajectories", by Liu & Goretsky, et al. 

Preprint available on [biorxiv.](https://www.biorxiv.org/content/10.1101/2025.08.28.672865v1)
Unaligned BAM files available on [NCBI SRA](https://www.ncbi.nlm.nih.gov/bioproject/1307171)
Further project files available on [Zenodo](https://doi.org/10.5281/zenodo.16883901)

### Dependencies

----- Please consult the main project README.md for installation instructions -----

TreeHarmonizer requires a Python 3.6 environment with the following packages -
* pandas
* io
* functools 
* os
* intervaltree (https://pypi.org/project/intervaltree/)
* ete3
* jupyter

## How To Run

1. Load Dependencies in the cell below
2. Fill in input parameters per instructions
3. Run all following cells

## Load Dependencies

In [26]:
import importlib
import pandas as pd
import TreeHarmonizer_utils as th_utils
importlib.reload(th_utils)
import subprocess

pd.set_option('display.width', 5000)
pd.set_option("display.expand_frame_repr", True)
pd.set_option("display.max_colwidth", 1000)
pd.set_option('display.max_columns', None)

## Input Parameters

### SNV Path
* Expects folder structure within SNV path to be the following: `[snv_path]/[sample_name]/[sample_name].vcf`
  * SNV vcfs are assumed to follow the standard VCF 4.2 format, output by DeepVariant.

  * Sample name within the VCF file is assumed to be the same as `[sample_name]`.

  * All folders within the path will be assumed to be a sample by default, except those that begin with an underscore `_`.

  * In order to limit folders further, add the `predefined_sample_list` argument to the call to `th_utils.generate_merged_df` in the Load SNVs and Tree data cell.

  * Argument value should be a list of directory names within the `snv_path`, VCF names are still expected to match the sample name.

### SV Path
* Expects a direct path to the to the Severus output VCF of all somatic SVs.

### CNA Path
* Follows the expected path structure of Wakhan output.
* Expects folder structure within the CNA path to be the following: `[cna_path]/[sample_name]/bed_output/[sample]_copynumber_segments.bed`

### FN rate
* False negative rate expected as a float 0 < fn > 1. Default is 15%.

### Tree Path (In development, unavailable)
* Currently only expects the original tree for the paper.
* Expects a phylogenetic tree in the standard newick format.

### Newick Format (In development, unavailable)
* Newick strings are read and processed by the ete3 package. Please choose the format number that represents the structure of your newick string.

### Write VCFs
* True or False, output VCFs per tree node.
* False will result in only internal notebook variables being populated.

### Write Path
* Path for VCF output per tree node

In [None]:
# Modify paths with your file structures below.
# Paths are prepopulated for example data provided in the repository. 
# Relative paths get converted to absolute paths via os.path.abspath in the th_utils functions.

snv_path = "./example_data/snv/"
cna_path = "./example_data/cna/"
sv_path = "./example_data/sv/severus_chr1.vcf"
fn_rate = 0.15
write_exclusive_vcfs = True
write_cumulative_vcfs = True
write_path_exclusive = "./output_vcfs/exclusive"
write_path_cumulative = "./output_vcfs/cumulative"

## Import SNV and Severus Data

In [28]:
# Load SNV data into a merged DataFrame
dv_merged, sample_list = th_utils.generate_merged_df(caller_path=snv_path)

# Load Severus data into a merged DataFrame
severus = th_utils.generate_severus_df(severus_path=sv_path, simple_name=True)

# Load tree input via newick string and parse it into various components
imported_tree, non_terminals, terminals, non_terminal_paths, terminal_paths, non_terminal_leaves, terminal_paths_o_keys, non_terminal_paths_without_N1 = th_utils.get_tree_data()

# Add informative columns to the merged DataFrames that were lost from original merging.
dv_merged['CHROM'] = dv_merged['KEY'].str.split(":").str[0]
dv_merged['POS'] = dv_merged['KEY'].str.split(":").str[1]
dv_merged['REF'] = dv_merged['KEY'].str.split(":").str[2]
dv_merged['ALT'] = dv_merged['KEY'].str.split(":").str[3]

severus['CHROM'] = severus['CHROM'].astype(str)
severus['POS'] = severus['POS'].astype(int)
dv_merged['CHROM'] = dv_merged['CHROM'].astype(str)
dv_merged['POS'] = dv_merged['POS'].astype(int)

# Rename INFO COLUMNS to avoid conflicts
dv_merged.rename(columns={'INFO': 'DV_INFO'}, inplace=True)
severus.rename(columns={'INFO': 'SEV_INFO'}, inplace=True)

## Process Severus Variants

In [29]:
# Filter out second breakpoint pair
severus_no_break = severus[severus["ID"].str.contains("_2")==False]
severus_filtered = severus_no_break.copy(deep=True)

# Filter for only deletions
severus_filtered_del = severus_filtered[severus_filtered["SEV_INFO"].str.contains("SVTYPE=DEL")==True]

# Create critical columns, note that these are NOT SETS, JUST ARRAYS (MRCA is a string)
# Chain warning turned off, as we are not modifying the dataframe
pd.options.mode.chained_assignment = None

def severus_called_sublines_helper(row):
    # GT:VAF:hVAF:DR:DV
    # Prior filter method was:
    # return [col for col in severus_data.columns if col.startswith('C') and col != "CHROM" and row[col] != "./.:0:0,0,0:0:0"]
    # A VCF bug in the current Severus version (as of 03.11.2025) requires the following filter method to be used:
    # FILTER HAS BEEN CHANGED TO THE FOLLOWING
    output_subline_list = []
    for col in severus_filtered_del.columns:
        if col.startswith('C') and col != "CHROM":
            internal_sev_data = row[col].split(":")
            DV = int(internal_sev_data[4])
            if DV > 0:
                output_subline_list.append(col)
    return output_subline_list

severus_filtered_del['SEVERUS_SUBLINES'] = severus_filtered_del.apply(severus_called_sublines_helper, axis=1)
severus_filtered_del['SEVERUS_MRCA'] = severus_filtered_del.apply(lambda row: th_utils.common_ancestor_helper(row, "SEVERUS_SUBLINES", input_tree=imported_tree), axis=1)
severus_filtered_del['SEVERUS_MRCA_TERMINALS'] = severus_filtered_del.apply(lambda row: imported_tree.search_nodes(name=row['SEVERUS_MRCA'])[0].get_leaf_names(), axis=1)

# Return chained assignment warning to default
pd.options.mode.chained_assignment = 'warn'

severus_filtered_del.index=severus_filtered_del['ID']

## Import and process Wakhan CNA numbers

In [30]:
# Create interval tree of all severus deletion ranges
wakhan_cna_trees_per_chromosome = {}
wakhan_cna_1_only_trees_per_chromosome = {}
wakhan_cna_0_only_trees_per_chromosome = {}

# Create blank interval trees per chromosome
for sub in sample_list:
    for chrom in th_utils.autosomes:
        wakhan_cna_trees_per_chromosome.update({sub + "-" + chrom: th_utils.it.IntervalTree()})
        wakhan_cna_1_only_trees_per_chromosome.update({sub + "-" + chrom: th_utils.it.IntervalTree()})
        wakhan_cna_0_only_trees_per_chromosome.update({sub + "-" + chrom: th_utils.it.IntervalTree()})


for subline in sample_list:
    wk_copy_num = th_utils.read_bed_updated(cna_path + "/" + str(subline) + "/bed_output/" + str(subline) + "_copynumbers_segments.bed")
    wk_copy_num['chr'] = wk_copy_num['chr'].astype(str)

    # Filter wakhan down to only autosomes
    wk_copy_num = th_utils.keep_rows_by_values(wk_copy_num, 'chr', th_utils.autosomes)
    wk_copy_num['copynumber_state'] = wk_copy_num['copynumber_state'].astype(int)

    # For each copy num entry, add to its respective interval tree, with the interval being the copy num range, data being the copy num metadata
    for index, row in wk_copy_num.iterrows():
        wakhan_cna_trees_per_chromosome[str(subline) + "-" + str(row['chr'])].addi(int(row['start']), int(row['end']) + 1, (str(subline), row['copynumber_state'], row['coverage'], row['confidence'], row['svs_breakpoints_ids']))

    for index, row in wk_copy_num.iterrows():
        if row['copynumber_state'] == 1:
            wakhan_cna_1_only_trees_per_chromosome[str(subline) + "-" + str(row['chr'])].addi(int(row['start']), int(row['end']) + 1, (str(subline), row['copynumber_state'], row['coverage'], row['confidence'], row['svs_breakpoints_ids']))
    
    for index, row in wk_copy_num.iterrows():
        if row['copynumber_state'] == 0:
            wakhan_cna_0_only_trees_per_chromosome[str(subline) + "-" + str(row['chr'])].addi(int(row['start']), int(row['end']) + 1, (str(subline), row['copynumber_state'], row['coverage'], row['confidence'], row['svs_breakpoints_ids']))

# Create a copy of the merged dataframe to modify
df_merged_copy_wakhan = dv_merged.copy(deep=True)

def in_copy_num_of_1_helper(row):
    for subline in sample_list:
        if wakhan_cna_1_only_trees_per_chromosome[str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            return True
    return False

def in_copy_num_of_0_helper(row):
    for subline in sample_list:
        if wakhan_cna_0_only_trees_per_chromosome[str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            return True
    return False

def copy_num_1_sublines_helper(row):
    output = []
    for subline in sample_list:
        if wakhan_cna_1_only_trees_per_chromosome[str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            output.append(str(subline))
    return output

def copy_num_0_sublines_helper(row):
    output = []
    for subline in sample_list:
        if wakhan_cna_0_only_trees_per_chromosome[str(subline) + "-" + row['CHROM']].overlaps_point(int(row['POS'])):
            output.append(str(subline))
    return output

def get_cn_meta_helper(row, subline_col, interval_tree_dict, input_tree):
    metadata = {}
    for subline in row[subline_col]:
        chrom = row['CHROM']
        pos = int(row['POS'])
        intervals = interval_tree_dict[subline + "-" + chrom][pos]
        metadata[subline] = intervals.pop().data
    return metadata

# PER CN Loss Type Column Creation (For debugging and closer analysis)
df_merged_copy_wakhan['IN_CN_1'] = df_merged_copy_wakhan.apply(lambda row: in_copy_num_of_1_helper(row), axis=1)
df_merged_copy_wakhan['IN_CN_0'] = df_merged_copy_wakhan.apply(lambda row: in_copy_num_of_0_helper(row), axis=1)

df_merged_copy_wakhan['CN_1_SUBLINES'] = df_merged_copy_wakhan.apply(lambda row: copy_num_1_sublines_helper(row), axis=1)
df_merged_copy_wakhan['CN_0_SUBLINES'] = df_merged_copy_wakhan.apply(lambda row: copy_num_0_sublines_helper(row), axis=1)

df_merged_copy_wakhan['CN_1_MRCA'] = df_merged_copy_wakhan.apply(lambda row: th_utils.common_ancestor_helper(row, "CN_1_SUBLINES", input_tree=imported_tree) if row['IN_CN_1'] else float("nan"), axis=1)
df_merged_copy_wakhan['CN_0_MRCA'] = df_merged_copy_wakhan.apply(lambda row: th_utils.common_ancestor_helper(row, "CN_0_SUBLINES", input_tree=imported_tree) if row['IN_CN_0'] else float("nan"), axis=1)

df_merged_copy_wakhan['CN_1_MRCA_TERMINALS'] = df_merged_copy_wakhan.apply(lambda row: imported_tree.search_nodes(name=row['CN_1_MRCA'])[0].get_leaf_names() if row['IN_CN_1'] else float("nan"), axis=1)
df_merged_copy_wakhan['CN_0_MRCA_TERMINALS'] = df_merged_copy_wakhan.apply(lambda row: imported_tree.search_nodes(name=row['CN_0_MRCA'])[0].get_leaf_names() if row['IN_CN_0'] else float("nan"), axis=1)

# CN 1 and 0 Combined for final union regenotyping
df_merged_copy_wakhan['IN_CN_1_0'] = df_merged_copy_wakhan.apply(lambda row: row['IN_CN_1'] and row['IN_CN_0'], axis=1)
df_merged_copy_wakhan['CN_1_0_SUBLINES'] = df_merged_copy_wakhan.apply(lambda row: row['CN_1_SUBLINES'] + row['CN_0_SUBLINES'], axis=1)
df_merged_copy_wakhan['CN_1_0_MRCA'] = df_merged_copy_wakhan.apply(lambda row: th_utils.common_ancestor_helper(row, "CN_1_0_SUBLINES", input_tree=imported_tree) if row['IN_CN_1_0'] else float("nan"), axis=1)
df_merged_copy_wakhan['CN_1_0_MRCA_TERMINALS'] = df_merged_copy_wakhan.apply(lambda row: imported_tree.search_nodes(name=row['CN_1_0_MRCA'])[0].get_leaf_names() if row['IN_CN_1_0'] else float("nan"), axis=1)

In [31]:
cna_to_merge = df_merged_copy_wakhan[['CHROM', 'POS', 'IN_CN_1_0', 'CN_1_0_SUBLINES', 'CN_1_0_MRCA_TERMINALS', 'IN_CN_1', 'IN_CN_0', 'CN_0_SUBLINES', 'CN_1_SUBLINES', 'CN_0_MRCA_TERMINALS', 'CN_1_MRCA_TERMINALS']].copy(deep=True)

## Create MRCA records and metadata, severus MRCA metadata.

In [32]:
MARKED_dv = dv_merged.copy(deep=True)

# Add Wakhan CNA data to MARKED_dv
MARKED_dv = MARKED_dv.merge(cna_to_merge, on=['CHROM', 'POS'], how='left')

MARKED_dv['DV_SUBLINES'] = MARKED_dv.apply(lambda row: [col for col in MARKED_dv.columns if col.startswith('C') and col != "CHROM" and not col.startswith("CN") and pd.notna(row[col])], axis=1)
MARKED_dv['DV_MRCA'] = MARKED_dv.apply(lambda row: th_utils.common_ancestor_helper(row, "DV_SUBLINES", input_tree=imported_tree), axis=1)
MARKED_dv['DV_MRCA_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['DV_MRCA'])[0].get_leaf_names(), axis=1)
MARKED_dv['DV_SUBLINE_COUNT'] = MARKED_dv.apply(lambda row: len(row['DV_SUBLINES']), axis=1)

# Create interval tree of all severus deletion ranges
severus_internal_trees_per_chromosome = {}

# Create blank interval trees per chromosome
for chrom in th_utils.autosomes:
    severus_internal_trees_per_chromosome.update({chrom: th_utils.it.IntervalTree()})

# For each deletion, add to its respective interval tree, with the interval being the deletion range, data being the deletion metadata
# Severus deletions, as of Mar 11 version, are (, ], aka start exclusive, end inclusive.
# Therefore we add 1 to the start position to make it inclusive, and add 1 to the end position to make it inclusive (interval tree is inclusive on lower, exclusive on upper, this has been verified.)
for index, row in severus_filtered_del.iterrows():
    severus_internal_trees_per_chromosome[row['CHROM']].addi(int(row['POS'] + 1), int(row['SEV_INFO'].split(";")[3].split("=")[1]) + 1, (row['ID'], row['SEVERUS_MRCA'], row['SEVERUS_SUBLINES']))

# Create new column in df_merged which states if variant position is in a severus deletion
MARKED_dv['IN_SEVERUS_DELETION'] = MARKED_dv.apply(lambda row: len(severus_internal_trees_per_chromosome[row['CHROM']][int(row['POS'])]) > 0, axis=1)
MARKED_dv['SEVERUS_SUBLINES'] = MARKED_dv.apply(lambda row: severus_internal_trees_per_chromosome[row['CHROM']][int(row['POS'])].pop()[2][2] if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['SEVERUS_MRCA'] = MARKED_dv.apply(lambda row: severus_internal_trees_per_chromosome[row['CHROM']][int(row['POS'])].pop()[2][1] if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['SEVERUS_MRCA_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['SEVERUS_MRCA'])[0].get_leaf_names() if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['SEVERUS_ID'] = MARKED_dv.apply(lambda row: severus_internal_trees_per_chromosome[row['CHROM']][int(row['POS'])].pop()[2][0] if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)

## Apply parisomony assumption due to lack of phasing data in many regions

In [33]:
# If any of the severus deletion sublines called MATCH a subline an SNV called subline
# We make the assumption that deletion occurred on the other haplotype, and therefore not relevant to that SNV.
# If the following col is True, do not use said deletion for this SNV.

MARKED_dv['SEVERUS_OTHER_HAPLO_BOOL'] = MARKED_dv.apply(lambda row: len(set(row['DV_SUBLINES']).intersection(set(row['SEVERUS_SUBLINES']))) > 0 if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['SEVERUS_OTHER_HAPLO_BOOL'] = MARKED_dv['SEVERUS_OTHER_HAPLO_BOOL'].astype('boolean')
MARKED_dv['CN_OTHER_HAPLO_BOOL'] = MARKED_dv.apply(lambda row: len(set(row['DV_SUBLINES']).intersection(set(row['CN_1_0_SUBLINES']))) > 0 if row['IN_CN_1_0'] else float("nan"), axis=1)
MARKED_dv['CN_OTHER_HAPLO_BOOL'] = MARKED_dv['CN_OTHER_HAPLO_BOOL'].astype('boolean')


## Create columns that reflect effects of regenotyping with only Severus deletions.

In [34]:
# Union of DV and Severus deletion sublines
MARKED_dv['REGENO_COMBINED_SUBLINES_SEVERUS_ONLY'] = MARKED_dv.apply(lambda row: list(set(row['SEVERUS_SUBLINES']).union(set(row['DV_SUBLINES']))) if (row['IN_SEVERUS_DELETION'] == True and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# Common ancestor of DV and Severus deletion sublines
MARKED_dv['REGENO_MRCA_SEVERUS_ONLY'] = MARKED_dv.apply(lambda row: th_utils.common_ancestor_helper(row, 'REGENO_COMBINED_SUBLINES_SEVERUS_ONLY', input_tree=imported_tree) if (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# TRUE/FALSE if the common ancestor of DV and Severus deletion sublines differs from the DV common ancestor
MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] = MARKED_dv.apply(lambda row: row['REGENO_MRCA_SEVERUS_ONLY'] != row['DV_MRCA'] if (row['IN_SEVERUS_DELETION'] and (row['SEVERUS_OTHER_HAPLO_BOOL'] == False)) else float("nan"), axis=1)
MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] = MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'].astype('boolean')

# Terminal nodes for the CHANGED common ancestor when including sublines from Severus deletion
MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['REGENO_MRCA_SEVERUS_ONLY'])[0].get_leaf_names() if row['IN_SEVERUS_DELETION'] and (row['SEVERUS_OTHER_HAPLO_BOOL'] == False)else float("nan"), axis=1)

# TRUE/FALSE if the union of sublines from DV and Severus deletion differs from the DV sublines (MAY NOT DIFFER, IF DOESN'T, VARIANT WAS STILL CALLED IN DEL POSITION, IMPLIES OTHER HAPLOTYPE FOR DEL)
MARKED_dv['REGENO_SUBLINES_DIFFER_SEVERUS_ONLY'] = MARKED_dv.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_SEVERUS_ONLY']) != set(row['DV_SUBLINES']) if row['IN_SEVERUS_DELETION'] and (row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)
MARKED_dv['REGENO_SUBLINES_DIFFER_SEVERUS_ONLY'] = MARKED_dv['REGENO_SUBLINES_DIFFER_SEVERUS_ONLY'].astype('boolean')

## Create columns that reflect effects of regenotyping with only Wakhan losses (CN=0 or CN=1)

In [35]:
# Union of DV and CN 1 sublines
MARKED_dv['REGENO_COMBINED_SUBLINES_CN_1_0_ONLY'] = MARKED_dv.apply(lambda row: list(set(row['CN_1_0_SUBLINES']).union(set(row['DV_SUBLINES']))) if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)

# Common ancestor of DV and CN 1 sublines
MARKED_dv['REGENO_MRCA_CN_1_0_ONLY'] = MARKED_dv.apply(lambda row: th_utils.common_ancestor_helper(row, 'REGENO_COMBINED_SUBLINES_CN_1_0_ONLY', input_tree=imported_tree) if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)

# TRUE/FALSE if the common ancestor of DV and CN 1 sublines differs from the DV common ancestor
MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] = MARKED_dv.apply(lambda row: row['REGENO_MRCA_CN_1_0_ONLY'] != row['DV_MRCA'] if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)
MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] = MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'].astype('boolean')

# Terminal nodes for the CHANGED common ancestor when including sublines from CN 1
MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['REGENO_MRCA_CN_1_0_ONLY'])[0].get_leaf_names() if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)

# TRUE/FALSE if the union of sublines from DV and CN 1 differs from the DV sublines (MAY NOT DIFFER, IF DOESN'T, VARIANT WAS STILL CALLED IN DEL POSITION, IMPLIES OTHER HAPLOTYPE FOR DEL)
MARKED_dv['REGENO_SUBLINES_DIFFER_CN_1_0_ONLY'] = MARKED_dv.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_CN_1_0_ONLY']) != set(row['DV_SUBLINES']) if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)
MARKED_dv['REGENO_SUBLINES_DIFFER_CN_1_0_ONLY'] = MARKED_dv['REGENO_SUBLINES_DIFFER_CN_1_0_ONLY'].astype('boolean')

## Create columns that reflect effects of regenoptying using both Severus and Wakhan losses.

* Additionally, create dramatic shift column relfecting if regenotyping would dramatically shift the MRCA up the tree, herein defined as a subline reinclusion rate of >= 100%.

In [36]:
def combined_sublines_union_helper(row):
    dv_set = set(row['DV_SUBLINES'])
    CN_set = set()
    SEV_set = set()
    if row['IN_CN_1_0']:
        CN_set = set(row['CN_1_0_SUBLINES'])
    if row['IN_SEVERUS_DELETION']:
        SEV_set = set(row['SEVERUS_SUBLINES'])
    if SEV_set or CN_set:
        return list(dv_set.union(CN_set).union(SEV_set))
    else:
        return float("nan")

# Union of DV and CN 1 sublines
MARKED_dv['REGENO_COMBINED_SUBLINES_UNION'] = MARKED_dv.apply(lambda row: combined_sublines_union_helper(row), axis=1)

# Common ancestor of DV and CN 1 sublines
MARKED_dv['REGENO_MRCA_UNION'] = MARKED_dv.apply(lambda row: th_utils.common_ancestor_helper(row, 'REGENO_COMBINED_SUBLINES_UNION', input_tree=imported_tree) if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# TRUE/FALSE if the common ancestor of DV and CN 1 sublines differs from the DV common ancestor
MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] = MARKED_dv.apply(lambda row: row['REGENO_MRCA_UNION'] != row['DV_MRCA'] if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)
MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] = MARKED_dv['REGENO_MRCA_UNION_DIFFERS'].astype('boolean')

# Terminal nodes for the CHANGED common ancestor when including sublines from CN 1
MARKED_dv['REGENO_MRCA_UNION_TERMINALS'] = MARKED_dv.apply(lambda row: imported_tree.search_nodes(name=row['REGENO_MRCA_UNION'])[0].get_leaf_names() if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

# TRUE/FALSE if the union of sublines from DV and CN 1 differs from the DV sublines (MAY NOT DIFFER, IF DOESN'T, VARIANT WAS STILL CALLED IN DEL POSITION, IMPLIES OTHER HAPLOTYPE FOR DEL)
MARKED_dv['REGENO_SUBLINES_DIFFER_UNION'] = MARKED_dv.apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) != set(row['DV_SUBLINES']) if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)

def tree_shift_helper(row):
    if len(row['REGENO_COMBINED_SUBLINES_UNION']) > (len(row['DV_SUBLINES']) * 2):
        return True
    else:
        return False
        

MARKED_dv['DRAMATIC_SHIFT'] = MARKED_dv.apply(lambda row: tree_shift_helper(row) if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)
MARKED_dv['DRAMATIC_SHIFT'] = MARKED_dv['DRAMATIC_SHIFT'].astype('boolean')

## Load the minimum subline support per clade size thresholds.

In [37]:
# Prebuilt minimum subline support per clade size requirement
# Uncomment below in order to use the prebuilt version rather than unnecessarily recalculating

# minimum_subline_support_per_clade_size_requirement = {
#     1: 1,
#     2: 2,
#     3: 2,
#     4: 3,
#     5: 4,
#     7: 5,
#     8: 6,
#     12: 10,
#     16: 13,
#     23: 19
# }

# If other clade sizes are necessary, formulaic version of threshold is below, commented out.

clade_sizes = set()
# Get a list of all the clade sizes in the tree
for clade in imported_tree.traverse():
    if clade.is_leaf():
        continue
    clade_sizes.add(len(clade.get_leaves()))

formulaic_subline_support_per_clade_size_requirement = {}

cur_tree_clade_sizes = clade_sizes
cur_tree_fn_rate = fn_rate # Default 15% as set in Cell 2, can be changed based on user preference
for clade_size in cur_tree_clade_sizes:
    if clade_size < 2:
        formulaic_subline_support_per_clade_size_requirement[clade_size] = 1
    elif clade_size == 2:
        formulaic_subline_support_per_clade_size_requirement[clade_size] = 2
    else:
        # Calculate the support requirement based on the FN rate
        support_requirement = int(clade_size * (1 - float(cur_tree_fn_rate)))
        formulaic_subline_support_per_clade_size_requirement[clade_size] = support_requirement

minimum_subline_support_per_clade_size_requirement = formulaic_subline_support_per_clade_size_requirement

## Support prior to regenotyping and post regenotyping calculation

In [38]:
# Original Pass or Fail calculation
MARKED_dv['DV_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['DV_SUBLINES']) >= minimum_subline_support_per_clade_size_requirement[len(row['DV_MRCA_TERMINALS'])], axis=1)
MARKED_dv['DV_MINIMUM_SUPPORT_MET'] = MARKED_dv['DV_MINIMUM_SUPPORT_MET'].astype('boolean')
MARKED_dv['SEVERUS_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['SEVERUS_SUBLINES']) >= minimum_subline_support_per_clade_size_requirement[len(row['SEVERUS_MRCA_TERMINALS'])] if row['IN_SEVERUS_DELETION'] else float("nan"), axis=1)
MARKED_dv['SEVERUS_MINIMUM_SUPPORT_MET'] = MARKED_dv['SEVERUS_MINIMUM_SUPPORT_MET'].astype('boolean')
MARKED_dv['CN_1_0_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['CN_1_0_SUBLINES']) >= minimum_subline_support_per_clade_size_requirement[len(row['CN_1_0_MRCA_TERMINALS'])] if row['IN_CN_1_0'] else float("nan"), axis=1)
MARKED_dv['CN_1_0_MINIMUM_SUPPORT_MET'] = MARKED_dv['CN_1_0_MINIMUM_SUPPORT_MET'].astype('boolean')

# Post REGENO Pass or Fail calculation
MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_SEVERUS_ONLY']) >= minimum_subline_support_per_clade_size_requirement[len(row['REGENO_MRCA_SEVERUS_ONLY_TERMINALS'])] if row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)
MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] = MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'].astype('boolean')
MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_CN_1_0_ONLY']) >= minimum_subline_support_per_clade_size_requirement[len(row['REGENO_MRCA_CN_1_0_ONLY_TERMINALS'])] if row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False else float("nan"), axis=1)
MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] = MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'].astype('boolean')
MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] = MARKED_dv.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_UNION']) >= minimum_subline_support_per_clade_size_requirement[len(row['REGENO_MRCA_UNION_TERMINALS'])] if (row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False) else float("nan"), axis=1)
MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] = MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'].astype('boolean')

# Do not allow dramatic clade shifting
MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] = MARKED_dv.apply(lambda row: len(row['REGENO_COMBINED_SUBLINES_UNION']) >= minimum_subline_support_per_clade_size_requirement[len(row['REGENO_MRCA_UNION_TERMINALS'])] if ((row['IN_CN_1_0'] and row['CN_OTHER_HAPLO_BOOL'] == False) or (row['IN_SEVERUS_DELETION'] and row['SEVERUS_OTHER_HAPLO_BOOL'] == False)) and row['DRAMATIC_SHIFT'] == False else float("nan"), axis=1)
MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] = MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'].astype('boolean')

## Regenotyping Category Scoring

In [39]:
# Generate all possible regenotyping category scoring
# Severus only, CNA only, union of both, and the one that is used - union of both with dramatic shifts disallowed

#original_support_met = MARKED_dv['DV_MINIMUM_SUPPORT_MET'] == True
#regeno_cna_only_support_met = MARKED_dv['REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'] == True
#mrca_change_cna_only = MARKED_dv['REGENO_MRCA_CN_1_0_ONLY_DIFFERS'] == True
#regeno_sv_only_support_met = MARKED_dv['REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'] == True
#mrca_change_sv_only = MARKED_dv['REGENO_MRCA_SEVERUS_ONLY_DIFFERS'] == True
#regeno_union_support_met = MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET'] == True
#mrca_change_union = MARKED_dv['REGENO_MRCA_UNION_DIFFERS'] == True
#regeno_union_no_dramatic_shift_support_met = MARKED_dv['REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'] == True  


# Define function blocks to return the masks for each scenario
def get_scenario_columns(scenario_name):
    if scenario_name == 'severus':
        regeno_col = 'REGENO_SEVERUS_ONLY_MINIMUM_SUPPORT_MET'
        mrca_col = 'REGENO_MRCA_SEVERUS_ONLY_DIFFERS'
    elif scenario_name == 'cna':
        regeno_col = 'REGENO_CN_1_0_ONLY_MINIMUM_SUPPORT_MET'
        mrca_col = 'REGENO_MRCA_CN_1_0_ONLY_DIFFERS'
    elif scenario_name == 'both':
        regeno_col = 'REGENO_UNION_MINIMUM_SUPPORT_MET'
        mrca_col = 'REGENO_MRCA_UNION_DIFFERS'
    elif scenario_name == 'both_no_dramatic':
        regeno_col = 'REGENO_UNION_MINIMUM_SUPPORT_MET_NO_DRAMATIC_SHIFT'
        mrca_col = 'REGENO_MRCA_UNION_DIFFERS'
    else:
        raise ValueError(f"Unknown scenario: {scenario_name}")
    
    return ('DV_MINIMUM_SUPPORT_MET', regeno_col, mrca_col)

scenarios = {
    0: 'severus',
    1: 'cna',
    2: 'both',
    3: 'both_no_dramatic'
}

cat_dvs_no_dramatic = []

for option_num, scenario_name in scenarios.items():
    print(f"--- Processing Option {option_num}: {scenario_name} ---")

    orig_col, regeno_col, mrca_col = get_scenario_columns(scenario_name)

    # Here we maintain the creation of explicit masks rather than the direct boolean indexing, as this 
    # combats the problems of having NaN values in the boolean columns, which leads to issues.
    
    original_support_TRUE = (MARKED_dv[orig_col] == True)
    original_support_FALSE = (MARKED_dv[orig_col] == False)

    regeno_support_TRUE = (MARKED_dv[regeno_col] == True)
    regeno_support_FALSE = (MARKED_dv[regeno_col] == False)

    mrca_change_TRUE = (MARKED_dv[mrca_col] == True)
    mrca_change_FALSE = (MARKED_dv[mrca_col] == False)
    
    cat1_dv = MARKED_dv[(original_support_TRUE)  & (regeno_support_TRUE)  & (mrca_change_FALSE)]
    cat2_dv = MARKED_dv[(original_support_TRUE)  & (regeno_support_TRUE)  & (mrca_change_TRUE)]
    cat3_dv = MARKED_dv[(original_support_FALSE) & (regeno_support_TRUE)  & (mrca_change_FALSE)]
    cat4_dv = MARKED_dv[(original_support_FALSE) & (regeno_support_TRUE)  & (mrca_change_TRUE)]
    cat5_dv = MARKED_dv[(original_support_FALSE) & (regeno_support_FALSE) & (mrca_change_FALSE)]
    cat6_dv = MARKED_dv[(original_support_FALSE) & (regeno_support_FALSE) & (mrca_change_TRUE)]
    cat7_dv = MARKED_dv[(original_support_TRUE)  & (regeno_support_FALSE) & (mrca_change_FALSE)]
    cat8_dv = MARKED_dv[(original_support_TRUE)  & (regeno_support_FALSE) & (mrca_change_TRUE)]

    cat_dvs = [cat1_dv, cat2_dv, cat3_dv, cat4_dv, cat5_dv, cat6_dv, cat7_dv, cat8_dv]

    # Add column and save the cat_dvs for the no_dramatic option for later use
    if option_num == 3:
        for i, cat_df in enumerate(cat_dvs):
            if not cat_df.empty:
                cat_dvs[i]['ADDED_REGENO_SUBLINES'] = cat_dvs[i].apply(lambda row: set(row['REGENO_COMBINED_SUBLINES_UNION']) - set(row['DV_SUBLINES']), axis=1)
        cat_dvs_no_dramatic = cat_dvs

    for i, cat_df in enumerate(cat_dvs):
        print(f"Category {i+1} DV Count: {len(cat_df)}")
    print("\n" + "="*50 + "\n")
        

--- Processing Option 0: severus ---
Category 1 DV Count: 2
Category 2 DV Count: 2
Category 3 DV Count: 0
Category 4 DV Count: 0
Category 5 DV Count: 2
Category 6 DV Count: 0
Category 7 DV Count: 0
Category 8 DV Count: 208


--- Processing Option 1: cna ---
Category 1 DV Count: 0
Category 2 DV Count: 0
Category 3 DV Count: 0
Category 4 DV Count: 0
Category 5 DV Count: 0
Category 6 DV Count: 0
Category 7 DV Count: 0
Category 8 DV Count: 0


--- Processing Option 2: both ---
Category 1 DV Count: 2
Category 2 DV Count: 2
Category 3 DV Count: 0
Category 4 DV Count: 0
Category 5 DV Count: 2
Category 6 DV Count: 0
Category 7 DV Count: 0
Category 8 DV Count: 208


--- Processing Option 3: both_no_dramatic ---
Category 1 DV Count: 2
Category 2 DV Count: 2
Category 3 DV Count: 0
Category 4 DV Count: 0
Category 5 DV Count: 2
Category 6 DV Count: 0
Category 7 DV Count: 0
Category 8 DV Count: 208




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


## Reconcile passing categories back to VCF, Union (SV and CNA), no dramatic change

In [40]:
# Category 1 is kept as is, no change to MRCA, and passes anyway
FINAL_VAR_DF = cat_dvs_no_dramatic[0].copy(deep=True)

# Category 2, Need to look through the details. 
FINAL_VAR_DF = pd.concat([FINAL_VAR_DF, cat_dvs_no_dramatic[1]], ignore_index=True)

# Category 3, Regenotyped with an unchanged MRCA, include all.
FINAL_VAR_DF = pd.concat([FINAL_VAR_DF, cat_dvs_no_dramatic[2]], ignore_index=True)

# Category 4, Regenotyped with a changed MRCA from originally failing.
FINAL_VAR_DF = pd.concat([FINAL_VAR_DF, cat_dvs_no_dramatic[3]], ignore_index=True)

In [41]:
# Blanket Default VCF Header
default_header = """##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##FORMAT=<ID=MED_DP,Number=1,Type=Integer,Description="Median DP observed within the GVCF block rounded to the nearest integer.">
##DeepVariant_version=1.6.0
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>\n"""

## Add REGENO entry to variants that are regenotyped rather than called originally.

In [42]:
internal_nodes = [f'N{i}' for i in range(1, 23)]
private_nodes = [f'O{i}' for i in range(1, 25) if i != 2]

for index, row in FINAL_VAR_DF.iterrows():
    if 'ADDED_REGENO_SUBLINES' in row and isinstance(row['ADDED_REGENO_SUBLINES'], set):
        for subline in row['ADDED_REGENO_SUBLINES']:
            if subline in FINAL_VAR_DF.columns:
                FINAL_VAR_DF.at[index, subline] = "REGENO"

columns_to_keep = ['KEY', 'CHROM', 'POS', 'REF', 'ALT', 'REGENO_MRCA_UNION'] + [f'C{i}' for i in range(1, 25) if i != 2]
FINAL_VAR_DF = FINAL_VAR_DF[columns_to_keep]

## Add back non-regenotyped variants to the final output dataframe

In [43]:
# Add back all of the non-regenotyped variants.
dv_merged_copy = dv_merged.copy(deep=True)
dv_merged_copy = dv_merged_copy[~dv_merged_copy['KEY'].isin(FINAL_VAR_DF['KEY'])]

print(len(dv_merged_copy))
print(len(dv_merged))
print(len(FINAL_VAR_DF))

FINAL_VAR_DF.index = FINAL_VAR_DF['KEY']
FINAL_VAR_DF = FINAL_VAR_DF.drop(columns=['KEY'])
dv_merged_copy.index = dv_merged_copy['KEY']
dv_merged_copy = dv_merged_copy.drop(columns=['KEY'])

dv_merged_copy = dv_merged_copy.append(FINAL_VAR_DF)

36359
36363
4


#### Metadata for FINAL_MRCA column which will show to which internal branch / node this variant is placed at.

In [44]:
dv_merged_copy['DV_SUBLINES'] = dv_merged_copy.apply(lambda row: [col for col in dv_merged_copy.columns if col.startswith('C') and col != "CHROM" and not col.startswith("CN") and pd.notna(row[col])], axis=1)
dv_merged_copy['DV_MRCA'] = dv_merged_copy.apply(lambda row: th_utils.common_ancestor_helper(row, "DV_SUBLINES", input_tree=imported_tree), axis=1)
dv_merged_copy['DV_MRCA_TERMINALS'] = dv_merged_copy.apply(lambda row: imported_tree.search_nodes(name=row['DV_MRCA'])[0].get_leaf_names(), axis=1)
dv_merged_copy['DV_MINIMUM_SUPPORT_MET'] = dv_merged_copy.apply(lambda row: len(row['DV_SUBLINES']) >= minimum_subline_support_per_clade_size_requirement[len(row['DV_MRCA_TERMINALS'])], axis=1)

## Create FINAL_MRCA column - where variant is placed at.

In [45]:
def final_mrca_helper(row):
    if type(row['REGENO_MRCA_UNION']) == float and pd.isna(row['REGENO_MRCA_UNION']):
        if row['DV_MINIMUM_SUPPORT_MET']:
            return row['DV_MRCA']
        else:
            return float('nan')
    else:
        return row['REGENO_MRCA_UNION']

dv_merged_copy['FINAL_MRCA'] = dv_merged_copy.apply(lambda row: final_mrca_helper(row), axis=1)

## Show amount of unplaced variants

In [46]:
nan_count = dv_merged_copy['FINAL_MRCA'].isna().sum()
print(f"Number of rows with NaN in FINAL_MRCA: {nan_count}")

print(nan_count / len(dv_merged_copy) * 100)

Number of rows with NaN in FINAL_MRCA: 458
1.2595220416357287


## Subsection merged DF into sub_df per internal and private node for VCF creation.

In [47]:
internal_node_dfs = {}
private_node_dfs = {}

dv_merged_without_unplaced = dv_merged_copy.dropna(subset=['FINAL_MRCA'])

for internal_node in internal_nodes:
    internal_node_dfs[internal_node] = dv_merged_without_unplaced[dv_merged_without_unplaced['FINAL_MRCA'] == internal_node]

for private_node in private_nodes:
    private_node_dfs[private_node] = dv_merged_without_unplaced[dv_merged_without_unplaced['FINAL_MRCA'] == private_node]

## Prepare exclusive DF formatting for VCF creation.

In [48]:
exclusive_dfs = {**internal_node_dfs, **private_node_dfs}

for node in exclusive_dfs:
    exclusive_dfs[node] = exclusive_dfs[node][['CHROM', 'POS', 'REF', 'ALT'] + [col for col in exclusive_dfs[node].columns if col.startswith('C') and col != "CHROM"]]
    exclusive_dfs[node]['ID'] = "."
    exclusive_dfs[node]['QUAL'] = "."
    exclusive_dfs[node]['FILTER'] = "PASS"
    exclusive_dfs[node]['INFO'] = "."
    exclusive_dfs[node]['FORMAT'] = "GT:GQ:DP:MIN_DP:AD:VAF:PL:MED_DP"
    # Fill Empty cells with "./.:0:0:0:0,0:0:0:0"
    # exclusive_dfs[node].fillna("./.:0:0:0:0,0:0:0:0", inplace=True)

    # Fill Empty cells with "./.:0:0,0,0:0:0:" (As of unphased Sev data)
    exclusive_dfs[node].fillna("./.:0:0,0,0:0:0", inplace=True)

    # Reorder columns
    exclusive_dfs[node] = exclusive_dfs[node][['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT'] + [col for col in exclusive_dfs[node].columns if col.startswith('C') and col != "CHROM"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/p

## Write Exclusive VCFs

In [54]:
for node in exclusive_dfs:
    print(node, ": ", len(exclusive_dfs[node]))
    if write_exclusive_vcfs:
        path_prefix = write_path_exclusive.strip("/")
        subprocess.run(['mkdir', '-p', path_prefix])
        th_utils.write_vcf(exclusive_dfs[node], f"{path_prefix}/{node}.vcf", default_header)

N1 :  2879
N2 :  38
N3 :  23
N4 :  38
N5 :  27
N6 :  67
N7 :  539
N8 :  12
N9 :  11
N10 :  302
N11 :  30
N12 :  16
N13 :  79
N14 :  195
N15 :  23
N16 :  131
N17 :  51
N18 :  874
N19 :  78
N20 :  145
N21 :  268
N22 :  208
O1 :  2077
O3 :  1020
O4 :  1625
O5 :  1472
O6 :  609
O7 :  576
O8 :  911
O9 :  1700
O10 :  1293
O11 :  1583
O12 :  1331
O13 :  1487
O14 :  983
O15 :  528
O16 :  2006
O17 :  2145
O18 :  619
O19 :  2340
O20 :  554
O21 :  834
O22 :  1909
O23 :  1703
O24 :  566


## Write Cumulative VCFs

In [53]:
cumulative_dfs = {}

for key, value in non_terminal_paths.items():
    merged_for_key = pd.concat([exclusive_dfs[x] for x in value], ignore_index=True)
    cumulative_dfs.update({key: merged_for_key})
    print(str(key), ": ", str(len(merged_for_key)))
    #print("SNV Count for node: " + str(key) + " " + str(len(merged_for_key)))

    if write_cumulative_vcfs:
        path_prefix = write_path_cumulative.strip("/")
        subprocess.run(['mkdir', '-p', path_prefix])
        th_utils.write_vcf(cumulative_dfs[key], f"{path_prefix}/{key}.vcf", default_header)

N1 :  2879
N8 :  2891
N2 :  2917
N12 :  2907
N9 :  2902
N4 :  2955
N3 :  2940
N16 :  3038
N13 :  2986
N10 :  3204
O5 :  4374
O23 :  4658
N5 :  2982
O19 :  5280
O17 :  5085
N17 :  3089
O13 :  4525
N14 :  3181
O9 :  4686
N11 :  3234
O4 :  4829
N6 :  3049
N7 :  3521
N19 :  3167
N18 :  3963
N15 :  3204
O24 :  3747
O1 :  5311
O22 :  5143
O10 :  4342
O12 :  4380
O3 :  4541
O14 :  4504
N20 :  3312
O11 :  4750
O18 :  4582
O15 :  4491
O21 :  4038
O6 :  3813
N21 :  3580
O16 :  5318
N22 :  3788
O8 :  4491
O20 :  4342
O7 :  4364
