# Calling WGDs

If you run MEDICC2 in the standard (and recommended way) from the command line, the WGD status will be outputed both in the `_summary.tsv` file as well as in the `_copynumber_events_df.tsv` file (these are calcualted using approach 1, see below).

In this notebook, we shows how to call WGDs inside python using MEDICC2.
There are two approaches to do this. The standard way (Approach 1) runs MEDICC2's main routine which reconstruct a phylogentic tree of the samples. It then reconstruct the ancestral states (i.e. internal nodes) and infers events for every branch of the tree. If MEDICC2 detects a WGD in a branch, all nodes that lie downstream of that branch are considered to have a WGD.

In the second approach we calculate the optimal copy-number events directly from the diploid state for every sample individually. In the case that the phylogenetic tree and the ancestral states are correct, this leads to exactly the same events and therefore the same WGD calls as in approach 1.
In some cases however, MEDICC2 is not able to accurately infer the tree structure and/or the ancestral states. This is mainly the case for single-cell experiments with 100s-1000s of samples and a high noise-to-signal ratio. It is possible that MEDICC2 misses a WGD in the corresponding branch.

If the WGD detection differs between the two approaches, there is likely a mistake with the phylogeny creation, i.e. the reconstructed tree is not accurate. In this, we would recommened to rather trust the WGD calls that were created per sample and not from the tree.

Additionally, approach 2 can be run with a bootstrap approach, where the WGD detection is repeated multiple times with varying inputs. This is the most accurate approach for WGD detection but of course also the most time-intensive. Usually we use a threshold of 5%. See the MEDICC2 publication for details.

# Imports and definitions

In [1]:
#%% import and load data
import os
from pathlib import Path
import sys
import pandas as pd

sys.path.append('..')
import medicc
import fstlib

In [2]:
fst = medicc.io.read_fst()
fst_nowgd = medicc.io.read_fst(no_wgd=True)
symbol_table = fst.input_symbols()

# Run WGD detection for all samples in Gundem et al. 2015

In [3]:
data_folder = "../examples/gundem_et_al_2015"
                   
patients = [f.split('_')[0] for f in os.listdir(data_folder) if 'input_df.tsv' in f]
patients.sort()

In [4]:
results = dict()
for patient in patients:
    # Create DataFrame
    input_df = medicc.io.read_and_parse_input_data(
        os.path.join(data_folder, "{}_input_df.tsv".format(patient)))
    wgd_status = pd.DataFrame(index=input_df.index.get_level_values('sample_id').unique())
    wgd_status['in_tree'] = False
    wgd_status['individually'] = False
    wgd_status['individually_bootstrap'] = 0

    results[patient] = wgd_status



## Approach 1: Create phylogenetic tree and infer WGD status from tree

In [5]:
for patient in patients:
    wgd_status = results[patient]

    # Load data
    input_df = medicc.io.read_and_parse_input_data(
        os.path.join(data_folder, "{}_input_df.tsv".format(patient)))

    # run MEDICC2's main routine 
    sample_labels, pairwise_distances, nj_tree, final_tree, output_df, events_df = medicc.main(
        input_df,
        fst,
        reconstruct_events=True)

    # Find WGD nodes from the phylogenetic tree
    wgd_nodes = events_df.loc[events_df['type'] == 'wgd'].index.get_level_values(0).unique()
    for wgd_node in wgd_nodes:
        wgd_status.loc[[x.name for x in list(final_tree.find_clades(wgd_node))[0].get_terminals() if x.name is not None], 'in_tree'] = True
    



## Approach 2: Calculate WGD status for each sample individually

In [6]:
for patient in patients:
    wgd_status = results[patient]

    # Load data
    input_df = medicc.io.read_and_parse_input_data(
        os.path.join(data_folder, "{}_input_df.tsv".format(patient)))
    
    samples = input_df.index.get_level_values('sample_id').unique()
    for sample in samples:
        cur_wgd_status = medicc.core.detect_wgd(input_df, sample, total_cn=False)
        if cur_wgd_status:
            wgd_status.loc[sample, 'individually'] = True

    results[patient] = wgd_status



## Approach 2.5: Calculate WGD status for each sample individually with bootstrap

In [7]:
N_bootstrap = 100

for patient in patients:
    wgd_status = results[patient]

    # Load data
    input_df = medicc.io.read_and_parse_input_data(
        os.path.join(data_folder, "{}_input_df.tsv".format(patient)))

    samples = input_df.index.get_level_values('sample_id').unique()
    for _ in range(N_bootstrap):
        cur_bootstrap_df = medicc.bootstrap.chr_wise_bootstrap_df(input_df)
        for sample in samples:
            has_wgd = medicc.core.detect_wgd(cur_bootstrap_df, sample)
            if has_wgd:
                wgd_status.loc[sample, 'individually_bootstrap'] += 1

    wgd_status['individually_bootstrap'] =  wgd_status['individually_bootstrap'] / N_bootstrap
    results[patient] = wgd_status



## Results

In [8]:
for patient in patients:
    print(patient)
    if results[patient].any().any():
        print('WGD detected')
        display(results[patient])
    else:
        print('no WGD detected')
    print('')

PTX004
no WGD detected

PTX005
no WGD detected

PTX006
no WGD detected

PTX007
WGD detected


Unnamed: 0_level_0,in_tree,individually,individually_bootstrap
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LAdrenalMet_A21D-0096_CRUK_PC_0096_M3,False,False,0.0
LClavicleLNMet_A21I-0096_CRUK_PC_0096_M8,False,False,0.0
LIliacCrestSoftMet_A21J-0096_CRUK_PC_0096_M9,False,False,0.0
LRib5MassMet_A21A-0096_CRUK_PC_0096_M1,False,False,0.0
MultLiverMet13_A21H-0096_CRUK_PC_0096_M7,False,False,0.0
RRibNodularMet_A21F-0096_CRUK_PC_0096_M5,True,True,1.0
SingleLiverMet2_A21G-0096_CRUK_PC_0096_M6,False,False,0.0
SingleLiverMet4_A21E-0096_CRUK_PC_0096_M4,False,False,0.0
SingleLiverMet8_A21C-0096_CRUK_PC_0096_M2,False,False,0.0
diploid,False,False,0.0



PTX008
WGD detected


Unnamed: 0_level_0,in_tree,individually,individually_bootstrap
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ApicalProstateCA_A22C-0016_CRUK_PC_0016_T1,False,False,0.04
BladderMet_A22G-0016_CRUK_PC_0016_M5,False,False,0.18
LAdrenalMet_A22E-0016_CRUK_PC_0016_M3,False,False,0.12
LHumerusBoneMarrowMet_A22A-0016_CRUK_PC_0016_M1,False,False,0.04
LPelvicLN5Met_A22K-0016_CRUK_PC_0016_M9,False,False,0.1
LPelvicLN8Met_A22I-0016_CRUK_PC_0016_M7,False,False,0.11
PelvicLN7Met_A22H-0016_CRUK_PC_0016_M6,False,False,0.21
RAdrenalMet_A22F-0016_CRUK_PC_0016_M4,False,False,0.36
RPelvicLN12Met_A22J-0016_CRUK_PC_0016_M8,False,False,0.03
SeminalVesicleMet_A22D-0016_CRUK_PC_0016_M2,False,False,0.15



PTX009
no WGD detected

PTX010
WGD detected


Unnamed: 0_level_0,in_tree,individually,individually_bootstrap
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ProstateCA_A29C-0017_CRUK_PC_0017_T1,True,True,1.0
RSuperficialInguinalLNMetA1_A29A-0017_CRUK_PC_0017_M1,True,True,1.0
diploid,False,False,0.0



PTX011
WGD detected


Unnamed: 0_level_0,in_tree,individually,individually_bootstrap
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LAdrenalMet_A31E-0018_CRUK_PC_0018_M3,True,True,1.0
Prostate1-1-2CA_A31C-0018_CRUK_PC_0018_T1,False,False,0.0
RIngLNMet_A31D-0018_CRUK_PC_0018_M2,True,True,1.0
RRib7Met_A31F-0018_CRUK_PC_0018_M4,True,True,1.0
RSubduralMet_A31A-0018_CRUK_PC_0018_M1,True,True,1.0
diploid,False,False,0.0



PTX012
WGD detected


Unnamed: 0_level_0,in_tree,individually,individually_bootstrap
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LCervicalLNMet1-2_A32D-0019_CRUK_PC_0019_M2,True,True,1.0
LSubclavicularLNMet1-5_A32E-0019_CRUK_PC_0019_M3,True,True,1.0
Prostate10-1-3CA_A32C-0019_CRUK_PC_0019_T1,True,True,1.0
RHumerusMet1-12_A32F-0019_CRUK_PC_0019_M4,True,True,1.0
RRib8Met1_A32A-0019_CRUK_PC_0019_M1,True,True,1.0
diploid,False,False,0.0



PTX013
no WGD detected



NOTE: Here we find 4 patients with WGDs. Approaches 1 and 2 lead to the same results for all patients which is expected for small trees and high quality copy-number inputs.

Bootstrapping with the usual threshold of 5% leads to a different WGD result for patient PTX008. However, the ratio of bootstrap runs which detect WGDs are comparatively low so this result should be treated with care and the original copy-number profiles investigated.

# Detecting multiple WGDs

Here, we show how to detect multiple WGDs in your data. The data from patien PTX011 is adjusted such that sample `LAdrenalMet_A31E-0018_CRUK_PC_0018_M3` exhibits two WGDs and sample `Prostate1-1-2CA_A31C-0018_CRUK_PC_0018_T1` exhibits three WGDs.

In [18]:
patient = 'PTX011'

# Load data
input_df = medicc.io.read_and_parse_input_data(
    os.path.join(data_folder, "{}_input_df.tsv".format(patient)))
samples = input_df.index.get_level_values('sample_id').unique()
columns = ['any WGDs', 'two WGDs', 'three or more WGDs']

# Adjusting samples such that the exhibit multiple WGDs
rename_dict = {samples[0]: 'two_WGD_sample', samples[1]: 'three_WGD_sample'}
input_df = input_df.reset_index()
input_df['sample_id'] = input_df['sample_id'].replace(rename_dict)
input_df = input_df.set_index(['sample_id', 'chrom', 'start', 'end'])
input_df.loc['two_WGD_sample'] = '3'
input_df.loc['three_WGD_sample'] = '6'
samples = input_df.index.get_level_values('sample_id').unique()
multiple_wgd_status = pd.DataFrame(False, index=samples, columns=columns)

for sample in samples:
    for n_wgd, col in zip([None, 1, 2], columns):
        cur_wgd_status = medicc.core.detect_wgd(input_df, sample, total_cn=False, n_wgd=n_wgd)
        if cur_wgd_status:
            multiple_wgd_status.loc[sample, col] = True



In [17]:
multiple_wgd_status

Unnamed: 0_level_0,any WGDs,two WGDs,three or more WGDs
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LAdrenalMet_A31E-0018_CRUK_PC_0018_M3,False,False,False
Prostate1-1-2CA_A31C-0018_CRUK_PC_0018_T1,False,False,False
RIngLNMet_A31D-0018_CRUK_PC_0018_M2,True,False,False
RRib7Met_A31F-0018_CRUK_PC_0018_M4,True,False,False
RSubduralMet_A31A-0018_CRUK_PC_0018_M1,True,False,False
diploid,False,False,False
two_WGD_sample,True,True,
three_WGD_sample,True,True,True
