# Define Event Boundaries

This notebook gives you an idea of where gain and loss regions are in a chromosome, so you can decide what specific boundaries to define for a graphically observed particular event.

Use the chart at the end of the notebook to determine what boundaries make the most sense. If you mouse over the chart, and don't have any weird JavaScript blockers installed, the chart will be interactive and a box will pop up telling you the start and end of each region, as well as the number of cancer types who have at least 20% of patients showing that event:

![tooltip_example](data/tooltip_example.png)

You can also refer to the gain/loss region tables in this notebook, and the plots from the previous notebooks. Also keep in mind where the centromere is.

For each arm, record the start and end boundaries for the event on that arm in the `00_setup/00_set_arm_parameters.ipynb` notebook in the folder for that arm.

For example if I was looking at 6 cancer types, I would ideally choose a boundary that included each of the regions where all 6 cancer types had the event while not including any regions with only a few. However, the regions in the table won't always coincide nicely with the plot, so sometimes you need to include a region with just 5 cancer types as a boundary to more closely resemble what you see.

## Setup

In [1]:
import cnvutils
import json
import numpy as np
import os
import pandas as pd
import seaborn as sns
import altair as alt

In [2]:
gen_params = cnvutils.load_params(os.path.join("..", "data", "gen_params.json"))
PANCAN = gen_params["PANCAN"]

chr_params = cnvutils.load_params(os.path.join("data", "chr_params.json"))
CANCER_TYPES = chr_params["CHR_CANCER_TYPES"]
CHROMOSOME = chr_params["CHROMOSOME"]
CUTOFF_PERCENT = chr_params["GENE_CNV_PROPORTION_CUTOFF"]

In [3]:
counts = pd.read_csv(os.path.join("data", f"chr{CHROMOSOME:0>2}_cnv_counts_{'harmonized' if PANCAN else 'AWG'}.tsv"), sep='\t')

In [4]:
# For each cancer type, calculate the minimum number of patients that need to have a CNV amplification
# or deletion of a gene for us to consider that gene significantly amplified or deleted in that cancer
# type, based on the CUTOFF_PERCENT parameter. As of 12 Feb 2022, this is set at 20% of the total number
# of patients in the cancer type.
cutoffs = dict()

for cancer_type in CANCER_TYPES:
    cutoffs[cancer_type] = counts[counts["cancer"] == cancer_type]["cancer_type_total_patients"].iloc[0] * CUTOFF_PERCENT

## Find Loss Regions

For each cancer type, find regions that are lost in 20% or more of patients (or a different percentage, if the "CUTOFF_PERCENT" parameter has been changed.

In [5]:
df = counts
loss_event_locations = dict()
for cancer in CANCER_TYPES:
    
    df_loss = df[(df.variable == 'loss') & (df.cancer == cancer)].sort_values('start_bp')
    values = list(df_loss.value)
    loss_events = list()
    start = None
    for i in range(0, len(values)):
        val = values[i]
        if val > cutoffs[cancer]:
            if start is None:
                start = i
        else:
            if start is not None:
                loss_events.append((start, i))
                start = None
    if start is not None:
        loss_events.append((start, len(values)-1))
    event_locations = list()
    for event in loss_events:
        start_bp = df_loss.iloc[event[0]].start_bp
        end_bp = df_loss.iloc[event[1]].start_bp
        event_locations.append((start_bp, end_bp-start_bp))
    loss_event_locations[cancer] = event_locations

In [6]:
loss_event_locations["luad"]

[(406428.0, 40124162.0)]

In [7]:
loss_event_patients = list()
for cancer in loss_event_locations.keys():
    events = loss_event_locations[cancer]
    for event in events:
        start = event[0]
        end = event[0] + event[1]
        loss_event_patients.append((start, 1))
        loss_event_patients.append((end, 0))
#     patients += list(gain_event_locations[cancer])
loss_event_patients.sort()

In [8]:
count = 0
current_bp = 0
start = list()
end = list()
size = list()
total = list()
for patient in loss_event_patients:
    if patient[0] != current_bp:
        start.append(current_bp)
        end.append(patient[0])
        size.append(patient[0]-current_bp)
        total.append(count)
        current_bp = patient[0]
    if patient[1] == 1:
        count += 1
    else:
        count -= 1
loss_event_data = pd.DataFrame({'start': start, 'end': end, 'counts': total, 'length': size}).sort_values('start')
loss_event_data

Unnamed: 0,start,end,counts,length
0,0.0,166049.0,0,166049.0
1,166049.0,406428.0,1,240379.0
2,406428.0,2935353.0,6,2528925.0
3,2935353.0,7355517.0,7,4420164.0
4,7355517.0,8701937.0,6,1346420.0
5,8701937.0,12104389.0,7,3402452.0
6,12104389.0,12721906.0,6,617517.0
7,12721906.0,31639222.0,7,18917316.0
8,31639222.0,36784324.0,6,5145102.0
9,36784324.0,37695782.0,5,911458.0


The "counts" column is the number of cancer types that have that region lost.

## Find Gain Regions

For each cancer type, find regions that are lost in 20% or more of patients (or a different percentage, if the "CUTOFF_PERCENT" parameter has been changed.

In [9]:
df = counts
gain_event_locations = dict()
for cancer in CANCER_TYPES:
   
    df_gain = df[(df.variable == 'gain') & (df.cancer == cancer)].sort_values('start_bp')
    values = list(df_gain.value)
    gain_events = list()
    start = None
    for i in range(0, len(values)):
        val = values[i]
        if val > cutoffs[cancer]:
            if start is None:
                start = i
        else:
            if start is not None:
                gain_events.append((start, i))
                start = None
    if start is not None:
        gain_events.append((start, len(values)-1))
    event_locations = list()
    for event in gain_events:
        start_bp = df_gain.iloc[event[0]].start_bp
        end_bp = df_gain.iloc[event[1]].start_bp
        event_locations.append((start_bp, end_bp-start_bp))
    gain_event_locations[cancer] = event_locations

In [10]:
gain_event_patients = list()
for cancer in gain_event_locations.keys():
    events = gain_event_locations[cancer]
    for event in events:
        start = event[0]
        end = event[0] + event[1]
        gain_event_patients.append((start, 1))
        gain_event_patients.append((end, 0))
#     patients += list(gain_event_locations[cancer])
gain_event_patients.sort()

In [11]:
count = 0
current_bp = 0
start = list()
end = list()
size = list()
total = list()
for patient in gain_event_patients:
    if patient[0] != current_bp:
        start.append(current_bp)
        end.append(patient[0])
        size.append(patient[0]-current_bp)
        total.append(count)
        current_bp = patient[0]
    if patient[1] == 1:
        count += 1
    else:
        count -= 1
gain_event_data = pd.DataFrame({'start': start, 'end': end, 'counts': total, 'length': size}).sort_values('start')
gain_event_data

Unnamed: 0,start,end,counts,length
0,0.0,8701937.0,0,8701937.0
1,8701937.0,12104389.0,1,3402452.0
2,12104389.0,14084845.0,0,1980456.0
3,14084845.0,18170477.0,1,4085632.0
4,18170477.0,18391282.0,0,220805.0
5,18391282.0,20246165.0,1,1854883.0
6,20246165.0,22089150.0,0,1842985.0
7,22089150.0,22165140.0,1,75990.0
8,22165140.0,30384511.0,0,8219371.0
9,30384511.0,31639222.0,1,1254711.0


The "counts" column is the number of cancer types that have that region gained.

## Plot loss and gain regions

This will make it easy to visually see loss and gain regions, to decide on event boundaries.

In [12]:
def make_pancan_gain_loss_region_plot(chrm, gains, losses):
    
    # Join the gain and loss data
    gains = gains.assign(event="gain")
    losses = losses.assign(event="loss", counts=losses["counts"] * -1)
    events = gains.append(losses)
    
    # Make the gains and losses plot
    events_plot = alt.Chart(events).mark_rect().encode(
        x=alt.X(
            "counts",
            title="Number of cancers with gain (positive) or loss (negative)",
        ),
        y=alt.Y(
            "start",
            title=None,
            axis=alt.Axis(
                labels=False,
                ticks=False,
                values=list(range(0, events["end"].astype(int).max(), 5000000)),
            )
        ),
        y2="end",
        color=alt.Color(
            "event",
            scale=alt.Scale(
                domain=["gain", "loss"],
                range=["darkred", "darkblue"],
            ),
        ),
        tooltip=alt.Tooltip(
            ["start", "end", "counts"],
            format=","
        ),
    )
    
    # Get the cytoband plot
    cytobands = cnvutils.make_cytoband_plot(chrm)
    
    # Concatenate the plots
    events_plot = alt.hconcat(
        cytobands,
        events_plot,
        bounds="flush"
    ).resolve_scale(
        color="independent",
        y="shared",
    )
    
    return events_plot

# Chart for deciding event boundaries

If you mouse over the gain and loss regions on the chart below (and don't have any weird JavaScript blockers installed), a box will appear telling you the start and end of each region, and the number of cancer types that have that event in that region for at least 20% of their patients. (The cancer type count will be negative for loss regions.)

For each arm, record the start and end boundaries for the event on that arm in the `00_setup/00_set_arm_parameters.ipynb` notebook in the folder for that arm.

In [13]:
make_pancan_gain_loss_region_plot(CHROMOSOME, gain_event_data, loss_event_data)

