# SCNA analysis step 3: Find a summarize SCNA events

- Input: SCNA table
- Output: Event table (summary of individual genes' SCNA reads)
    1. Sort SCNA table by these columns in this order, all ascending: [cancer_type, patient_id, chromosome, first]
    2. Identify individual events.
        1. We will define an event as a set of adjacent genes that are all up or all down. If we want to get fancy later, and if values between genes are directly comparable, we can also check for values not deviating too far. (We probably wouldn't be able to define an acceptable range without first identifying the whole potential group.)
        2. Start a counter at zero
        3. Iterate over dataframe, assigning the current value of the counter to each row. Before assignment, increment the counter if any of the following conditions are met:
            1. We are onto a new chromosome or new sample
            2. The current value has a different sign than the previous value, or didn't pass the cutoff. However, allow some gaps--start experimenting with 1 or 2.
    3. Summarize each event.
        1. Group by event number created in the previous step
        2. For each group, create the following values in a new summary dataframe:
            1. A list of the genes contained in it
            2. The min of the "first" column
            3. The max of the "last" column
            4. The average of the "cna_val" column

## Setup

In [9]:
import cptac
import pandas as pd
import numpy as np
import datetime
import os

TIME_START = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
MAX_SKIP = 1

# Note: If you are running this yourself, you will need to first run
# the step 2 notebook (02_get_locations) and sub in the appropriate output
# file name for STEP2_FILE_NAME. We didn't include the output files in
# the repository because they exceed GitHub's 100 MB per file limit.

STEP2_DIR = "02_outputs"
STEP2_FILE_NAME = "locations_scna_cutoff_0.2_20200706_092210.tsv.gz"
STEP2_FILE_PATH = os.path.join(STEP2_DIR, STEP2_FILE_NAME)

STEP3_DIR = "03_outputs"
if not os.path.isdir(STEP3_DIR):
    os.mkdir(STEP3_DIR)
    
STEP3_FILE_PATH = os.path.join(STEP3_DIR, f"summary_{TIME_START}_skip_{MAX_SKIP}_from_{STEP2_FILE_NAME}")

In [10]:
print(STEP3_FILE_PATH)

03_outputs/summary_20200708_154024_skip_1_from_locations_scna_cutoff_0.2_20200706_092210.tsv.gz


In [11]:
cnas = pd.read_csv(STEP2_FILE_PATH, sep="\t", dtype={"Database_ID": "O", "chromosome": "O"})

## Sort genes by location

Because the table is big and we're using Python, this uses a lot of RAM. To mitigate that, we perform these operations inside functions so that when the functions end, variables inside them will pass out of scope and be garbage collected.

In [12]:
def sort_cnas(table):
    return table.sort_values(by=["cancer_type", "Patient_ID", "chromosome", "first"])

def reset_drop_index(table):
    return table.reset_index(drop=True)

In [13]:
cnas = sort_cnas(cnas)
cnas = reset_drop_index(cnas)

## Identify individual events

- States:
    1. Last value
    2. Current chromosome
    3. Are we on a potentially new event?
    
    
- Goal:
    - Iterate through the genes
    - If we find a set of genes that is MAX_SKIP or longer and differs from the event we've been on, make it a new event.
    - If the set is below the length cutoff and the event continues after it, just omit those genes--assign the group as NaN


- Method:
    - If we reach a row with a different sign or not passes, look ahead and see if we're going to skip it or not, by seeing if the set is shorter than MAX_SKIP and if the event really continues after it.
        - Look ahead using index lookup
        - If skip:
            - Insert as many NaNs as were in the gap
        - If not skip:
            - Increment counter to indicate a new event, and continue as normal

In [None]:
# We're going to use pandas.DataFrame.itertuples to iterate over
# the CNA dataframe. It will return each row as a tuple. Below
# we assign to variables the indexes in a row's tuple that will
# correspond to the values of the different columns for the row.
# We do this instead of not passing None to the name parameter
# of itertuples because this is more efficient.
index_idx = 0
Patient_ID_idx = 1
gene_idx = 2
Database_ID_idx = 3
cna_val_idx = 4
cancer_type_idx = 5
passes_idx = 6
chromosome_idx = 7
start_idx = 8
end_idx = 9
first_idx = 10
last_idx = 11

## Set the variables we'll use as we iterate over the table

# This will be a list of numbers that we will later append as 
# a column to our CNA dataframe. All genes that we determine
# are part of the same event will have the same number in this
# column. All genes that we skip will have np.nan.
groups = []

# We will use this counter to keep track of which event we
# are on, so we know what group number to append to the 
# groups list for each gene.
counter = 0

# If we've looked ahead and determined that there are some
# genes that make a gap in the event, but the gap is small 
# enough to ignore, we'll use this counter to signal to
# skip that many genes.
num_ahead_to_skip = 0

# These variables will help us keep track of when it would
# be appropriate to say we're on a new event.
current_cancer = cnas.loc[0, "cancer_type"] # If we get to a new cancer type, we're obviously onto a new event
current_sample = cnas.loc[0, "Patient_ID"] # Same if we get to a new sample...
current_chr = cnas.loc[0, "chromosome"] # or a new chromosome.

# If the next gene's cna_val has a different sign than the previous 
# gene, it may be a new event. We'll figure that out below :D
current_val = cnas.loc[0, "cna_val"]

for row in cnas.itertuples(index=True, name=None):
    
    # We will execute this block if we've previously looked ahead 
    # in the table and determined that there's a set of genes to 
    # skip and not count in our current event.
    if num_ahead_to_skip > 0:
        groups.append(np.nan)
        num_ahead_to_skip -= 1
        continue
    
    # We always say we're on a new event if we reach a new 
    # cancer type, sample, or chromosome
    if (
        row[cancer_type_idx] != current_cancer or
        row[Patient_ID_idx] != current_sample or
        row[chromosome_idx] != current_chr
    ):
        counter += 1
    
    # If the cna_val changes sign, let's see if it's just a small,
    # ignorable gap in the event, or if we should count it as a new
    # event.
    elif (
        np.sign(row[cna_val_idx]) != np.sign(current_val) or 
        not row[passes_idx]
    ):
        
        # Slice out the next 2 * MAX_SKIP rows
        # TODO: Make sure that if we're at the very end of the table, we don't slice out of range
        num_rows = 2 * MAX_SKIP
        curr_index = row[index_idx]
        next_rows = cnas.iloc[(curr_index + 1): (curr_index + 1) + num_rows, :]
        
        # Keep track of how big this gap is, including the gene
        # we're on right now
        gap_width = 1
        
        # Keep track of how many genes after this gap continue 
        # the event we were on before
        cont_current_width = 0
        
        for ahead_row in next_rows.itertuples(index=True, name=None):
            # See if the gap is short enough to skip. Also require that
            # there be more genes after the gap that continue the current
            # trend, than are in the gap
            
            if (
                ahead_row[cancer_type_idx] != current_cancer or
                ahead_row[Patient_ID_idx] != current_sample or
                ahead_row[chromosome_idx] != current_chr
            ):
                # We reached a new cancer type, sample, or chromosome
                # before we got to a point where we could skip the gap. So,
                # don't skip the gap.
                skip = False
                break
            
            elif (
                np.sign(ahead_row[cna_val_idx]) == np.sign(current_val) and 
                ahead_row[passes_idx]
            ):
                # The conditions directly above were true, so this gene 
                # continued the event we were on, so it counts as continuing
                # the event past the gap.
                cont_current_width += 1
    
            else:
                # The current gene continues the gap.
                gap_width += 1
                
                # Because the cont_current_width segment (if there was one) that 
                # was continuing the event after the gap was shorter than the 
                # first continuous portion of the gap (because the 
                # 'cont_current_width >= gap_width' condition below never evaluated
                # to True), that segment is not counted as continuing the event 
                # anymore, and contributes to the gap.
                gap_width += cont_current_width
                
                # We have to start over trying to create a new contiguous
                # segment that continues the current event.
                cont_current_width = 0
                
            if gap_width > MAX_SKIP:
                # We found a gap too big to skip.
                skip = False
                break
                
            elif cont_current_width >= gap_width:
                # If we got here, there was a continuous segment after the gap 
                # that was longer than the gap, and continued the event we were 
                # on, and the gap was shorter than MAX_SKIP (because the check 
                # above was False), so we'll skip the gap.
                skip = True
                break
            
        if skip:
            
            # Indicate that we're skipping the gene we're on right now
            groups.append(np.nan) 
            
            # Indicate how many additional genes to skip
            num_ahead_to_skip = gap_width - 1 # Is that the right number?

            # This 'continue' applies to the outer for loop--the one iterating
            # over the entire CNA table. We skip the rest of this iteration
            # of the loop, because we're saying that the event hasn't changed
            # yet, so there's no need to update the variables below.
            continue
        else:
            assert num_ahead_to_skip == 0
            
            # The gap was too big to skip. We're on a new event.
            counter += 1
            
    # Append the group number indicating which event this current gene is part of
    groups.append(counter)
    
    # Update which cancer type, sample, chromosome, and value we're on
    current_cancer = row[cancer_type_idx]
    current_sample = row[Patient_ID_idx]
    current_chr = row[chromosome_idx]
    current_val = row[cna_val_idx]

We will manually convert the group list to a numpy array before appending it, to save RAM.

In [None]:
groups_ar = np.array(groups, dtype=np.uint32)

In [None]:
cnas = cnas.assign(event=groups_ar)

## Drop rows that don't pass
We didn't drop them earlier because we still needed their record of when a group of genes has breaks.

In [None]:
cnas = cnas[cnas["passes"]]

## Drop rows we skipped

In [None]:
cnas = cnas[pd.notnull(cnas["event"])]

## Summarize events

In [None]:
summary = cnas.groupby("event").agg(**{
    "chromosome": ("chromosome", lambda x: x[0]),
    "cancer_type": ("cancer_type", lambda x: x[0]),
    "Patient_ID": ("Patient_ID", lambda x: x[0]),
    "genes": ("gene", list),
    "start": ("first", min),
    "end": ("last", max),
    "num_genes": ("gene", len),
    "avg_cna": ("cna_val", np.mean) # TODO: Weight this by gene length!
})

In [None]:
summary.to_csv(STEP3_FILE_PATH, index=False, compression="gzip", sep="\t")

In [None]:
summary