# SCNA analysis step 3: Find a summarize SCNA events

1. Summarize SCNA events
    - Input: SCNA table
    - Output: Event table (summary of individual genes' SCNA reads)
        1. Sort SCNA table by these columns in this order, all ascending: [cancer_type, patient_id, chromosome, first]
        2. Identify individual events.
            1. We will define an event as a set of adjacent genes that are all up or all down. If we want to get fancy later, and if values between genes are directly comparable, we can also check for values not deviating too far. (We probably wouldn't be able to define an acceptable range without first identifying the whole potential group.)
            2. Start a counter at zero
            3. Iterate over dataframe, assigning the current value of the counter to each row. Before assignment, increment the counter if any of the following conditions are met:
                1. We are onto a new chromosome or new sample
                2. The current value has a different sign than the previous value, or didn't pass the cutoff
        3. Summarize each event.
            1. Group by event number created in the previous step
            2. For each group, create the following values in a new summary dataframe:
                1. A list of the genes contained in it
                2. The min of the "first" column
                3. The max of the "last" column
                4. The average of the "cna_val" column

## Setup

In [1]:
import cptac
import pandas as pd
import numpy as np
import datetime
import os

TIME_START = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

STEP2_DIR = "02_outputs"
STEP2_FILE_NAME = "locations_scna_cutoff_0.2_20200706_092210.tsv.gz"
STEP2_FILE_PATH = os.path.join(STEP2_DIR, STEP2_FILE_NAME)

STEP3_DIR = "03_outputs"
if not os.path.isdir(STEP3_DIR):
    os.mkdir(STEP3_DIR)
    
STEP3_FILE_PATH = os.path.join(STEP3_DIR, f"summary_{TIME_START}_from_{STEP2_FILE_NAME}")

In [2]:
print(STEP3_FILE_PATH)

03_outputs/summary_20200706_142325_from_locations_scna_cutoff_0.2_20200706_092210.tsv.gz


In [3]:
cnas = pd.read_csv(STEP2_FILE_PATH, sep="\t", dtype={"Database_ID": "O", "chromosome": "O"})

## Sort genes by location

Because the table is big and we're using Python, this uses a lot of RAM. To mitigate that, we perform these operations inside functions so that when the functions end, variables inside them will pass out of scope and be garbage collected.

In [4]:
def sort_cnas(table):
    return table.sort_values(by=["cancer_type", "Patient_ID", "chromosome", "first"])

def reset_drop_index(table):
    return table.reset_index(drop=True)

In [5]:
cnas = sort_cnas(cnas)
cnas = reset_drop_index(cnas)

## Identify individual events

In [6]:
groups = []
counter = 0

current_chr = cnas.loc[0, "chromosome"]
current_val = cnas.loc[0, "cna_val"]

for row in cnas.itertuples(index=False, name=None):
    if (not row[5] or
        row[6] != current_chr or
        np.sign(row[3]) != np.sign(current_val)):
    
        counter += 1
    
    groups.append(counter)
    current_chr = row[6]
    current_val = row[3]

We will manually convert the list to a numpy array first, to save RAM.

In [7]:
ar = np.array(groups, dtype=np.uint32)

In [8]:
cnas = cnas.assign(event=ar)

## Drop rows that don't pass
We didn't drop them earlier because we still needed their record of when a group of genes has breaks.

In [9]:
cnas = cnas[cnas["passes"]]

## Summarize events

In [10]:
summary = cnas.groupby("event").agg(**{
    "chromosome": ("chromosome", lambda x: x[0]),
    "cancer_type": ("cancer_type", lambda x: x[0]),
    "Patient_ID": ("Patient_ID", lambda x: x[0]),
    "genes": ("gene", list),
    "start": ("first", min),
    "end": ("last", max),
    "num_genes": ("gene", len),
    "avg_cna": ("cna_val", np.mean)
})

In [11]:
summary.to_csv(STEP3_FILE_PATH, index=False, compression="gzip", sep="\t")

In [12]:
summary

Unnamed: 0_level_0,chromosome,cancer_type,Patient_ID,genes,start,end,num_genes,avg_cna
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1,br,CPT000814,"[DDX11L1, FAM138A, OR4F5, OR4F29, OR4F16, FAM8...",11869.0,3438621.0,77,-0.270000
402,1,br,CPT000814,[RN7SL371P],30918469.0,30918735.0,1,-0.471000
403,1,br,CPT000814,"[PUM1, NKAIN1, SNRNP40, ZCCHC17, FABP3, SERINC...",30931506.0,31632518.0,8,1.259000
490,1,br,CPT000814,"[CSF3R, GRIK3, MIR4255, RNA5SP43, ZC3H12A, MEA...",36466043.0,38859772.0,26,1.930692
1134,1,br,CPT000814,"[NOTCH2, SEC22B, PPIAL4A, LINC00623, FCGR1B, H...",119911553.0,155934413.0,279,1.759244
1135,1,br,CPT000814,"[RXFP4, ARHGEF2, SSR2, UBQLN4, LAMTOR2, RAB25,...",155941710.0,203744081.0,452,1.680850
1136,1,br,CPT000814,[SNORA77],203729581.0,203729705.0,1,-0.484000
1137,1,br,CPT000814,"[SNORA77, LAX1, ZC3H11A, ZBED6, SNRPE, LINC003...",203729581.0,228555901.0,219,0.544452
1138,1,br,CPT000814,[RNA5SP162],228558296.0,228558339.0,1,-0.500000
1139,1,br,CPT000814,"[RNA5S1, RNA5S2, RNA5S3, RNA5S4, RNA5S5, RNA5S...",228610268.0,228746664.0,20,0.585000
