# Check for confounding variables

This notebook uses chi-squared tests to look for clinical variables that are associated with having a chromosome event or not.

- Get clinical tables
- Get event tables
- Binarize clinical columns as needed
- For each binary column in the clinical table, make a contingency table of that column and the event table
- Run chi squared test and save results

In [19]:
import pandas as pd
import numpy as np
import os
import cptac
import altair as alt
import scipy.stats

In [2]:
pd.options.display.max_columns = None

In [3]:
dss = {
    "brca": cptac.Brca,
#     "ccrcc": cptac.Ccrcc,
    "colon": cptac.Colon,
#     "endometrial": cptac.Endometrial,
#     "gbm": cptac.Gbm,
    "hnscc": cptac.Hnscc,
    "lscc": cptac.Lscc,
    "luad": cptac.Luad,
    "ovarian": cptac.Ovarian
}

In [4]:
def load_tables(cancer_type):
    
    # Load the dataset
    ds = dss[cancer_type]()
    
    # Get the clinical table
    clin = ds.get_clinical()
    
    # Get the event table
    event = pd.\
    read_csv(f"{cancer_type}_has_event.tsv", sep="\t", index_col=0).\
    rename(columns={"gain_event": "8q_gain", "loss_event": "8p_loss"})
    
    joined = clin.join(event, how="inner")
    
    return joined

## BRCA

In [5]:
brca = load_tables("brca")

                                         

In [6]:
brca_cols = [
    "Age.in.Month",
    "Race",
    "Stage",
    "PAM50",
    "NMF.v2.1",
]

# Don't use gender because all female
# Todo: Split up age. In decades?
# Todo: Consolidate stages

In [7]:
brca = brca.assign(Age=brca["Age.in.Month"] / 12)

alt.Chart(brca).mark_bar().encode(
    alt.X("Age:Q", bin=alt.Bin()),
    y='count()',
)

In [8]:
for col in brca_cols:
    if col != "Age.in.Month":
        print(brca[col].value_counts())
        print()

white                        78
asian                        19
black.or.african.american    14
hispanic.or.latino            4
Name: Race, dtype: int64

Stage IIA     50
Stage IIIA    22
Stage IIB     20
Stage III      4
Stage IA       4
Stage IIIC     4
Stage IIIB     3
Name: Stage, dtype: int64

LumA      57
Basal     29
LumB      17
Her2      14
Normal     5
Name: PAM50, dtype: int64

C3    33
C2    32
C1    26
C4    25
Name: NMF.v2.1, dtype: int64



In [17]:
ct = pd.crosstab(brca["8p_loss"], brca["NMF.v2.1"])

In [18]:
ct

NMF.v2.1,C1,C2,C3,C4
8p_loss,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,14,15,19,14
True,12,17,14,11


In [21]:
chi2, p, dof, exp_freq = scipy.stats.chi2_contingency(ct)

In [22]:
p

0.8377783047508794

## Colon

In [11]:
colon = load_tables("colon")

                                          

## HNSCC

In [12]:
hnscc = load_tables("hnscc")

                                          



## LSCC

In [13]:
lscc = load_tables("lscc")

                                         



## LUAD

In [14]:
luad = load_tables("luad")

                                         

## Ovarian

In [15]:
ovarian = load_tables("ovarian")

                                            

Clinical vars to use
- Age
- Gender
- Race
- Tumor stage/grade
- Histology/subtype
- TP53 and other mutation status for any that have like a 10% (maybe 5%) or greater frequency

brca

- Replicate_Measurement_IDs
- Sample_Tumor_Normal
- Age.in.Month
- Gender
- Race
- Human.Readable.Label
- Experiment
- Channel
- Stage
- PAM50
- NMF.v2.1
- ER
- PR
- ER.IHC.Score
- PR.IHC.Score
- Coring.or.Excision
- Ischemia.Time.in.Minutes
- Ischemia.Decade
- Necrosis
- Tumor.Cellularity
- Total.Cellularity
- In.CR
- QC.status

colon

- Sample_Tumor_Normal
- Age
- CEA
- Gender
- Lymphatic_Invasion
- Mucinous
- Perineural_Invasion
- Polyps_History
- Polyps_Present
- Stage
- Subsite
- Synchronous_Tumors
- Tumor.Status
- Vascular_Invasion
- Vital.Status
- pathalogy_N_stage
- pathalogy_T_stage

hnscc

- Sample_Tumor_Normal
- Cored_Sample
- P16
- age
- alcohol_consum
- clinic_staging_dist_metas
- country
- follow_up_days
- follow_up_is_contact
- follow_up_vital_status
- gender
- histologic_grade
- histologic_type
- num_pack_years_sm
- num_smoke_per_day
- num_yrs_alc_con
- patho_staging_curated
- patho_staging_orignial
- patho_staging_pn
- patho_staging_pt
- smoke_age_start
- smoke_age_stop
- smoking_history
- smoking_inferred_binary
- smoking_second_hand
- tumor_focality
- tumor_necrosis
- tumor_site_curated
- tumor_site_original
- tumor_size_cm

lscc

- Sample_Tumor_Normal
- Smoking.History
- Stage
- Country.of.Origin
- Age
- Gender
- Ethnicity
- Cigarettes.per.Day
- Pack.Years.Smoked
- Secondhand.Smoke

luad

- Sample.IDs
- Sample_Tumor_Normal
- Smoking.Status
- Stage
- Region.of.Origin
- Country.of.Origin
- Age
- Gender
- Ethnicity
- Height.cm
- Weight.kg
- BMI
- Cigarettes.per.Day
- Pack.Years.Smoked
- Smoking.History
- Secondhand.Smoke

ovarian

- Sample_Tumor_Normal
- Participant_Procurement_Age
- Participant_Gender
- Participant_Race
- Participant_Ethnicity
- Participant_Jewish_Heritage
- Participant_History_Malignancy
- Participant_History_Chemotherapy
- Participant_History_Neo-adjuvant_Treatment
- Participant_History_Radiation_Therapy
- Participant_History_Hormonal_Therapy
- Aliquots_Plasma
- Blood_Collection_Time
- Blood_Collection_Method
- Anesthesia_Time
- Tumor_Surgical_Devascularized_Time
- Tumor_Sample_Number
- Tumor_Sample_1_Weight
- Tumor_Sample_1_LN2_Time
- Tumor_Sample_1_Ischemia_Time
- Tumor_Sample_2_Weight
- Tumor_Sample_2_LN2_Time
- Tumor_Sample_2_Ischemia_Time
- Tumor_Sample_3_Weight
- Tumor_Sample_3_LN2_Time
- Tumor_Sample_3_Ischemia_Time
- Tumor_Sample_4_Weight
- Tumor_Sample_4_LN2_Time
- Tumor_Sample_4_Ischemia_Time
- Tumor_Sample_5_Weight
- Tumor_Sample_5_LN2_Time
- Tumor_Sample_5_Ischemia_Time
- Normal_Sample_Number
- Normal_Sample_1_Surgical_Devascularized_Time
- Normal_Sample_1_Weight
- Normal_Sample_1_LN2_Time
- Normal_Sample_1_Ischemia_Time
- Normal_Sample_2_Surgical_Devascularized_Time
- Normal_Sample_2_Weight
- Normal_Sample_2_LN2_Time
- Normal_Sample_2_Ischemia_Time
- Normal_Sample_3_Surgical_Devascularized_Time
- Normal_Sample_3_Weight
- Normal_Sample_3_LN2_Time
- Normal_Sample_3_Ischemia_Time
- Normal_Sample_4_Surgical_Devascularized_Time
- Normal_Sample_4_Weight
- Normal_Sample_4_LN2_Time
- Normal_Sample_4_Ischemia_Time
- Normal_Sample_5_Surgical_Devascularized_Time
- Normal_Sample_5_Weight
- Normal_Sample_5_LN2_Time
- Normal_Sample_5_Ischemia_Time
- Origin_Site_Disease
- Anatomic_Site_Tumor
- Anatomic_Lateral_Position_Tumor
- Histological_Subtype
- Method_of_Pathologic_Diagnosis
- Tumor_Stage_Ovary_FIGO
- Tumor_Grade
- Tumor_Residual_Disease_Max_Diameter
- Days_Between_Collection_And_Last_Contact
- Vital_Status
- Days_Between_Collection_And_Death
- Tumor_Status
- Review_Of_Initial_Pathological_Findings
- Pathology_Review_Consistent_With_Diagnosis
- Adjuvant_Radiation_Therapy
- Adjuvant_Pharmaceutical_Therapy
- Adjuvant_Immunotherapy
- Adjuvant_Hormone_Therapy
- Adjuvant_Targeted_Molecular_Therapy
- Response_After_Surgery_And_Adjuvant_Therapies
- New_Tumor_Event_After_Initial_Treatment
- New_Tumor_Event_Type
- New_Tumor_Event_Site
- Other_New_Tumor_Event_Site
- Days_Between_Collection_And_New_Tumor_Event
- New_Tumor_Event_Diagnosis
- New_Tumor_Event_Surgery
- Days_Between_Collection_And_New_Tumor_Event_Surgery
- New_Tumor_Event_Chemotherapy
- New_Tumor_Event_Immunotherapy
- New_Tumor_Event_Hormone_Therapy
- New_Tumor_Event_Targeted_Molecular_Therapy