# Check for confounding variables

This notebook uses chi-squared tests to look for clinical variables that are associated with having a chromosome event or not.

- Get clinical tables
- Get event tables
- Binarize clinical columns as needed
- For each binary column in the clinical table, make a contingency table of that column and the event table
- Run chi squared test and save results

In [1]:
import pandas as pd
import numpy as np
import os
import cptac

In [2]:
dss = {
    "brca": cptac.Brca,
#     "ccrcc": cptac.Ccrcc,
    "colon": cptac.Colon,
#     "endometrial": cptac.Endometrial,
#     "gbm": cptac.Gbm,
    "hnscc": cptac.Hnscc,
    "lscc": cptac.Lscc,
    "luad": cptac.Luad,
    "ovarian": cptac.Ovarian
}

In [3]:
clins = {}
events = {}

def find_confounding_vars_cancer_type(cancer_type):
    
    # Load the dataset
    ds = dss[cancer_type]()
    
    # Get the clinical table
    clin = ds.get_clinical()
    
    # Get the event table
    event = pd.read_csv(f"{cancer_type}_has_event.tsv", sep="\t", index_col=0)
    
    clins[cancer_type] = clin
    events[cancer_type] = event

In [4]:
for cancer_type in dss.keys():
    find_confounding_vars_cancer_type(cancer_type)

Checking that lscc index is up-to-date... 



Checking that luad index is up-to-date...



                                            

Clinical vars to use
- Age
- Gender
- Race
- Tumor stage/grade
- Histology/subtype
- TP53 and other mutation status for any that have like a 10% (maybe 5%) or greater frequency

In [2]:
cancer_types = [
    "brca",
    "colon",
    "hnscc",
    "lscc",
    "luad",
    "ovarian"
]

clinical_vars = [
    "age",
    "gender",
    "race",
    "tumor_stage",
    "subtype",
    # And mutation statuses for frequently mutated genes
]

cancer_clinical_vars = pd.DataFrame(index=pd.MultiIndex.from_product([cancer_types, clinical_vars]))
cancer_clinical_vars

Unnamed: 0,Unnamed: 1
brca,age
brca,gender
brca,race
brca,tumor_stage
brca,subtype
colon,age
colon,gender
colon,race
colon,tumor_stage
colon,subtype


In [7]:
for cancer in clins.keys():
    df = clins[cancer]
    print("------------------------------------------------------------------------")
    print(cancer)
    print()
    [print(col) for col in df.columns]
    print()

------------------------------------------------------------------------
brca

Replicate_Measurement_IDs
Sample_Tumor_Normal
Age.in.Month
Gender
Race
Human.Readable.Label
Experiment
Channel
Stage
PAM50
NMF.v2.1
ER
PR
ER.IHC.Score
PR.IHC.Score
Coring.or.Excision
Ischemia.Time.in.Minutes
Ischemia.Decade
Necrosis
Tumor.Cellularity
Total.Cellularity
In.CR
QC.status

------------------------------------------------------------------------
colon

Sample_Tumor_Normal
Age
CEA
Gender
Lymphatic_Invasion
Mucinous
Perineural_Invasion
Polyps_History
Polyps_Present
Stage
Subsite
Synchronous_Tumors
Tumor.Status
Vascular_Invasion
Vital.Status
pathalogy_N_stage
pathalogy_T_stage

------------------------------------------------------------------------
hnscc

Sample_Tumor_Normal
Cored_Sample
P16
age
alcohol_consum
clinic_staging_dist_metas
country
follow_up_days
follow_up_is_contact
follow_up_vital_status
gender
histologic_grade
histologic_type
num_pack_years_sm
num_smoke_per_day
num_yrs_alc_con
patho_s