# Check for confounding variables

This notebook uses chi-squared tests to look for clinical variables that are associated with having a chromosome event or not.

- Get clinical tables
- Get event tables
- Binarize clinical columns as needed
- For each binary column in the clinical table, make a contingency table of that column and the event table
- Run chi squared test and save results

In [1]:
import pandas as pd
import numpy as np
import os
import cptac
import altair as alt
import scipy.stats

In [2]:
pd.options.display.max_columns = None
pd.options.display.max_colwidth = None

In [3]:
dss = {
    "brca": cptac.Brca,
#     "ccrcc": cptac.Ccrcc,
    "colon": cptac.Colon,
#     "endometrial": cptac.Endometrial,
#     "gbm": cptac.Gbm,
    "hnscc": cptac.Hnscc,
    "lscc": cptac.Lscc,
    "luad": cptac.Luad,
    "ovarian": cptac.Ovarian
}

In [4]:
def load_tables(cancer_type):
    
    # Load the dataset
    ds = dss[cancer_type]()
    
    # Get the clinical table
    clin = ds.get_clinical()
    
    # Get the event table
    event = pd.\
    read_csv(f"{cancer_type}_has_event.tsv", sep="\t", index_col=0).\
    rename(columns={"gain_event": "8q_gain", "loss_event": "8p_loss"})
    
    joined = clin.join(event, how="inner")
    
    return joined

In [5]:
def test_cnv_association(df, test_cols, cnv_col):
    
    pvals = {}
    efs = {}
    
    for col in test_cols:
        
        # Create contingency table
        contingency_table = pd.crosstab(df[cnv_col], df[col])
        
        # Run test
        chi2, p, dof, exp_freq = scipy.stats.chi2_contingency(contingency_table)
        
        # Check assumptions: No group has expected value < 1, and no more than
        # 20% of groups have expected frequency < 5.
        exp_freq = pd.DataFrame(exp_freq)
        
        if (exp_freq < 1).any().any():
            pvals[col] = "Not all expected frequencies were > 1."
        elif (exp_freq < 5).sum().sum() > 0.2 * exp_freq.shape[0] * exp_freq.shape[1]:
            pvals[col] = "More than 20% of groups had expected frequency < 5."
        else:
            pvals[col] = p
            
        efs[col] = exp_freq
        
    pvals = pd.DataFrame(pvals.values(), index=pvals.keys())
    pvals = pvals.rename(columns={0: "pval"})
    
    return pvals

## BRCA

In [6]:
brca = load_tables("brca")

                                         

### Simplify the age column
For the age column, we will create groups of 15 years, and combine all > 75 years.

In [7]:
brca = brca.assign(Age=brca["Age.in.Month"] // 12)
brca = brca.assign(Age_group=(brca["Age"] // 15) * 15)
brca = brca.assign(Age_group=brca["Age_group"].where(cond=(brca["Age_group"] < 75) | (pd.isnull(brca["Age"])), other=75))

In [8]:
brca["Age_group"].value_counts(dropna=False).sort_index()

30.0    12
45.0    36
60.0    38
75.0    19
NaN     17
Name: Age_group, dtype: int64

### Simplify the stage column

We will also simplify the "Stage" column.

In [9]:
brca["Stage"].value_counts(dropna=False).sort_index()

Stage IA       4
Stage IIA     50
Stage IIB     20
Stage III      4
Stage IIIA    22
Stage IIIB     3
Stage IIIC     4
NaN           15
Name: Stage, dtype: int64

Because there are only 4 Stage I samples, we will group them with Stage II.

In [10]:
def simplify_stage(row):
    if pd.isna(row):
        return row
    elif row.startswith("Stage III"):
        return "III"
    elif row.startswith("Stage II"):
        return "I or II"
    elif row.startswith("Stage I"):
        return "I or II"
    else:
        return row
    
brca = brca.assign(Simplified_Stage=brca["Stage"].apply(simplify_stage))

In [11]:
brca["Simplified_Stage"].value_counts(dropna=False).sort_index()

I or II    74
III        33
NaN        15
Name: Simplified_Stage, dtype: int64

### Race column

There aren't enough people in the hispanic.or.latino group to satisfy the requirements of the chi-squared test. Should we drop the category, or is there a permutation testing solution?

In [12]:
brca["Race"].value_counts(dropna=False)

white                        78
asian                        19
black.or.african.american    14
NaN                           7
hispanic.or.latino            4
Name: Race, dtype: int64

### Run chi-squared tests
Now we will run chi-squared tests to look for association between each variable and CNV events.

In [13]:
brca_cols = [
    "Age_group",
    "Race",
    "Simplified_Stage",
    "PAM50",
    "NMF.v2.1",
]
# Don't use gender because all female

In [14]:
test_cnv_association(
    df=brca,
    test_cols=brca_cols,
    cnv_col="8p_loss"
)

Unnamed: 0,pval
Age_group,0.245893
Race,More than 20% of groups had expected frequency < 5.
Simplified_Stage,0.973413
PAM50,0.668204
NMF.v2.1,0.937665


In [15]:
test_cnv_association(
    df=brca,
    test_cols=brca_cols,
    cnv_col="8q_gain"
)

Unnamed: 0,pval
Age_group,0.837975
Race,More than 20% of groups had expected frequency < 5.
Simplified_Stage,0.67463
PAM50,0.00112059
NMF.v2.1,0.000460971


## Colon

In [16]:
colon = load_tables("colon")

                                          

### Simplify the Age column

In [17]:
colon = colon.assign(Age_years=colon["Age"] // 12)
colon = colon.assign(Age_group=(colon["Age_years"] // 15) * 15)

In [18]:
colon["Age_group"].value_counts(dropna=False).sort_index()

30.0     4
45.0    31
60.0    48
75.0    19
90.0     1
NaN      2
Name: Age_group, dtype: int64

In [19]:
colon = colon.assign(
    Age_group=colon["Age_group"].where(cond=(colon["Age_group"] < 75) | (pd.isnull(colon["Age"])), other=75)
)
colon = colon.assign(
    Age_group=colon["Age_group"].where(cond=(colon["Age_group"] > 45) | (pd.isnull(colon["Age"])), other=30)
)

In [20]:
colon["Age_group"].value_counts(dropna=False).sort_index()

30.0    35
60.0    48
75.0    20
NaN      2
Name: Age_group, dtype: int64

### Simplify the Stage column

In [21]:
colon["Stage"].value_counts(dropna=False).sort_index()

Stage I      12
Stage II     42
Stage III    44
Stage IV      7
Name: Stage, dtype: int64

In [22]:
colon = colon.assign(Simplified_Stage=colon["Stage"].where(colon["Stage"] != "Stage IV", "Stage III or IV"))
colon = colon.assign(
    Simplified_Stage=colon["Simplified_Stage"].where(colon["Stage"] != "Stage III", "Stage III or IV")
)

In [23]:
colon["Simplified_Stage"].value_counts(dropna=False).sort_index()

Stage I            12
Stage II           42
Stage III or IV    51
Name: Simplified_Stage, dtype: int64

### Run chi-squared tests

In [24]:
colon_cols = [
    "Age_group",
    "Gender",
    "Simplified_Stage",
    "Mucinous"
]

In [25]:
test_cnv_association(
    df=colon,
    test_cols=colon_cols,
    cnv_col="8p_loss"
)

Unnamed: 0,pval
Age_group,0.04483
Gender,0.299017
Simplified_Stage,0.249238
Mucinous,0.230789


In [26]:
test_cnv_association(
    df=colon,
    test_cols=colon_cols,
    cnv_col="8q_gain"
)

Unnamed: 0,pval
Age_group,0.81527
Gender,0.765412
Simplified_Stage,0.677828
Mucinous,0.626902


## HNSCC

In [42]:
hnscc = load_tables("hnscc")

                                          



### Group ages

In [43]:
hnscc = hnscc.assign(Age_group=(hnscc["age"] // 10) * 10)

In [44]:
hnscc["Age_group"].value_counts(dropna=False).sort_index()

20.0     1
40.0     5
50.0    37
60.0    48
70.0    14
80.0     3
NaN      1
Name: Age_group, dtype: int64

In [45]:
hnscc = hnscc.assign(
    Age_group=hnscc["Age_group"].where(cond=(hnscc["Age_group"] < 70) | (pd.isnull(hnscc["age"])), other=70)
)
hnscc = hnscc.assign(
    Age_group=hnscc["Age_group"].where(cond=(hnscc["Age_group"] > 50) | (pd.isnull(hnscc["age"])), other=50)
)

In [46]:
hnscc["Age_group"].value_counts(dropna=False).sort_index()

50.0    43
60.0    48
70.0    17
NaN      1
Name: Age_group, dtype: int64

### Simplify alcohol consumption column

We are going to combine the past drinker group with the current but less group.

Also replace the history not available group with NaN.

In [47]:
hnscc["alcohol_consum"].value_counts(dropna=False)

Alcohol consumption equal to or less than 2 drinks per day for men and 1 drink or less per day for women    44
Alcohol consumption history not available                                                                   23
Lifelong non-drinker                                                                                        21
Alcohol consumption more than 2 drinks per day for men and more than 1 drink per day for women              11
NaN                                                                                                          7
Consumed alcohol in the past, but currently a non-drinker                                                    3
Name: alcohol_consum, dtype: int64

In [48]:
hnscc["alcohol_consum"] = hnscc["alcohol_consum"].replace(
    to_replace="Consumed alcohol in the past, but currently a non-drinker",
    value="Alcohol consumption equal to or less than 2 drinks per day for men and 1 drink or less per day for women"
)["alcohol_consum"].replace(
    to_replace="Alcohol consumption history not available",
    value=np.nan
)

In [49]:
hnscc["alcohol_consum"].value_counts(dropna=False)

Alcohol consumption equal to or less than 2 drinks per day for men and 1 drink or less per day for women    47
NaN                                                                                                         30
Lifelong non-drinker                                                                                        21
Alcohol consumption more than 2 drinks per day for men and more than 1 drink per day for women              11
Name: alcohol_consum, dtype: int64

### Simplify smoking history column

Combine all the "current reformed" groups and set the "history not available" group to NaN.

In [50]:
hnscc["smoking_history"].value_counts(dropna=False)

Current smoker: Includes daily and non-daily smokers                38
Smoking history not available                                       21
Lifelong non-smoker: Less than 100 cigarettes smoked in lifetime    21
Current reformed smoker within past 15 years                        14
Current reformed smoker, more than 15 years                         10
Current reformed smoker, years unknown                               4
NaN                                                                  1
Name: smoking_history, dtype: int64

In [51]:
hnscc["smoking_history"] = hnscc["smoking_history"].replace(
    to_replace="Smoking history not available",
    value=np.nan
).replace(
    to_replace="Current reformed smoker, years unknown",
    value="Current reformed smoker"
).replace(
    to_replace="Current reformed smoker within past 15 years",
    value="Current reformed smoker"
).replace(
    to_replace="Current reformed smoker, more than 15 years",
    value="Current reformed smoker"
)

In [52]:
hnscc["smoking_history"].value_counts(dropna=False)

Current smoker: Includes daily and non-daily smokers                38
Current reformed smoker                                             28
NaN                                                                 22
Lifelong non-smoker: Less than 100 cigarettes smoked in lifetime    21
Name: smoking_history, dtype: int64

### Simplify tumor site column

Combine the two pharynx categories, and put lip with Oral cavity.

In [38]:
hnscc["tumor_site_curated"].value_counts(dropna=False)

Oral cavity    49
Larynx         47
Oropharynx      6
Lip             4
Hypopharynx     2
NaN             1
Name: tumor_site_curated, dtype: int64

In [53]:
hnscc["tumor_site_curated"] = hnscc["tumor_site_curated"].replace(
    to_replace="Oropharynx",
    value="Pharynx"
).replace(
    to_replace="Hypopharynx",
    value="Pharynx"
).replace(
    to_replace="Lip",
    value="Oral cavity"
)

In [54]:
hnscc["tumor_site_curated"].value_counts(dropna=False)

Oral cavity    53
Larynx         47
Pharynx         8
NaN             1
Name: tumor_site_curated, dtype: int64

### Simplify stage column

Combine stage I and stage II groups

In [58]:
hnscc["patho_staging_curated"].value_counts(dropna=False)

Stage IV     45
Stage III    32
Stage II     24
Stage I       7
NaN           1
Name: patho_staging_curated, dtype: int64

In [59]:
hnscc["patho_staging_curated"] = hnscc["patho_staging_curated"].replace(
    to_replace="Stage I",
    value="Stage I/II"
).replace(
    to_replace="Stage II",
    value="Stage I/II"
)

In [60]:
hnscc["patho_staging_curated"].value_counts(dropna=False)

Stage IV      45
Stage III     32
Stage I/II    31
NaN            1
Name: patho_staging_curated, dtype: int64

### Run chi-squared tests

In [61]:
hnscc_cols = [
    "Age_group",
    "alcohol_consum",
#     "gender", # There are only 14 women and 94 men. Chi square assumption not met: More than 20% of groups had expected frequency < 5.
    "histologic_grade",
#     "histologic_type", # 97 out of 104 are all "Squamous cell carcinoma, conventional"
    "patho_staging_curated",
    "smoking_history",
    "tumor_site_curated"
]

In [62]:
test_cnv_association(
    df=hnscc,
    test_cols=hnscc_cols,
    cnv_col="8p_loss"
)

Unnamed: 0,pval
Age_group,0.611143
alcohol_consum,0.565015
histologic_grade,0.320023
patho_staging_curated,0.053354
smoking_history,0.294374
tumor_site_curated,0.56374


In [63]:
test_cnv_association(
    df=hnscc,
    test_cols=hnscc_cols,
    cnv_col="8q_gain"
)

Unnamed: 0,pval
Age_group,0.104835
alcohol_consum,0.26648
histologic_grade,0.930855
patho_staging_curated,0.272547
smoking_history,0.078895
tumor_site_curated,0.739403


## LSCC

In [None]:
lscc = load_tables("lscc")

## LUAD

In [None]:
luad = load_tables("luad")

## Ovarian

In [None]:
ovarian = load_tables("ovarian")

Clinical vars to use
- Age
- Gender
- Race
- Tumor stage/grade
- Histology/subtype
- TP53 and other mutation status for any that have like a 10% (maybe 5%) or greater frequency

brca

- Replicate_Measurement_IDs
- Sample_Tumor_Normal
- Age.in.Month
- Gender
- Race
- Human.Readable.Label
- Experiment
- Channel
- Stage
- PAM50
- NMF.v2.1
- ER
- PR
- ER.IHC.Score
- PR.IHC.Score
- Coring.or.Excision
- Ischemia.Time.in.Minutes
- Ischemia.Group
- Necrosis
- Tumor.Cellularity
- Total.Cellularity
- In.CR
- QC.status

colon

- Sample_Tumor_Normal
- Age
- CEA
- Gender
- Lymphatic_Invasion
- Mucinous
- Perineural_Invasion
- Polyps_History
- Polyps_Present
- Stage
- Subsite
- Synchronous_Tumors
- Tumor.Status
- Vascular_Invasion
- Vital.Status
- pathalogy_N_stage
- pathalogy_T_stage

hnscc

- Sample_Tumor_Normal
- Cored_Sample
- P16
- age
- alcohol_consum
- clinic_staging_dist_metas
- country
- follow_up_days
- follow_up_is_contact
- follow_up_vital_status
- gender
- histologic_grade
- histologic_type
- num_pack_years_sm
- num_smoke_per_day
- num_yrs_alc_con
- patho_staging_curated
- patho_staging_orignial
- patho_staging_pn
- patho_staging_pt
- smoke_age_start
- smoke_age_stop
- smoking_history
- smoking_inferred_binary
- smoking_second_hand
- tumor_focality
- tumor_necrosis
- tumor_site_curated
- tumor_site_original
- tumor_size_cm

lscc

- Sample_Tumor_Normal
- Smoking.History
- Stage
- Country.of.Origin
- Age
- Gender
- Ethnicity
- Cigarettes.per.Day
- Pack.Years.Smoked
- Secondhand.Smoke

luad

- Sample.IDs
- Sample_Tumor_Normal
- Smoking.Status
- Stage
- Region.of.Origin
- Country.of.Origin
- Age
- Gender
- Ethnicity
- Height.cm
- Weight.kg
- BMI
- Cigarettes.per.Day
- Pack.Years.Smoked
- Smoking.History
- Secondhand.Smoke

ovarian

- Sample_Tumor_Normal
- Participant_Procurement_Age
- Participant_Gender
- Participant_Race
- Participant_Ethnicity
- Participant_Jewish_Heritage
- Participant_History_Malignancy
- Participant_History_Chemotherapy
- Participant_History_Neo-adjuvant_Treatment
- Participant_History_Radiation_Therapy
- Participant_History_Hormonal_Therapy
- Aliquots_Plasma
- Blood_Collection_Time
- Blood_Collection_Method
- Anesthesia_Time
- Tumor_Surgical_Devascularized_Time
- Tumor_Sample_Number
- Tumor_Sample_1_Weight
- Tumor_Sample_1_LN2_Time
- Tumor_Sample_1_Ischemia_Time
- Tumor_Sample_2_Weight
- Tumor_Sample_2_LN2_Time
- Tumor_Sample_2_Ischemia_Time
- Tumor_Sample_3_Weight
- Tumor_Sample_3_LN2_Time
- Tumor_Sample_3_Ischemia_Time
- Tumor_Sample_4_Weight
- Tumor_Sample_4_LN2_Time
- Tumor_Sample_4_Ischemia_Time
- Tumor_Sample_5_Weight
- Tumor_Sample_5_LN2_Time
- Tumor_Sample_5_Ischemia_Time
- Normal_Sample_Number
- Normal_Sample_1_Surgical_Devascularized_Time
- Normal_Sample_1_Weight
- Normal_Sample_1_LN2_Time
- Normal_Sample_1_Ischemia_Time
- Normal_Sample_2_Surgical_Devascularized_Time
- Normal_Sample_2_Weight
- Normal_Sample_2_LN2_Time
- Normal_Sample_2_Ischemia_Time
- Normal_Sample_3_Surgical_Devascularized_Time
- Normal_Sample_3_Weight
- Normal_Sample_3_LN2_Time
- Normal_Sample_3_Ischemia_Time
- Normal_Sample_4_Surgical_Devascularized_Time
- Normal_Sample_4_Weight
- Normal_Sample_4_LN2_Time
- Normal_Sample_4_Ischemia_Time
- Normal_Sample_5_Surgical_Devascularized_Time
- Normal_Sample_5_Weight
- Normal_Sample_5_LN2_Time
- Normal_Sample_5_Ischemia_Time
- Origin_Site_Disease
- Anatomic_Site_Tumor
- Anatomic_Lateral_Position_Tumor
- Histological_Subtype
- Method_of_Pathologic_Diagnosis
- Tumor_Stage_Ovary_FIGO
- Tumor_Grade
- Tumor_Residual_Disease_Max_Diameter
- Days_Between_Collection_And_Last_Contact
- Vital_Status
- Days_Between_Collection_And_Death
- Tumor_Status
- Review_Of_Initial_Pathological_Findings
- Pathology_Review_Consistent_With_Diagnosis
- Adjuvant_Radiation_Therapy
- Adjuvant_Pharmaceutical_Therapy
- Adjuvant_Immunotherapy
- Adjuvant_Hormone_Therapy
- Adjuvant_Targeted_Molecular_Therapy
- Response_After_Surgery_And_Adjuvant_Therapies
- New_Tumor_Event_After_Initial_Treatment
- New_Tumor_Event_Type
- New_Tumor_Event_Site
- Other_New_Tumor_Event_Site
- Days_Between_Collection_And_New_Tumor_Event
- New_Tumor_Event_Diagnosis
- New_Tumor_Event_Surgery
- Days_Between_Collection_And_New_Tumor_Event_Surgery
- New_Tumor_Event_Chemotherapy
- New_Tumor_Event_Immunotherapy
- New_Tumor_Event_Hormone_Therapy
- New_Tumor_Event_Targeted_Molecular_Therapy