### Quick Exploration of Copy Number Variation (CNV)

This notebook allows for a quick way to explore the CNV status of specific genes across different patients in a lower-grade glioma (LGG) and breast-cancer (BRCA) cohort.

CNV segments were classified according to their ploidy state for each patient:
- **Amplified** if the copy number exceeded 2,
- **Deleted** if it was below 2,
- **Neutral** if it equalled 2.
 
Genes such as **TP53** appeared **amplified** in some patients. TP53 is a well-known **prognostic factor** in LGG and plays a critical role in tumor suppression.

Understanding which genes are altered in each cancer type can guide the **tailoring of SAMVAE** to specific cancer contexts. This notebook facilitates the exploration of gene-level CNV alterations, enabling researchers to identify candidate markers or patterns relevant to tumor biology and personalized modeling.


In [12]:
import pandas as pd

# --------- User-defined parameters ---------
symbol = 'TP53'          # Gene symbol
patient_idx = 27         # Index of the patient
cancer_type = 'lgg'      # Cancer type: 'lgg' or 'brca'
# -------------------------------------------

# Paths based on cancer type
symbols_file_path = f'../../data_download/Frequently_mutated_genes/frequently-mutated-genes_{cancer_type}.tsv'
cnv_file_path = f'../data_preprocessing/raw_data/omic_data/cnv/{cancer_type}/consolidated_segments_by_gene.tsv'

# Load gene symbols to gene_id mapping
df_symbols = pd.read_csv(symbols_file_path, sep='\t')

# Function to get gene_id from a symbol
def get_gene_id(symbol_to_find):
    result = df_symbols[df_symbols['symbol'] == symbol_to_find]
    if not result.empty:
        return result['gene_id'].values[0]
    else:
        return None

# Get gene_id
gene_id = get_gene_id(symbol)

# Display all frequently mutated genes for this cancer type
print(f"\nFrequently mutated genes in {cancer_type.upper()}:\n")
display(df_symbols)

# If symbol is found, load CNV data and display result
if gene_id is None:
    print(f"\nThe symbol '{symbol}' was not found in the frequently mutated genes file.")
else:
    # Load CNV file
    df_cnv = pd.read_csv(cnv_file_path, sep='\t')

    # Function to get CNV value
    def get_cnv_value_for_gene_and_patient(gene_id, patient_idx):
        if gene_id not in df_cnv.iloc[:, 0].values:
            return f"\nThe gene ID '{gene_id}' was not found in the CNV file."
        if patient_idx < 1 or patient_idx >= df_cnv.shape[1]:
            return f"\nInvalid patient index. Must be between 1 and {df_cnv.shape[1]-1}."
        row = df_cnv[df_cnv.iloc[:, 0] == gene_id]
        patient_column = df_cnv.columns[patient_idx]
        value = row[patient_column].values[0]
        return f"\nThe CNV value for gene '{symbol}' (ID: {gene_id}) in patient '{patient_column}' (Index: {patient_idx}) is: {value}"

    # Show CNV result
    result = get_cnv_value_for_gene_and_patient(gene_id, patient_idx)
    print(result)



Frequently mutated genes in LGG:



Unnamed: 0,gene_id,symbol,name,cytoband,type,num_cohort_ssm_affected_cases,num_cohort_ssm_cases,cohort_ssm_affected_cases_percentage,num_gdc_ssm_affected_cases,num_gdc_ssm_cases,gdc_ssm_affected_cases_percentage,num_cohort_cnv_cases,num_cohort_cnv_gain_cases,cohort_cnv_gain_cases_percentage,num_cohort_cnv_loss_cases,cohort_cnv_loss_cases_percentage,num_mutations,annotations
0,ENSG00000138413,IDH1,isocitrate dehydrogenase (NADP(+)) 1,2q34,protein_coding,394,513,76.80,600,16508,3.63,515,7,1.36,13,2.52,4,Cancer Gene Census
1,ENSG00000141510,TP53,tumor protein p53,17p13.1,protein_coding,240,513,46.78,4964,16508,30.07,515,6,1.17,10,1.94,169,Cancer Gene Census
2,ENSG00000085224,ATRX,ATRX chromatin remodeler,Xq21.1,protein_coding,187,513,36.45,817,16508,4.95,515,0,0.00,0,0.00,190,Cancer Gene Census
3,ENSG00000079432,CIC,capicua transcriptional repressor,19q13.2,protein_coding,110,513,21.44,464,16508,2.81,515,23,4.47,236,45.83,96,Cancer Gene Census
4,ENSG00000162613,FUBP1,far upstream element binding protein 1,1p31.1,protein_coding,45,513,8.77,229,16508,1.39,515,12,2.33,183,35.53,42,Cancer Gene Census
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
706,ENSG00000213672,NCKIPSD,NCK interacting protein with SH3 domain,3p21.31,protein_coding,0,513,0.00,119,16508,0.72,515,5,0.97,41,7.96,0,Cancer Gene Census
707,ENSG00000214562,NUTM2D,NUT family member 2D,10q23.2,protein_coding,0,513,0.00,59,16508,0.36,515,1,0.19,94,18.25,0,Cancer Gene Census
708,ENSG00000245848,CEBPA,CCAAT enhancer binding protein alpha,19q13.11,protein_coding,0,513,0.00,63,16508,0.38,515,26,5.05,198,38.45,0,Cancer Gene Census
709,ENSG00000261652,C15orf65,chromosome 15 open reading frame 65,15q21.3,protein_coding,0,513,0.00,10,16508,0.06,515,7,1.36,56,10.87,0,Cancer Gene Census



The CNV value for gene 'TP53' (ID: ENSG00000141510) in patient '0f679102-f96c-4390-9537-14cdfe977782' (Index: 27) is: Amplified
