## Step 1: Load and Inspect Gene Expression Data (Transcriptomics )

This dataset contains log2(FPKM + 1) normalized gene expression values for TCGA-BRCA samples.
Each row represents a gene, and each column represents a sample ID (TCGA barcode).

We will:
- Load the dataset using `pandas`
- Preview the shape and gene/sample identifiers
- Check if our gene of interest (*TP53*) is included


In [7]:
# Import required library
import pandas as pd

# Load the expression data
df_expr = pd.read_csv("TCGA-BRCA.star_fpkm.tsv.gz", sep ="\t", index_col=0)

# Preview of shape of the expression matrix
print("The shape of expression matrix: ", df_expr.shape) # (Ensembl IDs x samples)

# Show first few rows
df_expr.head()

The shape of expression matrix:  (60660, 1226)


Unnamed: 0_level_0,TCGA-D8-A146-01A,TCGA-AQ-A0Y5-01A,TCGA-C8-A274-01A,TCGA-BH-A0BD-01A,TCGA-B6-A1KC-01B,TCGA-AC-A62V-01A,TCGA-AO-A0J5-01A,TCGA-BH-A0B1-01A,TCGA-A2-A0YM-01A,TCGA-AO-A03N-01B,...,TCGA-E2-A1IG-01A,TCGA-E9-A1NA-01A,TCGA-D8-A1JP-01A,TCGA-AR-A252-01A,TCGA-D8-A1XL-01A,TCGA-BH-A0EI-01A,TCGA-E2-A1IO-01A,TCGA-E2-A15R-01A,TCGA-B6-A0IP-01A,TCGA-A1-A0SN-01A
Ensembl_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.15,3.81619,2.034638,4.823,3.028003,2.865503,2.122341,2.805272,4.146924,4.286985,2.798652,...,3.06188,2.407924,3.271246,3.25031,3.677587,4.426251,4.245123,1.234747,5.072925,1.9568
ENSG00000000005.6,1.796473,0.134221,0.0,1.058801,0.166972,0.275722,0.428571,0.1133,0.129217,0.0,...,0.037453,0.042924,0.023326,1.961364,0.0,0.101516,0.229219,0.136716,0.172488,0.015212
ENSG00000000419.13,4.971102,5.159173,5.107052,4.595068,4.693615,5.189662,3.824147,5.073178,4.8279,5.183113,...,4.533395,4.531319,5.222078,4.425009,5.722215,5.189046,4.907208,5.077508,4.634012,6.164261
ENSG00000000457.14,2.656428,2.324868,3.407869,2.659925,2.249506,1.033723,2.664597,2.31632,1.67735,1.612824,...,2.238512,2.089295,3.258172,2.244644,2.216982,1.99534,2.328664,2.986848,2.738357,2.391108
ENSG00000000460.17,1.395556,1.088888,2.505002,2.473917,1.458435,1.148739,1.141171,1.78442,2.24373,1.174598,...,1.210015,1.184407,2.191405,1.32262,1.754845,1.500853,1.140451,2.033934,1.839194,1.681719


### Step 1B: Locate TP53 in Expression Matrix

The expression matrix uses Ensembl gene IDs instead of gene symbols like TP53.

To extract TP53 expression:
- Find its Ensembl ID (`ENSG00000141510`)
- Locate that row in the expression matrix


In [11]:
# Find all rows containing TP53
tp53_rows = [i for i in df_expr.index if "ENSG00000141510" in i]

# View TP53 expression values across samples
df_expr.loc[tp53_rows].T.head()  # Transpose for readability

Ensembl_ID,ENSG00000141510.18
TCGA-D8-A146-01A,3.655386
TCGA-AQ-A0Y5-01A,3.304263
TCGA-C8-A274-01A,4.243738
TCGA-BH-A0BD-01A,4.410416
TCGA-B6-A1KC-01B,3.838962


### Step 1C: Clean and Rename TP53 Expression Data

We convert the TP53 expression row to a DataFrame and rename it for clarity.

- This improves interpretability
- Makes merging and plotting easier


In [23]:
# Convert to DataFrame for merging later
tp53_expr_df = df_expr.loc[tp53_rows].T  # transpose to make samples rows
tp53_expr_df.columns = ["TP53_Expression"]
tp53_expr_df.index.name = "Sample_ID"
tp53_expr_df.reset_index(inplace=True)

# Save as a table
tp53_expr_df.to_csv("TP53_expression_table.tsv", sep="\t", index=False)

tp53_expr_df.head()

Unnamed: 0,Sample_ID,TP53_Expression
0,TCGA-D8-A146-01A,3.655386
1,TCGA-AQ-A0Y5-01A,3.304263
2,TCGA-C8-A274-01A,4.243738
3,TCGA-BH-A0BD-01A,4.410416
4,TCGA-B6-A1KC-01B,3.838962


## Step 2: Load and Filter Mutation Data (Genomics Layer)

We load the TCGBRCA somatic mutation file.  
Then we filter for samples where the gene TP53 is mutated, and extract their barcodes.

We'll use this to label samples as:
- 'Mutated' if TP53 is altered
- 'Wild-type' if TP53 is not mutated


In [25]:
# Load mutation data
df_mut = pd.read_csv("TCGA-BRCA.somaticmutation_wxs.tsv.gz", sep="\t")

# Preview columns
print("Mutation data columns: ", df_mut.columns.tolist())

# Preview first few rows
df_mut.head()

Mutation data columns:  ['sample', 'gene', 'chrom', 'start', 'end', 'ref', 'alt', 'Tumor_Sample_Barcode', 'Amino_Acid_Change', 'effect', 'callers', 'dna_vaf']


Unnamed: 0,sample,gene,chrom,start,end,ref,alt,Tumor_Sample_Barcode,Amino_Acid_Change,effect,callers,dna_vaf
0,TCGA-A1-A0SO-01A,MIB2,chr1,1625276,1625299,CCCTCCGCAGGCAAGCCGGCGGAG,-,TCGA-A1-A0SO-01A-22D-A099-09,p.X356_splice,splice_acceptor_variant;coding_sequence_varian...,mutect2;varscan2,0.3
1,TCGA-A1-A0SO-01A,VPS13D,chr1,12283654,12283654,T,C,TCGA-A1-A0SO-01A-22D-A099-09,p.I1851T,missense_variant,muse;mutect2;varscan2,0.179775
2,TCGA-A1-A0SO-01A,PRAMEF8,chr1,13281907,13281907,C,A,TCGA-A1-A0SO-01A-22D-A099-09,p.V297F,missense_variant,muse;mutect2,0.031496
3,TCGA-A1-A0SO-01A,NBPF1,chr1,16576355,16576355,C,G,TCGA-A1-A0SO-01A-22D-A099-09,p.E677D,missense_variant,muse;mutect2,0.086957
4,TCGA-A1-A0SO-01A,SRSF4,chr1,29148880,29148880,C,T,TCGA-A1-A0SO-01A-22D-A099-09,p.G339S,missense_variant,muse;mutect2;varscan2,0.134615


In [37]:
# Filter rows where the mutated gene is TP53
df_tp53_mut = df_mut[df_mut["gene"]== "TP53"] 

# Get unique sample barcodes that have TP53 mutation
mutated_samples = df_tp53_mut["sample"].unique()
print("Number of samples with TP53 mutation:", len(mutated_samples))
pd.Series(mutated_samples).head()

Number of samples with TP53 mutation: 339


0    TCGA-A1-A0SO-01A
1    TCGA-LL-A5YP-01A
2    TCGA-BH-A0RX-01A
3    TCGA-AN-A0FL-01A
4    TCGA-A2-A0T1-01A
dtype: object

## Step 3: Label Samples as "Mutated" or "Wild-type"

Now that we have the TP53-mutated sample list, we label each sample in our TP53 expression dataset.

- If a sample is found in the mutated list → label as "Mutated"
- Otherwise → label as "Wild-type"


In [42]:
# Create mutation status column
tp53_expr_df["Mutation_Status"] = tp53_expr_df["Sample_ID"].apply(
    lambda x: "Mutated" if x in mutated_samples else "Wild-type")

# Check counts
tp53_expr_df["Mutation_Status"].value_counts()

Mutation_Status
Wild-type    888
Mutated      338
Name: count, dtype: int64

In [43]:
tp53_expr_df.head(10)

Unnamed: 0,Sample_ID,TP53_Expression,Mutation_Status
0,TCGA-D8-A146-01A,3.655386,Wild-type
1,TCGA-AQ-A0Y5-01A,3.304263,Wild-type
2,TCGA-C8-A274-01A,4.243738,Wild-type
3,TCGA-BH-A0BD-01A,4.410416,Wild-type
4,TCGA-B6-A1KC-01B,3.838962,Wild-type
5,TCGA-AC-A62V-01A,3.350724,Wild-type
6,TCGA-AO-A0J5-01A,3.95939,Wild-type
7,TCGA-BH-A0B1-01A,3.405107,Wild-type
8,TCGA-A2-A0YM-01A,4.156995,Mutated
9,TCGA-AO-A03N-01B,4.440009,Wild-type


## Step 4: Export the Final Multi-Omics Dataset

This dataset includes TP53 expression values and mutation status for all TCGA-BRCA samples.

We will save it as a `.csv` file for use in R for visualization.


In [48]:
# Save the final dataframe to CSV for statistical analysis and visualization in R
tp53_expr_df.to_csv("TP53_expression_mutation_status.csv", index=False)

print("File saved successfully!")

File saved successfully!
