## **Breast Cancer Project**

---

### **Raw Data File Preprocessing and Formatting**

This notebook focuses on preparing and formatting raw RNA-Seq data (TPM values) for breast cancer classification. It includes merging GDC (cancer) and GTEx (benign) data, aligning gene features, and saving clean matrices for downstream experiments. This preprocessing is shared across all three experimental pipelines described in the project.

---

### **Setup and Configuration**

The data is organized into the following folders under `data/`:
- `raw_gdc_data/`: Raw cancer sample files downloaded from the GDC portal (each sample in its own folder).
- `raw_gtx_data/`: Raw healthy sample file from GTEx in `.gct` format.
- `initial/`: Preprocessed and aligned matrices after merging and cleaning.
- `interim/`: Final feature matrices and label files used for training and testing.

In [None]:
# Import required libraries
import os
import pandas as pd
import numpy as np
import seaborn as sns
import shutil
import pickle
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold, permutation_test_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.metrics import roc_curve, auc

In [None]:
# Define data folder structure
DATA_DIR = "data"
RAW_GDC_DIR = os.path.join(DATA_DIR, "raw_gdc_data")
RAW_GTX_DIR = os.path.join(DATA_DIR, "raw_gtx_data")

INITIAL_DIR = os.path.join(DATA_DIR, "initial")
INTERIM_DIR = os.path.join(DATA_DIR, "interim")

In [None]:
# Create directories if they don't exist
os.makedirs(INITIAL_DIR, exist_ok=True)
os.makedirs(INTERIM_DIR, exist_ok=True)

---

### **Step 1: Initial Process of GDC Raw Sample Files (Cancer Data)**

This block processes RNA-Seq files downloaded from the GDC portal. Each `.tsv` file contains gene expression data for a single cancer sample. From each file, the column `tpm_unstranded` is extracted, indexed by `gene_id`.

These per-sample files are merged into a single matrix where:
- **Rows** = genes,
- **Columns** = samples,
- **Values** = TPM expression values.

This process results in a file `gdc_data.csv` saved to the `data/initial/` folder. 

It is computationally expensive, so it is designed to be skipped once completed unless the raw data changes.

---

##### **Step 1.1: Merge All GDC Cancer Sample Files**

This step reads individual `.tsv` files downloaded from the GDC portal, each representing one cancer sample. From each file, the `tpm_unstranded` column is extracted, renamed to a unique sample ID (e.g., `c_0001`, `c_0002`, ...), and merged by gene ID into a single dataframe. The final merged dataset is saved as `gdc_data.csv` in the `data/initial/` directory.


In [None]:
# Define paths for GDC sample sheet and data files
gdc_sample_sheet_path = os.path.join(RAW_GDC_DIR, 'gdc_sample_sheet.tsv')
gdc_merged_file_path = os.path.join(INITIAL_DIR, 'gdc_data.csv')

sample_sheet_file_handler = pd.read_csv(gdc_sample_sheet_path, sep='\t')

# Initialize an empty DataFrame to store the combined data
gdc_full_df = pd.DataFrame()

sample_counter = 1
# Iterate through each file listed in the sample sheet
for index, row in sample_sheet_file_handler.iterrows():
    folder_id = row['File ID']
    file_name = row['File Name']
    
    # Construct the full file path
    sample_file_path = os.path.join(RAW_GDC_DIR, folder_id, file_name)
    
    # Read the TSV file
    try:
        sample_data = pd.read_csv(sample_file_path, sep='\t', skiprows=[0, 2, 3, 4, 5])
        
        # Extract the relevant columns ('gene_id' and 'TPM' or equivalent)
        relevant_data = sample_data[['gene_id', 'tpm_unstranded']]
        
        # Rename the columns to match the GTex format
        sample_number = 'c_' + str(sample_counter).zfill(4)
        relevant_data.columns = ['gene_id', sample_number]  # Use folder_id as the sample name
        
        # Merge with the combined data
        if gdc_full_df.empty:
            gdc_full_df = relevant_data
        else:
            gdc_full_df = pd.merge(gdc_full_df, relevant_data, on='gene_id', how='outer')

        sample_counter += 1
            
    except Exception as e:
        print(f"Error processing file {sample_file_path}: {e}")

gdc_full_df.to_csv(gdc_merged_file_path, index=False)

In [None]:
# Load and display the first few rows of the merged GDC dataset
gdc_merged_file_path = os.path.join(INITIAL_DIR, 'gdc_data.csv')

gdc_merged_file_df = pd.read_csv(gdc_merged_file_path, index_col=0)
print("Shape of initial GDC Data Set", gdc_merged_file_df.shape)
gdc_merged_file_df.head(5)

##### **Step 1.2: Split GDC Data into Model Training and Unseen Testing**

The full GDC cancer dataset is split into:
- `train_num = 1000` samples for model building and training
- `unseen_num = 231` samples for final testing

Samples are randomly selected from the full dataset to ensure unbiased evaluation. The result is two files:
- `gdc_data_training.csv`
- `gdc_data_testing.csv`

Both are saved in `data/initial/`.

In [None]:
cancer_data_number_of_samples_for_model_builiding = 1000    # You can choose the number of data samples for model building
                                                            # The rest of data will be kept separately as Unseen data for testing

In [None]:
# Define path
gdc_data_file_path = os.path.join(INITIAL_DIR, 'gdc_data.csv')

gdc_full_data_few_rows_df = pd.read_csv(gdc_data_file_path, nrows=5)
columns = gdc_full_data_few_rows_df.columns.tolist()

gene_info_columns = columns[0:1]
sample_columns = columns[1:]

# Split data into data for model building and training and unseen data for testing
gdc_total_number_of_samples = len(sample_columns)         # bc tcga data = 1231
gdc_train_num = cancer_data_number_of_samples_for_model_builiding
gdc_unseen_num = gdc_total_number_of_samples - gdc_train_num

# Randomly select 1000 sample columns from the dataset as traiing data
gdc_chosen_columns_for_training = np.random.choice(sample_columns, gdc_train_num, replace=False).tolist()
gdc_final_columns_for_training = gene_info_columns + gdc_chosen_columns_for_training

# Get the remaining columns as testing data
gdc_chosen_columns_for_testing = [col for col in sample_columns if col not in gdc_chosen_columns_for_training]
gdc_final_columns_for_testing = gene_info_columns + gdc_chosen_columns_for_testing

# Load the dataset again but only with the selected columns
gdc_train_df = pd.read_csv(gdc_data_file_path, usecols=gdc_final_columns_for_training)
gdc_unseen_df = pd.read_csv(gdc_data_file_path, usecols=gdc_final_columns_for_testing)

# Save the training and testing data to new files
gdc_training_file_path = os.path.join(INITIAL_DIR, 'gdc_data_training.csv')
gdc_unseen_file_path = os.path.join(INITIAL_DIR, 'gdc_data_testing.csv')

gdc_train_df.to_csv(gdc_training_file_path, index=False)
gdc_unseen_df.to_csv(gdc_unseen_file_path, index=False)

In [None]:
print("Shape of initial GDC Model Building Data Set", gdc_train_df.shape)
gdc_train_df.head(5)

In [None]:
print("Shape of initial GDC Testing Data Set", gdc_unseen_df.shape)
gdc_unseen_df.head(5)

---

### **Step 2: Initial Processing of GTEx Raw Sample File (Non-Cancer Data)**

This step processes the transcriptomic data from GTEx, originally stored in `.gct` format, which contains expression levels for thousands of genes across thousands of healthy samples.

It includes:
1. Conversion of `.gct` to `.csv`, skipping metadata rows.
2. Random selection of healthy samples.
3. Standardization of sample names to the format `h_xxxx`.
4. Saving the result as `gtx_data.csv` in `data/initial/`.

Like the GDC step, this is a time-consuming process and can be skipped once completed.

---

In [None]:
# Define paths
gtx_full_gct_file_path = os.path.join(RAW_GTX_DIR, 'gtx_raw_data.gct')
gtx_full_csv_file_path = os.path.join(RAW_GTX_DIR, 'gtx_raw_data.csv')

##### **Step 2.1: Convert Raw GCT to Raw CSV file**
This block defines the `read_gct()` function, which skips the top two header lines from the `.gct` file and loads the remaining expression matrix into a DataFrame. The `convert_gct_to_csv()` function wraps this logic and saves the result as a temporary `.csv` file.

In [None]:
def read_gct(file_path):
    with open(file_path, 'r') as f:
        # Skip the first two header lines
        for _ in range(2):
            next(f)
        # Read the rest of the file into a pandas DataFrame
        df = pd.read_csv(f, sep='\t')
    return df

In [None]:
def convert_gct_to_csv(gct_file, csv_file):
    df = read_gct(gct_file)
    df.to_csv(csv_file, index=False)

In [None]:
convert_gct_to_csv(gtx_full_gct_file_path, gtx_full_csv_file_path)

##### **Step 2.2: Select Healthy Samples Randomly to Create GTEx Dataset**
A fixed number of healthy samples (default: 1231) are randomly selected from the converted `.csv` GTEx matrix. The sample names are renamed using the format `h_0001`, `h_0002`, etc., and the final dataset is saved as `gtx_data.csv` in the `data/initial/` folder.


In [None]:
gtx_number_of_samples_to_be_selected = 2231

In [None]:
gtx_full_csv_file_path = os.path.join(RAW_GTX_DIR, 'gtx_raw_data.csv')
gtx_full_df = pd.read_csv(gtx_full_csv_file_path, nrows=5)
columns = gtx_full_df.columns.tolist()

# The first two columns are 'Name' and 'Description', we keep them and sample the rest
gene_info_columns = columns[0:2]
sample_columns = columns[2:]

# Randomly select sample columns from the dataset
chosen_samples_columns = np.random.choice(sample_columns, gtx_number_of_samples_to_be_selected, replace=False).tolist()
final_samples_columns = gene_info_columns + chosen_samples_columns

# Load the dataset again but only with the selected columns
gtx_selected_data_df = pd.read_csv(gtx_full_csv_file_path, usecols=final_samples_columns)
gtx_selected_data_df = gtx_selected_data_df.rename(columns={'Name': 'gene_id'})
gtx_selected_data_df = gtx_selected_data_df.drop('Description', axis=1)

# Changing the sample ids to h_xxxx format
num_samples = gtx_selected_data_df.shape[1] - 1
new_sample_names = [f"h_{i:04d}" for i in range(1, num_samples + 1)]
gtx_selected_data_df.columns = [gtx_selected_data_df.columns[0]] + new_sample_names

# Save the sampled data to a new CSV file
gtx_selected_samples_file_path = os.path.join(INITIAL_DIR, 'gtx_data.csv')
gtx_selected_data_df.to_csv(gtx_selected_samples_file_path, index=False)

In [None]:
# Load and display the first few rows of the GTex selected dataset
gtx_data_final_df = pd.read_csv(gtx_selected_samples_file_path, index_col=0)
print("Shape of initial GTex Data Set", gtx_data_final_df.shape)
gtx_data_final_df.head(5)

##### **Step 2.3: Divide GTEx Data into Training and Testing Sets**

From the selected GTEx samples:
- `1000` samples are randomly chosen for model training
- The remaining `231` samples are used as unseen test data

Resulting files:
- `gtx_data_training.csv`
- `gtx_data_testing.csv`

Both are stored in `data/initial/`.

In [None]:
gtx_selected_samples_file_path_ = os.path.join(INITIAL_DIR, 'gtx_data.csv')
gtx_selected_samples_few_rows_df = pd.read_csv(gtx_selected_samples_file_path_, nrows=5)
gtx_columns = gtx_selected_samples_few_rows_df.columns.tolist()

# The first two columns are 'Name' and 'Description', we keep them and sample the rest
gtx_gene_info_columns = gtx_columns[0:1]
gtx_samples_columns = gtx_columns[1:]

# Split data into data for model building and training and unseen data for testing
gtx_total_num = len(gtx_samples_columns)
gtx_train_num = 1000
gtx_unseen_num = gtx_total_num - gtx_train_num      # The rest of samples will be used for testing as unseen data

# Randomly select x sample columns from the dataset as traiing data
gtx_chosen_columns_for_training = np.random.choice(gtx_samples_columns, gtx_train_num, replace=False).tolist()
gtx_final_columns_for_training = gtx_gene_info_columns + gtx_chosen_columns_for_training

# Select the rest of samples from the dataset as testing data
gtx_chosen_columns_for_testing = [col for col in gtx_samples_columns if col not in gtx_chosen_columns_for_training]
gtx_final_columns_for_testing = gtx_gene_info_columns + gtx_chosen_columns_for_testing

# Load the dataset again but only with the selected columns
gtx_train_df = pd.read_csv(gtx_selected_samples_file_path_, usecols=gtx_final_columns_for_training)
gtx_unseen_df = pd.read_csv(gtx_selected_samples_file_path_, usecols=gtx_final_columns_for_testing)

# Save the training and unseen testing data to new CSV files
gtx_training_file_path = os.path.join(INITIAL_DIR, 'gtx_data_training.csv')
gtx_testing_file_path = os.path.join(INITIAL_DIR, 'gtx_data_testing.csv')

gtx_train_df.to_csv(gtx_training_file_path, index=False)
gtx_unseen_df.to_csv(gtx_testing_file_path, index=False)

In [None]:
print("Shape of initial Gtex Training Data Set", gtx_train_df.shape)
gtx_train_df.head(5)

In [None]:
print("Shape of initial GTex Testing Data Set", gtx_unseen_df.shape)
gtx_unseen_df.head(5)

---

### **Step 3: Create Final Data - Merge and Label Cancer and Non-Cancer**

---

##### **Step 3.1: Create Training Data (Combined Cancer and non-Cancer)**

This step merges the GTEx (non-cancer) and GDC (cancer) RNA-Seq datasets and prepares them for machine learning. The goal is to align the gene expression profiles from both sources, assign binary labels, and generate a unified dataset.

This step includes:
1. **Loads**:
   - `gtx_data.csv` and `gdc_data.csv` from `data/initial/`
2. **Aligns gene features** (columns) to keep only shared genes between GTEx and GDC.
3. **Transposes the matrices**:
   - Each **row** becomes a sample,
   - Each **column** is a gene (TPM value).
4. **Assigns labels**:
   - `0` for GTEx (healthy samples),
   - `1` for GDC (cancer samples).
5. **Combines** both datasets into a single feature matrix and a corresponding label vector.

##### Output files (in `data/interim/`):
- `preprocessed_data_features.csv`: Combined matrix of all samples with aligned gene features (samples × genes)
- `preprocessed_data_labels.csv`: Binary labels for each sample (0 = GTEx, 1 = GDC)

These files serve as input to the next steps: statistical feature selection and dimensionality reduction.

Both the feature matrix and label vector are saved in `data/interim/` and will be used as input to the modeling pipelines.

In [None]:
gtx_file_path = os.path.join(INITIAL_DIR, 'gtx_data_training.csv')
gdc_file_path = os.path.join(INITIAL_DIR, 'gdc_data_training.csv')

# Load the datasets with headers
gtx__df = pd.read_csv(gtx_file_path, header=0)
gdc__df = pd.read_csv(gdc_file_path, header=0)

# Ensure that the 'gene_id' column is the index for both datasets
gtx__df.set_index('gene_id', inplace=True)
gdc__df.set_index('gene_id', inplace=True)

# Add labels row: 0 for GTEx (healthy) and 1 for GDC (cancer)
gtx_labels = pd.DataFrame([0] * gtx__df.shape[1], index=gtx__df.columns, columns=['label']).transpose()
gdc_labels = pd.DataFrame([1] * gdc__df.shape[1], index=gdc__df.columns, columns=['label']).transpose()

# Concatenate labels and data
gtx__df = pd.concat([gtx_labels, gtx__df])
gdc__df = pd.concat([gdc_labels, gdc__df])

# Find common genes (rows)
common_genes = gtx__df.index.intersection(gdc__df.index)

# Filter both datasets to keep only the common genes
gtx_df_aligned = gtx__df.loc[common_genes]
gdc_df_aligned = gdc__df.loc[common_genes]

# Combine both datasets
combined_data = pd.concat([gtx_df_aligned, gdc_df_aligned], axis=1)

# Transpose the data to have samples as rows and genes as columns
combined_data = combined_data.transpose()

# Separate features and labels
labels = combined_data['label']
features = combined_data.drop(columns=['label'])

# Save preprocessed features and labels to CSV
output_file_prefix = os.path.join(INTERIM_DIR, 'training_data')
features.to_csv(output_file_prefix + '_features.csv', index=False)
labels.to_csv(output_file_prefix + '_labels.csv', index=False)

print("Data Merging Complete")
print("Shape of features:", features.shape)
print("Shape of labels:", labels.shape)

##### **Step 3.2: Create Unseen Testing Data (Combined Cancer and non-Cancer)**

This step merges the unseen test samples from GTEx and GDC to create the final test dataset. It:
- Loads `gtx_data_testing.csv` and `gdc_data_testing.csv` from `data/initial/`
- Aligns gene features to ensure consistency
- Assigns binary labels:
  - `0` for GTEx (benign)
  - `1` for GDC (cancer)
- Combines the datasets into:
  - `test_features.csv`: matrix of samples × genes
  - `test_labels.csv`: corresponding label vector

These files are saved to `data/interim/` and used only for final evaluation of trained models.

In [None]:
gtx_test_file_path = os.path.join(INITIAL_DIR, 'gtx_data_testing.csv')
gdc_test_file_path = os.path.join(INITIAL_DIR, 'gdc_data_testing.csv')

# Load the datasets with headers
gtx_test__df = pd.read_csv(gtx_test_file_path, header=0)
gdc_test__df = pd.read_csv(gdc_test_file_path, header=0)

# Ensure that the 'gene_id' column is the index for both datasets
gtx_test__df.set_index('gene_id', inplace=True)
gdc_test__df.set_index('gene_id', inplace=True)

# Add labels row: 0 for GTEx (healthy) and 1 for GDC (cancer)
gtx_test_labels = pd.DataFrame([0] * gtx_test__df.shape[1], index=gtx_test__df.columns, columns=['label']).transpose()
gdc_test_labels = pd.DataFrame([1] * gdc_test__df.shape[1], index=gdc_test__df.columns, columns=['label']).transpose()

# Concatenate labels and data
gtx_test__df = pd.concat([gtx_test_labels, gtx_test__df])
gdc_test__df = pd.concat([gdc_test_labels, gdc_test__df])

# Find common genes (rows)
common_genes = gtx_test__df.index.intersection(gdc_test__df.index)

# Filter both datasets to keep only the common genes
gtx_test_df_aligned = gtx_test__df.loc[common_genes]
gdc_test_df_aligned = gdc_test__df.loc[common_genes]

# Combine both datasets
combined_data_test = pd.concat([gtx_test_df_aligned, gdc_test_df_aligned], axis=1)

# Transpose the data to have samples as rows and genes as columns
combined_data_test = combined_data_test.transpose()

# Separate features and labels
labels_test = combined_data_test['label']
features_test = combined_data_test.drop(columns=['label'])

# Save preprocessed features and labels to CSV
output_file_prefix = os.path.join(INTERIM_DIR, 'testing_data')
features_test.to_csv(output_file_prefix + '_features.csv', index=False)
labels_test.to_csv(output_file_prefix + '_labels.csv', index=False)

print("Data Merging Complete")
print("Shape of features:", features_test.shape)
print("Shape of labels:", labels_test.shape)