# TCGA Data Joining
---

Joining the preprocessed the TCGA dataset from the Pancancer paper (https://www.ncbi.nlm.nih.gov/pubmed/29625048) into a single, clean dataset.

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between the National Cancer Institute and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions.

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import torch                               # PyTorch to create and apply deep learning models
import numpy as np

In [None]:
# Path to the dataset files
data_path = 'cleaned/'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas

Allow pandas to show more columns:

In [None]:
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

Set the random seed for reproducibility:

In [None]:
np.random.seed(42)

## Joining the normalized data

### Loading the data

In [None]:
import pandas

In [None]:
rna_df = pandas.read_csv(f'{data_path}normalized/rna.csv')
rna_df = rna_df.rename(columns={'Unnamed: 0': 'sample_id'})

rna_df.head()

In [None]:
copy_num_df = pandas.read_csv(f'{data_path}normalized/copy_number_ratio.csv')
copy_num_df.head()

In [None]:
cdr_df = pandas.read_csv(f'{data_path}normalized/clinical_outcome.csv')
cdr_df.head()

### Joining dataframes

#### Checking the length of the ID in the dataframes

In [None]:
rna_df.sample_id.str.len().describe()

In [None]:
copy_num_df.sample_id.str.len().describe()

In [None]:
cdr_df.sample_id.str.len().describe()

#### Joining RNA and Copy Num

In [None]:
rna_df['sample_cpy_id'] = rna_df['sample_id'].str.slice(stop=15)
copy_num_df['sample_cpy_id'] = copy_num_df['sample_id'].str.slice(stop=15)
tcga_df = rna_df.merge(copy_num_df, how='inner', on='sample_cpy_id')

tcga_df['participant_id'] = tcga_df['sample_id_x'].str.slice(stop=12)
cdr_df['participant_id'] = cdr_df['sample_id'].str.slice(stop=12)

tcga_df = tcga_df.merge(cdr_df, how='inner', on='participant_id')

# --- Clean up redundant ID columns ---
id_cols_to_drop = [col for col in tcga_df.columns if '_id' in col and col != 'participant_id']
tcga_df.drop(columns=id_cols_to_drop, inplace=True)

# --- One sample per patient ---
tcga_df = tcga_df.groupby('participant_id').apply(lambda x: x.sample(n=1, random_state=42)).reset_index(drop=True)


### Performing imputation

Remove columns with too high percentage of missing values (>40%):

In [None]:
nan_percent_thrsh = 40

# Calculate missing percentage per column
missing_percent = tcga_df.isnull().mean() * 100

# Keep only columns with less than threshold
tcga_df = tcga_df.loc[:, missing_percent < nan_percent_thrsh]


Imputation:

In [None]:
# Fill all missing values in-place with 0
tcga_df.fillna(0, inplace=True)

# Show the first few rows
tcga_df.head()


### Saving the data

In [None]:
# --- Save merged dataset ---
tcga_df.to_csv(f'{data_path}normalized/tcga.csv', index=False)