# TCGA Data Joining
---

Joining the preprocessed the TCGA dataset from the Pancancer paper (https://www.ncbi.nlm.nih.gov/pubmed/29625048) into a single, clean dataset.

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between the National Cancer Institute and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions.

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import torch                               # PyTorch to create and apply deep learning models

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../..")
# Path to the dataset files
data_path = 'data/TCGA-Pancancer/cleaned/'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
import data_utils as du                    # Data science and machine learning relevant methods

Allow pandas to show more columns:

In [None]:
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

Set the random seed for reproducibility:

In [None]:
du.set_random_seed(42)

## Joining the normalized data

### Loading the data

In [None]:
rppa_df = pd.read_csv(f'{data_path}normalized/rppa.csv')
rppa_df.head()

In [None]:
rna_df = pd.read_csv(f'{data_path}normalized/rna.csv')
rna_df.head()

In [None]:
dna_mthltn_df = pd.read_csv(f'{data_path}normalized/dna_methylation.csv')
dna_mthltn_df.head()

In [None]:
mirna_df = pd.read_csv(f'{data_path}normalized/mirna.csv')
mirna_df.head()

In [None]:
copy_num_df = pd.read_csv(f'{data_path}normalized/copy_number_ratio.csv')
copy_num_df.head()

In [None]:
pur_plo_df = pd.read_csv(f'{data_path}normalized/purity_ploidy.csv')
pur_plo_df.head()

In [None]:
cdr_df = pd.read_csv(f'{data_path}normalized/clinical_outcome.csv')
cdr_df.head()

### Joining dataframes

#### Checking the length of the ID in the dataframes

In [None]:
rppa_df.sample_id.str.len().describe()

In [None]:
rppa_df[rppa_df.sample_id.str.len() == 19]

In [None]:
rna_df.sample_id.str.len().describe()

In [None]:
dna_mthltn_df.sample_id.str.len().describe()

In [None]:
mirna_df.sample_id.str.len().describe()

In [None]:
pur_plo_df.sample_id.str.len().describe()

In [None]:
copy_num_df.sample_id.str.len().describe()

In [None]:
cdr_df.sample_id.str.len().describe()

#### Joining RPPA with RNA data

In [None]:
tcga_df = rppa_df
tcga_df['sample_portion_id'] = tcga_df['sample_id'].str.slice(stop=19)
tcga_df[['sample_id', 'sample_portion_id']].head()

In [None]:
rna_df['sample_portion_id'] = rna_df['sample_id'].str.slice(stop=19)
rna_df[['sample_id', 'sample_portion_id']].head()

In [None]:
# Only 13 matches; ignore RPPA data, at least for now
tcga_df = tcga_df.merge(rna_df, how='inner', on='sample_portion_id')
tcga_df

#### Joining RNA with DNA Methylation data

In [None]:
# 0 matches
tcga_df = rna_df.merge(dna_mthltn_df, how='inner', on='sample_id')
tcga_df

#### Joining RNA with miRNA data

In [None]:
# Only 705 matches
tcga_df = rna_df.merge(mirna_df, how='inner', on='sample_id')
tcga_df

#### Joining RNA with purity/ploidy data

In [None]:
# 0 matches
tcga_df = rna_df.merge(pur_plo_df, how='inner', on='sample_id')
tcga_df

#### Joining DNA Methylation with miRNA data

In [None]:
# 0 matches
tcga_df = dna_mthltn_df.merge(mirna_df, how='inner', on='sample_id')
tcga_df

#### Joining DNA Methylation with purity/ploidy data

In [None]:
# 0 matches
tcga_df = dna_mthltn_df.merge(pur_plo_df, how='inner', on='sample_id')
tcga_df

#### Joining miRNA with purity/ploidy data

In [None]:
# 0 matches
tcga_df = mirna_df.merge(pur_plo_df, how='inner', on='sample_id')
tcga_df

#### Joining RNA with copy number ratio data

In [None]:
tcga_df = rna_df
tcga_df['sample_cpy_id'] = tcga_df['sample_id'].str.slice(stop=15)
tcga_df[['sample_id', 'sample_cpy_id']].head()

In [None]:
copy_num_df['sample_cpy_id'] = copy_num_df['sample_id'].str.slice(stop=15)
copy_num_df[['sample_id', 'sample_cpy_id']].head()

In [None]:
# 9848 matches! Now that's more like it!
tcga_df = tcga_df.merge(copy_num_df, how='inner', on='sample_cpy_id')
tcga_df

#### Joining RNA with copy number ratio and with clinical data

In [None]:
tcga_df['participant_id'] = tcga_df['sample_id_x'].str.slice(stop=12)
tcga_df[['sample_id_x', 'participant_id']].head()

In [None]:
cdr_df['participant_id'] = cdr_df['sample_id'].str.slice(stop=12)
cdr_df[['sample_id', 'participant_id']].head()

In [None]:
# 9825 matches! Now that's more like it!
tcga_df = tcga_df.merge(cdr_df, how='inner', on='participant_id')
tcga_df

### Removing redundant ID columns

In [None]:
id_columns = [col for col in tcga_df if '_id' in col]
id_columns

In [None]:
tcga_df[id_columns].head()

In [None]:
id_columns.remove('participant_id')
tcga_df = tcga_df.drop(columns=id_columns)
tcga_df

In [None]:
id_columns = [col for col in tcga_df if '_id' in col]
id_columns

In [None]:
tcga_df[id_columns].head()

### Removing repeated patient data

In order to prevent the machine learning models from overfitting to specific patients, we'll randomly select a single sample from patients that have multiple ones, guaranteeing that each patient has only one sample.

In [None]:
tcga_df[tcga_df.participant_id == 'TCGA-SR-A6MX']

In [None]:
tcga_df[tcga_df.participant_id == 'TCGA-SR-A6MX'].sample(n=1, random_state=du.random_seed)

In [None]:
n_samples_per_patient = tcga_df.participant_id.value_counts()
n_samples_per_patient

In [None]:
list(n_samples_per_patient.index)

In [None]:
oversampled_participants = [participant for participant in list(n_samples_per_patient.index)
                            if n_samples_per_patient[participant] > 1]
oversampled_participants

In [None]:
for participant in oversampled_participants:
    tcga_df[tcga_df.participant_id == participant] = (tcga_df[tcga_df.participant_id == participant]
                                                      .sample(n=1, random_state=du.random_seed))

### Performing imputation

Checking for missing values:

In [None]:
du.search_explore.dataframe_missing_values(tcga_df)

Remove columns with too high percentage of missing values (>40%):

In [None]:
tcga_df = du.data_processing.remove_cols_with_many_nans(tcga_df, nan_percent_thrsh=40, inplace=True)
du.search_explore.dataframe_missing_values(tcga_df)

Imputation:

In [None]:
tcga_df = du.data_processing.missing_values_imputation(tcga_df, method='zero',
                                                       id_column='participant_id', inplace=True)
tcga_df.head()

### Saving the data

In [None]:
tcga_df.to_csv(f'{data_path}normalized/tcga.csv')

### Experimenting with tensor conversion

In [None]:
tcga_df = pd.read_csv(f'{data_path}normalized/tcga.csv')
tcga_df.head()

In [None]:
tcga_df.participant_id.value_counts()

In [None]:
tcga_df.dtypes

Remove the original string ID column and use the numeric one instead:

In [None]:
tcga_df = tcga_df.drop(columns=['participant_id'], axis=1)
tcga_df = tcga_df.rename(columns={'Unnamed: 0': 'participant_id'})
tcga_df.head()

Convert the label to a numeric format:

In [None]:
tcga_df.tumor_type_label.value_counts()

In [None]:
tcga_df['tumor_type_label'], label_dict = du.embedding.enum_categorical_feature(tcga_df, 'tumor_type_label', 
                                                                                nan_value=0, clean_name=False)
tcga_df.tumor_type_label.value_counts()

In [None]:
label_dict

In [None]:
tcga_df.dtypes

In [None]:
tcga_tsr = torch.from_numpy(tcga_df.to_numpy())
tcga_tsr