#### Open Notebook in Colab
To open the notebook in Google Colab, click the link below:
[Open in Colab](https://colab.research.google.com/github/JunetaeKim/GCSP-HBDA/blob/main/Week2/Lecture1.2.ipynb)

### Topic: Processing and Merging TCGA Data
##### Goals: Processing and merging TCGA (The Cancer Genome Atlas) data. 
##### We'll practice through each step, from importing necessary libraries to saving the final merged dataset.

<img src="https://github.com/JunetaeKim/GCSP-HBDA/raw/main/Week2/figure1.2.1.png" alt="Image" width="600"/>

#### 1. Importing Libraries 📚
##### Objective: Load the necessary Python libraries for data processing.

In [None]:
import pandas as pd  # Essential for data manipulation and analysis
import numpy as np  # Useful for numerical operations
import requests  # Allows us to make HTTP requests to download data
import os  # Helps in interacting with the operating system
import warnings  # Manages warning messages
import sys  # Provides system-specific parameters and functions

#### Explanation: 
##### We start by importing several libraries. 
##### pandas and numpy are key for data manipulation. 
##### requests helps in downloading data files from the internet. 
##### os is used for file and directory operations, while warnings and sys are for handling system and warning messages.

#### 2. Setting File Paths 📁
#### Objective: Define where our data files will be stored and ensure necessary directories exist.

In [None]:
# Specify relative paths.
FILE_DIR = './'
DATA_DIR = os.path.join(FILE_DIR, "SourceData")
RSUBREAD_FOLDER = os.path.join(FILE_DIR, "SourceData", "rsubread")

# Create directories if they do not exist.
if not os.path.exists(RSUBREAD_FOLDER):
    os.makedirs(RSUBREAD_FOLDER)

# Specify paths for the data files.
clinical_variables_path = os.path.join(RSUBREAD_FOLDER, "clinical_variables.txt.gz")
cancer_type_path = os.path.join(RSUBREAD_FOLDER, "cancer_types.txt.gz")
rsubread_gene_counts_path = os.path.join(RSUBREAD_FOLDER, "gene_counts.txt.gz")

#### Explanation: 
##### Here, we set up the file paths.
##### os.path.join helps in creating platform-independent paths. 
##### We also check if the RSUBREAD_FOLDER directory exists, and if not, we create it to ensure our data has a place to be stored.

#### 3. Downloading Data 💾
##### Objective: Download the necessary data files if they are not already present.

In [None]:
print("____ Downloading data ____ \n")

# Clinical Variables
if not os.path.exists(clinical_variables_path):
    print("Started Download of Clinical Variables...")
    clinical_variables_url = r"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE62nnn/GSE62944/suppl/GSE62944%5F06%5F01%5F15%5FTCGA%5F24%5F548%5FClinical%5FVariables%5F9264%5FSamples%2Etxt%2Egz"
    r = requests.get(clinical_variables_url)
    with open(clinical_variables_path, "wb") as f:
        f.write(r.content)
    print("Done.")
else:
    print("Raw data exists. Skipping Download.")

# Cancer Types
if not os.path.exists(cancer_type_path):
    print("Started Download of Cancer Types...")
    cancer_type_url = r"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE62nnn/GSE62944/suppl/GSE62944%5F06%5F01%5F15%5FTCGA%5F24%5FCancerType%5FSamples%2Etxt%2Egz"
    r = requests.get(cancer_type_url)
    with open(cancer_type_path, "wb") as f:
        f.write(r.content)
    print("Done.")

# Gene Counts
if not os.path.exists(rsubread_gene_counts_path):
    print("Started Download of Gene Counts...")
    rsubread_gene_counts_url = r"https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1536nnn/GSM1536837/suppl/GSM1536837%5F06%5F01%5F15%5FTCGA%5F24%2Etumor%5FRsubread%5FFeatureCounts%2Etxt%2Egz"
    r = requests.get(rsubread_gene_counts_url)
    with open(rsubread_gene_counts_path, "wb") as f:
        f.write(r.content)
    print("Done.")

#### Explanation: 
##### In this step, we download the clinical variables, cancer types, and gene counts data if they are not already present in the specified paths. 
##### The requests.get method fetches the data from the provided URLs, and the content is written to the respective file paths.



#### 4. Reading Data 📄
##### Objective: Load the downloaded data into pandas DataFrames for further processing.

In [None]:
print("Opening downloaded data...")

clinical_variables = pd.read_csv(clinical_variables_path, sep="\t", compression="gzip")
cancer_types = pd.read_csv(cancer_type_path, sep="\t", header=0, names=["patient_id", "tumor_type"], compression="gzip")
gene_counts = pd.read_csv(rsubread_gene_counts_path, sep="\t", compression="gzip")


#### Explanation: 
##### We use pd.read_csv to read the gzipped data files into pandas DataFrames. 
##### This step makes it easy to manipulate and analyze the data in subsequent steps.

#### 5. Data Preprocessing and Merging 🔄
##### 5.1 Processing Clinical Variables 🩺
##### Objective: Clean and preprocess the clinical variables data.



In [None]:
# Drop unnecessary columns and set index
clinical_variables = clinical_variables.drop(columns=["Unnamed: 1", "Unnamed: 2"])
clinical_variables.set_index("Unnamed: 0", inplace=True)

# Select necessary columns
clinical_variables = clinical_variables.loc[["vital_status", "last_contact_days_to", "death_days_to", 'gender', 'birth_days_to', 'race'], :]
clinical_variables = clinical_variables.T

# Drop rows with missing values in specific columns
clinical_variables = clinical_variables.dropna(subset=["vital_status"])
clinical_variables = clinical_variables.dropna(subset=["last_contact_days_to", "death_days_to", "race"])

# Filter out invalid data
clinical_variables = clinical_variables.loc[clinical_variables.vital_status != "[Not Available]", :]
clinical_variables = clinical_variables.loc[(clinical_variables.birth_days_to != "[Not Available]") & (clinical_variables.birth_days_to != '[Completed]'), :]
clinical_variables = clinical_variables.loc[clinical_variables.gender != "[Not Available]", :]

# Filter specific races
races_to_keep = [
    'AMERICAN INDIAN OR ALASKA NATIVE',
    'ASIAN',
    'BLACK OR AFRICAN AMERICAN',
    'NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER',
    'WHITE'
]
clinical_variables = clinical_variables[clinical_variables['race'].isin(races_to_keep)]

# Convert birth days to age and gender to binary
clinical_variables['birth_days_to'] = clinical_variables['birth_days_to'].astype('float32') * -1 / 365.
clinical_variables['gender'] = pd.get_dummies(clinical_variables['gender'])['MALE']

# Calculate survival time
clinical_variables["time"] = -1
mask = clinical_variables.vital_status == "Dead"
clinical_variables.time.loc[mask] = clinical_variables.death_days_to.loc[mask]
mask = clinical_variables.vital_status == "Alive"
clinical_variables.time.loc[mask] = clinical_variables.last_contact_days_to.loc[mask]

# Filter usable data points
mask = (clinical_variables.time != "[Not Available]") & (clinical_variables.time != "[Discrepancy]") & (clinical_variables.time != "[Completed]")
clinical_variables = clinical_variables.loc[mask]

# Convert time to numeric and filter non-positive values
clinical_variables.time = pd.to_numeric(clinical_variables.time)
clinical_variables = clinical_variables.loc[clinical_variables.time > 0]

# Set event indicator
clinical_variables["event"] = -1
clinical_variables.event[clinical_variables.vital_status == "Dead"] = True
clinical_variables.event[clinical_variables.vital_status == "Alive"] = False

# Select and rename columns
clinical_variables = clinical_variables.loc[:, ["time", "event", 'gender', 'birth_days_to', 'race']]
clinical_variables.reset_index(inplace=True)
clinical_variables.rename(columns={"index": "patient_id", 'birth_days_to': 'age'}, inplace=True)

print("Done.")

#### Explanation: This section involves extensive data cleaning:

##### 1.Dropping unnecessary columns and rows with missing values to ensure we only work with relevant and complete data.
##### 2.Filtering data based on specific criteria such as valid values and certain races.
##### 3.Transforming data by converting birth days to age and encoding gender as binary.
##### 4.Calculating survival time and setting event indicators for further analysis.

##### 5.2 Merging Data 🔗
##### Objective: Combine the clinical variables, cancer types, and gene counts into a single dataset.



In [None]:
print("Merging with cancer types.")
patients = pd.merge(cancer_types, clinical_variables, on=["patient_id"])
print("Done.")

print("Merging with gene counts.")
gene_counts.set_index("Unnamed: 0", inplace=True)
gene_counts = gene_counts.T
gene_counts.reset_index(inplace=True)
gene_counts.rename(columns={"index": "patient_id"}, inplace=True)
print("Done.")

print("Merging all together.")
full_data = pd.merge(patients, gene_counts, on=["patient_id"])
print("Done.")

#### Explanation: 
##### This part involves merging different datasets:
##### 1.Merging clinical variables with cancer types to create a combined patient dataset.
##### 2.Preparing and merging gene counts by setting the appropriate index and ensuring the patient ID is included.
##### 3.Creating a final merged dataset that includes all relevant data for each patient.

#### 6. Saving the Merged Data 💾
##### Objective: Save the final merged dataset for future use.

In [None]:
print("Saving merged data...")
full_data.to_pickle(os.path.join(RSUBREAD_FOLDER, "complete_data_merged.pickle"))

#### Explanation: 
##### Finally, we save the merged dataset to a file using the pickle format. 
##### This allows us to easily load and reuse the data later without having to repeat the preprocessing steps.