# **[Project] Cancer Subtype Classification**

# Introduction

The [TCGA Kidney Cancers Dataset](https://archive.ics.uci.edu/dataset/892/tcga+kidney+cancers) is a bulk RNA-seq dataset that contains transcriptome profiles (i.e., gene expression quantification data) of patients diagnosed with three different subtypes of kidney cancers.
This dataset can be used to make predictions about the specific subtype of kidney cancers given the normalized transcriptome profile data.

The normalized transcriptome profile data is given as **TPM** and **FPKM** for each gene.

> TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase Million) are two common methods for quantifying gene expression in RNA sequencing data.
> They both aim to account for the differences in sequencing depth and transcript length when estimating gene expression levels.
>
> **TPM** (Transcripts Per Million):
> - TPM is a measure of gene expression that normalizes for both library size (sequencing depth) and transcript length.
> - The main idea behind TPM is to express the abundance of a transcript relative to the total number of transcripts in a sample, scaled to one million.
>
> **FPKM** (Fragments Per Kilobase Million):
> - FPKM is another method for quantifying gene expression, which is commonly used in older RNA-seq analysis pipelines. It's similar in concept to TPM but differs in the way it's calculated.
> - FPKM also normalizes for library size and transcript length, but it measures gene expression as the number of fragments (i.e., reads) per kilobase of exon model per million reads.
>
> TPM is generally considered more robust to variations in library size, making it a preferred choice in many modern RNA-seq analysis workflows.

We provide one dataset for each kidney cancer subtype:

- [TCGA-KICH](https://portal.gdc.cancer.gov/projects/TCGA-KICH): kidney chromophobe (renal clear cell carcinoma)
- [TCGA-KIRC](https://portal.gdc.cancer.gov/projects/TCGA-KIRC): kidney renal clear cell carcinoma
- [TCGA-KIRP](https://portal.gdc.cancer.gov/projects/TCGA-KIRP): kidney renal papillary cell carcinoma

> This and _much_ more data is openly available on the [NCI Genomic Data Commons (GDC) Data Portal](https://portal.gdc.cancer.gov/).

# Data access

There are two ways to access the data: via the TNT homepage or the GDC Data Portal.

## Download from the TNT homepage (_recommended_)

The download from the TNT homepage is straightforward:

In [2]:
#! wget http://www.tnt.uni-hannover.de/edu/vorlesungen/AMLG/data/project-cancer-classification.tar.gz
#! tar -xzvf project-cancer-classification.tar.gz
#! mv -v project-cancer-classification/ data/
#! rm -v project-cancer-classification.tar.gz

In the `data/` folder you will now find many files in the [TSV format](https://en.wikipedia.org/wiki/Tab-separated_values) ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)-like with tabs as delimiter) containing the normalized transcriptome profile data.

To start, you can read a TSV file into a [pandas](https://pandas.pydata.org) [`DataFrame`](pandas dataframe to dict) using the [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv) function with the `sep` parameter set to `\t`:

## Download from the GDC Data Portal

The data can also be accessed via the GDC Data Portal.

A convenient way to download multiple files from the GDC Data Portal is to use a manifest file generated by the portal.
After generating a manifest file, initiate the download using the [GDC Data Transfer Tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool) by supplying the `-m` or `--manifest` option, followed by the location and name of the manifest file.

We provide the following manifest files (in the `gdc-data-portal` folder) for the datasets:

- `gdc_manifest.tcga-kich-geq.txt` (91 files)
- `gdc_manifest.tcga-kirc-geq.txt` (614 files)
- `gdc_manifest.tcga-kirp-geq.txt` (323 files)

> We also provide `metadata.*.json` files containing extensive dataset metadata.

Assuming that the GDC Data Transfer Tool is available as `gdc-client`, the following commands can be used to download the data.

```shell
mkdir --parents data/tcga-kich-geq/
mkdir --parents data/tcga-kirc-geq/
mkdir --parents data/tcga-kirp-geq/

gdc-client download --manifest gdc_manifest.tcga-kich-geq.txt --dir data/tcga-kich-geq/
gdc-client download --manifest gdc_manifest.tcga-kirc-geq.txt --dir data/tcga-kirc-geq/
gdc-client download --manifest gdc_manifest.tcga-kirp-geq.txt --dir data/tcga-kirp-geq/
```

In [3]:
### In the following sections, we concatinate the tpm data of all patient and cancertypes into a single dataframe.

In [4]:
import pandas as pd
import os
FILENAME = "original_tpm_data.csv"

In [5]:
tsv_file_path = "data/tcga-kich-geq/00ddf8c2-039f-409f-a2ed-b29e18395dd4/f07b7c4c-5f30-4c51-9eb1-4f873ad49c56.rna_seq.augmented_star_gene_counts.tsv"
main_folder_path_kich = "data/tcga-kich-geq"
dfs = pd.DataFrame({'Column1': [1, 2, 3, 4, 5]})

# Read the TSV file into a DataFrame
df = pd.read_csv(tsv_file_path, sep='\t', header=1)

second_column_values = df.iloc[:, 1].tolist()

for subfolder in os.listdir(main_folder_path_kich):
    subfolder_path = os.path.join(main_folder_path_kich, subfolder)
    if os.path.isdir(subfolder_path):

        # Iterate through all the files in the subfolder
        for filename in os.listdir(subfolder_path):
            if filename.endswith(".tsv"):  # Consider only CSV files, modify the condition as per your file type

                file_path = os.path.join(subfolder_path, filename)
                df1 = pd.read_csv(file_path, sep='\t', header=1)
                df1 = df1['tpm_unstranded']

                dfs = pd.concat([dfs, df1], axis=1)
               
print(dfs)

       Column1  tpm_unstranded  tpm_unstranded  tpm_unstranded  \
0          1.0             NaN             NaN             NaN   
1          2.0             NaN             NaN             NaN   
2          3.0             NaN             NaN             NaN   
3          4.0             NaN             NaN             NaN   
4          5.0          8.2367          4.0204         30.3101   
...        ...             ...             ...             ...   
60659      NaN          0.0000          0.0000          0.0000   
60660      NaN          3.6421          2.4549         10.8131   
60661      NaN          0.0000          0.0000          0.0000   
60662      NaN          0.0035          0.0944          0.0336   
60663      NaN          0.7835          0.4324          1.5405   

       tpm_unstranded  tpm_unstranded  tpm_unstranded  tpm_unstranded  \
0                 NaN             NaN             NaN             NaN   
1                 NaN             NaN             NaN        

In [6]:
transposed = dfs.transpose()
transposed.columns = second_column_values
transposed = transposed.drop(columns=transposed.columns[:4])
transposed['cancer_type'] = 'kidney chromophobe'
transposed = transposed.drop(transposed.index[0])

# Reset the index if needed
transposed = transposed.reset_index(drop=True)
print(transposed)

      TSPAN6    TNMD      DPM1    SCYL3  C1orf112      FGR      CFH    FUCA2  \
0     8.2367  0.0000   25.8092   0.9317    0.2657   2.9539   1.4042  15.7151   
1     4.0204  0.1582   16.6135   0.5691    0.1217   1.4957   0.1268  10.0106   
2    30.3101  0.0000  194.5276   4.1172    4.2266   9.3357  11.7886  63.2781   
3   192.6733  2.2687  166.1158  13.8005    2.6671   9.7596   9.5741  62.3076   
4    29.3118  0.2938   56.6818   1.8314    0.4773   0.5403   0.5084  25.3080   
..       ...     ...       ...      ...       ...      ...      ...      ...   
86   33.4924  1.4589  103.9655   3.1484    0.9286   2.0726   1.3779  29.8234   
87   55.9520  4.3021   83.0649   4.2308    1.6038   1.2908   0.6652  41.0652   
88   95.9856  0.2360   72.5829   3.1023    0.7469   3.2856   4.7755  34.4884   
89   26.4694  0.5510   97.8056   3.8770    1.2197   0.8360   1.0726  34.5722   
90   50.0542  0.8695   80.1825   9.3399    2.0911  26.5636  65.1863  46.7395   

       GCLC     NFYA  ...  AC092910.4  

In [7]:
main_folder_path_kirc = "data/tcga-kirc-geq"
dfs = pd.DataFrame({'Column1': [1, 2, 3, 4, 5]})

# Read the TSV file into a DataFrame
df = pd.read_csv(tsv_file_path, sep='\t', header=1)

for subfolder in os.listdir(main_folder_path_kirc):
    subfolder_path = os.path.join(main_folder_path_kirc, subfolder)
    if os.path.isdir(subfolder_path):

        # Iterate through all the files in the subfolder
        for filename in os.listdir(subfolder_path):
            if filename.endswith(".tsv"):  # Consider only CSV files, modify the condition as per your file type
                
                file_path = os.path.join(subfolder_path, filename)
                df1 = pd.read_csv(file_path, sep='\t', header=1)
                df1 = df1['tpm_unstranded']

                dfs = pd.concat([dfs, df1], axis=1)
               
print(dfs)


       Column1  tpm_unstranded  tpm_unstranded  tpm_unstranded  \
0          1.0             NaN             NaN             NaN   
1          2.0             NaN             NaN             NaN   
2          3.0             NaN             NaN             NaN   
3          4.0             NaN             NaN             NaN   
4          5.0         50.3093         37.8239        138.9149   
...        ...             ...             ...             ...   
60659      NaN          0.0000          0.0000          0.0000   
60660      NaN         23.9462         15.1532         13.6630   
60661      NaN          0.0000          0.0000          0.0000   
60662      NaN          0.0550          0.0360          0.0120   
60663      NaN          0.5038          0.3439          0.2740   

       tpm_unstranded  tpm_unstranded  tpm_unstranded  tpm_unstranded  \
0                 NaN             NaN             NaN             NaN   
1                 NaN             NaN             NaN        

In [8]:
transposed2= dfs.transpose()
transposed2.columns = second_column_values
transposed2 = transposed2.drop(columns=transposed2.columns[:4])
transposed2['cancer_type'] = 'kidney renal clear cell carcinoma'
transposed2 = transposed2.drop(transposed2.index[0])

# Reset the index if needed
transposed2 = transposed2.reset_index(drop=True)
print(transposed2)

       TSPAN6    TNMD     DPM1   SCYL3  C1orf112      FGR      CFH    FUCA2  \
0     50.3093  1.8636  85.7241  9.2839    3.2253  17.8309  25.0306  59.9089   
1     37.8239  0.3392  51.3701  4.6496    1.7095  14.5073  29.0526  63.3354   
2    138.9149  1.2867  77.2900  3.8629    1.7255  17.9696  79.2434  44.6616   
3     38.1085  0.6410  76.2203  8.0788    2.6212  19.9282  20.6766  89.3928   
4     27.4275  0.2703  84.7318  5.4551    4.1345  43.9371  59.1496  68.9828   
..        ...     ...      ...     ...       ...      ...      ...      ...   
609   40.2582  0.6677  92.2171  7.2686    2.8743  11.3818  44.8882  53.6990   
610   25.9545  0.0877  81.0769  6.7891    2.4068  14.9653   9.7083  46.9017   
611   42.9272  0.2773  87.4014  7.6856    2.9137  12.8583   9.1551  71.2842   
612   26.3359  0.7459  74.7579  6.2894    2.4227  14.1762   5.5126  72.1782   
613   31.6082  0.0686  66.2066  8.4483    2.2465  10.5946  73.9473  38.7188   

        GCLC     NFYA  ...  AC092910.4  AC073611.1 

In [9]:
main_folder_path_kirp = "data/tcga-kirp-geq"

dfs = pd.DataFrame({'Column1': [1, 2, 3, 4, 5]})

# Read the TSV file into a DataFrame
df = pd.read_csv(tsv_file_path, sep='\t', header=1)

for subfolder in os.listdir(main_folder_path_kirp):
    subfolder_path = os.path.join(main_folder_path_kirp, subfolder)
    if os.path.isdir(subfolder_path):

        # Iterate through all the files in the subfolder
        for filename in os.listdir(subfolder_path):
            if filename.endswith(".tsv"):  # Consider only CSV files, modify the condition as per your file type

                file_path = os.path.join(subfolder_path, filename)
                df1 = pd.read_csv(file_path, sep='\t', header=1)
                df1 = df1['tpm_unstranded']

                dfs = pd.concat([dfs, df1], axis=1)
               
print(dfs)


       Column1  tpm_unstranded  tpm_unstranded  tpm_unstranded  \
0          1.0             NaN             NaN             NaN   
1          2.0             NaN             NaN             NaN   
2          3.0             NaN             NaN             NaN   
3          4.0             NaN             NaN             NaN   
4          5.0         47.3502         41.6238         52.8734   
...        ...             ...             ...             ...   
60659      NaN          0.0000          0.0000          0.0000   
60660      NaN         27.3330         16.8132         15.0506   
60661      NaN          0.0000          0.0000          0.0000   
60662      NaN          0.0144          0.0275          0.0183   
60663      NaN          0.4671          0.6814          1.1538   

       tpm_unstranded  tpm_unstranded  tpm_unstranded  tpm_unstranded  \
0                 NaN             NaN             NaN             NaN   
1                 NaN             NaN             NaN        

In [10]:
transposed3= dfs.transpose()
transposed3.columns = second_column_values
transposed3 = transposed3.drop(columns=transposed3.columns[:4])
transposed3['cancer_type'] = 'kidney renal papillary cell carcinoma'
transposed3 = transposed3.drop(transposed3.index[0])

# Reset the index if needed
transposed3 = transposed3.reset_index(drop=True)
print(transposed3)

       TSPAN6    TNMD     DPM1   SCYL3  C1orf112      FGR      CFH    FUCA2  \
0     47.3502  1.2197  70.3675  5.4522    1.0902   6.2784  10.8631  50.5269   
1     41.6238  0.1193  59.4621  2.0215    0.7523  42.3647  14.4453  94.5218   
2     52.8734  0.0398  56.9389  3.0381    1.9973   6.2178  35.8620  55.3259   
3     85.7232  0.5334  63.9742  8.5221    2.6660   2.4944  53.5122  84.3195   
4     15.2345  0.3393  62.0003  2.4412    0.9320   2.6651   0.7739  46.1160   
..        ...     ...      ...     ...       ...      ...      ...      ...   
318   27.6946  0.0719  59.5989  1.8498    0.6931  29.8034   7.6052  78.9924   
319  108.7368  1.1551  63.8355  6.2989    1.2240   4.8013   5.7891  61.8564   
320   25.1217  1.3316  46.6076  2.2971    0.6072  15.4585   1.1822  73.1605   
321   89.4600  0.0000  73.2885  1.0064    1.0095   2.0277   1.1458  78.9180   
322   35.9497  0.2206  40.8623  2.8142    0.6925   2.3872   0.8893  70.1397   

        GCLC     NFYA  ...  AC092910.4  AC073611.1 

In [11]:
FILENAME = "original_tpm_data.csv"

final_df = pd.concat([transposed, transposed2, transposed3], ignore_index=True)
print(final_df)
final_df.to_csv(FILENAME, index=False)

        TSPAN6    TNMD      DPM1    SCYL3  C1orf112      FGR      CFH  \
0       8.2367  0.0000   25.8092   0.9317    0.2657   2.9539   1.4042   
1       4.0204  0.1582   16.6135   0.5691    0.1217   1.4957   0.1268   
2      30.3101  0.0000  194.5276   4.1172    4.2266   9.3357  11.7886   
3     192.6733  2.2687  166.1158  13.8005    2.6671   9.7596   9.5741   
4      29.3118  0.2938   56.6818   1.8314    0.4773   0.5403   0.5084   
...        ...     ...       ...      ...       ...      ...      ...   
1023   27.6946  0.0719   59.5989   1.8498    0.6931  29.8034   7.6052   
1024  108.7368  1.1551   63.8355   6.2989    1.2240   4.8013   5.7891   
1025   25.1217  1.3316   46.6076   2.2971    0.6072  15.4585   1.1822   
1026   89.4600  0.0000   73.2885   1.0064    1.0095   2.0277   1.1458   
1027   35.9497  0.2206   40.8623   2.8142    0.6925   2.3872   0.8893   

        FUCA2     GCLC     NFYA  ...  AC092910.4  AC073611.1  AC136977.1  \
0     15.7151   4.5194   4.9240  ...      0.000

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
# Load the data into a DataFrame
data = pd.read_csv(filepath_or_buffer=FILENAME)

# Split the data into features (X) and target variable (y)
X = data.drop('cancer_type', axis=1)
y = data['cancer_type']

# Split the data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(data[data.cancer_type == "kidney chromophobe"].shape[0])

data.head()

91


Unnamed: 0,TSPAN6,TNMD,DPM1,SCYL3,C1orf112,FGR,CFH,FUCA2,GCLC,NFYA,...,AC092910.4,AC073611.1,AC136977.1,AC078856.1,AC008763.4,AL592295.6,AC006486.3,AL391628.1,AP006621.6,cancer_type
0,8.2367,0.0,25.8092,0.9317,0.2657,2.9539,1.4042,15.7151,4.5194,4.924,...,0.0,0.1066,0.0,0.0,0.0,3.6421,0.0,0.0035,0.7835,kidney chromophobe
1,4.0204,0.1582,16.6135,0.5691,0.1217,1.4957,0.1268,10.0106,3.0583,2.1101,...,0.0,0.2003,0.0,0.0,0.0,2.4549,0.0,0.0944,0.4324,kidney chromophobe
2,30.3101,0.0,194.5276,4.1172,4.2266,9.3357,11.7886,63.2781,8.2357,61.3211,...,0.0,0.5563,0.0,0.0,0.0,10.8131,0.0,0.0336,1.5405,kidney chromophobe
3,192.6733,2.2687,166.1158,13.8005,2.6671,9.7596,9.5741,62.3076,7.1604,30.2865,...,0.0,0.1799,0.0,0.0,0.0,11.7993,0.0,0.2841,0.5695,kidney chromophobe
4,29.3118,0.2938,56.6818,1.8314,0.4773,0.5403,0.5084,25.308,5.7213,5.0955,...,0.5162,0.1739,0.0,0.0,0.0,6.1702,0.0,0.0129,0.1659,kidney chromophobe


In [13]:
print("kidney chromophobe", data[data.cancer_type == "kidney chromophobe"].shape[0])
print("kidney renal papillary cell carcinoma", data[data.cancer_type == "kidney renal papillary cell carcinoma"].shape[0])
print("kidney renal clear cell carcinoma", data[data.cancer_type == "kidney renal clear cell carcinoma"].shape[0])

kidney chromophobe 91
kidney renal papillary cell carcinoma 323
kidney renal clear cell carcinoma 614
