<a href="https://colab.research.google.com/github/0m0kenny/0m0kenny/blob/main/Neurob_Prog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediciting the Prognosis of Neuroblastoma Patients via RNA sequencing Data

## Abstract
                
RNA-Seq reveals an unprecedented complexity of the neuroblastoma transcriptome and is suitable for clinical endpoint prediction [ microarray ]

### Experiment Description  

We generated gene expression profiles from 498 primary neuroblastomas using RNA-Seq and microarrays. We sought to systematically evaluate the capability of RNA deep-sequencing (RNA-Seq)-based classification for clinical endpoint prediction in comparison to microarray-based ones. The neuroblastoma cohort was randomly divided into training and validation sets (**Please note:** <em>in the following we refer to this validation set as test set</em>), and 360 predictive models on six clinical endpoints were generated and evaluated. While prediction performances did not differ considerably between the two technical platforms, the RNA-Seq data processing pipelines, or feature levels (i.e., gene, transcript, and exon junction levels), RNA-Seq models based on the AceView database performed best on most endpoints. Collectively, our study reveals an unprecedented complexity of the neuroblastoma transcriptome, and provides guidelines for the development of gene expression-based predictive classifiers using high-throughput technologies.  Sample clinical characteristics definitions:  

* sex:
    <ul>
    <li>M = male</li>
    <li>F = female</li>
    </ul>
    
* age at diagnosis: The age in days at diagnosis
    <ul>
    <li>integer</li>
    </ul>

* high risk: Clinically considered as high-risk neuroblastoma
    <ul>
    <li>yes = 1</li>
    <li>no = 0</li>
    </ul>


* INSS stage: Disease stage according to International Neuroblastoma Staging System ([INSS](https://www.cancer.org/cancer/neuroblastoma/detection-diagnosis-staging/staging.html))
    <ul>
    <li>1</li>
    <li>2</li>
    <li>3</li>
    <li>4</li>
    <li>4S</li>
    </ul>


* progression: Occurrence of a tumor progression event
    <ul>
    <li>yes = 1</li>
    <li>no = 0</li>
    </ul>



* death from disease: Occurrence of death from the disease (yes=1; no=0)
    <ul>
    <li>yes = 1</li>
    <li>no = 0</li>
    </ul>





Gene expression of 498 neuroblastoma samples was quantified by RNA sequencing as well as by microarray analyses in order to understand the neuroblastoma transcriptome and predict clinical endpoints.


## Task

The task is to predict the missing values in the validation set (from here on called test set) using the RNAseq Data.

# Import necessary packages

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt # plotting and visulisation
import seaborn as sns # nicer (easier) visualisation
%matplotlib inline
from sklearn.preprocessing import PowerTransformer

# for saving
import os,os.path

# Set up directory and filenames

In [4]:
data_dir = '..{}data'.format(os.path.sep)
#load in files
fn_fpkm             = '/content/drive/MyDrive/Colab Notebooks/DataSet2/log2FPKM.tsv'
fn_patient_info     = '/content/drive/MyDrive/Colab Notebooks/DataSet2/patientInfo.tsv'
fn_prop_intensities = '/content/drive/MyDrive/Colab Notebooks/DataSet2/allProbIntensities.tsv'




# Load the RNAs-Seq data

This part already sets the indeces in the DataFrame. Please feel free to change as required.

In [5]:
df_fpkm = pd.read_csv('{}'.format(fn_fpkm),sep='\t',).rename({'00gene_id':'gene_id'},axis=1)
#set row/column headers
df_fpkm = df_fpkm.set_index(['gene_id'])
df_fpkm.columns.name = 'ID'

#view the data
df_fpkm.head()


ID,NB001,NB002,NB003,NB004,NB005,NB006,NB007,NB008,NB009,NB010,...,NB489,NB490,NB491,NB492,NB493,NB494,NB495,NB496,NB497,NB498
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1/2-SBSRNA4,0.834381,0.743094,0.909414,0.795775,0.90554,0.869154,1.811352,0.59924,0.981855,1.066399,...,0.997977,1.003559,0.842437,1.057873,0.805515,0.491331,0.868249,0.911379,0.660139,1.152988
A1BG,1.910053,0.941996,1.950857,1.989477,1.942946,1.927608,1.617745,2.161291,1.436439,2.159797,...,2.336929,2.83636,1.205317,2.439868,1.649027,1.451425,1.493852,1.641241,1.994978,1.289534
A1BG-AS1,1.453191,0.640614,1.156765,1.525277,1.365043,0.899212,1.304178,1.189205,0.771248,1.114787,...,1.182908,1.367371,0.643751,1.096815,0.925425,0.933275,1.208723,0.904511,1.529221,1.102866
A1CF,0.005102,0.005902,0.005192,0.0,0.025347,0.005682,0.0,0.0,0.02188,0.0,...,0.024298,0.007295,0.0,0.006678,0.005746,0.004998,0.004853,0.0,0.02278,0.01872
A2LD1,0.580151,0.738233,0.927667,0.936497,0.924853,0.739038,1.018705,0.546324,0.666877,0.86585,...,0.673627,1.401265,0.837443,0.939849,0.743496,0.957837,0.812093,0.488748,1.068072,0.782887


### Load the patient factors, including the potential endpoints

This part already sets the indeces in the DataFrame. Please feel free to change as required.
Please note, that the ```FactorValues``` should have a 1-to-1 correspondence to the factors described in the abstract.

In [6]:
df_patient_info = pd.read_csv('{}'.format(fn_patient_info),sep='\t').set_index('ID')
df_patient_info.columns.name = 'FactorValues'

df_patient_info.head()

FactorValues,FactorValue..Sex.,FactorValue..age.at.diagnosis.,FactorValue..death.from.disease.,FactorValue..high.risk.,FactorValue..inss.stage.,FactorValue..progression.
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NB498,female,530,,,,
NB497,female,379,0.0,0.0,1.0,0.0
NB496,male,132,,,,
NB495,male,163,0.0,0.0,1.0,0.0
NB494,male,56,,,,


####  Divide into training and external testing

Some of the factor values for some of the patient **ID**s are NaN.
Every row, where this information is missing indicate a real validation entry. We can use this information and create two separate DataFrames, one for training, one for the validation (testing).

The task is to predict the missing values, with the RNASeq.



In [7]:
#get training and test data
df_patient_info_train  = df_patient_info[df_patient_info['FactorValue..death.from.disease.'].notna()]
df_patient_info_test   = df_patient_info[df_patient_info['FactorValue..death.from.disease.'].isna()]



In [None]:
df_patient_info_train.head()

FactorValues,FactorValue..Sex.,FactorValue..age.at.diagnosis.,FactorValue..death.from.disease.,FactorValue..high.risk.,FactorValue..inss.stage.,FactorValue..progression.
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NB497,female,379,0.0,0.0,1,0.0
NB495,male,163,0.0,0.0,1,0.0
NB493,male,190,0.0,0.0,1,0.0
NB491,male,2326,0.0,1.0,4,1.0
NB489,female,865,0.0,1.0,4,0.0


### Declare all y train and y test which is from patient info

In [8]:
y_train_death = df_patient_info_train['FactorValue..death.from.disease.'].astype(int)
y_train_risk = df_patient_info_train['FactorValue..high.risk.'].astype(int)
y_train_prog = df_patient_info_train['FactorValue..progression.'].astype(int)
y_train_stage = df_patient_info_train['FactorValue..inss.stage.'].astype(str)
y_train_age = df_patient_info_train['FactorValue..age.at.diagnosis.']
y_train_sex = df_patient_info_train['FactorValue..Sex.']

y_test_death = df_patient_info_test['FactorValue..death.from.disease.']
y_test_risk = df_patient_info_test['FactorValue..high.risk.']
y_test_prog = df_patient_info_test['FactorValue..progression.']
y_test_stage = df_patient_info_test['FactorValue..inss.stage.']
y_test_age = df_patient_info_test['FactorValue..age.at.diagnosis.']
y_test_sex = df_patient_info_test['FactorValue..Sex.']

In [None]:
#df_patient_info_test.head()