# Neuroblastoma Predictive Modelling

### Project Objective:
The goal of this project is to develop a machine learning model that can predict clinical outcomes in neuroblastoma patients using RNA-Seq gene expression data. Several clinical outcome fields contain missing values. This project aims to train a model that can accurately predict those outcomes and impute the missing values accordingly.

### Project Aims:

- **Explore and understand** the dataset to identify patterns, distributions, and any preprocessing needs.

- **Select clinical endpoints** as target variables for prediction:
  
  - **Death from Disease:**  
    Occurrence of death from the disease  
    (yes = 1, no = 0)
  
  - **High Risk:**  
    Clinically considered as high-risk neuroblastoma  
    (yes = 1, no = 0)
  
  - **INSS Stage:**  
    Disease stage according to the International Neuroblastoma Staging System  
    (1, 2, 3, 4, or 4S)
  
  - **Progression:**  
    Occurrence of a tumor progression event  
    (yes = 1, no = 0)

- **Build classification models** to predict each selected clinical endpoint using RNA-Seq gene expression data.

- **Evaluate model performance** using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

- **Generate predictions** for the missing clinical outcomes in the test set using the trained models.


## Part 1: Import libraries

In [1]:
# Data handling
import pandas as pd
import numpy as np

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    roc_curve
)


## Part 2: Load Data

Gene expression of 498 neuroblastoma samples was quantified by RNA sequencing.

In [2]:
# Load data

# Gene expression data
df_rnaseq = pd.read_csv("./data/log2FPKM.tsv", sep = "\t", index_col = 0)

# Patient Info data
df_patient_info = pd.read_csv("./data/patientInfo.tsv", sep = "\t", index_col=0)

# Reversing the order of rows based on the patient IDs
df_patient_info = df_patient_info.sort_index(ascending=True)


In [5]:
#df_rnaseq
#df_patient_info

**Observations:**
- There are missing values in some of the samples in both gene expression and patient info data

**Inference:**
- We can use the samples with missing values as the test set

## Part 3: Feature Engineering

**Objectives:**
- Match the sample names in the gene expression data with the patient info data
- Merge the two datasets
- Check if same samples have missing values in the combined dataset
- Separate samples with missing values as df_test and non-missing values as df_train
- Standardise the features

In [None]:
# Extract the patient IDs from the df_patient_info 
patient_ids = df_patient_info.index.values
#patient_ids

# Rename the columns of df_rnaseq
df_rnaseq.columns = patient_ids

# Transpose RNA-seq data
df_rnaseq_t = df_rnaseq.T
#df_rnaseq_t

In [None]:
# Merge dataframes
df_merged = df_patient_info.merge(df_rnaseq_t, left_index=True, right_index=True)
#df_merged

In [7]:
# Check missing values in clinical outcomes
df_merged[['FactorValue..death.from.disease.',
           'FactorValue..high.risk.',
           'FactorValue..inss.stage.',
           'FactorValue..progression.']].isna().sum()

FactorValue..death.from.disease.    249
FactorValue..high.risk.             249
FactorValue..inss.stage.            249
FactorValue..progression.           249
dtype: int64

In [8]:
# List the four outcome columns
outcome_cols = [
    'FactorValue..death.from.disease.',
    'FactorValue..high.risk.',
    'FactorValue..inss.stage.',
    'FactorValue..progression.'
]

# For each column, find the index of missing rows
missing_indices_per_outcome = [df_merged[df_merged[col].isna()].index for col in outcome_cols]

# Check if all sets of missing indices are the same
all_same = all(missing_indices_per_outcome[0].equals(idx) for idx in missing_indices_per_outcome[1:])

print("Are missing values aligned across all outcomes?", all_same)

Are missing values aligned across all outcomes? True


In [9]:
# Get patient IDs with missing values (can use any one outcome since they're aligned)
missing_patient_ids = df_merged[df_merged['FactorValue..death.from.disease.'].isna()].index

# Create test set: patients with missing clinical outcome values
df_test = df_merged.loc[missing_patient_ids]

# Create train set: patients with complete outcome information
df_train = df_merged.drop(index=missing_patient_ids)

In [10]:
# Confirm no missing values in train
print("Train missing values per outcome:")
print(df_train[outcome_cols].isna().sum())

# Confirm all values are missing in test
print("\nTest missing values per outcome:")
print(df_test[outcome_cols].isna().sum())

Train missing values per outcome:
FactorValue..death.from.disease.    0
FactorValue..high.risk.             0
FactorValue..inss.stage.            0
FactorValue..progression.           0
dtype: int64

Test missing values per outcome:
FactorValue..death.from.disease.    249
FactorValue..high.risk.             249
FactorValue..inss.stage.            249
FactorValue..progression.           249
dtype: int64


In [11]:
# Define target outcome columns
target_outcomes = [
    'FactorValue..death.from.disease.',
    'FactorValue..high.risk.',
    'FactorValue..inss.stage.',
    'FactorValue..progression.'
]

# Drop unnecessary columns from X_train (Sex and age)
X_train = df_train.drop(columns=outcome_cols + ['FactorValue..Sex.', 'FactorValue..age.at.diagnosis.'])

# For the test set, we do the same (drop clinical columns and the ones not relevant to predictions)
X_test = df_test.drop(columns=outcome_cols + ['FactorValue..Sex.', 'FactorValue..age.at.diagnosis.'])

In [None]:
# Standardise the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # for real test prediction later

**Observations:**
- Exactly 249 samples had missing values for all features in the combined dataset. This is exactly half of all samples (498)
- df_train had no missing values and df_test had all missing values
- Age and Sex features were removed as they are not being considered in this project
- All target outcomes are binary

**Inference:**
- Since all targets had 2 outcomes (0, 1), a binary classification model such as logistic regression can be used to predict the missing outcomes in the df_test dataset.

## Part 4: Train models for each clinical outcome

**Objectives:**
- Train 4 logistic regression models (one for each outcome)
- Use accuracy, precision, recall and f1-score performance metrics to evaluate the models

In [13]:
# Dictionary to store trained models
models = {}

for outcome in target_outcomes:
    y = df_train[outcome]  # current target labels
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train_scaled, y)
    models[outcome] = model  # save model for later

In [14]:
for outcome in target_outcomes:
    y = df_train[outcome]

    # Split known data for evaluation
    X_subtrain, X_val, y_subtrain, y_val = train_test_split(
        X_train_scaled, y, test_size=0.2, random_state=42, stratify=y
    )

    # Train a new model for this split
    model = LogisticRegression(max_iter=1000)
    model.fit(X_subtrain, y_subtrain)
    
    # Predict and evaluate
    y_pred = model.predict(X_val)
    print(f"Evaluation for: {outcome}")
    print(classification_report(y_val, y_pred))
    print("-" * 50)

Evaluation for: FactorValue..death.from.disease.
              precision    recall  f1-score   support

         0.0       0.97      0.80      0.88        40
         1.0       0.53      0.90      0.67        10

    accuracy                           0.82        50
   macro avg       0.75      0.85      0.77        50
weighted avg       0.88      0.82      0.83        50

--------------------------------------------------
Evaluation for: FactorValue..high.risk.
              precision    recall  f1-score   support

         0.0       0.97      0.94      0.95        33
         1.0       0.89      0.94      0.91        17

    accuracy                           0.94        50
   macro avg       0.93      0.94      0.93        50
weighted avg       0.94      0.94      0.94        50

--------------------------------------------------
Evaluation for: FactorValue..inss.stage.
              precision    recall  f1-score   support

           1       0.23      0.25      0.24        12
     

**Observations:**
1. Model for predicting death from disease:
    - Good accuracy (82%)
    - Low precision for predicting death (0.53) but high precision for predicting no death (0.97)
    - High recall for predicitng death (0.90)

2. Model for predicting high risk:
    - High accuracy (94%)
    - High precision and recall for predicting death and no death

3. Model for predicting INSS stage:
    - Low accuracy (42%)
    - Low precision and recall for all stages

4. Model for predicting disease progression:
    - Decent accuracy (74%)
    - Low precision for predicting disease progression (0.59) but high precision for predicting no disease progression (0.91)
    - High recall for prediciting progression of disease (0.89)


**Inferences**
1. Model for predicting death from disease:
    - This model is incorrect almost half the time (precision = 0.53), this means there are many false positives.
    - It catches most real deaths correctly (recall = 0.90), which means low false negatives and high true positives.
    - This model is useful when missing a death case is riskier than wrongly predicting one.

2. Model for predicting high risk:
    - The model performs consistently well for both high-risk and non-high-risk patients, with high precision and recall for both classes.
    - It has few false positives and false negatives. This makes this model a reliable classifier.

3. Model for predicting INSS stage:
    - This model performs poorly across all stages with both low precision and low recall.
    - This is likely due to the multiclass nature of the outcome (stages 1, 2, 3, 4, and 4S), which logistic regression may struggle to handle effectively, especially with class imbalance.
    - A non-linear model like Random Forest may be better suited for this task, as it can capture complex patterns in multiclass classification problems.

4. Model for predicting disease progression:
    - This model is good at identifying actual progression cases (recall = 0.89), meaning few false negatives and many true positives.
    - The precision for progression is lower (0.59), indicating many false positives. This means this model often predicts disease progression even though it is an incorrect prediction.
    - This model is useful when catching progression cases is more important than avoiding false alarms.