# Building ML Pipelines
In this project, I will be using a dataset  containing bone marrow transplantation characteristics for pediatric patients from UCI's Machine Learning Repository.

I will this dataset to build a pipeline, containing all preprocessing and data cleaning steps, and then selecting the best classifier to predict patient survival.

### About data set
* donor_age - Age of the donor at the time of hematopoietic stem cells apheresis

* donor_age_below_35 - Is donor age less than 35 (yes, no)

* donor_ABO - ABO blood group of the donor of hematopoietic stem cells (0, A, B, AB)

* donor_CMV - Presence of cytomegalovirus infection in the donor of hematopoietic stem cells prior to transplantation (present, absent)

* recipient_age - Age of the recipient of hematopoietic stem cells at the time of transplantation

* recipient_age_below_10 - Is recipient age below 10 (yes, no)

* recipient_age_int - Age of the recipient discretized to intervals (0,5], (5, 10], (10, 20]

* recipient_gender - Gender of the recipient (female, male)

* recipient_body_mass - Body mass of the recipient of hematopoietic stem cells at the time of the transplantation
* …
* survival_status - Survival status (0 - alive, 1 - dead)

### Import nessary libraries

In [3]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

from scipy.io import arff

### Load the data set as a dataframe

In [4]:
data = arff.loadarff('bone-marrow.arff')
df = pd.DataFrame(data[0])
df.drop(columns=['Disease'], inplace=True)

### Prepare the data

In [5]:
# Convert all columns to numeric, coerce errors to null values
for c in df.columns:
    df[c] = pd.to_numeric(df[c], errors='coerce')

# Make sure binary columns are encoded as 0 and 1
for c in df.columns[df.nunique()==2]:
    df[c] = (df[c]==1)*1.0

# Calculate the number of unique values for each column
print('Count of unique values in each column:')
print(df.nunique())

Count of unique values in each column:
Recipientgender           2
Stemcellsource            2
Donorage                187
Donorage35                2
IIIV                      2
Gendermatch               2
DonorABO                  4
RecipientABO              4
RecipientRh               2
ABOmatch                  2
CMVstatus                 4
DonorCMV                  2
RecipientCMV              2
Riskgroup                 2
Txpostrelapse             2
Diseasegroup              2
HLAmatch                  4
HLAmismatch               2
Antigen                   4
Alel                      5
HLAgrI                    7
Recipientage            125
Recipientage10            2
Recipientageint           3
Relapse                   2
aGvHDIIIIV                2
extcGvHD                  2
CD34kgx10d6             183
CD3dCD34                182
CD3dkgx10d8             163
Rbodymass               130
ANCrecovery              18
PLTrecovery              50
time_to_aGvHD_III_IV     28
survival_

In [6]:
# Set target, survival_status as y; features as X
X = df.drop(columns=['survival_time', 'survival_status'])
y = df.survival_status

In [7]:
# Define list of numeric and categorical columns based on number of unique values
num_cols = X.columns[X.nunique()>7]
cat_cols = X.columns[X.nunique()<=7]
# Print columns wit hmissing values
print('Columns with missing values:')
print(X.columns[X.isnull().sum()>0])

Columns with missing values:
Index(['RecipientABO', 'CMVstatus', 'Antigen', 'Alel', 'CD3dCD34',
       'CD3dkgx10d8', 'Rbodymass'],
      dtype='object')


In [8]:
# Split data into train/test set
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [9]:
x_train.head(5)

Unnamed: 0,Recipientgender,Stemcellsource,Donorage,Donorage35,IIIV,Gendermatch,DonorABO,RecipientABO,RecipientRh,ABOmatch,...,Relapse,aGvHDIIIIV,extcGvHD,CD34kgx10d6,CD3dCD34,CD3dkgx10d8,Rbodymass,ANCrecovery,PLTrecovery,time_to_aGvHD_III_IV
144,1.0,0.0,30.279452,0.0,0.0,0.0,1,0.0,0.0,1.0,...,1.0,1.0,1.0,10.96,10.611083,1.03,14.8,15.0,19.0,1000000.0
8,1.0,1.0,32.641096,0.0,0.0,0.0,2,0.0,1.0,1.0,...,0.0,1.0,1.0,23.54,3.772555,6.24,20.5,15.0,14.0,1000000.0
150,1.0,0.0,23.09589,0.0,0.0,1.0,2,1.0,1.0,1.0,...,1.0,1.0,0.0,1.88,7.910679,0.24,62.0,23.0,1000000.0,1000000.0
89,0.0,0.0,28.276712,0.0,1.0,0.0,-1,1.0,1.0,1.0,...,0.0,1.0,1.0,1.31,14.642869,0.09,72.5,21.0,1000000.0,1000000.0
177,0.0,1.0,34.167123,0.0,1.0,0.0,0,-1.0,1.0,1.0,...,0.0,1.0,1.0,11.45,1.671314,6.85,49.0,13.0,14.0,1000000.0


### Create Preprocessing pipelines

* #### create categorical preprocessing pipeline

In [12]:
# Using mode(most frequent) to fill in missing values and OHE
cat_vals = cat_vals = Pipeline([
    ('inputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(sparse=False, drop='first', handle_unknown='ignore'))
])

* #### create numerical preprocessing pipeline

In [13]:
# Using mean to fill in missing values and standard scaling of features
num_vals = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])

#### create columns transformer that will preprocess the numerical and categorical features separately

In [15]:
preprocess = ColumnTransformer(transformers=[
    ('cat_process', cat_vals, cat_cols),
    ('num_process', num_vals, num_cols)
])