# Building ML Pipelines
In this project, I will be using a dataset  containing bone marrow transplantation characteristics for pediatric patients from UCI's Machine Learning Repository.

I will this dataset to build a pipeline, containing all preprocessing and data cleaning steps, and then selecting the best classifier to predict patient survival.

### About data set
* donor_age - Age of the donor at the time of hematopoietic stem cells apheresis

* donor_age_below_35 - Is donor age less than 35 (yes, no)

* donor_ABO - ABO blood group of the donor of hematopoietic stem cells (0, A, B, AB)

* donor_CMV - Presence of cytomegalovirus infection in the donor of hematopoietic stem cells prior to transplantation (present, absent)

* recipient_age - Age of the recipient of hematopoietic stem cells at the time of transplantation

* recipient_age_below_10 - Is recipient age below 10 (yes, no)

* recipient_age_int - Age of the recipient discretized to intervals (0,5], (5, 10], (10, 20]

* recipient_gender - Gender of the recipient (female, male)

* recipient_body_mass - Body mass of the recipient of hematopoietic stem cells at the time of the transplantation
* …
* survival_status - Survival status (0 - alive, 1 - dead)

### Import nessary libraries

In [3]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

from scipy.io import arff

### Load the data set as a dataframe

In [4]:
data = arff.loadarff('bone-marrow.arff')
df = pd.DataFrame(data[0])
df.drop(columns=['Disease'], inplace=True)