Bone Marrow Transplants in Children 
=====
A Classification Analysis
----
**Source:** Survival Prediction of Children Undergoing Hematopoietic Stem Cell Transplantation 
        doi:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9527434/

In [2]:
library(tidyverse)
library(tidymodels)
library(repr) 

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

Step 1: Read in the Data
---

In [9]:

bone_marrow_transplant_data <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/00565/bone-marrow.arff", 
                                          skip =109, sep=",", col.names = c(" Recipientgender {1,0}", 
"Stemcellsource","donor.age","Donorage35","IIIV","Gendermatch","DonorABO","RecipientABO {1,-1,2,0}" ,"RecipientRh {1,0}" ,
"ABOmatch {0,1}" ,"CMVstatus" ,"DonorCMV" ,"RecipientCMV" ,"Disease {ALL,AML,chronic,nonmalignant,lymphoma}" ,
"Riskgroup {1,0}" ,"Txpostrelapse {0,1}" ,"Diseasegroup {1,0}" ,"HLAmatch" ,"HLAmismatch" ,"Antigen" ,"Allel" ,
"HLAgrI" ,"Recipientage" ,"Recipientage10 {0,1}" ,"Recipientageint {0,1,2}" ,"Relapse" ,"aGvHDIIIIV {0,1}" ,"extcGvHD {1,0}" ,
"CD34kgx10d6_numeric" ,"CD3dCD34_numeric" ,"CD3dkgx10d8_numeric" ,"Rbodymass" ,"ANCrecovery" ,"PLTrecovery" ,"time_to_aGvHD_III_IV numeric" ,
"survival_time numeric" ,"survival_status numeric"))

head(bone_marrow_transplant_data)

Unnamed: 0_level_0,X.Recipientgender..1.0.,Stemcellsource,donor.age,Donorage35,IIIV,Gendermatch,DonorABO,RecipientABO..1..1.2.0.,RecipientRh..1.0.,ABOmatch..0.1.,⋯,extcGvHD..1.0.,CD34kgx10d6_numeric,CD3dCD34_numeric,CD3dkgx10d8_numeric,Rbodymass,ANCrecovery,PLTrecovery,time_to_aGvHD_III_IV.numeric,survival_time.numeric,survival_status.numeric
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,⋯,<chr>,<dbl>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>
1,1,0,23.34247,0,1,0,-1,-1,1,0,⋯,1,4.5,11.078295,0.41,20.6,16,37,1000000,163,1
2,1,0,26.39452,0,1,0,-1,-1,1,0,⋯,1,7.94,19.01323,0.42,23.4,23,20,1000000,435,1
3,0,0,39.68493,1,1,0,1,2,1,1,⋯,?,4.25,29.481647,0.14,50.0,23,29,19,53,1
4,0,1,33.3589,0,0,0,1,2,0,1,⋯,1,51.85,3.972255,13.05,9.0,14,14,1000000,2043,0
5,1,0,27.39178,0,0,0,2,0,1,1,⋯,1,3.27,8.412758,0.39,40.0,16,70,1000000,2800,0
6,0,1,34.52055,0,1,0,0,1,0,1,⋯,?,17.78,2.406248,7.39,51.0,17,29,18,41,1


Step 2: Clean and Wrangle Data into Tidy Format (and choose which columns we need to use for analysis)
-----


**Class that we are Prediciting**

Relapse = Did the patient relapse

**Rational Behind Choosing These Variables**

donor.age = The age of the donor has an impact on the health of the HPCs which can have an effect on the transplant sucesses and therefore the chance of relapse 

CMVstatus = Cytomegalovirus infection compatibility between the host and the recipient. High value is non compatible. 

HLAmatch = MHC match between host and donor (low value is a better match so will support a better acceptance of the transplant)

Antigen = in how many antigens there are differences between the host and donor (no differences at -1, larger numbers are more differences)

Allel = in how many allels there are differences between the host and donor (no differences at -1, larger numbers are more differences)

Recipientage = Age of the recipient of hematopoietic stem cells at the time of transplantation. 

Rbodymass = Body mass of the recipient of hematopoietic stem cells at the time of transplantation

ANCrecovery = Time to neutrophils recovery defined as neutrophils count (per L) (note: this is important as neutrophils are derived from the myloid progenerators that are derived from the HPC cells that are included in bone marrow transplants)

PLTrecovery = Time to platelet recovery defined as platelet count (per mm3)


In [10]:
clean_transplant_data <- bone_marrow_transplant_data |>
    select(donor.age, CMVstatus, HLAmatch, Antigen, Allel, Recipientage, Relapse, Rbodymass, ANCrecovery, PLTrecovery)
clean_transplant_data

donor.age,CMVstatus,HLAmatch,Antigen,Allel,HLAgrI,Recipientage,Relapse,Rbodymass,ANCrecovery,PLTrecovery
<dbl>,<chr>,<int>,<chr>,<chr>,<int>,<dbl>,<int>,<chr>,<int>,<int>
23.34247,0,0,-1,-1,0,4.0,1,20.6,16,37
26.39452,2,0,-1,-1,0,6.6,1,23.4,23,20
39.68493,1,0,-1,-1,0,18.1,0,50,23,29
33.35890,0,1,1,0,1,1.3,0,9,14,14
27.39178,?,0,-1,-1,0,8.9,0,40,16,70
34.52055,?,0,-1,-1,0,14.4,0,51,17,29
21.43562,1,3,1,2,7,18.2,0,56,22,58
32.64110,2,0,-1,-1,0,7.9,0,20.5,15,14
28.78356,2,1,0,1,3,4.7,0,16.5,16,17
29.73151,1,1,0,1,2,1.9,0,10.5,12,13
