Bone Marrow Transplants in Children 
=====
A Classification Analysis
----

In [2]:
library(tidyverse)
library(tidymodels)
library(repr) 

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

Step 1: Read in the Data
---

In [36]:

bone_marrow_transplant_data <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/00565/bone-marrow.arff", 
                                          skip =109, sep=",", col.names = c(" Recipientgender {1,0}", 
"Stemcellsource",
"donor.age",
"Donorage35",
"IIIV",
"Gendermatch",
"DonorABO",
"RecipientABO {1,-1,2,0}" ,
"RecipientRh {1,0}" ,
"ABOmatch {0,1}" ,
"CMVstatus {3,2,1,0}" ,
"DonorCMV {1,0}" ,
"RecipientCMV {1,0}" ,
"Disease {ALL,AML,chronic,nonmalignant,lymphoma}" ,
"Riskgroup {1,0}" ,
"Txpostrelapse {0,1}" ,
"Diseasegroup {1,0}" ,
"HLAmatch {0,1,3,2}" ,
"HLAmismatch {0,1}" ,
"Antigen {-1,1,0,2}" ,
"Alel {-1,0,2,1,3}" ,
"HLAgrI {0,1,7,3,2,4,5}" ,
"Recipientage numeric" ,
"Recipientage10 {0,1}" ,
"Recipientageint {0,1,2}" ,
"Relapse {0,1}" ,
"aGvHDIIIIV {0,1}" ,
"extcGvHD {1,0}" ,
"CD34kgx10d6 numeric" ,
"CD3dCD34 numeric" ,
"CD3dkgx10d8 numeric" ,
"Rbodymass numeric" ,
"ANCrecovery numeric" ,
"PLTrecovery numeric" ,
"time_to_aGvHD_III_IV numeric" ,
"survival_time numeric" ,
"survival_status numeric"))

head(bone_marrow_transplant_data)

Unnamed: 0_level_0,X.Recipientgender..1.0.,Stemcellsource..1.0..,Donorage.numeric,Donorage35..0.1.,IIIV..1.0.,Gendermatch..0.1.,DonorABO..1..1.2.0.,RecipientABO..1..1.2.0.,RecipientRh..1.0.,ABOmatch..0.1.,⋯,extcGvHD..1.0.,CD34kgx10d6.numeric,CD3dCD34.numeric,CD3dkgx10d8.numeric,Rbodymass.numeric,ANCrecovery.numeric,PLTrecovery.numeric,time_to_aGvHD_III_IV.numeric,survival_time.numeric,survival_status.numeric
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,⋯,<chr>,<dbl>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>
1,1,0,23.34247,0,1,0,-1,-1,1,0,⋯,1,4.5,11.078295,0.41,20.6,16,37,1000000,163,1
2,1,0,26.39452,0,1,0,-1,-1,1,0,⋯,1,7.94,19.01323,0.42,23.4,23,20,1000000,435,1
3,0,0,39.68493,1,1,0,1,2,1,1,⋯,?,4.25,29.481647,0.14,50.0,23,29,19,53,1
4,0,1,33.3589,0,0,0,1,2,0,1,⋯,1,51.85,3.972255,13.05,9.0,14,14,1000000,2043,0
5,1,0,27.39178,0,0,0,2,0,1,1,⋯,1,3.27,8.412758,0.39,40.0,16,70,1000000,2800,0
6,0,1,34.52055,0,1,0,0,1,0,1,⋯,?,17.78,2.406248,7.39,51.0,17,29,18,41,1


Step 2: Clean and Wrangle Data into Tidy Format (and choose which columns we need to use for analysis)
-----


In [33]:
Meaning of Column Names 
%- Recipientgender - Male - 1, Female - 0,
%- Stemcellsource - Source of hematopoietic stem cells (Peripheral blood - 1, Bone marrow - 0),
%- Donorage - Age of the donor at the time of hematopoietic stem cells apheresis
%- Donorage35 - Donor age <35 - 0, Donor age >=35 - 1
%- IIIV - Development of acute graft versus host disease stage II or III or IV (Yes - 1, No - 0),
%- Gendermatch - Compatibility of the donor and recipient according to their gender (Female to Male - 1, Other - 0),
%- DonorABO - ABO blood group of the donor of hematopoietic stem cells (0 - 0, 1, A, B=-1, AB=2),
%- RecipientABO - ABO blood group of the recipient of hematopoietic stem cells (0 - 0, 1, A, B=-1, AB=2),
%- RecipientRh - Presence of the Rh factor on recipient s red blood cells ('+' - 1, '-' - 0),
%- ABOMatch - Compatibility of the donor and the recipient of hematopoietic stem cells according to ABO blood group (matched - 1, mismatched - 1)
%- CMVstatus - Serological compatibility of the donor and the recipient of hematopoietic stem cells according to cytomegalovirus
  infection prior to transplantation (the higher the value the lower the compatibility)
%- RecipientCMV - Presence of cytomegalovirus infection in the donor of hematopoietic stem cells prior to transplantation (presence - 1, absence - 0)
%- Disease - Type of disease (ALL,AML,chronic,nonmalignant,lymphoma)
%- Riskgroup - High risk - 1, Low risk - 0,
%- Txpostrelapse - The second bone marrow transplantation after relapse (No - 0; Yes - 1),
%- Diseasegroup - Type of disease (malignant - 1, nonmalignant - 0), 
%- HLAmatch - Compatibility of antigens of the main histocompatibility complex of the donor and the recipient of hematopoietic stem cells
$  according to ALL international BFM SCT 2008 criteria (10/10 - 0, 9/10 - 1, 8/10 - 2, 7/10 - 3 (allele/antigens)),
%- HLAmismatch - HLA matched - 0, HL mismatched - 1,
%- Antigen - In how many anigens there is difference beetwen the donor nad the recipient (-1 - no differences, 0 - one difference,1 (2) - two (three) diffences)
%- Allel - In how many allele there is difference beetwen the donor nad the recipient {-1 no differences,0 - one difference, 1 (2) (3) - two, (tree, four) differences)
%- HLAgrI - The differecne type beetwien the donor and the recipient (HLA mateched - 0,the difference is in only one antigen - 1,
  the difference is only in one allel - 2, the difference is only in DRB1 cell - 3, two differences (two allele or two antignes) - 4,
  two differences (two allele or two antignes) - 5),
%- Recipientage - Age of the recipient of hematopoietic stem cells at the time of transplantation,
%- Recipientage10 - Recipient age <10 - 0, Recipient age>=10 - 1,
%- Recipientageint - Recipient age in (0,5] - 0, (5, 10] - 1, (10, 20] - 2,
%- Relapse - Reoccurrence of the disease (No - 0, Yes - 1),
%- aGvHDIIIIV - Development of acute graft versus host disease stage III or IV (Yes - 0, No - 1)
%- extcGvHD - Development of extensive chronic graft versus host disease (Yes - 0, No - 1)
%- CD34kgx10d6 - CD34+ cell dose per kg of recipient body weight (10^6/kg)
%- CD3dCD34 - CD3+ cell to CD34+ cell ratio
%- CD3dkgx10d8 - CD3+ cell dose per kg of recipient body weight (10^8/kg)
%- Rbodymass - Body mass of the recipient of hematopoietic stem cells at the time of transplantation,
%- ANCrecovery - Time to neutrophils recovery defined as neutrophils count >0.5 x 10^9/L 
%- PLTrecovery - Time to platelet recovery defined as platelet count >50000/mm3,
%- time_to_aGvHD_III_IV - Time to development of acute graft versus host disease stage III or IV
%- survival_time numeric
%- survival_status

ERROR: Error in parse(text = x, srcfile = src): <text>:1:9: unexpected symbol
1: Meaning of
            ^


In [39]:
clean_transplant_data <- bone_marrow_transplant_data |>
    select(donor.age, CMVstatus, RecipientCMV, HLAmatch, Antigen, Allel, HLAgrI, Recipientage, Relapse, Rbodymass, ANCrecovery, PLTrecovery)
clean_transplant_data

ERROR: [1m[33mError[39m in [1m[1m`select()`:[22m
[33m![39m Can't subset columns that don't exist.
[31m✖[39m Column `donor.age` doesn't exist.
