## <center>INTRACRANIAL HEMORRHAGES DATABASE CLEANING AND CURATION</center>

After anonymization was completed, some steps of data cleaning and curation are carry out before continuing with data analysis and modeling:

1. Load the database
2. Check variables: datatypes and values
3. Change datatypes when appropriate
4. Change values when appropriate
5. Delete non-useful variables
6. Change column names
7. Check the cleaned database
8. Save the cleaned and cured database and the database metadata

## Data Curation

### 1. Load Data

In [1]:
# Import types as character to avoid conversion of 0 to 1 when converting the data to numeric
df= read.csv('Databases/ICH_database_anonymized.csv', sep=',', na.strings=c('1234'), colClasses=c('character'))
head(df)

X,Marca.temporal,SEXO..1.Hombre..2.Mujer.,Fecha.de.ingreso,Fecha.de.alta,Hospital.de.Procedencia,Fecha.de.TC,Fecha.de.análisis.de.sangre,Empeoramiento.clínico.después.del.TC..0.no..1.empeora.a.causa.del.hematoma..2.muere.a.causa.del.hematoma..3.empeora.por.otra.causa.derivada...Tener.en.cuenta.los.quirúrgicos.en.el.análisis,Secuelas..0.no..1.si..2.si.muere.a.causa.del.hematoma..3.muere.a.causa.de.otra.complicación..4.le.dan.el.alta.y.muere.en.los.siguientes.3.meses.por.el.hematoma...De.momento.todo.se.considera.secuela..cuadrantanopsias.....,...,ADE......1234...si.no.dispone.de.datos.,C.H.C.M...g.dL...1234...si.no.dispone.de.datos.,V.P.M...fL...1234...si.no.dispone.de.datos.,HCM..pg...1234...si.no.dispone.de.datos.,INR..1234...si.no.dispone.de.datos.,Fibrinógeno..mg.dL...1234...si.no.dispone.de.datos.,Fibrinógeno.máximo.registrado.durante.el.ingreso,Time.between.head.CT.scan.and.blood.analysis..days.,Age.at.the.hospital.admission.date..years.,Survival.days.after.admission..days.
1,1,1,anonymized,anonymized,1,anonymized,anonymized,2,2,...,17.0,33.0,9.0,28.1,4.21,344.0,618.0,0,74,4
2,2,2,anonymized,anonymized,2,anonymized,anonymized,0,1,...,14.7,32.5,10.7,29.8,,,1081.0,0,81,2128
3,3,2,anonymized,anonymized,1,anonymized,anonymized,0,0,...,14.0,33.1,8.7,30.1,3.16,298.0,470.0,0,78,2388
4,4,2,anonymized,anonymized,1,anonymized,anonymized,2,2,...,15.8,34.1,7.6,19.3,1.09,344.0,344.0,0,79,7
5,5,1,anonymized,anonymized,2,anonymized,anonymized,0,1,...,13.4,33.5,7.8,30.7,1.01,,,0,86,1016
6,6,2,anonymized,anonymized,1,anonymized,anonymized,2,2,...,13.4,32.8,7.9,32.1,0.98,332.0,332.0,0,88,5


### 2. Check variables: datatypes and values

In [2]:
str(df, list.len=ncol(df))

'data.frame':	300 obs. of  162 variables:
 $ X                                                                                                                                                                                                                                                                                                                                                            : chr  "1" "2" "3" "4" ...
 $ Marca.temporal                                                                                                                                                                                                                                                                                                                                               : chr  "1" "2" "3" "4" ...
 $ SEXO..1.Hombre..2.Mujer.                                                                                                                                                                             

### 3. Change datatypes when appropriate

In [3]:
numerics= c(1,2,13,15,18,38:44,46,88:93,109,122:126,140:162)
factors= c(3,6,9,10,14,16,17,19:37,45,47:78,79:87,94:108,110:120,127:139)
characters= c(4,5,7,8,11,12,121)

df[,numerics]= suppressWarnings(lapply(df[,numerics], as.numeric))
df[,factors]= lapply(df[,factors], as.factor)

### 4. Change values when appropriate

In [4]:
levels(df[,9])[which(levels(df[,9]) == 'Bien')]= 0
levels(df[,9])[which(levels(df[,9]) == '1. Plejia de la mano. ')]= 1
levels(df[,9])[which(levels(df[,9]) == 'Empeora')]= 1
levels(df[,9])[which(levels(df[,9]) == 'M')]= 2
levels(df[,9])[which(levels(df[,9]) == '')]= NA

### 5. Delete non-useful variables

In [5]:
# Variables with only one different value (all patients have the same value), with redundant information...
non_useful= c(2,4,5,7,8,11,12,37,44,57,63,69,71,75,76,77,83,121,135,136)
df[1,non_useful]

Marca.temporal,Fecha.de.ingreso,Fecha.de.alta,Fecha.de.TC,Fecha.de.análisis.de.sangre,FECHA.DE.NACIMIENTO..día..mes..año.,Antecedentes.familiares..0...no.tiene..1234...si.no.hay.datos.,Tipo.diabétes,Tratamiento..nº.fármacos.que.toma...Inmunoterápico.,Antidiabéticos..0.no..1..sí...Tiazolinidadionas..pioglitazona.....,Antidiabéticos..0.no..1..sí...Gliflozinas..empagliflozina......,Hipolipemiantes..Antiagregantes..y.Anticoagulantes..0...no..1...sí...Otros.antiagregantes.,Hipolipemiantes..Antiagregantes..y.Anticoagulantes..0...no..1...sí...HBPM.,Hipolipemiantes..Antiagregantes..y.Anticoagulantes..0...no..1...sí...Apixaban.,Hipolipemiantes..Antiagregantes..y.Anticoagulantes..0...no..1...sí...Edoxaban.,Hipolipemiantes..Antiagregantes..y.Anticoagulantes..0...no..1...sí...Betrixaban.,Sintomatología.y.AP..0.No..1..Sí..1234...No.se.dispone...Otra.sintomatología.,Fecha.de.mortalidad,Causa.del.sangrado..0...no..1...sí..1234...no.se.sabe...Post.proceso.intervencionista.de.otro.tipo.,Causa.del.sangrado..0...no..1...sí..1234...no.se.sabe...Sangrado.neoplásico.primario.secundario.
1,anonymized,anonymized,anonymized,anonymized,anonymized,,2,0,0,0,0,0,0,0,0,1,anonymized,0,0


In [6]:
df_cleaned= df[,-non_useful]

### 6. Change variable names

In [7]:
df_metadata= read.csv('Databases/ICH_database_anonymized_metadata.csv', sep=',', colClasses=c('character'))
df_cleaned_metadata= df_metadata[-non_useful,-2]
head(df_cleaned_metadata,1)

Variable_Name,Variable_Label,Variable_Definition,R_Datatype,Python_Datatype,Pandas_Datatype,Values,Maximum_Number_of_Different_Values_in_the_Dataset,Comment,Type_of_Variable
Patient Number,patient,Patient index,numeric,int,int64,From 1 to 300,300,,Auxiliary


In [8]:
colnames(df_cleaned)= df_cleaned_metadata[['Variable_Label']]

### 7. Check everything was correctly changed

In [9]:
str(df_cleaned, list.len=ncol(df))

'data.frame':	300 obs. of  142 variables:
 $ patient                      : num  1 2 3 4 5 6 7 8 9 10 ...
 $ sex                          : Factor w/ 2 levels "1","2": 1 2 2 2 1 2 2 1 1 2 ...
 $ hospital                     : Factor w/ 4 levels "1","2","3","NA": 1 2 1 1 2 1 1 1 1 1 ...
 $ follow_up                    : Factor w/ 4 levels "0","1","2","3": 3 1 1 3 1 3 1 1 3 1 ...
 $ final_outcome                : Factor w/ 6 levels "0","1","2","3",..: 3 2 1 3 2 3 1 1 3 2 ...
 $ nfamily_medhist              : num  17 4 8 8 4 5 0 5 9 6 ...
 $ tobacco                      : Factor w/ 4 levels "0","1","2","NA": 3 1 1 1 1 1 1 1 3 1 ...
 $ n_tobacco                    : num  NA 0 0 0 0 0 0 0 0 0 ...
 $ drugs                        : Factor w/ 4 levels "0","1","2","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ alcohol                      : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 2 1 1 1 4 1 ...
 $ g_alcohol                    : num  NA 0 0 0 0 0 0 0 NA 0 ...
 $ ht                           : Factor w/ 2 

### 8. Save 'IH_database_cleaned' and 'IH_database_cleaned_metadata'

In [10]:
# Save the new database
write.csv(df_cleaned, 'Databases/ICH_database.csv', row.names=FALSE)
saveRDS(df_cleaned, file='Databases/ICH_database.rds')

# Save the metadata of the new database
write.csv(df_cleaned_metadata, 'Databases/ICH_database_metadata.csv', row.names=FALSE)