## <center>INTRACRANIAL HEMORRHAGES DATABASE - ANONYMIZATION 2</center>

In a first step, all patient identifiers (IDs,...) were removed. Now this Notebook conducts a second anonymization step, that consists in the anonymization of the dates. This kind of data may contain information which can help to identify patients, so it will be safer to anonymize all variables containing dates. However, to avoid the loss of information three new variables will be generated and added to the database: 
- Time between head CT scan and blood analysis (days)
- Age at the hospital admission date (years)
- Survival days after admission (days)

This Notebook will conduct this second anonymization step including the following sub-steps:
1. Load data
2. Change the dates into Date types
3. Generate the new variables
4. Anonymize the dates and save the anonymized database.

### Load libraries

In [1]:
library(data.table)

### 1. Load Data

In [2]:
df= read.csv('Databases/ICH_database_nonredudant_pseudoanonymized.csv', sep=',')
head(df,1)

X,Marca.temporal,SEXO..1.Hombre..2.Mujer.,Fecha.de.ingreso,Fecha.de.alta,Hospital.de.Procedencia,Fecha.de.TC,Fecha.de.análisis.de.sangre,Empeoramiento.clínico.después.del.TC..0.no..1.empeora.a.causa.del.hematoma..2.muere.a.causa.del.hematoma..3.empeora.por.otra.causa.derivada...Tener.en.cuenta.los.quirúrgicos.en.el.análisis,Secuelas..0.no..1.si..2.si.muere.a.causa.del.hematoma..3.muere.a.causa.de.otra.complicación..4.le.dan.el.alta.y.muere.en.los.siguientes.3.meses.por.el.hematoma...De.momento.todo.se.considera.secuela..cuadrantanopsias.....,...,Hematocrito......1234...si.no.dispone.de.datos.,Plaquetas..10.3.uL...1234...si.no.dispone.de.datos.,VCM..fL...1234...si.no.dispone.de.datos.,ADE......1234...si.no.dispone.de.datos.,C.H.C.M...g.dL...1234...si.no.dispone.de.datos.,V.P.M...fL...1234...si.no.dispone.de.datos.,HCM..pg...1234...si.no.dispone.de.datos.,INR..1234...si.no.dispone.de.datos.,Fibrinógeno..mg.dL...1234...si.no.dispone.de.datos.,Fibrinógeno.máximo.registrado.durante.el.ingreso
1,1,1,10-10-2012,14-10-2012,1,10-10-2012,10-10-2012,2,2,...,40.1,107,85,17,33,9,28.1,4.21,344,618


### 2. Change dates into Date types

In [3]:
dates1= c(4,5,7,8)
dates2= c(11,121)

df[,dates1]= lapply(df[,dates1], function(x) as.Date(x, format='%d-%m-%Y'))
df[,dates2]= lapply(df[,dates2], function(x) as.Date(x, format='%Y-%m-%d'))

In [4]:
df[0,c(dates1,dates2)]

Fecha.de.ingreso,Fecha.de.alta,Fecha.de.TC,Fecha.de.análisis.de.sangre,FECHA.DE.NACIMIENTO..día..mes..año.,Fecha.de.mortalidad


### 3. Generate the new variables
- Time between head CT scan and blood analysis (days)
- Age at the hospital admission date (years)
- Survival days after admission (days)

In [5]:
agecalc= function(from, to){
    from_years= format(from,'%Y')
    from_months= format(from,'%m')
    from_days= format(from,'%d')
    
    to_years= format(to,'%Y')
    to_months= format(to,'%m')
    to_days= format(to,'%d')
    
    len= length(from_years[[colnames(from_years)]])
    ages=rep(0,len)
    
    for (idx_date in c(1:len)) {
        
        from_year= as.numeric(from_years[[1]][idx_date])
        from_month= as.numeric(from_months[[1]][idx_date])
        from_day= as.numeric(from_days[[1]][idx_date])
        
        to_year= as.numeric(to_years[[1]][idx_date])
        to_month= as.numeric(to_months[[1]][idx_date])
        to_day= as.numeric(to_days[[1]][idx_date])
        
        if (to_month > from_month) ages[idx_date]= to_year - from_year
        if (to_month < from_month) ages[idx_date]= to_year - from_year - 1
        if (to_month == from_month & to_day >= from_day) ages[idx_date]= to_year - from_year
        if (to_month == from_month & to_day < from_day) ages[idx_date]= to_year - from_year - 1
        
        }
    return(ages)
    }

In [6]:
# Patients still alive at the end of the study have the value '1900-01-01' - change the value for the end of the study date '2020-06-30'
df['Fecha.de.mortalidad']= with(df, fifelse(Fecha.de.mortalidad == '1900-01-01', as.Date('2020-06-30', format='%Y-%m-%d'), Fecha.de.mortalidad))

In [7]:
# Generate the new variables
df['Time between head CT scan and blood analysis (days)']= df['Fecha.de.TC'] - df['Fecha.de.análisis.de.sangre']
df['Age at the hospital admission date (years)']= agecalc(df['FECHA.DE.NACIMIENTO..día..mes..año.'], df['Fecha.de.ingreso'])
df['Survival days after admission (days)']= df['Fecha.de.mortalidad'] - df['Fecha.de.ingreso']

In [8]:
# Patients borned in the hospital will have age 0 years
df['Age at the hospital admission date (years)'][df['Age at the hospital admission date (years)'] < 0] = 0

# Change values equal to 1234 to 1236 (1234 is the NA value in the original database)
idx_1234= which(df['Survival days after admission (days)']== 1234)

if (length(idx_1234)> 0){
    df[idx_1234,'Survival days after admission (days)']= 1236
    print(idx_1234)
}

[1] 58


### 4. Anonymize the date values and save the database

In [9]:
df[,c(dates1,dates2)]= 'anonymized'

In [10]:
write.csv(df, 'Databases/ICH_database_anonymized.csv', row.names=FALSE)