In [11]:
require(tidyverse)
require(feather)
require(stringdist)

data_folder_path <- "C:\\Users\\javier\\WorkSpace"


# Load ICD10fi and ICD10who

In [56]:
# load ICD10fi
thl_icd10fi <- read_feather(file.path(data_folder_path,
                                      "ICD10fi",
                                      "THL_ICD10fi.feather") )

In [57]:
# load concept table and get the ICD10who
concept <- read_feather(file.path(data_folder_path,
                                      "OMOP_vocabulary_v5",
                                      "CONCEPT.feather") )
concept_icd10 <- concept  %>% filter(vocabulary_id == "ICD10")


# Study codes

**Summary ICD10fi coding marks**

In the Finnish version of the ICD10 there are two types of additional codes. 

- Classification codes: decrive ranges of codes other than the conventional ICD10 herarchy
    - `Code1-Code2` : from Code1 to Code2
- Reason codes: combine codes to add more info on what caused the diagnose, there are 4 marks
   - `Code1*Code2` : ”Oirekoodi”, Code2 indicates an additional symtom 
   - `Code1+Code2` : ”Syykoodi”, Code2 indicates the reason for Code1
   - `Code1#Code2` : ATC-koodi, Code2 is and ATC code indicating the medicine that caused Code1
   - `Code1&Code2` : ”Kasvainkoodi”, Code2 is and endocrinological disorder code that caused Code1


### How many are  clasification codes??

In [58]:
thl_icd10fi  %>% nrow

In [129]:
thl_icd10fi  %>% distinct(CodeId)  %>% nrow

In [59]:
thl_icd10fi_classif <- thl_icd10fi  %>% filter( grepl("-", CodeId))

In [60]:
thl_icd10fi_classif  %>% nrow

### How many are a direct match ??

In [61]:
# match ICD10fi codes to the ICD10who codes 
thl_icd10fi_match_icd10who <- inner_join(thl_icd10fi,
                concept_icd10 %>% rename(CodeId = concept_code),
                by = "CodeId")


In [62]:
thl_icd10fi_match_icd10who %>% nrow

#### Do these agree in the definitio ??

In [63]:
# Count the differes in the number of charactes between the english names b
thl_icd10fi_match_icd10who  %>%  select(CodeId, English_name, concept_name, ShortName) %>% 
        mutate(same_name = stringdist(English_name,concept_name) )  %>% 
        count(same_name, sort = T) #%>% head(10)

same_name,n
<dbl>,<int>
0,8463
,226
2,150
3,149
20,114
4,91
1,84
7,32
17,26
10,20


Most 8463 match the name exatly, 226 NA for these with no english name

#### Why the large differences ??

In [64]:
# Count the differes in the number of charactes between the english names b
thl_icd10fi_match_icd10who  %>%  select(CodeId, English_name, concept_name, ShortName) %>% 
        mutate(same_name = stringdist(English_name,concept_name) )  %>% 
        rename(ICD10fi_name = English_name, ICD10who_name = concept_name) %>% 
        filter(same_name>21)

CodeId,ICD10fi_name,ICD10who_name,ShortName,same_name
<chr>,<chr>,<chr>,<chr>,<dbl>
A09,Diarrhoea and gastroenteritis of presumed infectious origin,Other gastroenteritis and colitis of infectious and unspecified origin,Tartt. olet. ripuli/g-e-iitti,40
A49.1,"Streptococcal infection, unspecified","Streptococcal and enterococcal infection, unspecified site","Streptokokki, sij ei-määr",22
A74.0,Paratrachoma,Chlamydial conjunctivitis,Paratrakooma,22
B96.5,Pseudomonas (aeruginosa)(mallei)(pseudomallei) as the cause of diseases classified to other chapters,Pseudomonas (aeruginosa) as the cause of diseases classified to other chapters,Pseudomonakset sair. aih.,22
C81.1,Nodular sclerosis,Nodular sclerosis classical Hodgkin lymphoma,Sidekudoskyhmyinen Hodgkinin t,27
C81.2,Mixed cellularity,Mixed cellularity classical Hodgkin lymphoma,Sekasoluinen Hodgkinin tauti,27
C81.3,Lymphocytic depletion,Lymphocyte depleted classical Hodgkin lymphoma,Vähälymfosyyttinen Hodgkinin t,28
C82,Follicular [nodular] non-Hodgkin's lymphoma,Follicular lymphoma,Nodul. non-Hodgkin-lymfooma,24
C83.6,Undifferentiated (diffuse),Undifferentiated (diffuse) non-Hodgkin's lymphoma,"Diff.non-Hodgk,erilaistum.solu",23
C83.9,"Diffuse non-Hodgkin's lymphoma, unspecified","Non-follicular (diffuse) lymphoma, unspecified",Diff.non-Hodgkin-lymfooma NAS,22


Seems the concept are the same, descrived in different way. 

### How many more match if only the diagnose code is taken? 

In [85]:
# match ICD10fi's diagnose codes to the ICD10who codes 
matched_ids <- c(thl_icd10fi_classif$CodeId, thl_icd10fi_match_icd10who$CodeId)

thl_icd10fi_code1_match_icd10who <- inner_join( thl_icd10fi  %>% filter(!(CodeId %in% matched_ids)) , 
                concept_icd10 %>% rename(Code1 = concept_code),
                by = "Code1")


In [87]:
thl_icd10fi_code1_match_icd10who  %>% nrow

### How many ICD10fi diagnose codes do not exist in ICD10who? 

In [88]:
matched_ids <- c(thl_icd10fi_classif$CodeId, thl_icd10fi_match_icd10who$CodeId, thl_icd10fi_code1_match_icd10who$CodeId)

thl_icd10fi_new <- thl_icd10fi  %>% filter(!(CodeId %in% matched_ids)) 

In [91]:
# which not composed codes are new
thl_icd10fi_new %>%
select(CodeId, English_name, ShortName)

CodeId,English_name,ShortName
<chr>,<chr>,<chr>
A28.10,,Kissanraapaisut suun al. ilm.
A28.11,,Kaulan imus. kissanraapaisut.
A28.19,,Muu tai määr. kissanraapaisut.
A31.80,,M.bact. intr.cell.tulehd.suus.
A31.81,,M.bact. chelonei tulehd. suu
A31.84,,Muu tai määr. mykobakt. aih
A31.89,Other mycobacterial infections,Muu mykobakt infektio
A50.50,,Kupan aih. Parrotin uurteet
A50.51,,Hutchinsonin etuhampaat
A50.52,,Nuppuhammas Mulberryn molaari


In [92]:
#which are new in the level higher than 6
thl_icd10fi_new  %>% 
select(CodeId, English_name, ShortName) %>% 
mutate(group = nchar(CodeId) #str_sub(concept_code, 0, 1)
      ) %>% filter(group<6)

CodeId,English_name,ShortName,group
<chr>,<chr>,<chr>,<int>
B07.9,Verruca simplex,Tavallinen syylä,5
F61.0,Mixed personality disorders,Sekamuotoiset persoonallisuushäiriöt,5
F61.1,,Häiritsevä persoonallisuusmuutos,5
W71,Drowning to ice,Vajoaminen jäihin,3
X85.0,"Assault by drugs, medicaments and biological substances by spouse or partner",Pahoinpitely lääkeaineilla.0,5
X85.2,"Assault by drugs, medicaments and biological substances by friend",Pahoinpitely lääkeaineilla.2,5
X85.8,"Assault by drugs, medicaments and biological substances by other known person",Pahoinpitely lääkeaineilla.8,5
X85.9,"Assault by drugs, medicaments and biological substances by unknown person",Pahoinpitely lääkeaineilla.9,5
X90.0,Assault by unspecified chemical or noxious substance by spouse or partner,Pahoinpit.kem.aineilla NAS.0,5
X90.2,Assault by unspecified chemical or noxious substance by friend,Pahoinpit.kem.aineilla NAS.2,5


In [94]:
# what groups have more new ones 
thl_icd10fi_new  %>% 
select(CodeId, English_name, ShortName) %>% 
mutate(group = str_sub(CodeId, 0, 1)) %>% count(group, sort = T)

group,n
<chr>,<int>
C,770
F,497
Q,348
K,313
D,130
I,105
E,89
G,62
H,50
X,48


# Proposed matching  

1. Ignore the new ICD10fi clasification codes, these are not supose to be used as diagnose
2. Match ICD10fi to ICD10who only based on the diagnose code `code1`
3. These new ICD10fi that dont exist in ICD10who, match to the parent code

In [122]:
# ICD10fi code mathches the ICD10who
thl_icd10fi_clas <- thl_icd10fi  %>% 
    filter( CodeId %in% concept_icd10$concept_code)  %>% 
    mutate( ICD10who = CodeId , ICD10who_match_level = 0 )

In [150]:
# match diagnose codes to the upper level code
thl_icd10fi_new_1 <- inner_join( thl_icd10fi_new  %>% 
                                # remove the last digit of the code 
                                mutate(ICD10who = str_sub(Code1, 0, -2))%>% 
                                # remove the last digit of the code if it is a point
                                mutate(ICD10who = sub("\\.$", "", ICD10who)) , 
                concept_icd10 %>% rename(ICD10who = concept_code),
                by = "ICD10who")


In [151]:
thl_icd10fi_new_1  %>%  nrow

In [161]:
# match diagnose codes to the upper level code
thl_icd10fi_new_2 <- left_join( thl_icd10fi_new  %>% filter( !(CodeId %in% thl_icd10fi_new_1$CodeId)) %>%  
                                # remove the last digit of the code 
                                mutate(ICD10who = str_sub(Code1, 0, -3))%>% 
                                # remove the last digit of the code if it is a point
                                mutate(ICD10who = sub("\\.$", "", ICD10who)) , 
                concept_icd10 %>% rename(ICD10who = concept_code),
                by = "ICD10who")

In [162]:
thl_icd10fi_new_2 %>%  nrow

In [173]:
# Join all 
thl_icd10fi_matched <- bind_rows(
    # classifiction codes
    thl_icd10fi  %>% filter( CodeId %in%thl_icd10fi_classif$CodeId)  %>% 
    mutate(ICD10who = as.character(NA), ICD10who_match_level = "classification" ), 
    # perfect matches
    thl_icd10fi  %>% filter( CodeId %in%thl_icd10fi_match_icd10who$CodeId)  %>% 
    mutate(ICD10who = CodeId, ICD10who_match_level = "full_match" ), 
    # matched with the diagnose 
    thl_icd10fi  %>% filter( CodeId %in%thl_icd10fi_code1_match_icd10who$CodeId)  %>% 
    mutate(ICD10who = Code1, ICD10who_match_level = "diagnose_match" ),
    #
    thl_icd10fi_new_1 %>% select(CodeId:ICD10who)%>% 
    mutate(ICD10who_match_level = "diagnose_match_parent" ),
    #
    thl_icd10fi_new_2 %>% select(CodeId:ICD10who)%>% 
    mutate(ICD10who_match_level =  "diagnose_match_grandparent" )
)%>% arrange(CodeId)  %>% 
mutate(ICD10who_match_level = factor(ICD10who_match_level, 
                                     levels = c("classification", 
                                                "full_match", 
                                                "diagnose_match",
                                                "diagnose_match_parent", 
                                                "diagnose_match_grandparent")
                                    )
      )

In [174]:
thl_icd10fi_matched  %>% count(ICD10who_match_level)

ICD10who_match_level,n
<fct>,<int>
classification,298
full_match,9587
diagnose_match,2105
diagnose_match_parent,2587
diagnose_match_grandparent,104


In [175]:
# load ICD10fi
write_feather(thl_icd10fi_matched, file.path(data_folder_path,
                                      "ICD10fi",
                                      "THL_ICD10fi_matched_ICD10who.feather") )