# Which SNOMED CT codes for HbA1c are useful?

The purpose of this notebook is to figure out which of the SNOMED CT codes that relate to HbA1c are useful for the study of HbA1c values.

The motivation for this purpose is that there are some strange results returned from a simple study of the numeric values that are returned from querying the [OpenSAFELY list of HbA1c SNOMED CT codes](https://www.opencodelists.org/codelist/opensafely/glycated-haemoglobin-hba1c-tests-numerical-value/). One would expect mmol/mol values between 40 and 80, typically, which approximately correspond to 5.8% to 9.5% in the percentage units. Instead, I am seeing values in the tens of thousands and values that are minus.

In this notebook, I want to see if the problems are arising because of particular SNOMED CT codes, and see if my script to identify and convert the old percentage units into mmol/mol is performing correctly.

(Note: The formula for converting the old percentage units to mmol/mol can be found, [here](https://ebmcalc.com/GlycemicAssessment.htm).)

### Set up.

In [44]:
if( !"pacman" %in% installed.packages() ){ install.packages( "pacman" ) }
pacman::p_load(
    bigrquery
    ,tidyverse
    )

# Setup connection to GCP.
project_id = "yhcr-prd-bradfor-bia-core"
con <- DBI::dbConnect( drv = bigquery(), project = project_id ) %>% suppressWarnings()

# Define R tibbles from GCP tables.
r_tbl_srcode <- dplyr::tbl( con, "CB_FDM_PrimaryCare.tbl_srcode" )

# Define the list of SNOMED CT codes.
codes_SNOMED_test_of_interest <-
    readr::read_csv(file = paste0( 'codelists/', 'opensafely-glycated-haemoglobin-hba1c-tests-3e5b1269.csv' ),
                    col_types = cols( code = col_character(), term = col_character() ) )$code

### Distributional summary statistics without transformation to mmol/mol.

In [5]:
# Provide descriptive statistics of each SNOMED code.
r_tbl_srcode %>%
dplyr::select( person_id, dateevent, numericvalue, snomedcode ) %>%
dplyr::filter( snomedcode %in% codes_SNOMED_test_of_interest ) %>%
dplyr::collect() %>%
dplyr::group_by( snomedcode ) %>%
dplyr::summarise(
    n = n()
    ,max = max( numericvalue %>% as.numeric() )
    ,mean = mean( numericvalue %>% as.numeric() )
    ,median = numericvalue %>% as.numeric() %>% quantile( 0.5 )
    ,min = min( numericvalue %>% as.numeric() )
) %>%
dplyr::ungroup()

snomedcode,n,max,mean,median,min
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1003671000000109,314378,27300,5.308858761,6.5,-8.9
1049301000000100,4480,144,42.781785714,41.0,0.0
365845005,2612,0,-0.005359877,0.0,-1.0
999791000000106,3020974,85000,48.637394356,42.0,-1.0


### Distributional summary statistics _with_ transformation to mmol/mol.

In [6]:
# Provide descriptive statistics of each SNOMED code.
r_tbl_srcode %>%
dplyr::select( person_id, dateevent, numericvalue, snomedcode ) %>%
dplyr::filter( snomedcode %in% codes_SNOMED_test_of_interest ) %>%
dplyr::mutate(
        numericvalue = 
            dplyr::if_else(
                sql("REGEXP_CONTAINS( numericvalue, r'\\.')" )
                ,( numericvalue %>% as.numeric() %>% `-`( 2.15 ) ) %>% `*`( 10.929 ) %>% as.character()
                ,numericvalue
            )
    ) %>%
dplyr::collect() %>%
dplyr::group_by( snomedcode ) %>%
dplyr::summarise(
    n = n()
    ,max = max( numericvalue %>% as.numeric() )
    ,mean = mean( numericvalue %>% as.numeric() )
    ,median = numericvalue %>% as.numeric() %>% quantile( 0.5 )
    ,min = min( numericvalue %>% as.numeric() )
) %>%
dplyr::ungroup()

snomedcode,n,max,mean,median,min
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1003671000000109,314378,27300.0,37.503532839,44.26245,-120.76545
1049301000000100,4480,442.0781,44.285178549,41.0,0.0
365845005,2612,0.0,-0.005359877,0.0,-1.0
999791000000106,3020974,85000.0,49.805432963,42.0,-22.09844


### Discussion.

Firstly, I note that only four of the eleven SNOMED CT codes in the list are ever used in the Connected Bradford dataset.

Secondly, conclude that:  
1. `1003671000000109` looks like old percentage values that the transformation corrects, but there are spuriously low initial values (e.g. -8.9) and high initial values (e.g. 27,3000).
    - ACTION: Keep, transform, bandpass-filter between 30 and some reasonable maximum.
2. `1049301000000100` looks like the standard mmol/mol value but the transformation is erroneously being applied to it, sometimes.
    - ACTION: Keep, do not transform, no need to apply filter but I will apply it so that everything is bounded in the same range.
3. `365845005` is nonsense
    - ACTION: Remove.   
4. `999791000000106`
    - ACTION: Keep, transform, bandpass-filter between 30 and some reasonable maximum.

### Apply new rules and check.

In [43]:
# Provide descriptive statistics of each SNOMED code.
r_tbl_srcode %>%
dplyr::select( person_id, dateevent, numericvalue, snomedcode ) %>%
dplyr::filter( snomedcode %in% codes_SNOMED_test_of_interest ) %>%
# Apply transformations, where appropriate
dplyr::mutate( numericvalue = numericvalue %>% as.numeric() ) %>%
dplyr::mutate(
        numericvalue = 
            dplyr::case_when(
               snomedcode %in%
                    c( '1003671000000109'
                      ,'999791000000106' ) ~ numericvalue %>%
                                             `-`( 2.15 ) %>%
                                             `*`( 10.929 ) %>%
                                             round()
                ,snomedcode == '1049301000000100' ~ numericvalue
                ,snomedcode == '365845005' ~ NA
                
                ,TRUE ~ NA_character_
            )
    ) %>%
# Apply bandpass filter.
dplyr::mutate(
    numericvalue = 
        dplyr::if_else(
            numericvalue < 30 | numericvalue > 100
            ,NA
            ,numericvalue )
) %>%
# Collect and summarise.
dplyr::collect() %>%
tidyr::drop_na() %>%
dplyr::group_by( snomedcode ) %>%
dplyr::summarise(
    n = n()
    ,max = max( numericvalue )
    ,mean = mean( numericvalue )
    ,median = numericvalue %>% quantile( 0.5 )
    ,min = min( numericvalue )
) %>%
dplyr::ungroup()

snomedcode,n,max,mean,median,min
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1003671000000109,54,98,52.24074,47,30
1049301000000100,3648,100,48.77599,43,30
999791000000106,807219,100,46.964,42,30


The new scripting rules provide sensible results. I will incorporate this into the `RESHAPE_cohort_generator.r` script.