# Extracting prescription quantity and frequency.
The purpose of this notebook is to develop a script to extract presctiption quantity and frequency from the free-test field `medicationdosage` in the `tbl_srprimarycaremedication` table.

Some pharmacy lingo will be needed. The initialisms below might be spearated by periods, `.`, or slashes, `/`, and are sometimes capitalised. They can also be concatonated without spaces, e.g. achs = ac + hs = before meals and at bedtime.
| Initialism | Meaning |
| ------- | ------- |
| a | ante, before |
| ac | ante cibum (before food) |
| ad lib | as desired |
| am | in the morning |
| aq | aqua, take with water |
| bib | drink |
| bd, bds, bid | bis die (twice daily) |
| et | and |
| c\* | with \* (where \* stands for the other thing) |
| cap | capsule |
| h | hour |
| hs | bedtime or half-strength |
| i, ii, iii, ... | 1, 2, 3, etc. tablets |
| id | intradermally, through the skin |
| im | intramuscular, inject into muscle |
| in | intranasal, through the nose |
| inh | inhaler |
| inj | injection |
| neb | nebuliser, inhalant medication |
| nbm, npo | nill by mouth |
| nocte, noct | at night |
| od | omni die (every day) |
| om | omni mane (every morning) |
| on | omni nocte (every night) |
| p | post, after |
| pc | post cibum (after food) |
| pm | in the evening |
| pn | per neb, by nebulizer |
| po | per os, orally, by mouth |
| pr | rectally 
| prn | pro re nata (when required) |
| q\* | every \* (where \* is a wildcard for a timepoints or  intervals, e.g. q2h = every 2 hours, q1d = qd = every day, qam = every morning) |
| qds, qid | quater die sumendum (to be taken four times daily) |
| qod | every other day |
| qs | a sufficient amount |
| qqh | quarta quaque hora (every four hours) |
| s\* | sans\*, without\* (where \* stands for the other thing) |
| sc | subcutaneous, inject under the skin |
| stat | immediately |
| syr | syrup |
| tab | tablet |
| tds, tid | ter die sumendum (to be taken three times daily) |
| u | unit |
| ud | as directed |
| w\* | with \* (where \* stands for the other thing) |
| wf | with food |
| x\* | if \* is a number, then \*-many tablets. If \* is a number and/or letter, then \*-many tablets per interval (e.g. d = day) or \*-many intervals |


Sources:

- https://bnf.nice.org.uk/about/abbreviations-and-symbols/
- https://www.nhs.uk/nhs-app/nhs-app-help-and-support/health-records-in-the-nhs-app/abbreviations-commonly-found-in-medical-records/
- https://www.drugs.com/article/prescription-abbreviations.html#
- https://chartercollege.edu/news-hub/72-abbreviations-every-pharmacy-tech-needs-know/


## Strategy

### Morning, evening, before bed, etc.
I want a total count of tablets per day so I need to combine the morning, evening, etc. prescriptions. My plan is to parse these prescriptions into separate columns that I will later sum:

| n_morning | n_evening | n_night | n_bedtime | n_mealtime\* |
| --------- | --------- | ------- | --------- | ------------ |
| ...       | ...       | ...     | ...       | ...          |

\*`n_ MEALTIME ` will include "before food", "after food", and "with food".


### Units vs tablets.
I will treat a tablet as a unit so that there is a common language.



Install packages.

In [8]:
#######################
## Install packages. ##
#######################
library( pacman )
pacman::p_load( bigrquery, tidyverse, readr )
bq_auth()

Set and load requisites.

In [9]:
##############################
## Set and load requisites. ##
##############################

# Setup connection to GCP.
project_id <- "yhcr-prd-phm-bia-core"
con <- DBI::dbConnect( drv = bigquery(), project = project_id )

# Define R tibbles from GCP tables.
r_tbl_BNF_DMD_SNOMED_lkp <- dplyr::tbl( con, "CB_LOOKUPS.tbl_BNF_DMD_SNOMED_lkp" )
r_tbl_srprimarycaremedication <- dplyr::tbl( con, "CB_FDM_PrimaryCare_V7.tbl_srprimarycaremedication" )

# Clinical code lists (BNF, SNOMED-CT, etc).
codes_BNF_meds_of_interest <-
    readr::read_csv(file = 'ciaranmci-bnf-section-61-drugs-for-diabetes-207573b7.csv',
                    col_types = cols( code = col_character(), term = col_character() ) )$code
names_meds_of_interest <-
    r_tbl_BNF_DMD_SNOMED_lkp %>%
    dplyr::filter( BNF_Code %in% codes_BNF_meds_of_interest ) %>%
    dplyr::select( DMplusD_ProductDescription )

Apply transforms to the free-text in the `medicationdosage` column to get the contents into a machine-readable format.

In [19]:
n_row_display <- 200

qry <- r_tbl_srprimarycaremedication %>%
# Select every record that has a prescription for any diabetes medication.
dplyr::inner_join( names_meds_of_interest, by = join_by( nameofmedication == DMplusD_ProductDescription ) ) %>%
# Filter out all 'as directed' dosages.
dplyr::filter( sql( 'NOT CONTAINS_SUBSTR(medicationdosage, \'directed\' )' ) ) %>%
# Filter for tablets and capsules, only.
dplyr::filter( medicationquantity %LIKE% '%tablet%' %OR% medicationquantity %LIKE% '%capsule%' ) %>%
# Select unique `medicationdosage` values, only.
dplyr::select( medicationdosage ) %>%
dplyr::distinct() %>%

#####################
## Transformations ##
#####################
###dplyr::mutate_at(                      # Use dplyr::mutate_at() if I am replacing the column. Otherwise, use dplyr::mutate().
###    .vars = vars( medicationdosage )
###    ,funs(
dplyr::mutate(
    transformed = 
        medicationdosage %>%
        # Transform X: Remove periods, `.`.
        ###stringr::str_replace(., "\\.",              "" ) %>%
        stringr::str_replace("\\.",              "" ) %>%
        
        # Transform X: Make everything lower case.
        stringr::tolower() %>%
        
        # Transform X: "and " to +.
        stringr::str_replace( "\\ (and)|\\&\\ ",               " + " ) %>%
        
        # Transfrom X: Tidy away typos.
        stringr::str_replace( "(ty)[a-z]",              "t" ) %>% # Still experimental. This is trying to fix typos like "tywo" instead of "two".
        stringr::str_replace( "(afte)r?",                "" ) %>%
        stringr::str_replace( "(befor)e?",               "" ) %>%
        stringr::str_replace( "(me)[a-z]?(al)s?",        "" ) %>%
        stringr::str_replace( "\\ (ime)",                  "" ) %>%
        stringr::str_replace( "\\ (-time)",                "" ) %>%
        stringr::str_replace( "\\ (time)\\ ",              "" ) %>%
        stringr::str_replace( "(eve)[a-z](ing)",              "" ) %>%
        stringr::str_replace( "(m[a-z]?orn?ing)",              "" ) %>%
                
        # Transform X: Handle special case of "for one week".
        stringr::str_replace( "(for)?\\ (one)?1?\\ (week)s?",         " INTERVAL = (1) " ) %>% # Ideally, this set of transforms would handle all the "week1", "week 2", etc. instructions.
        stringr::str_replace( "(for)?\\ (two)?2?\\ (week)s?",         " INTERVAL = (2) " ) %>% 
        stringr::str_replace( "(for)?\\ (three)?3?\\ (week)s?",       " INTERVAL = (3) " ) %>% 
        stringr::str_replace( "(for)?\\ (four)?4?\\ (week)s?",        " INTERVAL = (4) " ) %>% 
        
        # Transfrom X: Homogenise intervals, e.g. hourly, daily.
        stringr::str_replace( "(a day)",                     "daily" ) %>%
        stringr::str_replace( "(aday)",                      "daily" ) %>%
        stringr::str_replace( "(each day)",                "daily" ) %>%
        stringr::str_replace( "(per day)",                "daily" ) %>%
        stringr::str_replace( "/\\ ?(day)",                "daily" ) %>%
        stringr::str_replace( "(every day)",               "daily" ) %>%
        stringr::str_replace( "(^|\\ )(od)\\ ",               "daily" ) %>%
        
        # Transform X: Remove unnecessary 'tablet'.
        stringr::str_replace( "\\ ?(tab)(le[a-z]?t)?s?",       "" ) %>%
        
        # Transform X: Collapse numbers and their units so that the numbers don't get interpretted as counts, later.
        stringr::str_replace( "[0-9]*\\ g\\ |$",       "" ) %>%
    
        # Transform X: Remove unnecessary 'take' and its typos.
        stringr::str_replace( "(ta)[a-z]e?(ing)?n?",               "" ) %>%
    
        # Transform X: Remove unnecessary ' to be '.
        stringr::str_replace( "\\ (to be)\\ ",               "" ) %>%
        
        # Transform X: Remove unnecessary 'use '.
        stringr::str_replace( "(use)[0-9]?\\ ",             "" ) %>%
        
        # Transform X: Ignore 'working up to' and its variants.
        #stringr::str_replace( "work(ing)? up to ",  "" ) %>% # This might be important information, so I'll leave it in for now.
        #stringr::str_replace( "wean up gently to",  "" ) %>%
        #stringr::str_replace( "work up slowly to",  "" ) %>%
        # (up)\\ ?(titrate)
        
        # Transform X: Ignore notions of 'with'.
        stringr::str_replace( "(with)/?\\ ?(the)?", "" ) %>%
        stringr::str_replace( "(w/)\\ ",                "" ) %>%
        stringr::str_replace( "\\ c\\ ",                "" ) %>%
        stringr::str_replace( "\\ ?w\\ ",             "" ) %>%
        
        # Transform X: Ignore 'as needed' and similar.
        stringr::str_replace( "(as needed)",          "" ) %>%
        stringr::str_replace( "(when needed)",        "" ) %>%
        stringr::str_replace( "(as necessary)",       "" ) %>%
        stringr::str_replace( "(when necessary)",     "" ) %>%
        stringr::str_replace( "(as required)",        "" ) %>%
        stringr::str_replace( "(when required)",      "" ) %>%
        stringr::str_replace( "(when req)",           "" ) %>%
        stringr::str_replace( "(as desired)",         "" ) %>%
        stringr::str_replace( "(when desired)",       "" ) %>%
        stringr::str_replace( "(ad lib)",             "" ) %>%
        stringr::str_replace( "( qs )",               "" ) %>%
        
        # Transform X: Fix typos.
        stringr::str_replace( "!",                  "1" ) %>%
        
        # Transform X: Convert number words to digits.
        stringr::str_replace( "(1/2)|(half)|(every alternate day)|(alt days)|(every other day)",         " QUANT = (0.5) " ) %>%
        stringr::str_replace( "(one)|(\\ |^)i\\ |(whole)",                     " QUANT = (1) " ) %>%
        stringr::str_replace( "(two)|(\\ |^)ii\\ ",                     " QUANT = (2) " ) %>%
        stringr::str_replace( "(three)|(\\ |^)iiii\\ ",                   " QUANT = (3) " ) %>%
        stringr::str_replace( "(four)|(\\ |^)iv\\ ",                    " QUANT = (4) " ) %>% # I hope patients don't take more than 4 of the same tablet in a day.
        
        # Transform X: Remove leading 'x' in front of digits.
        stringr::str_replace( "x\\ ?1",                    " QUANT = (1) " ) %>%
        stringr::str_replace( "x\\ ?2",                    " QUANT = (2) " ) %>%
        stringr::str_replace( "x\\ ?3",                    " QUANT = (3) " ) %>%
        stringr::str_replace( "x\\ ?4",                    " QUANT = (4) " ) %>%
        
        # Transform X: Leading quantities.
        stringr::str_replace( "1\\ ",                     " QUANT = (1) " ) %>%
        stringr::str_replace( "2\\ ",                     " QUANT = (2) " ) %>%
        stringr::str_replace( "3\\ ",                     " QUANT = (3) " ) %>%
        stringr::str_replace( "4\\ ",                     " QUANT = (4) " ) %>%
        stringr::str_replace( "5\\ ",                     " QUANT = (5) " ) %>%
        
        # Transform X: Convert multiplier words to digits. I assume a daily base for my frequency, e.g. daily = 1.
        stringr::str_replace( "\\ ?(times)",                              "" ) %>%
        stringr::str_replace( "(twice)|(2x)|((\\ x)?(bd)s?\\ ?)",           " FREQ = (2) " ) %>%
        stringr::str_replace( "(once)|(daily)|((\\ x)?(od)s?\\ ?)",       " FREQ = (1) " ) %>% # The preceding space is not optional because, otherwise, we transform "blood" to "blo FREQ = (1)".
        stringr::str_replace( "\\ ?x?[0-9]?(td)s?\\ ?",                   " FREQ = (3) " ) %>%
        stringr::str_replace( "(every 8 hours)",                          " FREQ = (3) " ) %>%
        stringr::str_replace( "\\ ?x?[0-9]?(qd)s?\\ ?",                   " FREQ = (4) " ) %>%
        
         # Transform X: Convert time-of-day instructions to latin capitalised initialisms, or latin words.
        stringr::str_replace( "(in the morning)",          " WHEN = (A.M.) " ) %>%
        stringr::str_replace( "(at )?(morning)[a-z]?",     " WHEN = (A.M.) " ) %>%
        stringr::str_replace( "\\ [0-9]?(o.m.)(\\ |$)",    " WHEN = (A.M.) " ) %>%
        stringr::str_replace( "\\ [0-9]?(om)(\\ |$)",      " WHEN = (A.M.) " ) %>%
        stringr::str_replace( "(\\ |^[0-9])?(am)(\\ |$)",  " WHEN = (A.M.) " ) %>%
        stringr::str_replace( "(in the evening)",          " WHEN = (P.M.) " ) %>%
        stringr::str_replace( "(midday)",                  " WHEN = (P.M.) " ) %>%
        stringr::str_replace( "(evening)[a-z]?",           " WHEN = (P.M.) " ) %>%
        
        stringr::str_replace( "\\ ?(pm)$?",                " WHEN = (P.M.) " ) %>%
        stringr::str_replace( "(noct)e?",                  " WHEN = (NOCTE) " ) %>%
        stringr::str_replace( "(in the night)",            " WHEN = (NOCTE) " ) %>%
        stringr::str_replace( "(at night)",                " WHEN = (NOCTE) " ) %>%
        stringr::str_replace( "(night)[a-z]?",             " WHEN = (NOCTE) " ) %>%
        stringr::str_replace( "\\ [0.9]?(o.n.)(\\ |$)",                  " WHEN = (NOCTE) " ) %>% # Not confident about replacing "on" with NOCTE.
        stringr::str_replace( "(wf)",                                                              " WHEN = (MEALTIME) " ) %>%
        stringr::str_replace( "(before)?(after)?(with)?\\ (food)s?",              " WHEN = (MEALTIME) " ) %>%
        stringr::str_replace( "(before)?(after)?(with)?\\ (meal)s?",              " WHEN = (MEALTIME) " ) %>%
        stringr::str_replace( "(before)?(after)?(with)?\\ (break)\\ ?(fast)s?",   " WHEN = (MEALTIME) " ) %>%
        stringr::str_replace( "(before)?(after)?(with)?\\ (lunch)(time)?",        " WHEN = (MEALTIME) " ) %>%
        stringr::str_replace( "(eve)(ning)?\\ (meal)",                             " WHEN = (MEALTIME) " ) %>%
        stringr::str_replace( "(before)?(after)?(with)?\\ (dinner)s?",            " WHEN = (MEALTIME) " ) %>%
        stringr::str_replace( "(before)?(after)?(with)?\\ (supper)s?",            " WHEN = (MEALTIME) " ) %>%
        stringr::str_replace( "(before)?(after)?(with)?\\ (tea)\\ ?(time)?s?(\\ meal)?",     " WHEN = (MEALTIME) " ) %>%
        
       
        
        # Transform X: Remove schpeel about "your practice will contact you..."
        stringr::str_replace( "(your practice will contact you).*$", "" ) %>%
        
        # Transform X: Remove schpeel about "whilst taking steriods...
        stringr::str_replace( "(whilst taking steroids)",    "" ) %>%
        
       
        
        
        # Transfrom X: Remove typos that results from other transforms.
        stringr::str_replace( "( in the )",                     "" ) %>%
        stringr::str_replace( "\\ at\\ ",                     "" ) %>%
        stringr::str_replace( "\\ (time)",                     "" ) %>%
        stringr::str_replace( "(every)",                     "" ) %>%
        stringr::str_replace( "(main)",                    "" ) %>%
        stringr::str_replace( "(each)",                    "" ) %>%
        stringr::str_replace( "\\ \\ ",                        " " ) %>%
        stringr::str_replace( "\\ -\\ ",                       " " ) %>%
        stringr::str_replace( "^?\\ a\\ ",                       " " )
        
    ###)
) %>%
# Filter out prosaic dosages that give week-by-week explanations.
dplyr::filter( sql( 'NOT REGEXP_CONTAINS(medicationdosage, r\'w(e*)?ks?\\ ?(-)?[0-9]\' )' ) )%>% dplyr::filter( medicationdosage != "" ) %>% distinct() %>% collect() %>% head( n = n_row_display ); qry


### DISPLAY ###

options(repr.matrix.max.rows = n_row_display)
#qry %>% dplyr::filter( medicationdosage != "" ) %>% distinct() %>% collect() %>% head( n = n_row_display )

##############
### TRICKY ###
##############
# 1-2 IN THE MORNING


medicationdosage,transformed
<chr>,<chr>
Take Half A Tablet Daily With Breakfast,QUANT = (0.5) FREQ = (1) WHEN = (MEALTIME)
2 before breakfast and another 2 before teatime meal,QUANT = (2) WHEN = (MEALTIME) + another QUANT = (2) WHEN = (MEALTIME)
take one at teatime,QUANT = (1) WHEN = (MEALTIME)
Take 1 daily,QUANT = (1) FREQ = (1)
Take 2 daily,QUANT = (2) FREQ = (1)
half tab twice a day,QUANT = (0.5) FREQ = (2) FREQ = (1)
Two Tablets To Be Taken With Breakfast And Two To be Taken With Evening Meal,QUANT = (2) WHEN = (MEALTIME) + QUANT = (2)
TAKE 2 TABS IN AM & 1 TAB AT NIGHT,QUANT = (2) in WHEN = (A.M.) + QUANT = (1) WHEN = (NOCTE)
one bd with meal,QUANT = (1) FREQ = (2)
take one OM,QUANT = (1) WHEN = (A.M.)


Work with the `medicationquantity` columns to see if you can determine the expected prescription duration.

In [11]:
qry <- r_tbl_srprimarycaremedication %>%
# Select every record that has a prescription for any diabetes medication.
dplyr::inner_join( names_meds_of_interest, by = join_by( nameofmedication == DMplusD_ProductDescription ) ) %>%
# Filter out all 'as directed' dosages.
dplyr::filter( sql( 'NOT CONTAINS_SUBSTR(medicationdosage, \'directed\' )' ) ) %>%
# Filter for tablets and capsules, only.
dplyr::filter( medicationquantity %LIKE% '%tablet%' %OR% medicationquantity %LIKE% '%capsule%' ) %>%
# Select unique `medicationdosage` values, only.
dplyr::select( medicationquantity ) %>%
dplyr::distinct() %>%
dplyr::mutate(
    transformed = 
        medicationquantity %>%
        
        # Transform X: Remove parentheses.
        stringr::str_replace("\\(?\\)?",              "" ) %>%
        
        # Transform X: Make everything lower case.
        stringr::tolower() %>%
        
        # Get rid of the drug mass, e.g. grams and mg.
        stringr::str_replace( "[0-9]*\\ ?(mg)|(gram)s?",              "" ) %>%
        
        
        # Transform X: Remove parentheses.
        stringr::str_replace( "(tab)(let)?s?",              "" ) %>%
        
        # Transform " pack of " to "*", which is used more often, it seems.
        stringr::str_replace( "\\ (pack)s?\\ (of)\\ ",              "*" ) %>%
        
        # Transform the implied multiplications.
        stringr::str_replace( "(^1\\*|x)|(\\*1\\ )",              " MULTI = (1 TIMES) " ) %>%
        stringr::str_replace( "(^2\\*|x)|(\\*2\\ )",              " MULTI = (2 TIMES) " ) %>%
        stringr::str_replace( "(^3\\*|x)|(\\*3\\ )",              " MULTI = (3 TIMES) " ) %>%
        stringr::str_replace( "(^4\\*|x)|(\\*4\\ )",              " MULTI = (4 TIMES) " ) %>%
        stringr::str_replace( "(^5\\*|x)|(\\*5\\ )",              " MULTI = (5 TIMES) " ) %>%
        stringr::str_replace( "(^6\\*|x)|(\\*6\\ )",              " MULTI = (6 TIMES) " ) %>%
        stringr::str_replace( "(^7\\*|x)|(\\*7\\ )",              " MULTI = (7 TIMES) " ) %>%
        stringr::str_replace( "(^8\\*|x)|(\\*8\\ )",              " MULTI = (8 TIMES) " ) %>%
        stringr::str_replace( "(^9\\*|x)|(\\*9\\ )",              " MULTI = (9 TIMES) " ) %>%
        stringr::str_replace( "(^10\\*|x)|(\\*10\\ )",              " MULTI = (10 TIMES) " ) %>%
        stringr::str_replace( "(^11\\*|x)|(\\*11\\ )",              " MULTI = (11 TIMES) " ) %>%
        stringr::str_replace( "(^12\\*|x)|(\\*12\\ )",              " MULTI = (12 TIMES) " ) %>%
        
        # Find realistic multiples of 28, which is a months' worth of tablets.
        stringr::str_replace( "(^|\\ )?(28)\\ ",              " QUANT = (1 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(56)\\ ",              " QUANT = (2 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(84)\\ ",              " QUANT = (3 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(112)\\ ",             " QUANT = (4 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(140)\\ ",             " QUANT = (5 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(168)\\ ",             " QUANT = (6 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(196)\\ ",             " QUANT = (7 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(224)\\ ",             " QUANT = (8 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(252)\\ ",             " QUANT = (9 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(280)\\ ",             " QUANT = (10 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(308)\\ ",             " QUANT = (11 MONTHS) " ) %>%
        stringr::str_replace( "(^|\\ )?(336)\\ ",             " QUANT = (12 MONTHS) " ) %>%
        
        # TIdy up the implied arithemtic operations.
        #stringr::str_replace( "\\ +\\ ",             "+" ) %>%
        stringr::str_replace( "(\\ \\*)|(\\*\\ )",             "*" ) %>%
        
        # Clean up scraps caused by previous transforms.
        stringr::str_replace( "\\ ?-\\ ?",             "" )
        
        

)

n_row_display <- 200
options(repr.matrix.max.rows = n_row_display)
qry %>% distinct() %>% collect()

medicationquantity,transformed
<chr>,<chr>
112 tablet(s) - 500 mg,QUANT = (4 MONTHS)
7 tablets - 500 mg,7
420 tablet,420
126 tablet(s),126
14 x tablets,14 MULTI = (1 TIMES)
3 tablet - 80 mg,3
160 tablets,160
256 tablet,2 QUANT = (2 MONTHS)
320 tablet,320
3*112 tablets,MULTI = (3 TIMES) QUANT = (4 MONTHS)
