# Intro

In CDHB, the quarterly registrar's reporting stats are based on the number of reports recorded in the RIS. Regardless the number of the body parts within a study, if there's only one report being written, then 
Use `ce_description` as a surrogate to calculate actual number of radiographs examined. Other than causing unnecessary stress due to undercounting the numbers, one of the more concerning side effects is that multi-part studies will not be reported as promptly as the single-part studies.

This project aim to provide a more accurate way of tracking the progress of each individual registrars, as well as improving the incentives for reporting multi-part studies.

## Goal
1. Write a parser to parse the free-text field `ce_description` and count the number of body parts examined
2. Generate a report for individual registrars

## Expected difficulties

- Change of wording across time
- Typos within the free text description field
- Symbols are sometimes included
- Misnomers are used
- Inconsistencies in naming of the same study
- Added descriptors that is irrelevant to the study type, such as AFTER CT

# Method

## Inspection

In [1]:
import re
import pandas as pd
import numpy as np

In [4]:
data = pd.read_csv("ExamDataForTubo_202108.csv", usecols=["ce_description", "ex_type", "RegRep", "reporter", "@ TrainingYear", "@Phase", "@ RptGrp", "@YearMonth"])

In [5]:
xrs = data[data.ex_type == "XR"]
xrs

Unnamed: 0,ce_description,reporter,ex_type,RegRep,@ TrainingYear,@ RptGrp,@Phase,@YearMonth
0,CHEST,ALO,XR,ALO,5,Plain Film & Fluoro,Phase 2,202108
1,CHEST ABDOMEN,SDA,XR,SDA,4,Plain Film & Fluoro,Phase 2,202108
3,CHEST,ALO,XR,ALO,5,Plain Film & Fluoro,Phase 2,202108
4,CHEST ABDOMEN,ALO,XR,ALO,5,Plain Film & Fluoro,Phase 2,202108
5,CHEST,SDA,XR,SDA,4,Plain Film & Fluoro,Phase 2,202108
...,...,...,...,...,...,...,...,...
151369,L FOOT,AS,XR,KSH,1,Plain Film & Fluoro,Phase 1,201401
151372,R SHOULDER BILATERAL HANDS FEET,GJW,XR,KSH,1,Plain Film & Fluoro,Phase 1,201401
151376,CHEST,DL,XR,KSH,1,Plain Film & Fluoro,Phase 1,201312
151377,CHEST,LJW1,XR,LJW1,2,Plain Film & Fluoro,Phase 1,201810


### Lexicon

In [6]:
ce_description = xrs['ce_description']
ce_description_counts = pd.Series(ce_description).value_counts()

In [7]:
ce_description_counts[ce_description_counts > 100]

CHEST                   45447
ABDOMEN                  4876
CHEST ABDOMEN            3046
R HIP                    2547
L HIP                    2321
R KNEE                   1969
L SPINE                  1902
L KNEE                   1801
PELVIS                   1741
C SPINE                  1293
R SHOULDER                918
R HAND                    877
L SHOULDER                852
R ANKLE                   846
L FINGER                  801
L ANKLE                   770
R FINGER                  715
R FOOT                    697
L FOOT                    663
L HAND                    647
R KNEE POST OP            644
R HIP POST OP             634
L WRIST                   606
R WRIST                   575
L HIP POST OP             540
L KNEE POST OP            520
R FEMUR                   505
L FEMUR                   502
OPG                       481
T SPINE                   430
BILATERAL HIPS            423
L ELBOW                   416
CHEST 1                   415
L THUMB   

In [8]:
# total number of individual descriptors, e.g. CHEST, ABDOMEN, LEFT, CERVICAL etc.
len(set(data[data.ex_type == "XR"].ce_description.str.cat(sep=' ').split(" ")))

543

In [9]:
# total number of individual descriptors with symbols such as .,-` removed
descriptors = data[data.ex_type == "XR"].ce_description.str.replace(r"[^\w\s]", " ").str.cat(sep=' ').split(" ")
print(len(set(descriptors)))

469


In [10]:
counts = pd.Series(descriptors).value_counts()
counts

CHEST      52614
L          18330
R          15825
ABDOMEN     8503
HIP         7640
           ...  
1445HRS        1
WORKER         1
A6             1
THORACO        1
SACRO          1
Length: 469, dtype: int64

In [11]:
counts[counts > 100]

CHEST        52614
L            18330
R            15825
ABDOMEN       8503
HIP           7640
SPINE         6251
KNEE          5976
POST          2767
OP            2767
SHOULDER      2681
PELVIS        2622
ANKLE         2605
BILATERAL     2332
FOOT          2175
HAND          2095
WRIST         2067
C             1842
FINGER        1662
FEMUR         1479
              1467
T             1269
ELBOW         1264
TIB            939
FIB            935
THUMB          869
FEET           675
HIPS           673
OPG            624
HANDS          613
1              556
KNEES          543
FOREARM        542
HUMERUS        475
TOE            394
WHOLE          381
IN             375
SCAPHOID       268
CLAVICLE       228
OPR            221
2              214
WRISTS         184
MANDIBLE       150
FACIAL         149
BONES          148
JOINTS         135
KUB            129
SI             122
3              120
NECK           117
APR            103
dtype: int64

#### Data inhomogeneity
By inspecting the keywords above, we found several sources of inhomogeneities.

- Typos such as ABDOMON, CALCANEOUS, FINFGER
- Missing or wrong delimiters such as CHESTABDOMEN, CHEST.ABDOMEN, LHAND, CHEST+ABDOMEN
- Unintended symbols such as CHEST`
- Unknown identifiers such as R16, 23, REFUGEE, OPR

Fortunately, there are less than 50 keywords that appeared more than 100 times of the near 75k number of radiogrpahs.

## Parsing

### Central region
- Simple central region are counted by their appearance, such as CHEST, ABDOMEN, PELVIS
- Multi-part central region are parsed by their modifier, such as C T L SPINE
- Inconsistent usage of acronym such as SOFT TISSUE NECK and ST NECK

### Peripheral region
- Body part after BILATERAL are not consistently plural, such as BILATERAL ANKLE, FINGER, ELBOW

### Counting conflicts

- Inconsistency between modifiers and noun plurality, such as L FEET, BILATERAL HAND. In this case, the modifier takes precedence.

In [13]:
from src.parser import parse

In [14]:
parts_sum = xrs['ce_description'].apply(parse)

In [15]:
result = pd.concat([xrs, parts_sum.rename('parts_sum')], axis=1)

In [16]:
result.to_csv("result.csv")

## Analysis

### Percentage of multi-part studies

In [17]:
result[result['parts_sum'] > 5]

Unnamed: 0,ce_description,reporter,ex_type,RegRep,@ TrainingYear,@ RptGrp,@Phase,@YearMonth,parts_sum
1271,BILATERAL HANDS WRISTS FEET,TUS,XR,TUS,2,Plain Film & Fluoro,Phase 1,202108,6
1276,CHEST HIPS KNEES HANDS FEET,RRY,XR,RRY,4,Plain Film & Fluoro,Phase 2,202108,9
2577,BILATERAL SHOULDERS HANDS FEET,LRH,XR,LRH,5,Plain Film & Fluoro,Phase 2,202108,6
2989,L SPINE BILATERAL SHOULDERS HIPS HANDS,MSU,XR,MSU,2,Plain Film & Fluoro,Phase 1,202108,7
3176,CHEST BILATERAL HANDS KNEES FEET X-RAY,DMR,XR,DMR,3,Plain Film & Fluoro,Phase 1,202108,7
...,...,...,...,...,...,...,...,...,...
148661,CHEST BILATERAL WRISTS HANDS FEET,DL,XR,KSH,2,Plain Film & Fluoro,Phase 1,201503,7
150359,BILATERAL WRISTS HANDS FEET,AS,XR,KSH,1,Plain Film & Fluoro,Phase 1,201409,6
150752,CHEST R HUM FOREARM FEMUR TIB FIB L FOR,JAC2,XR,KSH,1,Plain Film & Fluoro,Phase 1,201403,6
151273,SHOULDERS HANDS HIPS L SPINE,IAC,XR,KSH,1,Plain Film & Fluoro,Phase 1,201401,7


9360 / 75000 = 12.5%

### Group by year and registrar

In [18]:
registrars = result.groupby("RegRep")

In [19]:
cur = registrars.size()

In [20]:
new = registrars['parts_sum'].sum()

In [23]:
pd.DataFrame({"current": cur, "new": new, "change": new/cur})

Unnamed: 0_level_0,current,new,change
RegRep,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ADB,4506,5188,1.151354
AJV,534,614,1.149813
ALO,4970,6255,1.258551
BDR,1168,1460,1.25
CCF,3916,4410,1.126149
DMR,5941,7705,1.29692
ESG,4526,5292,1.169244
IVM,2923,3552,1.21519
KCK,797,946,1.186951
KSH,6982,8194,1.173589


In [None]:
grouped = result.groupby(["RegRep", "@ TrainingYear", "@ RptGrp"], sort=False).size()

In [None]:
pd.DataFrame(grouped).sort_values(["RegRep", "@ TrainingYear"]).head(30)

# Result

- There are 74849 plain radiographs recorded from 2013-12 to 2020-10, each may be include one or more body parts.
- There are 383 unique identifiers initially, 287 unique identifiers after filtration

# Limitations

- Unintended filtration of studies with typos

# Appendix

## Descriptor related

In [58]:
def write_raw_descriptors():
    # write a list of all descriptors to file
    with open("raw_descriptors.txt", "w") as f:
        for term in descriptor_set:
            f.write(f"{term}\n")

def write_descriptor_stats(s, fn):
    # write a list of descriptor frequency table to file
    with open(f"{fn}.txt", "w") as f:
        for term, count in s.value_counts().iteritems(): 
            f.write(f"{term}\t\t{count}\n")

In [17]:
pd.Series(raw_descriptors).to_csv("ce_descriptions.txt", index=False, header=False)

In [59]:
write_descriptor_stats(filtered_descriptor_series, "cleaned_descriptor_stats")

## Parse and then append a new column onto the raw dataset

In [1]:
import re
import pandas as pd
import numpy as np

In [2]:
from src.parser import parse

In [3]:
filename = "ExamDataForTubo_202109_2.csv"
outname = "ExamDataParsed_202109_2.csv"

In [4]:
data = pd.read_csv(filename)

In [5]:
parts_sum = data['ce_description'].apply(parse)

In [6]:
parsed = pd.concat([data, parts_sum.rename('parts_sum')], axis=1)

In [7]:
result = pd.concat([data, parsed.parts_sum.rename('parts_sum')], axis=1)

In [8]:
result.to_csv(outname)