# Intro

Use `ce_description` as a surrogate to calculate actual number of radiographs examined.

### Goal
1. Write a parser to parse the free-text field `ce_description` and count the number of body parts examined
2. Generate a report for individual registrars

In [66]:
import re
import pandas as pd
import numpy as np

In [67]:
data = pd.read_csv("ExamDataForTubo.csv", usecols=["ce_description", "ex_type", "RegRep", "reporter", "@ TrainingYear", "@Phase", "@ RptGrp", "@YearMonth"])

In [68]:
data[data.ex_type == "XR"]

Unnamed: 0,ce_description,reporter,ex_type,RegRep,@ TrainingYear,@ RptGrp,@Phase,@YearMonth
0,CHEST,DL,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201312
4,L ANKLE,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401
5,R ANKLE,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401
6,R ANKLE,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401
7,L FOOT,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401
...,...,...,...,...,...,...,...,...
119067,CHEST ABDOMEN,WAB,XR,WAB,1.0,Plain Film & Fluoro,Phase 1,202010
119070,ABDOMEN,DMR,XR,DMR,2.0,Plain Film & Fluoro,Phase 1,202010
119071,CHEST,WJI,XR,WJI,2.0,Plain Film & Fluoro,Phase 1,202010
119074,ABDOMEN,DMR,XR,DMR,2.0,Plain Film & Fluoro,Phase 1,202010


In [None]:
grouped = data.groupby(["RegRep", "@ TrainingYear", "@ RptGrp"], sort=False).size()

In [None]:
pd.DataFrame(grouped).sort_values(["RegRep", "@ TrainingYear"]).head(30)

# Method
## Data cleaning

Sources of data inhomogeneity:

- Typos such as ABDOMON, CALCANEOUS, FINFGER
- Missing or wrong delimiters such as CHESTABDOMEN, CHEST.ABDOMEN, LHAND, CHEST+ABDOMEN
- Unintended symbols such as CHEST`
- Unknown identifiers such as R16, 23, REFUGEE, OPR

In [108]:
descriptors = data[data.ex_type == "XR"].ce_description

In [62]:
raw_descriptor_list = data[data.ex_type == "XR"].ce_description.str.cat(sep=' ').split(" ")
print(len(set(raw_descriptor_list)))

383


In [63]:
raw_descriptor_list = data[data.ex_type == "XR"].ce_description.str.replace(r"[,./`]", " ").str.cat(sep=' ').split(" ")
print(len(set(raw_descriptor_list)))

340


Set a regex to remove random symbols and numbers, only keep words and dashes

In [56]:
filtered_descriptor_series = pd.Series([term for term in raw_descriptor_list if re.match(r"[A-Za-z][\w]*", term)])

In [57]:
filtered_descriptor_series.value_counts()

CHEST         42663
L             12982
R             11094
ABDOMEN        7153
HIP            6108
              ...  
ULTRASOUND        1
LFINGER           1
ACETABULUM        1
NASAL             1
PORTACATH         1
Length: 287, dtype: int64

## Parsing

### Central region
- Simple central region are counted by their appearance, such as CHEST, ABDOMEN, PELVIS
- Multi-part central region are parsed by their modifier, such as C T L SPINE
- Inconsistent usage of acronym such as SOFT TISSUE NECK and ST NECK

### Peripheral region
- Body part after BILATERAL are not consistently plural, such as BILATERAL ANKLE, FINGER, ELBOW

### Counting conflicts

- Inconsistency between modifiers and noun plurality, such as L FEET, BILATERAL HAND. In this case, the modifier takes precedence.

In [110]:
descriptors[descriptors.str.match(r"BILATERAL \w+[^ST]\s*$")]

2584          BILATERAL ANKLE
11418          BILATERAL HAND
11793         BILATERAL ELBOW
12963          BILATERAL HAND
21417         BILATERAL ANKLE
22511        BILATERAL FEMORA
22553      BILATERAL SHOULDER
46066           BILATERAL HIP
51483         BILATERAL ELBOW
56345           BILATERAL HIP
58590       BILATERAL FOREARM
62429          BILATERAL KNEE
65054     BILATERAL CALCANEUM
66981          BILATERAL KNEE
76898           BILATERAL HIP
77303           BILATERAL HIP
81895          BILATERAL KNEE
82150           BILATERAL HIP
83300         BILATERAL ELBOW
89533         BILATERAL FEMUR
89559           BILATERAL HIP
94085       BILATERAL WRISTS 
97016      BILATERAL SHOULDER
102551        BILATERAL FEMUR
102883      BILATERAL FOREARM
106374     BILATERAL SHOULDER
106648          BILATERAL HIP
Name: ce_description, dtype: object

In [109]:
descriptors[descriptors.str.match(r".* NECK")]

125                     SOFT TISSUE NECK
277                     SOFT TISSUE NECK
1822                    SOFT TISSUE NECK
2180      CHEST ABDOMEN SOFT TISSUE NECK
5850                    SOFT TISSUE NECK
                       ...              
107773                           ST NECK
110410                           ST NECK
112342                           ST NECK
114566                           ST NECK
115965                           ST NECK
Name: ce_description, Length: 110, dtype: object

In [111]:
descriptors[descriptors.str.match(r".* NG")]

33529     CHEST ABDOMEN FOR NG TUBE
40678     CHEST ABDOMEN FOR NG TUBE
41633             CHEST FOR NG TUBE
43229                      CHEST NG
47169     CHEST/ABDOMEN FOR NG TUBE
50134            CHEST NG PLACEMENT
63358     CHEST ABDOMEN FOR NG TUBE
67726     CHEST ABDOMEN FOR NG TUBE
67728     CHEST ABDOMEN FOR NG TUBE
75066     CHEST ABDOMEN FOR NG TUBE
75077     CHEST ABDOMEN FOR NG TUBE
87427     CHEST/ABDOMEN FOR NG TUBE
87473     CHEST/ABDOMEN FOR NG TUBE
92255     CHEST ABDOMEN FOR NG TUBE
98832        CHEST/ABDO FOR NG TUBE
102464    CHEST/ABDOMEN FOR NG TUBE
102465    CHEST/ABDOMEN FOR NG TUBE
102498    CHEST/ABDOMEN FOR NG TUBE
103558    CHEST/ABDOMEN FOR NG TUBE
107209                     CHEST NG
113164    CHEST ABDOMEN FOR NG TUBE
114903    CHEST ABDOMEN FOR NG TUBE
115720    CHEST ABDOMEN FOR NG TUBE
115753    CHEST ABDOMEN FOR NG TUBE
118077    CHEST/ABDOMEN FOR NG TUBE
118747    CHEST/ABDOMEN FOR NG TUBE
118748    CHEST/ABDOMEN FOR NG TUBE
118784    CHEST/ABDOMEN FOR 

In [118]:
descriptors[descriptors.str.match(r".*COCHLEA.*")]

45542    L COCHLEAR IMPLANT
46939      COCHLEAR IMPLANT
Name: ce_description, dtype: object

In [150]:
descriptors[descriptors.str.match(r"(?<!BILATERAL )\w+")]

0                 CHEST
4               L ANKLE
5               R ANKLE
6               R ANKLE
7                L FOOT
              ...      
119067    CHEST ABDOMEN
119070          ABDOMEN
119071            CHEST
119074          ABDOMEN
119081            CHEST
Name: ce_description, Length: 74842, dtype: object

In [58]:
def write_raw_descriptors():
    # write a list of all descriptors to file
    with open("raw_descriptors.txt", "w") as f:
        for term in descriptor_set:
            f.write(f"{term}\n")

def write_descriptor_stats(s, fn):
    # write a list of descriptor frequency table to file
    with open(f"{fn}.txt", "w") as f:
        for term, count in s.value_counts().iteritems(): 
            f.write(f"{term}\t\t{count}\n")

In [59]:
write_descriptor_stats(filtered_descriptor_series, "cleaned_descriptor_stats")

# Result

- There are 74849 plain radiographs recorded from 2013-12 to 2020-10, each may be include one or more body parts.
- There are 383 unique identifiers initially, 287 unique identifiers after filtration

# Limitations

- Unintended filtration of studies with typos