# Intro

Use `ce_description` as a surrogate to calculate actual number of radiographs examined.
This will provide a more accurate way of tracking the progress of each individual registrars, as well as improving the incentives for multi-part studies.

## Goal
1. Write a parser to parse the free-text field `ce_description` and count the number of body parts examined
2. Generate a report for individual registrars

## Expected difficulties

- Change of wording
- Typos
- Symbols
- Misnomers
- Inconsistencies
- Added descriptors

# Method

## Inspection

In [2]:
import re
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("ExamDataForTubo.csv", usecols=["ce_description", "ex_type", "RegRep", "reporter", "@ TrainingYear", "@Phase", "@ RptGrp", "@YearMonth"])

In [35]:
xrs = data[data.ex_type == "XR"]
xrs

Unnamed: 0,ce_description,reporter,ex_type,RegRep,@ TrainingYear,@ RptGrp,@Phase,@YearMonth
0,CHEST,DL,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201312
4,L ANKLE,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401
5,R ANKLE,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401
6,R ANKLE,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401
7,L FOOT,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401
...,...,...,...,...,...,...,...,...
119067,CHEST ABDOMEN,WAB,XR,WAB,1.0,Plain Film & Fluoro,Phase 1,202010
119070,ABDOMEN,DMR,XR,DMR,2.0,Plain Film & Fluoro,Phase 1,202010
119071,CHEST,WJI,XR,WJI,2.0,Plain Film & Fluoro,Phase 1,202010
119074,ABDOMEN,DMR,XR,DMR,2.0,Plain Film & Fluoro,Phase 1,202010


In [20]:
ce_description = xrs['ce_description']
ce_description_counts = pd.Series(ce_description).value_counts()

In [21]:
ce_description_counts[ce_description_counts > 100]

CHEST              36951
ABDOMEN             3851
CHEST ABDOMEN       2871
R HIP               2063
L HIP               1906
L SPINE             1581
R KNEE              1430
PELVIS              1340
L KNEE              1300
C SPINE             1046
OPG                  810
R SHOULDER           589
L FINGER             575
R HAND               552
L SHOULDER           545
R FINGER             515
R ANKLE              502
R HIP POST OP        498
R FOOT               490
R KNEE POST OP       484
R FEMUR              435
L FOOT               432
L HAND               427
L ANKLE              423
L HIP POST OP        395
L FEMUR              388
L KNEE POST OP       387
CHEST 1              335
T SPINE              325
L WRIST              318
R WRIST              297
BILATERAL HIPS       291
WHOLE SPINE          268
L ELBOW              266
BILATERAL KNEES      239
L THUMB              235
T L SPINE            221
R THUMB              207
R ELBOW              205
L TIB FIB            194


In [38]:
# total number of individual descriptors, e.g. CHEST, ABDOMEN, LEFT, CERVICAL etc.
len(set(data[data.ex_type == "XR"].ce_description.str.cat(sep=' ').split(" ")))

383

In [84]:
# total number of individual descriptors with symbols such as .,-` removed
descriptors = data[data.ex_type == "XR"].ce_description.str.replace(r"[^\w\s]", " ").str.cat(sep=' ').split(" ")
print(len(set(descriptors)))

330


In [86]:
counts = pd.Series(descriptors).value_counts()
counts

CHEST         42663
L             12991
R             11094
ABDOMEN        7153
HIP            6108
              ...  
ULTRASOUND        1
LFINGER           1
ILIAC             1
ACETABULUM        1
SACRO             1
Length: 330, dtype: int64

In [87]:
counts[counts > 100]

CHEST        42663
L            12991
R            11094
ABDOMEN       7153
HIP           6108
SPINE         4870
KNEE          4323
OP            2081
POST          2081
PELVIS        1982
SHOULDER      1740
ANKLE         1474
C             1458
FOOT          1380
BILATERAL     1379
HAND          1339
FINGER        1205
WRIST         1201
FEMUR         1171
OPG           1054
               925
T              903
ELBOW          774
TIB            626
FIB            622
THUMB          526
HIPS           473
1              437
KNEES          347
HUMERUS        330
HANDS          328
FEET           323
FOREARM        305
IN             279
WHOLE          275
TOE            251
OPR            237
MANDIBLE       237
FACIAL         215
BONES          213
2              159
KUB            132
CLAVICLE       129
NECK           115
dtype: int64

#### Data inhomogeneity
By inspecting the keywords above, we found several sources of inhomogeneities.

- Typos such as ABDOMON, CALCANEOUS, FINFGER
- Missing or wrong delimiters such as CHESTABDOMEN, CHEST.ABDOMEN, LHAND, CHEST+ABDOMEN
- Unintended symbols such as CHEST`
- Unknown identifiers such as R16, 23, REFUGEE, OPR

Fortunately, there are less than 50 keywords that appeared more than 100 times of the near 75k number of radiogrpahs.

## Parsing

### Central region
- Simple central region are counted by their appearance, such as CHEST, ABDOMEN, PELVIS
- Multi-part central region are parsed by their modifier, such as C T L SPINE
- Inconsistent usage of acronym such as SOFT TISSUE NECK and ST NECK

### Peripheral region
- Body part after BILATERAL are not consistently plural, such as BILATERAL ANKLE, FINGER, ELBOW

### Counting conflicts

- Inconsistency between modifiers and noun plurality, such as L FEET, BILATERAL HAND. In this case, the modifier takes precedence.

In [52]:
from src.parser import parse

In [53]:
parts_sum = xrs['ce_description'].apply(parse)

In [54]:
result = pd.concat([xrs, parts_sum.rename('parts_sum')], axis=1)

In [55]:
result.to_csv("result.csv")

## Analysis

### Percentage of multi-part studies

In [69]:
result[result['parts_sum'] > 5]

Unnamed: 0,ce_description,reporter,ex_type,RegRep,@ TrainingYear,@ RptGrp,@Phase,@YearMonth,parts_sum
44,WRISTS HANDS FEET,DOK,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401,6
124,SHOULDERS HANDS HIPS L SPINE,IAC,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201401,7
605,CHEST R HUM FOREARM FEMUR TIB FIB L FOR,JAC2,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201403,6
1078,BILATERAL WRISTS HANDS FEET,AS,XR,KSH,1.0,Plain Film & Fluoro,Phase 1,201409,6
2839,CHEST BILATERAL WRISTS HANDS FEET,DL,XR,KSH,2.0,Plain Film & Fluoro,Phase 1,201503,7
...,...,...,...,...,...,...,...,...,...
114602,BILATERAL HANDS WRISTS FEET,RRY,XR,RRY,3.0,Plain Film & Fluoro,Phase 1,202009,6
115068,BILATERAL HANDS WRISTS FEET,RRY,XR,RRY,3.0,Plain Film & Fluoro,Phase 1,202009,6
116022,CHEST BILATERAL HANDS WRISTS FEET,RRY,XR,RRY,3.0,Plain Film & Fluoro,Phase 1,202009,7
116262,BILAT SHOULDER ELBOWS WRIST HANDS KNEES,RRY,XR,RRY,3.0,Plain Film & Fluoro,Phase 1,202009,10


9360 / 75000 = 12.5%

### Group by year and registrar

In [62]:
registrars = result.groupby("RegRep")

In [63]:
cur = registrars.size()

In [64]:
new = registrars['parts_sum'].sum()

In [65]:
pd.DataFrame({"current": cur, "new": new, "change": new/cur})

Unnamed: 0_level_0,current,new,change
RegRep,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ADB,1350,1554,1.151111
ALO,3986,4903,1.230055
BDR,360,430,1.194444
CCF,2664,2954,1.108859
DMR,4004,4987,1.245504
ESG,3261,3779,1.158847
GJL,9409,10894,1.157828
IVM,1917,2422,1.263432
KSH,5799,6870,1.184687
LJS,3919,4528,1.155397


In [None]:
grouped = result.groupby(["RegRep", "@ TrainingYear", "@ RptGrp"], sort=False).size()

In [None]:
pd.DataFrame(grouped).sort_values(["RegRep", "@ TrainingYear"]).head(30)

# Result

- There are 74849 plain radiographs recorded from 2013-12 to 2020-10, each may be include one or more body parts.
- There are 383 unique identifiers initially, 287 unique identifiers after filtration

# Limitations

- Unintended filtration of studies with typos

# Appendix

In [58]:
def write_raw_descriptors():
    # write a list of all descriptors to file
    with open("raw_descriptors.txt", "w") as f:
        for term in descriptor_set:
            f.write(f"{term}\n")

def write_descriptor_stats(s, fn):
    # write a list of descriptor frequency table to file
    with open(f"{fn}.txt", "w") as f:
        for term, count in s.value_counts().iteritems(): 
            f.write(f"{term}\t\t{count}\n")

In [17]:
pd.Series(raw_descriptors).to_csv("ce_descriptions.txt", index=False, header=False)

In [59]:
write_descriptor_stats(filtered_descriptor_series, "cleaned_descriptor_stats")