non-target structure indicator: ROOT_structure~_PRVxx_spatial
target structure indicator: TARGETclassifier#_imaging#_structure_DOSEfx-cropping

ROI: structureIndicator^custom

# TG-263 Renaming

## Background
Before a cancer patient is treated with radiation, they undergo a CT scan on which clinicians outline structures such as targets (tumors) and organs to aid in treatment planning. The American Association of Physicists in Medicine (AAPM) Task Group Report 263 (TG-263) provides guidelines for standardized structure nomenclature. TG-263 provides a [spreadsheet](https://www.aapm.org/pubs/reports/RPT_263_Supplemental/TG263_Nomenclature_Worksheet_20170815.xls) listing some valid names according to their guidelines.

### Non_target Nomenclature
A TG-263-compliant non-target name is in the format `[<root>_]<structure>[~][_PRV<xx>][_<spatial>]`, where:
* `<root>` is a standard category root. See constant `STD_CAT_ROOTS` below.
* `<structure>` is the main part of the name and the only required part.
* `~` indicates a partial structure.
* `PRV_<expansion>` indicates a two-digit, `<expansion>`-mm uniform expansion of the structure.
* `_<spatial>` is a spatial categorization. See constant `SPTL_CATS` below.
As an example, the organ name `A_Aorta` consists of a standard category root, `A`, and the main part, `Aorta`. The organ name `Brainstem~_PRV05` is a 5 mm expansion of the partial brainstem.

### Target Nomenclature
A target name matches the format `<target>[<classifier>][<number>][_<image modality><number>][_<structure>][_(<cGy dose>|<Gy dose>Gy)[x<fractions>]][-<crop>]`, where:
* `<target>` is the target type.
* `<classifier>` is a target subtype.
* `<number>` is the number of that target type/subtype in the structure set.
* `<image_modality><number>` gives the number of the image that the target is drawn on, in a sequence of images of that modality.
* `<structure>` is a valid non-target name.
* `<cGy dose>` is a radiation prescription dose in cGy, `<Gy dose>` is a prescription in Gy, and `<fractions>` is a number of fractions (treatments) in the presecription. If a number of fractions is provided, the dose is the fractional prescription (dose per treatment), not the total prescription.
* `<crop>` is a two-digit mm amount that the target is cropped back from the extrenal (body) structure.
For example, `PTVp_CT1` is a primary PTV on the first CT image ins CT series; `CTV_Lung_L_5000` is a CTV for a left lung prescription of 5000 cGy; and `GTV2_2Gyx35-03` is the second GTV (perhaps for a case with more than one tumor) in a 35-fraction plan with 2 Gy per fraction, cropped back 3 mm from the external structure. Realistically, most targets have a couple of these name parts, at most.

## Analysis Overview
It would be helpful to be able to automatically rename a structure to its closest match in the TG-263 spreadsheet. Note that this spreadsheet is anything but exhautive, and it only includes basic names without many modifiers. This notebook is an investigation into various strategies that could used in an automated renaming task.

### Constants
First, define some helpful constants. `SPTL_CATS` and `STD_CAT_ROOTS` are spatial categorizations and standard category roots, respectively, from TG-263.

In [23]:
TG263_FILEPATH = 'TG263_Nomenclature_Worksheet_20170815.xls'

SPTL_CATS = ['L', 'R', 'A', 'P', 'I', 'S', 'RUL', 'RLL', 'RML', 'LUL', 'LLL', 'NAdj', 'Dist', 'Prox']
STD_CAT_ROOTS = ['A', 'V', 'LN', 'CN', 'Glnd', 'Bone', 'Musc', 'Spc', 'VB', 'Sinus']

## ETL
Next, read in the TG-263 spreadsheet data.

In [24]:
tg263 = pd.read_excel(TG263_FILEPATH)
tg263.head()

Unnamed: 0,Target Type,Major Category,Minor Category,Anatomic Group,N Characters,TG263-Primary Name,TG-263-Reverse Order Name,Description,FMAID,Unnamed: 9
0,Anatomic,Artery,Aorta,Thorax,7,A_Aorta,Aorta_A,Aorta,3734.0,
1,Anatomic,Artery,Aorta,Thorax,11,A_Aorta_Asc,Asc_Aorta_A,Ascending Aorta,3736.0,
2,Anatomic,Artery,Brachiocephalic,Thorax,15,A_Brachiocephls,Brachiocephls_A,Brachiocephalic Artery,3932.0,
3,Anatomic,Artery,Carotid,Head and Neck,9,A_Carotid,Carotid_A,Common Carotid Artery,3939.0,
4,Anatomic,Artery,Carotid,Head and Neck,11,A_Carotid_L,L_Carotid_A,Carotid Artery,4058.0,


The data needs some cleaning. Remove the `TG-263-Reverse Order Name`, `Description`, and `FMAID` columns, `Unnamed` columns, and rename the `TG-263 Primary Name` column.

In [35]:
tg263 = tg263.iloc[:, :6].rename(columns={'TG263-Primary Name': 'Name'})
tg263.head()

Unnamed: 0,Target Type,Major Category,Minor Category,Anatomic Group,N Characters,Name
0,Anatomic,Artery,Aorta,Thorax,7,A_Aorta
1,Anatomic,Artery,Aorta,Thorax,11,A_Aorta_Asc
2,Anatomic,Artery,Brachiocephalic,Thorax,15,A_Brachiocephls
3,Anatomic,Artery,Carotid,Head and Neck,9,A_Carotid
4,Anatomic,Artery,Carotid,Head and Neck,11,A_Carotid_L


Standardize `Target Type` values of `Non_Anatomic` to `Non-Anatomic`, and `Anatomic Group` values of `Limbs` to `Limb`, and correct a few obvious `Name` typos.

In [32]:
tg263.loc[tg263['Target Type'] == 'Non_Anatomic', 'Target Type'] = 'Non-Anatomic'
tg263.loc[tg263['Anatomic Group'] == 'Limbs', 'Anatomic Group'] = 'Limb'

tg263.loc[tg263['Name'] == 'Colon_PTVxx', 'Name'] = 'Colon_PRVxx'
tg263.loc[tg263['Name'] == 'LN_lliac_Int_R', 'Name'] = 'LN_Iliac_Int_R' 

Remove structures that are sums or differences of two other structuresâ€”e.g., `Lungs-CTV` or `Jejunum_Ileum`. Use a helper function is_sum_or_diff that categorizes a name as a sum/difference if both operands (before and after the `-` or `_`) are valid names in the spreadsheet.

In [41]:
def is_sum_or_diff(name: str) -> bool:
    """Determines whether a structure name represents a sum or difference of two other structures
    
    A difference is in the format <name 1>-<name 2>, a sum as <name 1>_<name 2>, where both <name 1> and <name 2> (or their left or right versions) are valid names in the `Name` column
    
    Arguments
    ---------
    name: The ROI name
    
    Returns
    -------
    True if the name is a sum/difference, False otherwise
    
    Examples
    --------
    is_sum_or_diff('E-PTVxxxx') -> False
    is_sum_or_diff('Lung_L-CTV') -> True
    """
    name_vals = tg263['Name'].values
    if '-' in name:  # Could be a difference
        idx = name.index('-')
        pt_1 = name[:idx]
        pt_2 = name[(idx + 1):]
        if (pt_1 in name_vals or pt_1 + '_L' in name_vals) and (pt_2 in name_vals or pt_2 + '_L' in name_vals):
            return True
    if '_' in name:  # Could be a sum
        idx = name.index('_')
        pt_1 = name[:idx]
        pt_2 = name[(idx + 1):]
        if (pt_1 in name_vals or pt_1 + '_L' in name_vals) and (pt_2 in name_vals or pt_2 + '_L' in name_vals):
            return True
    return False

In [45]:
len_b4 = len(tg263)
tg263 = tg263.loc[~tg263['Name'].apply(is_sum_or_diff)]
print('Number of names:\n\tBefore removing sums/differences: ' + str(len_b4) + '\n\tAfter: ' + str(len(tg263)))

Number of names:
	Before removing sums/differences: 698
	After: 698


## Clustering Name Parts
For the first part of the analysis, attempt to cluster parts of a TG-263-compliant name. The goal is to device the parts into those discussed in the **Background** section above.

## ETL
Name parts are separated by underscores or changes from lowercase to uppercase. Using the helper function `custom_split`, create a unique set of parts from all names in the dataset.

In [9]:
def custom_split(name: str) -> List[str]:
    """Split the structure name into its component parts
    
    Arguments
    ---------"""
    parts = []
    for part in re.split(r'[-+_&\s]', name):
        if part.isupper() or part.islower():
            parts.append(part)
        parts.extend(re.findall('[A-Z][a-z]+|[A-Z]+s?', part))
    return parts

In [46]:
import re
from typing import Dict, List

import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords

import numpy as np

import pandas as pd

from sklearn.cluster import KMeans

from strsimpy.cosine import Cosine
from strsimpy.damerau import Damerau
from strsimpy.jaro_winkler import JaroWinkler
from strsimpy.levenshtein import Levenshtein
from strsimpy.longest_common_subsequence import LongestCommonSubsequence
from strsimpy.metric_lcs import MetricLCS
from strsimpy.ngram import NGram
from strsimpy.normalized_levenshtein import NormalizedLevenshtein
from strsimpy.optimal_string_alignment import OptimalStringAlignment
from strsimpy.qgram import QGram
from strsimpy.weighted_levenshtein import WeightedLevenshtein

pd.options.display.max_rows = 999

In [12]:
def normalized_pos(l: str) -> Dict[str, float]:
    """Calculates the "percentile" positions of each character in a string
    
    Arguments
    ---------
    l: The string whose characters to calculate normalized positions for
    
    Returns
    -------
    A dictionary with the characters as keys and the normalized positions as values
    
    Example
    -------
    normalized_pos('TG-263') -> {'T': 0.0, 'G': 0.2, '-': 0.4, '2': 0.6, '6': 0.8, '3': 1.0}
    """
    length = len(l)
    d = {}
    for i, item in enumerate(l):
        d[item] = 0 if length == 1 else i / (length - 1)
    return d

In [20]:
def mode_prop(grp):
    start_parts = grp.str.split('_').apply(lambda l: l[0])
    cts = start_parts.value_counts(normalize=True)
    #cts = pd.DataFrame(cts).reset_index().rename(columns={'index': 'Name Part', 'Primary Name': 'Freq'})
    #cts['Freq'] = cts.apply(lambda row: cts['Freq'] / start_parts.count(roq['Name Part']))
    #print(pd.DataFrame({'Mode': start_parts.mode(), 'Freq': [cts.iloc[[0]] / len(start_parts)]}))
    #cts = cts.reset_index().rename(columns={'index': 'Name Part', 'Primary Name': 'Freq'})
    return cts.iloc[[0]]

In [10]:
name_parts = pd.Series(np.unique([j for i in tg263['Primary Name'].apply(custom_split) for j in i]))
name_parts.head()

0                    A
1           Acetabulum
2          Acetabulums
3                  Act
4              Adrenal
5                  Air
6                  All
7                 Anal
8                 Anus
9                Aorta
10              Aortic
11              Apical
12            Appendix
13           Arytenoid
14                 Asc
15           Ascending
16              Atrium
17                  Ax
18              Azygos
19                 Bag
20                Base
21                 Bed
22                Bile
23             Bladder
24                Body
25               Bolus
26                Bone
27               Boost
28               Bowel
29            Brachial
30         Brachioceph
31       Brachiocephls
32        Brachiocephs
33               Brain
34           Brainstem
35              Breast
36             Breasts
37          Bronchpulm
38         Bronchpulms
39            Bronchus
40                Bulb
41                   C
42                  C1
43         

In [428]:
def extract_nums(name):
    matches = re.finditer(r'\d*\.?\d+|\d+\.?\d*', name)
    if matches is None:
        return custom_split(name)
    parts = []
    for i, match in list(enumerate(matches)):
        before_part = name[:match.start()] if i == 0 else name[matches[i - 1].end():match.start()]
        parts.extend(custom_split(before_part))
        parts.append(match.group()) 
        if i == len(matches) - 1:
            after_part = name[match.end():]
            parts.extend(custom_split(after_part))
    return parts

In [19]:
name_parts = tg263['Primary Name'].str.split('_').apply(normalized_pos)
name_parts = pd.DataFrame(pd.DataFrame(list(name_parts), index=None).mean()).reset_index()
name_parts = name_parts.rename(columns={'index': 'Name Part', 0: 'Avg Normalized Position'}).sort_values('Name Part')
name_parts['Length'] = name_parts['Name Part'].str.len()
for col in tg263.columns[:5]:
    name_parts['Is ' + col] = name_parts['Name Part'].isin(tg263[col])
name_parts

Unnamed: 0,Name Part,Avg Normalized Position,Length,Is Target Type,Is Major Category,Is Minor Category,Is Anatomic Group,Is N Characters
0,A,0.0,1,False,False,False,False,False
23,Acetabulum,0.0,10,False,False,False,False,False
24,Acetabulums,0.0,11,False,False,False,False,False
62,Act,1.0,3,False,False,False,False,False
151,Adrenal,0.5,7,False,False,True,False,False
25,AirWay,0.0,6,False,False,False,False,False
365,All,1.0,3,False,False,False,False,False
79,Anal,1.0,4,False,False,False,False,False
28,Anus,0.0,4,False,False,True,False,False
1,Aorta,0.75,5,False,False,True,False,False


In [21]:
mode_start_freqs = tg263.groupby('Major Category')['Primary Name'].apply(mode_prop).sort_values(ascending=False)
mode_start_freqs = mode_start_freqs.reset_index().rename(columns={'level_1': 'Name Part', 'Primary Name': 'Freq'})
mode_start_freqs['Starts Most in Major Category'] = mode_start_freqs['Freq'] > 0.5
#mode_start_freqs['Is Primary'] = mode_start_freqs['Mode Start'].apply(lambda x: tg263['Primary Name'].str.contains('^' + x + '_[LR]?[^A-Za-z]').any())
old_name_parts = name_parts.copy()
name_parts = old_name_parts.merge(mode_start_freqs, how='left', on='Name Part')
name_parts.head()

Unnamed: 0,Name Part,Avg Normalized Position,Length,Is Target Type,Is Major Category,Is Minor Category,Is Anatomic Group,Is N Characters,Major Category,Freq,Starts Most in Major Category
0,A,0.0,1,False,False,False,False,False,Artery,1.0,True
1,Acetabulum,0.0,10,False,False,False,False,False,,,
2,Acetabulums,0.0,11,False,False,False,False,False,,,
3,Act,1.0,3,False,False,False,False,False,,,
4,Adrenal,0.5,7,False,False,True,False,False,,,


In [295]:
n_clusters = 20
clustered_name_parts = name_parts.copy()
#fig, ax = plt.subplots()
clustered_name_parts['Starts Most in Major Category'] = clustered_name_parts['Starts Most in Major Category'].fillna(False)
X = clustered_name_parts[['Avg Normalized Position', 'Is Target Type', 'Is Minor Category', 'Is Anatomic Group', 'Starts Most in Major Category']]
model = KMeans(n_clusters=n_clusters)
model.fit(X)
y_hat = model.predict(X)
clustered_name_parts['Cluster'] = y_hat
clustered_name_parts = clustered_name_parts.sort_values(['Cluster', 'Name Part'])
clustered_name_parts[['Cluster', 'Name Part']]
#clusters = np.unique(y_hat)
#for cluster in clusters:
    #row_ix = np.where(y_hat == cluster)
    #x = X.loc[row_ix, 'Avg Normalized Position']
    #y = X.loc[row_ix, 'Length']
    #ax.scatter(x=x, y=y)
    #for i, txt in enumerate(name_parts_copy['Name Part']):
        #x = X.loc[i, 'Avg Normalized Position']
        #y = X.loc[i, 'Length']
        #ax.annotate(txt, (x, y), rotation=90)
for cluster in range(n_clusters):
    names = clustered_name_parts.loc[clustered_name_parts['Cluster'] == cluster, 'Name Part']
    print(f'{cluster}:\t{", ".join(names)}')

0:	A, Bladder, Bolus, CTV, Carina, Cervix, Chestwall, Cricopharyngeus, Diaphragm, E-PTV, Esophagus, GTV, Gallbladder, Glnd, Glottis, Hardpalate, IDL, ITV, LN, Laryngl, Lig, Lips, Liver, Musc, PTV, Perineum, Spc, SpinalCanal, Stomach, Valve
1:	Acetabulum, Acetabulums, AirWay, Bag, BileDuct, Bone, BoneMarrow, Boost, BrachialPlex, BrachialPlexs, Breasts, CN, Canal, Cartlg, Cavernosum, Cavity, Cecum, Cist, CribriformPlate, Dens, Duodenum, Edema, Eval, Eyes, Femurs, Foley, Genitals, GreatVes, GrowthPlate, Hemispheres, Hippocampi, Hypothalmus, Ileum, Jejunum, Kidneys, Knee, Larynx, Leads, Lung, Lungs, Malleus, Markers, Mediastinum, Nasalconcha, Nasopharynx, Nose, Nrv, OpticChiasm, OpticNrv, Ovaries, Ovary, Pacemaker, Palate, PancJejuno, Parametrium, Parotids, PenileBulb, Penis, Pericardium, Pons, Postop, Preop, Proc, ProstateBed, Prosthesis, PubicSymphys, Rectal, Rectum, Retinas, Rib01, Rib02, Rib03, Rib04, Rib05, Rib06, Rib07, Rib08, Rib09, Rib10, Rib11, Rib12, SacralPlex, Scalp, Scar, Scro

In [254]:
name_parts = old_name_parts_2.copy()

In [431]:
test_cases = {
    'A_Aorta': ['Aorta',
    'carotid': 'A_Carotid',
    'oral cav': 'Cavity_Oral',
    'Vocal Cords': 'VocalCords',
    'c2': 'VB_C2',
    'cord': 'SpinalCord',
    'Ribs': 'Rib',
}

In [440]:
def ins_cost(char):
    #if not char.isalpha():
        #return 0.125
    return 1

def del_cost(char):
    return 1

def sub_cost(char_a, char_b):
    VOWELS = 'aeiouy'
    #if char_a.lower() in VOWELS and char_b in VOWELS:
        #return 1.75
    #if char_a.lower() == char_b.lower():
        #return 0
    return 3

In [452]:
num_names = len(name_parts)
tests = ['carotid', 'aorta', 'oral cav', 'Vocal Cords', 'c2', 'cord', 'Ribs']

sim = pd.DataFrame({'Incorrect Name': [], 'Incorrect Name Part': [], 'TG-263 Name Part': [], 'Metric': [], 'Measure': []})
metrics = {#'Levenshtein': Levenshtein(), # Min # of insertions, deletions, and replacements
           #'Normalized Levenshtein': NormalizedLevenshtein(),  # 1 - Lev / longer length
           'Weighted Levenshtein': WeightedLevenshtein(substitution_cost_fn=sub_cost, insertion_cost_fn=ins_cost, deletion_cost_fn=del_cost),
           #'Damerau-Levenshtein': Damerau(),
           #'Optimal string alignment': OptimalStringAlignment(),
           #'Jaro-Winkler': JaroWinkler(),
           #'Longest Common Subsequence': LongestCommonSubsequence(),
           #'Metric LCS': MetricLCS(),
           #'2-gram': NGram(2),
           #'3-gram': NGram(3),
           #'4-gram': NGram(4),
          }
for test in tests:
    for test_part in custom_split(test):
        for name, obj in metrics.items():
            measures = name_parts.apply(obj.distance, args=(test_part,))
            new_df = pd.DataFrame({'TG-263 Name Part': name_parts, 'Measure': measures})
            new_df['Incorrect Name'] = test
            new_df['Incorrect Name Part'] = test_part
            new_df['Metric'] = name
            sim = sim.append(new_df, ignore_index=True)
sim.head()

Unnamed: 0,TG-263 Name Part,Measure,Incorrect Name,Incorrect Name Part,Metric
0,A,5.0,Ribs,Ribs,Weighted Levenshtein
1,Acetabulum,12.0,Ribs,Ribs,Weighted Levenshtein
2,Acetabulums,11.0,Ribs,Ribs,Weighted Levenshtein
3,Act,7.0,Ribs,Ribs,Weighted Levenshtein
4,Adrenal,11.0,Ribs,Ribs,Weighted Levenshtein


In [453]:
most_similar = sim.loc[sim.groupby(['Incorrect Name', 'Incorrect Name Part', 'Metric'])['Measure'].idxmin()]
most_similar.reset_index(drop=True).set_index(['Incorrect Name', 'Incorrect Name Part', 'TG-263 Name Part'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Measure,Metric
Incorrect Name,Incorrect Name Part,TG-263 Name Part,Unnamed: 3_level_1,Unnamed: 4_level_1
Ribs,Ribs,Rib,1.0,Weighted Levenshtein
Vocal Cords,Cords,Cords,0.0,Weighted Levenshtein
Vocal Cords,Vocal,Vocal,0.0,Weighted Levenshtein
aorta,aorta,Aorta,2.0,Weighted Levenshtein
c2,c2,C2,2.0,Weighted Levenshtein
carotid,carotid,Carotid,2.0,Weighted Levenshtein
cord,cord,Cord,2.0,Weighted Levenshtein
oral cav,cav,Sclav,2.0,Weighted Levenshtein
oral cav,oral,Oral,2.0,Weighted Levenshtein


In [462]:
from itertools import permutations
most_similar.groupby('Incorrect Name').apply(lambda grp: ['_'.join(perm) for perm in permutations(grp['Incorrect Name Part'])])

Incorrect Name
Ribs                               [Ribs]
Vocal Cords    [Cords_Vocal, Vocal_Cords]
aorta                             [aorta]
c2                                   [c2]
carotid                         [carotid]
cord                               [cord]
oral cav             [cav_oral, oral_cav]
dtype: object

In [464]:
# Preprocessing
tg263_names = tg263['Primary Name'].str.lower()
tg263_names.head()

0            a_aorta
1        a_aorta_asc
2    a_brachiocephls
3          a_carotid
4        a_carotid_l
Name: Primary Name, dtype: object

In [8]:
import nltk
# Token generation
# BOW
#nltk.download('punkt')
name = 'oral cav'
name = re.sub(r'\W|\s+', ' ', name)
nltk.word_tokenize(name)

['oral', 'cav']