# Term Classification

To discover relationships between the records and find common themes it would help to know which words used in the descriptions are medical in nature. Additional details such as if a term is an ingredient in a medical recipe, a body part or a notable person would be helpful.

The fuzzy nature of what constitutes a medical term, and the project's very limited timescale means automation of this classification work is unlikely to be the optimal approach. Instead, getting a human (subject expert) to classify terms will ensure good accuracy levels for a predictable time expendature, so long as the quantity of terms is not too large.

In [1]:
import re

# We'll use Natural Language Toolkit for text tokenisation
import nltk
import pandas as pd

In [2]:
columns = ['title', 'summary', 'material', 'date_start', 
           'date_end', 'width', 'height', 'columns', 'lines']

data = pd.read_json('../medical-data/genizah-medical.json', orient='index')[columns]
data.index.name = 'classmark'
data.head()

Unnamed: 0_level_0,title,summary,material,date_start,date_end,width,height,columns,lines
classmark,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
MS-OR-01080-00001-00063,Medical,"Pharmacopoeia, containing diagrams and symbols...",paper,0500-01-01,1899-12-31,14.3,21.2,1.0,21.0
MS-OR-01080-00001-00072,Medical,"Discussion of various medical treatments, regi...",vellum,0500-01-01,1899-12-31,16.8,36.4,1.0,22.0
MS-OR-01080-00001-00081,Medical,"Medical work on the composition of the body, c...",paper,0500-01-01,1899-12-31,16.8,25.4,1.0,12.0
MS-OR-01080-00001-00087,Medical,Recto: a short medical recipe. Verso: a respon...,paper,1213-01-01,1233-12-31,,,1.0,5.0
MS-OR-01080-00002-00070,Medical,Autograph draft of a medical work by Moses Mai...,paper,1100-01-01,1199-12-31,22.8,31.5,1.0,35.0


In [3]:
# A few works are single-quoted with apostrophies (e.g. "'mace'" in MS-TS-AS-00167-00137)
# which is tokenised to "'mace", "'". Replacing the apostophies with single quotes avoids this.
summaries = data.summary.str.replace(r"'([a-zA-Z]+)(?: +(?:[a-zA-Z]+))*'", r'‘\1’')
summaries = summaries.str.lower()

In [4]:
# Detect and ignore tokens which are of no interest
def ignore(token):
    return bool(len(token) < 2 or re.search(r'[0-9]|^\.+$', token))

lem = nltk.WordNetLemmatizer()
def tokenise(text, ignore=ignore):
    return [lem.lemmatize(w) for w in nltk.word_tokenize(text) if not ignore(w)]

In [5]:
tokens = pd.DataFrame.from_records(
    ((classmark, token) for classmark, summary in summaries.items() for token in tokenise(summary)),
    columns='classmark token'.split()
)
tokens.head()

Unnamed: 0,classmark,token
0,MS-OR-01080-00001-00063,pharmacopoeia
1,MS-OR-01080-00001-00063,containing
2,MS-OR-01080-00001-00063,diagram
3,MS-OR-01080-00001-00063,and
4,MS-OR-01080-00001-00063,symbol


In [6]:
unique_tokens = pd.DataFrame({'tokens': tokens.token.unique()})
unique_tokens.head()

Unnamed: 0,tokens
0,pharmacopoeia
1,containing
2,diagram
3,and
4,symbol


We've only got ~4000 unique tokens, making human classification of tokens practical.

In [7]:
len(unique_tokens)

3817

Create create the all-tokens.csv file.

In [8]:
unique_tokens.sort_values('tokens').to_csv('../medical-data/all-tokens.csv', index=False)