# Medical Terms Dataset

Following from [02-term-classification.ipynb](./02-term-classification.ipynb) we have a [manually-produced spreadsheet](../medical-data/mel-words-tagged_2018-07-20.csv) containing a token per row, with several columns indicating whether the row's token is of the column's type. This structure is designed for ease of human data entry, so we'll restructure it for easier analysis.

In [1]:
import pandas as pd

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
raw_tags = pd.read_csv('../medical-data/mel-words-tagged_2018-07-20.csv')

In [3]:
raw_tags.head()

Unnamed: 0,Token,Is medical,Is medical.1,Token issues,Authorities,Ingredients,Ailments,Body Parts,Treatments
0,!,,,,,,,,
1,',,,,,,,,
2,'book,,,,,,,,
3,'mace,,,x,,x,,,
4,'s,,,,,,,,


In [4]:
tidy_tags = raw_tags.rename(columns={
        'Token': 'token',
        'Token issues': 'is_junk',
        'Authorities': 'is_authority',
        'Ingredients': 'is_ingredient',
        'Ailments': 'is_ailment',
        'Body Parts': 'is_body_part',
        'Treatments': 'is_treatment'}
    ).set_index('token')

flag_columns = [c for c in tidy_tags.columns if c.startswith('is_')]

tidy_tags = tidy_tags[flag_columns]
tidy_tags[flag_columns] = tidy_tags[flag_columns] == 'x'

tidy_tags.head()

Unnamed: 0_level_0,is_junk,is_authority,is_ingredient,is_ailment,is_body_part,is_treatment
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
!,False,False,False,False,False,False
',False,False,False,False,False,False
'book,False,False,False,False,False,False
'mace,True,False,True,False,False,False
's,False,False,False,False,False,False


In [5]:
ignored_tags = ['medical', 'treat', 'treatment', 'treating', 'simple']

is_junk_token = tidy_tags['is_junk'] == True
is_empty_row = ~tidy_tags[flag_columns].any(axis='columns')

# Ignore bad tokens and non-medical (all-false) rows
tidy_tags = tidy_tags[~(is_junk_token | is_empty_row)]
tidy_tags = tidy_tags.drop(columns='is_junk')
tidy_tags = tidy_tags.drop(index=ignored_tags)
tidy_tags.head()

Unnamed: 0_level_0,is_authority,is_ingredient,is_ailment,is_body_part,is_treatment
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
abcess,False,False,True,False,False
abcesses,False,False,True,False,False
abdomen,False,False,False,True,False
abdominal,False,False,False,True,False
abnormal,False,False,True,False,False


None of the tags are labeled with > 1 category, so we can represent the tokens as a (name, type) tuple.

In [6]:
(tidy_tags.sum(axis=1) == 1).all()

True

In [7]:
token_has_type = tidy_tags.stack()
token_has_type.index.names = ['token', 'type']
token_has_type.head(15)

token     type         
abcess    is_authority     False
          is_ingredient    False
          is_ailment        True
          is_body_part     False
          is_treatment     False
abcesses  is_authority     False
          is_ingredient    False
          is_ailment        True
          is_body_part     False
          is_treatment     False
abdomen   is_authority     False
          is_ingredient    False
          is_ailment       False
          is_body_part      True
          is_treatment     False
dtype: bool

In [8]:
typed_tokens = pd.DataFrame(token_has_type[token_has_type]).reset_index(level='type')[['type']]
typed_tokens['type'] = typed_tokens['type'].str.replace(r'^is_', '')
typed_tokens.head()

Unnamed: 0_level_0,type
token,Unnamed: 1_level_1
abcess,ailment
abcesses,ailment
abdomen,body_part
abdominal,body_part
abnormal,ailment


In [9]:
typed_tokens.to_csv('../medical-data/token-types.csv')