# Parsing and Pre-Processing GENIA Text Data

The data is given in XML format, with the POS of each word identified. Biological entities are also labelled with the type of entity. We parse this data file and tag each word with the POS it is labelled as and also assign a label to it based on whether or not it is a biological entity.

In [None]:
!curl -o 'data' https://raw.githubusercontent.com/Shkev/Biomedical-Named-Entity-Recognition-SVM/main/data/GENIAcorpus3.02.merged.xml

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.4M  100 15.4M    0     0  8457k      0  0:00:01  0:00:01 --:--:-- 8453k


In [None]:
import bs4
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import numpy as np
from typing import List, Dict, Tuple, Set

data_file_path = '/content/data'

## Extracting Words and Named Entities from Texts

Here we parse the XML data file to get the POS of each word and determine the semantic biological meaning of each word (if it has one). The words with a biological meaning will be labelled as a 1, and the non-biological words will be labelled with a -1.

In [None]:
connectors = ('(AND ', '(OR ', '(AND/OR ', '(TO ', '(BUT_NOT ', '(AS_WELL_AS ', '(VERSUS ', '(NEITHER_NOR ', '(THAN ', '(AND_NOT ', '(NOT_ONLY_BUT_ALSO ')
def parse_name_connectors(text: str) -> List[str]:
    """
    Parses through nested connectors in labelled entities and returns the separated entities

    Args:
        text (str): The string to be parsed for individual entities
      
    Returns:
        A list containing all entities that are contained in the given text string
    """
    if text.startswith(connectors):
        # we can just split on spaces bc spaces are replaced by '_' in all named entity entries
        # so splitting won't split the names of the entities
        components = text.split()
        components = [comp for comp in components if comp+' ' not in connectors]
        components = [comp[:-1] if comp.endswith(')') else comp for comp in components]
        return components
    return [text.strip()]

def get_words_part_of_speech(sentence: bs4.element.Tag, words_pos_dict: Dict[str, Set[str]]) -> None:
    """
    Extracts words in given sentence and their respective part of speech.
    Updates words_pos_dict (passed by reference) to contain these words and their POS.

    Arg:
        sentence (bs4.element.Tag): The sentence to extract the words and POS from.
        words_pos_dict (dict): The dictionary to update with the words and POS.
    """
    words = sentence.find_all('w')
    fragments = []
    for word in words:
        text = word.text.strip()

        if word['c'] == '*':
            fragments.append(text)
            continue
        fragments.append(text)
        text = ''.join(fragments)
        fragments.clear()

        if text not in words_pos_dict:
            words_pos_dict[text] = set()
        # words that can be multiple POS indicated with '|' between each POS tag
        pos = word['c'].split('|')
        words_pos_dict[text].update(pos)

def get_named_bio_entities(sentence: bs4.element.Tag, named_bio_entities: Set[str]) -> None:
    """
    Extracts named biological entities in given sentence.
    Updates named_bio_entities (passed by reference) to contain these entities.

    Arg:
        sentence (bs4.element.Tag): The sentence to extract the named biological entities from.
        named_bio_entities (set): The dictionary to update with the named biological entities.
    """
    bio_entities = sentence.find_all('cons')
    for ent in bio_entities:
        if ent.has_attr('lex'):
            text = ent['lex'].strip()
            # if it has semantic meaning labelled it is a bio named entity
            if ent.has_attr('sem'):
                text = parse_name_connectors(text)
                named_bio_entities.update(text)
        else:
            print("This constituent has no text: ", ent)

def extract_data_from_xml(data_file_path: str) -> Tuple[Dict[str, Set[str]], Set[str]]:
    """
    Extracts the data from the xml file and returns a list of tuples
    containing the text and the label.
    Note that punctuation marks are kept in the data as was done in the presented paper.
    However, they are not attached to any of the words.
    
    Args:
        data_file_path (str): The path to the xml file.
        
    Returns:
        A dictionary containing all words and their POS and 
        a set of all biomedical named entities in the texts.
    """
    data = []
    words_pos_dict = dict() # dict of set
    named_bio_entities = set()
    with open(data_file_path, "r") as f:
        xml = f.read()
        print("Parsing XML file with BeautifulSoup...")
        soup = BeautifulSoup(xml, 'xml')
        print("Parsing Done. Now extracting info...")
        articles = soup.find_all('article')
        for art in tqdm(articles):
            sentences = art.find_all('sentence')
            for sent in sentences:
                get_words_part_of_speech(sent, words_pos_dict)
                get_named_bio_entities(sent, named_bio_entities)
    print("Done")
    return words_pos_dict, named_bio_entities


In [None]:
words_pos_dict, named_bio_ent = extract_data_from_xml(data_file_path)

Parsing XML file with BeautifulSoup...
Parsing Done. Now extracting info...


  0%|          | 0/2000 [00:00<?, ?it/s]

This constituent has no text:  <cons sem="G#cell_line"><cons sem="G#cell_line"><w c="NN">THP-1</w></cons> <w c="JJ">mononuclear</w> <w c="NN">phagocyte</w> <w c="NN">cell</w> <w c="NN">line</w></cons>
This constituent has no text:  <cons sem="G#cell_line"><w c="NN">THP-1</w></cons>
Done


The paper explains that they label each named entity with their POS that they extract from the dataset. However, the dataset only assigns POS tags to single words and not multi-word entities. I assume that the writers of the paper limited their view to only the single word biological named entities to be able to accurately and efficiently determine their POS. There may be a way to determine the POS of several words (maybe using external libraries) but I am not aware of any of these methods, so this is how I will proceed. Also, many of the multiword entities have single word entities nested in them, which can in some cases represent the entire entity for the purposes of the data. 

In [None]:
single_word_bio_ent = named_bio_ent.intersection(set(words_pos_dict.keys()))
len(single_word_bio_ent)

4805

## Labels

Each data point is assigned a 1 if it is a biological entity and a -1 if it is not.

In [None]:
# maps each word to its label
word_labels = {word : 1 if word in single_word_bio_ent else -1 for word in words_pos_dict.keys()}

## Features

We use the extracted data (POS of named entities and the entities themselves) to construct the feature vectors for each word which will be given to the model to train on.

As described in the paper, we use both the POS of the words as well as structural features for the words.

In [None]:
# stored as dict of dict. Maps word to feature dict
# for each sub-dict, key is index of feature and value is 1
# features that word does not have are excluded (all values in the dicts should be 1)
word_features = dict()

### POS Features

In [None]:
pos_tags = set().union(*words_pos_dict.values())
num_pos_feat = len(pos_tags)

Not sure why, but there are more POS tags in this dataset than the study said they extracted from theirs. This may be due to a discrepency in the version of the dataset being used (version is not specified in the paper).

A unique index is assigned to each POS tag. The value of the index does not matter as long as it is consistent throughout the rest of the code.

These indices are used to assign features to each word. If a word is a certain POS, the entry `idx:1` is added to its feature dict (where idx is the index corresponding to the POS tag)

In [None]:
pos_indices = {pos : idx for (idx, pos) in enumerate(pos_tags)}
pos_indices

{'.': 0,
 '': 1,
 'SYM': 2,
 'NNPS': 3,
 'XT': 4,
 'NNS': 5,
 'WP$': 6,
 '-': 7,
 ':': 8,
 'MD': 9,
 'CC': 10,
 'VBD': 11,
 ')': 12,
 'VBP': 13,
 'NN': 14,
 'PDT': 15,
 "''": 16,
 'WRB': 17,
 'TO': 18,
 'RP': 19,
 'IN': 20,
 'FW': 21,
 'JJS': 22,
 'CD': 23,
 ',': 24,
 'PP': 25,
 'LS': 26,
 '(': 27,
 'WDT': 28,
 'CT': 29,
 'JJR': 30,
 'POS': 31,
 'RBR': 32,
 'PRP': 33,
 '``': 34,
 'VBN': 35,
 'WP': 36,
 'RB': 37,
 'NNP': 38,
 'PRP$': 39,
 'VBG': 40,
 'RBS': 41,
 'DT': 42,
 'EX': 43,
 'N': 44,
 'JJ': 45,
 'VBZ': 46,
 'VB': 47}

In [None]:
# adding features
word_features = {word: {pos_indices[pos]: 1 for pos in pos_set} for (word, pos_set) in words_pos_dict.items()}
word_features

{'IL-2': {14: 1},
 'gene': {14: 1},
 'expression': {14: 1},
 'and': {10: 1},
 'NF-kappa': {14: 1},
 'B': {14: 1},
 'activation': {14: 1},
 'through': {20: 1},
 'CD28': {14: 1},
 'requires': {46: 1},
 'reactive': {45: 1},
 'oxygen': {14: 1},
 'production': {14: 1},
 'by': {20: 1},
 '5-lipoxygenase': {14: 1, 45: 1},
 '.': {0: 1},
 'Activation': {14: 1},
 'of': {20: 1},
 'the': {42: 1, 29: 1},
 'surface': {14: 1},
 'receptor': {14: 1},
 'provides': {46: 1},
 'a': {14: 1, 40: 1, 4: 1, 42: 1, 26: 1, 29: 1},
 'major': {45: 1},
 'costimulatory': {14: 1, 45: 1},
 'signal': {14: 1, 13: 1, 47: 1, 45: 1},
 'for': {20: 1},
 'T': {14: 1, 45: 1},
 'cell': {14: 1},
 'resulting': {40: 1},
 'in': {37: 1, 20: 1, 21: 1, 14: 1, 19: 1},
 'enhanced': {35: 1, 11: 1, 45: 1},
 'interleukin-2': {14: 1},
 '(': {27: 1},
 ')': {1: 1, 12: 1},
 'proliferation': {14: 1},
 'In': {20: 1, 21: 1},
 'primary': {45: 1},
 'lymphocytes': {5: 1},
 'we': {33: 1},
 'show': {13: 1, 47: 1},
 'that': {28: 1, 20: 1, 42: 1},
 'ligat

### Structural Features

The paper uses a set of 22 structural features for the words (i.e., word contains certain characters, contains numbers, etc.). We implement the rules for each of these features and compute them for each word. Each feature has a unique index assigned to it (*starting from 48* since the last POS feature has index 47) which will be used to assign the feature to the word in the feature dict.

In [None]:
def capital_letter_indices(s: str) -> List[int]:
  return [idx for idx in range(len(s)) if s[idx].isupper()]

def str_contains_digit(s: str) -> bool:
  return any(ch.isdigit() for ch in s)

def str_contains_letter(s: str) -> bool:
  return any(ch.isalpha() for ch in s)

def all_capital(s: str) -> bool:
  # feat 11
  return s.isalpha() and s.isupper()

def all_lower(s: str) -> bool:
  # feat 20
  return s.isalpha() and s.islower()


In [None]:
# testing rule functions
assert all_capital('aaaa') == False
assert all_capital('AhEllO') == False
assert all_capital('HELLO') == True
assert all_capital('HELLO3') == False

assert capital_letter_indices("hELlo") == [1, 2]
assert capital_letter_indices('hello') == []

assert str_contains_digit('b5b') == True
assert str_contains_digit('hello') == False

assert str_contains_letter('b5b') == True
assert str_contains_letter('hello') == True
assert str_contains_letter('12345') == False

In [None]:
num_structural_feat = 22

# dict containing functions that return true if given string has the i-th feature
# indices for features matches those given in paper
structural_rules = {
    1: lambda s: s.isdigit(),
    2: lambda s: s.count('/') == 1,
    3: lambda s: s.count('/') == 2,
    4: lambda s: '$' in s,
    5: lambda s: '%' in s,
    6: lambda s: ',' in s,
    7: lambda s: '.' in s,
    8: lambda s: ':' in s,
    9: lambda s: '-' in s,
    10: lambda s: str_contains_letter(s) and str_contains_digit(s) and s.count('/') > 0,
    11: all_capital,
    12: lambda s: 0 in capital_letter_indices(s) and '.' in s,
    13: lambda s: len(capital_letter_indices(s)) > 0 and '.' in s,
    14: lambda s: str_contains_letter(s) and '$' in s,
    15: lambda s: str_contains_letter(s) and '.' in s,
    16: lambda s: len(capital_letter_indices(s)) > 0,
    17: lambda s: str_contains_letter(s) and str_contains_digit(s),
    18: lambda s: 0 in capital_letter_indices(s), # first letter capital
    19: lambda s: any(idx > 0 and idx < len(s)-1 for idx in capital_letter_indices(s)), # capital letter in middle of word
    20: all_lower,
    21: lambda s: str_contains_letter(s) and '-' in s
}
# 22nd feature assigned to any word that doesn't have any of the other 21 features

In [None]:
for word in word_features.keys():
  word_features[word].update({num_pos_feat + (idx - 1) : 1 for (idx, f) in structural_rules.items() if f(word)})
  if max(word_features[word].keys()) <= 47: # if word has none of the other structural features
    word_features[word].update({num_pos_feat + (22 - 1): 1})

In [None]:
word_features

{'IL-2': {14: 1, 56: 1, 63: 1, 64: 1, 65: 1, 66: 1, 68: 1},
 'gene': {14: 1, 67: 1},
 'expression': {14: 1, 67: 1},
 'and': {10: 1, 67: 1},
 'NF-kappa': {14: 1, 56: 1, 63: 1, 65: 1, 66: 1, 68: 1},
 'B': {14: 1, 58: 1, 63: 1, 65: 1},
 'activation': {14: 1, 67: 1},
 'through': {20: 1, 67: 1},
 'CD28': {14: 1, 63: 1, 64: 1, 65: 1, 66: 1},
 'requires': {46: 1, 67: 1},
 'reactive': {45: 1, 67: 1},
 'oxygen': {14: 1, 67: 1},
 'production': {14: 1, 67: 1},
 'by': {20: 1, 67: 1},
 '5-lipoxygenase': {14: 1, 45: 1, 56: 1, 64: 1, 68: 1},
 '.': {0: 1, 54: 1},
 'Activation': {14: 1, 63: 1, 65: 1},
 'of': {20: 1, 67: 1},
 'the': {42: 1, 29: 1, 67: 1},
 'surface': {14: 1, 67: 1},
 'receptor': {14: 1, 67: 1},
 'provides': {46: 1, 67: 1},
 'a': {14: 1, 40: 1, 4: 1, 42: 1, 26: 1, 29: 1, 67: 1},
 'major': {45: 1, 67: 1},
 'costimulatory': {14: 1, 45: 1, 67: 1},
 'signal': {14: 1, 13: 1, 47: 1, 45: 1, 67: 1},
 'for': {20: 1, 67: 1},
 'T': {14: 1, 45: 1, 58: 1, 63: 1, 65: 1},
 'cell': {14: 1, 67: 1},
 'res

### Creating Final Label and Feature Sets

The SVM model requires the features to be given as a list of dicts and the labels as a list of values. The constructed dictionaries for the labels and features are processed to create such lists such that the i-th entry in the labels list and features list correspond to the same word (this was the entire purpose of using dictionaries in the previous sections).

In [None]:
y = []
x = []
for word in words_pos_dict.keys():
  y.append(word_labels[word])
  x.append(word_features[word])
y = np.asarray(y)
x = np.asarray(x)

### Tests

In [None]:
# make sure features constructed correctly
unique_feat_entries = set().union(*[d.values() for d in word_features.values()])
assert unique_feat_entries == {1} or len(unique_feat_entries) == 0, "features contain entries other than 1. Note that if they are 0, they should not be included"

## Training SVM Model

We train an SVM model (using libsvm) to identify words that are related to the biological domain.

### Dataset Split

The data is split into a testing a training dataset. The training set is used to determine the best model parameters and train the model. The testing set is used to determine the performance of the trained model on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

### K-Fold Cross-validation

We use 10-fold cross validation to determine the best model from the training data. Once the best model is determined, it is then evaluated on the testing data.

In [None]:
from libsvm.svmutil import *
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score

In [None]:
skf = StratifiedKFold(n_splits=10, shuffle=True)
skf.get_n_splits(x_train, y_train)

acc = []
precisions = []
recalls = []
param = svm_parameter('-s 0 -t 0 -c 2')
for i, (train_idx, val_idx) in enumerate(skf.split(x_train, y_train)):
  print(f'Fold {i}:')
  prob = svm_problem(y_train[train_idx], x_train[train_idx])
  m = svm_train(prob, param)
  pred_label, pred_acc, pred_val = svm_predict(y_train[val_idx], x_train[val_idx], m)
  prec = precision_score(y_train[val_idx], pred_label, average='binary')
  rec = recall_score(y_train[val_idx], pred_label, average='binary')
  print(f'Precision = {prec}')
  print(f'Recall = {rec}')

  acc.append(pred_acc[0])
  precisions.append(prec)
  recalls.append(rec)
  print('')

avg_acc = sum(acc) / len(acc)
avg_prec = sum(precisions) / len(precisions)
avg_rec = sum(recalls) / len(recalls)
print(f'Average Accuracy of model Across all Folds: {avg_acc}')
print(f'Average Precision of model on Biological Entities Across all Folds: {avg_prec}')
print(f'Average Recalls of model on Biological Entities Across all Folds: {avg_rec}')

Fold 0:
Accuracy = 84.9462% (1343/1581) (classification)
Precision = 0.7100840336134454
Recall = 0.5

Fold 1:
Accuracy = 84.3137% (1333/1581) (classification)
Precision = 0.68
Recall = 0.5029585798816568

Fold 2:
Accuracy = 84.2505% (1332/1581) (classification)
Precision = 0.6731517509727627
Recall = 0.5118343195266272

Fold 3:
Accuracy = 82.4794% (1304/1581) (classification)
Precision = 0.6196078431372549
Recall = 0.46745562130177515

Fold 4:
Accuracy = 83.8077% (1325/1581) (classification)
Precision = 0.6736401673640168
Recall = 0.4749262536873156

Fold 5:
Accuracy = 86.401% (1366/1581) (classification)
Precision = 0.7421875
Recall = 0.56047197640118

Fold 6:
Accuracy = 85.5696% (1352/1580) (classification)
Precision = 0.7165354330708661
Recall = 0.5384615384615384

Fold 7:
Accuracy = 83.5443% (1320/1580) (classification)
Precision = 0.6547619047619048
Recall = 0.4881656804733728

Fold 8:
Accuracy = 83.5443% (1320/1580) (classification)
Precision = 0.6598360655737705
Recall = 0.47633

In [None]:
p_label, p_acc, p_val = svm_predict(y, x, m)

Accuracy = 84.5571% (19093/22580) (classification)
