In [1]:
import pandas as pd
from tqdm import tqdm

# Showcase of the lazy_nlp_pipeline

## Task definition

Let's say there's a need to perform some simple rule-based pattern matching. In this example we'll match mentions of russian [military unit numbers](https://en.wikipedia.org/wiki/Military_Unit_Number) (Russian: Войсковая часть).
The prime dataset is a collection of texts, mainly in russian, from Telegram channel "Ищи своих" which includes many mentions of russian military units.

Dataset includes 11_000 rows with `msg_id` and `message` columns.

In [2]:
df = pd.read_csv('./msgs_ischi_svoih.csv', usecols=['msg_id', 'message'])
df

Unnamed: 0,msg_id,message
0,4,Гаврилович Роман Александрович и \nВоробьев Де...
1,5,Подписаться | Бот для поиска своих | Резервный...
2,7,Подписаться | Бот для поиска своих | Резервный...
3,38,Плотников Сергей Витальевич \nг. Прокопьевск К...
4,40,Командир танкового батальона 35 мотострелковой...
...,...,...
10995,15640,❗️Во время послания путин процитировал одного ...
10996,15641,❗️Коротко подытожим сказанное \n\nПодписаться ...
10997,15642,❗️Самый предсказуемый спич ever\n\nМемы придум...
10998,15643,❗️Тем временем спецоперация идет по плану\n\n▪...


The task is to extract patterns which contains 5-digit id (occasionally with a single letter prefix) preceded by "military unit" (or few common variations of those words). Here are some examples of successful matches:
- **в/ч 22179** (most common abbreviation)
- **в/ч Л-12265** (id contains single letter prefix)
- **войсковой части 20924** (here the words are inflected according to sentence context)
- **военная часть 91701** (technically incorrect but common usage of word "военная" instead of "войсковая")

## Spacy soluion

Spacy provides fairly easy way to construct patterns and match them with existing pretrained pipelines. Major drawbacks are linearity of those patterns (there are no nested patterns in Spacy) and execution speed (mostly due to the fact, that full pipeline is applied to all documents, while that could be avoided if most computationally expensive steps were checked only against those documents which weren't filtered out by simpler rules)

In [3]:
import re

import spacy
from spacy import displacy, Language

In [4]:
# retokenize tokens to split letters from numerical symbols if there are no spaces inbetween
@Language.factory("re_tokenize")
def re_tokenize(nlp, name):
    return ReTokenize()

class ReTokenize:
    def __call__(self, doc):
        regexp = re.compile(r'(?<=\d)(?=\D)|(?<=\D)(?=\d)|(?<=\w)(?=\W)|(?<=\W)(?=\w)')
        with doc.retokenize() as retokenizer:
            for i, t in enumerate(doc):
                splitted = regexp.split(t.text)
                if len(splitted) == 1:
                    continue
                retokenizer.split(t, splitted, [(t, 0) for _ in splitted])
        return doc

In [5]:
# constructing pattern

spacy_nlp = spacy.load("ru_core_news_sm", exclude=['ner'])
re_tokenizer = spacy_nlp.add_pipe("re_tokenize", "split_num_nonnum", before='tok2vec')

mil_unit_ruler = spacy_nlp.add_pipe("span_ruler", "mil_unit_span_ruler",
                                    config={"annotate_ents": True})
patterns = [
    {"label": "MIL_UNIT", "pattern": [
        {"LOWER": "в"},
        {"NORM": {"IN": ["/", "\\", "."]}},
        {"LOWER": "ч"},
        {"NORM": ".", "OP": "?"},
    ]},
    {"label": "MIL_UNIT", "pattern": [
        {"LOWER": "вч"},
    ]},
    {"label": "MIL_UNIT", "pattern": [
        {"LEMMA": "войсковой"},
        {"LEMMA": "часть"},
    ]},
    {"label": "MIL_UNIT", "pattern": [
        {"LEMMA": "военный"},
        {"LEMMA": "часть"},
    ]},
    {"label": "MIL_UNIT", "pattern": [
        {"LEMMA": "воинский"},
        {"LEMMA": "часть"},
    ]},
]
mil_unit_ruler.add_patterns(patterns)

mil_unit_x_ruler = spacy_nlp.add_pipe("span_ruler", name='mil_unit_x_span_ruler',
                                      config={"overwrite": True, "annotate_ents": True})
patterns = [
    {"label": "MIL_UNIT_X", "pattern": [
        {"ENT_TYPE": "MIL_UNIT", "OP": "+"},
        {"OP": "{0,1}"},
        {"IS_ALPHA": True, "LENGTH": 1, "OP": "?"},
        {"NORM": "-", "OP": "?"},
        {"IS_DIGIT": True, "LENGTH": 5},
    ]},
]
mil_unit_x_ruler.add_patterns(patterns)

In [6]:
%%time
# actual matching

docs = list(tqdm(spacy_nlp.pipe(df.message), total=df.shape[0]))

matches = []
n_matches = 0
for doc, msg_id in zip(docs, df.msg_id):
    have_matches = False
    for i in doc.ents:
        if i.label_ in ['MIL_UNIT_X']:
            have_matches = True
            n_matches += 1
    if not have_matches:
        continue
    matches.append((doc, msg_id))

n_matched_msgs = len(matches)
print(f'{n_matched_msgs=} {n_matches=}')

100%|█████████████████████████████████████| 11000/11000 [02:03<00:00, 89.15it/s]

n_matched_msgs=161 n_matches=172
CPU times: user 2min 3s, sys: 341 ms, total: 2min 3s
Wall time: 2min 3s





In [7]:
%%time
# matching in parallel

docs = list(tqdm(spacy_nlp.pipe(df.message, n_process=-1), total=df.shape[0]))

matches = []
n_matches = 0
for doc, msg_id in zip(docs, df.msg_id):
    have_matches = False
    for i in doc.ents:
        if i.label_ in ['MIL_UNIT_X']:
            have_matches = True
            n_matches += 1
    if not have_matches:
        continue
    matches.append((doc, msg_id))

n_matched_msgs = len(matches)
print(f'{n_matched_msgs=} {n_matches=}')

100%|████████████████████████████████████| 11000/11000 [00:35<00:00, 311.11it/s]


n_matched_msgs=161 n_matches=172
CPU times: user 25.2 s, sys: 672 ms, total: 25.9 s
Wall time: 35.5 s


In [8]:
# Render some results

for doc, msg_id in matches[:4]:
    print(f'{msg_id=}')
    displacy.render(doc, style="ent")
    print('#'*80)

msg_id=4


################################################################################
msg_id=50


################################################################################
msg_id=69


################################################################################
msg_id=83


################################################################################


In [9]:
print(f'{n_matched_msgs=} {n_matches=}')

n_matched_msgs=161 n_matches=172


## lazy_nlp_pipeline solution

`lazy_nlp_pipeline` provides a way to construct nested patterns and match them in a lazy fashion. In some cases it could significantly speed-up matching, as compared to Spacy solution

In [10]:
from lazy_nlp_pipeline import NLP, Pattern as P, TokenPattern as TP, WordPattern as WP

In [11]:
# constructing pattern

# "military unit" prefix pattern is built as either of 3 options:
mil_unit_pattern = P(         # 1) abbreviations like "в/ч", "в\ч", "в.ч.", "в.ч", "в ч", etc.
    TP('в', ignore_case=True),
    (TP('/') | TP('\\') | TP('.'))[0:1],
    TP('ч', ignore_case=True),
    TP('.')[0:1],
) | P(                        # 2) single token "вч", "ВЧ", "Вч", "вЧ"
    TP('вч', ignore_case=True),
) | P(                        # 3) full words, matched by lemmas
    WP(lemma='войсковой') | WP(lemma='военный') | WP(lemma='воинский'),
    WP(lemma='часть'),
)

# full pattern consists of "military unit" subpattern + optional one token + id
mil_unit_x_pattern = P(
    mil_unit_pattern,                          # "military unit"
    TP()[0:1],                                 # optional any token, i.e. "№" or anything else
    P(                                         # id
        P(                                         # optional one letter + dash prefix
            TP(isalpha=True, max_len=1),
            TP('-')[0:1],
         )[0:1],
        TP(isnumeric=True, min_len=5, max_len=5),  # 5-digit number

        as_attribute='unit_id',  # save text matched with this subpattern as "unit_id" attribute of full match
    ),
)

In [12]:
%%time
# actual matching

lazy_nlp = NLP(project_name='mil_unit_matching')

docs = [lazy_nlp(text) for text in df.message]
for d, msg_id in zip(docs, df.msg_id):
    d.lazy_attributes['msg_id'] = msg_id

matches = list(lazy_nlp.match_patterns([mil_unit_x_pattern], texts=docs, backward=True))

n_matches = len(matches)
n_matched_msgs = len(set(s.doc for s in matches))
print(f'{n_matched_msgs=} {n_matches=}')

n_matched_msgs=161 n_matches=174
CPU times: user 8.45 s, sys: 64 ms, total: 8.52 s
Wall time: 8.52 s


`lazy_nlp_pipeline` doesn't provide parallel processing out-of-the-box, but it is possible to implement

In [13]:
import multiprocessing
from multiprocessing import Pool

In [14]:
%%time
# matching in parallel

def match(doc):
    return list(mil_unit_x_pattern.match(doc, forward=False))

lazy_nlp = NLP(project_name='mil_unit_matching')

docs = [lazy_nlp(text) for text in df.message]
for d, msg_id in zip(docs, df.msg_id):
    d.lazy_attributes['msg_id'] = msg_id

with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
    sub_matches = p.map(match, docs)

matches = sum(sub_matches, start=[])

n_matches = len(matches)
n_matched_msgs = len(set(s.doc for s in matches))
print(f'{n_matched_msgs=} {n_matches=}')

n_matched_msgs=161 n_matches=174
CPU times: user 4.76 s, sys: 1.38 s, total: 6.14 s
Wall time: 7.08 s


In [15]:
# Render some results

for match in matches[:4]:
    msg_id = match.doc.msg_id
    print(f'{msg_id=}')
    print(f'{match=}')
    displacy.render([{"text": match.doc.text,
                      "ents": [{"start": match.start_char, "end": match.end_char, "label": "MIL_UNIT_X"}]}],
                    style="ent", manual=True)
    print('#'*80)

msg_id=4
match=Span('в/ч02511')[83:91 doc=140100198609248 {unit_id: '02511'}]


################################################################################
msg_id=50
match=Span('в/ч41659')[26:34 doc=140100198836368 {unit_id: '41659'}]


################################################################################
msg_id=69
match=Span('в/ч Л-12265')[37:48 doc=140100198834832 {unit_id: 'Л-12265'}]


################################################################################
msg_id=83
match=Span('в/ч 22179')[27:36 doc=140100198829264 {unit_id: '22179'}]


################################################################################


In [16]:
print(f'{n_matched_msgs=} {n_matches=}')

n_matched_msgs=161 n_matches=174


## Comparison

`lazy_nlp_pipeline` could outperform Spacy in some cases due to the benefits of lazy evaluation. Even though, unlike Spacy, it doesn't allow multiprocessing out-of-the-box. Furthermore, as it is shown here, there is room to speed-up `lazy_nlp_pipeline` with multiprocessing.

Here is the comparison of time to process given dataset:

time (time running in parallel)<br>
Spacy: 123s (36s)<br>
lazy_nlp_pipeline: 9s (7s)<br>

`lazy_nlp_pipeline` also provides ability to use nested patterns.

Spacy does have massive ecosystem and many more features both within it's rule-based pattern matching and beyond