# Making SpaCy Pipelines Performant 

Date: 2021-01-01  
Author: Jason Beach  
Categories: Best_Practice, Introduction_Tutorial, Data_Science 
Tags: nlp, development, spacy

<!--eofm-->

Grab the following:
    * ~~create env with spacy2~~
    * ~~use spacy2 and spacy3 in same notebook~~ => can use in same dir, but not same notebook
    * performance
    * sentence2vec, doc2vec
    * ~~pattern-builder~~

In this post we demonstrate the installation of spacy2 with important modules:

* pattern-builder
* neuralcoref

In [1]:
from platform import python_version
print( python_version() )

3.9.7


In [2]:
import spacy

In [3]:
spacy.__version__

'2.3.7'

## Linguistics for NLP

## Dependency Matcher

## Pattern Builder

To install `pattern-builder` package try

```bash
git clone https://github.com/sai-prasanna/spacy-pattern-builder.git
cd spacy-pattern-builder
pipenv install -r requirements.txt
cd ..
pipenv install -e spacy-pattern-builder/
```

In [4]:
# Import a SpaCy model, parse a string to create a Doc object
import en_core_web_sm

text = 'We introduce efficient methods for fitting Boolean models to molecular data.'
nlp = en_core_web_sm.load()
doc = nlp(text)

from spacy_pattern_builder import build_dependency_pattern

In [5]:
# Provide a list of tokens we want to match.
match_tokens = [doc[i] for i in [0, 1, 3]]  # [We, introduce, methods]

''' Note that these tokens must be fully connected. That is,
all tokens must have a path to all other tokens in the list,
without needing to traverse tokens outside of the list.
Otherwise, spacy-pattern-builder will raise a TokensNotFullyConnectedError.
You can get a connected set that includes your tokens with the following: '''
from spacy_pattern_builder import util
connected_tokens = util.smallest_connected_subgraph(match_tokens, doc)
assert match_tokens == connected_tokens  # In this case, the tokens we provided are already fully connected

# Specify the token attributes / features to use
feature_dict = {  # This is equal to the default feature_dict
    'DEP': 'dep_',
    'TAG': 'tag_'
}

# Build the pattern
pattern = build_dependency_pattern(doc, match_tokens, feature_dict=feature_dict)

In [6]:
from pprint import pprint
pprint(pattern)  # In the format consumed by SpaCy's DependencyMatcher:
'''
[{'PATTERN': {'DEP': 'ROOT', 'TAG': 'VBP'}, 'SPEC': {'NODE_NAME': 'node1'}},
 {'PATTERN': {'DEP': 'nsubj', 'TAG': 'PRP'},
  'SPEC': {'NBOR_NAME': 'node1', 'NBOR_RELOP': '>', 'NODE_NAME': 'node0'}},
 {'PATTERN': {'DEP': 'dobj', 'TAG': 'NNS'},
  'SPEC': {'NBOR_NAME': 'node1', 'NBOR_RELOP': '>', 'NODE_NAME': 'node3'}}]
'''

[{'PATTERN': {'DEP': 'ROOT', 'TAG': 'VBP'}, 'SPEC': {'NODE_NAME': 'node1'}},
 {'PATTERN': {'DEP': 'nsubj', 'TAG': 'PRP'},
  'SPEC': {'NBOR_NAME': 'node1', 'NBOR_RELOP': '>', 'NODE_NAME': 'node0'}},
 {'PATTERN': {'DEP': 'dobj', 'TAG': 'NNS'},
  'SPEC': {'NBOR_NAME': 'node0', 'NBOR_RELOP': '$--', 'NODE_NAME': 'node3'}}]


"\n[{'PATTERN': {'DEP': 'ROOT', 'TAG': 'VBP'}, 'SPEC': {'NODE_NAME': 'node1'}},\n {'PATTERN': {'DEP': 'nsubj', 'TAG': 'PRP'},\n  'SPEC': {'NBOR_NAME': 'node1', 'NBOR_RELOP': '>', 'NODE_NAME': 'node0'}},\n {'PATTERN': {'DEP': 'dobj', 'TAG': 'NNS'},\n  'SPEC': {'NBOR_NAME': 'node1', 'NBOR_RELOP': '>', 'NODE_NAME': 'node3'}}]\n"

In [14]:
# Create a matcher and add the newly generated pattern
from spacy.matcher import DependencyMatcher

matcher = DependencyMatcher(doc.vocab)
matcher.add('pattern', None, pattern)

# And get matches
matches = matcher(doc)
for match_id, token_idxs in matches:
    print(f'match: {match_id}')
    print(f'token: {token_idxs}')
    tokens = [doc[i] for i in token_idxs[0]]    #<<<KEY_CHANGE: token_idxs is now a list, so another loop is needed if there is more than one item
    tokens = sorted(tokens, key=lambda w: w.i)  # Make sure tokens are in their original order
    print(tokens)  # [We, introduce, methods]

match: 15329811787164753587
token: [[1, 0, 3]]
[We, introduce, methods]


In [17]:
text = 'We introduce a slightly different but still efficient method for this test.'
nlp = en_core_web_sm.load()
doc = nlp(text)

# And get matches
matches = matcher(doc)
for match_id, token_idxs in matches:
    print(f'match: {match_id}')
    print(f'token: {token_idxs}')
    tokens = [doc[i] for i in token_idxs[0]]    #<<<KEY_CHANGE: token_idxs is now a list, so another loop is needed if there is more than one item
    tokens = sorted(tokens, key=lambda w: w.i)  # Make sure tokens are in their original order
    print(tokens)  # [We, introduce, methods]

*** I THINK THIS^^^ IS STILL BROKEN ****

TAKE A LOOK AT THE EARLIER FIX: https://github.com/cyclecycle/spacy-pattern-builder/pull/2/commits/3b430d9f78a52117af86d12797bd8e2ee02fc0fb

## NeuralCoref

To install `neuralcoref` package try

```bash
git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
cd ..
pip install -e neuralcoref/
```

In [1]:
# Load your usual SpaCy model (one of SpaCy English models)
import spacy
nlp = spacy.load("en_core_web_sm")

# Add neural coref to SpaCy's pipe
import neuralcoref
neuralcoref.add_to_pipe(nlp)

# You're done. You can now use NeuralCoref as you usually manipulate a SpaCy document annotations.
doc = nlp(u'My sister has a dog. She loves him.')

doc._.has_coref
doc._.coref_clusters

[My sister: [My sister, She], a dog: [a dog, him]]

## Performance Profiling

* `%time`: time the execution of a single statement
* `%timeit`: time repeated execution of a single statement for more accuracy; %%timeit -n1 -r1 => %time
* `%prun`: run code wtih the profiler
* `%lprun`: run code with the line-by-line profiler
* `%memit`: measure the memory use of a single statement
* `%mprun`: run code with the line-by-line memory profiler

### Generate data

In [22]:
import os

In [33]:
dir_path = './resources/Test_Performance'
file_name = 'test_file'
file_path = os.path.join(dir_path, file_name)

In [None]:
os.mkdir(dir_path)

In [39]:
with open(file_path, 'w') as file:
    idx = 0
    for idx in range(100):
        file.writelines(f'This is sentence {idx+1} used for testing our NLP pipeline. ' * 10 + '\n')

In [41]:
! ls resources/Test_Performance/

test_file


### Methods to read data

In [44]:
#naive
def iter_text_lines(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            yield line
        
iter = iter_text_lines(file_path)

In [45]:
next(iter)

'This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. \n'

In [48]:
#batch lines returned
def iter_text_batch(file_object, batch_size=10):
    """Lazy function (generator) to read a file piece by piece
    Default chunk size: 10."""
    while True:
        data = []
        for idx in range(batch_size):
            data.append(file_object.readline())
        if not data:
            break
        yield data
        
file_obj = open(file_path)
iter = iter_text_batch(file_obj, 3)

In [49]:
next(iter)

['This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. \n',
 'This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 

In [50]:
file_obj.close()

In [51]:
LIMIT = 2
with open(file_path) as file:
    for idx, batch in enumerate(iter_text_batch(file)):
        print(batch)
        if (idx + 1) >= LIMIT:
            break

['This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. \n', 'This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 used for testing our NLP pipeline. This is sentence 2 u

In [52]:
def gen_text_chunk(file_object, chunksize):
    """Lazy function (generator) to read a file piece by piece
    Default chunksize: 10"""
    while True:
        for idx in range(chunksize):
            data = file_object.readline()
            if not data:
                break
            yield data

In [53]:
class iter_chunk_strm(object):
    
    def __init__(self, file_path, number_of_chunks, chunksize):
        self.file_path = file_path
        self.number_of_chunks = number_of_chunks
        self.chunksize = chunksize
        
    def __iter__(self):
        with open(self.file_path) as file:
            for chunk_idx in range(self.number_of_chunks):
                for idx, batch in enumerate( gen_text_chunk(file, self.chunksize) ):
                    yield batch
                    if (idx + 1) >= self.chunksize:
                        break
                        
iterator = (iter_chunk_strm(file_path, 2, chunksize=3)).__iter__()

In [54]:
next( iterator )

'This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. This is sentence 1 used for testing our NLP pipeline. \n'

### Test methods

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("input_component", last=True)
nlp.add_pipe("mdl_large_fasttext", last=True)

In [None]:
%%timeit -n1 -r1
result = []
iter = iter_chunk_strm(file_path, 1, chunksize=1000)    #next( iter.__iter__() )
#next( nlp_spacy.pipe(iter) )
for doc in nlp_spacy.pipe(iter, n_process=1, batch_size=1000):
    result.append(doc)

In [None]:
%%timeit -n1 -r1
BATCH = 10
result = []
#iter = iter_text_lines(file_path)
for idx, doc in enumerate( nlp_spacy.pipe(iter, n_process=1, batch_size=BATCH) ):
    result.append(doc)
    if idx > BATCH:
        break

In [None]:
%%timeit -n1 -r1
BATCH = 1000
result = []
#iter = iter_text_lines(file_path)
for idx, doc in enumerate( nlp_spacy.pipe(my_array[0:1000], n_process=1, batch_size=BATCH) ):
    result.append(doc)

In [27]:
os.rmdir(dir_path)

## References

* [iterators for data processing](https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/)
* [inside spacy's pipe method](https://explosion.ai/blog/multithreading-with-cython)
* [3 approaches to improving spacy performance](https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html) 
* [using spacy with spark distributed processing](https://haridas.in/run-spacy-jobs-on-apache-spark.html)
* [different python profilers](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html)