Modern encoders have more than one stage of encoding. Developers manually create pipes for converting their text in vector. But some steps are atomic bricks and can be reuseful. Beside it some encoding takes a lot of time and we reinvent caches. Encoder library provide simple way initialization and pipeline construction.
pip install encoder-lib[bert_embedded,bert_client]
Let's create bert thin client for bert-as-service
from encoders.encoder_factory import EncoderFactory
encoder_conf_dict = {
"default": {
"type": "bert_client",
"input_dim": 1,
"output_dim": 768,
"params": {
"port": 5555,
"port_out": 5556,
"ip": "localhost",
"timeout": 5000,
}
}
}
encoder_factory = EncoderFactory(encoder_conf_dict)
encoder = encoder_factory.get_encoder("default")
documents_list = ["Hello World!"]
vectors = encoder.encode(documents_list)
Coll, we have encoder, but each request over network takes time. Let's enhance the encoder and add simple in-memory in cache
from encoders.encoder_factory import EncoderFactory
encoder_conf_dict = {
"default": {
"type": "bert_client",
"input_dim": 1,
"output_dim": 768,
"params": {
"port": 5555,
"port_out": 5556,
"ip": "localhost",
"timeout": 5000,
},
"cache": {
"type": "simple"
}
}
}
encoder_factory = EncoderFactory(encoder_conf_dict)
encoder = encoder_factory.get_encoder("default")
documents_list = ["Hello World!"]
# Encoder sends request over network
vectors = encoder.encode(documents_list)
# This call takes vector from cache
vectors = encoder.encode(documents_list)
Simple cache stores data in memory without any memory restriction. Beside it we can keep time on warming up and load pre-computed vectors from file:
encoder_conf_dict = {
"default": {
"type": "bert_client",
"input_dim": 1,
"output_dim": 768,
"params": {
"port": 5555,
"port_out": 5556,
"ip": "localhost",
"timeout": 5000,
},
"cache": {
"type": "simple",
"params": {
"path_desc": {
"type": "absolute",
"file": "/cache/bert_cache.pkl"
}
}
}
}
}
Path object is flexible description of file location. Current path object version supports:
-
Absolute path - allow to specify full path to file
path_desc: type: absolute file: full_file_path
-
Relative path - allow to specify relative path to file. We separate full file name on two parts relative and base. Relative part is stored in param "file". Base part is stored in OS environment variable and make you config transferable to other computers.
path_desc: type: relative file: relative_file_name os_env: ENV_VAR
Examples
path_desc: type: relative os_env: BERT_HOME file: "cache/bert_cache.pkl"
- Bert-as-Service client
- Bert embedded
- TF-IDF
- Composite vectoriser
example_bert_client:
type: bert_client
input_dim: 1
output_dim: 768
params:
port: 5555
port_out: 5556
ip: localhost
timeout: 5000
example_bert_embedded:
type: bert_embedded
verbose: True
input_dim: 1
output_dim: 768
params:
seq_len: 25
graph:
path_desc:
type: relative
os_env: BERT_HOME
file: model_for_inference.pbtxt
vocab:
path_desc:
type: relative
os_env: BERT_HOME
file: vocab.txt
example_composite:
type: composite
params:
encoders:
- example_bert_client
example_tf_idf:
type: tfidf
params:
path_desc:
type: absolute
file: /dumped_tf_idf/model.pkl
- Added parameter seq_len for BERTFeatureExtractor
- Added parameter verbose for BaseEncoder and all child classes
- Added method simple_dump_to_pickle for dumping EncoderCache.
- Added base functionality for Bert and TF-IDF encoders