## BySearch
BySearch package implements light and effective semantic search pipelines in a simple API by combining modern vector storages with lightweight open-source language models.

My goal is to create approachable and simple solution that would help developers, who are not familiar with NLP models and technologies, to easily build and implement into their projects their mother tongue semantic search engines.  

In [1]:
from datasets import load_dataset
import pandas as pd

from bysearch import Engine
from bysearch.pipelines import HuggingFacePipeline, ONNXPipeline
from bysearch.backends import DatasetBackend, PineconeBackend, ChromaBackend 

  from .autonotebook import tqdm as notebook_tqdm


Let's take an example with belarusian texts.

Main class of the package is Engine class. It organizes whole dataflow starting with row texts and finishing with vector storages. Engine manages other modules converting texts into text embeddings and carrying out communication with chosen vector storage. Engine class provides unified API, that for this moment supports upsert texts collections, search by prompt and delete by ID methods, for any different models and vector storages.

Engine needs two components:
1. EmbeddingPipeline for text embeddings generation.
2. DataBackend for communication with vector storage.

Upserted text collection should be wrapped into Pandas DataFrame or into Hugging Face Dataset.

Let't take a look on DataBackend classes. For this moment BySearch package implements backends for Pinecone and Chroma storages and also simple local backend based on Hagging Face datasets. There is two common required parameters for any backends: text_column_name and id_colum_name that should respectively contain text and id column names both for the input collection and for the vector storage. Other parameters are different for each backend, they are used to establish connection with corresponding vector storage. DataBackend classes provide connection with existing vector storages or crate storages and then connect to it otherwise. 

In [2]:
# Simple local backend based on Hugging Face datasets.
# Upserted data will be stored in your RAM inside a pythons session.
# Don't support delete operation, don't track copies during upsert operation.
# Recommended to use only during tests.
# backend = DatasetBackend(
#     text_column_name='text', 
#     id_column_name='id'
# )

# Backend for communication with a Pinecone storage.
# Pinecone is commercial closed-source storage with access through API.
# backend = PineconeBackend(
#     text_column_name='text', 
#     id_column_name='id', 
#     api_key='your key', 
#     environment ='gcp-starter', 
#     index_name='your index name'
# )

# Backend for communication with a Chroma storage.
# Chroma is an open-source database that supports RAM storage, disk storage and server storage. 
backend = ChromaBackend(
    text_column_name='text', 
    id_column_name='id', 
    type='persistent', 
    collection_name='by-embeddings'
)

Now let's get into EmbeddingPipeline classes. Pipelines generate text vector embeddings from input texts with help of deep text models. For this moment BySearch supports transformer models from Hugging Face hub (https://huggingface.co/models), models in ONNX format and also HuggingFace-ONNX converter for faster text processing.

We will create RoBERTa based pipeline for belarusian texts.

Choice of a model depends on languages supported by the model. You can use multilingual models that were trained for required languages or use specific language models.  

In [3]:
# HuggingFacePipeline implements text embeddings generation 
# using any text model from Hugging Face hub.
# Recommended to use only during tests 
# if the model could be converted into ONNX format.
# pipeline = HuggingFacePipeline(
#     model='KoichiYasuoka/roberta-small-belarusian', 
#     max_context_length=127
# )

# ONNXPipeline implements texts embeddings generation
# using any text embedding model in ONNX format.
# ONNX models are optimized for fast inference, 
# so it should be a good option for long-term usage.
# Also this pipeline has from_hugging_face method 
# that automatically converts Hugging Face models into ONNX format. 
pipeline = ONNXPipeline.from_hugging_face(
    model='KoichiYasuoka/roberta-small-belarusian', 
    onnx_save_path='onnx\\model.onnx', 
    max_context_length=127, 
    dummy_input=['What is Lorem Ipsum?Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Why do we use it?It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using, making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).']
)

Some weights of RobertaModel were not initialized from the model checkpoint at KoichiYasuoka/roberta-small-belarusian and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Prepare data from belarusian texts collection and create Engine.

In [4]:
# Data loading and id column insertion.
# ID column could be any text metadata column,
# but some storages, particularly Pinecone storage, have limits on length of ID values.
dataset = load_dataset('mc4', 'be', split='validation')
dataset = dataset.add_column('id', list(range(len(dataset))))
# Engine creation from chosen pipeline and backend.
# Dataset parameter isn't mandatory, dataset could be upserted into storage after initialization.
search = Engine(dataset=dataset, pipeline=pipeline, backend=backend)

Map: 100%|██████████| 1712/1712 [00:41<00:00, 41.22 examples/s]


Now we will perform search by prompt. Search operation outputs Pandas DataFrame with top-k most similar texts to prompt following format (similarity_score, id_column, text_column, *other_columns). Possible quality of the search depends on model size, text size and amount of different topics covered by each single text in text collection.

In [5]:
rez = search.search('аповесць беларускага пісьменніка Уладзіміра Караткевіча', verbose=False)
rez

Unnamed: 0,score,id,text,timestamp,url
0,6.785421,901,﻿ Стаўленне да вайны — характарыстыка духоўнас...,2020-08-04T17:20:46Z,http://zviazda.by/be/news/20150508/1431035548-...
1,6.891847,77,Мікалай Якаўлевіч Нікіфароўскі — Вікіпедыя\nМі...,2020-08-03T21:55:48Z,https://be.m.wikipedia.org/wiki/%D0%9C%D1%96%D...
2,7.098675,344,"Арлоў Уладзімір, Айчына, частка першая - Белар...",2019-11-12T09:06:47Z,http://ww.w.kamunikat.org/katalohbht.html?pub_...
3,7.118645,1473,"""Беларусь 3"" ушануе памяць Валерыя Чкалава | Б...",2019-02-24T06:42:12Z,http://3belarus.by/be/news/belarus-3-ushanue-p...
4,7.238077,506,Прэзентацыя кнігі «Хуш Килесез — калі ласка!: ...,2019-02-17T06:57:53Z,http://mininform.gov.by/news/stuzhka-navin/pre...


Let's delete texts from search output and again perform search with the same prompt.

Warning: Pinecone remote backend has a storage update delay, for this reason operation results could be unavailable right after any operations. 

In [6]:
search.delete(rez['id'].tolist())
rez = search.search('аповесць беларускага пісьменніка Уладзіміра Караткевіча', verbose=False)
rez

Unnamed: 0,score,id,text,timestamp,url
0,7.278406,408,Аляксей Белы адзначае юбілей\nГалоўная » Навін...,2020-03-29T15:38:08Z,https://lit-bel.org/news/Alyaksey-Beli-adznach...
1,7.476275,1358,«Дарогамі Максіма Танка»\nОпубликовано: 18.09....,2018-03-23T16:48:12Z,https://www.sb.by/articles/darogam-maks-ma-tan...
2,7.612226,544,"Творчая сустрэча «Родны край, я цябе апяваю»\n...",2019-12-12T14:06:21Z,http://kb.brl.by/index.php/home?view=featured
3,7.784564,120,Прэм'ера ''Смешныя людзі'' - 12.10.2018 - Наци...,2018-10-18T09:00:55Z,https://www.kvitki.by/rus/bileti/teatr/drama/p...
4,7.789008,56,Вязень лагера Дахау | Дзяннiца\nГлавная Вязень...,2017-01-17T09:03:39Z,http://dzyannica.by/node/11144


Now we will upsert texts collection copy.

In [7]:
# Create texts collection copy.
new_dataset = load_dataset('mc4', 'be', split='validation')
new_dataset = new_dataset.add_column('id', list(range(len(dataset))))
# Upsert method implements data update or insertion into your storage.
search.upsert(dataset=new_dataset)
rez = search.search('аповесць беларускага пісьменніка Уладзіміра Караткевіча', verbose=False, k=10)
rez

Map: 100%|██████████| 1712/1712 [00:44<00:00, 38.80 examples/s]


Unnamed: 0,score,id,text,timestamp,url
0,6.629228,1656,﻿ Новыя паступленні музея імя Суворава | А. М....,2017-08-18T16:34:45Z,http://ikobrin.ru/martinov-52.php
1,6.785421,901,﻿ Стаўленне да вайны — характарыстыка духоўнас...,2020-08-04T17:20:46Z,http://zviazda.by/be/news/20150508/1431035548-...
2,6.891847,77,Мікалай Якаўлевіч Нікіфароўскі — Вікіпедыя\nМі...,2020-08-03T21:55:48Z,https://be.m.wikipedia.org/wiki/%D0%9C%D1%96%D...
3,7.098675,344,"Арлоў Уладзімір, Айчына, частка першая - Белар...",2019-11-12T09:06:47Z,http://ww.w.kamunikat.org/katalohbht.html?pub_...
4,7.118645,1473,"""Беларусь 3"" ушануе памяць Валерыя Чкалава | Б...",2019-02-24T06:42:12Z,http://3belarus.by/be/news/belarus-3-ushanue-p...
5,7.238077,506,Прэзентацыя кнігі «Хуш Килесез — калі ласка!: ...,2019-02-17T06:57:53Z,http://mininform.gov.by/news/stuzhka-navin/pre...
6,7.278406,408,Аляксей Белы адзначае юбілей\nГалоўная » Навін...,2020-03-29T15:38:08Z,https://lit-bel.org/news/Alyaksey-Beli-adznach...
7,7.476275,1358,«Дарогамі Максіма Танка»\nОпубликовано: 18.09....,2018-03-23T16:48:12Z,https://www.sb.by/articles/darogam-maks-ma-tan...
8,7.612226,544,"Творчая сустрэча «Родны край, я цябе апяваю»\n...",2019-12-12T14:06:21Z,http://kb.brl.by/index.php/home?view=featured
9,7.784564,120,Прэм'ера ''Смешныя людзі'' - 12.10.2018 - Наци...,2018-10-18T09:00:55Z,https://www.kvitki.by/rus/bileti/teatr/drama/p...


As a result deleted texts have reappeared in the storage and other texts have avoided duplication.