# Lexical tokenization - query time tokenization

Let's walk through a basic introduction to lexical search.

### Who you are:

An ML engineer with enough comfort with Python data stack (pandas, numpy, etc) that wants to understand traditional search engines (ie Elasticsearch, etc)

### What this is

A run through of the core concepts behind lexical search.


## This notebook: query-time tokenization

We [previously discussed index-time tokenization](https://colab.research.google.com/drive/1Mz2H05900XlNdnV_IXveDukYeEV3HABi), now how does this apply to query time? Obviously if we search for "doug complaint" we might want to

In [None]:
!pip install searcharray

from searcharray import SearchArray
import pandas as pd
import numpy as np

Collecting searcharray
  Downloading searcharray-0.0.72-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading searcharray-0.0.72-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: searcharray
Successfully installed searcharray-0.0.72


## Tokenize and index

Last time we made a bit smarter tokenizer. Nothing too fancy, but interesting enough to make basic matching work

In [None]:
from string import punctuation


def better_tokenize(text):
    lowercased = text.lower()
    without_punctuation = lowercased.translate(str.maketrans('', '', punctuation))
    split = without_punctuation.split()
    return split


chat_transcript = [
  "Hi this is Doug, I have a complaint about the weather",
  "Doug, this is Tom, support for Earth's Climate, how can we help?",
  "Tom, can I speak to your manager?",
  "Hi, this is Sue, Tom's boss. What can I do for you?",
  "I'd like to complain about the ski conditions in West Virginia",
  "Oh doug thats terrible, lets see what we can do."
]

msgs = pd.DataFrame({"name": ["Doug", "Doug", "Tom", "Sue", "Doug", "Sue"],
                     "msg": chat_transcript})
msgs['msg_tokenized'] = SearchArray.index(msgs['msg'],
                                          tokenizer=better_tokenize)
msgs

2025-09-17 14:41:13,977 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-09-17 14:41:13,982 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-09-17 14:41:13,985 - searcharray.indexing - INFO - Tokenizing 6 documents


INFO:searcharray.indexing:Tokenizing 6 documents


2025-09-17 14:41:13,995 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-09-17 14:41:13,997 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-09-17 14:41:14,000 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-09-17 14:41:14,002 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-09-17 14:41:14,005 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-09-17 14:41:14,009 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-09-17 14:41:14,010 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


Unnamed: 0,name,msg,msg_tokenized
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'have', 'is', 'weather', 'about', 'this..."
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'is', 'tom', 'how', 'this', 'we', 'clim..."
2,Tom,"Tom, can I speak to your manager?","Terms({'manager', 'tom', 'your', 'to', 'can', ..."
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'sue', 'is', 'boss', 'do', 'what', 'thi..."
4,Doug,I'd like to complain about the ski conditions ...,"Terms({'west', 'about', 'in', 'the', 'conditio..."
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'see', 'thats', 'do', 'we', 'terrible',..."


## Search with two terms

What happens when we search for "Doug complaint" as in we want to find places where Doug had a complaint

In [None]:
QUERY = "doug complaint"
matches = msgs['msg_tokenized'].array.score(QUERY) > 0
msgs[matches]

Unnamed: 0,name,msg,msg_tokenized


### Again nothing matched!?

What gives!? Nothing matched here either!!!

I tricked you. SearchArray `score` expects either

1. If a string, this corresponds to a single term
2. If a list of strings, this corresponds to a phrase

So when we passed `doug complaint` we literally searched for a single token `doug complaint`. As if we indexed word-bigrams.

We'll see why SearchArray doesn't want to tokenize queries for you in a bit, but our problem has an easy fix:

## Fix by tokenizing the query?

In [None]:
query_tokenized = better_tokenize(QUERY)
query_tokenized

['doug', 'complaint']

In [None]:
matches = np.zeros(len(msgs), dtype=np.bool)
for query_token in query_tokenized:
    matches |= (msgs['msg_tokenized'].array.score(query_token) > 0)

msgs[matches]

Unnamed: 0,name,msg,msg_tokenized
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'have', 'is', 'weather', 'about', 'this..."
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'is', 'tom', 'how', 'this', 'we', 'clim..."
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'see', 'thats', 'do', 'we', 'terrible',..."


### Fixed! (kinda)

1. We tried searching for `doug complaint`
2. We changed to tokenizing `doug complaint` - > `['doug', 'complaint']`
3. Any match for either term we accept as a match
4. We end up with 4 matches (despite only one document mentioning both `doug` AND `complaint`)

## OK Actually fix - require all terms to match

In the previous function, we do an `|=` of each term match. We would say this is an "OR Query" - as in any term can match. Let's change our loop to an AND query.

One reason to give YOU control and not SearchArray, is so you can make decisions like this!

In [None]:
matches = np.ones(len(msgs), dtype=np.bool)     # NOTICE <- init to np.ones
for query_token in query_tokenized:
    matches &= (msgs['msg_tokenized'].array.score(query_token) > 0)     # &= -- AND equals

msgs[matches]

Unnamed: 0,name,msg,msg_tokenized
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'have', 'is', 'weather', 'about', 'this..."


## Breadcrumbs for Elasticsearch, Vespa, etc

Most search engines have a query DSL that controls how query matches occur. Specifically for controlling AND / OR etc behaviors, Elasticsearch and friends has a [boolean query](https://www.elastic.co/guide/en/elasticsearch/reference/8.18/query-dsl-bool-query.html) and the [match](https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-match-query) and [multi match](https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-multi-match-query) lets you pass a query string and control how and/or occurs. Vespa has [YQ contains](https://docs.vespa.ai/en/reference/query-language-reference.html) which is the workforce for whether a document will be considered a match