# Lexical tokenization - Searching over fields

Let's walk through a basic introduction to lexical search.

### Who you are:

An ML engineer with enough comfort with Python data stack (pandas, numpy, etc) that wants to understand traditional search engines (ie Elasticsearch, etc)

### What this is

A run through of the core concepts behind lexical search.


## This notebook: Term centric search

We [previously discussed controlling index and query time tokenization](https://colab.research.google.com/drive/1RGNkq4SOZMvlFvpHq3IKgNJdCTlHqiek). But what happens if you're searching multiple fields?

In [None]:
!pip install searcharray pystemmer

from searcharray import SearchArray
import pandas as pd
import numpy as np
import Stemmer




## Tokenize and index

Tokenize and index two fields:

1. The name (who's chatting)
2. Their message

In [None]:
from string import punctuation


def better_tokenize(text):
    lowercased = text.lower()
    without_punctuation = lowercased.translate(str.maketrans('', '', punctuation))
    split = without_punctuation.split()
    return split


chat_transcript = [
  "Hi this is Doug, I have a complaint about the weather",
  "Doug, this is Tom, support for Earth's Climate, how can we help you doug?",
  "Tom, can I speak to your manager?",
  "Hi, this is Sue, Tom's boss. What can I do for you?",
  "I have complaints about the ski conditions in West Virginia",
  "Oh doug thats terrible, lets see what we can do."
]

msgs = pd.DataFrame({"name": ["Doug", "Doug", "Tom", "Sue", "Doug", "Sue"],
                     "msg": chat_transcript})
msgs['msg_tokenized'] = SearchArray.index(msgs['msg'],
                                          tokenizer=better_tokenize)
msgs['name_tokenized'] = SearchArray.index(msgs['name'],
                                          tokenizer=better_tokenize)
msgs

2025-06-20 14:19:51,975 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-06-20 14:19:51,979 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-06-20 14:19:51,983 - searcharray.indexing - INFO - Tokenizing 6 documents


INFO:searcharray.indexing:Tokenizing 6 documents


2025-06-20 14:19:51,986 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-06-20 14:19:51,989 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-06-20 14:19:51,991 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-06-20 14:19:51,993 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-06-20 14:19:51,996 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-06-20 14:19:51,999 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-06-20 14:19:52,004 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2025-06-20 14:19:52,007 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-06-20 14:19:52,008 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-06-20 14:19:52,017 - searcharray.indexing - INFO - Tokenizing 6 documents


INFO:searcharray.indexing:Tokenizing 6 documents


2025-06-20 14:19:52,035 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-06-20 14:19:52,041 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-06-20 14:19:52,045 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-06-20 14:19:52,049 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-06-20 14:19:52,050 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-06-20 14:19:52,059 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-06-20 14:19:52,068 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


Unnamed: 0,name,msg,msg_tokenized,name_tokenized
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'a', 'is', 'i', 'weather', 'this', 'hi'...",Terms({'doug'})
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'tom', 'can', 'you', 'is', 'how'...",Terms({'doug'})
2,Tom,"Tom, can I speak to your manager?","Terms({'tom', 'can', 'your', 'i', 'speak', 'to...",Terms({'tom'})
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'what', 'can', 'you', 'is', 'i',...",Terms({'sue'})
4,Doug,I have complaints about the ski conditions in ...,"Terms({'ski', 'complaints', 'i', 'west', 'in',...",Terms({'doug'})
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'what', 'can', 'terrible', 'do', 'we', ...",Terms({'sue'})


## Use naive TF\*IDF again

Recall we created a naive TF\*IDF similarity function last time. Let's use that!

In [None]:
from searcharray.similarity import Similarity

def tf_idf(term_freqs: np.ndarray,        # TF array of every doc in the index
               doc_freqs: np.ndarray,         # Doc freq array of every term (> 1 if a phrase)
               doc_lens: np.ndarray,          # Every documents length (same shape as TF)
               avg_doc_lens: int,             # avg doc length of corpus
               num_docs: int) -> np.ndarray:     # total number of docs in corpus

    return term_freqs / (doc_freqs + 1)


## Repeat our search from last time

In [None]:
QUERY = "doug complaint"
query_tokenized = better_tokenize(QUERY)

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    # PASS SIMILARITY
    score = msgs['msg_tokenized'].array.score(query_token,
                                              similarity=tf_idf)
    print(f"Term '{query_token}' score: {score}")
    scores += score


msgs['scores'] = scores
msgs.sort_values('scores', ascending=False)

Term 'doug' score: [0.25 0.5  0.   0.   0.   0.25]
Term 'complaint' score: [0.5 0.  0.  0.  0.  0. ]


Unnamed: 0,name,msg,msg_tokenized,name_tokenized,scores
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'a', 'is', 'i', 'weather', 'this', 'hi'...",Terms({'doug'}),0.75
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'tom', 'can', 'you', 'is', 'how'...",Terms({'doug'}),0.5
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'what', 'can', 'terrible', 'do', 'we', ...",Terms({'sue'}),0.25
2,Tom,"Tom, can I speak to your manager?","Terms({'tom', 'can', 'your', 'i', 'speak', 'to...",Terms({'tom'}),0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'what', 'can', 'you', 'is', 'i',...",Terms({'sue'}),0.0
4,Doug,I have complaints about the ski conditions in ...,"Terms({'ski', 'complaints', 'i', 'west', 'in',...",Terms({'doug'}),0.0


## Tokenization detour: We notice a bit of a tokenization problem

Our boss comes along and notices when someone searches for `complaint` we don't match `complaints` - can we solve this with a better tokenizer?

We can add a **stemmer** which is an algorithmic (ie sometimes naive) way of using root forms. Below we use a basic snowball stemmer.

In [None]:
stemmer = Stemmer.Stemmer('english')
stemmer.stemWord("complaint"), stemmer.stemWord("complaints")

('complaint', 'complaint')

In [None]:
def even_better_tokenize(text):
    lowercased = text.lower()
    without_punctuation = lowercased.translate(str.maketrans('', '', punctuation))
    split = without_punctuation.split()
    return [stemmer.stemWord(tok) for tok in split]

even_better_tokenize("I have complaints about this complaint!")

['i', 'have', 'complaint', 'about', 'this', 'complaint']

### Reindex with stemming added

In [None]:

chat_transcript = [
  "Hi this is Doug, I have a complaint about the weather",
  "Doug, this is Tom, support for Earth's Climate, how can we help you doug?",
  "Tom, can I speak to your manager?",
  "Hi, this is Sue, Tom's boss. What can I do for you?",
  "I have complaints about the ski conditions in West Virginia",
  "Oh doug thats terrible, lets see what we can do."
]

msgs = pd.DataFrame({"name": ["Doug", "Doug", "Tom", "Sue", "Doug", "Sue"],
                     "msg": chat_transcript})
msgs['msg_tokenized'] = SearchArray.index(msgs['msg'],
                                          tokenizer=even_better_tokenize)
msgs['name_tokenized'] = SearchArray.index(msgs['name'],
                                          tokenizer=even_better_tokenize)
msgs

2025-06-20 14:19:52,201 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-06-20 14:19:52,204 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-06-20 14:19:52,207 - searcharray.indexing - INFO - Tokenizing 6 documents


INFO:searcharray.indexing:Tokenizing 6 documents


2025-06-20 14:19:52,210 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-06-20 14:19:52,212 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-06-20 14:19:52,214 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-06-20 14:19:52,217 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-06-20 14:19:52,219 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-06-20 14:19:52,221 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-06-20 14:19:52,225 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2025-06-20 14:19:52,228 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-06-20 14:19:52,232 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-06-20 14:19:52,234 - searcharray.indexing - INFO - Tokenizing 6 documents


INFO:searcharray.indexing:Tokenizing 6 documents


2025-06-20 14:19:52,236 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-06-20 14:19:52,237 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-06-20 14:19:52,239 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-06-20 14:19:52,240 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-06-20 14:19:52,242 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-06-20 14:19:52,243 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-06-20 14:19:52,245 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


Unnamed: 0,name,msg,msg_tokenized,name_tokenized
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'a', 'is', 'i', 'weather', 'this', 'hi'...",Terms({'doug'})
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'tom', 'earth', 'can', 'you', 'i...",Terms({'doug'})
2,Tom,"Tom, can I speak to your manager?","Terms({'tom', 'can', 'your', 'i', 'speak', 'to...",Terms({'tom'})
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'tom', 'what', 'can', 'you', 'is...",Terms({'sue'})
4,Doug,I have complaints about the ski conditions in ...,"Terms({'ski', 'i', 'west', 'condit', 'about', ...",Terms({'doug'})
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'what', 'can', 'do', 'that', 'we', 'dou...",Terms({'sue'})


### Search again

In [None]:
QUERY = "doug complaint"
query_tokenized = better_tokenize(QUERY)

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    # PASS SIMILARITY
    score = msgs['msg_tokenized'].array.score(query_token,
                                              similarity=tf_idf)
    print(f"Term '{query_token}' score: {score}")
    scores += score


msgs['scores'] = scores
msgs.sort_values('scores', ascending=False)

Term 'doug' score: [0.25 0.5  0.   0.   0.   0.25]
Term 'complaint' score: [0.33333333 0.         0.         0.         0.33333333 0.        ]


Unnamed: 0,name,msg,msg_tokenized,name_tokenized,scores
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'a', 'is', 'i', 'weather', 'this', 'hi'...",Terms({'doug'}),0.583333
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'tom', 'earth', 'can', 'you', 'i...",Terms({'doug'}),0.5
4,Doug,I have complaints about the ski conditions in ...,"Terms({'ski', 'i', 'west', 'condit', 'about', ...",Terms({'doug'}),0.333333
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'what', 'can', 'do', 'that', 'we', 'dou...",Terms({'sue'}),0.25
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'tom', 'what', 'can', 'you', 'is...",Terms({'sue'}),0.0
2,Tom,"Tom, can I speak to your manager?","Terms({'tom', 'can', 'your', 'i', 'speak', 'to...",Terms({'tom'}),0.0


## New Problem: multi term, multi field

Notice a fairly simple problem above, we are summing the scores, but not really biasing towards cases where both `complaint` and `doug` match.

By a naive score summing, a document with `doug` `doug` matters just as much as `doug` `complaint` when clearly the latter is closer to the user's information need.

In [None]:
QUERY = "doug complaint"
FIELDS = ["msg_tokenized", "name_tokenized"]
query_tokenized = even_better_tokenize(QUERY)

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    field_scores = np.zeros(len(msgs))
    for field in FIELDS:
        score = msgs[field].array.score(query_token,
                                        similarity=tf_idf)
        # Take maximum between field_scores and this field's score
        print(f"Field {field}, Term '{query_token}' score: {score}")
        field_scores += score
    print(f"Term '{query_token}' score: {field_scores}")
    scores += field_scores
    print(f"Scores now: {field_scores}")


msgs['scores'] = scores
msgs.sort_values('scores', ascending=False)

Field msg_tokenized, Term 'doug' score: [0.25 0.5  0.   0.   0.   0.25]
Field name_tokenized, Term 'doug' score: [0.25 0.25 0.   0.   0.25 0.  ]
Term 'doug' score: [0.5  0.75 0.   0.   0.25 0.25]
Scores now: [0.5  0.75 0.   0.   0.25 0.25]
Field msg_tokenized, Term 'complaint' score: [0.33333333 0.         0.         0.         0.33333333 0.        ]
Field name_tokenized, Term 'complaint' score: [0. 0. 0. 0. 0. 0.]
Term 'complaint' score: [0.33333333 0.         0.         0.         0.33333333 0.        ]
Scores now: [0.33333333 0.         0.         0.         0.33333333 0.        ]


Unnamed: 0,name,msg,msg_tokenized,name_tokenized,scores
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'a', 'is', 'i', 'weather', 'this', 'hi'...",Terms({'doug'}),0.833333
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'tom', 'earth', 'can', 'you', 'i...",Terms({'doug'}),0.75
4,Doug,I have complaints about the ski conditions in ...,"Terms({'ski', 'i', 'west', 'condit', 'about', ...",Terms({'doug'}),0.583333
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'what', 'can', 'do', 'that', 'we', 'dou...",Terms({'sue'}),0.25
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'tom', 'what', 'can', 'you', 'is...",Terms({'sue'}),0.0
2,Tom,"Tom, can I speak to your manager?","Terms({'tom', 'can', 'your', 'i', 'speak', 'to...",Terms({'tom'}),0.0


In [None]:
QUERY = "doug complaint"
FIELDS = ["msg_tokenized", "name_tokenized"]
query_tokenized = even_better_tokenize(QUERY)

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    field_scores = np.zeros(len(msgs))
    for field in FIELDS:
        score = msgs[field].array.score(query_token,
                                        similarity=tf_idf)
        # Take maximum between field_scores and this field's score
        print(f"Field {field}, Term '{query_token}' score: {score}")
        field_scores = np.maximum(field_scores, score)
    print(f"Term '{query_token}' score: {field_scores}")
    scores += field_scores
    print(f"Scores now: {field_scores}")


msgs['scores'] = scores
msgs.sort_values('scores', ascending=False)

Field msg_tokenized, Term 'doug' score: [0.25 0.5  0.   0.   0.   0.25]
Field name_tokenized, Term 'doug' score: [0.25 0.25 0.   0.   0.25 0.  ]
Term 'doug' score: [0.25 0.5  0.   0.   0.25 0.25]
Scores now: [0.25 0.5  0.   0.   0.25 0.25]
Field msg_tokenized, Term 'complaint' score: [0.33333333 0.         0.         0.         0.33333333 0.        ]
Field name_tokenized, Term 'complaint' score: [0. 0. 0. 0. 0. 0.]
Term 'complaint' score: [0.33333333 0.         0.         0.         0.33333333 0.        ]
Scores now: [0.33333333 0.         0.         0.         0.33333333 0.        ]


Unnamed: 0,name,msg,msg_tokenized,name_tokenized,scores
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'a', 'is', 'i', 'weather', 'this', 'hi'...",Terms({'doug'}),0.583333
4,Doug,I have complaints about the ski conditions in ...,"Terms({'ski', 'i', 'west', 'condit', 'about', ...",Terms({'doug'}),0.583333
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'tom', 'earth', 'can', 'you', 'i...",Terms({'doug'}),0.5
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'what', 'can', 'do', 'that', 'we', 'dou...",Terms({'sue'}),0.25
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'tom', 'what', 'can', 'you', 'is...",Terms({'sue'}),0.0
2,Tom,"Tom, can I speak to your manager?","Terms({'tom', 'can', 'your', 'i', 'speak', 'to...",Terms({'tom'}),0.0


### Term-centric - take the max per term

You might here a crazy word "dismax" in lexical search. That means take the disjunction maximum. Disjunction means ~ or query, maximum means ~max score instead of sum field scores.

This is often referred to as a [**term centric** search](https://medium.com/@ansuaggarwal/elasticsearch-field-centric-vs-term-centric-approach-f754b6e7d51c).

In [None]:
QUERY = "doug complaint"
FIELDS = ["msg_tokenized", "name_tokenized"]
query_tokenized = even_better_tokenize(QUERY)

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    field_scores = np.zeros(len(msgs))
    for field in FIELDS:
        score = msgs[field].array.score(query_token,
                                        similarity=tf_idf)
        # Take maximum between field_scores and this field's score
        print(f"Field {field}, Term '{query_token}' score: {score}")
        field_scores = np.maximum(field_scores, score)
    print(f"Term '{query_token}' score: {field_scores}")
    scores += field_scores
    print(f"Scores now: {field_scores}")


msgs['scores'] = scores
msgs.sort_values('scores', ascending=False)

Field msg_tokenized, Term 'doug' score: [0.25 0.5  0.   0.   0.   0.25]
Field name_tokenized, Term 'doug' score: [0.25 0.25 0.   0.   0.25 0.  ]
Term 'doug' score: [0.25 0.5  0.   0.   0.25 0.25]
Scores now: [0.25 0.5  0.   0.   0.25 0.25]
Field msg_tokenized, Term 'complaint' score: [0.33333333 0.         0.         0.         0.33333333 0.        ]
Field name_tokenized, Term 'complaint' score: [0. 0. 0. 0. 0. 0.]
Term 'complaint' score: [0.33333333 0.         0.         0.         0.33333333 0.        ]
Scores now: [0.33333333 0.         0.         0.         0.33333333 0.        ]


Unnamed: 0,name,msg,msg_tokenized,name_tokenized,scores
0,Doug,"Hi this is Doug, I have a complaint about the ...","Terms({'a', 'is', 'i', 'weather', 'this', 'hi'...",Terms({'doug'}),0.583333
4,Doug,I have complaints about the ski conditions in ...,"Terms({'ski', 'i', 'west', 'condit', 'about', ...",Terms({'doug'}),0.583333
1,Doug,"Doug, this is Tom, support for Earth's Climate...","Terms({'for', 'tom', 'earth', 'can', 'you', 'i...",Terms({'doug'}),0.5
5,Sue,"Oh doug thats terrible, lets see what we can do.","Terms({'what', 'can', 'do', 'that', 'we', 'dou...",Terms({'sue'}),0.25
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...","Terms({'for', 'tom', 'what', 'can', 'you', 'is...",Terms({'sue'}),0.0
2,Tom,"Tom, can I speak to your manager?","Terms({'tom', 'can', 'your', 'i', 'speak', 'to...",Terms({'tom'}),0.0


## Breadcrumbs for Elasticsearch, Vespa etc

In Elasticsearch, the [multi match](https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-multi-match-query) query is term-centric when using cross-fields. While in Vespa you can control the scoring math directly to get this ie `bm25(title) + bm25(description) + bm25(tags) + nativeRank`