In [1]:
pip install krixik




[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





In [2]:
import sys 
sys.path.append('..')
from dotenv import load_dotenv
import os
load_dotenv()

LUCAS_STAGING_API_KEY=os.getenv('LUCAS_STAGING_API_KEY')
LUCAS_STAGING_API_URL=os.getenv('LUCAS_STAGING_API_URL')

# import Krixik
from krixik import krixik
krixik.init(api_key = LUCAS_STAGING_API_KEY, 
            api_url = LUCAS_STAGING_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.


---

---

---

# Three Types of Keyword Search

What is Keyword Search?

ChatGPT 3.5 defines it thus: "*Keyword search works by entering words or phrases into a search engine, which then scans its index or database for relevant content containing those keywords.*"

Accurate enough. In other words, to break it down:

- A keyword index is generated from an input file. For every keyword that appears at least once in the document, the index lists it and notes the line number in which it appears.
- A query is input by the user. The query is broken down into keywords.
- For every keyword in the query, if the keyword is in the index at least once, it returns to the user the keyword in question and every line number it appears on in the original document.
- If every keyword in the query is in the document, then the final output of the keyword search is each query keyword individually listed with all the line numbers it appears on in the document.

[What is a keyword? A keyword is every word that is not a ["stop word"](https://krixik-docs.readthedocs.io/en/latest/system/search_methods/keyword_search_method/#stop-words). Stop words are words so common that keyword search algorithms ignore them.]

### Three Types of Keyword Search

We've all used keyword search before; it's the "basic" search function. A user has a word or phrase in mind, types it into a search bar on a website (e.g. Wikipedia), and the results returned all include the words searched for, likely ranked in some order.

In this article we'll look at three different ways to think of and build keyword search. This is by no means an exhaustive list; our intention is to highlight that, even with something as common and often-used as keyword search, developers with the right tools can always build something better or more accurate for their particular use case.

They are the following:

- Single-word keyword search (a.k.a. vanilla keyword search)
- Multi-word phrase search
- Multi-word cluster search

[Note that these are not standard labels, they just sound reasonable at this writing]

Let's take a look at each one of these. Although they vary in level of complexity, they are all keyword search functions that we can build with [Krixik](https://krixik-docs.readthedocs.io/en/latest/). We'll show you how to do just that.

### Single-Word Keyword Search

This is the "vanilla" keyword search. After creating an index out of a document (which in Krixik is done by [processing](https://krixik-docs.readthedocs.io/en/latest/system/parameters_processing_files_through_pipelines/process_method/) a file through a keyword search [pipeline](https://krixik-docs.readthedocs.io/en/latest/system/pipeline_creation/components_of_a_krixik_pipeline/)), a query is fed and searched for.

The search function then goes down the keyword index, checking for each of the queried words at every step of the way. Each time there's a match it returns to the user the word, the line number, and since we're using Krixik [keyword search](https://krixik-docs.readthedocs.io/en/latest/modules/database_modules/keyword-db_module/), the word number within the line. There is no ranking or sorting of any sort: the output is in document order, so a word that's printed on Line 1 of the document is returned before a word that's printed on Line 2.

To show this in action, we'll first need to [create](https://krixik-docs.readthedocs.io/en/latest/system/pipeline_creation/create_pipeline/) a simple [single-module](https://krixik-docs.readthedocs.io/en/latest/examples/single_module_pipelines/single_keyword-db/) Krixik [keyword search](https://krixik-docs.readthedocs.io/en/latest/modules/database_modules/keyword-db_module/) pipeline. That's accomplished as follows:

In [3]:
# instantiate a single-module pipeline with a keyword-db module
pipeline_1 = krixik.create_pipeline(name='my_simple_keyword-db_pipeline',
                                    module_chain=['keyword-db'])

The pipeline is now instantiated. Let's process a file through it. We'll go with something meaty, like the [Project Gutenberg](https://gutenberg.org/) version of Jane Austen's [<u>Pride and Prejudice</u>](https://www.gutenberg.org/ebooks/1342):

In [4]:
# process Pride and Prejudice through the pipeline
pipeline_1.process(local_file_path='./test_files/Pride_and_Prejudice.txt')

INFO: hydrated input modules: {'module_1': {'model': 'sqlite', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_iljdbepopt.txt
INFO: expire_time was not set by user - setting to default of 1800 seconds
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 1800 seconds, at Sun Jun  9 22:15:26 2024 UTC
INFO: my_simple_keyword-db_pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 89b91968-78ca-b302-c27e-17657cce35e4
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - module_1 processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded.


{'status_code': 200,
 'pipeline': 'my_simple_keyword-db_pipeline',
 'request_id': '47f8acf7-7985-476e-b414-acdacd1fe2f8',
 'file_id': 'd4c5baf5-e4e1-441f-a1ea-b3a81baf3483',
 'message': 'SUCCESS - output fetched for file_id d4c5baf5-e4e1-441f-a1ea-b3a81baf3483.Output saved to location(s) listed in process_output_files.',
 'process_output': None,
 'process_output_files': ['c:\\Users\\Lucas\\Desktop\\Content/d4c5baf5-e4e1-441f-a1ea-b3a81baf3483.db']}

Lastly, our query. We'll use a few words that are likely often present in a novel like this one:

In [5]:
# run a vanilla keyword search query on the novel we've just processed through the pipeline

pipeline_1.keyword_search(query='woman stared gentleman wealthy landscape sadness romance heart scoundrel',
                          symbolic_directory_paths=['/*'])

{'status_code': 500,
 'request_id': '6b2b8b35-347e-44e1-83c1-2b16aea71550',
 'message': 'FAILURE: Error querying user files',
 'items': []}

Vanilla keyword search works as described: our output is our queried words listed in order of a appearance (with as many repetitions as is applicable for this file).

### Multi-Word Phrase Search

What if you only want words when they appear together in a certain order? For instance, you may wish to search for "the horse grazed on the green pasture", but not want individual instances of "horse", "grazed", "green", or "pasture". You only want instances where that specific combination of words appears.

You can also build this version of keyword search with [Krixik](https://krixik-docs.readthedocs.io/en/latest/), but it'll take more than a single-module pipeline. You'll need to leverage the output of basic keyword search—which importantly includes line numbers *and line numbers*, and add in a couple of extra steps in order to put it together.

We begin by ###Need Jeremy's Help Here. Section probably calls for a couple different code cells and short explanatory markdown cells###

In [None]:
# First of probably a few different code cells laying out how to accomplish the above.

As expected, our output is a listing of only the line numbers in which our queried-for phrase shows up exactly as input, ranked in order of appearance. In this case, that phrase is only present ###once/twice/thrice###.

### Multi-Word Cluster Search

"Multi-Word Density Search" is what you would use if you wanted to find *clusters* of words. In other words, you have a set of words to search for, and you want to know where several of those words are grouped together in a single line or a separate line.

As opposed to the two previous types of keyword search, where results are presented in order of appearance, our results here should be ranked by cluster size. In other words, if I search for 20 different words, then output should first show any clusters in which there are e.g. 20 words, then 19, etc. The numbers wouldn't work quite like this, given that you must allow for word repetition, but that's the general idea.

You might also wish to add an additional argument, `window_size`. With it you could determine how many words or lines you are allowing for the clusters to form in. So if there's a 15-word cluster that's spread across 60 words and your `window_size` is 80, that would be returned just fine, but if it were the same scenario with a `window_size` of 40, then that cluster would not be returned as a single cluster, given that the 15 keywords are never within 40 total words of each other.

Let's take a look at how we would build this with [Krixik](https://krixik-docs.readthedocs.io/en/latest/)—as a hint, you would, as before, also need to leverage line and word numbers. 

First you ###Jeremy help here please###

In [None]:
# First of probably a few different code cells laying out how to accomplish the above.

The queried-for keywords have been returned as we had hoped, with the biggest, densest clusters at the top of our output.

### Conclusion

The above are three different ways to think about keyword search and to build with the basic keyword search algorithim—in this case, with [Krixik](https://krixik-docs.readthedocs.io/en/latest/) [keyword search](https://krixik-docs.readthedocs.io/en/latest/modules/database_modules/keyword-db_module/) [pipelines](https://krixik-docs.readthedocs.io/en/latest/system/pipeline_creation/components_of_a_krixik_pipeline/).

Do any of these work for what you seek to accomplish? Could you further build towards what you need, either further developing one of the above or building in another direction altogether?

Even with other more evolved search methods arising, keyword search will continue to be a valuable tool in developer arsenals. So think creatively about what you could accomplish with it... the ideas in this article are only an appetizer.