# Re-embed Dad Jokes with Ollama

This notebook re-embeds the Dad Jokes (see `Build Dad Jokes KB.ipynb`) using an Ollama model running locally!

It creates a new KB (stored to `dad_jokes_ollama.sqlite.gz`) with the Ollama embeddings.

In [1]:
import svs

In [2]:
import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
)

In [3]:
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# !!!       IF YOU RUN OLLAMA ON A DIFFERENT HOST OR DIFFERENT PORT            !!!
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

# Below is the default. Comment it out and change it if you need to.
# import os
# os.environ['OLLAMA_BASE_URL'] = 'http://127.0.0.1:11434'

## Step 1: Load the Old KB

In [4]:
old_kb = svs.KB('./dad_jokes.sqlite.gz')
old_kb

2025-01-27 19:20:29,561 - svs.util - INFO - resolve_to_local_uncompressed_file('./dad_jokes.sqlite.gz'): found gzipped file
2025-01-27 19:20:29,562 - svs.util - INFO - resolve_to_local_uncompressed_file('./dad_jokes.sqlite.gz'): starting gunzip...
2025-01-27 19:20:29,734 - svs.util - INFO - resolve_to_local_uncompressed_file('./dad_jokes.sqlite.gz'): finished gunzip!


<svs.kb.KB at 0x7a18cab309b0>

## Step 2: Create the New KB

In [5]:
# At the time of writing, the best Ollama embedding models seem to be:
#   - 'nomic-embed-text' or
#   - 'mxbai-embed-large'
#
# But feel free to update the code below with a *new* model if you'd like!
#
# Also note: You have to do `ollama pull <model>` before running this code.
#            Else you'll get an error telling you the same!

embed_function = svs.make_ollama_embeddings_func(
    model = 'nomic-embed-text',
    truncate = False,
)

new_kb = svs.KB('./dad_jokes_ollama.sqlite', embed_function, force_fresh_db=True)
new_kb

<svs.kb.KB at 0x7a18cab32900>

## Step 3: Copy Old to New

In [6]:
%%time

with old_kb.bulk_query_docs() as old_q:
    with new_kb.bulk_add_docs() as new_add_doc:
        for old_doc in old_q.dfs_traversal():
            new_add_doc(old_doc['text'])

2025-01-27 19:21:34,227 - svs.kb - INFO - starting bulk-add (as new database transaction)
2025-01-27 19:21:34,310 - svs.kb - INFO - getting 4213 document embeddings...
2025-01-27 19:24:20,755 - svs.kb - INFO - *DONE*: got 4213 document embeddings
2025-01-27 19:24:20,755 - svs.kb - INFO - invalidating cached vectors; they'll be re-built next time you `retrieve()`
2025-01-27 19:24:20,756 - svs.kb - INFO - ending bulk-add (committing the database transaction)


CPU times: user 845 ms, sys: 97.7 ms, total: 943 ms
Wall time: 2min 46s


In [7]:
old_kb.close()
new_kb.close(vacuum=True, also_gzip=True)

2025-01-27 19:24:20,776 - svs.kb - INFO - invalidating cached vectors; they'll be re-built next time you `retrieve()`
2025-01-27 19:24:21,028 - svs.kb - INFO - invalidating cached vectors; they'll be re-built next time you `retrieve()`
2025-01-27 19:24:21,029 - svs.kb - INFO - KB.close(): starting gzip...
2025-01-27 19:24:21,663 - svs.kb - INFO - KB.close(): finished gzip: dad_jokes_ollama.sqlite.gz


## Demo!

Let's re-open the KB and query something! Just as a demo...

In [8]:
kb = svs.KB('./dad_jokes_ollama.sqlite')  # <-- it will remember the embedding func!

In [9]:
%%time

kb.retrieve('cats', n=3)

2025-01-27 19:24:21,675 - svs.kb - INFO - retrieving 3 documents with query string: cats
2025-01-27 19:24:21,677 - svs.kb - INFO - re-building cached vectors...
2025-01-27 19:24:21,929 - svs.kb - INFO - re-building cached vectors... DONE!
2025-01-27 19:24:21,961 - svs.kb - INFO - got embedding for query!
2025-01-27 19:24:21,967 - svs.kb - INFO - computed 4213 cosine similarities
2025-01-27 19:24:21,968 - svs.kb - INFO - retrieved top 3 documents


CPU times: user 246 ms, sys: 17.9 ms, total: 264 ms
Wall time: 293 ms


[{'score': 0.7133579254150391,
  'doc': {'id': 1181,
   'parent_id': None,
   'level': 0,
   'text': 'Siamese cats are a great choice for a cat lover on a budget. You get two for the price of one.',
   'embedding': True,
   'meta': None}},
 {'score': 0.7095077633857727,
  'doc': {'id': 1543,
   'parent_id': None,
   'level': 0,
   'text': 'What do cats call their human form? Their purr-sona.',
   'embedding': True,
   'meta': None}},
 {'score': 0.704696536064148,
  'doc': {'id': 2817,
   'parent_id': None,
   'level': 0,
   'text': 'An English cat named ABC challenges a French cat named 123 to a swim across the English Channel, from the UK to France. They both swim hard, but only the English cat makes it. What happened to the other cat? Well, un deux trois quatre cinq.',
   'embedding': True,
   'meta': None}}]

In [10]:
kb.close()

2025-01-27 19:24:21,988 - svs.kb - INFO - invalidating cached vectors; they'll be re-built next time you `retrieve()`
