# On-Device RAG powered AI agent demo

In this notebook we are going to explore how you can build your own on-device LLM-powered AI agent that leverages  RAG (Retrieval-Augmented Generation) to correctly answer questions about the characters of The Wizarding World of Harry Potter. 

To do this, we are going to combine [Ollama](https://github.com/ollama/ollama) as our local inference engine, [Gemma](https://ai.google.dev/gemma) as our local LLM, our newly released [RETSim](https://arxiv.org/abs/2311.17264) ultra-fast near-duplicate text embeddings, and [USearch](https://github.com/unum-cloud/usearch) for efficient indexing and retrieval. 

For those who want a more detailed write-up, you can read it at [Wingardium Trivia-osa! On-Device Sorting Hatbot Powered by Gemma, Ollama, USearch, and RETSim](https://elie.net/blog/ai/wingardium-trivia-osa-on-device-sorting-hatbot-powered-by-gemma-ollama-usearch-and-retsim)

## Setup

First things first, we are installing the packages we need for Ollama to run Gemma locally, and UniSim to index data with RETSim and retrieve it with USearch.

In [None]:
# installing dependencies
!pip install -U tqdm Iprogress unisim ollama tabulate

In [22]:
# Importing libraries
import json
from tabulate import tabulate
from tqdm.auto import tqdm
import ollama
import unisim

## Pre-Flight checks
Quickly testing that ollama, Gemma and Unisim are all setup and working well including downloading the latest Gemma version if needed.



In [55]:
# Making sure Gemma is installed with Ollama otherwise installing it
MODEL = 'gemma'
try:
    ollama.show(MODEL)
except Exception as e:
    print(f"can't find {MODEL}: {e} installing it")
    ollama.pull(MODEL)
info = ollama.show(MODEL)
print(f"{MODEL.capitalize()} {info['details']['parameter_size']} loaded")

Gemma 9B loaded


In [24]:
# small wrapper function to make generation easier and check it all works
# we use generate as we are going for a RAG style system so streaming is not useful
def generate(prompt: str) -> str:
    res = ollama.generate(model=MODEL, prompt=prompt)
    if res['done']:
        return res['response']
    else:
        raise Exception(f"Generation failed: {res['done_reason']}")

generate("Hello Gemma it is Elie")

"Hello Elie! 👋 It's lovely to hear from you. How can I help you today? 😊"

In [25]:
# initializizing TextSim for near duplicate text similarity
VERBOSE = True  # interactive demo so we want to see what happens
txtsim = unisim.TextSim(verbose=True)
# check it works as intended
sim_value = txtsim.similarity("Gemma", "Gemmaa")
if sim_value > 0.9:
    print(f"Similarity {sim_value} - TextSim works as intended")
else:
    print(f"Similarity {sim_value} - Something is very wrong with TextSim")

[Loading model]
|-model_id: text/retsim/v1
|-model path: /Users/elieb/git/unisim/unisim/embedder/models/text/retsim/v1.onnx
|-input: ['unk__340', 512, 24], tensor(float)
|-output: ['unk__341', 256], tensor(float)
INFO: UniSim is storing a copy of the indexed data
INFO: If you are using large data corpus, consider disabling this behavior using store_data=False
INFO: Accelerator is not available, using CPU
Similarity 0.9258707761764526 - TextSim works as intended


## Testing model without retrivial augmentations

Before building RAG let’s evaluate how much Gemma knows about the Wizarding world by asking a few questions with increasing difficulty and let’s throw some typos into the mix to also see how it affects the model performance. I added type next to the question to express what type of test it is.

In [26]:
questions = [
             {"q":'Which School is Harry Potter part of?', 'type': 'basic fact'},
             {"q": 'Who is ermionne?', 'type': 'typo'},
             {"q": 'What is Aberforth job?', 'type': 'harder fact'},
             {"q": "what is dubldore job?", 'type': 'harder fact and typo'},
             {"q": 'Which school is  Nympadora from?', 'type': 'hard fact'},
]

### Direct Generation Answers

In [56]:
# Let’s run those through Gemma via Ollama and see what type of answer we get.
for q in tqdm(questions, desc="Generating direct answers"):
    q['direct'] = generate(q['q'])


Generating direct answers: 100%|██████████| 5/5 [00:28<00:00,  5.74s/it]


### Questions Results

In [28]:
print("[answers without retrieval]\n")
for q in questions:
    a =  q['direct'][:100].replace('\n', ' ')
    print(f"Q:{q['q']}? (type: {q['type']})")
    print(f"Direct answer: {a}..")
    print("")

[answers without retrival]

Q:Which School is Harry Potter part of?? (type: basic fact)
Direct answer: Hogwarts School of Witchcraft and Wizardry is the school that Harry Potter attends...

Q:Who is ermionne?? (type: typo)
Direct answer: Ermionne is a French fashion designer known for her colorful and playful designs, primarily focused ..

Q:What is Aberforth job?? (type: harder fact)
Direct answer: Aberforth is a fictional character in the Harry Potter series of books and films. He does not have a..

Q:what is dubldore job?? (type: harder fact and typo)
Direct answer: **Dublador** is a voice actor who provides voices for characters in animated films, television shows..

Q:Which school is  Nympadora from?? (type: hard fact)
Direct answer: Nympadora is a character from the book series "Harry Potter" and did not attend any school. She is a..



> Overall results are not great so let's use UniSim and RetSim to index Harry Potter Data and get better results using the RAG pattern

**Note**: Some of those mistakes can probably be reduced and the answers improved by using a better prompt. Feel free to experiment with prompt tuning. 

## Indexing Harry Potter characters data 

The first step to build our RAG pipeline to help the LLM with additional context is to load the data, compute the embeddings and index them. We are simply indexing the characters name using RETSim embedding and will return the data associated with it during the retrieval process to help the model.

The data used is from the Kaggle [Characters in Harry Potter Books dataset](https://www.kaggle.com/datasets/zez000/characters-in-harry-potter-books)

Each character has its name and a few fields. Our game plan is to use 
unisim/textsim to perform typo-resilient name lookup and retrieve the relevant 
fields to help Gemma answer about the characters


| Field               | Description    | Retrieval Strategy |
| :------------------ | :------------- | :------------------|
| Name                |  char name     | unisim embedding   |
| Descr               |  Char info     | Retrieved          |
| Link                |  link to wiki  | Retrieved          |
| Gender              |  Char gender   | Retrieved          |
| Species/Race        |                | Retrieved          |
| Blood School        |  Magic school  | Retrieved          |
| Profession          |                | Retrieved          |



In [58]:
raw_data = json.loads(open('data/harry_potter_characters.json').read())
CHARACTERS_INFO = {}  # we are deduping the data using the name as key
for d in raw_data:
    name = d['Name'].lower().strip()
    CHARACTERS_INFO[name] = d
print(f'{len(CHARACTERS_INFO)} characters loaded from harry_potter_characters.json')

1350 characters loaded from harry_potter_characters.json


In [59]:
# indexing data with TextSim
txtsim.reset_index()  # clean up in case we run this cell multiple times
idx = txtsim.add(list(CHARACTERS_INFO.keys()))
txtsim.indexed_data[:10]  # display what we added

['mrs. abbott',
 'hannah abbott',
 'abel treetops',
 'euan abercrombie',
 'aberforth dumbledore',
 'abernathy',
 'abraham peasegood',
 'abraham potter',
 'abraxas malfoy',
 'achilles tolliver']

Let’s quickly write `lookup()` function to make it easier to query the index 
and test it on one of my favorite characters, Newt Scamander, but with a typo in his name.



In [60]:
# writing a small lookup function wrapper and testing it
def lookup(name: str, k: int = 3, threshold: float = 0.9, verbose: bool = False) -> dict:
    data = []
    name = [name.lower().strip()]
    lkp = txtsim.search(name, similarity_threshold=threshold, k=k)
    # visualize results for each query using .visualize
    res = lkp.results[0]
    if verbose:
        txtsim.visualize(res)
    for m in res.matches:
        if m.is_match:
            data.append(CHARACTERS_INFO[m.data])

    # no match? then let's use the first best result
    if not len(data):
        data.append(CHARACTERS_INFO[res.matches[0].data])
    return data

r = lookup("New Scamramber", verbose=True)   # verbose to show all the matches
print('')
print('[best lookup result]')
print(f"name: {r[0]['Name']} / School: {r[0]['School']} / Profession: {r[0]['Profession']}")
print(f"Description: {r[0]['Descr']}")

Query 0: "new scamramber"
Most similar matches:

  idx  is_match      similarity  text
-----  ----------  ------------  -----------------------
 1005  False               0.81  newt scamander
 1006  False               0.71  newt scamander's mother
 1172  False               0.63  sam

[best lookup result]
name: Newt Scamander / School: Hogwarts - Hufflepuff / Profession: Magizoologist
Description: Newton “Newt” Scamander is a famous magizoologist and author of Fantastic Beasts and Where To Find Them (PS5) as well as a number of other books. Now retired, Scamander lives in Dorset with his wife Porpentina (FB). He received the Order of Merlin, second… 


## Building a RAG with Gemma and UniSim

The RAG implementation is going to be in four steps: 
1. Ask Gemma what is the name of the character so we can look it up. Given that we have access to a powerful LLM, using it to extract the named entity is I think the simplest and more robust way to do so

2. Retrieve the nearest match info from our UniSim index

3. Replace the name in the user query with the looked up name to fix the typo, which is very important and often overlooked, and then inject in the query the information we retrieve

4. Answer the user’s question and impress them with our extensive knowledge of the Wizarding world of Harry Potter!

This translates to this simple code, with helper functions defined earlier and available in the colab.

In [61]:
def rag(prompt: str, k: int = 5, threshold: float = 0.9, verbose: bool = False) -> str:
    # normalizing the prompt
    prompt = prompt.lower().strip()

    # ask Gemma who is the character
    char_prompt = f"In the following sentence: '{prompt}' who is the subject? reply only with name."
    if verbose:
        print(f"Char prompt: {char_prompt}")
    character = generate(char_prompt)
    if verbose:
        print(f"Character: '{character}'")

    # lookup the character
    data = lookup(character, k=k, threshold=threshold, verbose=verbose)

    # augmented prompt
    # replace the name in the prompt with the one in the rag
    prompt = prompt.replace(character.lower().strip(), data[0]['Name'].lower().strip())

    aug_prompt = f"Using the following data: {data} answer the following question: '{prompt}'. Don't mention your sources - just the answer."

    if verbose:
        print(f"Augmented prompt: {aug_prompt}")
    response = generate(aug_prompt)

    return response
# rag(questions[-1]['q'], verbose=True)

## RAG answers vs direct answer generation
Lets see our RAG in action and compare it to the directly generated answers we got before.


In [33]:
# generate direct answers using Gemma
for q in tqdm(questions, desc="Generating rag answers"):
    q['direct'] = generate(q['q'])  # redoing in case the questoins were changed
    q['rag'] = rag(q['q'])


Generating rag answers: 100%|██████████| 5/5 [00:28<00:00,  5.69s/it]


 Here are the answers we got from Gemma with retrieval and how they compare to the direct generation one.
 Overall we see that the RAG powered answers are more precise (Harry is correctly from Hogwarts - Gryffindor),
 have the right context (E.g Albus Dumbledor job) and answers very precise questions (Nympadora Tonks school)


In [34]:
print("[Answers with RAG vs without]\n")
for q in questions:
    a =  q['direct'][:100].replace('\n', ' ')
    print(f"Q:{q['q']}? (type: {q['type']})")
    print(f"Direct answer: {a}..")
    r =  q['rag'][:100].replace('\n', ' ')
    print(f'RAG answer: {r}')
    print("")

[Answers with RAG vs without]

Q:Which School is Harry Potter part of?? (type: basic fact)
Direct answer: Hogwarts School of Witchcraft and Wizardry is the school that Harry Potter attends...
RAG anwer: Harry Potter is part of Hogwarts - Gryffindor.

Q:Who is ermionne?? (type: typo)
Direct answer: Ermionne is a fictional character from Greek mythology, most famously known for her role in the stor..
RAG anwer: Hermione Granger is a resourceful, principled, and brilliant witch known for her academic prowess an

Q:What is Aberforth job?? (type: harder fact)
Direct answer: Aberforth is a fictional character in the Harry Potter series of books and films. He does not have a..
RAG anwer: Aberforth was a barman.

Q:what is dubldore job?? (type: harder fact and typo)
Direct answer: **Dublador** is a voice-over artist who provides voices for characters, narration, or other elements..
RAG anwer: Headmaster at Hogwarts School.

Q:Which school is  Nympadora from?? (type: hard fact)
Direct answer: N