# explain colab here

Explain unisim / textsim and link to the capabilities colab 

In [210]:
# installing dependencies
!pip install -U tqdm Iprogress unisim ollama tabulate

Collecting tqdm
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/18/eb/fdb7eb9e48b7b02554e1664afd3bd3f117f6b6d6c5881438a0b055554f9b/tqdm-4.66.4-py3-none-any.whl.metadata
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Downloading tqdm-4.66.4-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tqdm
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.66.1
    Uninstalling tqdm-4.66.1:
      Successfully uninstalled tqdm-4.66.1
Successfully installed tqdm-4.66.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;4

In [1]:
# Importing libraries
import json
from tabulate import tabulate
from tqdm.auto import tqdm
import ollama
import unisim


  from .autonotebook import tqdm as notebook_tqdm


INFO: Loaded backend
INFO: Using ONNX with CPU


## Pre-Flight checks
Quickly testing that ollama, Gemma and Unisim are all setup and working well

In [212]:
# Make sure Gemma is installed with ollama otherwise installing it
MODEL = 'gemma'
try:
    ollama.show(MODEL)
except Exception as e:
    print(f"can't find {MODEL}: {e} installing it")
    ollama.pull(MODEL)
ollama.show(MODEL)



{'license': 'Gemma Terms of Use \n\nLast modified: February 21, 2024\n\nBy using, reproducing, modifying, distributing, performing or displaying any portion or element of Gemma, Model Derivatives including via any Hosted Service, (each as defined below) (collectively, the "Gemma Services") or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement.\n\nSection 1: DEFINITIONS\n1.1 Definitions\n(a) "Agreement" or "Gemma Terms of Use" means these terms and conditions that govern the use, reproduction, Distribution or modification of the Gemma Services and any terms and conditions incorporated by reference.\n\n(b) "Distribution" or "Distribute" means any transmission, publication, or other sharing of Gemma or Model Derivatives to a third party, including by providing or making Gemma or its functionality available as a hosted service via API, web access, or any other electronic or remote means ("Hosted Service").\n\n(c) "Gemma" means the set of machine learni

In [213]:
# small wrapper function to make generation easier and check it all work
# we use generate as we are going for a RAG style system so streaming is not useful
def generate(prompt: str) -> str:
    res = ollama.generate(model=MODEL, prompt=prompt)
    if res['done']:
        return res['response']
    else:
        raise Exception(f"Generation failed: {res['done_reason']}")

generate("Hello Gemma it is Elie")

"Hello Elie! 👋 It's lovely to hear from you. How can I help you today?"

In [2]:
# initializizing TextSim for near duplicate text similarity
VERBOSE = True  # interactive demo so we want to see what happen
txtsim = unisim.TextSim(verbose=True)
# check it works as intendeds
sim_value = txtsim.similarity("Gemma", "Gemmaa")
if sim_value > 0.9:
    print(f"Similarity {sim_value} - TextSim works as intended")
else:
    print(f"Similarity {sim_value} - Something is very wrong with TextSim")

[Loading model]
|-model_id: text/retsim/v1
|-model path: /Users/elieb/git/unisim/unisim/embedder/models/text/retsim/v1.onnx
|-input: ['unk__340', 512, 24], tensor(float)
|-output: ['unk__341', 256], tensor(float)
INFO: UniSim is storing a copy of the indexed data
INFO: If you are using large data corpus, consider disabling this behavior using store_data=False
INFO: Accelerator is not available, using CPU
Similarity 0.9258707761764526 - TextSim works as intended


## Testing model without retrivial augmentations

In [215]:
questions = [
             {"q":'Which School is Harry Potter part of?', 'type': 'basic fact'},
             {"q": 'Who is ermionne?', 'type': 'typo'},
             {"q": 'What is Aberforth job?', 'type': 'harder fact'},
             {"q": "'what is dubldore job?'", 'type': 'harder fact and typo'}
]

In [216]:
# generate direct answers using Gemma
for q in tqdm(questions, desc="Generating direct answers"):
    q['direct'] = generate(q['q'])


Generating direct answers: 100%|██████████| 4/4 [00:16<00:00,  4.19s/it]


In [217]:
print("[answers without retrival]\n")
for q in questions:
    a =  q['direct'][:100].replace('\n', ' ')
    print(f"Q:{q['q']}? (type: {q['type']})")
    print(f"Direct answer: {a}..")
    print("")

[answers without retrival]

Q:Which School is Harry Potter part of?? (type: basic fact)
Direct answer: Hogwarts School of Witchcraft and Wizardry..

Q:Who is ermionne?? (type: typo)
Direct answer: Errmionne is a character in the League of Legends universe. She is a young woman from the Freljord r..

Q:What is Aberforth job?? (type: harder fact)
Direct answer: Aberforth Dumbledore is a wizard and a Potions Master at Hogwarts School of Witchcraft and Wizardry...

Q:'what is dubldore job?'? (type: harder fact and typo)
Direct answer: **Dublador** is a voice-over artist who records dialogue for animated films, television shows, video..



**Results are not great, let's use Unisim to index Harry Potter Data and get better results using the RAG pattern**

## Indexing Harry Potter characters data 
Data from Kaggle [https://www.kaggle.com/datasets/zez000/characters-in-harry-potter-books](Characters in Harry Potter Books)

Each characters has its name and a few fields. Our game plan is to use 
unisim/textsim to perform typo resilient name lookup and retrive the relevants fields
to help Gemma answer about the characters


| Field               | Description    | Retrieval Strategy |
| :------------------ | :------------- | :------------------|
| Name                |  char name     | unisim embedding   |
| Descr               |  Char info     | Retrieved          |
| Link                |  link to wiki  | Retrieved          |
| Gender              |  Char gender   | Retrieved          |
| Species/Race        |                | Retrieved          |
| Blood School        |  Magic school  | Retrieved          |
| Profession          |                | Retrieved          |



In [3]:
raw_data = json.loads(open('data/harry_potter_characters.json').read())
CHARACTERS_INFO = {}  # we are deduping the data using the name as key
for d in raw_data:
    name = d['Name'].lower().strip()
    CHARACTERS_INFO[name] = d
print(f'{len(CHARACTERS_INFO)} characters loaded from harry_potter_characters.json')

1350 characters loaded from harry_potter_characters.json


In [4]:
# indexing data with text sim
txtsim.reset_index()  # clean up in case we run this cell multiple times
idx = txtsim.add(list(CHARACTERS_INFO.keys()))
txtsim.indexed_data[:10]  # display what we added

['mrs. abbott',
 'hannah abbott',
 'abel treetops',
 'euan abercrombie',
 'aberforth dumbledore',
 'abernathy',
 'abraham peasegood',
 'abraham potter',
 'abraxas malfoy',
 'achilles tolliver']

In [7]:
# writing a small lookup function wrapper and testing it
def lookup(name: str, k: int = 3, threshold: float = 0.9, verbose: bool = False) -> dict:
    data = []
    name = [name.lower().strip()]
    lkp = txtsim.search(name, similarity_threshold=threshold, k=k)
    # visualize results for each query using .visualize
    res = lkp.results[0]
    if verbose:
        txtsim.visualize(res)
    for m in res.matches:
        if m.is_match:
            data.append(CHARACTERS_INFO[m.data])

    # no match? then let's use the first best result
    if not len(data):
        data.append(CHARACTERS_INFO[res.matches[0].data])
    return data

lookup("abraxxas malfoy", verbose=True)   # verbose to show all the matches

Query 0: "abraxxas malfoy"
Most similar matches:

  idx  is_match      similarity  text
-----  ----------  ------------  ---------------
    8  True                0.99  abraxas malfoy
  904  False               0.72  nicholas malfoy
  260  False               0.72  brutus malfoy


[{'Name': 'Abraxas Malfoy',
  'Link': 'https://www.hp-lexicon.org/character/malfoy-family/abraxas-malfoy/',
  'Descr': 'Abraxas Malfoy was a wizard who was believed to be involved in a plot which led to the first Muggle-born Minister for Magic, Nobby Leach, leaving office owing to a mysterious illness (Pm). Abraxas died of Dragon Pox (HBP9). ',
  'Gender': 'Male',
  'Species/Race': 'Wizard',
  'Blood': 'Pure blood',
  'School': 'Unknown',
  'Profession': 'Unknown'}]

## GEMMA + UniSim RAG

In [202]:
def rag(prompt: str, k: int = 5, threshold: float = 0.9, verbose: bool = False) -> str:
    # normalizing the prompt
    prompt = prompt.lower().strip()

    # ask Gemma who is the character
    char_prompt = f"In the following sentence: '{prompt}' who is the subject? reply only with name."
    if verbose:
        print(f"Char prompt: {char_prompt}")
    character = generate(char_prompt)
    if verbose:
        print(f"Character: '{character}'")

    # lookup the character
    data = lookup(character, k=k, threshold=threshold, verbose=verbose)

    # augmented prompt
    # replace the name in the prompt with the one in the rag
    prompt = prompt.replace(character.lower().strip(), data[0]['Name'].lower().strip())

    aug_prompt = f"Using the following data: {data} answer the following question: '{prompt}'. Don't mention your sources - just the answer."

    if verbose:
        print(f"Augmented prompt: {aug_prompt}")
    response = generate(aug_prompt)

    return response
rag('what is dubldore job?', verbose=True)

Char prompt: In the following sentence: 'what is dubldore job?' who is the subject? reply only with name.
Character: 'Dubldore'
Query 0: "dubldore"
Most similar matches:

  idx  is_match      similarity  text
-----  ----------  ------------  --------------------
   32  False               0.74  albus dumbledore
    4  False               0.72  aberforth dumbledore
   86  False               0.71  ariana dumbledore
  114  False               0.69  aurelius dumbledore
  433  False               0.69  kendra dumbledore
Augmented prompt: Using the following data: [{'Name': 'Albus Dumbledore', 'Link': 'https://www.hp-lexicon.org/character/dumbledore-family/albus-dumbledore/', 'Descr': 'Albus Dumbledore was the Headmaster of Hogwarts for over thirty years, a time period that encompassed both of Voldemort’s attempts to take over the Wizarding world. Considered to be the most powerful wizard of his time, Dumbledore was awarded the Order of Merlin, First Class, and was the Supreme Mugwump… ', '

'Albus Dumbledore was the Headmaster at Hogwarts School.'

In [205]:
# generate direct answers using Gemma
for q in tqdm(questions, desc="Generating rag answers"):
    q['direct'] = generate(q['q'])  # redoing in case the questoins were changed
    q['rag'] = rag(q['q'])


Generating rag answers: 100%|██████████| 4/4 [00:23<00:00,  5.90s/it]


In [208]:
print("[Answers with RAG vs without]\n")
for q in questions:
    a =  q['direct'][:100].replace('\n', ' ')
    print(f"Q:{q['q']}? (type: {q['type']})")
    print(f"Direct answer: {a}..")
    r =  q['rag'][:100].replace('\n', ' ')
    print(f'RAG anwer: {r}')
    print("")

[Answers with RAG vs without]

Q:Which School is Harry Potter part of?? (type: basic fact)
Direct answer: Hogwarts School of Witchcraft and Wizardry..
RAG anwer: Hogwarts - Gryffindor

Q:Who is ermionne?? (type: typo)
Direct answer: Ermionne is a character from Greek mythology, most famously known for her role in the story of Herac..
RAG anwer: Hermione Granger is a resourceful, principled and brilliant witch known for her intelligence and clo

Q:What is Aberforth job?? (type: harder fact)
Direct answer: Aberforth Dumbledore is a character in the Harry Potter series of books and films. He is an eccentri..
RAG anwer: Aberforth Dumbledore was a barman at the Hog's Head in Hogsmeade.

Q:'what is dubldore job?'? (type: harder fact and typo)
Direct answer: **Dubldore is a customer engagement platform that helps businesses create interactive digital experi..
RAG anwer: Albus Dumbledore was the Headmaster at Hogwarts School.

