In [1]:
%%capture --no-stdout
%reload_ext watermark
%watermark -uniz --author "Prayson W. Daniel" -vm -p ollama,torch

Author: Prayson W. Daniel

Last updated: 2024-11-14T19:38:08.956407+01:00

Python implementation: CPython
Python version       : 3.11.10
IPython version      : 8.29.0

ollama: 0.3.3
torch : 2.5.1

Compiler    : Clang 15.0.0 (clang-1500.3.9.4)
OS          : Darwin
Release     : 23.5.0
Machine     : arm64
Processor   : arm
CPU cores   : 16
Architecture: 64bit



# Using BM25s, Embedding + Cosine, LLM (via Ollama)
```sh
ollama pull mxbai-embed-large
````

In [2]:
import bm25s
import ollama
import torch
import torch.nn.functional as F

In [3]:
tests_cases = {
    "case1": "I like ice cream",
    "case2": "I don't like ice cream",
    "case3": "I hate ice cream",
    "case4": "I enjoy ice cream",
}

In [4]:
# Best Matching 25 (= TF-IDF with document length normalization) does not capture negation.
# 🤷🏾‍♂️ Removing stopwords and applying lemmatization aims to improve matching accuracy.
# Case1 and Case4 are incorrectly far apart with a score of 0.08731071 (Fails to capture semantic meaning).

corpus = [case for case in tests_cases.values()]
retriever = bm25s.BM25(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

In [5]:
query = tests_cases.get("case1")
retriever.retrieve(bm25s.tokenize(query), k=4)

Results(documents=array([['I like ice cream', "I don't like ice cream",
        'I enjoy ice cream', 'I hate ice cream']], dtype='<U22'), scores=array([[0.37451115, 0.32753414, 0.08731071, 0.08731071]], dtype=float32))

In [6]:
# Embedding with Cosine Similarity
# Case1 and Case4 are correctly identified as similar with a score of 0.9540 (captures semantic meaning).
# However, Case1 and Case2 are also closely matched, even though I would prefer them to be further apart due to negation.


results = {
    case: torch.Tensor(
        ollama.embeddings(
            model="nomic-embed-text",
            prompt=blob,
        ).get("embedding")
    ).reshape(1, -1)
    for case, blob in tests_cases.items()
}

In [7]:
(
    F.cosine_similarity(results["case1"], results["case1"]),
    F.cosine_similarity(results["case1"], results["case2"]),
    F.cosine_similarity(results["case1"], results["case3"]),
    F.cosine_similarity(results["case1"], results["case4"]),
)

(tensor([1.0000]), tensor([0.8803]), tensor([0.8436]), tensor([0.9540]))

In [8]:
# Power of LLM
# Case1 and Case4 are correctly identified as similar.
# Bravo 🤘🏾 Case1 and Case2 are considered dissimilar.

In [9]:
client = ollama.Client(host="http://localhost:11434")

In [10]:
blob1 = tests_cases.pop("case1")
for case, blob in tests_cases.items():
    print("\n -- case -- \n")
    print(f"- case1: {blob1}")
    print(f"  {case}: {blob}\n")

    prompt = f"How similiar are: {blob1} and {blob}. Give a short reasons why. ca. 2 sentence"

    output = client.generate(
        model="llama3.2",
        prompt=prompt,
        options={
            "seed": 42,
            "temperature": 0.0,
            "num_ctx": 2048,  # must be set for reproducibility
        },
    )

    print(output["response"])


 -- case -- 

- case1: I like ice cream
  case2: I don't like ice cream

These two statements are contradictory, making them highly dissimilar. The similarity lies in the fact that they both express a preference or lack thereof for ice cream, but the tone and intention behind each statement are opposite, with one being positive and the other negative.

 -- case -- 

- case1: I like ice cream
  case3: I hate ice cream

These two statements are contradictory, as they express opposite emotions towards the same thing (ice cream). The similarity lies in their extreme nature, with one statement being an enthusiastic endorsement and the other a strong rejection, highlighting the intensity of the speaker's feelings.

 -- case -- 

- case1: I like ice cream
  case4: I enjoy ice cream

The two sentences are very similar, as they both express a positive sentiment towards ice cream. The only difference is that the first sentence uses "like" to describe the preference, while the second sentence us