Anki notes can be exported in a convenient `.txt` format. This file contains everything we need to modify Anki's notes and update the database later.

Let's start by checking the content of this file.

In [1]:
!head -n 25 ../data/Selected\ Notes.txt

#separator:tab
#html:true
#tags column:6
"<img src=""paste-d0ff77498ff8dde85ba00ae8b7c4bb6032d8483d.jpg"">"	Headboard&nbsp;				english
"<img src=""paste-334a3566ffa4cab66033c10810e8d06af8fda194.jpg"">"	Towel				english
"<img src=""paste-d9689dc830d3f333e81b9b7058d5b25517064954.jpg"">"	Jug				english
What command does create a soft link?	```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```				linux
In the `ln -s` command, what is the order of file name and link name?	```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```				linux
In the `zip` command, what is the option to specify the destination?	"```bash<br>$ unzip &lt;file&gt;<span style=""color: rgb(0, 0, 0);""> -d &lt;path&gt;<br>```<br><br></span><img src=""paste-92e15adfe1d216e9ba6f170e4033b292b7b15756.jpg"">"				linux
What command does extract files from a zip archive?	```bash<br>$ unzip &lt;file&gt;<br>```				linux
What is the command to list the content of a directory?	```bash<br>$ ls &lt;path&gt;<br>```				lin

### Load data

We can use the `anki_ai` library to load the notes from the file, and start exploring the content.

In [2]:
from anki_ai.domain.model import Deck

In [3]:
deck = Deck("default")
deck.read_txt(fpath="../data/Selected Notes.txt")
deck[:20]

[Note(uuid=56ba9f7a-e06e-4b0e-b904-56e95d78186f, front=What command does create a soft link?, back=```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```, tags=['linux'],
 Note(uuid=399eb632-f31f-4557-9d0c-9a102f0de372, front=In the `ln -s` command, what is the order of file name and link name?, back=```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```, tags=['linux'],
 Note(uuid=434bd9e0-9eff-4242-9943-c50a98a20ba8, front=What command does extract files from a zip archive?, back=```bash<br>$ unzip &lt;file&gt;<br>```, tags=['linux'],
 Note(uuid=145059c2-9165-4805-aa07-8a8cb5d4c775, front=What is the command to list the content of a directory?, back=```bash<br>$ ls &lt;path&gt;<br>```, tags=['linux'],
 Note(uuid=704f1d0b-bca6-4aae-98c6-b825fd66f7ae, front=What is the command to print text to the terminal window?, back=```bash<br>$ echo ...<br>```, tags=['linux'],
 Note(uuid=85ba69bb-d55c-408e-b3f3-6726e21d5750, front=What is the command to create a new file?, back=```ba

### Find duplicate notes (using semantic search)

The Anki client offers some basic functionality to identify repeated notes. It is based on an exact string comparison for the front and back fields of a note. This is a good starting point, but it misses scenarios where a card is semantically similar, or even the same, but not literally perfectly matching. This scenario happens very frequently as we add more notes over a long period of time.

We can use an embedding model to create sentence embeddings for front and back fields, to identify notes that are semantically very similar, although not always lexically equal. To do that, let's use one of the embedding models in `sentence-transformers` to generate embeddings for the front of our notes, and add them in `qdrant`.

In [4]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")

  from tqdm.autonotebook import tqdm, trange


In [5]:
from qdrant_client import QdrantClient, models

qdrant = QdrantClient(":memory:")  # create in-memory Qdrant instance for testing

In [6]:
qdrant.create_collection(
    collection_name="anki_deck",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

True

In [7]:
notes = [
    {"uuid": n.uuid, "front": n.front, "back": n.back, "tags": n.tags} for n in deck
]
notes[:10]

[{'uuid': UUID('56ba9f7a-e06e-4b0e-b904-56e95d78186f'),
  'front': 'What command does create a soft link?',
  'back': '```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```',
  'tags': ['linux']},
 {'uuid': UUID('399eb632-f31f-4557-9d0c-9a102f0de372'),
  'front': 'In the `ln -s` command, what is the order of file name and link name?',
  'back': '```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```',
  'tags': ['linux']},
 {'uuid': UUID('434bd9e0-9eff-4242-9943-c50a98a20ba8'),
  'front': 'What command does extract files from a zip archive?',
  'back': '```bash<br>$ unzip &lt;file&gt;<br>```',
  'tags': ['linux']},
 {'uuid': UUID('145059c2-9165-4805-aa07-8a8cb5d4c775'),
  'front': 'What is the command to list the content of a directory?',
  'back': '```bash<br>$ ls &lt;path&gt;<br>```',
  'tags': ['linux']},
 {'uuid': UUID('704f1d0b-bca6-4aae-98c6-b825fd66f7ae'),
  'front': 'What is the command to print text to the terminal window?',
  'back': '```bash<br>$ echo ...<br>`

In [8]:
qdrant.upload_points(
    collection_name="anki_deck",
    points=[
        models.PointStruct(
            id=idx, vector=encoder.encode(note["front"]).tolist(), payload=note
        )
        for idx, note in enumerate(notes)
    ],
)

In [9]:
hits = qdrant.search(
    collection_name="anki_deck",
    query_vector=encoder.encode("attention").tolist(),
    limit=3,
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'uuid': '6e4bbe15-9ebe-4c9f-a37b-fe69b4a72be4', 'front': 'What are the three main types of attention mechanisms?', 'back': '* Bidirectional (unmasked) self-attention<br>* Unidirectional (masked) self-attention<br>* Cross-attention', 'tags': ['llm']} score: 0.5707765511858129
{'uuid': 'eb36119f-c1f6-4ab5-9580-b21e4f95ecd8', 'front': 'What is the purpose of the query, key, and value vectors in attention mechanisms?', 'back': 'To compute the relevance of context tokens and combine their information', 'tags': ['llm']} score: 0.5335623046635898
{'uuid': '89d152a4-ddc5-4c9d-aa1c-38e479274507', 'front': '"What did the ""Attention is all you need"" paper showed?"', 'back': 'That the Transformer architecture outperformed recurrent neural networks (RNNs) on machine translation tasks, both in terms of translation quality and training cost', 'tags': ['nlp']} score: 0.5147806504913404


In [10]:
from difflib import Differ
from pprint import pprint

differ = Differ()

for note in deck:
    hits = qdrant.search(
        collection_name="anki_deck",
        query_vector=encoder.encode(note.front).tolist(),
        limit=3,
        score_threshold=0.95,
    )
    if len(hits) > 1:  #
        print(f"ORIGINAL: {note}\n")
        for hit in hits:
            if str(note.uuid) != hit.payload["uuid"]:
                print(f"POTENTIAL DUPLICATE ({hit.score:.2%}): {hit.payload}\n")
                result = differ.compare([note.front], [hit.payload["front"]])
                pprint(list(result))
        print("---------------------------------------\n")

ORIGINAL: Note(uuid=399eb632-f31f-4557-9d0c-9a102f0de372, front=In the `ln -s` command, what is the order of file name and link name?, back=```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```, tags=['linux']

POTENTIAL DUPLICATE (99.61%): {'uuid': '35e9f3a7-2bc0-4716-b7ae-392dafac5be7', 'front': 'In the `ln -s` command, what is the order of link name and file name?', 'back': '```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```', 'tags': ['linux']}

['- In the `ln -s` command, what is the order of file name and link name?',
 '?                                              ^ ^^           ^^^\n',
 '+ In the `ln -s` command, what is the order of link name and file name?',
 '?                                              ^ ^^          ++ ^\n']
---------------------------------------

ORIGINAL: Note(uuid=fedd3223-56f5-4332-832d-752aa4dfb7ea, front="How can we compute the dot product between&nbsp;<span style=""color: rgb(32, 33, 34);"">two vectors&nbsp;</span>$\vec{a}$ and