Anki notes can be exported in a convenient `.txt` format. This file contains everything we need to modify Anki's notes and update the database later.

Let's start by checking the content of this file.

In [1]:
!head -n 10 ../data/Selected\ Notes\ v8.txt

#separator:tab
#html:true
#guid column:1
#notetype column:2
#deck column:3
#tags column:9
D?H@y-%%r	KaTeX and Markdown Basic (Color)	Default	"<img src=""paste-d0ff77498ff8dde85ba00ae8b7c4bb6032d8483d.jpg"">"	Headboard				english
IjfKk}wnb@	KaTeX and Markdown Basic (Color)	Default	"<img src=""paste-334a3566ffa4cab66033c10810e8d06af8fda194.jpg"">"	Towel				english
"G1Z_~#;mLc"	KaTeX and Markdown Basic (Color)	Default	"<img src=""paste-d9689dc830d3f333e81b9b7058d5b25517064954.jpg"">"	Jug				english
Azd65{j+,q	KaTeX and Markdown Basic (Color)	Default	Command to create a soft link	```bash<br>$ ln -s &lt;file&gt; &lt;link&gt;<br>```				linux


### Load data

We can use the `anki_ai` library to load the notes from the file, and start exploring the content.

In [2]:
from anki_ai.domain.deck import Deck

In [3]:
deck = Deck()
deck.read_txt(fpath="../data/Selected Notes v8.txt", exclude_tags=["personal"])
deck[:10]

[Note(guid='D?H@y-%%r', front='<img src="paste-d0ff77498ff8dde85ba00ae8b7c4bb6032d8483d.jpg">', back='Headboard', tags=['english'], notetype='KaTeX and Markdown Basic (Color)', deck_name='Default'),
 Note(guid='IjfKk}wnb@', front='<img src="paste-334a3566ffa4cab66033c10810e8d06af8fda194.jpg">', back='Towel', tags=['english'], notetype='KaTeX and Markdown Basic (Color)', deck_name='Default'),
 Note(guid='G1Z_~#;mLc', front='<img src="paste-d9689dc830d3f333e81b9b7058d5b25517064954.jpg">', back='Jug', tags=['english'], notetype='KaTeX and Markdown Basic (Color)', deck_name='Default'),
 Note(guid='Azd65{j+,q', front='Command to create a soft link', back='```bash<br>$ ln -s &lt;file&gt; &lt;link&gt;<br>```', tags=['linux'], notetype='KaTeX and Markdown Basic (Color)', deck_name='Default'),
 Note(guid='BGL!8$wV<W', front='In the `ln -s` command, what is the order of file name and link name?', back='```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```', tags=['linux'], notetype='KaTe

### Find duplicate notes (using semantic search)

The Anki client offers some basic functionality to identify repeated notes. It is based on an exact string comparison for the front and back fields of a note. This is a good starting point, but it misses scenarios where a card is semantically similar, or even the same, but not literally perfectly matching. This scenario happens very frequently as we add more notes over a long period of time.

We can use an embedding model to create sentence embeddings for front and back fields, to identify notes that are semantically very similar, although not always lexically equal. To do that, let's use one of the embedding models in `sentence-transformers` to generate embeddings for the front of our notes, and add them in `qdrant`.

In [4]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")

In [5]:
from qdrant_client import QdrantClient, models

qdrant = QdrantClient(":memory:")  # create in-memory Qdrant instance for testing

In [6]:
qdrant.create_collection(
    collection_name="anki_deck",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

True

In [7]:
notes = [
    {"guid": n.guid, "front": n.front, "back": n.back, "tags": n.tags} for n in deck
]
notes[:10]

[{'guid': 'D?H@y-%%r',
  'front': '<img src="paste-d0ff77498ff8dde85ba00ae8b7c4bb6032d8483d.jpg">',
  'back': 'Headboard',
  'tags': ['english']},
 {'guid': 'IjfKk}wnb@',
  'front': '<img src="paste-334a3566ffa4cab66033c10810e8d06af8fda194.jpg">',
  'back': 'Towel',
  'tags': ['english']},
 {'guid': 'G1Z_~#;mLc',
  'front': '<img src="paste-d9689dc830d3f333e81b9b7058d5b25517064954.jpg">',
  'back': 'Jug',
  'tags': ['english']},
 {'guid': 'Azd65{j+,q',
  'front': 'Command to create a soft link',
  'back': '```bash<br>$ ln -s &lt;file&gt; &lt;link&gt;<br>```',
  'tags': ['linux']},
 {'guid': 'BGL!8$wV<W',
  'front': 'In the `ln -s` command, what is the order of file name and link name?',
  'back': '```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```',
  'tags': ['linux']},
 {'guid': 'be:y>MF$Ae',
  'front': 'In the `zip` command, what is the option to specify the destination?',
  'back': '```bash<br>$ unzip &lt;file&gt; -d &lt;path&gt;<br>```<br><br><img src="paste-92e15adfe1d

In [8]:
qdrant.upload_points(
    collection_name="anki_deck",
    points=[
        models.PointStruct(
            id=idx, vector=encoder.encode(note["front"]).tolist(), payload=note
        )
        for idx, note in enumerate(notes)
    ],
)

In [9]:
hits = qdrant.query_points(
    collection_name="anki_deck",
    query=encoder.encode("attention").tolist(),
    limit=3,
)
for hit in hits:
    hit = hit[1][0]
    print(hit.payload, "score:", hit.score)

{'guid': 'IVX4]l>.K$', 'front': 'What does the attention mechanism do?', 'back': 'It lets the decoder assign a different amount of weight, or "attention", to each of the encoder states at every decoding timestep<br><br><img src="paste-6efd6ecd3fe7a7d3dbc1e34b5698ec4f5f2368fc.jpg">', 'tags': ['nlp']} score: 0.6690907923016808


In [10]:
from difflib import Differ
from pprint import pprint

differ = Differ()

cnt = 0
for note in deck:
    hits = qdrant.query_points(
        collection_name="anki_deck",
        query=encoder.encode(note.front).tolist(),
        limit=3,
        score_threshold=0.95,
    )
    if len(hits.points) > 1:  #
        print(f"ORIGINAL: {note}\n")
        for hit in hits.points:
            if str(note.guid) != hit.payload["guid"]:
                print(f"POTENTIAL DUPLICATE ({hit.score:.2%}): {hit.payload}\n")
                result = differ.compare([note.front], [hit.payload["front"]])
                pprint(list(result))
        print("\n---------------------------------------\n")
        cnt += 1
        if cnt > 10:
            break

ORIGINAL: guid='D?H@y-%%r' front='<img src="paste-d0ff77498ff8dde85ba00ae8b7c4bb6032d8483d.jpg">' back='Headboard' tags=['english'] notetype='KaTeX and Markdown Basic (Color)' deck_name='Default'

POTENTIAL DUPLICATE (98.15%): {'guid': 'L<}Geu>7)g', 'front': '<img src="paste-d0059484db4597ce31817cd328cf2a8d7ca598c6.jpg">', 'back': 'Tubeless', 'tags': ['cycling']}

['- <img src="paste-d0ff77498ff8dde85ba00ae8b7c4bb6032d8483d.jpg">',
 '+ <img src="paste-d0059484db4597ce31817cd328cf2a8d7ca598c6.jpg">']
POTENTIAL DUPLICATE (97.74%): {'guid': 'GRGQ&._SC&', 'front': '<img src="paste-c41ed6497526ac56d4917668c24c3dfd8718bc7e.jpg">', 'back': 'Fence', 'tags': ['english']}

['- <img src="paste-d0ff77498ff8dde85ba00ae8b7c4bb6032d8483d.jpg">',
 '+ <img src="paste-c41ed6497526ac56d4917668c24c3dfd8718bc7e.jpg">']

---------------------------------------

ORIGINAL: guid='G1Z_~#;mLc' front='<img src="paste-d9689dc830d3f333e81b9b7058d5b25517064954.jpg">' back='Jug' tags=['english'] notetype='KaTeX and M