Anki notes can be exported in a convinient `.txt` format. This file contains everything we need to modify Anki notes, and later update the database.

Let's start by checking the content of this file.

In [1]:
!head -n 25 ../data/Selected\ Notes.txt

#separator:tab
#html:true
#tags column:6
"<img src=""paste-d0ff77498ff8dde85ba00ae8b7c4bb6032d8483d.jpg"">"	Headboard&nbsp;				english
"<img src=""paste-334a3566ffa4cab66033c10810e8d06af8fda194.jpg"">"	Towel				english
"<img src=""paste-d9689dc830d3f333e81b9b7058d5b25517064954.jpg"">"	Jug				english
What command does create a soft link?	```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```				linux
In the `ln -s` command, what is the order of file name and link name?	```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```				linux
In the `zip` command, what is the option to specify the destination?	"```bash<br>$ unzip &lt;file&gt;<span style=""color: rgb(0, 0, 0);""> -d &lt;path&gt;<br>```<br><br></span><img src=""paste-92e15adfe1d216e9ba6f170e4033b292b7b15756.jpg"">"				linux
What command does extract files from a zip archive?	```bash<br>$ unzip &lt;file&gt;<br>```				linux
What is the command to list the content of a directory?	```bash<br>$ ls &lt;path&gt;<br>```				lin

### Load data

We can use the `anki_ai` library to load the notes from the file, and start exploring the content.

In [2]:
from anki_ai.domain.model import Note, Deck

In [3]:
deck = Deck("default")
deck.from_txt(fpath="../data/Selected Notes.txt")
deck._collection[:20]

Was not able to process line 0: #separator:tab

Was not able to process line 1: #html:true

Was not able to process line 2: #tags column:6

Was not able to process line 366: "<div>What pandas DataFrame method and arguments can be used to create a histogram with 50 bins for the 'age' column of a DataFrame called 'users'?</div>

Was not able to process line 378: "What is the command to return <b><font color=""#ef2929"">all unique values</font></b> for a variable?"	"<center><table class=""highlighttable""><tbody><tr><td><div class=""linenodiv"" style=""background-color: #f0f0f0; padding-right: 10px""><pre style=""line-height: 125%"">1</pre></div></td><td class=""code""><div class=""highlight"" style=""background: #f8f8f8""><pre style=""line-height: 125%"">transactions<span style=""color: #666666"">.</span>t_dat<span style=""color: #666666"">.</span>unique()

Was not able to process line 379: </pre></div>

Was not able to process line 380: </td></tr></tbody></table></center><br>"				pandas

[Note(front=What command does create a soft link?, back=```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```, tags=['linux'],
 Note(front=In the `ln -s` command, what is the order of file name and link name?, back=```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```, tags=['linux'],
 Note(front=What command does extract files from a zip archive?, back=```bash<br>$ unzip &lt;file&gt;<br>```, tags=['linux'],
 Note(front=What is the command to list the content of a directory?, back=```bash<br>$ ls &lt;path&gt;<br>```, tags=['linux'],
 Note(front=What is the command to print text to the terminal window?, back=```bash<br>$ echo ...<br>```, tags=['linux'],
 Note(front=What is the command to create a new file?, back=```bash<br>$ touch ...<br>```, tags=['linux'],
 Note(front=What is the command to create a new directory?, back=```bash<br>mkdir ...<br>```, tags=['linux'],
 Note(front=What is the command to search text for patterns?, back=```bash<br>$ grep ...<br>```, tags=['li

### Find duplicate notes (using semantic search)

The Anki client offers some basic functionality to identify repeated notes. It is based on an exact string comparison for the front and back fields of a note. This is a good starting point, but it misses scenarios where a card is semantically similar, or even the same, but not literally perfectly matching. This scenario happens very frequently as we add more notes over a long period of time.

We can use an embedding model to create sentence embeddings for front and back fields, to identify notes that are semantically very similar, although not always lexically equal. To do that, let's use one of the embedding models in `sentence-transformers` to generate embeddings for the front of our notes, and add them in `qdrant`.

In [4]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")

  from tqdm.autonotebook import tqdm, trange


In [5]:
from qdrant_client import QdrantClient, models
from qdrant_client.models import Distance, VectorParams

qdrant = QdrantClient(":memory:")  # create in-memory Qdrant instance for testing

In [6]:
qdrant.create_collection(
    collection_name="anki_deck",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

True

In [7]:
notes = [{"front": n.front, "back": n.back, "tags": n.tags} for n in deck._collection]
notes[:10]

[{'front': 'What command does create a soft link?',
  'back': '```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```',
  'tags': ['linux']},
 {'front': 'In the `ln -s` command, what is the order of file name and link name?',
  'back': '```bash<br>$ ln -s &lt;file_name&gt; &lt;link_name&gt;<br>```',
  'tags': ['linux']},
 {'front': 'What command does extract files from a zip archive?',
  'back': '```bash<br>$ unzip &lt;file&gt;<br>```',
  'tags': ['linux']},
 {'front': 'What is the command to list the content of a directory?',
  'back': '```bash<br>$ ls &lt;path&gt;<br>```',
  'tags': ['linux']},
 {'front': 'What is the command to print text to the terminal window?',
  'back': '```bash<br>$ echo ...<br>```',
  'tags': ['linux']},
 {'front': 'What is the command to create a new file?',
  'back': '```bash<br>$ touch ...<br>```',
  'tags': ['linux']},
 {'front': 'What is the command to create a new directory?',
  'back': '```bash<br>mkdir ...<br>```',
  'tags': ['linux']},
 {'front

In [8]:
qdrant.upload_points(
    collection_name="anki_deck",
    points=[
        models.PointStruct(
            id=idx, vector=encoder.encode(note["front"]).tolist(), payload=note
        )
        for idx, note in enumerate(notes)
    ],
)

In [9]:
hits = qdrant.search(
    collection_name="anki_deck",
    query_vector=encoder.encode("attention").tolist(),
    limit=3,
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'front': 'What are the three main types of attention mechanisms?', 'back': '* Bidirectional (unmasked) self-attention<br>* Unidirectional (masked) self-attention<br>* Cross-attention', 'tags': ['llm']} score: 0.5707765739729703
{'front': 'What is the purpose of the query, key, and value vectors in attention mechanisms?', 'back': 'To compute the relevance of context tokens and combine their information', 'tags': ['llm']} score: 0.533562386527354
{'front': '"What did the ""Attention is all you need"" paper showed?"', 'back': 'That the Transformer architecture outperformed recurrent neural networks (RNNs) on machine translation tasks, both in terms of translation quality and training cost', 'tags': ['nlp']} score: 0.5147807080540934
