# Formatting Anki Flashcards

## Setup

In this notebook, we are exploring the best way to prompt an LLM to improve the formatting of Anki flashcards. 

Approaches we are including:
1. Simple prompt
2. Simple prompt + Chain of Thought
3. Two-step process (critique → refine)
4. Agent

For this notebook, we are going to use [`unsloth/Qwen3-14B-GGUF`](https://huggingface.co/unsloth/Qwen3-14B-GGUF). This is a larger and more modern LLM compared to [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), which should lead to better results. We chose `Qwen3-14B` because it's a good-performing open-weight LLM that fits in 24GB of VRAM and delivers acceptable inference speed.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import requests
from anki.collection import Collection

from addon.application.use_cases.note_counter import is_note_marked_for_review
from addon.infrastructure.configuration.settings import AddonConfig

In [3]:
# Open an existing collection
col = Collection("/home/gianluca/.local/share/Anki2/User 1/collection.anki2")

# Do something with the collection
print(f"Number of notes: {col.note_count()}")
print(f"Number of cards: {col.card_count()}")

Number of notes: 3362
Number of cards: 3498


### Connect to Inference Server

In [4]:
from __future__ import annotations

def display_addon_note(addon_note: AddonNote) -> None:
    print(f"Front: {addon_note.front}\nBack: {addon_note.back}\nTags: {addon_note.tags}")

In [5]:
from addon.infrastructure.external_services.openai import OpenAIClient

def answer(
    input: str | list[dict],
    guided_json = None,
    **kwargs
):
    """Helper function to prompt LLM.
    
    input: Either a string (completions API) or list of message dicts (chat completion)
    kwargs: Can override config values like max_tokens, temperature, etc.
    """
    # Create config with any overrides
    config = AddonConfig.create_nullable(kwargs)
    
    # Create OpenAI client
    client = OpenAIClient.create(config)
    
    # Build kwargs for the run method
    run_kwargs = {}
    if guided_json is not None:
        run_kwargs["guided_json"] = guided_json
    
    # Run the inference
    content = client.run(input, **run_kwargs)
    
    # Clean thinking tokens and return
    cleaned_content = content.replace('<think>\n\n</think>\n\n', '')
    print(f"Content: {cleaned_content}")

    if guided_json is not None:
        suggested_changes = AddonNoteChanges.model_validate_json(content)
        return (suggested_changes, None)

    # Note: OpenAIClient doesn't currently return reasoning_content separately
    # so we return None for backward compatibility
    return (content, None)

## Completions vs. Chat Completions API

In [6]:
content, reasoning_content = answer(
    input="Respond only with one word, lowercase, without punctuation. What is the Italian word for 'hello'? /no_think",
    mode="v1/completions",
    max_tokens=20,
)

Content: 

Assistant:

</think>

ciao


In [7]:
content, reasoning_content = answer(
    input="Respond only with one word, lowercase, without punctuation. What is the Italian word for 'hello'? /no_think",
    mode="v1/completions",
    max_tokens=20,
)

Content: 

Assistant:

Assistant:

ciao

Assistant:

Assistant:

ciao

Assistant:

Assistant:

c


In [8]:
content, reasoning_content = answer(
    input="Respond only with one word, lowercase, without punctuation. What is the Italian word for 'hello'? /no_think",
    mode="v1/completions",
    max_tokens=20,
)

Content: 

# The user is asking for the Italian word for 'hello'. I need to provide the correct


In [9]:
content, reasoning_content = answer(
    input=[{"role": "system", "content": "Respond only with one word, lowercase, without punctuation. What is the Italian word for 'hello'? /no_think"}],
    mode="v1/chat/completions",
    max_tokens=20,
)

Content: 

ciao


For Qwen 3, the Chat Completions API appears to work much better. In our simple test case, when using the Completions API, Qwen 3 tends to enter a pattern where it repeats itself until reaching the `max_tokens` limit.

We also notice that, despite asking Qwen 3 not to output "thinking tokens", it still does that in the `content` field. The thinking tokens are also returned both in the `content` and `reasoning` fields and do not match.

## Create Evaluation Dataset

We have everything we need now to tell the LLM to make some changes to our Anki flashcards.

Let's pull a few note currently marked for review.

In [10]:
deck_id = col.decks.current()["id"]
query = f"did:{deck_id}"
note_ids = col.find_notes(query)

NUM_NOTES_NEEDED = 10

flagged_notes = []
for note_id in note_ids:
    if is_note_marked_for_review(col, note_id):
        note = col.get_note(note_id)
        flagged_notes.append(note)

In [11]:
print(f"Number of flagged notes: {len(flagged_notes)}")

Number of flagged notes: 288


In [12]:
from addon.application.services.formatter_service import AnkiNoteAdapter

addon_note = AnkiNoteAdapter.to_addon_note(flagged_notes[0])
display_addon_note(addon_note)

Front: Most important thing to accelerate memory lookups
Back: Data availability on GPU memory<br><br><img src="Screenshot from 2022-02-07 11-31-00.png"><br><br>Ref.: <a href="https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/">https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/</a>
Tags: ['recsys']


## Simple Prompt

In [13]:
prompt = f"Look at this flashcard. How would you improve it? Keep in mind that flashcards should be atomic, concise, and accurate.\n\n{addon_note}"
print(prompt)

Look at this flashcard. How would you improve it? Keep in mind that flashcards should be atomic, concise, and accurate.

AddonNote(front='Most important thing to accelerate memory lookups', back='Data availability on GPU memory<br><br><img src="Screenshot from 2022-02-07 11-31-00.png"><br><br>Ref.: <a href="https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/">https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/</a>', guid='OM_`a7R~Dp', tags=['recsys'], notetype=<AddonNoteType.BASIC: 'basic'>, deck_name=None)


In [14]:
content, reasoning_content = answer(
    input=[{"role": "system", "content": prompt}],
    max_tokens=1_000
)

Content: 

Here’s an improved version of the flashcard, focusing on atomicity, conciseness, and accuracy:

---

**Front:**  
"Key factor for accelerating GPU memory lookups"  

**Back:**  
"Data must be resident in GPU memory (not CPU memory) to enable fast access during computation.  
Ref.: [NVIDIA GTC 2021 Talk](https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/)"  

---

**Improvements made:**  
1. **Front:** Simplified and clarified the question to focus on the specific context (GPU memory lookups).  
2. **Back:** Removed the image (non-essential for memorization) and replaced it with a concise explanation of the core concept.  
3. **Reference:** Retained the citation but formatted it as a direct link for clarity.  
4. **Conciseness:** Trimmed extraneous text to focus on the atomic fact: *data residency in GPU memory* as the critical enabler.  

This version adheres to flashcard best practices by prioritizing clarity, brevity, and actionable knowledge.


Despite the very simple prompt, on this flashcard, the model is doing a good job. However, we probably need some structured output/constrained decoding to ensure the LLM respects a predefined format. This will make extracting the suggestions more easier.

## Structured Output

Let's reuse the `pydantic` schema we have already defined for the main project.

In [15]:
from addon.infrastructure.llm.schemas import AddonNoteChanges

content, reasoning_content = answer(
    input=[{"role": "system", "content": prompt}],
    guided_json=AddonNoteChanges.model_json_schema(),
    max_tokens=1_000,
)

Content: {

"front": "What is the most important factor to accelerate memory lookups in GPU-based systems?",  
"back": "Data availability on GPU memory. Ensuring data is resident in GPU memory reduces latency and accelerates access during computations.\n\nRef.: https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/",  
"tags": ["recsys", "gpu-optimization"]  

}


In [16]:
display_addon_note(content)

Front: What is the most important factor to accelerate memory lookups in GPU-based systems?
Back: Data availability on GPU memory. Ensuring data is resident in GPU memory reduces latency and accelerates access during computations.

Ref.: https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/
Tags: ['recsys', 'gpu-optimization']


## Chat Completions + Structured Output + Old Prompt + Reasoning

Let's compare it with the current version we have in the `main` branch.

In [17]:
from addon.application.services.formatter_service import NoteFormatter
from addon.infrastructure.external_services.openai import OpenAIClient

config = AddonConfig.create_nullable({
    "mode": "v1/chat/completions",
    "url": "http://iamgianluca.ddns.net:8080/v1/completions",
    "max_tokens": 5_000,  # migh fail from time to time, depending 
})
for k, v in config.__dict__.items():
    if k == "url":  # hide inference server url
        continue
    print(f"{k}: {v}")

model_name: ./Qwen3-14B-Q8_0.gguf
temperature: 0.7
max_tokens: 5000
top_p: 0.8
top_k: 20
min_p: 0.0


In [18]:
openai = OpenAIClient.create(config)
formatter = NoteFormatter(openai)
new_note = formatter.format(note=addon_note)

display_addon_note(new_note)

Front: Accelerate memory lookups: Key factor
Back: Data availability on GPU memory

<img src="Screenshot from 2022-02-07 11-31-00.png">

Ref.: <a href="https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/">https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/</a>
Tags: ['recsys']


The current version in production, only barely change to the original flashcard, and focuses mostly on formatting. This is likely due to the highly specific prompt. A shorter and more generic prompt, in combination with reasoning, could lead to better results.

For reference, below is the original note.

In [19]:
print("ORIGINAL NOTE\n")
display_addon_note(addon_note)

ORIGINAL NOTE

Front: Most important thing to accelerate memory lookups
Back: Data availability on GPU memory<br><br><img src="Screenshot from 2022-02-07 11-31-00.png"><br><br>Ref.: <a href="https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/">https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/</a>
Tags: ['recsys']


## Reasoning + Structured Output + Simple Prompt

Let's try to let the LLM reason, while still perform constrained decoding.

In [20]:
from addon.infrastructure.llm.schemas import AddonNoteChanges

prompt = f"Look at this flashcard. How would you improve it? Keep in mind that flashcards should be atomic, concise, and accurate. /think \n\n{addon_note}"
content, reasoning_content = answer(
    input=[{"role": "system", "content": prompt}],
    guided_json=AddonNoteChanges.model_json_schema(),
    max_tokens=1_000,
)

Content: {

"front": "What is the most critical factor for accelerating memory lookups in GPU computing?",  
"back": "Data availability on GPU memory. (Visual aid: Screenshot from NVIDIA GTC Fall 2021 session on GPU memory optimization)\n\nReference: [NVIDIA GTC Fall 2021 Session](https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/)",  
"tags": ["recsys", "GPU", "memory-optimization"]  

}


In [21]:
display_addon_note(content)

Front: What is the most critical factor for accelerating memory lookups in GPU computing?
Back: Data availability on GPU memory. (Visual aid: Screenshot from NVIDIA GTC Fall 2021 session on GPU memory optimization)

Reference: [NVIDIA GTC Fall 2021 Session](https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31230/)
Tags: ['recsys', 'GPU', 'memory-optimization']


In [22]:
col.close()

## TODO

- [x] Check Qwen 3's instruction following capabilities with Completions and Chat Completions API
- [x] Simple prompt
- [x] Simple prompt + Constrained Decoding
- [x] Simple prompt + Constrained Decoding + Reasoning
- [ ] Simple prompt + Chain of Thought
- [ ] Cleaner representation of existing addon note in the prompt passed to the LLM
- [ ] Pass note to change as `user` instead of `system` role
- [ ] Two-step process (critique → refine)
- [ ] Agent
- [ ] Check if we have a class that we can use to read the collection from the hard disk and convert it to `AddonNote` instead of `Note`. In that way we can operate with domain objects instead of external dependencies. The same class should be used to convert the `AddonNote` back to `Note` (so maybe we should keep track of the `note_id`
- [ ] Once we have that class, we should update the rest of the codebase accordingly (e.g., note formatter, note counter, etc.)
- [ ] ...