# Formatting Anki Flashcards

## Setup

In this notebook, we are exploring the best way to prompt an LLM to improve the formatting of Anki flashcards. 

Approaches we are including:
1. Simple prompt
2. Simple prompt + Chain of Thought
3. Two-step process (critique → refine)
4. Agent

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import requests
from anki.collection import Collection

from addon.application.use_cases.note_counter import is_note_marked_for_review
from addon.infrastructure.configuration.settings import AddonConfig

In [3]:
# Open an existing collection
col = Collection("/home/gianluca/.local/share/Anki2/User 1/collection.anki2")

# Do something with the collection
print(f"Number of notes: {col.note_count()}")
print(f"Number of cards: {col.card_count()}")

Number of notes: 3362
Number of cards: 3498


### Connect to Inference Server

For this notebook, we are going to use [`unsloth/Qwen3-14B-GGUF`](https://huggingface.co/unsloth/Qwen3-14B-GGUF). This is a larger and more modern LLM compared to [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), which should lead to better results. We are also going to use the Chat Completion API, which 

In [4]:
from enum import Enum


class Mode(Enum):
    """Response API not implemented since currently not supported by vLLM."""

    COMPLETION = "v1/completions"
    CHAT_COMPLETION = "v1/chat/completions"


def answer(
    prompt: str,
    mode: Mode = Mode.CHAT_COMPLETION,
    **kwargs
):
    """Helper function to prompt LLM.

    Handles both Chat Completion and Completion API format.
    kwargs can override config values like max_tokens, temperature, etc.
    """
    # Overwrite the default API mode and any other kwargs
    config_overrides = {"mode": mode.value, **kwargs}
    config = AddonConfig.create_nullable(config_overrides)
    
    if mode == Mode.CHAT_COMPLETION:
        prompt_param_name = "messages"
        prompt_param_value = [{"role": "system", "content": prompt}]
    else:
        prompt_param_name = "prompt"
        prompt_param_value = prompt

    payload = {
        "model": config.model_name,
        prompt_param_name: prompt_param_value,
        "max_tokens": config.max_tokens,
        "temperature": config.temperature,
        "top_p": config.top_p,
        "min_p": config.min_p,
        "top_k": config.top_k,
    }

    response = requests.post(config.url, json=payload)
    print(response)
    
    # Check for errors
    if response.status_code != 200:
        print(f"Error response: {response.text}")
        raise Exception(f"Server returned {response.status_code}: {response.text}")

    if mode == Mode.CHAT_COMPLETION:
        reasoning_content = response.json()["choices"][0]["message"][
            "reasoning_content"
        ]
        content = response.json()["choices"][0]["message"]["content"]
        print(f"Content: {content.replace('<think>\n\n</think>\n\n', '')}")
        print(f"Reasoning: {reasoning_content}")
        return (content, reasoning_content)

    if mode == Mode.COMPLETION:
        content = response.json()["choices"][0]["text"]
        print(f"Content: {content}")
        return (content, None)

### Completions vs Chat Completions API

In [5]:
content, reasoning_content = answer(
    prompt="Respond only with one word, lowercase, without punctuation. What is the Italian word for 'hello'? /no_think",
    mode=Mode.CHAT_COMPLETION, 
    max_tokens=20
)

<Response [200]>
Content: ciao
Reasoning: None


In [6]:
content, reasoning_content = answer(
    "Respond only with one word, lowercase, without punctuation. What is the Italian word for 'hello'? /no_think",
    mode=Mode.COMPLETION, 
    max_tokens=20
,
)

<Response [200]>
Content: 

# The user is asking for the Italian word for 'hello'. I need to provide the correct


In [7]:
content, reasoning_content = answer(
    "Respond only with one word, lowercase, without punctuation. What is the Italian word for 'hello'? /no_think",
    mode=Mode.COMPLETION, 
    max_tokens=20
)

<Response [200]>
Content: 

Assistant:

Assistant:

ciao

Assistant:

Assistant:

ciao

Assistant:

Assistant:

c


In [8]:
content, reasoning_content = answer(
    "Respond only with one word, lowercase, without punctuation. What is the Italian word for 'hello'? /no_think",
    mode=Mode.COMPLETION, 
    max_tokens=20
)

<Response [200]>
Content: 

Assistant:

Assistant:

ciao

Assistant:

Assistant:

ciao

Assistant:

Assistant:

c


For Qwen 3, the Chat Completions API appears to work much better. In our simple test case, when using the Completions API, Qwen 3 tends to enter a pattern where it repeats itself until reaching the `max_tokens` limit.

We also notice that, despite asking Qwen 3 not to output "thinking tokens", it still does that in the `content` field. The thinking tokens are also returned both in the `content` and `reasoning` fields and do not match.

## Select a few flashcards for our offline evaluation

We have everything we need now to tell the LLM to make some changes to our Anki flashcards.

Let's pull a few note currently marked for review.

In [9]:
deck_id = col.decks.current()["id"]
query = f"did:{deck_id}"
note_ids = col.find_notes(query)

NUM_NOTES_NEEDED = 10

flagged_notes = []
for note_id in note_ids:
    if is_note_marked_for_review(col, note_id):
        note = col.get_note(note_id)
        flagged_notes.append(note)

In [10]:
print(f"Number of flagged notes: {len(flagged_notes)}")

Number of flagged notes: 284


In [17]:
from addon.application.services.formatter_service import AnkiNoteAdapter

addon_note = AnkiNoteAdapter.to_addon_note(flagged_notes[0])
print(addon_note)

AddonNote(front='Can bagging be performed in parallel?', back='Yes', guid='EbzX4(?/sg', tags=['ml'], notetype=<AddonNoteType.BASIC: 'basic'>, deck_name=None)


In [25]:
prompt = f"Look at this flashcard. How would you improve it? Keep in mind that flashcards should be atomic, concise, and accurate.\n\n{addon_note}"
prompt

"Look at this flashcard. How would you improve it? Keep in mind that flashcards should be atomic, concise, and accurate.\n\nAddonNote(front='Can bagging be performed in parallel?', back='Yes', guid='EbzX4(?/sg', tags=['ml'], notetype=<AddonNoteType.BASIC: 'basic'>, deck_name=None)"

In [28]:
content, reasoning_content = answer(
    prompt=prompt,
    mode=Mode.CHAT_COMPLETION, 
    max_tokens=1_000
)

<Response [200]>
Content: <think>
Okay, let's see. The user provided a flashcard about whether bagging can be performed in parallel. The front is the question, back is the answer. The user wants to know how to improve it, keeping in mind that flashcards should be atomic, concise, and accurate.

First, I need to check if the current flashcard meets those criteria. The question is clear and direct, but maybe it's a bit vague. "Can bagging be performed in parallel?" is a yes/no question. The answer is "Yes," which is concise. But maybe the answer could be more informative. The user might want to know why it's possible or the conditions under which it can be done.

However, the original instruction says the flashcard should be atomic and concise. So adding more details might make it less concise. But maybe the answer could be a bit more precise. For example, explaining that bagging involves training models independently, which allows parallelization. But that might be too much for a flashc

In [11]:
# col.close()

## TODO

- [x] Check Qwen 3's instruction following capabilities with Completions and Chat Completions API
- [ ] Simple prompt
- [ ] Simple prompt + Chain of Thought
- [ ] Two-step process (critique → refine)
- [ ] Agent
- [ ] Check if we have a class that we can use to read the collection from the hard disk and convert it to `AddonNote` instead of `Note`. In that way we can operate with domain objects instead of external dependencies. The same class should be used to convert the `AddonNote` back to `Note` (so maybe we should keep track of the `note_id`
- [ ] Once we have that class, we should update the rest of the codebase accordingly (e.g., note formatter, note counter, etc.)
- [ ] ...