TODO:
- Move pre-existing sound file check to before the API query for the audio url
- Add indexing to the dump to make searching more efficient
- Put file path steps in UI
- Handle special characters (breaks both Anki Import bc UTF & file lookup currently)
- Add part of speech parsing/tag
- Add check for irregular verbs
     - if irregular, manual review (vs long-term automate creation of irregular past - tense cards?)
- Synonyms...
- Uncouple fetch audio method & translationPair class?

DONE: 
- Switch from scraping to dump + supported API lookups
- Add retry logic to fetch wiki page
- Skip header of transcribed file. 
- Add skip audio file fetch if file already exists
- Add tags for sound retrievel (sound, nosound)
- Move more of main script into functions for a cleaner look.

#### Initialization

In [1]:
from bs4 import BeautifulSoup ## scraping
import requests, os, time, csv, random
## dump integration
import bz2 
import xml.etree.ElementTree as ET
import re

In [2]:
class TranslationPair:
    def __init__(self, dutch_word, english_translation, processed_vocab):
        self.dutch_word = dutch_word
        self.english_translation = english_translation
        self.processed_vocab = processed_vocab
        self.wiki_url = None
        self.wiki_page = None
        self.audio_url = None
        self.tags = []

In [3]:
# initialize session for page fetches
session = requests.Session()
HEADERS = {
    "User-Agent": "AnkiDutchAudioFetcher/1.0 (https://github.com/KathrynMercer/Anki-Deck)",
    "Accept": "application/json"
    }

session.headers.update(HEADERS)

In [4]:
# vocab file
vocab_file_path = r"C:\Users\wisery\Downloads\Untitled spreadsheet - Main Deck__00Kroongetuige.csv"

# prepare save locations
audio_save_folder = r"C:\Users\wisery\AppData\Roaming\Anki2\User 1\collection.media"
vocab_output_location = r'C:\Users\wisery\Data Science Projects\Anki Deck\audio_added_vocab.csv'

## path to dictionary dump
dump_path= r"C:\Users\wisery\Downloads\nlwiktionary-20260101-pages-meta-current.xml.bz2"

## Regex to identify audio files
AUDIO_PATTERNS = [
    # modern pattern {{audio|file.ogg|...}}
    re.compile(r"\{\{audio\|(.+?\.ogg)"),
    # legacy pattern [[Bestand:file.ogg]] - do these exist?                  
    re.compile(r"\[\[(?:Bestand|File):([^|\]]+\.ogg)")          
]

## API to lookup audio file links
API_URL = "https://commons.wikimedia.org/w/api.php"

#r'C:\Users\wisery\AppData\Roaming\Anki2\User 1\collection.media' #default Anki media storage location

# r"C:\Users\wisery\Data Science Projects\Anki Deck\test sound files"
# test data storage

os.makedirs(audio_save_folder, exist_ok=True)
existing_sound_files = os.listdir(audio_save_folder) # to compare to prevent re-downloading files

In [5]:
def get_wait_time():
    wait_time = random.uniform(1, 2)
    return wait_time

In [6]:
# Predefined dictionary of parts of speech to match on for parsing + corresponding English tag in Anki deck
PARTS_OF_SPEECH = {
    "Achtervoegsel": "phrase",
    "Achterzetsel": "phrase",
    "Afkorting": "phrase",
    "Bijvoeglijk_naamwoord": "adjective",
    "Bijwoord": "adverb",
    "Telbijwoord": "adverb",
    "Voornaamwoordelijk bijwoord": "adverb",
    "Contractie": "phrase",
    "Frase": "phrase",
    "Invoegsel": "phrase",
    "Lidwoord": "phrase",
    "Omvoegsel": "phrase",
    "Partikel": "phrase",
    "Telwoord": "phrase",
    "Hoofdtelwoord": "phrase",
    "Onbepaald_hoofdtelwoord": "phrase",
    "Rangtelwoord": "phrase",
    "Onbepaald_rangtelwoord": "phrase",
    "Verdelingsgetal": "phrase",
    "Vragend_telwoord": "phrase",
    "Tussenwerpsel": "phrase",
    "Voegwoord": "phrase",
    "Voornaamwoord": "phrase",
    "Aanwijzend_voornaamwoord": "phrase",
    "Betrekkelijk_voornaamwoord": "phrase",
    "Bezittelijk_voornaamwoord": "phrase",
    "Onbepaald_voornaamwoord": "phrase",
    "Persoonlijk_voornaamwoord": "phrase",
    "Temporeel_voornaamwoord": "phrase",
    "Uitroepend_voornaamwoord": "phrase",
    "Vragend_voornaamwoord": "phrase",
    "Wederkerend_voornaamwoord": "phrase",
    "Wederkerig_voornaamwoord": "phrase",
    "Voorvoegsel": "phrase",
    "Voorzetsel": "phrase",
    "Werkwoord": "verb",
    "Zelfstandig_naamwoord": "noun",
    "Eigennaam": "noun",
    "Cijfer": "phrase",
    "Leesteken": "phrase",
    "Letter": "phrase",
    "Symbool": "phrase"
}

#### Word Processing

In [7]:
# Read csv file (assumes Column 1 = dutch word/phrase & Colum 2 = English translation)
# remove article if noun; remove preposition if present; replace spaces with _ if multiple words
# generates translationpair objects for each word/phrase

def process_vocab(vocab_file_path):
    with open(vocab_file_path, newline='', encoding = 'utf-8-sig') as csvfile:
        vocab_list = csv.reader(csvfile, delimiter=',', quotechar='"')
        processed_vocab_list = []
        
        next(vocab_list, None)  # skip the headers

        for word_pair in vocab_list:
            dutch_phrase = word_pair[0]
            english_phrase = word_pair[1]

            # check for article or reflexive & remove if present
            segmented_dutch = dutch_phrase.split(' ')
            if segmented_dutch[0] in ('de', 'het', 'zich'):
                word = segmented_dutch[1]
                #print(f'{word_pair[0]} had an article; final word: {word}')

            # if there is no article...
            else:
                # is the phrase made up of multiple words?
                if len(segmented_dutch) > 1:

                    # check for preposition & remove if present
                    if segmented_dutch[1].startswith('('):
                        word = segmented_dutch[0]
                        #print(f'{word_pair[0]} had a preposition/reflexive; final word:  {word}')

                    # not a preposition but multiple words
                    else:
                        word = word_pair[0].replace(' ', '_') 
                        #print(f'{word_pair[0]} is multiple words; final word:  {word}')

                # no second word found
                else:
                    word = word_pair[0] 
                    #print(f'{word_pair[0]} has no article/preposition; final word:     {word}') 

            # Make list of processed words & their translations
            word_pair = TranslationPair(dutch_phrase, english_phrase, word)
            processed_vocab_list.append(word_pair)
    
    return processed_vocab_list
    
    

#### Retrieving & Processing Wiki Pages

In [None]:
def fetch_wiki_page_from_dump(word, dump_path):
    #print("fetch method called successfully")
    with bz2.open(dump_path, "rt", encoding="utf-8") as f:
        #print("dump opened successfully")
        context = ET.iterparse(f, events=("end",))
        for event, elem in context:
            if elem.tag.endswith("page"):
                title = elem.find("./{*}title").text
                if title == word.processed_vocab:
                   # print(f"Wikipage found in dump for: {title}")
                    revision = elem.find("./{*}revision/{*}text")
                    if revision is not None:
                        wikitext = revision.text or ""
                        word.wiki_page = wikitext
                        elem.clear()
                    #    print(f"Page fetched for {word.processed_vocab}: {word.wiki_page}")
                        break
                    else:
                        print(f"no revision found for: {word.processed_vocab}")
                    elem.clear()
        print(f"No wikipage found in dump for: {word.processed_vocab}")
        #print("closing dump")

##### Parsing Data from Wiki Page

In [None]:
def get_audio_download_url(filename):
    params = {
        "action": "query",
        "titles": f"File:{filename}",
        "prop": "imageinfo",
        "iiprop": "url",
        "format": "json"
    }
    ##print(params)
    
    response = requests.get(API_URL, params=params, headers=HEADERS)

    if response.status_code != 200:
        print("HTTP error:", response.status_code)
        print(response.text[:500])
        return None
    else: 
        data = response.json()
    
    pages = data.get("query", {}).get("pages", {})
    for page in pages.values():
        if "imageinfo" in page:
            return page["imageinfo"][0]["url"]

    return None

In [None]:
def parse_audio_file_from_dump(word):
    if not word.wiki_page:
        print(f"No wikitext available for {word.processed_vocab}")
        return
    
    for pattern in AUDIO_PATTERNS:
        matches = pattern.findall(word.wiki_page)
        if matches:
            if len(matches) > 1:
                print(f"{word.processed_vocab} has multiple audio files; using the first.")
            filename = matches[0].replace("{{pn}}", word.processed_vocab)
            #print(f"audio filename: {filename}")
            # Lookup the actual URL from the WikiMedia API
            word.audio_url = get_audio_download_url(filename) 
            return
        else:
            print(f"No audio links found for {word.processed_vocab} with {pattern}") 

In [11]:
# parsing parts of speech from scraped wiki page
# saves list of possibilities to the translation pair object passed to it

def parse_part_of_speech(word):
     # Find all h4 headings
    h4_elements = word.wiki_page.find_all('h4')

    # Store parts of speech
    matched_parts_of_speech = []

    for h4 in h4_elements:
        # Check if the id matches a known part of speech
        if 'id' in h4.attrs and h4['id'] in PARTS_OF_SPEECH:
            matched_parts_of_speech.append(PARTS_OF_SPEECH.get(h4['id']))
           
    # Remove duplicates and return
    final_list = list(set(matched_parts_of_speech))
    if len(final_list) > 0:
        print(f'Parts of speech identified for {word.dutch_word}: {final_list}')
    else:
        print(f'No parts of speech identified for {word.dutch_word}')
    return final_list

In [12]:
# Parse relevant info from wiki page 
def parse_wiki_page(word):
    # if wiki_page exists, parse audiofile from page
    if word.wiki_page:
        print(f"Parsing audio for: {word.processed_vocab}")
        parse_audio_file_from_dump(word)
        ##parse_part_of_speech(word)

    else: print(f'No parsing performed for {word.processed_vocab}.')

#### Audio File Retrieval & Processing

In [13]:
# Check to see if the audio file already exists. 

def audio_file_preexisting_check (word):
    if f"Nl-{word.processed_vocab}.ogg" in existing_sound_files:
        preexisting = True

    elif f"Nl-{word.processed_vocab}.ogg" not in existing_sound_files:
        preexisting = False

    return preexisting

In [14]:
# Fetch audio files for a word from wiktionary page
# If multiple audio files are listed, the user is informed & the first is used.

def fetch_audio(word, save_folder):
    file_name = f"Nl-{word.processed_vocab}.ogg" # matches default naming   convention for manual file download
    file_path = os.path.join(save_folder, file_name)
    
    if word.audio_url:
        with open(file_path, 'wb') as f:
            attempt = 0
            while attempt < 5:
                audio_response = session.get(word.audio_url)#, headers=headers)
                audio_file = audio_response.content
                if audio_response.status_code == 200:
                    f.write(audio_file)
                    print(f"Downloaded: {file_name} from {word.audio_url}")
                    return file_name # exit after downloading successfully
                else:
                    print(f"Failed to download: {file_name} from {word.audio_url}   dt {audio_response.status_code}; attempt = {attempt}")
                    time.sleep(30) # retry in 30 seconds
                    attempt += 1
    else: print(f'No audio file retrieved for {word.processed_vocab}.')

In [15]:
# modify card formatting & tags based on if audio file exists

def audio_based_formatting(word, audio_file_name):
    # if no audio exists, tag is nosound
    if (not audio_file_name) and (not word.audio_url):
        #print('no audiofile reported')
        dutch_complete = word.dutch_word
        word.tags.append('nosound')

    # if audio url exists but file does not, notify user, do not tag, and continue as if no sound file exists
    elif (not audio_file_name):
        dutch_complete = word.dutch_word
        print(f'Sound file not downloaded for {word}, but url recorded. Manual review recommended.')

    # if audio exists & was downloaded, modify formatting to play in Anki
    else:
        #print('audiofile reported')
        dutch_complete = f'{word.dutch_word} \n[sound:{str(audio_file_name)}]'
        word.tags.append('sound')

    return dutch_complete

In [16]:
def audio_processing(word):
    # Check to see if audio file pre-exists in specified save folder
    # If so, skip download + set file name based on manual convention
    # If no, fetch_audio and save in specified folder
    audio_file_name = None
    preexisting = audio_file_preexisting_check(word)

    if(preexisting):
        audio_file_name = 'Nl-'+ word.processed_vocab +'.ogg'
        print(f'Audio previously downloaded for {word.processed_vocab}. Skipping download...')
    else:
        time.sleep(get_wait_time())
        audio_file_name = fetch_audio(word, audio_save_folder)

    # Update formatting of card & tags before returning to main script
    dutch_complete = audio_based_formatting(word, audio_file_name)

    return dutch_complete

#### Write to file

In [17]:
def write_to_file(write_data, file_location):
    with open(file_location, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        for word_entry in write_data:
            writer.writerow(*[word_entry])
    print('File complete.')


## Main

In [18]:
# prepare words       
words = process_vocab(vocab_file_path) # yields list of TranslationPair objects
print('All vocab processed.')

write_data = []
# Fetch & parse wiki page + fetch audio file for each word in vocab list
for word in words:
    print(f"Fetching for: {word.processed_vocab}")
    fetch_wiki_page_from_dump(word, dump_path)
    if word.wiki_page:
        print(f"Parsing {word.processed_vocab}")
        parse_wiki_page(word)
        audio_formatted_dutch = audio_processing(word)
        write_data.append([audio_formatted_dutch, word.english_translation, ' '.join(word.tags)])
    else:
        write_data.append([word.dutch_word, word.english_translation, ' '.join(word.tags)])

print('Sound files collected; writing to file.')

#write_to_file(write_data, vocab_output_location)

All vocab processed.
Fetching for: uiterst
Parsing uiterst
Parsing audio for: uiterst
audio filename: nl-uiterst.ogg
{'action': 'query', 'titles': 'File:nl-uiterst.ogg', 'prop': 'imageinfo', 'iiprop': 'url', 'format': 'json'}
Audio previously downloaded for uiterst. Skipping download...
Fetching for: tralies
Parsing tralies
Parsing audio for: tralies
audio filename: nl-tralies.ogg
{'action': 'query', 'titles': 'File:nl-tralies.ogg', 'prop': 'imageinfo', 'iiprop': 'url', 'format': 'json'}
Audio previously downloaded for tralies. Skipping download...
Fetching for: inlichtingen
Parsing inlichtingen
Parsing audio for: inlichtingen
audio filename: nl-inlichtingen.ogg
{'action': 'query', 'titles': 'File:nl-inlichtingen.ogg', 'prop': 'imageinfo', 'iiprop': 'url', 'format': 'json'}
Audio previously downloaded for inlichtingen. Skipping download...
Fetching for: oorbellen
Parsing oorbellen
Parsing audio for: oorbellen
audio filename: nl-oorbellen.ogg
{'action': 'query', 'titles': 'File:nl-oorbe

In [19]:
write_to_file(write_data, vocab_output_location)

File complete.
