# Search matching exemplars

Working up code to use for searching CLDR exemplar data

In [1]:
import os
from pathlib import Path
from bs4 import BeautifulSoup
import re
import numpy as np

## Get the data files

Try out `os.listdir()`, `os.scandir()` and `Path` to see which is most convenient. (Supposedly `pathlib` is considered "cleaner" and better for cross-platform use.)

In [2]:
#path to cldr files:
cldr_data_pathname = "../data/cldr/common/main"

In [3]:
os.listdir(cldr_data_pathname)[:4]

['aa.xml', 'aa_DJ.xml', 'aa_ER.xml', 'aa_ET.xml']

In [4]:
file_objects = [f for f in os.scandir(cldr_data_pathname) if f.is_file()]

In [5]:
file_objects[0].path

'../data/cldr/common/main\\aa.xml'

In [6]:
os.path.splitext(file_objects[0].name)

('aa', '.xml')

In [7]:
# Use a generator expression to provide an iterator that returns (name, ext) tuples
[name for name,ext in (os.path.splitext(f.name) for f in file_objects) if ext==".xml"][:8]

['aa', 'aa_DJ', 'aa_ER', 'aa_ET', 'ab', 'ab_GE', 'af', 'af_NA']

In [8]:
[p for p in Path(cldr_data_pathname).iterdir()][:4]

[WindowsPath('../data/cldr/common/main/aa.xml'),
 WindowsPath('../data/cldr/common/main/aa_DJ.xml'),
 WindowsPath('../data/cldr/common/main/aa_ER.xml'),
 WindowsPath('../data/cldr/common/main/aa_ET.xml')]

In [9]:
cldr_data_path = Path(cldr_data_pathname)

Re the following, `glob()` is used for wildcard matching of filenames. The `Path` class has a `glob()` method; python also has a `glob` module—e.g.,
```python
import glob
files = glob.glob("common/main/*.xml")
```

In [40]:
# iterate over xml files
f = [file for file in cldr_data_path.glob("*.xml")][0]
print(f"f: {f}")
print(f"type(f): {type(f)}")
print(f"f.as_posix(): {f.as_posix()}")
print(f"f.name: {f.name}")
print(f"f.stem: {f.stem}")
print(f"f.suffix: {f.suffix}")
print(f"f.parts: {f.parts}")
print(f"f.absolute(): {f.absolute()}")
dir(f)


f: ..\data\cldr\common\main\aa.xml
type(f): <class 'pathlib._local.WindowsPath'>
f.as_posix(): ../data/cldr/common/main/aa.xml
f.name: aa.xml
f.stem: aa
f.suffix: .xml
f.parts: ('..', 'data', 'cldr', 'common', 'main', 'aa.xml')
f.absolute(): D:\Source\gh-pc\cldr_examplars\notebooks\..\data\cldr\common\main\aa.xml


['__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__firstlineno__',
 '__format__',
 '__fspath__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rtruediv__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__static_attributes__',
 '__str__',
 '__subclasshook__',
 '__truediv__',
 '_drv',
 '_filter_trailing_slash',
 '_format_parsed_parts',
 '_from_parsed_parts',
 '_from_parsed_string',
 '_glob_selector',
 '_globber',
 '_hash',
 '_max_symlinks',
 '_parse_path',
 '_parts_normcase',
 '_parts_normcase_cached',
 '_pattern_str',
 '_raw_path',
 '_raw_paths',
 '_remove_leading_dot',
 '_remove_trailing_slash',
 '_resolving',
 '_root',
 '_stack',
 '_str',
 '_str_normcase',
 '_str_normcase_cached',
 '_tail',
 '_tail_cached',
 '_unsupported_msg',
 'absolute',
 'anchor',
 'as_posix',
 'as_uri',
 'chmod'

The `Path` class from `pathlib` looks very flexible!

## Extracting exemplar data

The following will use BeautifulSoup to extract the exemplar data elements from the CLDR data files. Note that there are different exemplar types. The main exemplar data doesn't include any element attributes, but the other kinds of exemplar data include a `type` attribute. E.g.,

```xml
	<characters>
		<exemplarCharacters>[a b t s e c k x i d q r f g o l m n u w h y]</exemplarCharacters>
		<exemplarCharacters type="auxiliary">[j p v z]</exemplarCharacters>
		<exemplarCharacters type="index">[A B T S E C K X I D Q R F G O L M N U W H Y]</exemplarCharacters>
	</characters>
```
We'll be interested in the main and auxiliary exemplars for a language.

Also note: in additional to region-independent language data, CLDR has region-specific "locale" data that can include region-specific data such as time and date formats. In general, the region-specific files don't duplicate the region-independent data, such as exemplar data. Potentially, however, a language could have region-specific variants that use different character repertoires in their orthographies (`yo` vs. `yo-BJ` appears to have an example of this). For some languages, different communities use different scripts to write the language, and for these cases CLDR has script variants (e.g., Azeri written in Latin, Cyrillic or Arabic script). These will have different exemplar data. 

Thus, all files should be checked for exemplar data, though some of the data files will not have exemplar data. The BeautifulSoup `findall()` method will return `None` when the specified element isn't found; in the following snippet, we'll show the length of `findall()` results in each case to illustrate this.

In [11]:
for file in cldr_data_path.glob("*.xml"):
    filename = file.stem
    if filename > "af": break

    # open and parse with BeautifulSoup
    with file.open(encoding="utf-8") as f:
        soup = BeautifulSoup(f,"lxml-xml")
        exemplars = soup.find_all("exemplarCharacters")
        print(f"{filename} examplar count: {len(exemplars)}")
        for ex in exemplars:
            if "type" in ex.attrs:
                print(f"{filename}, type {ex["type"]}: {ex.text}")
            else:
                print(f"{filename}: {ex.text}")
    

aa examplar count: 3
aa: [a b t s e c k x i d q r f g o l m n u w h y]
aa, type auxiliary: [j p v z]
aa, type index: [A B T S E C K X I D Q R F G O L M N U W H Y]
aa_DJ examplar count: 0
aa_ER examplar count: 0
aa_ET examplar count: 0
ab examplar count: 5
ab: [а ә б в г {гә} {гь} ӷ {ӷә} {ӷь} д {дә} е ж {жә} {жь} з ӡ {ӡә} и к {кә} {кь} қ {қә} {қь} ҟ {ҟә} {ҟь} л м н о п ԥ р с т {тә} ҭ {ҭә} у ф х {хә} {хь} ҳ {ҳә} ц {цә} ҵ {ҵә} ч ҷ ҽ ҿ џ {џь} ш {шә} {шь} ы ь ҩ]
ab, type auxiliary: [{а́} ҕ {ҕә} {ҕь} {е́} {и́} {о́} ҧ {у́} {ы́}]
ab, type index: [А Б В Г {ГӘ} {ГЬ} Ӷ {ӶӘ} {ӶЬ} Д {ДӘ} Е Ж {ЖӘ} {ЖЬ} З Ӡ {ӠӘ} И К {КӘ} {КЬ} Қ {ҚӘ} {ҚЬ} Ҟ {ҞӘ} {ҞЬ} Л М Н О П Ԥ Р С Т {ТӘ} Ҭ {ҬӘ} У Ф Х {ХӘ} {ХЬ} Ҳ {ҲӘ} Ц {ЦӘ} Ҵ {ҴӘ} Ч Ҷ Ҽ Ҿ Џ {ЏЬ} Ш {ШӘ} {ШЬ} Ы Ҩ]
ab, type numbers: [\- ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9]
ab, type punctuation: [\- ‐‑ – — , ; \: ! ? . … '‘‚ "“„ « » ( ) \[ \] \{ \} § @ * / \& #]
ab_GE examplar count: 0
af examplar count: 4
af: [aáâ b c d eéèêë f g h iîï j k l m n oôö p q r s t uû v w x y z]
a

Repeat the above, but only show main and auxiliary exemplars.

In [55]:
for file in cldr_data_path.glob("*.xml"):
    filename = file.stem
    if filename > "b": break

    # open and parse with BeautifulSoup
    with file.open(encoding="utf-8") as f:
        soup = BeautifulSoup(f,"lxml-xml")
        exemplars = soup.find_all("exemplarCharacters")
        for ex in exemplars:
            if "type" in ex.attrs and ex["type"] == "auxiliary":
                print(f"{filename}, type {ex["type"]}: {ex.text}")
            elif "type" not in ex.attrs:
                print(f"{filename}: {ex.text}")
    

aa: [a b t s e c k x i d q r f g o l m n u w h y]
aa, type auxiliary: [j p v z]
ab: [а ә б в г {гә} {гь} ӷ {ӷә} {ӷь} д {дә} е ж {жә} {жь} з ӡ {ӡә} и к {кә} {кь} қ {қә} {қь} ҟ {ҟә} {ҟь} л м н о п ԥ р с т {тә} ҭ {ҭә} у ф х {хә} {хь} ҳ {ҳә} ц {цә} ҵ {ҵә} ч ҷ ҽ ҿ џ {џь} ш {шә} {шь} ы ь ҩ]
ab, type auxiliary: [{а́} ҕ {ҕә} {ҕь} {е́} {и́} {о́} ҧ {у́} {ы́}]
af: [aáâ b c d eéèêë f g h iîï j k l m n oôö p q r s t uû v w x y z]
af, type auxiliary: [àåäã æ ç íì óò úùü ý]
agq: [aàâǎā b c d eèêěē ɛ{ɛ̀}{ɛ̂}{ɛ̌}{ɛ̄} f g h iìîǐī ɨ{ɨ̀}{ɨ̂}{ɨ̌}{ɨ̄} k l m n ŋ oòôǒō ɔ{ɔ̀}{ɔ̂}{ɔ̌}{ɔ̄} p s t uùûǔū ʉ{ʉ̀}{ʉ̂}{ʉ̌}{ʉ̄} v w y z ʔ]
agq, type auxiliary: [q r x]
ak: [a b d e ɛ f g h i k l m n o ɔ p r s t u w y]
ak, type auxiliary: [áäã c éë í j óö q ü v z]
am: [ሀ ሁ ሂ ሃ ሄ ህ ሆ ለ ሉ ሊ ላ ሌ ል ሎ ሏ ሐ ሑ ሒ ሓ ሔ ሕ ሖ ሗ መ ሙ ሚ ማ ሜ ም ሞ ሟ ሠ ሡ ሢ ሣ ሤ ሥ ሦ ሧ ረ ሩ ሪ ራ ሬ ር ሮ ሯ ሰ ሱ ሲ ሳ ሴ ስ ሶ ሷ ሸ ሹ ሺ ሻ ሼ ሽ ሾ ሿ ቀ ቁ ቂ ቃ ቄ ቅ ቆ ቈ ቊ ቋ ቌ ቍ በ ቡ ቢ ባ ቤ ብ ቦ ቧ ቨ ቩ ቪ ቫ ቬ ቭ ቮ ቯ ተ ቱ ቲ ታ ቴ ት ቶ ቷ ቸ ቹ ቺ ቻ ቼ ች ቾ ቿ ኀ ኁ ኂ ኃ ኄ ኅ ኆ ኈ ኊ ኋ ኌ ኍ ነ ኑ ኒ ና ኔ ን ኖ ኗ ኘ ኙ ኚ

Note that the exemplar data isn't as simple as space-delimited elements. For example, items enclosed in braces, `{...}`, are Unicode character sequences that are treated like a single unit. For example, in Dutch, "ij" is a distinct element in the alphabet, and so is listed in the examplar data as `{ij}`. The data uses Unicode normalization form NFD, so when an orthography uses an accented letter, the precomposed character will be listed, if encoded; but in some cases, a grapheme can only be represented in Unicode as a combining mark sequence, and such sequences would be represented in the data this way. For example, "m̀" is a grapheme in Yoruba, and this combining mark sequences is listed in the exemplar data as `{m̀}`.