Let's read the file:

In [7]:
f = "varcon.txt"

with open(f, "r") as file:
    content = file.read()


Reminder: file looks like this:

```
# abettor <verified> (level 50)
A Bv C: abettor / Av B: abetter
A Bv C: abettors / Av B: abetters
A Bv C: abettor's / Av B: abetter's

# adviser <verified> (level 20)
A B C: adviser / A. Bv C.: advisor
A B C: advisers / A. Bv C.: advisors
A B C: adviser's / A. Bv C.: advisor's
## The oxford dictionary has this to say:
##     The spellings adviser and advisor are both correct. Adviser is more
##     common, but advisor is also widely used , especially in North
##     America. Adviser may be seen as less formal, while advisor often
##     suggests an official position.

# airstrike (level 55)
_: airstrike / _v: air_strike
_: airstrikes / _v: air_strikes
_: airstrike's / _v: air_strike's

# apprize (level 60)
A Z: apprize / B: apprise | value highly
A Z: apprize's / B: apprise's | value highly
A Z: apprizing / B: apprising | value highly
A Z: apprized / B: apprised | value highly
A Z: apprizes / B: apprises | value highly
A B: apprise | inform
A B: apprise's | inform
A B: apprising | inform
A B: apprised | inform
A B: apprises | inform

```

My idea:
* Obviously comments like above will not be manually read and put in order.
* I will look only at categories `A` for American variety and `B` or `Z` for Brittish variety.
* For now I will not look at versions of spellings (e.g. `abettor` will be declared american, `abetter` will be declared brittish.)
* It was noted that some words are underscore delimited instead of space delimited. This will be taken into account, <span style="color:red">but this means we would have to process word-N-grams, which we will not do in the first implementation.</span>
* For now the `<verified>` flag and `(level NN)` information will not be propagated.
* After inspecting the lexicon I assume the american version will be given first.
* The lines with inconsistent formatting will be discarded. (See last example.)

What to do when we have a case like this: `A Z: abnormalizing / B: abnormalising` ?

It is clear that the `B` version is more informative in the sense that if we see `abnormalising`, the spelling is surely brittish, whereas the alternative could be american or could be "-ize" brittish. In this case I propose we only use the definitively brittish form.

In [8]:
items = content.split("# ")[1:]
items[3].split("\n")

['abnormalizing (level 95)', 'A Z: abnormalizing / B: abnormalising', '', '']

In [1]:
from parse import compile
import logging
pattern = """{flags1}: {version1} / {flags2}: {version2}"""
p = compile(pattern)

def process_flags(flags:str) -> set:
    flags = flags.split(" ")
    flags = [flag for flag in flags if flag in ["A", "B", "Z"]]
    return set(flags)

def process_item(item:str) -> dict:
    pattern = """{flags1}: {version1} / {flags2}: {version2}"""
    p = compile(pattern)
    pattern2 = """{flags1}: {version1} / {flags2}: {version2} / {flags3}: {version3}"""
    p2 = compile(pattern2)
    resulting_dict = dict()
    lines = item.split("\n")
    for line in lines[1:]:
        if "|" in line:
            continue
        if line.startswith("#") or line == "":
            continue
        if line.count("/") == 1:
            try:
                results = p.parse(line)
                flags1 = process_flags(results["flags1"])
                flags2 = process_flags(results["flags2"])
                version1 = results["version1"].replace("_", " ").casefold()
                version2 = results["version2"].replace("_", " ").casefold()
                
                if (flags1, flags2) == ({"A"}, {"B"}):
                    resulting_dict[version1] = "A"
                    resulting_dict[version2] = "B"
                if (flags1, flags2) == ({"A", "Z"}, {"B"}):
                    resulting_dict[version2] = "B"
            except Exception as e:
                logging.debug(f"Found error {e} for line:")
                logging.debug(line)
        elif line.count("/") == 2:
            try:
                results = p2.parse(line)
                flags1 = process_flags(results["flags1"])
                flags2 = process_flags(results["flags2"])
                flags3 = process_flags(results["flags3"])
                version1 = results["version1"].replace("_", " ").casefold()
                version2 = results["version2"].replace("_", " ").casefold()
                version3 = results["version3"].replace("_", " ").casefold()

                if "A" in flags1:
                    american = version1
                if "A" in flags2:
                    american = version2
                if "A" in flags3:
                    american = version3
                if "B" in flags1:
                    brittish = version1
                if "B" in flags2:
                    brittish = version2
                if "B" in flags3:
                    brittish = version3
                
                if brittish != american:
                    resulting_dict[american] = "A"
                    resulting_dict[brittish] = "B"
            except Exception as e:
                logging.debug(f"Found error {e} for line:")
                logging.debug(line)
        else:
            logging.warning(f"Weird formatting with 0 or >2 slashes:\n{line}")
    return resulting_dict

def get_lexicon():
    f = "varcon.txt"
    with open(f, "r") as file:
        content = file.read()
    items = content.split("# ")[1:]
    results = {}
    for item in items:
        results.update(process_item(item))

    return results

get_lexicon()

A B D 1: amoebas / Av Bv Dv 1: amoebae / Av Dv 2: amebas / Av Dv 2: amebae
A B 1 2: aunties
A B: backward
A Av B: battleaxes
_: boatswain / _v: bosun / _V: bo'sun / _V: bos'n / _V: bo's'n
_: boatswains / _v: bosuns / _V: bo's'ns / _V: bo'suns / _V: bos'ns
_: boatswain's / _v: bosun's / _V: bo'sun's / _V: bo's'n's / _V: bos'n's
A B: buss
A B: busses
A B: bussing
A B: bussed
_ _.: cabbies
_: camelhair 
_: cesarean / _v: caesarean / _V 1: caesarian / _V 2: cesarian
_: cesareans / _v: caesareans / _V 1: caesarians / _V 2: cesarians
_: cesarean's / _v: caesarean's / _V 1: caesarian's / _V 2: cesarian's
A C: chilies / AV Cv: chiles / AV: chilis / AV B: chillies
_: chutzpah / _V 1: hutzpa / _V 2: chutzpa / _V 3: hutzpah
_: chutzpah's / _V 1: hutzpa's / _V 2: chutzpa's / _V 3: hutzpah's
_: chutzpahes / _V 1: hutzpas / _V 2: chutzpas / _V 3: hutzpahes
A B 1 2: dissed
A B 1 2: dissing
A AV B C Cv: distilled
A AV B C Cv: distilling
_ 1 2: dogies
A B C Cv: enthralling
A B C Cv: enthralled
A C Dv: 

{'abettor': 'A',
 'abetter': 'B',
 'abettors': 'A',
 'abetters': 'B',
 "abettor's": 'A',
 "abetter's": 'B',
 'abnormalise': 'B',
 'abnormalised': 'B',
 'abnormalising': 'B',
 'abolitionise': 'B',
 'abolitionised': 'B',
 'abolitionising': 'B',
 'abridgment': 'A',
 'abridgement': 'B',
 'abridgments': 'A',
 'abridgements': 'B',
 "abridgment's": 'A',
 "abridgement's": 'B',
 'academise': 'B',
 'academised': 'B',
 'academising': 'B',
 'acalephe': 'A',
 'acalephae': 'B',
 'accessorise': 'B',
 'accessorised': 'B',
 'accessorising': 'B',
 'accessorises': 'B',
 'acclimatisable': 'B',
 'acclimatisation': 'B',
 "acclimatisation's": 'B',
 'acclimatise': 'B',
 'acclimatised': 'B',
 'acclimatising': 'B',
 'acclimatises': 'B',
 'acclimatiser': 'B',
 'acclimatisers': 'B',
 'accorage': 'A',
 'accourage': 'B',
 'accoraged': 'A',
 'accouraged': 'B',
 'accoraging': 'A',
 'accouraging': 'B',
 'accorages': 'A',
 'accourages': 'B',
 'accouter': 'A',
 'accoutre': 'B',
 'accoutered': 'A',
 'accoutred': 'B',
 'a

In [2]:
"///".count("/")

3