# Preprocessing training set
This notebook preprocesses the training set from the raw data files, including:

- item, ort, pho: ../raw/Chang/6kdict
- pho to encoded pho: ../raw/Chang/mapping
- sem: ../raw/Chang/6ksem.mat
- wf: ../raw/ELP/elp_wf.csv

Outputs:
- training set (in a custom training/testing set format): ../train.pkl.gz
- human readable training set: ../train.csv

In [1]:
import pickle, gzip
import pandas as pd
import tensorflow as tf
from scipy.io import loadmat
from data_wrangling import (
    trim_unused_slots, 
    get_duplicates, 
    ort_to_binary,
    pho_to_binary,
    gen_pkey
)

# Basic checking

Load training set raw data

In [2]:
train = pd.read_csv(
    "../raw/Chang/6kdict",
    sep=r"[\t| ]",
    header=None,
    names=["word", "ort", "pho", "sampling_weight"],
    na_filter=False,
)

train

  return func(*args, **kwargs)


Unnamed: 0,word,ort,pho,sampling_weight
0,a,____a_________,___^______,1.000000
1,ace,____a_ce______,___es_____,0.077460
2,ache,____a_che_____,___ek_____,0.161245
3,ached,____a_ched____,___ekt____,0.232379
4,aches,____a_ches____,___eks____,0.141421
...,...,...,...,...
6224,zoo,___zoo________,__zu______,0.527257
6225,zoom,___zoom_______,__zum_____,0.126491
6226,zoomed,___zoomed_____,__zumd____,0.077460
6227,zooms,___zooms______,__zumz____,0.077460


Remove unused slots

In [3]:
train['ort'] = trim_unused_slots(train.ort)
train['pho'] = trim_unused_slots(train.pho)

We have these slots: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
Removing unused slots: [0, 11, 12, 13]
We have these slots: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Removing unused slots: [8, 9]


AssertionError: 

Failed representation lenght safety check... 

In [4]:
print(set(len(x) for x in train['pho']))  # We failed length check in phoneme column, some words has 11 slots in pho

{10, 11}


Either 10 slots or 11 slots in pho

In [5]:
train.loc[train.pho.str.len() >= 11, ]

Unnamed: 0,word,ort,pho,sampling_weight
900,clutched,_clu_tched,_kl^tCt____,0.232379
1187,dang,__da_ng___,__d@ng_____,0.05


For some reason, two words has len of 11 in PHO, the extra slot should be the last one. Manually trim it.

In [6]:
train.loc[train.pho == "_kl^tCt____",'pho'] = "_kl^tCt___"
train.loc[train.pho == "__d@ng_____",'pho'] = "__d@ng____"

Redo the slot trimming in pho

In [7]:
train['pho'] = trim_unused_slots(train.pho)

We have these slots: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Removing unused slots: [8, 9]


Check slot usage

In [8]:
slots_usage = {slot: set([x[slot] for x in train.pho]) for slot in range(len(train.pho[0]))}
print(slots_usage)

{0: {'_', 's'}, 1: {'S', 'p', 'k', 't', 'f', '_', 'b', 'g', 'T', 's', 'd'}, 2: {'f', 'g', 'b', 'z', 'm', 'C', 'n', 't', 'r', '_', 'w', 's', 'y', 'd', 'l', 'S', 'p', 'T', 'v', 'h', 'k', 'D', 'J'}, 3: {'o', 'I', 'e', 'u', 'a', 'U', 'W', 'i', 'O', '@', 'Y', '^', 'A', 'E'}, 4: {'f', 'g', 'b', 'z', 'Z', 'm', 'C', 'n', 't', 'r', '_', 's', 'd', 'l', 'S', 'p', 'T', 'v', 'D', 'k', 'J'}, 5: {'m', 'C', 'S', 'p', 'n', 't', 'k', 'f', 'v', 'J', '_', 'g', 'b', 'T', 's', 'l', 'd', 'z'}, 6: {'t', 'J', '_', 'T', 's', 'd', 'z'}, 7: {'t', '_', 'z', 's'}}


Check if the unique token can all be found in the mapping 

In [9]:
unique_token = {x for slot in slots_usage.values() for x in slot}
print(unique_token)
print(f"{len(unique_token)} unique token in the training set")

{'f', 'i', 'g', 'b', '@', 'z', 'Y', 'E', 'm', 'C', 'Z', 'e', 'U', 'n', 't', 'r', '_', 'w', 's', 'y', 'd', 'l', 'S', 'o', 'I', 'p', 'u', 'a', 'O', 'T', '^', 'A', 'v', 'h', 'k', 'D', 'W', 'J'}
38 unique token in the training set


In [10]:
pho_map = gen_pkey(key_file="../raw/Chang/mapping")

In [11]:
token_with_mapping = set(pho_map.keys())
print(token_with_mapping)
print(f"{len(token_with_mapping)} unique tokens in the mapping file")

{'f', 'i', 'g', 'b', '@', 'z', 'E', 'Z', 'm', 'C', 'Y', 'e', 'U', 'n', 't', 'r', '_', 's', 'w', 'y', 'd', 'l', 'S', 'o', 'I', 'p', 'a', 'u', 'O', 'T', '^', 'A', 'v', 'D', 'k', 'h', 'W', 'J', 'G'}
39 unique tokens in the mapping file


In [12]:
print("This token is not used in the data, but exists in the mapping file:")
set(pho_map.keys()).difference(unique_token)

This token is not used in the data, but exists in the mapping file:


{'G'}

G is not used in dataset

In [13]:
print("This token is used in the data, but not exists in the mapping file:")
unique_token.difference(pho_map.keys()) 

This token is used in the data, but not exists in the mapping file:


set()

All token can be found in pho_map. It should be safe to use this mapping dict along with the current training set 

# Get word frequency
Since Chang did not provide the raw word frequency, we need to obtain it from elsewhere. Perhaps we can use SUBTLWF obtained from ELP. 

In [14]:
df_wf = pd.read_csv('../raw/ELP/elp_wf.csv', index_col=None, na_filter='#', thousands=',')
df_wf.sample(10)

Unnamed: 0,Occurrences,Word,Length,Freq_HAL,Log_Freq_HAL,SUBTLWF
3699,1,queer,5,2459,7.808,5.800
3180,1,notes,5,42861,10.666,24.610
57,1,as,2,2642511,14.787,2217.020
4382,1,slag,4,579,6.361,0.760
5168,1,that,4,5262331,15.476,14111.310
5680,1,weed,4,3010,8.01,11.760
1779,1,fro,3,1163,7.059,1.140
2547,1,kilns,5,77,4.344,#
3258,1,paid,4,50044,10.821,85.670
836,1,clone,5,62986,11.051,2.510


Parse

In [15]:
df_wf['SUBTLWF'] = df_wf['SUBTLWF'].str.replace(',', '')
df_wf['SUBTLWF'] = df_wf['SUBTLWF'].str.replace('#', '0')
df_wf['SUBTLWF'] = df_wf['SUBTLWF'].astype(float)

Merge

In [16]:
wf_for_merge = df_wf[['Word', 'SUBTLWF']].rename(columns={'Word': 'word', 'SUBTLWF': 'wf'})
train = train.merge(wf_for_merge, on='word', how='left')
train.pop('sampling_weight')

0       1.000000
1       0.077460
2       0.161245
3       0.232379
4       0.141421
          ...   
6224    0.527257
6225    0.126491
6226    0.077460
6227    0.077460
6228    0.252982
Name: sampling_weight, Length: 6229, dtype: float64

# Check semantics (Wordnet)

In [17]:
wn_repr = loadmat('../raw/Chang/6ksem.mat')
wn_repr.keys()

dict_keys(['__header__', '__version__', '__globals__', 'numFeature', 'semantic_vector', 'maxFeature', 'tarWord'])

Safety check: Is 6ksem.mat is sorted in the same order as 6kdict?

In [18]:
wn_word_seq = [str(wn_repr['tarWord'][0, x][0]) for x in range(6229)]
assert wn_word_seq == train.word.tolist() 

Extract semantic vector from wordnet

In [19]:
sem = wn_repr['semantic_vector'].astype(float)
sem.shape

(6229, 2446)

Because word frequency is counted by each word (ort form), but there are some homographs in the dataset, we need to adjust the word frequency to avoid over-sampling in the homograph case. 

In [20]:
homographs = get_duplicates(train, 'ort')
print(f"There are {len(homographs)} homographs in the training set") 

There are 39 homographs in the training set


Show a homograph example.

In [21]:
train.loc[train.ort=='_shu_t____']

Unnamed: 0,word,ort,pho,wf
4496,shut,_shu_t____,__S^t___,263.82
4497,shut,_shu_t____,__S^t___,263.82


Now, proceed to adjusting word frequency to avoid over sampling from the duplicated words.

In [22]:
def adjust_wf(row):
    """Adjust word frequencies by the number of occurance."""
    if row.ort in homographs.keys():
        return(row.wf / homographs[row.ort])
    else:
        return(row.wf)
        
train['adjusted_wf'] = train.apply(adjust_wf, axis=1)

Print some examples to make sure it is working as expected.

In [23]:
train.loc[train.word.isin(['shut', 'man', 'swim'])]

Unnamed: 0,word,ort,pho,wf,adjusted_wf
3014,man,__ma_n____,__m@n___,1845.75,1845.75
4496,shut,_shu_t____,__S^t___,263.82,131.91
4497,shut,_shu_t____,__S^t___,263.82,131.91
5265,swim,_swi_m____,_swIm___,31.8,31.8


Save a human readable training set csv file for convienence reference.

In [24]:
train.to_csv('../train.csv')  # Checkpoint

# Encode training set

In [25]:
ort_np, ort_tokenizers = ort_to_binary(train.ort)

Token count: defaultdict(<class 'int'>, {'_': 6004, 'c': 4, 'p': 3, 's': 178, 't': 40})
Token count: defaultdict(<class 'int'>, {'_': 3772, 'b': 192, 'c': 393, 'h': 71, 'd': 80, 'f': 168, 'g': 174, 'k': 37, 'p': 191, 'q': 78, 'r': 5, 's': 735, 't': 236, 'w': 97})
Token count: defaultdict(<class 'int'>, {'_': 210, 'b': 305, 'l': 795, 'r': 1090, 'c': 305, 'h': 672, 'z': 24, 'd': 224, 'w': 305, 'f': 212, 'g': 174, 'n': 214, 'j': 97, 'k': 86, 'm': 289, 'p': 378, 's': 235, 'u': 83, 't': 391, 'v': 79, 'y': 61})
Token count: defaultdict(<class 'int'>, {'a': 1663, 'e': 1145, 'i': 1130, 'o': 1493, 'u': 737, 'y': 61})
Token count: defaultdict(<class 'int'>, {'_': 4374, 'i': 274, 'u': 241, 'w': 237, 'y': 89, 'a': 434, 'e': 353, 'h': 3, 'o': 224})
Token count: defaultdict(<class 'int'>, {'_': 239, 'c': 304, 'd': 347, 'f': 158, 'g': 270, 'l': 693, 'm': 377, 'r': 791, 's': 594, 'n': 915, 'p': 357, 't': 565, 'e': 35, 'x': 35, 'b': 168, 'z': 51, 'k': 187, 'u': 4, 'v': 126, 'q': 7, 'h': 4, 'w': 2})
Tok

Save orthographic tokenizer for later use.

In [26]:
tokenizer_jsons = [t.to_json() for t in ort_tokenizers]
with open(f"../tokenizer/ort_tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer_jsons, f)

Package dataset

In [27]:
train_package = {
    "id": train.word.index.tolist(),
    "item": train.word.tolist(),
    "wf": train.adjusted_wf.tolist(),
    "ort": tf.constant(ort_np, dtype=tf.float32),
    "pho": tf.constant(pho_to_binary(train.pho, mapping=pho_map), dtype=tf.float32),
    "sem": tf.constant(sem, dtype=tf.float32),
    "graphem": train.ort.tolist(),  # 10 slots, vowel at slot 3 (0-indexed)
    "phoneme": train.pho.tolist(),  # 8 slots, vowel at slot 3 (0-indexed)
}


2021-12-14 21:49:11.359027: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-14 21:49:11.365734: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-14 21:49:11.365969: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2021-12-14 21:49:11.366683: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate

Verify dimensions are correct

In [28]:
print(len(train_package["id"]))
print(len(train_package["item"]))
print(len(train_package["wf"]))
print(train_package["ort"].shape)
print(train_package["pho"].shape)
print(train_package["sem"].shape)
print(len(train_package["graphem"]))
print(len(train_package["phoneme"]))

6229
6229
6229
(6229, 119)
(6229, 200)
(6229, 2446)
6229
6229


Export

In [29]:
with gzip.open("../train.pkl.gz", "wb") as f:
    pickle.dump(train_package, f)