## Final Runnable Morpheme Decomposer Model

This Notebook is a **runnable Hungarian Morpheme Decomposer**. The necessary files are available in the same folder. These include multiple dataframes and Keras H5 models.

**1.** Importing the necessary packages:

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import random
import math
random_state = 777
import nltk
import tensorflow as tf
from tensorflow.keras.models import load_model
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import Adam
from keras.callbacks import EarlyStopping

Using TensorFlow backend.


**2.** Loading the dataframes for set-similarity recognition, appending additional words, and sorting the lists:

In [2]:
adverbs = pd.read_csv('adverbs.csv')
adverbs_list = list(adverbs['word'])

def_arts_list = ['a', 'az']
indef_arts_list = ['egy']

conjs = pd.read_csv('conjs.csv')
conjs_list = list(conjs['word'])

dets = pd.read_csv('dets.csv')
dets_list = list(dets['word'])
new_dets = 'egyugyanaz egyugyanez egyugyanezen egyugyanakként egyugyanekként ugyanaz ugyanez ugyanakként ugyanekként mindez e ugyanakkor ugyanekkor egyugyanakkor egyugyanekkor'.split(' ')
dets_list += new_dets
dets_list.sort()

nums_aggreg = pd.read_csv('nums_aggreg.csv')
nums_aggreg_list = list(nums_aggreg['word'])
new_nums_aggreg = 'akármennyien ketten hárman négyen öten hatan heten nyolcan kilencen tizen tizenegyen huszonhatan harminckilencen negyvenöten ötvenheten hatvanketten hetvennyolcan nyolcvannégyen kilencvenhárman sokszázan sokezren sokszázezren sokmillióan sokszázmillióan százmilióan sokmillárdan sokbillióan sokbilliárdan billiárdan billióan soktrillióan soktrilliárdan trilliárdan trillióan páran jópáran'.split(' ')
nums_aggreg_list += new_nums_aggreg
nums_aggreg_list = list(set(nums_aggreg_list))
nums_aggreg_list.sort()

nums_multipl = pd.read_csv('nums_multipl.csv')
nums_multipl_list = list(nums_multipl['word'])
new_nums_multipl = 'kétszerte négyszerte hétszerte kilencszerte százszorta ezerszerte miliószorta millárdszorta billiószorta billiárdszorta trilliószorta trilliárdszorta ötvenhetedszerte nyolcvankilencedszerte tizennegyedszerte hatvanegyedszerte harmincötödszörte hetvenkettedszerte negyvenhatdoszorta huszonharmadszorta kilencvennyolcadszorta'.split(' ')
nums_multipl_list += new_nums_multipl
nums_multipl_list.sort()

nums_iter = pd.read_csv('nums_iter.csv')
nums_iter_list = list(nums_iter['word'])
new_nums_iter = 'előszörre elsőre negyedszerre ötödszörre hetedszerre nyolcadszorra kilencedszerre tizedszerre századszorra ezredszerre milliárdadszorra billomodszorra billiárdadszorra trilliomodszorra trilliárdadszorra negyvenhatodszorra kilencvenharmadszorra huszonegyedszerre tizennegyedszerre nyolcvankettedszerre hetvenhetedszerre hatvankilencedszerre ötvennyolcadszorra harmincötödszörre'.split(' ')
nums_iter_list += new_nums_iter
nums_iter_list.sort()

nums = pd.read_csv('nums.csv')
nums.drop('Unnamed: 0', axis=1, inplace=True)
nums_list = list(set(nums['stem']))
new_nums = 'kilencvenöt nyolcvannégy ötvenhét harvanhárom huszonhat harmicnyolc tizenegy hetvenkilenc negyvenkettő sokszáz sokezer sokszázezer sokmilió sokmilliárd billió sokbillió billiárd sokbilliárd trillió soktrillió trilliárd soktrilliárd többszáz többezer többmillió többmilliárd többtrillió többtrilliárd'.split(' ')
nums_list += new_nums
nums_list.sort()

nums_order_list = 'első második harmadik negyedik ötödik hatodik hetedik nyolcadik kilencedik tizedik hányadik akárhányadik valahányadik sokadik többedik negyvenkettedik ötvenegyedik tizenkilencedik hetvenötödik huszonnegyedik hatvanharmadik harminchatodik kilencvennyolcadik nyolcvanhetedik századik sokszázadik többszázadik ezredik sokezredik többezredik milliomodik sokmilliomodik többmilliomodik millárdadik sokmillárdadik többmillárdadik billiomodik sokbilliomodik billárdadik sokbillárdadik trilliomodik soktrilliomodik többtrilliomodik trillárdadik soktrillárdadik többtrillárdadik'.split(' ')

nums_conv_model = load_model('nums_conv_model_extended2.h5')

onos = pd.read_csv('onos.csv')
onos_list = list(onos['word'])

postps = pd.read_csv('postps.csv')
postps_list = list(set(postps['word']))
postps_list.sort()

preps = pd.read_csv('preps.csv')
preps_list = list(preps['word'])
preps_list += ['pro']

prevs = pd.read_csv('prevs.csv')
prevs_list = list(set(prevs['word']))
prevs_list.sort()

utt_ints = pd.read_csv('utt_ints.csv')
utt_ints_list = list(set(utt_ints['word']))
utt_ints_list.sort()

nums_extended = pd.read_csv('nums_extended.csv')
all_nums_list = list(nums_extended['word'])
all_nums_list.sort()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


**3.** This codeblock contains multiple useful functions:

- An encoder and decoder for the Neural Network inputs.
- Word-similarity and set-similarity functions for wordtypes that only contain a few unique words.
- Morpheme decomposer functions for adjectives, determiners, nouns, numbers and verbs. These functions use trained Keras models for prediction.
- A word type predictor function, which also uses a trained Keras model.

In [3]:
chars = " 0123456789.:,;!%&'*_-=~\\()|[]{}aáäbcdeéfghiíjklmnoóöőpqrstuúüűvwxyz"
encode_dict = {}
decode_dict = {}

for c in range(len(chars)):
    encode_dict[chars[c]] = c
    decode_dict[c] = chars[c]

def encode(w):
    ret = []
    for c in w:
        ret.append(encode_dict[c])
    return np.array(ret)

def decode(a):
    ret = []
    for i in a:
        ret.append(decode_dict[i])
    return ''.join(ret)

M = len(encode_dict)
W = 44
def one_hot_encode(w):
    e = encode(w)
    ohe = np.zeros((W, M))
    ohe[np.arange(len(e)),e] = 1
    return ohe 

def one_hot_encode_col(col):
    return np.array([one_hot_encode(w) for w in col])

def word_similarity(w1, w2):
    return nltk.edit_distance(w1, w2) / max(len(w1),len(w2))

def set_similarity(word, word_set, word_set_avg_len=None):
    sim_full_sum = 0
    sim_end_sum = 0
    counter = 0
    
    if word_set_avg_len is not None:
        avg_len = word_set_avg_len
    else:
        avg_len = int(np.mean([len(w) for w in word_set]))

    for w in word_set:
        sim_full = 1 - word_similarity(word, w)
        sim_end = 1 -  word_similarity(word[max(0,len(word)-avg_len):], w)
        if sim_full > 0.5:
            counter += 1
            sim_full_sum += sim_full
        if sim_end > 0.5:
            counter += 1
            sim_end_sum += sim_end
    
    return ((sim_full_sum + sim_end_sum) / 2) * (counter / len(word_set))

adjectives_conv_model = load_model('adjectives_conv_model.h5')
def morphemes_adjective(adjective):
    example = one_hot_encode_col([adjective])
    prediction = adjectives_conv_model.predict(example).round()
    morpheme_names = ['ANP', 'ANP<PLUR>', 'CAS<ABL>', 'CAS<ACC>', 'CAS<ADE>', 'CAS<ALL>', 'CAS<CAU>', 'CAS<DAT>', 'CAS<DEL>', 'CAS<ELA>', 'CAS<ESS>', 'CAS<FOR>', 'CAS<ILL>', 'CAS<INE>', 'CAS<INS>', 'CAS<SBL>', 'CAS<SUE>', 'CAS<TEM>', 'CAS<TER>', 'CAS<TRA>', 'PLUR', 'PLUR<FAM>', 'POSS', 'POSS<1>', 'POSS<1><PLUR>', 'POSS<2>', 'POSS<2><PLUR>', 'POSS<PLUR>']
    return {'stem': adjective[:int(prediction[0][-1])], 'morphemes': [morpheme_names[i] for i in range(len(morpheme_names)) if prediction[0][i] == 1]}

determiners_conv_model = load_model('determiners_conv_model2.h5')
def morphemes_determiner(determiner):
    example = one_hot_encode_col([determiner])
    prediction = determiners_conv_model.predict(example).round()
    morpheme_names = ['ANP', 'ANP<PLUR>', 'CAS<ABL>', 'CAS<ACC>', 'CAS<ADE>', 'CAS<ALL>', 'CAS<CAU>', 'CAS<DAT>', 'CAS<DEL>', 'CAS<ELA>', 'CAS<FOR>', 'CAS<ILL>', 'CAS<INE>', 'CAS<INS>', 'CAS<SBL>', 'CAS<SUE>', 'CAS<TEM>', 'CAS<TRA>', 'PLUR', 'POSS<2><PLUR>']
    return {'stem': determiner[:int(prediction[0][-1])], 'morphemes': [morpheme_names[i] for i in range(len(morpheme_names)) if prediction[0][i] == 1]}

nouns_conv_model = load_model('nouns_conv_model3.h5')
def morphemes_noun(noun):
    example = one_hot_encode_col([noun])
    prediction = nouns_conv_model.predict(example).round()
    morpheme_names = ['ANP', 'ANP<PLUR>', 'CAS<ABL>', 'CAS<ACC>', 'CAS<ADE>', 'CAS<ALL>', 'CAS<CAU>', 'CAS<DAT>', 'CAS<DEL>', 'CAS<ELA>', 'CAS<ESS>', 'CAS<FOR>', 'CAS<ILL>', 'CAS<INE>', 'CAS<INS>', 'CAS<SBL>', 'CAS<SUE>', 'CAS<TEM>', 'CAS<TER>', 'CAS<TRA>', 'PERS', 'PERS<1>', 'PERS<2>', 'PLUR', 'PLUR<ANP>', 'PLUR<FAM>', 'POSS', 'POSS<1>', 'POSS<1><PLUR>', 'POSS<2>', 'POSS<2><PLUR>', 'POSS<PLUR>', 'POSTP<ALATT>', 'POSTP<ALÁ>', 'POSTP<ALÓL>', 'POSTP<ELLEN>', 'POSTP<ELLENÉRE>', 'POSTP<ELÉ>', 'POSTP<ELÉBE>', 'POSTP<ELŐL>', 'POSTP<ELŐTT>', 'POSTP<FELETT>', 'POSTP<FELÉ>', 'POSTP<FELÜL>', 'POSTP<FELŐL>', 'POSTP<FÖLIBE>', 'POSTP<FÖLÉ>', 'POSTP<FÖLÜL>', 'POSTP<HELYETT>', 'POSTP<IRÁNT>', 'POSTP<KÖRÉ>', 'POSTP<KÖRÖTT>', 'POSTP<KÖRÜL>', 'POSTP<KÖZBEN>', 'POSTP<KÖZIBE>', 'POSTP<KÖZÉ>', 'POSTP<KÖZÖTT>', 'POSTP<KÖZÜL>', 'POSTP<LÉTÉRE>', 'POSTP<MELLETT>', 'POSTP<MELLÉ>', 'POSTP<MELLŐL>', 'POSTP<MIATT>', 'POSTP<MÖGÉ>', 'POSTP<MÖGÖTT>', 'POSTP<MÖGÜL>', 'POSTP<NÉLKÜL>', 'POSTP<RÉSZÉRE>', 'POSTP<RÉSZÉRŐL>', 'POSTP<SZERINT>', 'POSTP<SZÁMÁRA>', 'POSTP<UTÁN>', 'POSTP<VÉGBŐL>', 'POSTP<VÉGETT>', 'POSTP<VÉGRE>', 'POSTP<ÁLTAL>', 'POSTP<ÓTA>']
    return {'stem': noun[:int(prediction[0][-1])], 'morphemes': [morpheme_names[i] for i in range(len(morpheme_names)) if prediction[0][i] == 1]}

nums_conv_model = load_model('nums_conv_model_extended2.h5')
def morphemes_num(num):
    example = one_hot_encode_col([num])
    prediction = nums_conv_model.predict(example).round()
    morpheme_names = ['AGGREG', 'ITER', 'MULTIPL', 'ORDER', 'COUNT']
    return {'morphemes': [morpheme_names[i] for i in range(len(morpheme_names)) if prediction[0][i] == 1]}

verbs_conv_model = load_model('verbs_conv_model2.h5')
def morphemes_verb(verb):
    example = one_hot_encode_col([verb])
    prediction = verbs_conv_model.predict(example).round()
    morpheme_names = ['COND', 'COND<PAST>', 'DEF', 'INF', 'MODAL', 'PAST', 'PERS', 'PERS<1<OBJ<2>>>', 'PERS<1>', 'PERS<2>', 'PLUR', 'SUBJUNC-IMP']
    return {'stem': verb[:int(prediction[0][-1])], 'morphemes': [morpheme_names[i] for i in range(len(morpheme_names)) if prediction[0][i] == 1]}

word_type_nn = load_model('types_conv_model3.h5')
def predict_basic_word_type(w):
    nn_input = one_hot_encode_col([w])
    return word_type_nn.predict(nn_input)[0]



**4.** And finally, this codeblock accomplishes the word type prediction and morpheme decomposition. The first function is a technicality to output the results in the correct shape. The second function puts everything together from the previous codeblocks.

It works by the following idea:

- Let's say we gave an input word for which the noun predictor returned with a value of 0.62: this is the likelihood for the given word to be a noun, according to the predictor.
- Then this codeblock defines a cutoff value, which in the case of nouns is 0.47.
- Anything above 0.47 will be predicted as a noun, anything below won't. So in the example, 0.62 passes.
- If a word passes, the function also calls the noun decomposer to return the predicted morphemes of the word.

In [51]:
def expand_morphemes_dict(word_type, m):
    if len(m['morphemes']) == 0:
        return word_type
    else:
        return str(word_type) + ', stem: ' + str(m['stem']) + ', morphemes: ' + ', '.join(m['morphemes'])

def predict_word_type(w):
    
    pred_list = []
    
    #### exact matches ####
    
    # article
    if w in ['a', 'az']:
        pred_list.append('ART<DEF>')
    elif w == 'egy':
        pred_list.append('ART<INDEF>')
    
    # preverb
    if w in prevs_list:
        pred_list.append('PREV')
    
    #### set-similarities ####
    
    # adverb
    if w in adverbs_list:
        pred_list.append('ADV')
    elif set_similarity(w, adverbs_list) > 0.01:
        pred_list.append('ADV')
    
    # num
    is_num = False
    if w in all_nums_list:
        is_num = True
    elif set_similarity(w, all_nums_list) > 0.02:
        is_num = True
    
    # conjunction
    if w in conjs_list:
        pred_list.append('CONJ')
    elif set_similarity(w, conjs_list) > 0.01:
        pred_list.append('CONJ')
    
    # determiners
    is_determiner = False
    if w in dets_list:
        is_determiner = True
    elif set_similarity(w, dets_list) > 0.01:
        is_determiner = True
    
    # onomatopoeia
    if w in onos_list:
        pred_list.append('ONO')
    elif set_similarity(w, onos_list) > 0.0158:
        pred_list.append('ONO')
    
    # postposition
    if w in postps_list:
        pred_list.append('POSTP')
    elif set_similarity(w, postps_list) > 0.15:
        pred_list.append('POSTP')
    
    # preposition
    if w in preps_list:
        pred_list.append('PREP')
    elif set_similarity(w, preps_list) > 0.25:
        pred_list.append('PREP')
    
    # utterance / interjection
    if w in utt_ints_list:
        pred_list.append('UTT-INT')
    elif set_similarity(w, utt_ints_list) > 0.027:
        pred_list.append('UTT-INT')
    
    #### NNs ####
    
    # word type
    types = predict_basic_word_type(w)
    
    # adjective
    if types[0] > 0.115:
        adj_dict = morphemes_adjective(w)
        pred_list.append(expand_morphemes_dict('ADJ', adj_dict))
    
    # determiner
    if is_determiner or types[1] > 0.006:
        det_dict = morphemes_determiner(w)
        pred_list.append(expand_morphemes_dict('DET', det_dict))
    
    # noun
    if types[2] > 0.47:
        noun_dict = morphemes_noun(w)
        pred_list.append(expand_morphemes_dict('NOUN', noun_dict))
    
    # num
    if is_num or types[3] > 0.002:
        num_dict = morphemes_num(w)
        pred_list.append('NUM, type: ' + ', '.join(num_dict['morphemes']))
    
    # verb
    if types[4] > 0.5:
        verb_dict = morphemes_verb(w)
        pred_list.append(expand_morphemes_dict('VERB', verb_dict))
    
    if len(pred_list) == 0:
        
        word_type_list = ['ADJ', 'DET', 'NOUN', 'NUM', 'VERB']
        t = word_type_dict[np.argmax(types)]
        
        if t == 'ADJ':
            m_dict = morphemes_adjective(w)
        elif t == 'DET':
            m_dict = morphemes_determiner(w)
        elif t == 'NOUN':
            m_dict = morphemes_noun(w)
        elif t == 'NUM':
            m_dict = morphemes_num(w)
        elif t == 'VERB':
            m_dict = morphemes_verb(w)
        
        pred_list.append(expand_morphemes_dict(t, m_dict))
    
    return '\n'.join(pred_list)

**5.** Example predictions:

In [5]:
print(predict_word_type('csinálhatja'))

VERB, stem: csinál, morphemes: DEF, MODAL


In [6]:
print(predict_word_type('csináltatok'))

VERB, stem: csinált, morphemes: PAST, PERS<2>, PLUR


In [7]:
print(predict_word_type('házaitokban'))

NOUN, stem: házait, morphemes: CAS<INE>, PLUR


In [8]:
print(predict_word_type('száz'))

NUM, type: COUNT


In [9]:
print(predict_word_type('blablabla'))

ONO
NOUN, stem: blablab, morphemes: CAS<ILL>


In [10]:
print(predict_word_type('szépek'))

VERB, stem: szép, morphemes: PERS<1>


In [11]:
print(predict_word_type('ülve'))

NOUN


In [12]:
print(predict_word_type('őz'))

NOUN
NUM, type: COUNT


In [13]:
print(predict_word_type('és'))

CONJ
NOUN


In [14]:
print(predict_word_type('többször'))

NOUN
NUM, type: COUNT


In [15]:
print(predict_word_type('elvétve'))

ADV
CONJ
ADJ
NOUN


In [16]:
print(predict_word_type('összegyűlik'))

VERB


In [17]:
print(predict_word_type('megcsinálhat'))

VERB, stem: megcsinál, morphemes: MODAL


In [18]:
print(predict_word_type('megcsinálgattuk'))

VERB, stem: megcsinálgat, morphemes: DEF, PAST, PERS<1>, PLUR


In [19]:
print(predict_word_type('kutyáinknak'))

NOUN, stem: kutyá, morphemes: CAS<DAT>, PLUR


In [20]:
print(predict_word_type('kutyátokra'))

NOUN, stem: kutyát, morphemes: CAS<SBL>, PLUR


In [21]:
print(predict_word_type('napsugár'))

NOUN


In [22]:
print(predict_word_type('napnál'))

NOUN


In [23]:
print(predict_word_type('naphoz'))

NOUN


In [24]:
print(predict_word_type('napoz'))

VERB


In [25]:
print(predict_word_type('nappal'))

ADV
NOUN


In [26]:
print(predict_word_type('napjainkban'))

NOUN, stem: napja, morphemes: CAS<INE>, PLUR, POSS<1><PLUR>


In [27]:
print(predict_word_type('napozóktól'))

ADJ, stem: napozó, morphemes: CAS<ABL>, PLUR
NOUN, stem: napozó, morphemes: CAS<ABL>, PLUR


In [28]:
print(predict_word_type('napjukból'))

NOUN, stem: napj, morphemes: CAS<ELA>


In [29]:
print(predict_word_type('falat'))

NOUN


In [30]:
print(predict_word_type('halat'))

ADV
NOUN


In [74]:
print(predict_word_type('kutat'))

ADJ
NOUN


In [31]:
print(predict_word_type('avagy'))

CONJ
NOUN


In [73]:
print(predict_word_type('ugyanilyen'))

ADV
ADJ
DET, stem: ugyani, morphemes: CAS<SUE>


In [75]:
print(predict_word_type('fogadnátok'))

VERB, stem: fogad, morphemes: COND, DEF, PERS<2>, PLUR


In [61]:
print(predict_word_type('ugyanattól'))

ADV
ADJ
DET
NOUN, stem: ugyanat, morphemes: CAS<ABL>


In [82]:
print(predict_word_type('százkilencszer'))

NOUN
NUM, type: ITER


In [86]:
print(predict_word_type('megtanulhattatok'))

VERB, stem: megtanul, morphemes: MODAL, PAST, PERS<2>, PLUR


In [87]:
print(predict_word_type('kinyújtóztak'))

VERB, stem: kinyújtóz, morphemes: PAST, PLUR


In [93]:
print(predict_word_type('megszentségtelenít'))

VERB, stem: megszentségtelen, morphemes: DEF


In [98]:
print(predict_word_type('átlós'))

ADJ
