# 1. Introduction
This document is intended to be a demonstration of the features within the framework that will be completely available under github.com/OlgunDursun/TurkishMorphologyFramework link.

In [1]:
!pip install transformers

from tmf import Word
from tmf import Affix
from tmf import Analyzer
import pandas as pd
import csv





# 2. Resources

There are some resources involved within the framework that are fully extensible.

## 2.1. Lexicon (roots dictionary)
Lexicon is compiled from several sources, including TDK's online dictionary.
It is stored in .csv format. Loading and preview of a sample lexicon entry on the next cell.

In [2]:

lexicon = {}

with open("tmf/resources/lexicon.csv", encoding="utf8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        word = row['entry']
        features = {k: v for k, v in row.items() if k != 'entry'}
        
        if word.lower() in lexicon:
            lexicon[word.lower()].append(features)
        else:
            lexicon[word.lower()] = [features]


In [3]:

print(lexicon['mahsul'])
print(lexicon['tahsil'])
print(lexicon['tercih'])
print(lexicon['kavim'])


[{'variants': 'mahsul', 'type_en': 'noun', 'suffixation': 'lü', 'origin': 'Arabic', 'semitic_root': 'HŞl', 'semitic_meter': 'mafˁūl', 'morph_features': '', 'definition': 'Ürün'}]
[{'variants': 'tahsil', 'type_en': 'noun', 'suffixation': '', 'origin': 'Arabic', 'semitic_root': 'HŞl', 'semitic_meter': 'tafˁīl', 'morph_features': '', 'definition': 'Parayı alma, toplama'}]
[{'variants': 'tercih', 'type_en': 'noun', 'suffixation': '', 'origin': 'Arabic', 'semitic_root': 'rch', 'semitic_meter': 'tafˁīl', 'morph_features': '', 'definition': 'Yeğleme'}]
[{'variants': 'kavim,kavm', 'type_en': 'noun', 'suffixation': 'vmi', 'origin': 'Arabic', 'semitic_root': 'Kwm', 'semitic_meter': 'faˁl', 'morph_features': '', 'definition': 'Aralarında töre, dil ve kültür ortaklığı bulunan, boy ve soy bakımından da birbirine bağlı insan topluluğu, budun'}]


Things to note here:


*   'mahsul' and 'tahsil' are words of Arabic origin with the same Semitic root, with different meters.
*   'tahsil' and 'tercih' have the same meter but different roots.
*   'kavim' has two variants: 'kavim' and 'kavm', where some suffixes cause the letter 'i' to drop.



## 2.2. Affixes

Affixes are stored under an .xlsx file that is manually manipulated.


In [4]:
def from_excel_row(row):
    affix_id = row["affix_id"]
    representation = row["representation"]
    variants = row["variants"].split(",")
    input_pos = row["input_pos"].split(",")
    output_pos = row["output_pos"].split(",")
    input_features = row["input_features"]
    output_features = row["output_features"]
    wipe_features = row["wipe_features"]
    positional_type = row["positional_type"]
    functional_type = row["functional_type"]
    peculiarity = row["peculiarity"]
    example = row["example"]
    metadata = row["metadata"]
    return Affix(affix_id, representation, variants, input_pos, output_pos, 
                input_features, output_features, wipe_features, positional_type, 
                functional_type, peculiarity, example, metadata)

reader = pd.read_excel('tmf/resources/affixes.xlsx')
affixes = {}
for _, row in reader.iterrows():
    affix = from_excel_row(row)
    affixes[affix.affix_id] = affix


print('Five random samples from the affixes:\n--------------------------------------')
print(reader.sample(5).to_markdown())

Five random samples from the affixes:
--------------------------------------
|     | affix_id   | representation   | variants                                        | input_pos         | output_pos   |   input_features | output_features   |   wipe_features | positional_type   | functional_type   |   peculiarity | example             |   metadata |
|----:|:-----------|:-----------------|:------------------------------------------------|:------------------|:-------------|-----------------:|:------------------|----------------:|:------------------|:------------------|--------------:|:--------------------|-----------:|
| 109 | DER110     | lAş              | laş,leş                                         | adjective,noun    | verb         |              nan | nan               |             nan | suffix            | derivational      |           nan | haberleş-, koyulaş- |        nan |
| 201 | INFL060    | kH(n)            | ki,kin,kü,kün                                   | pronoun       

## 2.3. Constraint resources

Constraint resources include utility files for various functions. In this demonstration, forbidden combinations and overriding segmentations are included.

Forbidden combinations are the IDs of affixes that do not go together in the specified order.

Overriding segmentations are the proxy of hard-coding some segmentations.

In [5]:
with open("tmf/resources/forbidden_combinations.txt", "r", encoding = "utf-8") as f:
    forbidden_combinations = f.read().splitlines()

forbidden_combinations = [tuple(combination.split("\t")) for combination in forbidden_combinations]

print('Forbidden combinations examples')
print(forbidden_combinations[:3])


with open("tmf/resources/overriding_segmentations.txt", "r", encoding = "utf-8") as f:
    overriding_segmentations = f.read().splitlines()
    overrides = {}
    for item in overriding_segmentations:
        a, b = item.split()
        overrides[a] = b.split("+")

print('Overriding segmentations examples')
print(overriding_segmentations[:3])


Forbidden combinations examples
[('DER048', 'INFL096'), ('DER060', 'DER020'), ('INFL106', 'INFL022')]
Overriding segmentations examples
['buluşma bul+uş+ma', 'buluş bul+uş', 'birlikler bir+lik+ler']


# 3. Morphological Analyzer

# 3.1. Basic properties

Morphological parser relies on two main objects: Word and Affix. These are initiated once an analysis starts.

Here is how they are initiated:


```
class Affix:
    def __init__(self, affix_id, representation, variants, input_pos, output_pos,
                 input_features, output_features, wipe_features, positional_type,
                 functional_type, peculiarity, example, metadata):
        self.affix_id = affix_id
        self.representation = representation
        self.variants = variants
        self.input_pos = input_pos
        self.output_pos = output_pos
        self.input_features = self.parse_features(input_features)
        self.output_features = self.parse_features(output_features)
        self.wipe_features = self.parse_features(wipe_features)
        self.positional_type = positional_type
        self.functional_type = functional_type
        self.peculiarity = peculiarity
        self.example = example
        self.metadata = metadata


class Word:
    def __init__(self, surface_form, deep_form, prefix, root, stem, suffixes, morph_features, pos = "noun"):
        self.surface_form = surface_form
        if type(deep_form) != list:
            self.deep_form = [deep_form]
        else:
            self.deep_form = deep_form
        self.prefix = prefix
        self.root = root
        self.stem = stem
        if type(suffixes) != list:
            self.suffixes = [suffixes]
        else:
            self.suffixes = suffixes
            
        self.morph_features = morph_features
        self.pos = pos```



## 3.2. Analyzing a word

To morphologically analyze a word, a function that takes the input word, lexcion, affixes, forbidden combinations and overriding segmentations is used.

In [6]:
word = 'gözlükçülük'

analyses = Analyzer().analyze(word, lexicon, affixes, forbidden_combinations, overrides)


How many analyses do we have for this word?

In [7]:
print(len(set(analyses)))


17


Phew! Do these analyses even make sense?

To see the results of the analyses in a somewhat readable format, let's iterate through the results and see some properties.

In [8]:
for hyp in set(analyses):
    hyp.update_pos_and_morph_features(affixes)
    print('Surface form:', hyp.surface_form)
    print('Deep form:', hyp.deep_form)
    print('Suffixes:', hyp.suffixes)
    print('PoS:', hyp.pos)
    print('Morph features:', hyp.morph_features)
    print()

Surface form: gözlükçülük
Deep form: ['göz', 'lük', 'çül', 'ü', 'k']
Suffixes: ['DER115', 'DER078', 'INFL008', 'INFL108']
PoS: ['noun']
Morph features: {'Case': 'Nom', 'Number': 'Plur', 'Person': '1', 'Number[psor]': 'Sing', 'Person[psor]': '3', 'Tense': 'Pres', 'Polarity': 'Pos'}

Surface form: gözlükçülük
Deep form: ['göz', 'lük', 'çül', 'ü', 'k']
Suffixes: ['DER115', 'DER078', 'INFL011', 'INFL108']
PoS: ['noun']
Morph features: {'Case': 'Acc', 'Number': 'Plur', 'Person': '1', 'Tense': 'Pres', 'Polarity': 'Pos'}

Surface form: gözlükçülük
Deep form: ['gözlükçü', 'lü', 'k']
Suffixes: ['DER113', 'DER027']
PoS: ['verb']
Morph features: {'Number': 'Sing', 'Person': '3', 'Polarity': 'Pos'}

Surface form: gözlükçülük
Deep form: ['gözlükçülük']
Suffixes: []
PoS: noun
Morph features: {'Case': 'Nom', 'Number': 'Sing', 'Person': '3'}

Surface form: gözlükçülük
Deep form: ['göz', 'lük', 'çü', 'lük']
Suffixes: ['DER115', 'DER076', 'DER115']
PoS: ['noun']
Morph features: {'Case': 'Nom', 'Number':

Answer: Nope. Big nope. Let's add an overriding segmentation for this word to make it work better.

In [9]:
overrides['gözlükçülük'] = 'göz+lük+çü+lük'.split('+')

In [10]:
analyses = Analyzer().analyze(word, lexicon, affixes, forbidden_combinations, overrides)

for hyp in set(analyses):
    hyp.update_pos_and_morph_features(affixes)
    print('Surface form:', hyp.surface_form)
    print('Deep form:', hyp.deep_form)
    print('Suffixes:', hyp.suffixes)
    print('PoS:', hyp.pos)
    print('Morph features:', hyp.morph_features)
    print()

Surface form: gözlükçülük
Deep form: ['göz', 'lük', 'çü', 'lük']
Suffixes: ['DER115', 'DER076', 'DER116']
PoS: ['adjective']
Morph features: {'Number': 'Sing', 'Person': '3'}

Surface form: gözlükçülük
Deep form: ['göz', 'lük', 'çü', 'lük']
Suffixes: ['DER115', 'DER076', 'DER115']
PoS: ['noun']
Morph features: {'Case': 'Nom', 'Number': 'Sing', 'Person': '3'}



Now this makes much more sense. We have 'gözlükçülük' either as an adjective or a noun. Let's take a look at some other scenarios for the analyzer.

First, let's write a wrapper analysis printing function so that we don't have to repeat the code block each time.

In [11]:
def analysis_wrapper(word_input):

    analyses = Analyzer().analyze(word_input, roots=lexicon, affixes=affixes,
                                    forbidden_combinations=forbidden_combinations, overrides=overrides)
    for hyp in set(analyses):
        hyp.update_pos_and_morph_features(affixes)
        print('Surface form:', hyp.surface_form)
        print('Deep form:', hyp.deep_form)
        print('Suffixes:', hyp.suffixes)
        print('PoS:', hyp.pos)
        print('Morph features:', hyp.morph_features)
        print()

Analyzing a word with Arabic origin, together with a Turkish suffix:

In [12]:
analysis_wrapper('ihtişamla')

Surface form: ihtişamla
Deep form: ['Hşm', 'iftiˁāl', 'la']
Suffixes: ['iftiˁāl', 'DER104']
PoS: ['verb']
Morph features: {'Number': 'Sing', 'Person': '3', 'Polarity': 'Pos'}

Surface form: ihtişamla
Deep form: ['Hşm', 'iftiˁāl', 'la']
Suffixes: ['iftiˁāl', 'DER105']
PoS: ['noun']
Morph features: {'Case': 'Nom', 'Number': 'Sing', 'Person': '3'}

Surface form: ihtişamla
Deep form: ['Hşm', 'iftiˁāl', 'l', 'a']
Suffixes: ['iftiˁāl', 'DER009', 'INFL027']
PoS: ['adjective']
Morph features: {'Number': 'Sing', 'Person': '3', 'Case': 'Dat'}

Surface form: ihtişamla
Deep form: ['Hşm', 'iftiˁāl', 'la']
Suffixes: ['iftiˁāl', 'INFL017']
PoS: ['noun']
Morph features: {'Case': 'Ins', 'Number': 'Sing', 'Person': '3'}



Analyzing an unknown proper name with a suffix (Zimbabwean'ın):

In [13]:
analysis_wrapper('Zimbabwean\'ın')

Surface form: Zimbabwean'ın
Deep form: ['Zimbabwean', 'ın']
Suffixes: ['DER058']
PoS: ['adverb']
Morph features: {'Case': 'Nom', 'Number': 'Sing', 'Person': '3'}

Surface form: Zimbabwean'ın
Deep form: ['Zimbabwean', 'ın']
Suffixes: ['INFL006']
PoS: ['noun']
Morph features: {'Case': 'Nom', 'Number': 'Sing', 'Person': '3', 'Number[psor]': 'Sing', 'Person[psor]': '2'}

Surface form: Zimbabwean'ın
Deep form: ['Zimbabwean', 'ın']
Suffixes: ['INFL015']
PoS: ['noun']
Morph features: {'Case': 'Gen', 'Number': 'Sing', 'Person': '3'}

Surface form: Zimbabwean'ın
Deep form: ['Zimbabwean', 'ın']
Suffixes: ['INFL008']
PoS: ['noun']
Morph features: {'Case': 'Nom', 'Number': 'Sing', 'Person': '3', 'Number[psor]': 'Sing', 'Person[psor]': '3'}



How are numbers handled?

In [14]:
analysis_wrapper('93')

Surface form: 93
Deep form: ['93']
Suffixes: []
PoS: number
Morph features: {'NumType': 'Card'}



# 4. Morphological Disambiguator

The morphological disambiguator is a BERT model with additional heads for disambiguation. It is based on https://huggingface.co/dbmdz/bert-base-turkish-cased.
The .bin file for the model is not stored on Github for now. Running the next cell downloads the model into the correct directory.

In [15]:
!wget -O tmf/resources/multihead_pos_morph_6.model/pytorch_model.bin 'https://www.dropbox.com/scl/fi/4b1sg25fsotxs4k6epma9/pytorch_model.bin\?rlkey=mxebcfkmwto3weomk3s2djyom&dl=1'
from tmf.disambiguator import *

In [16]:
def parse_sentence(sentence):
    disamb_input_sent = {'sentence':[], 'pos_subset':[], 'morph_subset':[]}

    for word in sentence.split():
        disamb_input_sent['sentence'].append(word)
        pos_subset = []
        morph_subset = []
        analyses = Analyzer().analyze(word, lexicon, affixes, forbidden_combinations, overrides)
        for hyp in analyses:
            pos_subset.append(convert_tag(hyp.pos))
            morph_subset.append(hyp.morph_features)
        disamb_input_sent['pos_subset'].append(pos_subset)
        disamb_input_sent['morph_subset'].append(morph_subset)
    sentence, pos, morph = infer(disamb_input_sent)
    for i in range(len(sentence)):
        print('Word: {} | POS: {} | Morph. Features: {}'.format(sentence[i], pos[i], morph[i]))
    return sentence, pos, morph

Let's try disambiguating 'gözlükçülük' a sentence, as we have two alternatives for it.

In [17]:
parse_sentence('gözlükçülük işi yapıyorum .')

Word: gözlükçülük | POS: NOUN | Morph. Features: {'Case': 'Nom', 'Number': 'Sing', 'Person': '3'}
Word: işi | POS: NOUN | Morph. Features: {'Case': 'Nom', 'Number': 'Sing', 'Person': '3'}
Word: yapıyorum | POS: VERB | Morph. Features: {'Polarity': 'Pos', 'Person': '3', 'Number': 'Sing', 'Tense': 'Pres', 'Mood': 'Imp'}
Word: . | POS: PUNCT | Morph. Features: {}


(['gözlükçülük', 'işi', 'yapıyorum', '.'],
 ['NOUN', 'NOUN', 'VERB', 'PUNCT'],
 [{'Case': 'Nom', 'Number': 'Sing', 'Person': '3'},
  {'Case': 'Nom', 'Number': 'Sing', 'Person': '3'},
  {'Polarity': 'Pos',
   'Person': '3',
   'Number': 'Sing',
   'Tense': 'Pres',
   'Mood': 'Imp'},
  {}])