# Tutorial 

## Part-III

In this live session we will **(a)** reorganize the input as tokenized into analyses, **(b)** parse each analysis into its root and a stream of affixes (another level of tokenization), **(c)** parse the stream into separate morphemes (yet another), **(d)** extract a set of morphemes and **(e)** export the results in *csv* format.

### * imports

In [1]:
import glob
from pprint import pprint

### ** globals

In [2]:
input_path = "./in/"
output_path = "./out/"

### *** read input

In [3]:
def read_and_store_input(input_path):
    # initialize dictionary
    data = {}
    # read file names from the input directory
    input_files = glob.glob(input_path+'*.txt')
    for file_path in input_files:
        content = open(file_path, 'r')
        # create an empty list with a key as the name of the current file
        file_name = file_path.split('/')[-1]
        data[file_name]=[]
        # populate the list with content from file
        for line in content:
            # check if line has content
            if line.strip() != "":
                data[file_name].append(line.strip())
    return data

In [4]:
data = read_and_store_input(input_path)

In [5]:
data

{'00001131_pp_ma.txt': ['(nü[JJ])+[Proper=False]/(nü[JJ])+[Proper=True] (Peri[NNP]+[PersonNumber=A3sg]+[Possessive=Pnon]+DA[Case=Loc])+[Proper=True]/(Peride[NNP]+[PersonNumber=A3sg]+[Possessive=Pnon]+[Case=Bare])+[Proper=True]',
  '(Hakan[NNP]+[PersonNumber=A3sg]+[Possessive=Pnon]+[Case=Bare])+[Proper=True]/(Hakan[NNP]+[PersonNumber=A3sg]+[Possessive=Pnon]+[Case=Nom])+[Proper=True] (Akdoğan[NNP]+[PersonNumber=A3sg]+[Possessive=Pnon]+[Case=Bare])+[Proper=True]/(Akdoğan[NNP]+[PersonNumber=A3sg]+[Possessive=Pnon]+[Case=Nom])+[Proper=True]',
  '(Roma[NNP]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+[Case=Bare])+[Proper=True]/(Roma[NNP]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+[Case=Nom])+[Proper=True]',
  '(7[CD])+[Proper=False]/(7[CD])+[Proper=True]',
  '(Koca[NNP]+[PersonNumber=A3sg]+[Possessive=Pnon]+[Case=Bare])+[Proper=True]/(Koca[NNP]+[PersonNumber=A3sg]+[Possessive=Pnon]+[Case=Nom])+[Proper=True] (duvar[NN]+[PersonNumber=A3sg]+[Possessive=Pnon]+[Case=Bare])+[Proper=False]/(duvar[NN]+[Per

## database

**current**: 
```
    {   
        file_1:[line_1, ..., line_n],
        ...,
        file_m:[line_1, ..., line_k]
    }
```

**envisioned**:
```
    {
        file_name: {
                        word_index: {
                                        analysis_index: {
                                                            'root'  :root,
                                                            'aff'   :[affixes],
                                                            'pos'   :pos_tag,
                                                            'proper':bool
                                                        }
                                    }
                    }      
    }
```

### (a-1) tokenize lines into streams of analyses 
note that each stream corresponds to a word in the original data; associated **delimiter** is a `single (white)space`, or, `' '`.

In [6]:
def tokenizer_l2s(data):
    # 'l' for line, 's' for stream
    # traverse data
    for file, content in data.items():
        stream_index = 0
        streams = {}
        for line in content:
            # split() = split(' ')
            for stream in line.split():
                streams[stream_index] = stream
                stream_index += 1
                # {file_name:{stream_index:stream,...}}
        # make changes in situ, i.e. update
        data[file] = streams

In [7]:
tokenizer_l2s(data)

In [8]:
data['00001131_pp_ma.txt'][40]

'(Sefer[NNP]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=True]/(Sefer[NNP]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])+[Proper=True]'

### (a-2) tokenize streams into individual analyses
associated **delimiter** is a `forward slash`, or, `'/'`.

In [9]:
def tokenizer_s2a(data):
    # 's' for stream, 'a' for analysis
    # traverse data
    for file, streams in data.items():
        for stream_index, stream in streams.items():
            analyses = {}
            for index, analysis in enumerate(stream.split('/')):
                analyses[index] = analysis
            # make changes in situ
            data[file][stream_index] = analyses

In [10]:
tokenizer_s2a(data)

In [11]:
data['00001131_pp_ma.txt'][40]

{0: '(Sefer[NNP]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=True]',
 1: '(Sefer[NNP]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])+[Proper=True]'}

### (b-1) tokenize each analysis into a tuple: (morp, is_proper)
associated **delimiter** is a `complex custom string`, in this case, `'+[Proper='`. for why, please refer to the manual of the morphological analyzer used in tutorial part II ([here](https://github.com/google-research/turkish-morphology)).

In [17]:
def tokenizer_a2t(data):
    # 'a' for analysis, 't' for tuple
    for file, streams in data.items():
        for stream_index, stream in streams.items():
            for analysis_index, analysis in stream.items():
                morphology, is_proper = analysis.split('+[Proper=')
                is_proper = is_proper.strip(']')
                if is_proper == 'False':
                    is_proper = False
                elif is_proper == 'True':
                    is_proper = True
                else:
                    raise ValueError('unexpected value')
                tup = (morphology, is_proper)
                # update
                data[file][stream_index][analysis_index] = tup

In [18]:
tokenizer_a2t(data)

In [19]:
data['00001131_pp_ma.txt'][40]

{0: ('(Sefer[NNP]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])',
  True),
 1: ('(Sefer[NNP]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])',
  True)}

### (b-2, c) tokenize morphology into roots, affixes and PoS tags
associated **delimiters** are `plus`, `minus` and `forward square bracket` characters, or, `'+'`, `'-'` and `'['` respectively.

In [20]:
def init_values():
    values = {
                'root':None,
                'affixes': None,
                'pos':None,
                'is_proper':None
            }
    return values

def tokenizer_m2c(data):
    # 'm' for morphology, 'c' for category
    for file, streams in data.items():
        for stream_index, stream in streams.items():
            for analysis_index, analysis in stream.items():
                morphology, is_proper = analysis
                # clean morphology
                # this will remove grouping info.
                morphology = ''.join(ch for ch in morphology if ch not in '()')
                # split morphology
                parts = [part for chunk in morphology.split('+') for part in chunk.split('-')]
                root_and_pos = parts[0]
                root = root_and_pos.split('[')[0]
                pos = root_and_pos.split('[')[1].strip(']')
                if len(parts) > 1:
                    affixes = parts[1:]
                else:
                    affixes = None
                # init values
                values = init_values()
                values['root'] = root
                values['affixes'] = affixes
                values['pos'] = pos
                values['is_proper'] = is_proper
                # update
                data[file][stream_index][analysis_index] = values

In [21]:
tokenizer_m2c(data)

In [23]:
pprint(data['00001131_pp_ma.txt'][40])

{0: {'affixes': ['[PersonNumber=A3sg]', 'Hn[Possessive=P2sg]', 'NDA[Case=Loc]'],
     'is_proper': True,
     'pos': 'NNP',
     'root': 'Sefer'},
 1: {'affixes': ['[PersonNumber=A3sg]', 'SH[Possessive=P3sg]', 'NDA[Case=Loc]'],
     'is_proper': True,
     'pos': 'NNP',
     'root': 'Sefer'}}


In [24]:
pprint(data)

{'00001131_pp_ma.txt': {0: {0: {'affixes': None,
                                'is_proper': False,
                                'pos': 'JJ',
                                'root': 'nü'},
                            1: {'affixes': None,
                                'is_proper': True,
                                'pos': 'JJ',
                                'root': 'nü'}},
                        1: {0: {'affixes': ['[PersonNumber=A3sg]',
                                            '[Possessive=Pnon]',
                                            'DA[Case=Loc]'],
                                'is_proper': True,
                                'pos': 'NNP',
                                'root': 'Peri'},
                            1: {'affixes': ['[PersonNumber=A3sg]',
                                            '[Possessive=Pnon]',
                                            '[Case=Bare]'],
                                'is_proper': True,
                                

                        61: {0: {'affixes': ['[PersonNumber=A3sg]',
                                             '[Possessive=Pnon]',
                                             '[Case=Bare]'],
                                 'is_proper': False,
                                 'pos': 'NN',
                                 'root': 'ben'},
                             1: {'affixes': ['[PersonNumber=A3sg]',
                                             '[Possessive=Pnon]',
                                             '[Case=Bare]'],
                                 'is_proper': True,
                                 'pos': 'NN',
                                 'root': 'ben'}},
                        62: {0: {'affixes': ['Hn[Derivation=Pass]',
                                             '[Polarity=Pos][NOMP]',
                                             'YAcAk[Derivation=FutNom]',
                                             '[PersonNumber=A3sg]',
                                    

### (d) extract a set of roots and affixes

### (e-1) export the data in *csv* format
let the fields be `'file_name'`, `'word_index'`, `'analysis_index'`, `'root'`, `'pos'`, `'is_prop'` and `morpheme_types`

In [25]:
def catalog(data):
    cat = {'roots':[], 'affixes':[]}
    for file, words in data.items():
        for w_index, word in words.items():
            for a_index, analysis in word.items():
                root = analysis['root']
                if root not in cat['roots']:
                    cat['roots'].append(root)
                affixes = analysis['affixes']
                # check if morph. analysis yielded any affixes
                if affixes:
                    for affix in affixes:
                        if affix not in cat['affixes']:
                            cat['affixes'].append(affix)
    return cat

In [26]:
cat = catalog(data)

In [27]:
pprint(cat)

{'affixes': ['[PersonNumber=A3sg]',
             '[Possessive=Pnon]',
             'DA[Case=Loc]',
             '[Case=Bare]',
             '[Case=Nom]',
             'Hn[Possessive=P2sg]',
             '[Polarity=Pos]',
             'Hyor[TenseAspectMood=Prog1]',
             'YDH[Copula=PastCop]',
             'n[PersonNumber=V2sg]',
             'NDA[Case=Loc]',
             'NHn[Case=Gen]',
             '[Copula=PresCop]',
             '[PersonNumber=V3pl]',
             'YA[Derivation=Able]',
             'mA[Polarity=Neg][NOMP]',
             'YAcAk[Derivation=FutNom]',
             'SH[Possessive=P3sg]',
             '[Polarity=Pos][NOMP]',
             'mA[Derivation=Nonf]',
             'YA[Case=Dat]',
             '[ComplementType=CBare]',
             '[Case=Bare][NOMP]',
             'lH[Derivation=From]',
             'NA[Case=Dat]',
             'Hl[Derivation=Pass]',
             'mAk[Derivation=Inf]',
             'Ar[Derivation=AorNom]',
             '[Case=Bare][NN]',

In [28]:
fields = ['file_index', 'word_index', 'analysis_index', 'root', 'pos', 'is_proper'] + ['afx:'+affix for affix in cat['affixes']]

In [31]:
print(len(fields))
pprint(fields)

64
['file_index',
 'word_index',
 'analysis_index',
 'root',
 'pos',
 'is_proper',
 'afx:[PersonNumber=A3sg]',
 'afx:[Possessive=Pnon]',
 'afx:DA[Case=Loc]',
 'afx:[Case=Bare]',
 'afx:[Case=Nom]',
 'afx:Hn[Possessive=P2sg]',
 'afx:[Polarity=Pos]',
 'afx:Hyor[TenseAspectMood=Prog1]',
 'afx:YDH[Copula=PastCop]',
 'afx:n[PersonNumber=V2sg]',
 'afx:NDA[Case=Loc]',
 'afx:NHn[Case=Gen]',
 'afx:[Copula=PresCop]',
 'afx:[PersonNumber=V3pl]',
 'afx:YA[Derivation=Able]',
 'afx:mA[Polarity=Neg][NOMP]',
 'afx:YAcAk[Derivation=FutNom]',
 'afx:SH[Possessive=P3sg]',
 'afx:[Polarity=Pos][NOMP]',
 'afx:mA[Derivation=Nonf]',
 'afx:YA[Case=Dat]',
 'afx:[ComplementType=CBare]',
 'afx:[Case=Bare][NOMP]',
 'afx:lH[Derivation=From]',
 'afx:NA[Case=Dat]',
 'afx:Hl[Derivation=Pass]',
 'afx:mAk[Derivation=Inf]',
 'afx:Ar[Derivation=AorNom]',
 'afx:[Case=Bare][NN]',
 'afx:lAr[Derivation=Fam]',
 'afx:YlA[Case=Ins]',
 'afx:mHş[Derivation=PerNom]',
 'afx:CH[Derivation=Agt]',
 'afx:[Case=Bare][VB]',
 'afx:lA[Derivat

In [35]:
def to_csv(data, fields, affix_cat, output_path):
    csv = open(output_path+'data.csv', 'w')
    for file_name, content in data.items():
        f_index = file_name.split('_')[0]
        # init file
        fields_line = ','.join(field for field in fields)
        csv.write(fields_line+'\n')
        for w_index, word in content.items():
            for a_index, analysis in word.items():
                root = analysis['root']
                pos = analysis['pos']
                is_proper = str((int(analysis['is_proper'])))
                affixes = analysis['affixes']
                if not affixes:
                    affixes = []
                csv.write(f_index+',')
                csv.write(str(w_index)+',')
                csv.write(str(a_index)+',')
                csv.write(root+',')
                csv.write(pos+',')
                csv.write(is_proper+',')
                for cat in affix_cat:
                    if cat in affixes:
                        csv.write('1,')
                    else:
                        csv.write('0,')
                csv.write('\n')
    csv.close()            

In [36]:
to_csv(data, fields, cat['affixes'], output_path)

### (e-2) export the catalog into two files 