# Data exploration and transformation

Here, we will:
- read in the unimorph dataset for Georgian.
- observe the structure of the "tag" column.
- decide on the fact that lemma and tag will be our features and form will be our target value.
- read in kartuverbs dataset.
- drop opaque columns, such as id, sub_id.
- transform the dataset to be more similar to the unimorph style with the difference that the verbal noun will be treated as the lemma form as the kartuverbs project does. This leaves the dataset more streamlined.
- split into feature and target values and save the cleaned data as two seperate csv files.

## UniMorph data

In [74]:
# import libraries 
import pandas as pd
import numpy as np

# Read in UniMorph
um = pd.read_csv(r"C:\Users\Home\Desktop\Python Scripts\kat-master\kat", sep='\t', header=None, names=['lemma', 'form', 'tag'])

In [75]:
# data exploration
print(um.head())

     lemma       form                tag
0  შეუძლია   შემიძლია  V;ARGNO1S;IND;PRS
1  შეუძლია   შეგიძლია  V;ARGNO2S;IND;PRS
2  შეუძლია    შეუძლია  V;ARGNO3S;IND;PRS
3  შეუძლია  შეგვიძლია  V;ARGNO1P;IND;PRS
4  შეუძლია  შეგიძლიათ  V;ARGNO2P;IND;PRS


In [76]:
# check for duplicates
print("num of duplicates:" + str(um.duplicated().sum()))
print("num of unique words:" + str(um.form.nunique()))
print("num of unique lemmas:" + str(um.lemma.nunique()))
print("num of unique tags:" + str(um.tag.nunique()))
print("num of unique word-lemma pairs:" + str(um[['form', 'lemma']].drop_duplicates().shape[0]))
print("num of unique word-tag pairs:" + str(um[['form', 'tag']].drop_duplicates().shape[0]))

num of duplicates:0
num of unique words:75998
num of unique lemmas:3852
num of unique tags:443
num of unique word-lemma pairs:76204
num of unique word-tag pairs:91951


In [77]:
# Filter for verbs
um = um[um.tag.str.contains(r'\bV;', na=False)]

In [78]:
# explore data again
um.head()
um.tail()


Unnamed: 0,lemma,form,tag
21049,წერს,დაგვეწეროს,V;ARGNO1P;ARGAC3P;SBJV;PRF
21050,წერს,დაგეწეროთ,V;ARGNO2P;ARGAC3P;SBJV;PRF
21051,წერს,დაეწეროთ,V;ARGNO3P;ARGAC3P;SBJV;PRF
21052,წერს,დაწერა,V;V.MSDR;PRF
21053,წერს,წერა,V;V.MSDR;IPFV


In [86]:
# split into features and target 
um_features = um.drop(columns=['form'])
um_target = um['form']

In [87]:
# save the cleaned data
um_features.to_csv(r"C:\Users\Home\Desktop\Python Scripts\kat-master\um_features.csv", index=False)
um_target.to_csv(r"C:\Users\Home\Desktop\Python Scripts\kat-master\um_target.csv", index=False)

## KartuVerbs data

In [79]:
# read in kartuverbs
kv = pd.read_csv(r"C:\Users\Home\Desktop\Python Scripts\KartuVerbs-main\data_vn+", sep=';')

In [80]:
# explore kv
print(kv.head())
print(kv.tail())

         form tense_in_paradigm  person number preverb pre2    root sf2  \
0   ვაბეზარობ           present       1     sg       -    ვ  აბეზარ  ობ   
1   ჰაბეზარობ           present       2     sg       -    ჰ  აბეზარ  ობ   
2    აბეზარობ           present       2     sg       -    -  აბეზარ  ობ   
3  ჰაბეზარობს           present       3     sg       -    ჰ  აბეზარ  ობ   
4   აბეზარობს           present       3     sg       -    -  აბეზარ  ობ   

  caus_sf ending tsch_class morph_type sub_id  id          vn  
0       -      -         MV     active    1-1   1  *აბეზარობა  
1       -      -         MV     active    1-1   1  *აბეზარობა  
2       -      -         MV     active    1-1   1  *აბეზარობა  
3       -      ს         MV     active    1-1   1  *აბეზარობა  
4       -      ს         MV     active    1-1   1  *აბეზარობა  
                  form tense_in_paradigm  person number preverb pre2    root  \
99994    დაუბზრიალებდე       conj-future       2     sg      და    უ  ბზრიალ   
99995

### drop opaque features in order to make a dataframe that is more akin to the unimorph dataset.

These include: 


- id
- sub_id
- preverb
- pre2
- sf2
- caus_sf
- ending

Reason: 
- id and sub_id are 1:1 identifiers for the verb. 
- Further, and most importantly, the model would not learn any morphologic patterns if it learns based on numeric identifiers. 
- Also, we want the model to learn the remaining morphologic information (like suffixes, affixes, apophony) on itself. 
- this time, vn is taken as the lemma form

In [81]:
# strip asterisk from vn column
kv['vn'] = kv['vn'].str.replace(r'\*', '', regex=True)

In [82]:
# drop opaque features
kv = kv.drop(columns=['id', 'sub_id', 'preverb', 'pre2', 'root', 'sf2', 'caus_sf', 'ending'])

In [83]:
kv.head()

Unnamed: 0,form,tense_in_paradigm,person,number,tsch_class,morph_type,vn
0,ვაბეზარობ,present,1,sg,MV,active,აბეზარობა
1,ჰაბეზარობ,present,2,sg,MV,active,აბეზარობა
2,აბეზარობ,present,2,sg,MV,active,აბეზარობა
3,ჰაბეზარობს,present,3,sg,MV,active,აბეზარობა
4,აბეზარობს,present,3,sg,MV,active,აბეზარობა


In [84]:
# split into features and target
kv_features = kv.drop(columns=['form'])
kv_target = kv['form']

In [85]:
# save the cleaned data to two csv files
kv_features.to_csv(r"C:\Users\Home\Desktop\Python Scripts\KartuVerbs-main\data_vn+_features.csv", index=False)
kv_target.to_csv(r"C:\Users\Home\Desktop\Python Scripts\KartuVerbs-main\data_vn+_target.csv", index=False)