# Conditional Random Fields for Surface Segmentation Demo 18/08/2020
## Tumi Moeng

## Data Cleaning
First, the data we received needs to be cleaned and the surface segmentation
 forms need to be generated to be used by the models. To this end I created a
 class called 'DataCleaner' which takes the data we received and filters out
 that which we need from that which we dont need and generates some more data
 that we need.<br>
 This class takes the data in the following form:<br>
 ngezinkonzo&emsp;khonzo&emsp;P&emsp;[RelConc]-nga[NPre]-i[NPrePre]-zin[BPre]-konzo[NStem]<br>
 And converts it to the following form:<br>
 ngezinkonzo | nge-zin-konzo | nga[NPre]i[NPrePre]zin[BPre]konzo[NStem]

In [None]:
from morphology.DataCleaner import DataCleaner
languages = ["zulu", "swati", "ndebele", "xhosa"]

for lang in languages:
    print("Language: " + lang)
    inputFile = DataCleaner(lang + ".train.conll")
    inputFile.reformat(lang + ".clean.train")
    inputFile = DataCleaner(lang + ".test.conll")
    inputFile.reformat(lang + ".clean.test")
    print(lang + " cleaning complete.\n#############################################")

## Preparation for Hyper-parameter Optimisation
Secondly, in preparation for the hyper-parameter optimisation that we will need to do
 to ensure our models are as good as possible we needed to be able to develop a test /
 validation set to be used in the optimisation process. To this end, I created a class
 called 'ValidationSetCreation' that extracts about 10% of the data from the training set
 and puts it into a validation set. This 10% represents roughly the same amount of entries
 as the test set contains.

In [None]:
from morphology.ValidationSetCreation import ValidationSet

for lang in languages:
    print("Language: " + lang)
    file_name = lang + ".clean.train.conll"
    # print(file_name)
    inputFile = ValidationSet(file_name)
    file_name = lang + ".clean.dev.conll"
    # print(file_name)
    inputFile.create_validation_set(file_name)
    print(lang + " validation set complete.\n#############################################")


## Conditional Random Field using SKLearn_CRFSuite
Finally, we are now able to train the CRF using the data we have developed and
train the CRF on this data. We give it input as the string of the language it
is to be trained on and it will give 2 outputs. The first is a list of lists of
what the CRF predicted the labels will be for the words in the test set. The second
 is a list of lists of what the actual labels for the word in the test set are. Both
 inner lists of labels occur in the following form:<br>
 For the word 'komthombo' which segments to 'ko-m-thombo the inner list would be
 [B,E,S,B,M,M,M,M,E] where:<br>
 B = Beginning of Segment<br>
 M = Middle of Segment<br>
 E = End of Segment<br>
 S = Single Length Segment<br>
 The results method which takes in the list of predicted answers and actual answers
 can be used to print to console the results of the CRF such as the <b>precision</b>,
 <b>recall</b> and <b>F1-Score</b>.

In [None]:
from BaselineCRF import BaselineCRF
crf = BaselineCRF("zulu")
predicted_ans, actual_ans = crf.surface_segmentation()
print("First Value of Predicted Answer List: "+predicted_ans[0])
print("First Value of Actual Answer List: "+actual_ans[0])
crf.results(predicted_ans, actual_ans)