# PARSER: A model for word segmentation
## A rough guide

### 1. Start with all of the important imports that we will be using in this notebook
    For some reason the sys.path.append does not want to import the PARSER module so we have to import it like this. And we also define two folder holding some datasets. We also create a varaible called parameters, so that we can turn of logging of the model. You can turn this on or off as you please by modifying the parameters. 


In [3]:
import sys
import os
root_folder = os.getcwd()
PARSER_folder = os.path.join(root_folder,"PARSER")
sys.path.append(PARSER_folder)

import PARSER
import PARSER_Tester
import data_handeling

from PARSER_class import PARSER

data_folder_gabriel = os.path.join(root_folder,"Data","Gabriel Original Datasets")
data_folder_parser = os.path.join(root_folder,"Data","PARSER Pre-saved")

In [None]:
#For simplicty and consistency in this tutorial, we will use a seed for the random-number generator in the PARSER model. 
#If you wish, you can change the parameters at this stage to f.ex support logging, larger percept sizes and so and so forth. 
parameters = PARSER.get_default_parameters()
parameters['random_seed'] = 10

### 2. What will we run the model on. Corpus creation.
    The PARSER model is created to run any type of string input. You simply create a string and feed it to the model. 

This example shows how to create a simple string and feed it to the model. 

In [None]:
input_string_simple = "ababababababababbbabbabbabbabbabba"
PARSER.run(input_string_simple)

This example shows how to create a simple string from one of Gabriels datasets and from the datasets that are premade in the original PARSER model (The PARSER.exe programme). 
For simplicity we create two dictionaries containing all of the names of the papers in the dataset.  

In [None]:
paper_titles_gabriel = {i:title for i,title in enumerate(data_handeling.get_paper_names(data_folder_gabriel))}
print(paper_titles_gabriel)
paper_titles_parser = {i:title for i,title in enumerate(data_handeling.get_paper_names(data_folder_parser))}
print(paper_titles_parser)

Now we have multiple datasets we can choose from! This next example showcases how we can extract the data that is in one of these datasets. This is rather simple and we only use the data_handeling class to do this. Here we will work with the 2'nd dataset from the list of gabriels papers. In all of these examples we are directly loading the information from the datasets, and not using the variables who hold the data.  
1. First we show how you get all the data from a dataset.
2. We show how to display the categories of the dataset.
3. Shows how to get the strings from a certain category in a dataset.

In [None]:
paper_title = paper_titles_gabriel.get(1)
paper_data = data_handeling.get_paper_data(data_folder_gabriel,paper_title)
#paper_data

In [None]:
paper_categories = data_handeling.get_paper_categories(data_folder_gabriel, paper_title)
#paper_categories

In [None]:
#Using the code above, we know that the cateogies in that dataset are [Exposure, Test, Consistent, Violating].
#We want to use the exposure strings for training a model. Therefore we only extract these. 
paper_categories = data_handeling.get_paper_categories(data_folder_gabriel, paper_title)
paper_categories

In [None]:
paper_category_strings = data_handeling.get_paper_strings(data_folder_gabriel, paper_title, ['Exposure'])
#paper_category_strings

Before we can train the model, we need to generate input for the model. This can be done using the data_handeling class as well. The method "generate_input_data()" takes two parameters. The string category you want to train it on (in the form of a dictionary: 'category':[strings]}) and the number of items

In [None]:
number_of_items = 200
input_string_example = data_handeling.generate_input_data(paper_category_strings, number_of_items)
input_string_example

In our notebook however, I want things to be consistent so that one can follow anlong all the way. So same as with the random-seed we wanted to be consistent, we want the input to be consistent as well. Therefore I add an input string here that is constant. 

In [None]:
input_string = 'acgfcgacgfacfcgacgfacgfcgadcfacfcgadcfcacgfcgacfcadcfadcfcacgfadcfcacgfcgadcfadcgfcacfcadcfcadcfadcfcadcfadcfcgacgfcgacfcgacfcacgfcgadcfadcfcgadcfadcgfcadcfcacfcgadcgfcacgfcgadcgfcadcfcgacgfcgadcgfcacgfacgfcgadcfcgacfcadcfcgacgfcgadcfcgadcfcadcgfcadcfadcfcgadcgfcadcfcgacfcadcfcadcgfcacfcgacfcadcgfcacfcacgfacfcgadcfcgacfcgadcfcadcfcgacgfadcfcadcgfcacgfacfcgadcfcacgfadcfcacfcadcfadcgfcacfcadcfadcfcgadcfcgacfcgacgfacgfcgacgfcgadcgfcacfcacgfcgadcfadcfacgfacfcacfcacfcgadcgfcadcfacgfadcfcacgfadcfadcfadcfacgfcgacfcadcfcacgfcgadcgfcacgfcgacfcacgfacgfcgacfcacgfcgacgfacfcacgfacgfcgadcgfcacfcgadcfcadcfcgacgfcgacgfcgacgfacfcgadcfcgacgfacgfcgadcfadcfcadcfacfcacgfacgfcgadcfcacgfcgadcgfcacgfcgacgfcgadcfacfcgadcfadcfacgfacfcadcfadcgfcacfcgacfcadcgfcadcfadcgfcadcfcacfcadcfcadcfcacfcacfcgadcfcacgfcgadcfcadcfadcfcadcfcgacgfacfcadcfacfcgadcfcacgfadcfacgfadcfcacgfadcfcgacfcacfcgadcfcacgfadcgfcadcfcacfcgadcfacgfadcfacgfcgadcfcacgfadcfadcfcadcfacgfcgacgfadcgfcadcfcgadcfcgadcfcgacfcacfcgadcfadcf'

### 3. Training
Now that we have created an input string that is made up from the exposure strings of a dataset. We are ready to feed the model this string to see if it can learn the chunks/grammar. The return from the PARSER_Trainer is a triplet that is unpacked to be: The percept_shaper, the primitives and the parameters. One can avoid the triple return by running it directly with the PARSER.run function. The PARSER_trainer is mostly in use if one wants to save and run multiple runs and stuff. It is important here that if you do not want to use random seeds for each created model, that you make a parameters variable and use this consistently (the first time you run it, the random-seed will save itself as a entry in the parameters. Causing that seed to be re-used again in later runs with the same parameter. Or you could manually set a random-seed).

In [None]:
trained_model, primitives, _ = PARSER.train(input_string, parameters= parameters)
trained_model

#trained_model = PARSER.run(input_string)

### 4. Testing
Now that we have a trained model. We can test the trained model to see how well it has learned the grammar that we tried to teach it. We can ask the model if a single word i consitent with the grammar the model as learned, or we can ask if a set of strings are consistent. So lets first get the testing string set from the data. Then test the model on a single string from this set, and then on the entire set. Normally, a trained model should have a function to test it self on input. Something along the lines of "trained_model.predict(input)" and then you get a bool back or something. This is not the case here. And the reason for that is that a trained model here is no more than a list of touples consisting of a chunk and its weight, which does not deserve its own object. Therefore for simplicity I made it as a dictionary. And the testing function is made separtly for multiple reasons. 1. The analysis of the model can be done in multiple ways. My way is just one of many, having therefore a separate test class makes sense. 2. The goal is to have many different models, not just PARSER, to output similar results. This means that all of these models would have to have the same testing function within them, which is not necessary if you can have it outside the model itself. 

In [None]:
# Single input: 
testing_data = data_handeling.get_paper_strings(data_folder_gabriel, paper_title, ['Test', 'Consistent','Violating'])
single__test_input = testing_data.get('Consistent')[1]
PARSER_Tester.give_single_input_to_model(trained_model, primitives, single__test_input, True)

In [None]:
# Multiple inputs: 
multiple_test_input = testing_data.get('Test')
PARSER_Tester.test_all_string_in_list(trained_model,primitives,multiple_test_input,True)

In [None]:
# Now lets test the strings that in theory is consistent with the grammar. (Answer should be 4)
consistent_test_input = testing_data.get('Consistent')
PARSER_Tester.test_all_string_in_list(trained_model,primitives,consistent_test_input,True)

In [None]:
# And then lets test the string that are supposedly in violation with the grammar. (Answer should be 1)
violation_test_input = testing_data.get('Violating')
PARSER_Tester.test_all_string_in_list(trained_model,primitives,violation_test_input,True)

### 5. Analytics
Here is an explanation and exploration of analaytical methods to analyse and compare models. For simplicty, we use normal machine learning measurments to calculate the various score of our model. A deep explenation of the testing procedures (how we count a string as consistent or violating can be read __elsewhere__)

Here is the explanation of the various measurements that is being done:
	- True positive   = A consistent string is classified as consistent
    - True negative   = A violating string classified as violating
	- False positive  = A violoating string classified as consistent
	- False negative  = A consisten string classified as violating 


TODO: 
Make a method that gets all of the possible A->B combinations of a trainingset and the consistent set. 
                  training + testin set (possible)  = number of possible learning nodes. 
the number of unique A->B's in the model  (learnt)  = number of learnt nodes

learnt/possible = the percentage of learnt nodes. 

Ok, this works now. Trying to plot it tho and it looks kinda strange. But it works. It's the methods main3 in the temp_main script. Ofcourse re-organize it, and then put into parser_trainer, or parser_tester or something like that.


Okei, plotly is fucking fantastic. But the rpoblem becomes the random-number generator. How to keep this consistent over multiple runs if we restart the model every time? 