# Tokenization process
Given the output of DatasetCreator (dictionary) create a nested dictionary with numbers for each lemma 

First, just import the libraries, create an object from the DatasetCreator and generate a random function


In [2]:
%load_ext autoreload
%autoreload 

from eq_learner.DatasetCreator import DatasetCreator
from sympy import sin, Symbol, log, exp 
import numpy as np

x = Symbol('x')
basis_functions = [x,sin,log,exp]
fun_generator = DatasetCreator(basis_functions,max_linear_terms=1, max_binomial_terms=1,max_compositions=1,max_N_terms=1,division_on=False)
string, dictionary =  fun_generator.generate_fun()
print("\n\n\n String format:", string)
print("\n\nDictionary formatt (consistent in the order): \n", dictionary)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


NameError: name 'string' is not defined

Now, let's tokenize.

In [2]:
from eq_learner.processing import tokenization

Tokenization is organized as three step process: 
First we segment each element of our mathematical expression, then we map each unique element of the expression to a number and finally we create a list with the labels.
<ul>
<li>
extract_terms does the first task.
<li>
numberize_terms does the second task.
<li> 
flatten_seq does the third task

In [3]:
tmp = tokenization.extract_terms(dictionary)
print(tmp)

{'Single': [['utf-8', 'log', '(', 'x', ')', '', '']], 'binomial': [['utf-8', 'exp', '(', 'x', ')', '*', 'log', '(', 'x', ')', '', '']], 'N_terms': [['utf-8', 'exp', '(', '2', '*', 'x', ')', '*', 'sin', '(', 'x', ')', '**', '2', '', '']], 'compositions': [['utf-8', 'log', '(', 'x', '**', '2', '+', 'sin', '(', 'x', ')', '+', '1', ')', '', '']], 'division': []}


In [4]:
tmp = tokenization.numberize_terms(tmp)
print(tmp)

{'Single': [[4, 5, 1, 6]], 'binomial': [[3, 5, 1, 6, 8, 4, 5, 1, 6]], 'N_terms': [[3, 5, 15, 8, 1, 6, 8, 2, 5, 1, 6, 7, 15]], 'compositions': [[4, 5, 1, 7, 15, 9, 2, 5, 1, 6, 9, 14, 6]], 'division': []}


In [5]:
final = tokenization.flatten_seq(tmp)
print(final)

[12, 4, 5, 1, 6, 9, 3, 5, 1, 6, 8, 4, 5, 1, 6, 9, 3, 5, 15, 8, 1, 6, 8, 2, 5, 1, 6, 7, 15, 9, 4, 5, 1, 7, 15, 9, 2, 5, 1, 6, 9, 14, 6, 13]


Some details regaring the tokenized sentence



In [38]:
import pandas as pd
print("len:", len(final))
series = pd.DataFrame({"Tokens": final, "Words":  tokenization.apply_inverse_mapping(final)})
series

len: 35


Unnamed: 0,Tokens,Words
0,-1,
1,4,log
2,5,(
3,1,x
4,6,)
5,9,+
6,3,exp
7,5,(
8,1,x
9,6,)


In [39]:
series.value_counts()

Tokens  Words
 6      )        6
 5      (        6
 1      x        6
 9      +        3
 8      *        3
 13     2        2
 4      log      2
 3      exp      2
 2      sin      2
 7      **       1
-1               1
-2               1
dtype: int64

To prove that everything went smoothly we can go back with get_string method to a string based representation

In [40]:
print("original:", str(string), "\n\n")

print("Result:", tokenization.get_string(final))

original: x*sin(x)**2 + exp(x)*sin(x) + log(x) + log(exp(2*x)) 


Result: log(x)+exp(x)*sin(x)+x*sin(x)**2+log(exp(2*x))


All that has been done before for a single expression, can be applied as well for batches with automatic padding

In [41]:
fun_generator = DatasetCreator(basis_functions,max_linear_terms=1, max_binomial_terms=1,max_compositions=1,max_N_terms=1,division_on=False)
support = np.arange(-20,20,0.1)
string, dictionary =  fun_generator.generate_batch(support,20)
dictionary

[{'Single': [log(x)],
  'binomial': [exp(x)*sin(x)],
  'N_terms': [sin(x)**6],
  'compositions': [exp(x*sin(x) + x)],
  'division': []},
 {'Single': [log(x)],
  'binomial': [exp(x)*log(x)],
  'N_terms': [x**3*log(x)**3],
  'compositions': [sin(x*log(x) + 1)],
  'division': []},
 {'Single': [log(x)],
  'binomial': [exp(x)*sin(x)],
  'N_terms': [exp(5*x)],
  'compositions': [log(sin(x)**2 + sin(x))],
  'division': []},
 {'Single': [log(x)],
  'binomial': [x*log(x)],
  'N_terms': [x**2*log(x)**2],
  'compositions': [sin(x*exp(x) + 1)],
  'division': []},
 {'Single': [log(x)],
  'binomial': [exp(x)*log(x)],
  'N_terms': [x**3],
  'compositions': [0],
  'division': []},
 {'Single': [sin(x)],
  'binomial': [exp(x)*sin(x)],
  'N_terms': [exp(4*x)],
  'compositions': [sin(1)],
  'division': []},
 {'Single': [sin(x)],
  'binomial': [exp(2*x)],
  'N_terms': [x**5],
  'compositions': [exp(x*sin(x))],
  'division': []},
 {'Single': [sin(x)],
  'binomial': [sin(x)**2],
  'N_terms': [sin(x)**4],
  '

In [42]:
tokenization.pipeline(dictionary)


AttributeError: 'str' object has no attribute 'items'