This notebook demonstrates the abilities of the solution and describes decisions taken during development of that project.

First of all, project was implemented with aim to to cover a lot of possible general applications, rather then efficiently solve some particular problems. That is, solution is slow and memory-ineffitient.

The idea is to parse input text into the [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) starting from the chain of tokens as they go in the text and adding vertices and edges by different analysers to perform named entity recognition tasks

In [1]:
from tokenization import text_to_tokens
from dag import EntitiesDAG
from analyser import (punctuation_analyser,
                      spacing_analyser,
                      integer_analyser,
                     )

# Tokenization

Tokenization implemented as simple as possible: it keeps every consequent letters (EN, UA, RU alphabets) or consequent digits together within same token and breaks everything else into separate tokens.<br>
The idea is to unite multiple tokens into more meaningfull entities during later analysis, while reducing possible need to further split tokens for the sake of analysis.

No characters ommited during tokenization, spaces are valid tokens as well

In [2]:
input_text = 'UINT64_C(0x123) expands to a literal'
print(input_text)

UINT64_C(0x123) expands to a literal


In [3]:
tokens = text_to_tokens(input_text)
print(tokens)

['UINT', '64', '_', 'C', '(', '0', 'x', '123', ')', ' ', 'expands', ' ', 'to', ' ', 'a', ' ', 'literal']


Another example:

In [4]:
input_text = """Українські народ-
ні прислів'я та приказки😃"""
tokens = text_to_tokens(input_text)
print(tokens)

['Українські', ' ', 'народ', '-', '\n', 'ні', ' ', 'прислів', "'", 'я', ' ', 'та', ' ', 'приказки', '😃']


# DAG

`EntitiesDAG` - data structure designed to store info about text.

`EntitiesDAG` consists of `BaseEntity` objects. Each entity contains info about next and previous entities, as well as info about entities which are deduced from itself (at least partially).<br>
Wnen initialized from tokens `EntitiesDAG` is created as chain of `ConnectingEntity -> TextEntity -> ... -> ConnectingEntity` entities, where all `TextEntity` are initialized from tokens and surrounded by `ConnectingEntity`.

In [5]:
input_text = '''pandas  designed to make working
with "relational" or "labeled" data both easy and intuitive.'''
tokens = text_to_tokens(input_text)
dag = EntitiesDAG(tokens)
dag

EntitiesDAG<pandas  designed to make working
with "relational" or "label...sy and intuitive.>

Representing such DAG in a pretty way requires significant additional coding and was't implemented properly. Instead, `pprint` method was implemented which prints all tokens in a way, that all consequent tokens are further to the right than tokens which precedes them. `pprint` also support `max_width` argument to print DAG on multiple lines if it doesn't fit one

`pprint` use `__str__` method to print entities.<br>
`TextEntity` objects are printed as their Python representation (without quotes).<br>
`ConnectingEntity` represented as "•" symbol (BULLET)

In [6]:
dag.pprint(max_width=70)

•pandas• • •designed• •to• •make• •working•\n•with• •"•relational•"• •
----------------------------------------------------------------------
or• •"•labeled•"• •data• •both• •easy• •and• •intuitive•.•


# Analysers: Punctuation, Spacing, Numbers

Further work with a DAG implemented with analysers, which adds new entities to existing DAG.<br>
For example, `spacing_analyser`, `punctuation_analyser`, `integer_analyser` match predefined sequences of `TextEntity` and add new entities to the DAG if matched.

In [7]:
input_text = '''How to convert variable into numpy.int64?
\t- Answer is simple...'''
tokens = text_to_tokens(input_text)
dag = EntitiesDAG(tokens)
dag.pprint()

•How• •to• •convert• •variable• •into• •numpy•.•int•64•?•\n•\t•-• •Answer• •is• 
--------------------------------------------------------------------------------
•simple•.•.•.•


In [8]:
spacing_analyser.analyse(dag)
punctuation_analyser.analyse(dag)
integer_analyser.analyse(dag)

dag.pprint()

•How• •to• •convert• •variable• •into• •numpy•.•int•64•?•\n•\t•-• •Answer• •is• 
                                                                                
                                                                                
                                                                                
                                                                 ␣        ␣    ␣
                                                         ␣  ␣  Punct<->
                                                       Punct<?>
                                                    N<64>
     ␣    ␣         ␣          ␣      ␣       Punct<.>
--------------------------------------------------------------------------------
•simple•.•.•.•
            Punct<.>
          Punct<.>
        Punct<.>
        Punct<...>


P.S. Due to current limitations of `pprint` method obtained representation of all entities of a DAG contains a lot of empty space