Skip to content
r0ller edited this page Jun 30, 2023 · 257 revisions

For the impatient: how does text to code work?

What is this project about?

Platform and use case examples

Disclaimer

User Data Privacy

How does it work?

Development platform

Unsupervised machine learning aided language modelling

Modelling a language

How to build

Tests (NLTK based)

Near term goals

Long term goals

Technical Guide and Documentation


For the impatient

How does text to code work? It won't be short but you can jump straight to the tutorial.

What is this project about?

In a nutshell, it is about natural language processing. Alice stands for 'A Language Interpreter as semantiC Experiment'. As it's stated in the abstract of Yacc: "Computer program input generally has some structure; in fact, every computer program that does input can be thought of as defining an input language which it accepts. An input language may be as complex as a programming language, or as simple as a sequence of numbers."

Instead of learning such program specific input languages I'm trying to build a reusable library which acts as a human interface -simply called 'hi'- to turn natural language text to structured analyses which can be used to extract all the information that are necessary to translate the text to a language that can be understood by computers. The translation process is carried out in two steps: first morphological, syntactic and semantic analyses are carried out whose results are handed over to the caller/client program in JSON structure, then the analyses need to be parsed by the client to generate a translation or extract information for tagging and classification. Examples for different platforms/use cases are also provided in the project to demonstrate how it can be done.


Platform and use case examples

Android: You can currently make phone calls or look for contacts in Hungarian and English (even offline if a dictionary is available for download for your choice of language) like:

list contacts with Peter

keress névjegyeket Péterrel

Check it out in Play Store in English
Check it out in Play Store in Hungarian

Javascript (browsers or Node.js): You can find an example on embedding the compiled js lib into a website which demonstrates how sentences about searching for a location on a map could be interpreted like:

show location of thai restaurants in erding

Check it out on the project page

Desktop: In this use case, file handling commands are interpreted -currently tuned for filtered file and directory listing using logical expressions like:

list symlinked or executable directories in directory abc that are not empty !

Clone the project and build it on your desktop according to the How to build section. Please, note that the desktop version now handles punctuation (notice the exclamation mark at the end of the example sentence).


Disclaimer

This software is provided "as is" and without any warranty. Use it at your own risk under the license of GPL v3.


User Data Privacy

This software does not store your data.


How does it work?

The interpreter itself is relying on the following components: a lexical analyzer, a syntactic parser (call it parser), a morphological analyser, a semantic parser (call it interpreter) and a database connector. The lexical analyzer used to be built with Lex but later got replaced with a hi specific function, the morphological analyser is built with foma, the syntactic parser is built with bison while the database connector is based on SQLite3.

The lexical analyzer scans the input and validates the words in that with the stems returned by foma, which are checked against the lexicon i.e. the dictionary in the database. The parser validates the input against syntactic rules while the interpreter checks the semantics and translates the command.

The library itself contains only one function to which one needs to pass the command and it returns the analyses in a JSON structure which the client must parse and assemble the corresponding shell script (for browsers and android: javascript) of it. To execute the shell script I created a main program which can take the commands from CLI, passes it to the mentioned library function and executes the shell script in a child process. This way it really feels like talking to a machine. One simply invokes the executable by typing 'hi' and after hitting enter the command can be formulated in English.


Development platform

The interpreter is developed and tested nowadays on NetBSD (Linux would fit the bill as well though) as I don't really have time to get all the development environment tools (e.g. android ndk, emscripten) working on Minix3 which I used earlier. The development language has changed in the meantime from C to C++. However, the C heritage can still be seen here and there. The shell scripts are aimed to be POSIX compliant but it may not always be the case -which is then considered as bug.


Unsupervised machine learning aided language modelling

This is highly experimental but finally I prepared an end to end toolchain to integrate it. The machine learning tool is that of the Alignment Based Learning. What I did in addition is that there's a preparation step that breaks down the words in a text corpus (without punctuation currently) first by foma to morpheme tags based on the foma fst assigned to the language in the language model db. You can invoke it as:

prep_abl /path/to/dbfile.db /path/to/abl/training/corpus <language id> /path/to/output/file/name pun|nopun -d<delimiter> lex|nolex thread_cap

The first parameter is the language model db file having the content prepared in the tables LANGUAGES, GCAT, SYMBOLS, LEXICON and the minimum required data by the foreign key constraints in other tables. However, for the machine learning phase you don't need to have any other syntax or semantics related content. The language id is one of the ids you have in your language model db available which at the same time identifies the foma fst as well. The output of prep_abl is the preprocessed corpus for the training. The pun|nopun options specify if punctuation marks in the text shall be taken into account or not (default: nopun). The delimiter option is either left as default (new line) by specifying nothing after the option (-d) or a single character (-d.). The lex|nolex option tells the tool if the stems identified by the stemmer shall be put in the db or not (default: nolex). The thread_cap sets the maximum number of cores to be used (default: 1). In the end, you will get several files called like the one you entered for the output file but suffixed. Only the ones suffixed as _cons are relevant if you want to see which words in the corpus could not be analysed while with the one you entered in the 4th option as output file you need to feed the ABL tool as that contains the morpheme tags (instead of the words) in the right sequence.

Of course, you need to know how to use the ABL tools but that project (see link above) has a nice documentation and the tools have a short help as well. However, for the impatient, this is how I invoked the ABL commands for my test corpus:

abl_align -a a -p b -e -i /path/to/corpus/file -o /path/to/corpus/file/name/aligned

abl_cluster -i /path/to/aligned/corpus/file -o /path/to/corpus/file/name/clustered

abl_select -s b -i /path/to/clustered/corpus/file -o /path/to/corpus/file/name/selected

Once you have done the training, you need to put the rules learned from the corpus into your language model db. As ABL does not provide the grammar rules directly, I had to write a postprocessing tool that extracts the rules from the ABL output. You need to invoke it as:

proc_abl /path/to/abl_select/output/file <language id> [/path/to/dbfile.db]

The parameters are pretty obvious to use I guess. The language model db is optional in order that you can make test runs without writing the grammar rules and symbols in the db.

As mentioned in the beginning, this is highly experimental so there's a lots of room for improvement e.g. machine generated symbols for the rules are pretty hard to read, there's no conversion to right recursion, etc. Besides all that, this will NOT give you the semantics. That you'll still need to write yourself.

If you want to test the grammar the machine built, you can do so by using the test tools: stex and stax. Ideally, you should get back the sentences from stax which you have in the corpus you used in the abl preparation step.


Modelling a language

If you'd like to create your own model for a language, you'll need to think over the followings:

  • Phonology
  • Morphology
  • Lexicon
  • Grammar (syntax)
  • Semantics

You have the possibility to maintain your own rules for all those. Unfortunately, documentation is lagging behind but you can ask for help either per email or by creating an issue describing your problem and I'll try to help. Some technical help you can find in the technical documentation but it's also not always up to date so as usual, the best way is to browse the source, especially the hi_db.sql file and the content sql files created for different platforms in case of modelling.

The rules for phonology and morphology belong to foma, so please check the technical documentation for some examples and links pointing to the original documentation of foma to be able to create your own morphological analyser. There are two analysers in development I use usually: a Hungarian and an English. Depending on the target, the morphological analyser can be built as:

desktop:

make desktop_fst DESKTOPFOMAPATH=/path/to/your/foma/file DESKTOPLEXCFILES=/your/lexc/files/directory

Android:

make android_fst ANDROIDFOMAPATH=/path/to/your/foma/file ANDROIDLEXCFILES=/your/lexc/files/directory

javascript:

make js_fst JSFOMAPATH=/path/to/your/foma/file JSLEXCFILES=/your/lexc/files/directory

The grammar rules can be either coded manually in a bison file as shown in the corresponding section of the technical documentation or you can just enter your syntactic rules in the grammar db table in your content sql file entering the language id for which the rule is relevant, the parent symbol, the head symbol and the non-head symbol as if it was a bison rule like: A->B C. However, in order to add a linguistic feature to a node (like main_verb), you have to put the corresponding code snippet of the action in the action field of the grammar table entry of the rule.

After you've created your content sql file with the lexicon, grammar, semantic rules, etc. you have to create a db file from it as follows:

desktop (supposing you have your mycontent.sql in the subdirectory of the project directory build/hi_desktop):

make desktop_parser_db NATIVEPARSERDBNAME=mymodel.db NATIVEPARSERDBCONTENT=build/hi_desktop/mycontent.sql

Android (supposing you have your mycontent.sql in the subdirectory of the project directory build/hi_android):

make android_parser_db ANDROIDPARSERDBNAME=mymodel.db ANDROIDPARSERDBCONTENT=build/hi_android/mycontent.sql

javascript (supposing you have your mycontent.sql in the subdirectory of the project directory build/hi_js):

make js_parser_db JSPARSERDBNAME=mymodel.db JSPARSERDBCONTENT=build/hi_js/mycontent.sql

Now, you can generate the bison source from your db file:

desktop (with mymodel.db in build/hi_desktop):

make desktop_bison_parser NATIVEPARSERDBNAME=mymodel.db

Android (with mymodel.db in build/hi_android):

make android_bison_parser ANDROIDPARSERDBNAME=mymodel.db

javascript (with mymodel.db in build/hi_js):

make js_bison_parser JSPARSERDBNAME=mymodel.db

If you have action snippets and functor implementations you can also pass their location in the following parameters for the corresponding target:

DESKTOPACTIONSNIPPETS, DESKTOPFUNCTORPATH
ANDROIDACTIONSNIPPETS, ANDROIDFUNCTORPATH
JSACTIONSNIPPETS, JSFUNCTORPATH

Once you have your foma fst file, db file and bison source, you can build your own interpreter out of these.


How to build

A makefile is now available but it's pretty bare bones with no external dependency checks and a minimal target dependency setup. Until further documentation here, use the help target to find out more, simply typing:

make help

That will give you all the parameters that can be used for each target and the dependencies in the end. When building on Linux, the dev packages need to be installed as well so e.g. besides sqlite3, rapidjson, flex, bison and foma you'll also need rapidjson-dev, libsqlite3-dev and libfoma-dev. There are only a few steps to get it up and running if the dependencies are installed on the target system:

NetBSD Desktop:

make desktop_parser
make shared_native_lib
make desktop_client

Linux (Ubuntu) Desktop:

make desktop_parser
make shared_native_lib INCLUDEDIRS="-I. -I/usr/include" COMMONLIBDIRS=/usr/lib/x86_64-linux-gnu
make desktop_client
  • Now you have an executable by default in build/hi_desktop called 'hi' which interprets the text input entered. (If you want a debug build, you can pass DEBUG=yes to the targets shared_native_lib and desktop_client.)

Android (requires Android NDK):

make android_parser
make arm32_lib NDK32BITTOOLCHAINDIR=/your/32bit/android/NDK/toolchain/directory
make arm64_lib NDK64BITTOOLCHAINDIR=/your/64bit/android/NDK/toolchain/directory
  • Build your android project that links the library file. Please, refer to the hi_android directory containing an example project which you can directly import using Android Studio. If you'd like to replace the library file in the example with the one you compiled, you need to copy yours in the hi_android/hi/app/src/main/jniLibs/arm64-v8a or hi_android/hi/app/src/main/jniLibs/armeabi-v7a directory.

Javascript (requires Emscripten):

make js_parser
make embedded_js_lib EMSCRIPTENDIR=/your/emscripten/directory
  • Now you have a js file by default in build/hi_js/embedded/ which you can use in the index.html file after modifying it according to your needs.

Tests (NLTK based)

Testing a language model is generally pretty difficult as grammars can generate a huge number of sentences even for a relatively small set of words. For the time being, the only thing I could come up with is to make use of NLTK's sentence generation capabilities. This means that you'll need python3.6 and NLTK installed. There are currently two tools in the tests directory: stex is a wrapper around NLTK that generates every possible sentence structure with the terminal symbols of the word forms specified. I'd recommend restricting the generation by depth instead of number of sentences since as soon as the generator runs into a recursive rule (which may even be the first sentence), it will result in an infinite loop and crash. As a hint, when I was testing the grammar for the desktop use case for file/directory listing using logical operators, syntax trees built for such sentences reached at least a depth of 10 so below that I did not get any result. Once you're satisfied with the result set, you'll need to redirect its output to a file to feed the other tool called stax. Invoke stex like:

`./stex /path/to/dbfile.db <language id> <sentence nr limit>n|<tree_depth>d list,of,all,wordforms,to,be,generated`

In order to make stex output lines unique, there's now a small script that does it which you can invoke as:

`remove_stex_output_duplicates.sh /path/to/stex_output_file`

The script will generate a file with the same name as the stex output suffixed with _unique. Stax simply takes the output of stex (preferably made unique):

`./stax /path/to/stex_output_file [/path/to/prep_abl/output/file]`

It generates the word forms from the terminal symbols (tokens) in the sentence structures so you'll get sentences that your grammar accepts for the given set of words. This is actually a language equivalence (or weak equivalence) test of the sentences based on which the grammar was induced and the ones that are generated by the same grammar. (For a strong/structural equivalence test you'd need to have the other grammar that generated the original sentences of which you induced your own grammar to be able to compare the shape of the grammar rules.) Optionally, if you provide the output of the abl preparation step as the second parameter, you'll get some basic statistics to show how the sentences generated based on the grammar rules induced by machine learning relate to the sentences used for the training.


Near term goals

  • Introducing and supporting more syntactic categories*
  • Enhancing the lexicon*
  • Handling compound sentences
  • Handling defining relative clause >> done
  • Supporting noun specific adjective interpretation (getting rid of classification problems posed by the semantic tree) >> done
  • Reengineering Yacc source to avoid recompiling if only dictionary changes but syntactic rules aren't changed >> done
  • Resolving conflicting adjectives*
  • Handling statements >> done

Long term goals

  • Handling questions >> done
  • Introducing interaction (Did you mean ...?)
  • Context handling* (e.g. Copy all non-executable files to directory abc. Copy those files to def as well.)
  • Machine learning*
  • Support any language as source of translation (just by technically providing the possibility in DB) >> done
  • Partial/error tolerant sentence analysis, simply by omitting words that cannot be analysed or can be analysed but don't fit in the given syntactic model. If no sentence analysis can be carried out, even giving back just the morphological analysis of words may make sense as in a mobile phone use case users may just say one word commands. >> done

*=Under development