Skip to content

Changing language and creating new dictionnaries

Marcel Izgin edited this page Dec 16, 2016 · 3 revisions

I saw a lot of people asking for the support of new languages for ACAT so here's a little tutorial. I did that on Windows 7 by creating a new french database based on a french book available on the Gutenberg project website (https://www.gutenberg.org/). Of course, keep in mind that it's still better to use texts produced by the users themselves. I wanted this documentation to be the most accessible for everybody because I know ACAT could improve the life of a lot of people but unfortunately it's available just in english and there is just some easy manipulations to do to put it in your language. It doesn't have to be an obstacle.

Note : Don't hesitate to modify/improve this tutorial, it's not perfect but could be with your help ;)

1. Introduction

To create the database, the tool "text2ngram" is needed. This tool is installed with presage during the installation of ACAT. Its default directory is "C:\Program Files (x86)\presage\bin" This tool is used at the end of installation of ACAT to generate the default database for word prediction from a text file. This is the following text file that is used by default at the end of the installation process : http://mattmahoney.net/dc/text8.zip

As you can see, this text is in english, that is why, when you use ACAT for the first time, all the predicted words are in english. To use your own language, all you have to do is to find a text in your language to and build a new database from it. That's what I am about to explain.

2. First step : Finding a good text to create the database

  • Its's important to choose a good text that fits with the way of talking of the user. For example if you choose a very old text with ancient words and syntax, the predictive system won't be really efficient and will suggest ancient words. So find something written in the current way of talking.

  • An autobiography could be a good idea because it's written in the 1st person singular and that's the way the user is going to talk.

  • About the size of the text file, I still don't know what is the best. The default text file (text8) is 100MB large an contains about 17 000 000 of words. But we don't a text file this big because ACAT (presage) will automatically learn new words. The most important thing is to give it a good base for first uses. So I think, one book is enough for that.

  • Try to find something containing the least special characters possible.

For my personal use I choosed the book "A se tordre" : https://www.gutenberg.org/files/13834/13834-0.txt. It's not a new one but it should to the job.

3. Formatting the text for text2ngram

To be well parsed by text2ngram the input text you choosed must be a pure raw text file. Don't use pdf, jpeg or these kinds of format. For example a simple ".txt" in UTF-8 file is perfect.

But even with a simple texte file you will have to make some modification to suppress the comas, point, special characters, line breaks, numbers... I also choosed to put every letter in lowercase. To do these changes, there are several way, but I used Notepad++ (You can download it there : https://notepad-plus-plus.org/download/). Then, your text file with Notepad++. From now on you can :

  • delete the lines return (Edit-Line-Join several lines and Erase blank lines),
  • put all the characters in lowercase (Edit-Capitals)
  • remove comas, points and every special characters (Find-Find..., Replace tab and type your character and replace it with nothing)

At the end, your whole book should appear on just one big line.

4. Text2ngram usage

I'm not going to explain how text2ngram works because I'm not very familiar with it. For more information about this tool, see the official page : http://homepages.inf.ed.ac.uk/lzhang10/ngram.html. Text2ngram parse a text and convert it in n-gram that is a group of n following words ad pus the n-gram into a database (sqlite).

4.1

Firstly you have to open a command line window as an Administrator (type"cmd" in the search field from the start menu, then click right on cmd.exe and select "Open as Administrator").

Then go in the folder where text2ngram.exe is installed (type in this command : "cd C:\Program Files (x86)\presage\bin"). Now you can use the tool. For my text file I decided to do use the 5-grams method simply because it didn't work with lower grams method. I assume it's because my text file has a lot of identical 3-grams and 4-grams and it seems there is a limit of identical n-grams.

4.2

Put you text file (I named mine "ebook_fr.txt) in the same directory than text2ngram (so that the command will be easier). Then here are the command lines to create a 5-grams database from your text file (execute them one by one) :

`text2ngram -n5 -f sqlite -o C:\Intel\ACAT\Users\ACAT\WordPredictors\Presage\database_fr.db ebook_fr.txt`
`text2ngram -n4 -a -f sqlite -o C:\Intel\ACAT\Users\ACAT\WordPredictors\Presage\database_fr.db ebook_fr.txt`
`text2ngram -n3 -a -f sqlite -o C:\Intel\ACAT\Users\ACAT\WordPredictors\Presage\database_fr.db ebook_fr.txt`
`text2ngram -n2 -a -f sqlite -o C:\Intel\ACAT\Users\ACAT\WordPredictors\Presage\database_fr.db ebook_fr.txt`
`text2ngram -n1 -a -f sqlite -o C:\Intel\ACAT\Users\ACAT\WordPredictors\Presage\database_fr.db ebook_fr.txt`

If everything worked fine, my database "database_fr" has been created in "C:\Intel\ACAT\Users\ACAT\WordPredictors\Presage" from my text file "ebook_fr.txt"

5. Configure ACAT to use the new database

The last thing to do is to tell ACAT to use your new database. Go in the following directory : "C:\Intel\ACAT\Users\ACAT" and open the file "PresageWordPredictorSettings.xml" with Notepad++ for example. Then change this line : <DatabaseFileName>database.db</DatabaseFileName> by <DatabaseFileName>database_fr.db</DatabaseFileName> (put your database file name)

Now you should be able to launch ACAT and the predicted word will be in your language. Enjoy !