nlmk is a small library for nlp specialized for Macedonian language, focusing on localization of the tokenizer and the stopwords and it also provides document analysis. People familiar with nltk
(python
) can be introduced painlessly. It also has focus on working with large files (texts).
nlmk
requires the following third party libraries:
pyparsing-1.5.7
nlmk
can also run with pypy
. Please be careful to install the correct pyparsing
version.
Display part of text, specified as a sentence-slice.
Examples:
python run.py sentences corpus/racin.txt 7 python run.py sentences corpus/racin.txt :2 python run.py sentences corpus/racin.txt 3:10 python run.py sentences corpus/racin.txt 80:
Display a word occuring in a fixed-length window (default: 9).
Examples:
python run.py concordance corpus/racin.txt филозофија python run.py concordance corpus/racin.txt филозофија 2
Use the nlmk.ngramgen
module, or call it through the run.py
caller.
Example:
python run.py ngramgen corpus/racin.txt 10 2 1
This will generate unigrams, bigrams and trigrams:
- the unigrams (words) show up at least 10 times
- the bigrams occur at least 2 times
- the trigrams occur at least 1 time (all trigrams)
Use the nlmk.tagger
module, or call it through the run.py
caller.
Example:
First you need to build a tagger using one or more documents. This will build a tagger called sociology
:
python run.py build-tagger sociology corpus/obezvrednuvanje.na.trudot.txt corpus/rabotni.sporovi.txt
This tagger can be used to tag some other documents:
python run.py tag corpus/racin.txt sociology
Use nlmk.corpus
module, or call it through the run.py
caller.
Example:
This will give the term frequency distribution:
python run.py tf corpus/racin.txt