Skip to content

Releases: lex-lingo/lingo

v1.10.2

11 Feb 12:36
Compare
Choose a tag to compare
  • Fixed regression introduced in 1.9.0 where source form was assumed to be a Lingo::Language::Token (issue reported by Leonhard Maylein).

v1.10.1

11 Feb 12:36
Compare
Choose a tag to compare
  • Fixed regression introduced in 1.8.6 where renamed constant in Lingo::Attendee::VectorFilter was not reflected in Lingo::Srv (issue #16 by @svelsae).

v1.10.0

11 Feb 12:35
Compare
Choose a tag to compare
  • Dropped support for Ruby 2.0.
  • Updated dependency versions.

v1.9.0

11 Feb 12:34
Compare
Choose a tag to compare
  • Dropped support for Ruby 1.9.
  • Removed support for deprecated options and attendee names (oldnew):
    • Lingo::Language::Grammar:
      compositumcompound
    • Lingo::Attendee::TextReader:
      lir-record-patternrecords
    • Lingo::Config:
      multiwordermulti_worder,
      objectfilterobject_filter,
      textreadertext_reader,
      textwritertext_writer,
      vectorfiltervector_filter,
      wordsearcherword_searcher
  • Lingo::Attendee::TextWriter learned format directives for ext option (currently supported are: %c = config name, %l = language name, %d = current date, %t = current time).
  • Lingo::Attendee::Sequencer remembers word form of sequences.
  • Updated and extended English system dictionary and suffix list.
  • Fixed errors with XML input (issue #15 by Thomas Berger).

v1.8.7

15 Feb 13:02
Compare
Choose a tag to compare
  • Added Lingo::Attendee::LsiFilter to correlate semantically related terms
    (LSI) over the
    "corpus" of all files processed during a single program invocation; requires
    lsi4r which in turn requires
    rb-gsl. [EXPERIMENTAL: Interface may
    be changed or removed in next release.]
  • Added Lingo::Attendee::HalFilter to correlate semantically related terms
    (HAL) over
    individual documents; requires hal4r
    which in turn requires rb-gsl.
    [EXPERIMENTAL: Interface may be changed or removed in next release.]
  • Added Lingo::Attendee::AnalysisFilter and associated lingoctl tooling.
  • Multiword dictionaries can now identify hyphenated variants (e.g.
    automatic data-processing); set hyphenate: true in the
    dictionary config.
  • Lingo::Attendee::Tokenizer no longer considers hyphens at word edges as part
    of the word. As a consequence, Lingo::Attendee::Dehyphenizer has been
    dropped.
  • Dropped Lingo::Attendee::NonewordFilter; use Lingo::Attendee::VectorFilter
    with option lexicals: '\?' instead.
  • Lingo::Attendee::TextReader and Lingo::Attendee::TextWriter learned
    encoding option to read/write text that is not UTF-8 encoded;
    configuration files and dictionaries still need to be UTF-8, though.
  • Lingo::Attendee::TextReader and Lingo::Attendee::TextWriter learned to
    read/write Gzip-compressed files (file extension .gz or .gzip).
  • Lingo::Attendee::Sequencer learned to recognize 0 in the pattern to match
    number tokens.
  • Fixed Lingo::Attendee::TextReader to recognize BOM in input files; does not
    apply to input read from STDIN.
  • Fixed regression introduced in 1.8.6 where Lingo::Attendee::Debugger would
    no longer work immediately behind Lingo::Attendee::TextReader.
  • Fixed lingoctl copy commands when overwriting existing files.
  • Refactored Lingo::Database::Crypter into a module.
  • JRuby 9000 compatibility.

v1.8.6

09 Feb 10:29
Compare
Choose a tag to compare
  • Lingo::Attendee::VectorFilter learned pos option to print position and
    byte offset with each word.
  • Lingo::Attendee::VectorFilter learned tfidf option to sort results based
    on their tf–idf score; the document
    frequencies are calculated over the "corpus" of all files processed during
    a single program invocation.
  • Lingo::Attendee::VectorFilter learned tokens option to filter on
    Lingo::Language::Token in addition to Lingo::Language::Word.
  • Lingo::Attendee::VectorFilter no longer supports debug (as well as
    prompt and preamble); use Lingo::Attendee::DebugFilter instead.
  • Lingo::Attendee::TextReader no longer removes line endings; option chomp
    is obsolete.
  • Lingo::Attendee::TextReader passes byte offset to the following attendee.
  • Lingo::Attendee::Tokenizer records token's byte offset.
  • Lingo::Attendee::Tokenizer records token's sequence position.
  • Lingo::Attendee::Tokenizer learned skip-tags option to skip over
    specified tags' contents.
  • Lingo::Attendee subclasses warn when invalid or obsolete options or names
    are used.
  • Changed German infix substitution /en to ch/chen in order to prevent
    overly aggressive identifications.
  • Internal refactoring and API changes.

v1.8.5

02 Oct 13:33
Compare
Choose a tag to compare
  • Dictionary values (projections) are no longer sorted; hence, order of
    definition affects processing.
  • Lexicals in Lingo::Language::Word are no longer sorted; in particular,
    compound parts keep their original order.
  • Lexicals in Lingo::Language::Word are no longer cleaned from duplicates.
  • Compiled dictionaries are updated whenever the Lingo version or their
    configuration changes, not only when the source file's size or modification
    time changes.
  • Lingo::Attendee::Synonymer learned compound-parts option to also
    generate synonyms for compound parts when set to true.
  • Lingo::Attendee::TextReader learned better PDF-to-text conversion using the
    pdftotext command; specify filter: pdftotext in the config.
  • Lingo::Attendee::VectorFilter learned dict option to print words in
    dictionary format (viz. Lingo::Database::Source::WordClass).
  • Lingo::Attendee::VectorFilter learned preamble option to print current
    configuration to the beginning of the log file (debug: 'true');
    set preamble: false to disable.
  • Multiword dictionaries compiled from base forms can now generate inflected
    adjectives based on the gender of the head noun; set inflect: true
    in the dictionary config.
  • Lingo::Database::Source::WordClass supports gender information being encoded
    in the dictionary as well as shorthand notation for multiple word
    classes/genders.
  • Lingo::Database::Source::WordClass supports compounds being encoded in the
    dictionary (appending + to their parts' word classes is
    recommended).
  • Lingo::Database::Source removes leading and trailing whitespace from
    dictionary lines.
  • Lingo::Database::Crypter uses OpenSSL to encrypt/decrypt dictionaries.
    Note: Can't decrypt dictionaries encrypted with the old scheme anymore.
  • Lingo::Attendee::Tokenizer learned subset of MediaWiki syntax.
  • Eliminated pathological behaviour of the URLS rule in
    Lingo::Attendee::Tokenizer.
  • Fixed regression introduced in 1.8.2 where combine: all would no
    longer work in Lingo::Attendee::MultiWorder.
  • Updated and extended Russian dictionaries. (Yulia Dorokhova, Thomas Müller)
  • lingoctl no longer overwrites existing files without confirmation.
  • lingoctl learned archive command.
  • Dictionary cleanup.

v1.8.4

16 Sep 08:29
Compare
Choose a tag to compare
  • Lingo::Attendee::Sequencer accepts regular expression patterns.
  • Lingo::Attendee::Sequencer substitutes 0 in the format string for the
    matched pattern.
  • Lingo::Attendee::NonewordFilter learned dict option to print nonewords
    in dictionary format.
  • Added progress reporting to Lingo::Attendee::TextReader for STDIN.
  • lingoctl demo reports successful initialization.
  • Russian localization for Lingo::Web. (Yulia Dorokhova, Thomas Müller)
  • Lingo::Web learned parameter hl to set UI language.
  • Lingo::Web displays the configuration in use.
  • Lingo::Srv accepts array of query strings in addition to single query
    string.
  • Meeting config takes precedence over language config.
  • When dictionary entries are rejected during conversion, the location of the
    reject file will be shown.
  • LIR record number defaults to match string in absence of capture group.
  • Optionally prevent Lingo from sorting any results by setting the
    LINGO_NO_SORT environment variable.

v1.8.3

16 Sep 08:31
Compare
Choose a tag to compare
  • Fixed regression introduced in 1.8.2 where reading input from STDIN was no
    longer possible.
  • Fixed regression introduced in 1.8.2 where Lingo would no longer run on Ruby
    1.9.2.
  • Fixed length limit handling for multibyte characters in SDBM store.
  • Fixed encoding issue in SDBM store.
  • Fixed issue with BOM in config files.
  • Modified character handling to accept any Unicode letter (Alphabetic)
    and digit (Decimal Number).
  • Modified Lingo::Attendee::Tokenizer to use only hard-coded tokenization
    rules.
  • Modified Lingo::Attendee::VectorFilter option lexicals to be
    case-sensitive.
  • Improved overall performance and memory usage; Lingo::Attendee::Sequencer
    changed the order sequences are inserted into the stream.
  • Eliminated performance penalty caused by Lingo::Attendee::Abbreviator.
  • Added Russian language support. (Yulia Dorokhova, Thomas Müller)
  • Added fields option to Lingo::Attendee::TextReader to cut off field
    labels; defaults to true in record (LIR) mode.
  • Added skip option to Lingo::Attendee::TextReader to skip lines matching
    the given pattern.
  • Added src option to Lingo::Attendee::VectorFilter to print "source" part
    of compounds.
  • Added lingosrv and lingoweb executables. The former provides a simple
    HTTP endpoint with JSON output; the latter serves a demo web interface.
  • Refactored internal caching.
  • Made dependency on Ruby version >= 1.9.2 explicit.
  • Removed reporting facility (options --perfmon and --status).
  • Learned --profile option to collect profiling information while running.
  • Deprecated Lingo::Language::Grammar option compositum (now compound),
    Lingo::Config option textreader (now text_reader), and
    Lingo::Attendee::TextReader option lir-record-pattern (now records);
    they will be removed in Lingo 1.9.

v1.8.2

16 Sep 08:32
Compare
Choose a tag to compare
  • Performance improvements regarding Lingo::Attendee::VectorFilter (as well
    as Lingo::Attendee::NonewordFilter) memory usage; set sort: false
    in the config.
  • Added Lingo::Attendee::Stemmer (implementing Porter's algorithm for suffix
    stripping).
  • Added progress reporting to Lingo::Attendee::TextReader; set progress: true in the config.
  • Added directory and glob processing to Lingo::Attendee::TextReader (new
    options glob and recursive).
  • Renamed Lingo::Attendee::TextReader option lir-record-pattern to
    records.
  • Fixed Lingo::Attendee::Debugger to forward all objects so it can be
    inserted between any two attendees in the config.
  • Fixed regression introduced in 1.8.0 where Lingo would not use existing
    compiled dictionary when source file is not present.
  • Fixed "invalid byte sequence in UTF-8" on Windows for SDBM store.
  • Enabled pluggable (compiled) dictionaries and storage backends.
  • Extensive internal refactoring and cleanup. (Finished for now.)