Skip to content

Commit

Permalink
libstemmer: update to 2.1.0.
Browse files Browse the repository at this point in the history
Snowball 2.1.0 (2021-01-21)
===========================

C/C++
-----

* Fix decoding of 4-byte UTF-8 sequences in `grouping` checks.  This bug
  affected Unicode codepoints U+40000 to U+7FFFF and U+C0000 to U+FFFFF and
  doesn't affect any of the stemming algorithms we currently ship (#138,
  reported by Stephane Carrez).

Python
------

* Fix snowballstemmer.algorithms() method (#132, reported by kkaiser).

* Update code to generate trove language classifiers for PyPI.  All the
  natural languages we previously had stemmers for have now been added to
  PyPI's list, but Armenian and Yiddish aren't on it.  Patch from Dmitry
  Shachnev.

Java
----

Code Quality Improvements
-------------------------

* Suppress GCC warning in compiler code.

* Use `const` pointers more in C runtime.

* Only use spaces for indentation in javascript code.  Change proposed by Emily
  Marigold Klassen in #123, and seems to be the modern Javascript norm.

New Code Generators
-------------------

* Add Ada generator from Stephane Carrez (#135).

New Snowball Language Features
------------------------------

* `lenof` and `sizeof` can now be applied to a literal string, which can be
  useful if you want to do calculations on cursor values.

  This change actually simplifies the language a little, since you can now use
  a literal string in any read-only context which accepts a string variable.

Code generation improvements
----------------------------

* General:

  + Fix bugs in the code generated to handle failure of `goto`, `gopast` or
    `try` inside `setlimit` or string-`$`.  This affected all languages (though
    the issue with `try` wasn't present for C).  These bugs don't affect any of
    the stemming algorithms we currently ship.  Reported by Stefan Petkovic on
    snowball-discuss.

  + Change `hop` with a negative argument to work as documented.  The manual
    says a negative argument to hop will raise signal f, but the implementation
    for all languages was actually to move the cursor in the opposite direction
    to `hop` with a positive argument.  The implemented behaviour is
    problematic as it allows invalidating implicitly saved cursor values by
    modifying the string outside the current region, so we've decided it's best
    to fix the implementation to match the documentation.

    The only Snowball code we're aware of which relies on this was the original
    version of the new Yiddish stemming algorithm, which has been updated not
    to rely on this.

    The compiler now issues a warning for `hop` with a constant negative
    argument (internally now converted to `false`), and for `hop` with a
    constant zero argument (internally now converted to `true`).

  + Canonicalise `among` actions equivalent to `()` such as `(true)` which
    previously resulted in an extra case in the among, and for Python
    we'd generate invalid Python code (`if` or `elif` with an empty body).
    Bug revealed by Assaf Urieli's Yiddish stemmer in #137.

  + Eliminate variables whose values are never used - they no longer have
    corresponding member variables, etc, and no code is generated for any
    assignments to them.

  + Don't generate anything for an unused `grouping`.

  + Stop warning "grouping X defined but not used" for a `grouping` which is
    only used to define other another `grouping`.

* C/C++:

  + Store booleans in same array as integers.  This means each boolean is
    stored as an int instead of an unsigned char which means 4 bytes instead of
    1, but we save a pointer (4 or 8 bytes) in struct SN_env which is a win for
    all the current stemmers.  For an algorithm which uses both integers and
    booleans, we also save the overhead of allocating a block on the heap, and
    potentially improve data locality.

  + Eliminate duplicate generated C comment for sliceto.

* Pascal:

  + Avoid generating unused variables.  The Pascal code generated for the
    stemmers we ship is now warning free (tested with fpc 3.2.0).

* Python:

  + End `if`-chain with `else` where possible, avoiding a redundant test
    of the variable being switched on.  This optimisation kicks in for an
    `among` where all cases have commands.  This change seems to speed up `make
    check_python_arabic` by a few percent.

New stemming algorithms
-----------------------

* Add Serbian stemmer from stef4np (#113).

* Add Yiddish stemmer from Assaf Urieli (#137).

* Add Armenian stemmer from Astghik Mkrtchyan.  It's been on the website for
  over a decade, and included in Xapian for over 9 years without any negative
  feedback.

Behavioural changes to existing algorithms
------------------------------------------

Optimisations to existing algorithms
------------------------------------

* kraaij_pohlmann: Use `$v = limit` instead of `do (tolimit setmark v)` since
  this generates simpler code, and also matches the code other algorithm
  implementations use.

  Probably for languages like C with optimising compilers the compiler
  will generate equivalent code anyway, but e.g. for Python this should be
  an improvement.

Code clarity improvements to existing algorithms
------------------------------------------------

* hindi.sbl: Fix comment typo.

Compiler
--------

* Don't count `$x = x + 1` as initialising or using `x`, so it's now handled
  like `$x += 1` already is.

* Comments are now only included in the generated code if command like option
  -comments is specified.

  The comments in the generated code are useful if you're trying to debug the
  compiler, and perhaps also if you are trying to debug your Snowball code, but
  for everyone else they just bloat the code which as the number of languages
  we support grows becomes more of an issue.

* `-parentclassname` is not only for java and csharp so don't disable it if
  those backends are disabled.

* `-syntax` now reports the value for each numeric literal.

* Report location for excessive get nesting error.

* Internally the compiler now represents negated literal numbers as a simple
  `c_number` rather than `c_neg` applied to a `c_number` with a positive value.
  This simplifies optimisations that want to check for a constant numeric
  expression.

Build system
------------

* Link binaries with LDFLAGS if it's set, which is needed for some platform
  (e.g. OpenEmbedded).  Patch from Andreas Müller (#120).

* Add missing dependencies of algorithms.go rule.

Testsuite
---------

* C: Add stemtest for low-level regression tests.

Documentation
-------------

* Document a C99 compiler as a requirement for building the snowball compiler
  (but the C code it generates should still work with any ISO C compiler.)

  A few declarations mixed with code crept in some time ago (which nobody's
  complained about), so this is really just formally documenting a requirement
  which already existed.

* README: Explain what Snowball is and what Stemming is (#131, reported by Sean
  Kelly).

* CONTRIBUTING.rst: Expand section on adding a new generator.

* For Python snowballstemmer module include global NEWS instead of
  Python-specific CHANGES.rst and use README.rst as the long description.
  Patch from Dmitry Shachnev (#119).

* COPYING: Update and incorporate Python backend licensing information which
  was previously in a separate file.
  • Loading branch information
wiz committed Feb 18, 2021
1 parent 9ad6e2c commit 7b0c3c6
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 27 deletions.
7 changes: 3 additions & 4 deletions textproc/libstemmer/Makefile
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# $NetBSD: Makefile,v 1.2 2020/08/31 18:11:43 wiz Exp $
# $NetBSD: Makefile,v 1.3 2021/02/18 10:26:56 wiz Exp $

DISTNAME= snowball-2.0.0
PKGNAME= libstemmer-2.0.0
PKGREVISION= 1
DISTNAME= snowball-2.1.0
PKGNAME= ${DISTNAME:S/snowball/libstemmer/}
CATEGORIES= textproc
MASTER_SITES= ${MASTER_SITE_GITHUB:=snowballstem/}
GITHUB_PROJECT= snowball
Expand Down
12 changes: 6 additions & 6 deletions textproc/libstemmer/distinfo
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
$NetBSD: distinfo,v 1.1 2020/04/14 14:07:50 ryoon Exp $
$NetBSD: distinfo,v 1.2 2021/02/18 10:26:56 wiz Exp $

SHA1 (snowball-2.0.0.tar.gz) = b152bbebca34505d963f3cfb6b726859d5b83b66
RMD160 (snowball-2.0.0.tar.gz) = f5dc4e6caeb65120eeb36d9f45dd758a8024c881
SHA512 (snowball-2.0.0.tar.gz) = 7da7c653d41bf03f3fb2f0b4a8963572fc97319fe44e82c1fc7882ba440e60e5947ed7fb722f7e78592d5ea862e3d733880f9f656236e40c1d5306e70a80a1b1
Size (snowball-2.0.0.tar.gz) = 179986 bytes
SHA1 (patch-GNUmakefile) = dc58eaec3de72fb93cf2393631b1bdc7d31be7cf
SHA1 (snowball-2.1.0.tar.gz) = 4a4c82c1619052442bd2049f7d12c4afa752e524
RMD160 (snowball-2.1.0.tar.gz) = ecdc9606e494447e1f85ff89076f45cec9f0a3dd
SHA512 (snowball-2.1.0.tar.gz) = 1efd7d8ab58852987e83247048244882c517e32237c8cb3c0558b66ecfb075733ce8805ebb76041e6e7d6664c236054effe66838e7c524ee529ce869aa8134f0
Size (snowball-2.1.0.tar.gz) = 220324 bytes
SHA1 (patch-GNUmakefile) = 0a0c0a1760338fc55374e88b4ab853b47dc24ea0
SHA1 (patch-libstemmer_symbol.map) = 0122f03d0ac54dae908ffd873f1ae4a6e502a56f
22 changes: 5 additions & 17 deletions textproc/libstemmer/patches/patch-GNUmakefile
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
$NetBSD: patch-GNUmakefile,v 1.1 2020/04/14 14:07:50 ryoon Exp $
$NetBSD: patch-GNUmakefile,v 1.2 2021/02/18 10:26:56 wiz Exp $

* Build dynamic library, from archlinux.

--- GNUmakefile.orig 2019-10-02 03:27:17.000000000 +0000
--- GNUmakefile.orig 2021-01-21 04:50:09.000000000 +0000
+++ GNUmakefile
@@ -151,10 +151,10 @@ C_OTHER_OBJECTS = $(C_OTHER_SOURCES:.c=.
@@ -162,10 +162,10 @@ C_OTHER_OBJECTS = $(C_OTHER_SOURCES:.c=.
JAVA_CLASSES = $(JAVA_SOURCES:.java=.class)
JAVA_RUNTIME_CLASSES=$(JAVARUNTIME_SOURCES:.java=.class)

Expand All @@ -18,25 +18,13 @@ $NetBSD: patch-GNUmakefile,v 1.1 2020/04/14 14:07:50 ryoon Exp $

clean:
rm -f $(COMPILER_OBJECTS) $(RUNTIME_OBJECTS) \
@@ -179,7 +179,7 @@ clean:
-rmdir $(js_output_dir)

snowball: $(COMPILER_OBJECTS)
- $(CC) $(CFLAGS) -o $@ $^
+ $(CC) $(CFLAGS) ${LDFLAGS} -o $@ $^

$(COMPILER_OBJECTS): $(COMPILER_HEADERS)

@@ -200,8 +200,11 @@ libstemmer/libstemmer.o: libstemmer/modu
@@ -212,6 +212,9 @@ libstemmer/libstemmer.o: libstemmer/modu
libstemmer.o: libstemmer/libstemmer.o $(RUNTIME_OBJECTS) $(C_LIB_OBJECTS)
$(AR) -cru $@ $^

+libstemmer.so: libstemmer/libstemmer.o $(RUNTIME_OBJECTS) $(C_LIB_OBJECTS)
+ $(CC) $(CFLAGS) -shared $(LDFLAGS) -Wl,-soname,libstemmer.so.0,-version-script,libstemmer/symbol.map -o $@.0.0.0 $^
+
stemwords: $(STEMWORDS_OBJECTS) libstemmer.o
- $(CC) $(CFLAGS) -o $@ $^
+ $(CC) $(CFLAGS) ${LDFLAGS} -o $@ $^
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $^

csharp_stemwords: $(CSHARP_STEMWORDS_SOURCES) $(CSHARP_RUNTIME_SOURCES) $(CSHARP_SOURCES)
$(MCS) -unsafe -target:exe -out:$@ $(CSHARP_STEMWORDS_SOURCES) $(CSHARP_RUNTIME_SOURCES) $(CSHARP_SOURCES)

0 comments on commit 7b0c3c6

Please sign in to comment.