Skip to content

Releases: quanteda/quanteda

CRAN v4.0.2

25 Apr 17:40
Compare
Choose a tag to compare

Minor fixes:

  • A failing test caused by C++ code related to fcm() and how tokens objects are re-indexed.

  • An undeclared package ‘quanteda.textstats’ in Rd xrefs.

CRAN v4.0.1

09 Apr 07:52
Compare
Choose a tag to compare

Fixed:

  • A failing test caused by the ever-shifting behaviour of Matrix and the devel R on r-devel-linux-x86_64-debian-clang and r-devel-linux-x86_64-debian-gcc.

  • An Undeclared package ‘quanteda.textstats’ in Rd xrefs.

  • An installation failure on r-devel-linux-x86_64-fedora-gcc due to searching for TBB in all the wrong places.

CRAN v4.0

04 Apr 16:45
Compare
Choose a tag to compare

quanteda 4.0.0

Changes and additions

  • Introduces the tokens_xptr objects that extend the tokens objects with external pointers for a greater efficiency. Once tokens objects are converted to tokens_xptr objects using as.tokens_xptr(), tokens_*.tokens_xptr() methods are called automatically.

  • Improved C++ functions to allow the users to change the number of threads for parallel computing in more flexible manner using quanteda_options(). The value of threads can be changed in the middle of analysis pipeline.

  • Makes "word4" the default (word) tokeniser, with improved efficiency, language handling, and customisation options.

  • Replaced all occurrences of the magrittr %>% pipe with the R pipe |> introduced in R 4.1, although the %>% pipe is still re-exported and therefore available to all users of quanteda without loading any additional packages.

  • Added min_ntoken and max_ntoken to tokens_subset() and dfm_subset() to extract documents based on number of tokens easily. It is equivalent to selecting documents using ntoken().

  • Added a new argument apply_if that allows a tokens-based operation to apply only to documents that meet a logical condition. This argument has been added to tokens_select(), tokens_compound(), tokens_replace(), tokens_split(), and tokens_lookup(). This is similar to applying purrr::map_if() to a tokens object, but is implemented within the function so that it can be performed efficiently in C++.

  • Added new arguments append_key, separator and concatenator to tokens_lookup(). These allow tokens matched by dictionary values to be retained with their keys appended to them, separated by separator. The addition of the concatenator argument allows additional control at the lookup stage for tokens that will be concatenated from having matched multi-word dictionary values. (#2324)

  • Added a new argument remove_padding to ntoken() and ntype() that allows for not counting padding that might have been left over from tokens_remove(x, padding = TRUE). This changes the previous number of types from ntype() when pads exist, by counting pads by default. (#2336)

  • Removed dependency on RcppParallel to improve the stability of the C++ code. This change requires the users of Linux-like OS to install the Intel TBB library manually to enable parallel computing.

Removals

  • bootstrap_dfm() was removed for character and corpus objects. The correct way to bootstrap sentences is not to tokenize them as sentences and then bootstrap them from the dfm. This is consistent with requiring the user to tokenise objects prior to forming dfms or other "downstream" objects.

  • dfm() no longer works on character or corpus objects, only on tokens or other dfm objects. This was deprecated in v3 and removed in v4.

  • Very old arguments to dfm() options that were not visible but worked with warnings (such as stem = TRUE) are removed.

  • Deprecated or renamed arguments formerly passed in tokens() that formerly mapped to the v3 arguments with a warning are removed.

  • Methods for readtext objects are removed, since these are data.frame objects that are straightforward to convert into a corpus object.

  • topfeatures() no longer works on an fcm object. (#2141)

Deprecations

  • Some on-the-fly calculations applied to character or corpus objects that require a temporary tokenisation are now deprecated. This includes:

    • nsentence() -- use lengths(tokens(x, what = "sentence")) instead;
    • ntype() -- use ntype(tokens(x)) instead; and.
    • ntoken() -- use ntoken(tokens(x)) instead.
    • char_ngrams() -- use tokens_ngrams(tokens(x)) instead.
  • corpus.kwic() is deprecated, with the suggestion to form a corpus from using tokens_select(x, window = ...) instead.

CRAN v3.3.0

07 Apr 22:48
9a221ea
Compare
Choose a tag to compare

Changes and additions

  • Implements a "word4" tokeniser that is based on new RBBI (RuleBasedBreakIterator) rules, implemented in a new .yml file that can be edited and changed by users, but whose defaults represent a significant improvement in pattern handling for words, sentences, and other forms of patterns. These rules are customised from the ICU rules for breaks, with the standard and customised rules found now in the breakrules/ system folder, so that they could, in principle, be modified by the user.

  • Other minor changes:

    • changes how elapsed time is recorded, by creating a global environment to record these in (aaa.R)
    • improves several of the R-coded patterns that apply to "word2":
      • the hashtag pattern (`pattern_hashtag)
      • the separator pattern (by adding \\p{M}).
      • the URL pattern
    • creates a new tokens_restore(), implemented in C++, to replace the older preserve_special() that rejoined splits created by the default stringi tokeniser machinery.
    • makes some technical improvements to internal tokenisation functions, such as moving the ellipsis to the end of the function, to allow more modularity in developing future tokenisers.

Bug fixes and stability enhancements

  • dfm_group() now works correctly with an empty dfm (#2225).
  • convert(x, to = "stm") no longer vulnerable to large numbers of removed features as in #2189.

CRAN v3.2.4

12 Dec 10:57
bff2cee
Compare
Choose a tag to compare

Fixes test failures caused by recent changes to Matrix package behaviours.

CRAN v3.2.3

29 Aug 08:01
Compare
Choose a tag to compare

Bug fixes and stability enhancements

  • Matrix package calls updated for compatibility with Matrix 1.4.2. (#2182)
  • Changes to C++ code for fcm() to prevent some (chance) errors downstream in LSX. (#2181)

CRAN v3.2.2

09 Aug 09:18
Compare
Choose a tag to compare

Bug fixes and stability enhancements

  • fcm() computes the marginal frequency of upper-case tokens correctly (#2176).
  • tokens_chunk() keeps all the docid, including those of empty documents, in the original object.
  • tokens_select() recycles values when the length of startpos or endpos is less than ndoc(x).
  • tokens_lookup() and dfm_lookup() can apply very large dictionaries (more than 100,000 keys).

CRAN v3.2.0

01 Dec 09:59
Compare
Choose a tag to compare

Bug fixes and stability enhancements

  • dfm() returns a dfm with the identical column order even if tokens_compound() or tokens_ngrams() is used in the upstream (#2100).
  • dfm_group() with NA values in a grouping variable now drops those, similar to the behaviour of tokens_group() and corpus_group() (#2134).

Changes and additions

  • char_wordstem() now has a a new argument check_whitespace, which will not throw an error when lower-casing text containing a whitespace character.
  • dfm_remove() now has a new argument padding = FALSE that when TRUE, collects counts of the removed features in the first column. This produces results consistent with what is compiled as a dfm built from tokens where some have been removed with padding = TRUE (#2152).

CRAN v3.1.0

17 Aug 16:46
Compare
Choose a tag to compare

Bug fixes and stability enhancements

  • Improved and more consistent handling of empty corpus, tokens and dfm objects, to address #2110.
  • rbind.dfm() now preserves docvars (#2109).
  • Document name for Biden's 2021 Inaugural Address in data_corpus_inaugural is now consistent with all other documents.
  • Fix #2127 that caused subsetting to change document names.

Changes and additions

  • phrase() now has a separator argument (#2124)

Deprecations

  • phrase() methods for tokens, collocations, and lists are deprecated in favour of as.phrase(). (#2129)

CRAN v3.0.0

06 Apr 09:08
75432cf
Compare
Choose a tag to compare

Summary

quanteda 3.0 is a major release that improves functionality, completes the modularisation of the package begun in v2.0, further improves function consistency by removing previously deprecated functions, and enhances workflow stability and consistency by deprecating some shortcut steps built into some functions.

Changes and additions

  • Modularisation: We have now separated the textplot_*() functions from the main package into a separate package quanteda.textplots, and the textstat_*() functions from the main package into a separate package quanteda.textstats. This completes the modularisation begun in v2 with the move of the textmodel_*() functions to the separate package quanteda.textmodels. quanteda now consists of core functions for textual data processing and management.

  • The package dependency structure is now greatly reduced, by eliminating some unnecessary package dependencies, through modularisation, and by addressing complex downstream dependencies in packages such as stopwords. v3 should serve as a more lightweight and more consistent platform for other text analysis packages to build on.

  • We have added non-standard evaluation for by and groups arguments to access object docvars:

    • The *_sample() functions' argument by, and groups in the *_group() functions, now take unquoted document variable (docvar) names directly, similar to the way the subset argument works in the *_subset() functions.
    • Quoted docvar names no longer work, as these will be evaluated literally.
    • The by = "document" formerly sampled from docid(x), but this functionality is now removed. Instead, use by = docid(x) to replicate this functionality.
    • For groups, the default is now docid(x), which is now documented more completely. See ?groups and ?docid.
  • dfm() has a new argument, remove_padding, for removing the "pads" left behind after removing tokens with padding = TRUE. (For other extensive changes to dfm(), see "Deprecated" below.)

  • tokens_group(), formerly internal-only, is now exported.

  • corpus_sample(), dfm_sample(), and tokens_sample() now work consistently (#2023).

  • The kwic() return object structure has been redefined, and built with an option to use a new function index() that returns token spans following a pattern search. (#2045 and #2065)

  • The punctuation regular expression and that for matching social media usernames has now been redefined so that the valid Twitter username @_ is now counted as a "tag" rather than as "punctuation". (#2049)

  • The data object data_corpus_inaugural has been updated to include the Biden 2021 inaugural address.

  • A new system of validators for input types now provides better argument type and value checking, with more consistent error messages for invalid types or values.

  • Upon startup, we now message the console with the Unicode and ICU version information. Because we removed our redefinition of View() (see below), the former conflict warning is now gone.

  • as.character.corpus() now has a use.names = TRUE argument, similar to as.character.tokens() (but with a different default value).

Deprecations

The main potentially breaking changes in version 3 relate to the deprecation or
elimination of shortcut steps that allowed functions that required tokens inputs
to skip the tokens creation step. We did this to require users to take more
direct control of tokenization options, or to substitute the alternative
tokeniser of their choice (and then coercing it to tokens via [as.tokens()]).
This also allows our function behaviour to be more consistent, with each
function performing a single task, rather than combining functions (such as
tokenisation and constructing a matrix).

The most common example involves constructing a dfm directly from a character
or corpus object. Formerly, this would construct a tokens object internally
before creating the dfm, and allowed passing arguments to tokens() via ....
This is now deprecated, although still functional with a warning.

We strongly encourage either creating a tokens object first, or piping the
tokens return to dfm() using %>%. (See examples below.)

We have also deprecated direct character or corpus inputs to [kwic()], since
this also requires a tokenised input.

The full listing of deprecations is:

  • dfm.character() and dfm.corpus() are deprecated. Users should create a tokens object first, and input that to dfm().

  • dfm(): As of version 3, only tokens objects are supported as inputs to dfm(). Calling dfm() for character or corpus objects is still functional, but issues a warning. Convenience passing of arguments to tokens() via ... for dfm() is also deprecated, but undocumented, and functions only with a warning. Users should now create a tokens object (using tokens() from character or corpus inputs before calling dfm().

  • kwic(): As of version 3, only tokens objects are supported as inputs to kwic(). Calling kwic() for character or corpus objects is still functional, but issues a warning. Passing arguments to tokens() via ... in kwic() is now disabled. Users should now create a tokens object (using tokens() from character or corpus inputs before calling kwic().

  • Shortcut arguments to dfm() are now deprecated. These are still active, with a warning, although they are no longer documented. These are:

    • stem -- use tokens_wordstem() or dfm_wordstem() instead.
    • select, remove -- use tokens_select() / dfm_select() or tokens_remove() / dfm_remove() instead.
    • dictionary, thesaurus -- use tokens_lookup() or dfm_lookup() instead.
    • valuetype, case_insensitive -- these are disabled; for the deprecated arguments that take these qualifiers, they are fixed to the defaults "glob" and TRUE.
    • groups -- use tokens_group() or dfm_group() instead.
  • texts() and texts<- are deprecated.

    • Use as.character.corpus() to turn a corpus into a simple named character vector.
    • Use corpus_group() instead of texts(x, groups = ...) to aggregate texts by a grouping variable.
    • Use [<- instead of texts()<- for replacing texts in a corpus object.

Removals

  • See note above under "Changes" about the textplot_*() and textstat_*() functions.

  • The following functions have been removed:

    • all methods for defunct corpuszip objects.
    • View() functions
    • as.wfm() and as.DocumentTermMatrix() (the same functionality is available via convert())
    • metadoc() and metacorpus()
    • corpus_trimsentences() (replaced by corpus_trim())
    • all of the tortl functions
    • all legacy functions related to the ancient "corpuszip" corpus variant.
  • dfm objects can no longer be used as a pattern in dfm_select() (formerly deprecated).

  • dfm_sample():

    • no longer has a margin argument. Instead, dfm_sample() now samples only on documents, the same as corpus_sample() and tokens_sample(); and
    • no longer works with by = "document" -- use by = docid(x) instead.
  • dictionary_edit(), char_edit(), and list_edit() are removed.

  • dfm_weight() - formerly deprecated "scheme" options are now removed.

  • tokens() - formerly deprecated options remove_hyphens and remove_twitter are now removed. (Use split_hyphens instead, and the default tokenizer always now preserves Twitter and other social media tags.)

  • Special versions of head() and tail() for corpus, dfm, and fcm objects are now removed, since the base methods work fine for these objects. The main consequence was the removal of the nf option from the methods for dfm and fcm objects, which limited the number of features. This can be accomplished using the index operator [ instead, or for printing, by specifying print(x, max_nfeat = 6L) (for instance).

Bug fixes and stability enhancements

  • Fixed a bug causing topfeatures(x, group = something) to fail with weighted dfms (#2032).

  • kwic() is more stable and does not crash when a vector is supplied as the window argument (#2008).

  • Allow use of multi-threading with more than two threads by fixing quanteda_options().

  • Mentions of the now-removed ngrams option in dfm(x, ...) has now been removed from the dfm documentation. (#1990)

  • Handling for some early-cycle v2 dfm object is improved, to ensure that they are updated to the latest object format. (#2097)