Skip to content

Releases: quanteda/quanteda

CRAN v1.4.0

30 Jan 22:04
Compare
Choose a tag to compare

Bug fixes and stability enhancements

  • Fixed bug in dfm_compress() and dfm_group() that changed or deleted docvars attributes of dfm objects (#1506).
  • Fixed a bug in textplot_xray() that caused incorrect facet labels when a pattern contained multiple list elements or values (#1514).
  • kwic() now correctly returns the pattern associated with each match as the "keywords" attribute, for all pattern types (#1515)
  • Implemented some improvements in efficiency and computation of unusual edge cases for textstat_simil() and textstat_dist().

New features

  • textstat_lexdiv() now works on tokens objects, not just dfm objects. New methods of lexical diversity now include MATTR (the Moving-Average Type-Token Ratio, Covington & McFall 2010) and MSTTR (Mean Segmental Type-Token Ratio).
  • New function tokens_split() allows splitting single into multiple tokens based on a pattern match. (#1500)
  • New function tokens_chunk() allows splitting tokens into new documents of equally-sized "chunks". (#1520)
  • New function textstat_entropy() now computes entropy for a dfm across feature or document margins.
  • The documentation for textstat_readability() is vastly improved, now providing detailing all formulas and providing full references.
  • New function dfm_match() allows a user to specify the features in a dfm according to a fixed vector of feature names, including those of another dfm. Replaces dfm_select(x, pattern) where pattern was a dfm.
  • A new argument vertex_labelsize added to textplot_network() to allow more precise control of label sizes, either globally or individually.

Behaviour changes

  • tokens.tokens(x, remove_hyphens = TRUE) where x was generated with remove_hyphens = FALSE now behaves similarly to how the same tokens would be handled had this option been called on character input as tokens.character(x, remove_hyphens = TRUE). (#1498)

CRAN v1.3.14

19 Nov 20:01
Compare
Choose a tag to compare

quanteda v.1.3.14

Bug fixes and stability enhancements

  • Improved the robustness of textstat_keyness() (#1482).
  • Improved the accuracy of sparsity reporting for the print method of a dfm (#1473).

New Features

  • Added the following measures to textstat_lexdiv(): Yule's K, Simpson's D, and Herdan's Vm.

CRAN v1.3.13

01 Nov 21:25
Compare
Choose a tag to compare

Bug fixes and stability enhancements

  • Fixed a bug causing incorrect counting in fcm(x, ordered = TRUE). (#1413) Also set the condition that window can be of size 1 (formerly the limit was 2 or greater).
  • Fixed deprecation warnings from adding a dfm as docvars, and this now inmports the feature names as docvar names automatically. (related to #1417)
  • Fixed behaviour from tokens(x, what = "fasterword", remove_separators = TRUE) so that it correctly splits words separated by \n and \t characters. (#1420)
  • Add error checking for functions taking dfm inputs in case a dfm has empty features (#1419).
  • For textstat_readability(), fixed a bug in Dale-Chall-based measures and in the Spache word list measure. These were caused by an incorrect lookup mechanism but also by limited implementation of the wordlists. The new wordlists include all of the variations called for in the original measures, but using fast fixed matching. (#1410)
  • Fixed problems with basic dfm operations (rowMeans(), rowSums(), colMeans(), colSums()) caused by not having access to the Matrix package methods. (#1428)
  • Fixed problem in textplot_scale1d() when input a predicted wordscores object with se.fit = TRUE (#1440).
  • Improved the stability of textplot_network(). (#1460)

New Features

  • Added new argument intermediate to textstat_readability(x, measure, intermediate = FALSE), which if TRUE returns intermediate quantities used in the computation of readability statistics. Useful for verification or direct use of the intermediate quantities.
  • Added a new separator argument to kwic() to allow a user to define which characters will be added between tokens returned from a keywords in context search. (#1449)
  • Reimplemented textstat_dist() and textstat_simil() in C++ for enhanced performance. (#1210)
  • Added a tokens_sample() function (#1478).

Behaviour changes

  • Removed the Hamming distance method from textstat_dist() (#1443), based on the reasoning in #1442.
  • Removed the "chisquared" and "chisquared2" distance measures from textstat_simil(). (#1442)

(not accepted by CRAN 😞) v1.3.10

05 Oct 20:08
Compare
Choose a tag to compare

Prepared for and submitted to CRAN, and the version current with the publication of the JOSS article about quanteda.

CRAN v1.3.0

05 Jun 18:43
Compare
Choose a tag to compare

New Features

  • Added to = "tripletlist" output type for convert(), to convert a dfm into a simple triplet list. (#1321)
  • Added tokens_tortl() and char_tortl() to add markers for right-to-left language tokens and character objects. (#1322)

Behaviour changes

  • Improved corpus.kwic() by adding new arguments split_context and extract_keyword.
  • dfm_remove(x, selection = anydfm) is now equivalent to dfm_remove(x, selection = featnames(anydfm)). (#1320)
  • Improved consistency of predict.textmodel_nb() returns, and added type = argument. (#1329)

Bug fixes

  • Fixed a bug in textmodel_affinity() that caused failure when the input dfm had been compiled with tolower = FALSE. (#1338)
  • Fixed a bug affecting tokens_lookup() and dfm_lookup() when nomatch is used. (#1347)
  • Fixed a problem whereby NA texts created a "document" (or tokens) containing "NA" (#1372)

CRAN v1.2.0

16 Apr 12:23
Compare
Choose a tag to compare

New Features

  • Added an nsentence() method for spacyr parsed objects. (#1289)

Bug fixes and stability enhancements

  • Fix bug in nsyllable() that incorrectly handled cased words, and returned wrong names with use.names = TRUE. (#1282)
  • Fix the overwriting of summary.character() caused by previous import of the network package namespace. (#1285)
  • dfm_smooth() now correctly sets the smooth value in the dfm (#1274). Arithmetic operations on dfm objects are now much more consistent and do not drop attributes of the dfm, as sometimes happened with earlier versions.

Behaviour changes

  • tokens_toupper() and tokens_tolower() no longer remove unused token types. Solves #1278.
  • dfm_trim() now takes more options, and these are implemented more consistently. min_termfreq and max_termfreq have replaced min_count and max_count, and these can be modified using a termfreq_type argument. (Similar options are implemented for docfreq_type.) Solves #1253, #1254.
  • textstat_simil() and textstat_dist() now take valid dfm indexes for the relevant margin for the selection argument. Previously, this could also be a direct vector or matrix for comparison, but this is no longer allowed. Solves #1266.
  • Improved performance for dfm_group() (#1295).

CRAN v1.1.1

08 Mar 10:19
Compare
Choose a tag to compare

Changed the default number of threads to 2.

CRAN v1.1.0

06 Mar 15:21
Compare
Choose a tag to compare

New Features

  • Added as.dfm() methods for tm DocumentTermMatrix and TermDocumentMatrix objects. (#1222)
  • predict.textmodel_wordscores() nows includes an include_reftexts argument to exclude training texts from the predicted model object (#1229). The default behaviour is include_reftexts = TRUE, producing the same behaviour as existed before the introduction of this argument. This allows rescaling based on the reference documents (since rescaling requires prediction on the reference documents) but provides an easy way to exclude the reference documents from the predicted quantities.
  • textplot_wordcloud() now uses code entirely internal to quanteda, instead of using the wordcloud package.

Bug fixes and stability enhancements

  • Eliminated unnecessary dependency on the digest package.
  • Updated the vignette title to be less generic.
  • Improved the robustness of dfm_trim() and dfm_weight() for previously weighted dfm objects and when supplied thresholds are proportions instead of counts. (#1237)
  • Fixed a problem in summary.corpus(x, n = 101) when ndoc(x) > 100 (#1242).
  • Fixed a problem in predict.textmodel_wordscores(x, rescaling = "mv") that always reset the reference values for rescaling to the first and second documents (#1251).
  • Issues in the color generation and labels for textplot_keyness() are now resolved (#1233, #1233).

Performance improvements

  • textmodel methods are now exported, to facilitate extension packages for other textmodel methods (e.g. wordshoal).

Behaviour changes

  • Changed the default in textmodel_wordfish() to sparse = FALSE, in response to #1216.
  • dfm_group() now preserves docvars that are constant for the group aggregation (#1228).

CRAN v1.0.0

29 Jan 09:31
Compare
Choose a tag to compare

New Features

  • Added vertex_labelfont to textplot_network().
  • Added textmodel_lsa() for Latent Semantic Analysis models.
  • Added textmodel_affinity() for the Perry and Benoit (2017) class affinity scaling model.
  • Added Chinese stopwords.
  • Added a pkgdown vignette for applications in the Chinese language.
  • Added textplot_network() function.
  • The stopwords() function and the associated internal data object data_char_stopwords have been removed from quanteda, and replaced by equivalent functionality in the stopwords package.
  • Added tokens_subset(), now consistent with other *_subset() functions (#1149).

Bug fixes and stability enhancements

  • Performance has been improved for fcm() and for textmodel_wordfish().
  • dfm() now correctly passes through all ... arguments to tokens(). (#1121)
  • All dfm_*() functions now work correctly with empty dfm objects. (#1133)
  • Fixed a bug in dfm_weight() for named weight vectors (#1150)
  • Fixed a bug preventing textplot_influence() from working (#1116).

Behaviour Changes

  • The convenience wrappers to convert() are simplified and no longer exported. To convert a dfm, convert() is now the only official function.
  • nfeat() replaces nfeature(), which is now deprecated. (#1134)
  • textmodel_wordshoal() has been removed, and relocated to a new package (wordshoal).
  • The generic wrapper function textmodel(), which used to be a gateway to specific textmodel_*() functions, has been removed.
  • (Most of) the textmodel_*() have been reimplemented to make their behaviour consistent with the lm/glm() families of models, including especially how the predict, summary, and coef methods work (#1007, #108).
  • The GitHub home for the repository has been moved to https://github.com/quanteda/quanteda.

CRAN v0.99.22

13 Nov 11:23
Compare
Choose a tag to compare

New Features

  • tokens_select() has a new window argument, permitting selection within an asymmetric window around the pattern of selection. (#521)
  • tokens_replace() now allows token types to be substituted directly and quickly.
  • Added a spacy_parse method for corpus objects. Also restored quanteda methods for spacyr spacy_parsed objects.

Bug fixes and stability enhancements

  • Improved documentation for textmodel_nb() (#1010), and made output quantities from the fitted NB model regular matrix objects instead of Matrix classes.

Behaviour Changes

  • All of the deprecated functions are now removed. (#991)
  • tokens_group() is now significantly faster.
  • The deprecated "list of characters" tokenize() function and all methods associated with the tokenizedTexts object types have been removed.
  • Added convenience functions for keeping tokens or features: tokens_keep(), dfm_keep(), and fcm_keep(). (#1037)
  • textmodel_NB() has been replaced by textmodel_nb().