Releases: quanteda/quanteda
Releases · quanteda/quanteda
CRAN v1.4.0
Bug fixes and stability enhancements
- Fixed bug in
dfm_compress()
anddfm_group()
that changed or deleted docvars attributes of dfm objects (#1506). - Fixed a bug in
textplot_xray()
that caused incorrect facet labels when a pattern contained multiple list elements or values (#1514). kwic()
now correctly returns the pattern associated with each match as the"keywords"
attribute, for allpattern
types (#1515)- Implemented some improvements in efficiency and computation of unusual edge cases for
textstat_simil()
andtextstat_dist()
.
New features
textstat_lexdiv()
now works on tokens objects, not just dfm objects. New methods of lexical diversity now include MATTR (the Moving-Average Type-Token Ratio, Covington & McFall 2010) and MSTTR (Mean Segmental Type-Token Ratio).- New function
tokens_split()
allows splitting single into multiple tokens based on a pattern match. (#1500) - New function
tokens_chunk()
allows splitting tokens into new documents of equally-sized "chunks". (#1520) - New function
textstat_entropy()
now computes entropy for a dfm across feature or document margins. - The documentation for
textstat_readability()
is vastly improved, now providing detailing all formulas and providing full references. - New function
dfm_match()
allows a user to specify the features in a dfm according to a fixed vector of feature names, including those of another dfm. Replacesdfm_select(x, pattern)
wherepattern
was a dfm. - A new argument
vertex_labelsize
added totextplot_network()
to allow more precise control of label sizes, either globally or individually.
Behaviour changes
tokens.tokens(x, remove_hyphens = TRUE)
wherex
was generated withremove_hyphens = FALSE
now behaves similarly to how the same tokens would be handled had this option been called on character input astokens.character(x, remove_hyphens = TRUE)
. (#1498)
CRAN v1.3.14
CRAN v1.3.13
Bug fixes and stability enhancements
- Fixed a bug causing incorrect counting in
fcm(x, ordered = TRUE)
. (#1413) Also set the condition thatwindow
can be of size 1 (formerly the limit was 2 or greater). - Fixed deprecation warnings from adding a dfm as docvars, and this now inmports the feature names as docvar names automatically. (related to #1417)
- Fixed behaviour from
tokens(x, what = "fasterword", remove_separators = TRUE)
so that it correctly splits words separated by\n
and\t
characters. (#1420) - Add error checking for functions taking dfm inputs in case a dfm has empty features (#1419).
- For
textstat_readability()
, fixed a bug in Dale-Chall-based measures and in the Spache word list measure. These were caused by an incorrect lookup mechanism but also by limited implementation of the wordlists. The new wordlists include all of the variations called for in the original measures, but using fast fixed matching. (#1410) - Fixed problems with basic dfm operations (
rowMeans()
,rowSums()
,colMeans()
,colSums()
) caused by not having access to the Matrix package methods. (#1428) - Fixed problem in
textplot_scale1d()
when input a predicted wordscores object withse.fit = TRUE
(#1440). - Improved the stability of
textplot_network()
. (#1460)
New Features
- Added new argument
intermediate
totextstat_readability(x, measure, intermediate = FALSE)
, which ifTRUE
returns intermediate quantities used in the computation of readability statistics. Useful for verification or direct use of the intermediate quantities. - Added a new
separator
argument tokwic()
to allow a user to define which characters will be added between tokens returned from a keywords in context search. (#1449) - Reimplemented
textstat_dist()
andtextstat_simil()
in C++ for enhanced performance. (#1210) - Added a
tokens_sample()
function (#1478).
Behaviour changes
(not accepted by CRAN 😞) v1.3.10
Prepared for and submitted to CRAN, and the version current with the publication of the JOSS article about quanteda.
CRAN v1.3.0
New Features
- Added
to = "tripletlist"
output type forconvert()
, to convert a dfm into a simple triplet list. (#1321) - Added
tokens_tortl()
andchar_tortl()
to add markers for right-to-left language tokens and character objects. (#1322)
Behaviour changes
- Improved
corpus.kwic()
by adding new argumentssplit_context
andextract_keyword
. dfm_remove(x, selection = anydfm)
is now equivalent todfm_remove(x, selection = featnames(anydfm))
. (#1320)- Improved consistency of
predict.textmodel_nb()
returns, and addedtype =
argument. (#1329)
Bug fixes
CRAN v1.2.0
New Features
- Added an
nsentence()
method for spacyr parsed objects. (#1289)
Bug fixes and stability enhancements
- Fix bug in
nsyllable()
that incorrectly handled cased words, and returned wrong names withuse.names = TRUE
. (#1282) - Fix the overwriting of
summary.character()
caused by previous import of the network package namespace. (#1285) dfm_smooth()
now correctly sets the smooth value in the dfm (#1274). Arithmetic operations on dfm objects are now much more consistent and do not drop attributes of the dfm, as sometimes happened with earlier versions.
Behaviour changes
tokens_toupper()
andtokens_tolower()
no longer remove unused token types. Solves #1278.dfm_trim()
now takes more options, and these are implemented more consistently.min_termfreq
andmax_termfreq
have replacedmin_count
andmax_count
, and these can be modified using atermfreq_type
argument. (Similar options are implemented fordocfreq_type
.) Solves #1253, #1254.textstat_simil()
andtextstat_dist()
now take valid dfm indexes for the relevant margin for theselection
argument. Previously, this could also be a direct vector or matrix for comparison, but this is no longer allowed. Solves #1266.- Improved performance for
dfm_group()
(#1295).
CRAN v1.1.1
Changed the default number of threads to 2.
CRAN v1.1.0
New Features
- Added
as.dfm()
methods for tmDocumentTermMatrix
andTermDocumentMatrix
objects. (#1222) predict.textmodel_wordscores()
nows includes aninclude_reftexts
argument to exclude training texts from the predicted model object (#1229). The default behaviour isinclude_reftexts = TRUE
, producing the same behaviour as existed before the introduction of this argument. This allows rescaling based on the reference documents (since rescaling requires prediction on the reference documents) but provides an easy way to exclude the reference documents from the predicted quantities.textplot_wordcloud()
now uses code entirely internal to quanteda, instead of using the wordcloud package.
Bug fixes and stability enhancements
- Eliminated unnecessary dependency on the digest package.
- Updated the vignette title to be less generic.
- Improved the robustness of
dfm_trim()
anddfm_weight()
for previously weighted dfm objects and when supplied thresholds are proportions instead of counts. (#1237) - Fixed a problem in
summary.corpus(x, n = 101)
whenndoc(x) > 100
(#1242). - Fixed a problem in
predict.textmodel_wordscores(x, rescaling = "mv")
that always reset the reference values for rescaling to the first and second documents (#1251). - Issues in the color generation and labels for
textplot_keyness()
are now resolved (#1233, #1233).
Performance improvements
- textmodel methods are now exported, to facilitate extension packages for other textmodel methods (e.g. wordshoal).
Behaviour changes
CRAN v1.0.0
New Features
- Added
vertex_labelfont
totextplot_network()
. - Added
textmodel_lsa()
for Latent Semantic Analysis models. - Added
textmodel_affinity()
for the Perry and Benoit (2017) class affinity scaling model. - Added Chinese stopwords.
- Added a pkgdown vignette for applications in the Chinese language.
- Added
textplot_network()
function. - The
stopwords()
function and the associated internal data objectdata_char_stopwords
have been removed from quanteda, and replaced by equivalent functionality in the stopwords package. - Added
tokens_subset()
, now consistent with other*_subset()
functions (#1149).
Bug fixes and stability enhancements
- Performance has been improved for
fcm()
and fortextmodel_wordfish()
. dfm()
now correctly passes through all...
arguments totokens()
. (#1121)- All
dfm_*()
functions now work correctly with empty dfm objects. (#1133) - Fixed a bug in
dfm_weight()
for named weight vectors (#1150) - Fixed a bug preventing
textplot_influence()
from working (#1116).
Behaviour Changes
- The convenience wrappers to
convert()
are simplified and no longer exported. To convert a dfm,convert()
is now the only official function. nfeat()
replacesnfeature()
, which is now deprecated. (#1134)textmodel_wordshoal()
has been removed, and relocated to a new package (wordshoal).- The generic wrapper function
textmodel()
, which used to be a gateway to specifictextmodel_*()
functions, has been removed. - (Most of) the
textmodel_*()
have been reimplemented to make their behaviour consistent with thelm/glm()
families of models, including especially how thepredict
,summary
, andcoef
methods work (#1007, #108). - The GitHub home for the repository has been moved to https://github.com/quanteda/quanteda.
CRAN v0.99.22
New Features
tokens_select()
has a newwindow
argument, permitting selection within an asymmetric window around thepattern
of selection. (#521)tokens_replace()
now allows token types to be substituted directly and quickly.- Added a
spacy_parse
method for corpus objects. Also restored quanteda methods for spacyrspacy_parsed
objects.
Bug fixes and stability enhancements
- Improved documentation for
textmodel_nb()
(#1010), and made output quantities from the fitted NB model regular matrix objects instead of Matrix classes.
Behaviour Changes
- All of the deprecated functions are now removed. (#991)
tokens_group()
is now significantly faster.- The deprecated "list of characters"
tokenize()
function and all methods associated with thetokenizedTexts
object types have been removed. - Added convenience functions for keeping tokens or features:
tokens_keep()
,dfm_keep()
, andfcm_keep()
. (#1037) textmodel_NB()
has been replaced bytextmodel_nb()
.