Awesome NLP in R



Chinese Text Segmentation

  • jiebaR - 基于c++的R分词包,支持keywords,simhash,海明距离...(首推)

  • cppjieba - c++ 分词工具

  • THULAC - 一个高效的中文词法分析工具包,清华大学荣誉出品,目前只有持Python,c++,java.

Document-Term Matrix

Text Regression and Document Similarity

  • textreg - n-Gram Text Regression, aka Concise Comparative Summarization

  • textreuse - provides classes and functions to detect document similarity and text reuse in text corpora.

  • stringdist - Approximate String Matching and String Distance Functions.可以计算hamming distance等等...

Quantitative Analysis of Textual Data

  • quanteda - c++写的,分析基于"dfm",依赖stingi,data.table等包,效率还是比较高的;另外还需要加载依赖的主题模型的包,例如lda,topicmodels等.快速上手文档; github地址.

  • topicmodels - Provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) 更多请看这里

  • lda - Implements latent Dirichlet allocation (LDA) and related models.

  • LDAvis - Interactive visualization of topic models.

  • mallet - This package allows you to train topic models in mallet and load results directly into R.基于JAVA

  • stm - Estimation of the Structural Topic Model.基于C


Implemention from Awesome R

Packages for Natural Language Processing.

  • tm - A comprehensive text mining framework for R.

  • openNLP - Apache OpenNLP Tools Interface.

  • koRpus - An R Package for Text Analysis.

  • zipfR - Statistical models for word frequency distributions.

  • NLP - Basic functions for Natural Language Processing.

  • syuzhet - Extracts sentiment from text using three different sentiment dictionaries.

  • SnowballC - Snowball stemmers based on the C libstemmer UTF-8 library.

CRAN Task View