Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
97 lines (62 sloc) 4.74 KB

Awesome NLP in R

Awesome

小赵搜集的关于自然语言处理相关的R包:

Chinese Text Segmentation

  • jiebaR - 基于c++的R分词包,支持keywords,simhash,海明距离...(首推)

  • cppjieba - c++ 分词工具

  • THULAC - 一个高效的中文词法分析工具包,清华大学荣誉出品,目前只有持Python,c++,java.

Document-Term Matrix

Text Regression and Document Similarity

  • textreg - n-Gram Text Regression, aka Concise Comparative Summarization

  • textreuse - provides classes and functions to detect document similarity and text reuse in text corpora.

  • stringdist - Approximate String Matching and String Distance Functions.可以计算hamming distance等等...

Quantitative Analysis of Textual Data

  • quanteda - c++写的,分析基于"dfm",依赖stingi,data.table等包,效率还是比较高的;另外还需要加载依赖的主题模型的包,例如lda,topicmodels等.快速上手文档; github地址.

  • topicmodels - Provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) 更多请看这里

  • lda - Implements latent Dirichlet allocation (LDA) and related models.

  • LDAvis - Interactive visualization of topic models.

  • mallet - This package allows you to train topic models in mallet and load results directly into R.基于JAVA

  • stm - Estimation of the Structural Topic Model.基于C

主题模型可以参考这里的比较和例子:https://github.com/trinker/topicmodels_learning#r-resources

Implemention from Awesome R

Packages for Natural Language Processing.

  • tm - A comprehensive text mining framework for R.

  • openNLP - Apache OpenNLP Tools Interface.

  • koRpus - An R Package for Text Analysis.

  • zipfR - Statistical models for word frequency distributions.

  • NLP - Basic functions for Natural Language Processing.

  • syuzhet - Extracts sentiment from text using three different sentiment dictionaries.

  • SnowballC - Snowball stemmers based on the C libstemmer UTF-8 library.

CRAN Task View

https://cran.r-project.org/web/views/NaturalLanguageProcessing.html