forked from quanteda/quanteda
-
Notifications
You must be signed in to change notification settings - Fork 0
LongLingMichael/quanteda
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
quanteda: Quantitative Analysis of Textual Data =============================================== An R package for managing and analyzing text, by Ken Benoit and Paul Nulty. quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units. quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features. Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction. How to Install -------------- You can download the files and build the package from source, or you can use the devtools library to install the package directly from github. Some preliminaries: 1. To install the package from github, you will need to install the `devtools` package, using (from R): ```S install.packages("devtools") ``` 2. To access the topic modelling functionality, you will need to install the package topicmodels from CRAN. On most platforms, this will be done automatically when you install quanteda, but will fail on OS X unless you have some additional tools installed. [See instructions on installing **topicmodels** on OS X Mavericks or Yosemite here.](http://www.kenbenoit.net/how-to-install-the-r-package-topicmodels-on-os-x/) 3. You should install [Will Lowe's **austin** package](http://conjugateprior.org/software/austin/). ```S install.packages("austin", repos="http://r-forge.r-project.org", type="source") ``` 4. You should install the additional corpus data from **quantedaData** using ```S ## devtools required to install quanteda from Github devtools::install_github("kbenoit/quantedaData") ``` To install `quanteda`, we recommend you use the **dev branch**: ```S devtools::install_github("kbenoit/quanteda", ref="dev") ## ALTERNATIVELY, to install the latest version master branch version: devtools::install_github("kbenoit/quanteda") ``` Documentation ------------- An introductory vignette is in progress and can be viewed here: [here](http://pnulty.github.io). In-depth tutorials in the form of a gitbook will be available here [here](http://kbenoit.github.io/quanteda). Examples for any function can also be seen using (for instance, for `corpus()`): ```S example(corpus) ``` There are also some demo functions that show off some of the package capabilities, such as `demo(quanteda)`. Example ------- ```{r quanteda_example} library(quanteda) # create a corpus from the immigration texts from UK party platforms uk2010immigCorpus <- corpus(uk2010immig, docvars=data.frame(party=names(uk2010immig)), notes="Immigration-related sections of 2010 UK party manifestos", enc="UTF-8") uk2010immigCorpus summary(uk2010immigCorpus, showmeta=TRUE) # key words in context for "deport", 3 words of context kwic(uk2010immigCorpus, "deport", 3) # create a dfm, removing stopwords mydfm <- dfm(uk2010immigCorpus, stopwords=TRUE) dim(mydfm) # basic dimensions of the dfm topfeatures(mydfm, 20) # 20 top words if (Sys.info()['sysname']=="Darwin") quartz() # open nicer window, Mac only plot(mydfm) # word cloud ```
About
R functions for Quantitative Analysis of Textual Data
Resources
Stars
Watchers
Forks
Releases
No releases published
Languages
- HTML 59.9%
- R 37.7%
- TeX 2.1%
- Other 0.3%