Skip to content

R functions for Quantitative Analysis of Textual Data

Notifications You must be signed in to change notification settings

LongLingMichael/quanteda

 
 

Repository files navigation

quanteda: Quantitative Analysis of Textual Data
===============================================

An R package for managing and analyzing text, by Ken Benoit and Paul Nulty.

quanteda makes it easy to manage texts in the form of a
corpus, defined as a collection of texts that includes document-level
variables specific to each text, as well as meta-data for documents
and for the collection as a whole. quanteda includes tools to make it
easy and fast to manuipulate the texts the texts in a corpus, for
instance by tokenizing them, with or without stopwords or stemming, or
to segment them by sentence or paragraph units. 

quanteda implements
bootstrapping methods for texts that makes it easy to resample texts
from pre-defined units, to facilitate computation of confidence
intervals on textual statistics using techniques of non-parametric
bootstrapping, but applied to the original texts as data. quanteda
includes a suite of sophisticated tools to extract features of the
texts into a quantitative matrix, where these features can be defined
according to a dictionary or thesaurus, including the declaration of
collocations to be treated as single features. 

Once converted into a
quantitative matrix (known as a "dfm" for document-feature matrix),
the textual feature can be analyzed using quantitative methods for
describing, comparing, or scaling texts, or used to train machine
learning methods for class prediction.


How to Install
--------------

You can download the files and build the package from source, or you can use the devtools library to install the package directly from github.

Some preliminaries:

1.  To install the package from github, you will need to install the `devtools` package, using (from R):

    ```S
    install.packages("devtools")
    ```

2.  To access the topic modelling functionality, you will need to install the package topicmodels from CRAN.  On most platforms, this will be done automatically when you install quanteda, but will fail on OS X unless you 
have some additional tools installed.  [See instructions on installing **topicmodels** on OS X Mavericks or Yosemite here.](http://www.kenbenoit.net/how-to-install-the-r-package-topicmodels-on-os-x/)

3.  You should install [Will Lowe's **austin** package](http://conjugateprior.org/software/austin/).

    ```S
    install.packages("austin", repos="http://r-forge.r-project.org", type="source")
    ```

4.  You should install the additional corpus data from **quantedaData** using

    ```S
    ## devtools required to install quanteda from Github
    devtools::install_github("kbenoit/quantedaData")
    ```

To install `quanteda`, we recommend you use the **dev branch**:

```S
devtools::install_github("kbenoit/quanteda", ref="dev")

## ALTERNATIVELY, to install the latest version master branch version:
devtools::install_github("kbenoit/quanteda")
```

Documentation
-------------

An introductory vignette is in progress and can be viewed here: [here](http://pnulty.github.io).

In-depth tutorials in the form of a gitbook will be available here [here](http://kbenoit.github.io/quanteda).

Examples for any function can also be seen using (for instance, for `corpus()`):
```S
example(corpus)
```

There are also some demo functions that show off some of the package capabilities, such 
as `demo(quanteda)`.


Example
-------

```{r quanteda_example}
library(quanteda)
# create a corpus from the immigration texts from UK party platforms
uk2010immigCorpus <- corpus(uk2010immig,
                            docvars=data.frame(party=names(uk2010immig)),
                            notes="Immigration-related sections of 2010 UK party manifestos",
                            enc="UTF-8")
uk2010immigCorpus
summary(uk2010immigCorpus, showmeta=TRUE)

# key words in context for "deport", 3 words of context
kwic(uk2010immigCorpus, "deport", 3)

# create a dfm, removing stopwords
mydfm <- dfm(uk2010immigCorpus, stopwords=TRUE)
dim(mydfm)              # basic dimensions of the dfm
topfeatures(mydfm, 20)  # 20 top words
if (Sys.info()['sysname']=="Darwin") quartz() # open nicer window, Mac only
plot(mydfm)             # word cloud     
```

About

R functions for Quantitative Analysis of Textual Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • HTML 59.9%
  • R 37.7%
  • TeX 2.1%
  • Other 0.3%