GitHub - LongLingMichael/quanteda: R functions for Quantitative Analysis of Textual Data

LongLingMichael / quanteda Public

forked from quanteda/quanteda

Notifications You must be signed in to change notification settings
Fork 0
Star 1

R functions for Quantitative Analysis of Textual Data

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 940 Commits
R		R
data		data
demo		demo
inst		inst
man		man
vignettes		vignettes
working_files		working_files
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS		NEWS
README.Rmd		README.Rmd
quanteda.pdf		quanteda.pdf
quanteda_KENNOTES.pdf		quanteda_KENNOTES.pdf

Repository files navigation

quanteda: Quantitative Analysis of Textual Data
===============================================

An R package for managing and analyzing text, by Ken Benoit and Paul Nulty.

quanteda makes it easy to manage texts in the form of a
corpus, defined as a collection of texts that includes document-level
variables specific to each text, as well as meta-data for documents
and for the collection as a whole. quanteda includes tools to make it
easy and fast to manuipulate the texts the texts in a corpus, for
instance by tokenizing them, with or without stopwords or stemming, or
to segment them by sentence or paragraph units. 

quanteda implements
bootstrapping methods for texts that makes it easy to resample texts
from pre-defined units, to facilitate computation of confidence
intervals on textual statistics using techniques of non-parametric
bootstrapping, but applied to the original texts as data. quanteda
includes a suite of sophisticated tools to extract features of the
texts into a quantitative matrix, where these features can be defined
according to a dictionary or thesaurus, including the declaration of
collocations to be treated as single features. 

Once converted into a
quantitative matrix (known as a "dfm" for document-feature matrix),
the textual feature can be analyzed using quantitative methods for
describing, comparing, or scaling texts, or used to train machine
learning methods for class prediction.


How to Install
--------------

You can download the files and build the package from source, or you can use the devtools library to install the package directly from github.

Some preliminaries:

1.  To install the package from github, you will need to install the `devtools` package, using (from R):

    ```S
    install.packages("devtools")
    ```

2.  To access the topic modelling functionality, you will need to install the package topicmodels from CRAN.  On most platforms, this will be done automatically when you install quanteda, but will fail on OS X unless you 
have some additional tools installed.  [See instructions on installing **topicmodels** on OS X Mavericks or Yosemite here.](http://www.kenbenoit.net/how-to-install-the-r-package-topicmodels-on-os-x/)

3.  You should install [Will Lowe's **austin** package](http://conjugateprior.org/software/austin/).

    ```S
    install.packages("austin", repos="http://r-forge.r-project.org", type="source")
    ```

4.  You should install the additional corpus data from **quantedaData** using

    ```S
    ## devtools required to install quanteda from Github
    devtools::install_github("kbenoit/quantedaData")
    ```

To install `quanteda`, we recommend you use the **dev branch**:

```S
devtools::install_github("kbenoit/quanteda", ref="dev")

## ALTERNATIVELY, to install the latest version master branch version:
devtools::install_github("kbenoit/quanteda")
```

Documentation
-------------

An introductory vignette is in progress and can be viewed here: [here](http://pnulty.github.io).

In-depth tutorials in the form of a gitbook will be available here [here](http://kbenoit.github.io/quanteda).

Examples for any function can also be seen using (for instance, for `corpus()`):
```S
example(corpus)
```

There are also some demo functions that show off some of the package capabilities, such 
as `demo(quanteda)`.


Example
-------

```{r quanteda_example}
library(quanteda)
# create a corpus from the immigration texts from UK party platforms
uk2010immigCorpus <- corpus(uk2010immig,
                            docvars=data.frame(party=names(uk2010immig)),
                            notes="Immigration-related sections of 2010 UK party manifestos",
                            enc="UTF-8")
uk2010immigCorpus
summary(uk2010immigCorpus, showmeta=TRUE)

# key words in context for "deport", 3 words of context
kwic(uk2010immigCorpus, "deport", 3)

# create a dfm, removing stopwords
mydfm <- dfm(uk2010immigCorpus, stopwords=TRUE)
dim(mydfm)              # basic dimensions of the dfm
topfeatures(mydfm, 20)  # 20 top words
if (Sys.info()['sysname']=="Darwin") quartz() # open nicer window, Mac only
plot(mydfm)             # word cloud     
```