-
Notifications
You must be signed in to change notification settings - Fork 0
NLP Ch2 Accessing Text Corpora and Lexical Resources
PeppermintT edited this page Feb 12, 2020
·
3 revisions
There are different types of structures for text corpuses.
- A collection of texts is the simplest kind - eg the Gutenberg corpus which simply is texts that are out of copyright.
- Some texts are grouped into categories. There are many types of categories eg genre, language, source, author (eg the Brown corpus)
- Sometimes categories overlap as items can fall under 2 or more categories (eg Reuters news articles)
- Temporal structures - showing language over time (eg the Inaugral address corpus for the USA).
- The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt')) tells us how many letters occur in the text, including the spaces between words. This is termed "the raw content of the corpus"
- raw(fileids=[f1,f2,f3]) the raw content of the specified files
- raw(categories=[c1,c2]) the raw content of the specified categories
fileids() the files of the corpus fileids([categories]) the files of the corpus corresponding to these categories
categories() the categories of the corpus categories([fileids]) the categories of the corpus corresponding to these files
sents() the sentences of the whole corpus sents(fileids=[f1,f2,f3]) the sentences of the specified fileids sents(categories=[c1,c2]) the sentences of the specified categories
words() the words of the whole corpus words(fileids=[f1,f2,f3]) the words of the specified fileids words(categories=[c1,c2]) the words of the specified categories