-
Notifications
You must be signed in to change notification settings - Fork 0
NLP Ch2 Accessing Text Corpora and Lexical Resources
PeppermintT edited this page Feb 12, 2020
·
3 revisions
There are different types of structures for text corpuses.
- A collection of texts is the simplest kind - eg the Gutenberg corpus which simply is texts that are out of copyright.
- Some texts are grouped into categories. There are many types of categories eg genre, language, source, author (eg the Brown corpus)
- Sometimes categories overlap as items can fall under 2 or more categories (eg Reuters news articles)
- Temporal structures - showing language over time (eg the Inaugral address corpus for the USA).
- The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt')) tells us how many letters occur in the text, including the spaces between words.