Skip to content

NLP Ch2 Accessing Text Corpora and Lexical Resources

PeppermintT edited this page Feb 12, 2020 · 3 revisions

Text Corpus Structure

There are different types of structures for text corpuses.

  1. A collection of texts is the simplest kind - eg the Gutenberg corpus which simply is texts that are out of copyright.
  2. Some texts are grouped into categories. There are many types of categories eg genre, language, source, author (eg the Brown corpus)
  3. Sometimes categories overlap as items can fall under 2 or more categories (eg Reuters news articles)
  4. Temporal structures - showing language over time (eg the Inaugral address corpus for the USA).

Some key functions for accessing corpuses ( I worked with gutenberg)

  • The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt')) tells us how many letters occur in the text, including the spaces between words.

Clone this wiki locally