<a href="https://colab.research.google.com/github/GenevieveMilliken/NLP/blob/main/NLTK_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accessing Text Corpora and Lexical Resources

https://www.nltk.org/book/ch02.html

1. What are some useful text corpora and lexical 
resources, and how can we access them with Python?
2. Which Python constructs are most helpful for this work?
3. How do we avoid repeating ourselves when writing Python code?

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

In [None]:
# Get Gutenberg Corpus

import nltk
nltk.download("gutenberg")
nltk.download('punkt')

from nltk.corpus import gutenberg

In [None]:
 
gutenberg.fileids()

for fileid in gutenberg.fileids(): 
  num_chars = len(gutenberg.raw(fileid))
  num_words = len(gutenberg.words(fileid))
  num_sents = len(gutenberg.sents(fileid))
  num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
  print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)


This program displays three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score). Observe that average word length appears to be a general property of English, since it has a recurrent value of 4. (In fact, the average word length is really 3 not 4, since the num_chars variable counts space characters.) By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.

In [None]:
emma = nltk.corpus.gutenberg.words("austen-emma.txt")

In [None]:
len(emma)

In [None]:
# unlike in chapter 2, we have to turn our corpus into a text object to perform concordance

emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
emma.concordance("melancholy")

In [None]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
len(macbeth_sentences)

In [None]:
print(macbeth_sentences[1906])

In [None]:
longest_len = max(len(s) for s in macbeth_sentences)
longest_len
print([s for s in macbeth_sentences if len(s) == longest_len])

# List of Corpora

The first secord of Ch. 2 provides information on available corpora: 
* Gutenberg Corpus
* Web and Chat Text
* Brown Corpus
* Reuters Corpus
* Inaugural Address Corpus
* Annotated Text Corpora
* Multi-language ( Universal Declaration of Human Rights)
* https://www.nltk.org/howto/corpus.html

In [None]:
help(nltk.corpus.reader)

In [None]:
# raw, word, and sent 

raw = gutenberg.raw("burgess-busterbrown.txt")
print(raw[1:20])
words = gutenberg.words("burgess-busterbrown.txt")
print(words[1:20])
sents = gutenberg.sents("burgess-busterbrown.txt")
print(sents[1:20])

The Adventures of B
['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.', 'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster', 'Bear']
[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as', 'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched', 'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', 'the', 'Green', 'Forest', 'to', 'chase', 'out', 'the', 'Black', 'Shadows', '.'], ['Once', 'more', 'he', 'yawned', ',', 'and', 'slowly', 'got', 'to', 'his', 'feet', 'and', 'shook', 'himself', '.'], ['Then', 'he', 'walked', 'over', 'to', 'a', 'big', 'pine', '-', 'tree', ',', 'stood', 'up', 'on', 'his', 'hind', 'legs', ',', 'reached', 'as', 'high', 'up', 'on', 'the', 'trunk', 'of', 'the', 'tree', 'as', 'he', 'could', ',', 'and', 'scratched', 'the', 'bark', 'with', 'his', 'great', 'claws', '.'], ['After', 'that', 'he', 'yawned', 'until', 'it', 'seemed', 'as', 'if', 'his', 'jaws', 'wou

In [None]:
help(nltk.corpus.reader)

## Loading your own Corpus

In [None]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "YOUR FILE PATH"

wordlists = PlaintextCorpusReader(corpus_root, '.*')

wordlists.fileids()

wordlists.words('connectives')

# More Python: Reusing Code

In [None]:
# function

def lexical_diversity(text):
  return len(text) / len(set(text))

lexical_diversity(emma)


24.63538599411087

In [None]:
# function w/ local variables

def lexical_diversity(my_text_data):
  word_count = len(my_text_data)
  vocab_size = len(set(my_text_data))
  diversity_score = vocab_size / word_count
  return diversity_score

lexical_diversity(emma)



0.04059201671283136

In [None]:
# from nltk.corpus import genesis
nltk.download('genesis')
kjv = genesis.words("english-kjv.txt")
lexical_diversity(kjv)

[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Unzipping corpora/genesis.zip.


0.06230453042623537

# Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. 

* Wordlist Corpora (NLTK includes some corpora that are nothing more than wordlists; We can use it to find unusual or mis-spelt words in a text corpus)
* corpus of stopwords


In [None]:
nltk.download("stopwords")
from nltk.corpus import stopwords
stopwords.words('english')

In [None]:
def content_fraction(text):
  stopwords = nltk.corpus.stopwords.words("english")
  content = [w for w in text if w.lower() not in stopwords]
  return len(content)/len(text)

content_fraction(emma)


[nltk_data] Downloading package reuters to /root/nltk_data...


0.735240435097661

In [None]:
# with the help of the stopwords we can filter out over a quarter of the words of the text 

nltk.download('reuters')
content_fraction(nltk.corpus.reuters.words())

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. We'll begin by looking at synonyms and how they are accessed in WordNet.

https://wordnet.princeton.edu/