# Discover Insights into Classic Texts

Novels and text contain insights into ideologies and places that are often originally unknown to the reader. By reading a written piece, you uncover the opinions of the author on their chosen topic and come to understand both the topic and how the author thinks.

In this project we will perform a natural language parsing analysis to gain deeper insight into one of two famous and often discussed novels in the public domain: <a href="http://www.gutenberg.org/ebooks/174" target="_blank" rel="noopener noreferrer">Oscar Wilde's _The Picture of Dorian Gray_</a> or <a href="http://www.gutenberg.org/ebooks/6130" target="_blank" rel="noopener noreferrer"> Homer's _The Iliad!_</a> 

By the end of this project, we will find out the main topics of discussion in the novel of choosing and can begin to discern some of the author's thoughts and beliefs!

## Import and Preprocess Text Data

1. Import the text of choosing, convert it to lowercase, and name it. Also import other useful functions.

In [1]:
from nltk.tokenize import PunktSentenceTokenizer, word_tokenize
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def word_sentence_tokenize(text):
  
  # create a PunktSentenceTokenizer
  sentence_tokenizer = PunktSentenceTokenizer(text)
  
  # sentence tokenize text
  sentence_tokenized = sentence_tokenizer.tokenize(text)
  
  # create a list to hold word tokenized sentences
  word_tokenized = list()
  
  # for-loop through each tokenized sentence in sentence_tokenized
  for tokenized_sentence in sentence_tokenized:
    # word tokenize each sentence and append to word_tokenized
    word_tokenized.append(word_tokenize(tokenized_sentence))
    
  return word_tokenized

[nltk_data] Downloading package punkt to /Users/miltonmc5/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/miltonmc5/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
from collections import Counter

# function that pulls chunks out of chunked sentence and finds the most common chunks
def np_chunk_counter(chunked_sentences):

    # create a list to hold chunks
    chunks = list()

    # for-loop through each chunked sentence to extract noun phrase chunks
    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'NP'):
            chunks.append(tuple(subtree))

    # create a Counter object
    chunk_counter = Counter()

    # for-loop through the list of chunks
    for chunk in chunks:
        # increase counter of specific chunk by 1
        chunk_counter[chunk] += 1

    # return 30 most frequent chunks
    return chunk_counter.most_common(30)

from collections import Counter

# function that pulls chunks out of chunked sentence and finds the most common chunks
def vp_chunk_counter(chunked_sentences):

    # create a list to hold chunks
    chunks = list()

    # for-loop through each chunked sentence to extract verb phrase chunks
    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'VP'):
            chunks.append(tuple(subtree))

    # create a Counter object
    chunk_counter = Counter()

    # for-loop through the list of chunks
    for chunk in chunks:
        # increase counter of specific chunk by 1
        chunk_counter[chunk] += 1

    # return 30 most frequent chunks
    return chunk_counter.most_common(30)

In [3]:
from nltk import pos_tag, RegexpParser

# import text of choice here
the_iliad = open("the_iliad.txt",encoding='utf-8').read().lower()


2. With the text imported, we need to split the text into individual sentences and then individual words. This allows us to perform a sentence-by-sentence parsing analysis.

In [4]:
# sentence and word tokenize text here
word_tokenized_text = word_sentence_tokenize(the_iliad)

In [5]:
# store and print any word tokenized sentence here
single_word_tokenized_sentence = word_tokenized_text[90]
print(single_word_tokenized_sentence)


['consistency', 'is', 'no', 'less', 'pertinacious', 'and', 'exacting', 'in', 'its', 'demands', '.']


## Part-of-speech Tag Text

4. Next we can part-of-speech tag each sentence to allow for syntax parsing. The list named `pos_tagged_text` will hold each part-of-speech tagged sentence from the novel.

In [6]:
pos_tagged_text = []

5. Loop through each word tokenized sentence in `word_tokenized_text` and part-of-speech tag each sentence using `nltk`'s `pos_tag()` function. Append the result to `pos_tagged_text`.

In [7]:
for ws in word_tokenized_text:
    pos_tagged_text.append(pos_tag(ws))

6. Save any part-of-speech tagged sentence in `pos_tagged_text` to a variable named `single_pos_sentence`. Print `single_pos_sentence` as a check to visualize what you have done so far!

In [20]:
# store and print any part-of-speech tagged sentence here
single_pos_sentence = pos_tagged_text[90]
print(single_pos_sentence)

[('consistency', 'NN'), ('is', 'VBZ'), ('no', 'DT'), ('less', 'RBR'), ('pertinacious', 'JJ'), ('and', 'CC'), ('exacting', 'VBG'), ('in', 'IN'), ('its', 'PRP$'), ('demands', 'NNS'), ('.', '.')]


## Chunk Sentences

7. Now that we have part-of-speech tagged our text, we can move on to syntax parsing.

   Begin by defining a piece of chunk grammar `np_chunk_grammar` that will chunk a noun phrase.

In [29]:
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

8. Create a `nltk` `RegexpParser` object named `np_chunk_parser` using the noun phrase chunk grammar we defined as an argument.

In [30]:
np_chunk_parser = RegexpParser(np_chunk_grammar)

9. Define a piece of chunk grammar named `vp_chunk_grammar` that will chunk a verb phrase of the following form: noun phrase, followed by a verb `VB`. followed by an optional adverb `RB`.

In [31]:
vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

10. Create a `nltk` `RegexpParser` object named `vp_chunk_parser` using the verb phrase chunk grammar we defined as an argument.

In [32]:
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

11. Create two empty lists `np_chunked_text` and `vp_chunked_text` that will hold the chunked sentences from our text. 

In [37]:
np_chunked_text = []
vp_chunked_text = []

12. Loop through each part-of-speech tagged sentence in `pos_tagged_text` and noun phrase chunk each sentence using our `RegexpParser`'s `.parse()` method. Append the result to `np_chunked_text`.

In [38]:
for ws in pos_tagged_text:
    np_chunked_text.append(np_chunk_parser.parse(ws))
    vp_chunked_text.append(vp_chunk_parser.parse(ws))

13. Within the same loop we defined in the previous task, verb phrase chunk each part-of-speech tagged sentence using your `RegexpParser`'s `.parse()` method. Append the result to `vp_chunked_text`.

## Analyze Chunks

14. Now that we have chunked our novel, we can analyze the chunk frequencies to gain insights.

    A function `np_chunk_counter()` that returns the `30` most common NP-chunks from a list of chunked sentences is defined at the beginning of the code.

In [40]:
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)

[((('hector', 'NN'),), 322), ((('i', 'NN'),), 277), ((('jove', 'NN'),), 257), ((('troy', 'NN'),), 208), ((('vain', 'NN'),), 195), ((('war', 'NN'),), 193), ((('son', 'NN'),), 170), ((('thou', 'NN'),), 158), ((('the', 'DT'), ('plain', 'NN')), 157), ((('the', 'DT'), ('field', 'NN')), 154), ((('the', 'DT'), ('ground', 'NN')), 138), ((('death', 'NN'),), 134), ((('hand', 'NN'),), 134), ((('greece', 'NN'),), 128), ((('heaven', 'NN'),), 127), ((('fate', 'NN'),), 127), ((('thee', 'NN'),), 122), ((('breast', 'NN'),), 121), ((('the', 'DT'), ('trojan', 'NN')), 120), ((('the', 'DT'), ('god', 'NN')), 119), ((('the', 'DT'), ('war', 'NN')), 117), ((('the', 'DT'), ('greeks', 'NN')), 116), ((('blood', 'NN'),), 115), ((('homer', 'NN'),), 112), ((('the', 'DT'), ('king', 'NN')), 105), ((('rage', 'NN'),), 103), ((('force', 'NN'),), 103), ((('care', 'NN'),), 99), ((('head', 'NN'),), 98), ((('man', 'NN'),), 97)]


15. A function `vp_chunk_counter()` that returns the `30` most common VP-chunks from a list of chunked sentences is also defined at the beginning of the code.

In [41]:
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

[((("'t", 'NN'), ('is', 'VBZ')), 19), ((('i', 'NN'), ('am', 'VBP')), 11), ((("'t", 'NN'), ('was', 'VBD')), 11), ((('the', 'DT'), ('hero', 'NN'), ('said', 'VBD')), 9), ((('i', 'NN'), ('know', 'VBP')), 8), ((('i', 'NN'), ('saw', 'VBD')), 8), ((('the', 'DT'), ('scene', 'NN'), ('lies', 'VBZ')), 7), ((('i', 'NN'), ('was', 'VBD')), 6), ((('confess', 'NN'), ("'d", 'VBD')), 6), ((('the', 'DT'), ('scene', 'NN'), ('is', 'VBZ')), 6), ((('view', 'NN'), ("'d", 'VBD')), 5), ((('i', 'NN'), ('felt', 'VBD')), 5), ((('i', 'NN'), ('bear', 'VBP')), 5), ((('hector', 'NN'), ('is', 'VBZ')), 5), ((('vain', 'NN'), ('was', 'VBD')), 5), ((('homer', 'NN'), ('was', 'VBD')), 4), ((('i', 'NN'), ('have', 'VBP')), 4), ((('hunger', 'NN'), ('was', 'VBD')), 4), ((('glory', 'NN'), ('lost', 'VBN')), 4), ((('i', 'NN'), ('see', 'VBP')), 4), ((('war', 'NN'), ('be', 'VB')), 4), ((('the', 'DT'), ('weapon', 'NN'), ('stood', 'VBD')), 4), ((('i', 'NN'), ('go', 'VBP')), 4), ((('the', 'DT'), ('silence', 'NN'), ('broke', 'VBD')), 4),