# Discover Insights into Classic Texts

Novels and text contain insights into ideologies and places that are often originally unknown to the reader. By reading a written piece, you uncover the opinions of the author on their chosen topic and come to understand both the topic and how the author thinks.

In this project you will perform a natural language parsing analysis to gain deeper insight into one of two famous and often discussed novels in the public domain: <a href="http://www.gutenberg.org/ebooks/174" target="_blank" rel="noopener noreferrer">Oscar Wilde's _The Picture of Dorian Gray_</a> or <a href="http://www.gutenberg.org/ebooks/6130" target="_blank" rel="noopener noreferrer"> Homer's _The Iliad!_</a> Fear not if you haven't heard or read the novels, one of the beauties of natural language parsing with regular expressions is the ability to gain insight into lengthy pieces of text without a formal read!

By the end of this project, you will find out the main topics of discussion in the novel of your choosing and can begin to discern some of the author's thoughts and beliefs!

## Import and Preprocess Text Data

1. Given to you in the downloadable kit are text files for _The Picture of Dorian Gray_, named `dorian_gray.txt`, and _The Iliad_, named `the_iliad.txt`, sourced from <a href="https://www.gutenberg.org/" target="_blank" rel="noopener noreferrer">Project Gutenberg</a>. Import the text of your choosing, convert it to lowercase, and name it `text` using the following line of code.

   ```py
   text = open("_______.txt",encoding='utf-8').read().lower()
   ```
   
   Replace the blank with the name of the text file for the novel you choose to analyze!

In [1]:
from nltk import pos_tag, RegexpParser
%run tokenize_words.ipynb import word_sentence_tokenize
%run chunk_counters.ipynb import np_chunk_counter, vp_chunk_counter, pat_chunk_counter

# import text of choice here
text = open('dorian_gray.txt', encoding = 'utf-8').read().lower()

2. With the text imported, now you need to split the text into individual sentences and then individual words. This allows you to perform a sentence-by-sentence parsing analysis!

   Provided to you in the downloadable kit is a customized function `word_sentence_tokenize()` that will sentence tokenize a text and then word tokenize each sentence, returning a list of word tokenized sentences. Call the function with `text` as an argument and save the result to a variable named `word_tokenized_text`.

In [2]:
# sentence and word tokenize text here
word_tokenized_text = word_sentence_tokenize(text)

3. Save any word tokenized sentence in `word_tokenized_text` to a variable named `single_word_tokenized_sentence`. Print `single_word_tokenized_sentence` as a check to visualize what you have done so far!

In [3]:
# store and print any word tokenized sentence here
single_word_tokenized_sentence = word_tokenized_text[54]

single_word_tokenized_sentence

['as',
 'soon',
 'as',
 'you',
 'have',
 'one',
 ',',
 'you',
 'seem',
 'to',
 'want',
 'to',
 'throw',
 'it',
 'away',
 '.']

## Part-of-speech Tag Text

4. Next you can part-of-speech tag each sentence to allow for syntax parsing! Begin by creating a list named `pos_tagged_text` that will hold each part-of-speech tagged sentence from the novel.

In [4]:
# create a list to hold part-of-speech tagged sentences here
pos_tagged_text = []

5. Loop through each word tokenized sentence in `word_tokenized_text` and part-of-speech tag each sentence using `nltk`'s `pos_tag()` function. Append the result to `pos_tagged_text`.

In [5]:
# create a for loop through each word tokenized sentence here
for sentence in word_tokenized_text:
    # part-of-speech tag each sentence and append to list of pos-tagged sentences here
    pos_tagged_text.append(pos_tag(sentence))

    


6. Save any part-of-speech tagged sentence in `pos_tagged_text` to a variable named `single_pos_sentence`. Print `single_pos_sentence` as a check to visualize what you have done so far!

In [6]:
# store and print any part-of-speech tagged sentence here
single_pos_sentence = pos_tagged_text[157]

single_pos_sentence

[('when', 'WRB'),
 ('our', 'PRP$'),
 ('eyes', 'NNS'),
 ('met', 'VBD'),
 (',', ','),
 ('i', 'VB'),
 ('felt', 'VBD'),
 ('that', 'IN'),
 ('i', 'NN'),
 ('was', 'VBD'),
 ('growing', 'VBG'),
 ('pale', 'NN'),
 ('.', '.')]

## Chunk Sentences

7. Now that you have part-of-speech tagged your text, you can move on to syntax parsing!

   Begin by defining a piece of chunk grammar `np_chunk_grammar` that will chunk a noun phrase. Remember, a noun phrase consists of an optional determiner `DT`, followed by any number of adjectives `JJ`, followed by a noun `NN`.

In [7]:
# define noun phrase chunk grammar here
np_chunk_grammar = 'NP: {<DT>?<JJ>*<NN>}'



8. Create a `nltk` `RegexpParser` object named `np_chunk_parser` using the noun phrase chunk grammar you defined as an argument.

In [8]:
# create noun phrase RegexpParser object here
np_chunk_parser = RegexpParser(np_chunk_grammar)


9. Define a piece of chunk grammar named `vp_chunk_grammar` that will chunk a verb phrase of the following form: noun phrase, followed by a verb `VB`. followed by an optional adverb `RB`.

In [9]:
# define verb phrase chunk grammar here
vp_chunk_grammar = 'VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}'

10. Create a `nltk` `RegexpParser` object named `vp_chunk_parser` using the verb phrase chunk grammar you defined as an argument.

In [10]:
# create verb phrase RegexpParser object here
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

11. Create two empty lists `np_chunked_text` and `vp_chunked_text` that will hold the chunked sentences from your text. 

In [11]:
# create a list to hold noun phrase chunked sentences and a list to hold verb phrase chunked sentences here
np_chunked_text = []
vp_chunked_text = []



12. Loop through each part-of-speech tagged sentence in `pos_tagged_text` and noun phrase chunk each sentence using your `RegexpParser`'s `.parse()` method. Append the result to `np_chunked_text`. Within the same loop, verb phrase chunk each part-of-speech tagged sentence using your `RegexpParser`'s `.parse()` method. Append the result to `vp_chunked_text`.

In [12]:
# create a for loop through each pos-tagged sentence here
for sentence in pos_tagged_text:

  # chunk each sentence and append to list here
    np_chunked_text.append(np_chunk_parser.parse(sentence))
    vp_chunked_text.append(vp_chunk_parser.parse(sentence))
    


## Analyze Chunks

13. Now that you have chunked your novel, you can analyze the chunk frequencies to gain insights!

    A function `np_chunk_counter()` that returns the `30` most common NP-chunks from a list of chunked sentences has been imported for you in the code block for task 1. Call `np_chunk_counter()` with `np_chunked_text` as an argument and save the result to a variable named `most_common_np_chunks`. Print `most_common_np_chunks`. What sticks out to you about the most common noun phrase chunks? Are you surprised by anything? Open **Discover Insights into Classic Texts_Solution.ipynb** to see our analysis.
    
    Want to see how `np_chunk_counter()` works? Open **chunk_counters.ipynb** from the kit you downloaded and inspect `np_chunk_counter()`.

In [13]:
# store and print the most common NP-chunks here
most_common_np_chunks = np_chunk_counter(np_chunked_text)

most_common_np_chunks

[((('i', 'NN'),), 963),
 ((('henry', 'NN'),), 200),
 ((('lord', 'NN'),), 197),
 ((('life', 'NN'),), 170),
 ((('harry', 'NN'),), 136),
 ((('dorian', 'JJ'), ('gray', 'NN')), 127),
 ((('something', 'NN'),), 126),
 ((('nothing', 'NN'),), 93),
 ((('basil', 'NN'),), 85),
 ((('the', 'DT'), ('world', 'NN')), 70),
 ((('everything', 'NN'),), 69),
 ((('anything', 'NN'),), 68),
 ((('hallward', 'NN'),), 68),
 ((('the', 'DT'), ('man', 'NN')), 61),
 ((('the', 'DT'), ('room', 'NN')), 60),
 ((('face', 'NN'),), 57),
 ((('the', 'DT'), ('door', 'NN')), 56),
 ((('love', 'NN'),), 55),
 ((('art', 'NN'),), 52),
 ((('course', 'NN'),), 51),
 ((('the', 'DT'), ('picture', 'NN')), 46),
 ((('the', 'DT'), ('lad', 'NN')), 45),
 ((('head', 'NN'),), 44),
 ((('round', 'NN'),), 44),
 ((('hand', 'NN'),), 44),
 ((('sibyl', 'NN'),), 41),
 ((('the', 'DT'), ('table', 'NN')), 40),
 ((('the', 'DT'), ('painter', 'NN')), 38),
 ((('sir', 'NN'),), 38),
 ((('a', 'DT'), ('moment', 'NN')), 38)]

14. A function `vp_chunk_counter()` that returns the `30` most common VP-chunks from a list of chunked sentences has been imported for you in the code block for task 1. Call `vp_chunk_counter()` with `vp_chunked_text` as an argument and save the result to a variable named `most_common_vp_chunks`. Print `most_common_vp_chunks`. What sticks out to you about the most common verb phrase chunks? Are you surprised by anything? Open **Discover Insights into Classic Texts_Solution.ipynb** to see our analysis.

    Want to see how `vp_chunk_counter()` works? Open **chunk_counters.ipynb** from the kit you downloaded and inspect `np_chunk_counter()`.

In [14]:
# store and print the most common VP-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)

most_common_vp_chunks

[((('i', 'NN'), ('am', 'VBP')), 101),
 ((('i', 'NN'), ('was', 'VBD')), 40),
 ((('i', 'NN'), ('want', 'VBP')), 37),
 ((('i', 'NN'), ('know', 'VBP')), 33),
 ((('i', 'NN'), ('do', 'VBP'), ("n't", 'RB')), 32),
 ((('i', 'NN'), ('have', 'VBP')), 32),
 ((('i', 'NN'), ('had', 'VBD')), 31),
 ((('i', 'NN'), ('suppose', 'VBP')), 17),
 ((('i', 'NN'), ('think', 'VBP')), 16),
 ((('i', 'NN'), ('am', 'VBP'), ('not', 'RB')), 14),
 ((('i', 'NN'), ('thought', 'VBD')), 13),
 ((('i', 'NN'), ('believe', 'VBP')), 12),
 ((('dorian', 'JJ'), ('gray', 'NN'), ('was', 'VBD')), 11),
 ((('i', 'NN'), ('am', 'VBP'), ('so', 'RB')), 11),
 ((('henry', 'NN'), ('had', 'VBD')), 11),
 ((('i', 'NN'), ('did', 'VBD'), ("n't", 'RB')), 9),
 ((('i', 'NN'), ('met', 'VBD')), 9),
 ((('i', 'NN'), ('said', 'VBD')), 9),
 ((('i', 'NN'), ('am', 'VBP'), ('quite', 'RB')), 8),
 ((('i', 'NN'), ('see', 'VBP')), 8),
 ((('i', 'NN'), ('did', 'VBD'), ('not', 'RB')), 7),
 ((('i', 'NN'), ('have', 'VBP'), ('ever', 'RB')), 7),
 ((('life', 'NN'), ('has

## Go Further On Your Own!

16. Amazing! You have performed a syntax parsing analysis on a novel and gained insight into both the meaning of the text and how the author thinks, without reading a page!

    Now's your chance to get creative. Is there a different pattern of parts-of-speech you want to identify and count in the novel you selected? Add a new piece of chunk grammar and repeat the process of chunking. What do you find?

In [19]:
pattern_chunk_grammar = 'PAT: {<DT>?<JJ>*<NN><VB.*>(<RB.?>|<JJ>)}'
pat_chunked_text = []

for sentence in pos_tagged_text:
    pat_chunked_text.append(RegexpParser(pattern_chunk_grammar).parse(sentence))

most_common_pat_chunks = pat_chunk_counter(pat_chunked_text)
most_common_pat_chunks

[((('i', 'NN'), ('do', 'VBP'), ("n't", 'RB')), 34),
 ((('i', 'NN'), ('am', 'VBP'), ('not', 'RB')), 14),
 ((('i', 'NN'), ('am', 'VBP'), ('afraid', 'JJ')), 12),
 ((('i', 'NN'), ('am', 'VBP'), ('so', 'RB')), 11),
 ((('i', 'NN'), ('did', 'VBD'), ("n't", 'RB')), 10),
 ((('i', 'NN'), ('am', 'VBP'), ('quite', 'RB')), 8),
 ((('i', 'NN'), ('did', 'VBD'), ('not', 'RB')), 7),
 ((('i', 'NN'), ('have', 'VBP'), ('ever', 'RB')), 7),
 ((('i', 'NN'), ('was', 'VBD'), ('afraid', 'JJ')), 5),
 ((('i', 'NN'), ('am', 'VBP'), ('too', 'RB')), 4),
 ((('i', 'NN'), ('have', 'VBP'), ('not', 'RB')), 3),
 ((('i', 'NN'), ('have', 'VBP'), ('never', 'RB')), 3),
 ((('i', 'NN'), ('am', 'VBP'), ('awfully', 'RB')), 3),
 ((('i', 'NN'), ('am', 'VBP'), ('sure', 'JJ')), 3),
 ((('i', 'NN'), ('had', 'VBD'), ('ever', 'RB')), 3),
 ((('i', 'NN'), ('was', 'VBD'), ('not', 'RB')), 3),
 ((('i', 'NN'), ('am', 'VBP'), ('sorry', 'JJ')), 3),
 ((('i', 'NN'), ('had', 'VBD'), ('never', 'RB')), 3),
 ((('i', 'NN'), ('sha', 'VBP'), ("n't", 'RB')

17. Not the biggest fan of _The Picture of Dorian Gray_ or _The Iliad_? No worries! Included in the downloadable kit is a blank text file named `my_text.txt`. Open the file and copy any text of your choice (novel, script, article, etc.) into the file. Save the file and then return to this file (**Discover Insights into Classic Texts.ipynb**). Update the opened text file to `my_text.txt` and rerun this notebook to perform a syntax parsing analysis on your text! What insights or deeper meanings did you discover?