# Insights into Classic Texts

# NLP

## Natural Language Processing analysis of Oscar Wilde's "The Picture of Dorian Gray" and Homer's "The Iliad".

#### Using parsing, with regular expressions, to to get an idea of some of the authors' thoughts and beliefs, to gain insight into lengthy pieces of text without reading the entire books.

#### These books came from the public domain:

- <a href="http://www.gutenberg.org/ebooks/174" target="_blank" rel="noopener noreferrer">Oscar Wilde's _The Picture of Dorian Gray_</a> 


- <a href="http://www.gutenberg.org/ebooks/6130" target="_blank" rel="noopener noreferrer"> Homer's _The Iliad!_</a>



## Import the Data

In [1]:
from nltk import pos_tag, RegexpParser
import import_ipynb
from tokenize_words import word_sentence_tokenize
from chunk_counters import np_chunk_counter, vp_chunk_counter

# import text of choice here
text = open("dorian_gray.txt",encoding='utf-8').read().lower()
#text = open("the_iliad.txt",encoding='utf-8').read().lower()
#text = open("my_text.txt",encoding='utf-8').read().lower()
print(text)

importing Jupyter notebook from tokenize_words.ipynb
importing Jupyter notebook from chunk_counters.ipynb
the picture of dorian gray

by

oscar wilde




the preface

the artist is the creator of beautiful things.  to reveal art and
conceal the artist is art's aim.  the critic is he who can translate
into another manner or a new material his impression of beautiful
things.

the highest as the lowest form of criticism is a mode of autobiography.
those who find ugly meanings in beautiful things are corrupt without
being charming.  this is a fault.

those who find beautiful meanings in beautiful things are the
cultivated.  for these there is hope.  they are the elect to whom
beautiful things mean only beauty.

there is no such thing as a moral or an immoral book.  books are well
written, or badly written.  that is all.

the nineteenth century dislike of realism is the rage of caliban seeing
his own face in a glass.

the nineteenth century dislike of romanticism is the rage of caliban
not 

## Split the Data

### Split the text into individual sentences and then individual words. 

In [2]:
import nltk
nltk.download('punkt')

# sentence and word tokenize text here
word_tokenized_text = word_sentence_tokenize(text)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/wranglerdeb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# store and print any word tokenized sentence here
single_word_tokenized_sentence = word_tokenized_text
print(single_word_tokenized_sentence)



## POS

## Part-of-speech Tag Text

In [4]:
# create a list to hold part-of-speech tagged sentences here
pos_tagged_text = list()

In [5]:
nltk.download('averaged_perceptron_tagger')
# create a for loop through each word tokenized sentence here
for word_tokenized_sentence in word_tokenized_text:
    # part-of-speech tag each sentence and append to list of pos-tagged sentences here
    pos_tagged_text.append(pos_tag(word_tokenized_sentence))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/wranglerdeb/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [7]:
# store and print any part-of-speech tagged sentence here
single_pos_sentence = pos_tagged_text[100]
print(single_pos_sentence)

[('it', 'PRP'), ('seems', 'VBZ'), ('to', 'TO'), ('be', 'VB'), ('the', 'DT'), ('one', 'CD'), ('thing', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('make', 'VB'), ('modern', 'JJ'), ('life', 'NN'), ('mysterious', 'JJ'), ('or', 'CC'), ('marvellous', 'JJ'), ('to', 'TO'), ('us', 'PRP'), ('.', '.')]


- 'VB' means Verb.
- 'VBZ' means Verb ( in the 3rd person ) Present Tense.
- 'PRP' means Personal Pronoun ("I", "you", "he", "she", "it", "we", "they").
- 'CD' means Cardinal Numbers.
- 'JJ' means Adjective
- 'WDT' means words like 'that', 'which', 'what', 'whatever', 'whichever'

## Chunk Sentences

### Noun Phrase

In [10]:
# define noun phrase chunk grammar here
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

In [11]:
# create noun phrase RegexpParser object here
np_chunk_parser = RegexpParser(np_chunk_grammar)

### Verb Phrase

In [12]:
# define verb phrase chunk grammar here
vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

In [13]:
# create verb phrase RegexpParser object here
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

In [14]:
# create a list to hold noun phrase chunked sentences and a list to hold verb phrase chunked sentences here
np_chunked_text = list()
vp_chunked_text = list()

In [15]:
# create a for loop through each pos-tagged sentence here
for pos_tagged_sentence in pos_tagged_text:
    # chunk each sentence and append to list here
    np_chunked_text.append(np_chunk_parser.parse(pos_tagged_sentence))

In [16]:
# create a for loop through each pos-tagged sentence here
for pos_tagged_sentence in pos_tagged_text:
    # chunk each sentence and append to lists here
    vp_chunked_text.append(vp_chunk_parser.parse(pos_tagged_sentence))

## Analyze Chunks

### The novel is now "chunked".  The np_chunk_counter tool will count the frequency of the 30 most common NOUN PHRASES in the book. 

In [19]:
# store and print the most common NP-chunks here
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)

[((('i', 'NN'),), 963), ((('henry', 'NN'),), 200), ((('lord', 'NN'),), 197), ((('life', 'NN'),), 170), ((('harry', 'NN'),), 136), ((('dorian', 'JJ'), ('gray', 'NN')), 127), ((('something', 'NN'),), 126), ((('nothing', 'NN'),), 93), ((('basil', 'NN'),), 85), ((('the', 'DT'), ('world', 'NN')), 70), ((('everything', 'NN'),), 69), ((('anything', 'NN'),), 68), ((('hallward', 'NN'),), 68), ((('the', 'DT'), ('man', 'NN')), 61), ((('the', 'DT'), ('room', 'NN')), 60), ((('face', 'NN'),), 57), ((('the', 'DT'), ('door', 'NN')), 56), ((('love', 'NN'),), 55), ((('art', 'NN'),), 52), ((('course', 'NN'),), 51), ((('the', 'DT'), ('picture', 'NN')), 46), ((('the', 'DT'), ('lad', 'NN')), 45), ((('head', 'NN'),), 44), ((('round', 'NN'),), 44), ((('hand', 'NN'),), 44), ((('sibyl', 'NN'),), 41), ((('the', 'DT'), ('table', 'NN')), 40), ((('the', 'DT'), ('painter', 'NN')), 38), ((('sir', 'NN'),), 38), ((('a', 'DT'), ('moment', 'NN')), 38)]


### Now the np_chunk_counter tool will count the frequency of the 30 most common NOUN PHRASES in the book. 
- 'NN' means Noun. 
- 'DT' means 'the determiner'. 
- 'JJ' means Adjective.

In [20]:
# store and print the most common VP-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

[((('i', 'NN'), ('am', 'VBP')), 101), ((('i', 'NN'), ('was', 'VBD')), 40), ((('i', 'NN'), ('want', 'VBP')), 37), ((('i', 'NN'), ('know', 'VBP')), 33), ((('i', 'NN'), ('do', 'VBP'), ("n't", 'RB')), 32), ((('i', 'NN'), ('have', 'VBP')), 32), ((('i', 'NN'), ('had', 'VBD')), 31), ((('i', 'NN'), ('suppose', 'VBP')), 17), ((('i', 'NN'), ('think', 'VBP')), 16), ((('i', 'NN'), ('am', 'VBP'), ('not', 'RB')), 14), ((('i', 'NN'), ('thought', 'VBD')), 13), ((('i', 'NN'), ('believe', 'VBP')), 12), ((('dorian', 'JJ'), ('gray', 'NN'), ('was', 'VBD')), 11), ((('i', 'NN'), ('am', 'VBP'), ('so', 'RB')), 11), ((('henry', 'NN'), ('had', 'VBD')), 11), ((('i', 'NN'), ('did', 'VBD'), ("n't", 'RB')), 9), ((('i', 'NN'), ('met', 'VBD')), 9), ((('i', 'NN'), ('said', 'VBD')), 9), ((('i', 'NN'), ('am', 'VBP'), ('quite', 'RB')), 8), ((('i', 'NN'), ('see', 'VBP')), 8), ((('i', 'NN'), ('did', 'VBD'), ('not', 'RB')), 7), ((('i', 'NN'), ('have', 'VBP'), ('ever', 'RB')), 7), ((('life', 'NN'), ('has', 'VBZ')), 7), ((('i'

### Now the np_chunk_counter tool will count the frequency of the 30 most common VERB PHRASES in the book. 
- 'VBP' means Verb.  Present tense.
- 'VBD' means Verb.  Past tense.
- 'RB' means Adverb.