# Parsing with Regular Expressions
#### Discover Insights into Classic Texts

Novels and text contain insights into ideologies and places that are often originally unknown to the reader. By reading a written piece, you uncover the opinions of the author on their chosen topic and come to understand both the topic and how the author thinks.

In this project you will perform a natural language parsing analysis to gain deeper insight into one of two famous and often discussed novels in the public domain: Oscar Wilde’s The Picture of Dorian Gray or Homer’s The Iliad! Fear not if you haven’t heard or read the novels, one of the beauties of natural language parsing with regular expressions is the ability to gain insight into lengthy pieces of text without a formal read!

By the end of this project, you will find out the main topics of discussion in the novel of your choosing and can begin to discern some of the author’s thoughts and beliefs!

Project from Codecademy https://www.codecademy.com/paths/data-science/tracks/natural-language-processing-dsp/modules/parsing-with-regular-expressions-dsp/projects/nlp-regex-parsing-project <br>
Adapted to work outside the Codecademy platform 

- Importing toolkits for Natural Language Processing --> NLTK is a leading platform for building Python programs to work with human language data www.nltk.org
- Importing Counter --> To highlight the main topics of discussion in the novel 

In [1]:
import nltk
from nltk import sent_tokenize 
from nltk import pos_tag, RegexpParser
from nltk.tokenize import word_tokenize
from collections import Counter
import collections
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

In [2]:
# importing text of choice here
text = open("the_iliad.txt", encoding='utf-8').read().lower()

In [3]:
# sentence and word tokenizing text here

sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences

# now looping over each sentence and tokenize it separately
word_tokenized_text = []

for sentence in sent_text:
    tokenized_text = nltk.word_tokenize(sentence)
    word_tokenized_text.append(tokenized_text)

In [4]:
# storing and printing any word tokenized sentence here

single_word_tokenized_sentence = word_tokenized_text[100]
print(single_word_tokenized_sentence)

['and', 'this', 'difficulty', 'attaches', 'itself', 'more', 'closely', 'to', 'an', 'age', 'in', 'which', 'progress', 'has', 'gained', 'a', 'strong', 'ascendency', 'over', 'prejudice', ',', 'and', 'in', 'which', 'persons', 'and', 'things', 'are', ',', 'day', 'by', 'day', ',', 'finding', 'their', 'real', 'level', ',', 'in', 'lieu', 'of', 'their', 'conventional', 'value', '.']


In [5]:
# creating a list to hold part-of-speech tagged sentences here
pos_tagged_text = list()


In [6]:
# creating a for loop through each word tokenized sentence here

for word in word_tokenized_text:
  # part-of-speech tagging each sentence and appending to list of pos-tagged sentences here
    pos_tagged_result = pos_tag(word)
    pos_tagged_text.append(pos_tagged_result)

In [7]:
# storing and printing any part-of-speech tagged sentence here

single_pos_sentence = pos_tagged_text[100]
print(single_pos_sentence)

[('and', 'CC'), ('this', 'DT'), ('difficulty', 'NN'), ('attaches', 'VBZ'), ('itself', 'PRP'), ('more', 'RBR'), ('closely', 'RB'), ('to', 'TO'), ('an', 'DT'), ('age', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('progress', 'NN'), ('has', 'VBZ'), ('gained', 'VBN'), ('a', 'DT'), ('strong', 'JJ'), ('ascendency', 'NN'), ('over', 'IN'), ('prejudice', 'NN'), (',', ','), ('and', 'CC'), ('in', 'IN'), ('which', 'WDT'), ('persons', 'NNS'), ('and', 'CC'), ('things', 'NNS'), ('are', 'VBP'), (',', ','), ('day', 'NN'), ('by', 'IN'), ('day', 'NN'), (',', ','), ('finding', 'VBG'), ('their', 'PRP$'), ('real', 'JJ'), ('level', 'NN'), (',', ','), ('in', 'IN'), ('lieu', 'NN'), ('of', 'IN'), ('their', 'PRP$'), ('conventional', 'JJ'), ('value', 'NN'), ('.', '.')]


In [8]:
# defining noun phrase chunk grammar here
chunk_grammar = "NP: {<DT>? <JJ>* <NN>} "

In [9]:
# creating noun phrase RegexpParser object here
np_chunk_parser = RegexpParser(chunk_grammar)

In [10]:
# defining verb phrase chunk grammar here
vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

In [11]:
# creating verb phrase RegexpParser object here
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

In [12]:
# creating a list to hold noun phrase chunked sentences and a list to hold verb phrase chunked sentences here
np_chunked_text = []
vp_chunked_text = list()

In [13]:
# creating a for loop through each pos-tagged sentence here
for pos_tagged_sentence in pos_tagged_text:
  # chunking each sentence and appending to lists here

  # for noun phrase chunks first
    np_chunked_text.append(np_chunk_parser.parse(pos_tagged_sentence))
  # and verb phrase chunks also
    vp_chunked_text.append(vp_chunk_parser.parse(pos_tagged_sentence))

print(np_chunked_text[100])

(S
  and/CC
  (NP this/DT difficulty/NN)
  attaches/VBZ
  itself/PRP
  more/RBR
  closely/RB
  to/TO
  (NP an/DT age/NN)
  in/IN
  which/WDT
  (NP progress/NN)
  has/VBZ
  gained/VBN
  (NP a/DT strong/JJ ascendency/NN)
  over/IN
  (NP prejudice/NN)
  ,/,
  and/CC
  in/IN
  which/WDT
  persons/NNS
  and/CC
  things/NNS
  are/VBP
  ,/,
  (NP day/NN)
  by/IN
  (NP day/NN)
  ,/,
  finding/VBG
  their/PRP$
  (NP real/JJ level/NN)
  ,/,
  in/IN
  (NP lieu/NN)
  of/IN
  their/PRP$
  (NP conventional/JJ value/NN)
  ./.)


In [14]:
# defining the np_chunk_counter function

def np_chunk_counter(list_of_trees):
    subtree_list = []
    for i in range(len(list_of_trees)):
        for subtree in list_of_trees[i].subtrees():
            if subtree.label() == "NP":
                for i in range(len(subtree)):
                    subtree_list.append(subtree[i])
    print(Counter(subtree_list).most_common(30))



In [15]:
# defining the vp_chunk_counter function

def vp_chunk_counter(list_of_trees):
    subtree_list = []
    for i in range(len(list_of_trees)):
        for subtree in list_of_trees[i].subtrees():
            if subtree.label() == "VP":
                for i in range(len(subtree)):
                    subtree_list.append(subtree[i])
    print(Counter(subtree_list).most_common(30))

In [16]:
# storing and printing the most common NP-chunks here
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)

[(('the', 'DT'), 10044), (('a', 'DT'), 1770), (('thy', 'JJ'), 523), (('this', 'DT'), 432), (('hector', 'NN'), 417), (('war', 'NN'), 356), (('jove', 'NN'), 311), (('i', 'NN'), 288), (('son', 'NN'), 284), (('day', 'NN'), 269), (('god', 'NN'), 264), (('great', 'JJ'), 253), (('hand', 'NN'), 252), (('plain', 'NN'), 248), (('troy', 'NN'), 248), (('an', 'DT'), 232), (('no', 'DT'), 218), (('field', 'NN'), 207), (('fight', 'NN'), 204), (('vain', 'NN'), 201), (('chief', 'NN'), 199), (('some', 'DT'), 194), (('thou', 'NN'), 194), (('man', 'NN'), 191), (('force', 'NN'), 191), (('fate', 'NN'), 190), (('race', 'NN'), 187), (('ground', 'NN'), 184), (('rage', 'NN'), 184), (('trojan', 'NN'), 180)]
None


Analysis for The Iliad

Looking at most_common_np_chunks, you can identify characters of importance in the text such as hector and jove based on their frequency. Additionally a location of importance, troy, is mentioned often. A theme of war can also implied by its high frequency count.

In [17]:
# storing and printing the most common VP-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

[(('the', 'DT'), 963), (('is', 'VBZ'), 269), (('was', 'VBD'), 212), (('i', 'NN'), 167), (('a', 'DT'), 133), (('hector', 'NN'), 95), (('be', 'VB'), 91), (('has', 'VBZ'), 86), (('had', 'VBD'), 75), (('this', 'DT'), 72), (("'d", 'VBD'), 68), (('thy', 'JJ'), 47), (('hero', 'NN'), 45), (("'t", 'NN'), 44), (('stood', 'VBD'), 43), (('were', 'VBD'), 40), (('gave', 'VBD'), 38), (('fell', 'VBD'), 37), (('not', 'RB'), 36), (('are', 'VBP'), 35), (('great', 'JJ'), 35), (('lies', 'VBZ'), 35), (('thou', 'NN'), 35), (('homer', 'NN'), 34), (('came', 'VBD'), 34), (('jove', 'NN'), 34), (('battle', 'NN'), 33), (('day', 'NN'), 31), (('flew', 'VBD'), 30), (('no', 'DT'), 29)]
None


Analysis for The Iliad

Looking at most_common_vp_chunks, you can see that verb phrases of the form you defined in your chunk grammar do not appear as often in The Iliad as noun phrases. This can indicate a different style of writing taken by the author that does not follow traditional grammatical style (i.e. poetry). Even when chunks are not found, their absence can give you insight!