# Discovering Insights from Homer's Iliad with NLP

Novels and text contain insights into ideologies and places that are often originally unknown to the reader. By reading a written piece, we uncover the opinions of the author on their chosen topic and come to understand both the topic and how the author thinks.

In this project I've performed a natural language parsing analysis to gain deeper insight into [Homer’s The Iliad!](http://www.gutenberg.org/ebooks/6130) 

One of the beauties of natural language parsing with regular expressions is the ability to gain insight into lengthy pieces of text without a formal read!

>>
***Project Goal***
- Finding out the main topics of discussion in the novel and can begin to discern some of the author’s thoughts and beliefs 
>
>    "Without Actually Reading The Novel"!

In [None]:
from nltk import pos_tag, RegexpParser
from tokenize_words import word_sentence_tokenize
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### Custom counter Function
Two functions `np_chunk_counter()`, `vp_chunk_counter()` that returns the 30 most common *Noun Phrase-chunks* and *Verb Phrase-chunks* from a list of chunked sentences is defined here.

In [17]:
from collections import Counter


def np_chunk_counter(chunked_sentences):
    chunks = list()
    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'NP'):
            chunks.append(tuple(subtree))
    chunk_counter = Counter()
    for chunk in chunks:
        chunk_counter[chunk] += 1
    return chunk_counter.most_common(30)




def vp_chunk_counter(chunked_sentences):
    chunks = list()
    for chunked_sentence in chunked_sentences:
        for subtree in chunked_sentence.subtrees(filter=lambda t: t.label() == 'VP'):
            chunks.append(tuple(subtree))
    chunk_counter = Counter()
    for chunk in chunks:
        chunk_counter[chunk] += 1
    return chunk_counter.most_common(30)


## Importing and Preprocessing Text Data

The Iliad, named `the_iliad.txt`, sourced from [Project Gutenberg](https://www.gutenberg.org/). Here is my [Text File](the_iliad.txt) of that novel. I've converted it to lowercase in my workspace.

In [18]:
text = open('the_iliad.txt', encoding = 'utf-8').read().lower()


word_tokenized_text = word_sentence_tokenize(text)


# Single tokenized sentence
single_word_tokenized_sentence = word_tokenized_text[100]
print(single_word_tokenized_sentence)

['he', 'appears', 'as', 'the', 'enunciator', 'of', 'opinions', 'as', 'different', 'in', 'their', 'tone', 'as', 'those', 'of', 'the', 'writers', 'who', 'have', 'handed', 'them', 'down', '.']


## Parts of Speech Tagging

In [21]:
pos_tagged_text = []
for sentence in word_tokenized_text:
  pos_tagged_text.append(pos_tag(sentence))



# Single sentence with Parts of Speecj tag
single_pos_sentence = pos_tagged_text[707]

print(single_pos_sentence)

[('but', 'CC'), ('this', 'DT'), ('when', 'WRB'), ('time', 'NN'), ('requires.', 'NN'), ('--', ':'), ('it', 'PRP'), ('now', 'RB'), ('remains', 'VBZ'), ('we', 'PRP'), ('launch', 'VBP'), ('a', 'DT'), ('bark', 'NN'), ('to', 'TO'), ('plough', 'VB'), ('the', 'DT'), ('watery', 'NN'), ('plains', 'VBZ'), (',', ','), ('and', 'CC'), ('waft', 'VBD'), ('the', 'DT'), ('sacrifice', 'NN'), ('to', 'TO'), ('chrysa', 'VB'), ("'s", 'POS'), ('shores', 'NNS'), (',', ','), ('with', 'IN'), ('chosen', 'JJ'), ('pilots', 'NNS'), (',', ','), ('and', 'CC'), ('with', 'IN'), ('labouring', 'JJ'), ('oars', 'NNS'), ('.', '.')]


## Chunk Grammar

- I have defined **Noun-phrase** (Determiner + Adj. + Noun) and
- **Verb Pharse** (Noun-phrase + Verb + Adv.) chunk grammer here.

In [22]:
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"
np_chunk_parser = RegexpParser(np_chunk_grammar)

vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

## Chunk Sentence

- I have imported the **Noun Phrases** of the novel in `np_chunked_text` and 
- The **Verb Phrases** of the novel in `vp_chunked_text`

In [22]:
np_chunked_text = []
vp_chunked_text = []
for sentence in pos_tagged_text:
  np_chunked_text.append(np_chunk_parser.parse(sentence))
  vp_chunked_text.append(vp_chunk_parser.parse(sentence))

## Novel Analysis

The function `np_chunk_counter()` and `vp_chunk_counter()` that I defined earlier, returns the 30 most common **NP-chunks** and **VP-chunks** from a list of chunked sentences. I've printed the `most_common_np_chunks` and `most_common_vp_chunks` in the editor.

In [23]:
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)

[((('hector', 'NN'),), 322), ((('i', 'NN'),), 277), ((('jove', 'NN'),), 257), ((('troy', 'NN'),), 208), ((('vain', 'NN'),), 195), ((('war', 'NN'),), 193), ((('son', 'NN'),), 170), ((('thou', 'NN'),), 158), ((('the', 'DT'), ('plain', 'NN')), 157), ((('the', 'DT'), ('field', 'NN')), 154), ((('the', 'DT'), ('ground', 'NN')), 138), ((('death', 'NN'),), 134), ((('hand', 'NN'),), 134), ((('greece', 'NN'),), 128), ((('heaven', 'NN'),), 127), ((('fate', 'NN'),), 127), ((('thee', 'NN'),), 122), ((('breast', 'NN'),), 121), ((('the', 'DT'), ('trojan', 'NN')), 120), ((('the', 'DT'), ('god', 'NN')), 119), ((('the', 'DT'), ('war', 'NN')), 117), ((('the', 'DT'), ('greeks', 'NN')), 116), ((('blood', 'NN'),), 115), ((('homer', 'NN'),), 112), ((('the', 'DT'), ('king', 'NN')), 105), ((('rage', 'NN'),), 103), ((('force', 'NN'),), 103), ((('care', 'NN'),), 99), ((('head', 'NN'),), 98), ((('man', 'NN'),), 97)]


### Result

- Looking at `most_common_np_chunks`, it can be identified characters of importance in the text such as **hector** and **jove** based on their frequency. Additionally a location of importance, **troy**, is mentioned often. **A theme of war** can also implied by its high frequency count.

In [24]:
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

[((("'t", 'NN'), ('is', 'VBZ')), 19), ((('i', 'NN'), ('am', 'VBP')), 11), ((("'t", 'NN'), ('was', 'VBD')), 11), ((('the', 'DT'), ('hero', 'NN'), ('said', 'VBD')), 9), ((('i', 'NN'), ('know', 'VBP')), 8), ((('i', 'NN'), ('saw', 'VBD')), 8), ((('the', 'DT'), ('scene', 'NN'), ('lies', 'VBZ')), 7), ((('i', 'NN'), ('was', 'VBD')), 6), ((('confess', 'NN'), ("'d", 'VBD')), 6), ((('the', 'DT'), ('scene', 'NN'), ('is', 'VBZ')), 6), ((('view', 'NN'), ("'d", 'VBD')), 5), ((('i', 'NN'), ('felt', 'VBD')), 5), ((('i', 'NN'), ('bear', 'VBP')), 5), ((('hector', 'NN'), ('is', 'VBZ')), 5), ((('vain', 'NN'), ('was', 'VBD')), 5), ((('homer', 'NN'), ('was', 'VBD')), 4), ((('i', 'NN'), ('have', 'VBP')), 4), ((('hunger', 'NN'), ('was', 'VBD')), 4), ((('glory', 'NN'), ('lost', 'VBN')), 4), ((('i', 'NN'), ('see', 'VBP')), 4), ((('war', 'NN'), ('be', 'VB')), 4), ((('the', 'DT'), ('weapon', 'NN'), ('stood', 'VBD')), 4), ((('i', 'NN'), ('go', 'VBP')), 4), ((('the', 'DT'), ('silence', 'NN'), ('broke', 'VBD')), 4),

### Result

- Looking at `most_common_vp_chunks`, it appears that verb phrases that is defined in the chunk grammar do not appear as often in The Iliad as noun phrases. 

This can indicate a different style of writing taken by the author that does not follow traditional grammatical style (i.e. poetry). Even when chunks are not found, their absence can give insights!