# Discovering Insights in Texts

We will be performing a syntax parsing analysis on a novel. The goal is to gain insight into the meaning of the text, the main topics of discussion, and the author's writing style. 

## Helper Functions

### Tokenization

We'll be using the NLTK library to tokenize the text into sentences, then tokenize those sentences into words. This results in a list of word tokenized sentences. 

From NLTK, we have imported word_tokenize and PunktSentenceTokenizer. We are using PunktSentenceTokenizer instead of sent_tokenizer because sent_tokenizer comes pre-trained and our text might be very different from that training text. For PunktSentenceTokenizer, the tokenizer is trained first on a training text and then, is deployed for tokenization. 

In [15]:
from nltk.tokenize import PunktSentenceTokenizer, word_tokenize

def wordSentenceTokenize(text):
    # create the tokenizer
    sentenceTokenizer = PunktSentenceTokenizer(text)

    # sentence tokenize the text
    sentenceTokenized = sentenceTokenizer.tokenize(text)

    # a list to hold our word tokenized sentences
    wordTokenized = []
    
    # for every sentence, word tokenize it and add it to the list
    for sentence in sentenceTokenized:
        wordTokenized.append(word_tokenize(sentence))
    
    return wordTokenized
    

### Chunking and Visualization

Using regular expressions, we can define patterns of parts-of-speech tags and find chunks of words in the sentences who's tags match those patterns. This can help give insight into the meaning of a text. 

The helper functions below find the 30 most common noun phrase and verb phrase chunks in the text.

In [16]:
from collections import Counter

def npChunkCounter(chunkedSentences):
    # a list to hold chunks
    chunks = []

    # for every chunked sentence, extract the noun phrase chunks and add it to the list
    for chunkedSentence in chunkedSentences:
        # NP is a user defined label to represent 'noun phrase' chunks
        for subtree in chunkedSentence.subtrees(filter=lambda t: t.label() == 'NP'): 
            chunks.append(tuple(subtree))
    
    # create a Counter object
    chunkCounter = Counter()

    # for every chunk in chunks, increase the counter of the specific chunk by 1 (works like a dict)
    for chunk in chunks:
        chunkCounter[chunk] += 1

    # return the 30 most frequent noun phrase chunks
    return chunkCounter.most_common(30)

def vpChunkCounter(chunkedSentences):
    # a list to hold chunks
    chunks = []

    # for every chunked sentence, extract the noun phrase chunks and add it to the list
    for chunkedSentence in chunkedSentences:
        # VP is a user defined label to represent 'noun phrase' chunks
        for subtree in chunkedSentence.subtrees(filter=lambda t: t.label() == 'VP'):
            chunks.append(tuple(subtree))
    
    # create a Counter object
    chunkCounter = Counter()

    # for every chunk in chunks, increase the counter of the specific chunk by 1 (works like a dict)
    for chunk in chunks:
        chunkCounter[chunk] += 1

    # return the 30 most frequent verb phrase chunks
    return chunkCounter.most_common(30)

## Syntax Parsing

In [None]:
from nltk import pos_tag, RegexpParser
import pprint
pp = pprint.PrettyPrinter(indent = 4)

# Import text from file
textLoc = './dorian_gray.txt'
text = open(textLoc,encoding='utf-8').read().lower()

# Tokenize the text into a list of word tokenized sentences
wordTokenizedText = wordSentenceTokenize(text)

# Sanity Check for word sentence tokenization
testTokenizedSentence = wordTokenizedText[20]
# print(testTokenizedSentence)

# Create a list to host part-of-speech tagged sentences
posTaggedText = []

# for every word tokenized sentence, pos tag it and add that to the list
for tokenizedSentence in wordTokenizedText:
    posTaggedText.append(pos_tag(tokenizedSentence))

# Sanity Check for pos tagging
testPosSentence = posTaggedText[20]
# print(testPosSentence)

# define noun phrase and verb phrase chunk grammar 
npChunkGrammar = "NP: {<DT>?<JJ>*<NN>}"
vpChunkGrammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

# create np and vp RegexpParser objects
npChunkParser = RegexpParser(npChunkGrammar)
vpChunkParser = RegexpParser(vpChunkGrammar)

# create a list to hold the np and vp chunked sentences
npChunkedText = []
vpChunkedText = []

# for every pos-tagged sentence, chunk the sentence and add them to the appropriate list
for posSentence in posTaggedText:
    npChunkedText.append(npChunkParser.parse(posSentence))
    vpChunkedText.append(vpChunkParser.parse(posSentence))

# Store and print the 30 most common np and vp chunks
mostCommonNPChunks = npChunkCounter(npChunkedText)
mostCommonVPChunks = vpChunkCounter(vpChunkedText)

## Analysis

In [22]:
pp.pprint(mostCommonNPChunks)

[   ((('i', 'NN'),), 963),
    ((('henry', 'NN'),), 200),
    ((('lord', 'NN'),), 197),
    ((('life', 'NN'),), 170),
    ((('harry', 'NN'),), 136),
    ((('dorian', 'JJ'), ('gray', 'NN')), 127),
    ((('something', 'NN'),), 126),
    ((('nothing', 'NN'),), 93),
    ((('basil', 'NN'),), 85),
    ((('the', 'DT'), ('world', 'NN')), 70),
    ((('everything', 'NN'),), 69),
    ((('anything', 'NN'),), 68),
    ((('hallward', 'NN'),), 68),
    ((('the', 'DT'), ('man', 'NN')), 61),
    ((('the', 'DT'), ('room', 'NN')), 60),
    ((('face', 'NN'),), 57),
    ((('the', 'DT'), ('door', 'NN')), 56),
    ((('love', 'NN'),), 55),
    ((('art', 'NN'),), 52),
    ((('course', 'NN'),), 51),
    ((('the', 'DT'), ('picture', 'NN')), 46),
    ((('the', 'DT'), ('lad', 'NN')), 45),
    ((('head', 'NN'),), 44),
    ((('round', 'NN'),), 44),
    ((('hand', 'NN'),), 44),
    ((('sibyl', 'NN'),), 41),
    ((('the', 'DT'), ('table', 'NN')), 40),
    ((('the', 'DT'), ('painter', 'NN')), 38),
    ((('sir', 'NN'),)

Looking at most_common_np_chunks, we can identify characters of importance such as henry, harry, dorian gray, and basil. They show up in the text most frequently!

Moreover, another noun phrase 'the picture' appears to be very relevant and also 'the painter'. Possibly, this is a story about an artist? I don't know, I've never read the text!

In [23]:
pp.pprint(mostCommonVPChunks)

[   ((('i', 'NN'), ('am', 'VBP')), 101),
    ((('i', 'NN'), ('was', 'VBD')), 40),
    ((('i', 'NN'), ('want', 'VBP')), 37),
    ((('i', 'NN'), ('know', 'VBP')), 33),
    ((('i', 'NN'), ('do', 'VBP'), ("n't", 'RB')), 32),
    ((('i', 'NN'), ('have', 'VBP')), 32),
    ((('i', 'NN'), ('had', 'VBD')), 31),
    ((('i', 'NN'), ('suppose', 'VBP')), 17),
    ((('i', 'NN'), ('think', 'VBP')), 16),
    ((('i', 'NN'), ('am', 'VBP'), ('not', 'RB')), 14),
    ((('i', 'NN'), ('thought', 'VBD')), 13),
    ((('i', 'NN'), ('believe', 'VBP')), 12),
    ((('dorian', 'JJ'), ('gray', 'NN'), ('was', 'VBD')), 11),
    ((('i', 'NN'), ('am', 'VBP'), ('so', 'RB')), 11),
    ((('henry', 'NN'), ('had', 'VBD')), 11),
    ((('i', 'NN'), ('did', 'VBD'), ("n't", 'RB')), 9),
    ((('i', 'NN'), ('met', 'VBD')), 9),
    ((('i', 'NN'), ('said', 'VBD')), 9),
    ((('i', 'NN'), ('am', 'VBP'), ('quite', 'RB')), 8),
    ((('i', 'NN'), ('see', 'VBP')), 8),
    ((('i', 'NN'), ('did', 'VBD'), ('not', 'RB')), 7),
    ((('i', 'NN

Analyzing the most_common_vp_chunks, we find something interesting about the theme of the text. The verb phrases 'i want', 'i know' and 'i have' occur frequently, representing a theme of desire and need.