# Analyzing Textual Data with NLTK

In this project, I dive deep into the world of Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK). My primary objective is to analyze a text document and extract meaningful insights by tokenizing text, performing part-of-speech tagging, and syntactic parsing.

**Data Source**: The text corpus is sourced from [Prof. Dr. Marc Hellmuth](https://math-inf.uni-greifswald.de/storages/uni-greifswald/fakultaet/mnf/mathinf/hellmuth/Teaching/AlgorDatastrWS1819/Goethe--Faust.txt) of the University of Greifswald.

**Key Project Objectives:**

1. **Import Libraries**: I start by importing essential libraries for text processing, including NLTK and custom functions for tokenization and chunk counting.

2. **Load Data**: The text document from "Faust" by Goethe is loaded, converted to lowercase, and prepared for analysis.

3. **Tokenize Text**: I tokenize the text into sentences and further split them into words, enabling granular analysis.

4. **Part-of-Speech Tagging**: Each word in the text is tagged with its respective part of speech, providing linguistic context and structure to the text.

5. **Syntactic Parsing**: I perform syntactic parsing by chunking sentences into noun phrases (NPs) and verb phrases (VPs) using predefined grammars.

6. **Visualize Chunks**: To gain a visual understanding, I search for specific words, like "klug," and visualize sentences containing these words as trees, revealing their grammatical structure.

7. **Analyze Chunks**: I analyze the most common NP-chunks and VP-chunks in the text. This analysis provides insights into the prominent characters, entities, and themes within the text.

**Summary:**  
My journey through "Faust" by Goethe unveils fascinating aspects of the text. I discover the significance of characters like "Faust" and "Margarete" based on their frequent appearance as NP-chunks. Additionally, my analysis of VP-chunks sheds light on Goethe's unique writing style and the absence of conventional grammatical structures, contributing to a deeper understanding of the text's literary nuances. This code project exemplifies the power of NLP techniques in unraveling the intricacies of textual data.

*Note: The code provided serves as an illustrative example and can be applied to a wide range of textual analyses.*

-----

## Import Libraries

In [65]:
from nltk import pos_tag, RegexpParser, Tree
import pprint
import import_ipynb
from functions.tokenize_words import word_sentence_tokenize
from functions.chunk_counters import np_chunk_counter, vp_chunk_counter

## Load Data

In [66]:
# Import text
text = open("text/faust.txt",encoding='utf-8').read().lower()

## Tokenize

In [67]:
# Split the text into individual sentences and word tokenize them
word_tokenized_text = word_sentence_tokenize(text)

# Specify the sentence index you want to print
sentence_index = 100

# Print a single word-tokenized sentence
if sentence_index < len(word_tokenized_text):
    selected_sentence = word_tokenized_text[sentence_index]
    print("Selected Sentence:")
    print(selected_sentence)
else:
    print(f"Sentence index {sentence_index} is out of range.")


Selected Sentence:
['was', 'hilft', 'es', 'viel', 'von', 'stimmung', 'reden', '?']


## Part-of-speech Tagging

In [68]:
# Create a list to hold part-of-speech tagged sentences
pos_tagged_text = [pos_tag(sentence) for sentence in word_tokenized_text]

# Check if the specified sentence index is within a valid range
if 0 <= sentence_index < len(pos_tagged_text):
    # Select and print a part-of-speech tagged sentence
    selected_sentence = pos_tagged_text[sentence_index]
    print("Selected Part-of-Speech Tagged Sentence:")
    print(selected_sentence)
else:
    # Handle the case where the sentence index is out of range
    print(f"Sentence index {sentence_index} is out of range.")


Selected Part-of-Speech Tagged Sentence:
[('was', 'VBD'), ('hilft', 'JJ'), ('es', 'JJ'), ('viel', 'NN'), ('von', 'NN'), ('stimmung', 'NN'), ('reden', 'NN'), ('?', '.')]


## Syntax Parsing

### Chunk Sentences

In [69]:
# Define noun phrase and verb phrase chunk grammars
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"
vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

# Create RegexpParser objects for both noun phrases and verb phrases
np_chunk_parser = RegexpParser(np_chunk_grammar)
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

# Create lists to hold noun phrase chunked sentences and verb phrase chunked sentences
np_chunked_text = list()
vp_chunked_text = list()

# Iterate through each part-of-speech tagged sentence
for pos_tagged_sentence in pos_tagged_text:
    # Perform noun phrase chunking and append to the noun phrase chunked list
    np_chunked_text.append(np_chunk_parser.parse(pos_tagged_sentence))
    
    # Perform verb phrase chunking and append to the verb phrase chunked list
    vp_chunked_text.append(vp_chunk_parser.parse(pos_tagged_sentence))


### Visualize Chunks

To visually inspect the part-of-speach tagged chunks we can use the pretty print function. In the followin, I will search for the word "klug", and visualize all sentences that contain the word as a tree.

In [70]:
# Define a function to search for a word in the parsed text
def search_word_in_parsed_text(parsed_text, target_word):
    # Initialize an empty list to store matching sentences
    matching_sentences = []

    # Iterate through each parsed sentence
    for sentence in parsed_text:
        # Extract the words from the sentence
        words = [word[0] for word in sentence.leaves()]
        
        # Check if the target word is in the list of words
        if target_word in words:
            # If found, add the sentence to the list of matching sentences
            matching_sentences.append(sentence)
    
    return matching_sentences

# Search for the word "klug" in the parsed text
target_word = "klug"
matching_sentences = search_word_in_parsed_text(np_chunked_text, target_word)

# Pretty print each matching sentence as an nltk.Tree object (if any)
if matching_sentences:
    for sentence in matching_sentences:
        tree = Tree.fromstring(str(sentence))
        print("Pretty Print of Matching Sentence:")
        tree.pretty_print()
else:
    print(f"The word '{target_word}' was not found in the parsed text.")


Pretty Print of Matching Sentence:
                                                                                                                                                                        S                                                                                                                                                                                                                
   _____________________________________________________________________________________________________________________________________________________________________|___________________________________________________________________________________________________________________________________________________________________________________________________________      
  |      |       |       |     |     |      |   |      |      |   |   |   |    |      |          |       |          NP           NP         NP        NP        NP      NP           NP                 NP      

## Analyze Chunks

Moving on to the analysis of the text, we can draw on the frequency of occurences of specific chunks to infer about a texts content and themes.

In [71]:
# Store and print the most common NP-chunks
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)


[((('faust', 'NN'),), 444), ((('der', 'NN'),), 422), ((('die', 'NN'),), 354), ((('zu', 'NN'),), 253), ((('du', 'NN'),), 202), ((('goethe', 'NN'),), 201), ((('ein', 'NN'),), 179), ((('sie', 'NN'),), 173), ((('mir', 'NN'),), 168), ((('mit', 'NN'),), 162), ((('ihr', 'NN'),), 139), ((('dem', 'NN'),), 135), ((('es', 'NN'),), 126), ((('auf', 'NN'),), 118), ((('und', 'NN'),), 114), ((('er', 'NN'),), 113), ((('wie', 'NN'),), 109), ((('nicht', 'NN'),), 109), ((('doch', 'NN'),), 99), ((('ist', 'NN'),), 99), ((('mich', 'NN'),), 93), ((('da', 'NN'),), 92), ((('man', 'NN'),), 82), ((('daß', 'NN'),), 77), ((('margarete', 'NN'),), 73), ((('den', 'NN'),), 70), ((('im', 'NN'),), 69), ((('wir', 'NN'),), 69), ((('dir', 'NN'),), 69), ((('von', 'NN'),), 66)]


**Summary:**  
Looking at `most_common_np_chunks`,   which based on the `np_chunk_counter()` functionreturns the `30` most common NP-chunks from a list of chunked sentences, we can identify characters of importance in the text. Specifically we find that "Faust," "Goethe," and "Margarete" are likely names of characters or entities in the text that are important  in understanding the characters and the storyline based on their frequency of occurence. Given that the text is authored by Goethe, a features a protagonist called "Faust" who falls in love with "Margarete".

In [72]:
# store and print the most common VP-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

[((('bei', 'NN'), ("'m", 'VBP')), 15), ((('und', 'NN'), ('was', 'VBD')), 13), ((('die', 'NN'), ('welt', 'VBD')), 7), ((('ich', 'NN'), ('mich', 'VBD')), 5), ((('nicht', 'NN'), ('was', 'VBD')), 5), ((('auch', 'NN'), ('was', 'VBD')), 4), ((('ein', 'NN'), ('ganzes', 'VBZ')), 3), ((('der', 'NN'), ('welt', 'VBD')), 3), ((('ich', 'NN'), ('was', 'VBD')), 3), ((('sich', 'NN'), ('die', 'VBP')), 3), ((('für', 'NN'), ('was', 'VBD')), 3), ((('herrlich', 'JJ'), ('wie', 'NN'), ('am', 'VBP')), 2), ((('sessel', 'NN'), ('am', 'VBP')), 2), ((('zeichen', 'NN'), ('des', 'VBZ')), 2), ((('man', 'NN'), ('sie', 'VBZ')), 2), ((('bess', 'NN'), ("'re", 'VBP')), 2), ((('mich', 'NN'), ('was', 'VBD')), 2), ((('nicht', 'NN'), ('die', 'VBP')), 2), ((('mit', 'NN'), ('tausend', 'VBP')), 2), ((('die', 'NN'), ('sich', 'VBD')), 2), ((('verflucht', 'NN'), ('was', 'VBD')), 2), ((('dir', 'NN'), ('was', 'VBD')), 2), ((('gern', 'NN'), ('was', 'VBD')), 2), ((('mir', 'NN'), ('ein', 'VBP')), 2), ((('wir', 'NN'), ('knicken', 'VBN')

**Summary:**  
Looking at `most_common_vp_chunks`, we can see that verb phrases of the form we defined in our chunk grammar do not appear as often in Faust as noun phrases. This indicates Goethe's unique style of writing, which does not necessarily follow traditional grammatical style (i.e. poetry). The absence of chunks thus gives us insight as well!