
# Research on Syntactic, Structural, and Plot Analysis Techniques for Palimpsest

This notebook summarizes research on advanced techniques for analyzing text documents at syntactic, 
structural, and narrative levels for the Palimpsest text analysis tool.



## Introduction

Palimpsest is a text analysis tool designed for comparing and analyzing large documents. 
This research explores techniques for analyzing documents beyond simple string and semantic matching:

1. **Syntactic Analysis**: Techniques for examining grammatical structure and syntactic patterns
2. **Structural Analysis**: Methods for analyzing document organization and structure 
3. **Plot and Narrative Analysis**: Approaches for understanding narrative elements and plot development

These higher-level analyses complement the string matching and semantic matching techniques
covered in our previous research notebook.


In [None]:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import Markdown, display
from collections import defaultdict

# Function to display markdown content
def md(text):
    display(Markdown(text))



## 1. Syntactic Analysis Techniques

Syntactic analysis examines the grammatical structure of texts, identifying patterns in 
sentence construction, grammar usage, and syntactic style. These techniques help identify 
authorial fingerprints and stylistic influences across documents.


In [None]:

# Syntactic analysis techniques comparison
syntactic_techniques = {
    "Part-of-Speech Analysis": [
        ("POS Tagging Distribution", "Statistical analysis of POS tag frequencies", "spaCy, NLTK", "Medium"),
        ("Syntactic n-grams", "Sequences of POS tags rather than words", "NLTK, custom", "Medium"),
        ("Dependency Parse Tree Analysis", "Examining grammatical relations between words", "spaCy, Stanford NLP", "High"),
        ("Constituency Parsing", "Hierarchical phrase structure analysis", "NLTK, AllenNLP", "High")
    ],
    
    "Grammar and Style Analysis": [
        ("Readability Metrics", "Flesch-Kincaid, Coleman-Liau, SMOG indexes", "textstat", "Low"),
        ("Sentence Complexity Analysis", "Clausal density, subordination", "spaCy + custom rules", "Medium"),
        ("Grammar Rule Usage", "Passive voice, nominalization frequencies", "Language Tool, custom rules", "Medium"),
        ("Rhetorical Device Detection", "Anaphora, epistrophe, etc.", "Custom implementations", "High")
    ],
    
    "Syntactic Similarity Measurement": [
        ("Tree Edit Distance", "Comparing parse trees", "APTED, Zhang-Shasha", "High"),
        ("Syntactic Embeddings", "Neural representations of syntactic structures", "Berkeley Neural Parser", "High"),
        ("Function Word Stylometry", "Analysis of function word patterns", "Custom implementations", "Medium"),
        ("Syntactic Motif Analysis", "Recurring grammatical patterns", "Custom graph-based analysis", "High")
    ]
}

# Display the syntactic techniques with their properties
for category, techniques in syntactic_techniques.items():
    md(f"### {category}")
    
    table_data = []
    for name, desc, tools, complexity in techniques:
        table_data.append([name, desc, tools, complexity])
    
    df = pd.DataFrame(table_data, 
                     columns=["Technique", "Description", "Tools", "Implementation Complexity"])
    display(df)


In [None]:

# Example implementation of syntactic fingerprinting
def create_syntactic_fingerprint(text):
    """
    Creates a syntactic fingerprint from text based on POS patterns
    
    Args:
        text: Input text to analyze
        
    Returns:
        Dictionary of syntactic features
    """
    # We'll use NLTK for POS tagging since spaCy might not be available
    import nltk
    from nltk.tokenize import word_tokenize, sent_tokenize
    
    try:
        # Download necessary NLTK resources if not already available
        nltk.download('punkt', quiet=True)
        nltk.download('averaged_perceptron_tagger', quiet=True)
    except:
        print("NLTK resources could not be downloaded, continuing with limited functionality")
    
    # Process the text
    sentences = sent_tokenize(text)
    tokens = [word_tokenize(sent) for sent in sentences]
    pos_tags = [nltk.pos_tag(sent) for sent in tokens]
    
    # Initialize fingerprint features
    fingerprint = {
        'pos_trigram_dist': defaultdict(int),
        'sentence_length_dist': defaultdict(int),
        'function_word_ratio': 0,
        'punctuation_ratio': 0
    }
    
    # Calculate POS trigrams
    all_pos = [tag for sent in pos_tags for _, tag in sent]
    for i in range(len(all_pos) - 2):
        trigram = (all_pos[i], all_pos[i+1], all_pos[i+2])
        fingerprint['pos_trigram_dist'][trigram] += 1
    
    # Calculate sentence length distribution
    for sent in tokens:
        length = len(sent)
        bin_size = 10
        bin_index = length // bin_size
        fingerprint['sentence_length_dist'][bin_index] += 1
    
    # Calculate function word ratio
    function_pos = {'IN', 'DT', 'CC', 'PRP', 'PRP$', 'WDT', 'WP', 'WRB', 'TO'}
    function_words = sum(1 for sent in pos_tags for word, tag in sent if tag in function_pos)
    total_words = sum(len(sent) for sent in tokens)
    fingerprint['function_word_ratio'] = function_words / total_words if total_words > 0 else 0
    
    # Count punctuation
    punctuation = sum(1 for sent in tokens for word in sent if word in '.,:;!?()[]{}""\'')
    fingerprint['punctuation_ratio'] = punctuation / total_words if total_words > 0 else 0
    
    return fingerprint

# Example text
example_text1 = """
The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet.
The concept of a pangram has been used by font designers and others to display examples of fonts.
"""

example_text2 = """
A swift amber-colored vulpine animal leaps across the indolent canine. This particular sequence 
of words incorporates all 26 letters found in English alphabet.
"""

# Create fingerprints
fingerprint1 = create_syntactic_fingerprint(example_text1)
fingerprint2 = create_syntactic_fingerprint(example_text2)

# Display some key features
print("Sample of POS trigrams in Text 1:")
top_trigrams1 = sorted(fingerprint1['pos_trigram_dist'].items(), 
                       key=lambda x: x[1], reverse=True)[:5]
for trigram, count in top_trigrams1:
    print(f"  {trigram}: {count}")

print("\nSample of POS trigrams in Text 2:")
top_trigrams2 = sorted(fingerprint2['pos_trigram_dist'].items(), 
                       key=lambda x: x[1], reverse=True)[:5]
for trigram, count in top_trigrams2:
    print(f"  {trigram}: {count}")

print("\nFunction word ratio:")
print(f"  Text 1: {fingerprint1['function_word_ratio']:.3f}")
print(f"  Text 2: {fingerprint2['function_word_ratio']:.3f}")

md("### Syntactic Similarity Libraries and Tools")
md("""
Key libraries for syntactic analysis in Palimpsest:
- **spaCy**: Industrial-strength NLP with dependency parsing and POS tagging
- **NLTK**: Natural Language Toolkit with parsing and POS analysis
- **textstat**: Library for calculating readability metrics
- **StanfordNLP/CoreNLP**: Advanced syntax analysis tools
- **Pattern**: Toolkit for stylometry and grammatical analysis
""")



## 2. Structural Analysis Techniques

Structural analysis examines how text is organized at higher levels, including paragraph structures,
section organization, and document architecture. These techniques help identify structural similarities
and differences between texts.


In [None]:

# Structural analysis techniques comparison
structural_techniques = {
    "Document Segmentation": [
        ("TextTiling", "Topic-based text segmentation", "NLTK", "Medium"),
        ("C99 Algorithm", "Matrix-based text segmentation", "Custom", "Medium"),
        ("TopicTiling", "Topic model-based segmentation", "Custom + LDA", "High"),
        ("BERTopic Segmentation", "BERT-based topic segmentation", "BERTopic", "High")
    ],
    
    "Hierarchical Structure Analysis": [
        ("Section Hierarchy Detection", "Identifying document hierarchy from headings", "Custom + NLP", "Medium"),
        ("Argumentation Mining", "Identifying argument structures", "ArguMiner, ATHAR", "High"),
        ("Rhetorical Structure Theory", "Discourse relations between text segments", "RST parsers", "Very High"),
        ("Document Architecture Analysis", "Identifying structural components", "GATE, Custom", "High")
    ],
    
    "Structural Comparison": [
        ("Structural Fingerprinting", "Generate document structure fingerprints", "Custom implementations", "Medium"),
        ("XML/DOM Distance", "Tree-based structural comparison", "Tree edit distance algorithms", "Medium"),
        ("Structure Visualization", "Visualizing document structure", "NetworkX, D3.js", "Medium"),
        ("Fractal Structure Analysis", "Self-similarity in text structure", "Custom implementations", "High")
    ]
}

# Display the structural techniques with their properties
for category, techniques in structural_techniques.items():
    md(f"### {category}")
    
    table_data = []
    for name, desc, tools, complexity in techniques:
        table_data.append([name, desc, tools, complexity])
    
    df = pd.DataFrame(table_data, 
                     columns=["Technique", "Description", "Tools", "Implementation Complexity"])
    display(df)


In [None]:

# Example implementation of document structure extraction
import re

def extract_document_structure(text):
    """
    Extract structural components from a document
    
    Args:
        text: Document text
        
    Returns:
        Dictionary representing document structure
    """
    # Split into paragraphs
    paragraphs = [p for p in text.split('\n\n') if p.strip()]
    
    structure = {
        'paragraph_count': len(paragraphs),
        'paragraph_lengths': [len(p) for p in paragraphs],
        'section_hierarchy': [],
        'section_depths': [],
        'section_relations': []
    }
    
    # Simple section detection (can be enhanced with ML)
    section_pattern = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
    sections = [(len(m.group(1)), m.group(2)) for m in section_pattern.finditer(text)]
    
    if sections:
        structure['section_hierarchy'] = sections
        structure['section_depths'] = [depth for depth, _ in sections]
        
        # Create section hierarchy graph
        G = nx.DiGraph()
        stack = [(0, "ROOT", 0)]  # (index, title, depth)
        
        for i, (depth, title) in enumerate(sections):
            # Pop elements from stack that are at same or greater depth
            while stack and stack[-1][2] >= depth:
                stack.pop()
                
            if stack:
                parent_idx = stack[-1][0]
                G.add_edge(parent_idx, i + 1)  # +1 because 0 is ROOT
                
            stack.append((i + 1, title, depth))
            
        structure['section_relations'] = list(G.edges())
    
    return structure

# Example document with structure
example_doc = """
# Document Title

This is an introduction paragraph that sets up the document.
It spans multiple lines to form a single paragraph.

## First Section

This is the content of the first section.
It might contain some important information.

### Subsection 1.1

Deeper nested content with more specific details.

## Second Section

Another top-level section with its own content.
This demonstrates a different branch in the document hierarchy.

### Subsection 2.1

More detailed information in a subsection.

#### Even Deeper Subsection

We can go quite deep in the hierarchy if needed.
"""

# Extract structure
doc_structure = extract_document_structure(example_doc)

# Display structure information
print(f"Document Structure Analysis:")
print(f"Paragraph count: {doc_structure['paragraph_count']}")
print(f"Average paragraph length: {sum(doc_structure['paragraph_lengths']) / len(doc_structure['paragraph_lengths']):.1f} characters")
print(f"\nSection hierarchy:")
for depth, title in doc_structure['section_hierarchy']:
    print(f"{'  ' * (depth-1)}{title}")

# Visualize the section hierarchy
if doc_structure['section_relations']:
    G = nx.DiGraph(doc_structure['section_relations'])
    plt.figure(figsize=(10, 6))
    pos = nx.spring_layout(G)
    
    # Create labels
    labels = {0: "ROOT"}
    for i, (depth, title) in enumerate(doc_structure['section_hierarchy']):
        labels[i + 1] = title
    
    nx.draw(G, pos, with_labels=False, node_size=500, node_color="lightblue")
    nx.draw_networkx_labels(G, pos, labels, font_size=10)
    plt.title("Document Section Hierarchy")
    plt.tight_layout()
    plt.show()



## 3. Plot and Narrative Analysis

Plot and narrative analysis focuses on understanding how stories unfold, identifying 
character relationships, plot arcs, and narrative techniques. These approaches help identify 
similar narrative structures across texts.


In [None]:

# Narrative analysis techniques comparison
narrative_techniques = {
    "Character Analysis": [
        ("Named Entity Recognition", "Identifying characters and places", "spaCy, NLTK", "Low"),
        ("Character Network Analysis", "Social network of characters", "NetworkX + NER", "Medium"),
        ("Character Attribute Extraction", "Identifying character traits", "Custom + NLP", "High"),
        ("Dialogue Attribution", "Associating dialogue with speakers", "Custom ML + rules", "High")
    ],
    
    "Plot Structure Analysis": [
        ("Event Detection", "Identifying key events in narrative", "Custom NLP + ML", "High"),
        ("Story Arc Identification", "Detecting narrative arcs", "Custom + sentiment analysis", "High"),
        ("Narrative Tempo Analysis", "Pacing of narrative", "Custom implementation", "Medium"),
        ("Plot Comparison", "Compare plot structures", "Custom algorithms", "Very High")
    ],
    
    "Narrative Techniques": [
        ("Focalization Analysis", "Point of view detection", "Custom ML classifiers", "High"),
        ("Temporal Structure", "Chronology, flashbacks, etc.", "Custom implementation", "High"),
        ("Narrative Modes", "Showing vs. telling, etc.", "Custom rules + ML", "High"),
        ("Intertextuality Detection", "References to other texts", "Knowledge bases + NLP", "Very High")
    ]
}

# Display the narrative techniques with their properties
for category, techniques in narrative_techniques.items():
    md(f"### {category}")
    
    table_data = []
    for name, desc, tools, complexity in techniques:
        table_data.append([name, desc, tools, complexity])
    
    df = pd.DataFrame(table_data, 
                     columns=["Technique", "Description", "Tools", "Implementation Complexity"])
    display(df)


In [None]:

# Example implementation of narrative element extraction
import re

def extract_narrative_elements(text):
    """
    Extract narrative elements from text
    
    Args:
        text: Document text
        
    Returns:
        Dictionary of narrative features
    """
    # Use NLTK for NER if spaCy is not available
    import nltk
    from nltk.tokenize import word_tokenize, sent_tokenize
    
    try:
        nltk.download('punkt', quiet=True)
        nltk.download('averaged_perceptron_tagger', quiet=True)
        nltk.download('maxent_ne_chunker', quiet=True)
        nltk.download('words', quiet=True)
    except:
        print("NLTK resources could not be downloaded, continuing with limited functionality")
    
    # Process the text
    sentences = sent_tokenize(text)
    tokens = [word_tokenize(sent) for sent in sentences]
    pos_tags = [nltk.pos_tag(sent) for sent in tokens]
    
    # Try to use NLTK's NER capabilities
    try:
        ner_chunks = [nltk.ne_chunk(sent) for sent in pos_tags]
        # Extract named entities
        entities = []
        for tree in ner_chunks:
            for subtree in tree:
                if hasattr(subtree, 'label'):
                    entity = ' '.join([word for word, tag in subtree.leaves()])
                    label = subtree.label()
                    entities.append((entity, label))
    except:
        # Fallback: simple capitalized word heuristic
        entities = []
        for sent in tokens:
            for word in sent:
                if word[0].isupper() and len(word) > 1:
                    entities.append((word, 'UNKNOWN'))
    
    # Create narrative structure
    narrative = {
        'characters': [],
        'character_mentions': defaultdict(int),
        'character_network': defaultdict(set),
        'dialogue_ratio': 0,
        'narrative_pacing': [],
        'locations': []
    }
    
    # Extract characters (named entities that are people)
    characters = set(name for name, label in entities if label == 'PERSON')
    narrative['characters'] = list(characters)
    
    # Extract locations
    locations = set(name for name, label in entities if label in ('GPE', 'LOCATION', 'FACILITY'))
    narrative['locations'] = list(locations)
    
    # Count character mentions
    for name, label in entities:
        if label == 'PERSON' and name in characters:
            narrative['character_mentions'][name] += 1
    
    # Simple character network construction
    # Characters appearing in the same sentence are connected
    for i, sent in enumerate(sentences):
        sent_characters = set()
        for char in characters:
            if char in sent:
                sent_characters.add(char)
        
        # Connect all characters in this sentence
        for char1 in sent_characters:
            for char2 in sent_characters:
                if char1 != char2:
                    narrative['character_network'][char1].add(char2)
    
    # Estimate dialogue ratio
    dialogue_pattern = re.compile(r'"[^"]*"')
    dialogue_matches = dialogue_pattern.findall(text)
    dialogue_chars = sum(len(m) for m in dialogue_matches)
    narrative['dialogue_ratio'] = dialogue_chars / len(text) if len(text) > 0 else 0
    
    # Simple narrative pacing measurement
    # Divide into segments and measure event density
    segment_count = min(10, len(sentences))
    segments = np.array_split(sentences, segment_count)
    
    for segment in segments:
        segment_text = ' '.join(segment)
        # Count named entities and action verbs as proxy for events
        event_count = 0
        for sent in segment:
            # Count entities in this segment
            for name, _ in entities:
                if name in sent:
                    event_count += 1
            # Count action verbs (simplified)
            for word in word_tokenize(sent):
                if word.lower() in ('run', 'jump', 'fight', 'kill', 'die', 'move', 'attack', 'defend',
                                   'go', 'come', 'leave', 'arrive', 'start', 'end', 'begin', 'finish'):
                    event_count += 1
        
        segment_word_count = sum(len(word_tokenize(sent)) for sent in segment)
        narrative['narrative_pacing'].append(event_count / segment_word_count if segment_word_count > 0 else 0)
    
    return narrative

# Example narrative text
example_narrative = """
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: 
once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 
"and what is the use of a book," thought Alice "without pictures or conversations?"

So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), 
whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, 
when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say 
to itself, "Oh dear! Oh dear! I shall be late!" (when she thought it over afterwards, it occurred to her that she ought 
to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of 
its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind 
that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with 
curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole 
under the hedge.

In another moment down went Alice after it, never once considering how in the world she was to get out again.
"""

# Extract narrative elements
narrative_elements = extract_narrative_elements(example_narrative)

# Display results
print("Character Analysis:")
print(f"Characters identified: {', '.join(narrative_elements['characters'])}")
print(f"Locations identified: {', '.join(narrative_elements['locations'])}")
print(f"Dialogue ratio: {narrative_elements['dialogue_ratio']:.2f}")

# Create character network visualization if characters were found
if narrative_elements['character_network']:
    G = nx.Graph()
    for char in narrative_elements['characters']:
        G.add_node(char)
    
    for char1, connected in narrative_elements['character_network'].items():
        for char2 in connected:
            G.add_edge(char1, char2)
    
    plt.figure(figsize=(8, 6))
    pos = nx.spring_layout(G)
    nx.draw(G, pos, with_labels=True, node_color="lightgreen", 
            node_size=3000, font_size=10, font_weight="bold")
    plt.title("Character Relationship Network")
    plt.tight_layout()
    plt.show()

# Plot narrative pacing
plt.figure(figsize=(10, 4))
plt.plot(narrative_elements['narrative_pacing'], marker='o')
plt.title("Narrative Pacing")
plt.xlabel("Segment")
plt.ylabel("Event Density")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()



## 4. Implementation Approaches for Palimpsest

This section outlines practical approaches for implementing the above techniques
in the Palimpsest tool, with particular focus on handling large documents efficiently.



### 4.1 Multi-level Analysis Framework

To effectively integrate these techniques into Palimpsest, we recommend a layered approach:

1. **Preprocessing Pipeline**:
   - Document segmentation into logical units (paragraphs, sections)
   - Syntactic analysis and fingerprinting at various levels of granularity
   - Structural feature extraction
   - Narrative element identification

2. **Multi-level Analysis Framework**:
   - Surface level: Basic syntax features, readability, structural elements
   - Middle level: Syntactic patterns, section organization, character networks
   - Deep level: Plot arcs, intertextual connections, narrative techniques

3. **Efficient Implementation Strategies**:
   - Use sparse representations for syntactic features
   - Implement incremental analysis for very large documents
   - Apply clustering techniques to identify related document sections
   - Use multi-threading for independent analysis tasks
   
4. **Visualization Framework**:
   - Syntactic heat maps showing similarity regions
   - Document structure comparisons using tree/graph visualizations 
   - Character network visualizations
   - Plot arc comparisons using line charts and overlays



## 5. Recommendations for Palimpsest

Based on our research, we recommend the following approaches for implementing advanced text analysis in Palimpsest:


In [None]:

# Key recommendations
recommendations = [
    ("Multi-tier Analysis Framework", 
     "Implement a layered approach that allows both quick surface analysis and deep structural comparison"),
    
    ("Adaptive Processing", 
     "Apply more intensive techniques only to promising document sections identified by lightweight methods"),
    
    ("Comparative Visualization", 
     "Develop visualization tools that can show syntactic, structural, and narrative similarities simultaneously"),
    
    ("Incremental Analysis", 
     "Support progressive analysis where results are refined as more intensive techniques are applied"),
    
    ("Cross-domain Integration", 
     "Combine insights from syntactic, structural, and narrative analysis with previously researched string and semantic matching"),
    
    ("Configurable Analysis Pipeline", 
     "Allow users to select which analysis techniques to apply based on their research questions"),
    
    ("Ground-truth Validation", 
     "Include tools for users to validate and correct automated analysis, improving accuracy over time")
]

# Display recommendations in a table format
recommendation_df = pd.DataFrame(recommendations, columns=["Recommendation", "Description"])
display(recommendation_df)



## 6. Libraries for Implementation

The following libraries provide essential functionality for implementing the proposed techniques in Palimpsest:


In [None]:

# Libraries for implementation
libraries = {
    "Syntactic Analysis": [
        ("spaCy", "Industrial-strength NLP with syntactic analysis"),
        ("NLTK", "Natural Language Toolkit with various linguistic tools"),
        ("CoreNLP", "Stanford's NLP suite with advanced parsing"),
        ("stanza", "Stanford NLP's Python interface"),
        ("textstat", "Text statistics and readability metrics")
    ],
    
    "Structural Analysis": [
        ("networkx", "Network analysis and visualization"),
        ("scikit-learn", "Machine learning for structure classification"),
        ("LDA implementations", "Topic modeling for section analysis"),
        ("pygraphviz", "Graph visualization"),
        ("BeautifulSoup", "HTML/XML parsing for structured documents")
    ],
    
    "Narrative Analysis": [
        ("BookNLP", "NLP pipeline for narrative text"),
        ("LitBank", "Literary entity and event extraction"),
        ("NLTK's sentiment tools", "Sentiment analysis for plot arcs"),
        ("NetworkX", "Character relationship network analysis"),
        ("SpanMarker", "Entity and span detection models")
    ]
}

for category, libs in libraries.items():
    md(f"### {category}")
    
    table_data = []
    for name, desc in libs:
        table_data.append([name, desc])
    
    df = pd.DataFrame(table_data, columns=["Library", "Description"])
    display(df)



## 7. Conclusion

This research has identified advanced techniques for syntactic, structural, and plot analysis 
that can be integrated into Palimpsest for comprehensive text comparison. By combining these 
approaches with the previously researched string matching and semantic matching techniques, 
Palimpsest will be able to provide multi-layered analysis of textual similarities and differences.

The integration of these techniques will allow users to:
1. Identify authorial fingerprints through syntactic analysis
2. Discover structural patterns and organizational influences
3. Compare narrative techniques and plot development across texts
4. Visualize multi-faceted textual relationships

This comprehensive approach positions Palimpsest as a powerful tool for literary analysis, 
comparative literature studies, and historical document research.
