In [None]:
````xml
<VSCode.Cell language="raw">
---
title: "Text Corpus Analyzer - OOP Project"
format: 
  html:
    code-fold: false
---
</VSCode.Cell>
<VSCode.Cell language="markdown">
# üìö Text Corpus Analyzer

A data-driven OOP project analyzing literary works from great authors:
- **Fyodor Dostoevsky** - Russian psychological realism
- **Albert Camus** - French absurdism  
- **Erich Maria Remarque** - German war literature

We'll build classes to analyze writing styles and compare authors.
</VSCode.Cell>
<VSCode.Cell language="markdown">
## üì¶ Import Libraries
</VSCode.Cell>
<VSCode.Cell language="python">
import json
import re
from collections import Counter
from typing import List, Dict
</VSCode.Cell>
<VSCode.Cell language="markdown">
## üìù Document Class

Represents a single text document (excerpt from a book)
</VSCode.Cell>
<VSCode.Cell language="python">
class Document:
    """Represents a single text document with analysis methods"""
    
    # Common English stop words (small selection)
    STOP_WORDS = {
        'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'i',
        'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you', 'do', 'at',
        'this', 'but', 'his', 'by', 'from', 'they', 'we', 'her', 'she',
        'or', 'an', 'will', 'my', 'one', 'all', 'would', 'there', 'their',
        'what', 'so', 'if', 'who', 'which', 'when', 'can', 'has', 'had',
        'were', 'been', 'is', 'was', 'are', 'am'
    }
    
    def __init__(self, author: str, title: str, text: str, theme: str = None, year: int = None):
        self.author = author
        self.title = title
        self.text = text
        self.theme = theme
        self.year = year
        self._words = None  # Cache for processed words
        
    def get_words(self) -> List[str]:
        """Get list of words (lowercase, no punctuation)"""
        if self._words is None:
            # Remove punctuation and convert to lowercase
            text = self.text.lower()
            text = re.sub(r'[^a-z\s]', '', text)
            self._words = text.split()
        return self._words
    
    def word_count(self) -> int:
        """Total number of words"""
        return len(self.get_words())
    
    def unique_word_count(self) -> int:
        """Number of unique words"""
        return len(set(self.get_words()))
    
    def vocabulary_richness(self) -> float:
        """Ratio of unique words to total words"""
        total = self.word_count()
        if total == 0:
            return 0
        return self.unique_word_count() / total
    
    def average_word_length(self) -> float:
        """Average length of words in characters"""
        words = self.get_words()
        if not words:
            return 0
        return sum(len(word) for word in words) / len(words)
    
    def sentence_count(self) -> int:
        """Estimate number of sentences"""
        return len(re.findall(r'[.!?]+', self.text))
    
    def average_sentence_length(self) -> float:
        """Average words per sentence"""
        sentences = self.sentence_count()
        if sentences == 0:
            return 0
        return self.word_count() / sentences
    
    def top_words(self, n: int = 10, exclude_stopwords: bool = True) -> List[tuple]:
        """Most common words"""
        if exclude_stopwords:
            words = [w for w in self.get_words() if w not in self.STOP_WORDS]
        else:
            words = self.get_words()
        return Counter(words).most_common(n)
    
    def contains_word(self, word: str) -> bool:
        """Check if document contains a specific word"""
        return word.lower() in self.get_words()
    
    def __repr__(self):
        return f"Document('{self.title}' by {self.author})"
    
    def summary(self) -> str:
        """Generate a summary of the document"""
        top_5 = ', '.join(w for w, _ in self.top_words(5))
        return f"""
üìÑ {self.title}
‚úçÔ∏è  Author: {self.author}
üè∑Ô∏è  Theme: {self.theme}
üìÖ Year: {self.year}

üìä Statistics:
   ‚Ä¢ Total words: {self.word_count()}
   ‚Ä¢ Unique words: {self.unique_word_count()}
   ‚Ä¢ Vocabulary richness: {self.vocabulary_richness():.2%}
   ‚Ä¢ Avg word length: {self.average_word_length():.2f} chars
   ‚Ä¢ Sentences: {self.sentence_count()}
   ‚Ä¢ Words per sentence: {self.average_sentence_length():.1f}

üîù Top words: {top_5}
"""
</VSCode.Cell>
<VSCode.Cell language="markdown">
## ‚úçÔ∏è Author Class

Represents an author with multiple documents
</VSCode.Cell>
<VSCode.Cell language="python">
class Author:
    """Represents an author with multiple documents"""
    
    def __init__(self, name: str):
        self.name = name
        self.documents: List[Document] = []
    
    def add_document(self, document: Document):
        """Add a document to this author's collection"""
        if document.author == self.name:
            self.documents.append(document)
        else:
            raise ValueError(f"Document author mismatch: '{document.author}' != '{self.name}'")
    
    def document_count(self) -> int:
        """Number of documents"""
        return len(self.documents)
    
    def total_words(self) -> int:
        """Total words across all documents"""
        return sum(doc.word_count() for doc in self.documents)
    
    def average_document_length(self) -> float:
        """Average words per document"""
        if not self.documents:
            return 0
        return self.total_words() / len(self.documents)
    
    def vocabulary_size(self) -> int:
        """Total unique words used"""
        all_words = []
        for doc in self.documents:
            all_words.extend(doc.get_words())
        return len(set(all_words))
    
    def average_word_length(self) -> float:
        """Average word length across all documents"""
        if not self.documents:
            return 0
        return sum(doc.average_word_length() for doc in self.documents) / len(self.documents)
    
    def average_sentence_length(self) -> float:
        """Average sentence length across all documents"""
        if not self.documents:
            return 0
        return sum(doc.average_sentence_length() for doc in self.documents) / len(self.documents)
    
    def favorite_words(self, n: int = 10) -> List[tuple]:
        """Most frequently used words across all documents"""
        all_words = []
        for doc in self.documents:
            words = [w for w in doc.get_words() if w not in Document.STOP_WORDS]
            all_words.extend(words)
        return Counter(all_words).most_common(n)
    
    def themes(self) -> Dict[str, int]:
        """Count documents by theme"""
        theme_list = [doc.theme for doc in self.documents if doc.theme]
        return dict(Counter(theme_list))
    
    def find_documents_with_word(self, word: str) -> List[Document]:
        """Find documents containing a word"""
        return [doc for doc in self.documents if doc.contains_word(word)]
    
    def __repr__(self):
        return f"Author('{self.name}', {len(self.documents)} documents)"
    
    def summary(self) -> str:
        """Detailed summary of the author"""
        themes = self.themes()
        top_words = ', '.join(w for w, _ in self.favorite_words(5))
        
        return f"""
‚úçÔ∏è  {self.name}
{'=' * (len(self.name) + 3)}

üìö Overview:
   ‚Ä¢ Documents: {self.document_count()}
   ‚Ä¢ Total words: {self.total_words():,}
   ‚Ä¢ Unique vocabulary: {self.vocabulary_size():,}
   ‚Ä¢ Avg document length: {self.average_document_length():.0f} words

üìù Writing Style:
   ‚Ä¢ Avg word length: {self.average_word_length():.2f} chars
   ‚Ä¢ Avg sentence length: {self.average_sentence_length():.1f} words

üí≠ Themes: {', '.join(f'{k} ({v})' for k, v in themes.items())}

üîù Favorite words: {top_words}
"""
</VSCode.Cell>
<VSCode.Cell language="markdown">
## üìö Corpus Class

Manages the entire collection of documents from multiple authors
</VSCode.Cell>
<VSCode.Cell language="python">
class Corpus:
    """A collection of documents from multiple authors"""
    
    def __init__(self, name: str = "Literary Corpus"):
        self.name = name
        self.documents: List[Document] = []
        self.authors: Dict[str, Author] = {}
    
    def add_document(self, document: Document):
        """Add a document to the corpus"""
        self.documents.append(document)
        
        # Add to author's collection
        if document.author not in self.authors:
            self.authors[document.author] = Author(document.author)
        self.authors[document.author].add_document(document)
    
    def load_from_json(self, filepath: str):
        """Load documents from a JSON file"""
        with open(filepath, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        for item in data:
            doc = Document(
                author=item['author'],
                title=item['title'],
                text=item['text'],
                theme=item.get('theme'),
                year=item.get('year')
            )
            self.add_document(doc)
    
    def get_author(self, name: str) -> Author:
        """Get Author object by name"""
        return self.authors.get(name)
    
    def author_names(self) -> List[str]:
        """List all author names"""
        return sorted(self.authors.keys())
    
    def compare_authors(self, metric: str) -> Dict[str, float]:
        """
        Compare authors on a specific metric.
        Options: 'avg_word_length', 'avg_sentence_length', 'vocabulary_size'
        """
        result = {}
        for author_name, author in self.authors.items():
            if metric == 'avg_word_length':
                result[author_name] = author.average_word_length()
            elif metric == 'avg_sentence_length':
                result[author_name] = author.average_sentence_length()
            elif metric == 'vocabulary_size':
                result[author_name] = author.vocabulary_size()
            elif metric == 'total_words':
                result[author_name] = author.total_words()
        
        # Sort by value, highest first
        return dict(sorted(result.items(), key=lambda x: x[1], reverse=True))
    
    def find_by_theme(self, theme: str) -> List[Document]:
        """Find all documents with a specific theme"""
        return [doc for doc in self.documents if doc.theme == theme]
    
    def all_themes(self) -> List[str]:
        """Get all unique themes"""
        themes = set(doc.theme for doc in self.documents if doc.theme)
        return sorted(themes)
    
    def search(self, word: str) -> List[Document]:
        """Find documents containing a word"""
        return [doc for doc in self.documents if doc.contains_word(word)]
    
    def __repr__(self):
        return f"Corpus('{self.name}', {len(self.authors)} authors, {len(self.documents)} docs)"
    
    def summary(self) -> str:
        """Overview of the corpus"""
        total_words = sum(author.total_words() for author in self.authors.values())
        
        return f"""
üìö {self.name}
{'=' * (len(self.name) + 3)}

üìä Statistics:
   ‚Ä¢ Authors: {len(self.authors)}
   ‚Ä¢ Documents: {len(self.documents)}
   ‚Ä¢ Total words: {total_words:,}
   ‚Ä¢ Themes: {len(self.all_themes())}

‚úçÔ∏è  Authors: {', '.join(self.author_names())}

üè∑Ô∏è  Themes: {', '.join(self.all_themes())}
"""
</VSCode.Cell>
<VSCode.Cell language="markdown">
---

# üî¨ Analysis Examples
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 1Ô∏è‚É£ Load the Data
</VSCode.Cell>
<VSCode.Cell language="python">
# Create a corpus and load all authors
corpus = Corpus("Great 20th Century Writers")
corpus.load_from_json('data/authors.json')

print(corpus.summary())
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 2Ô∏è‚É£ Analyze a Single Document
</VSCode.Cell>
<VSCode.Cell language="python">
# Look at the first document
doc = corpus.documents[0]
print(doc.summary())
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 3Ô∏è‚É£ Compare All Authors
</VSCode.Cell>
<VSCode.Cell language="python">
# Print summary for each author
for author_name in corpus.author_names():
    author = corpus.get_author(author_name)
    print(author.summary())
    print("="*80, "\n")
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 4Ô∏è‚É£ Compare Authors on Specific Metrics
</VSCode.Cell>
<VSCode.Cell language="python">
print("üìè WHO USES LONGER WORDS?")
for author, length in corpus.compare_authors('avg_word_length').items():
    print(f"   {author:30} {length:.2f} characters")

print("\nüìê WHO WRITES LONGER SENTENCES?")
for author, length in corpus.compare_authors('avg_sentence_length').items():
    print(f"   {author:30} {length:.1f} words")

print("\nüìñ WHO HAS THE RICHEST VOCABULARY?")
for author, vocab in corpus.compare_authors('vocabulary_size').items():
    print(f"   {author:30} {vocab:,} unique words")
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 5Ô∏è‚É£ Explore Themes
</VSCode.Cell>
<VSCode.Cell language="python">
print("üè∑Ô∏è  DOCUMENTS BY THEME:\n")

for theme in corpus.all_themes():
    docs = corpus.find_by_theme(theme)
    print(f"{theme} ({len(docs)} docs):")
    for doc in docs:
        print(f"   ‚Ä¢ {doc.title} by {doc.author}")
    print()
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 6Ô∏è‚É£ Search for Specific Words
</VSCode.Cell>
<VSCode.Cell language="python">
# Which documents mention "death"?
print("üíÄ Documents containing 'death':\n")
death_docs = corpus.search('death')
for doc in death_docs:
    print(f"   ‚Ä¢ {doc.title} ({doc.author})")

print(f"\nüìä {len(death_docs)} out of {len(corpus.documents)} documents mention death")

print("\n" + "="*80 + "\n")

# Which documents mention "freedom"?
print("üïäÔ∏è Documents containing 'freedom':\n")
freedom_docs = corpus.search('freedom')
for doc in freedom_docs:
    print(f"   ‚Ä¢ {doc.title} ({doc.author})")

print(f"\nüìä {len(freedom_docs)} out of {len(corpus.documents)} documents mention freedom")
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 7Ô∏è‚É£ Analyze Author's Favorite Words
</VSCode.Cell>
<VSCode.Cell language="python">
print("üéØ TOP 10 WORDS FOR EACH AUTHOR:\n")

for author_name in corpus.author_names():
    author = corpus.get_author(author_name)
    print(f"{author_name}:")
    
    for word, count in author.favorite_words(10):
        print(f"   {word:15} {count:3} times")
    print()
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 8Ô∏è‚É£ Find Who Writes About Specific Topics
</VSCode.Cell>
<VSCode.Cell language="python">
# Check which author talks most about "war"
print("‚öîÔ∏è WHO TALKS ABOUT WAR?\n")

for author_name in corpus.author_names():
    author = corpus.get_author(author_name)
    war_docs = author.find_documents_with_word('war')
    if war_docs:
        print(f"{author_name}: {len(war_docs)} document(s)")
        for doc in war_docs:
            print(f"   ‚Ä¢ {doc.title}")
</VSCode.Cell>
<VSCode.Cell language="markdown">
---

# üéØ Exercises for Students

1. **Add a new method**: Create a `longest_words(n)` method in Document class that returns the N longest words

2. **Comparison method**: Add a method to compare two authors directly

3. **Time analysis**: Add methods to analyze documents by year/decade

4. **Word pairs**: Extend Document to find common word pairs (bigrams)

5. **Inheritance**: Create specialized classes like `NovelExcerpt` or `PhilosophicalText` that inherit from Document

6. **Visualization**: Use matplotlib to create bar charts comparing authors

7. **More data**: Add JSON data for other authors (Tolstoy, Kafka, Hemingway)

8. **Export**: Add methods to export analysis results to CSV

9. **Statistics**: Calculate and compare standard deviation in sentence lengths

10. **Themes analysis**: Create a new `Theme` class that analyzes documents by theme
</VSCode.Cell>
````

In [None]:
---
title: "Text Corpus Analyzer - OOP Project"
format: 
  html:
    code-fold: false
---

# üìö Text Corpus Analyzer

A data-driven OOP project analyzing literary works from great authors:
- **Fyodor Dostoevsky** - Russian psychological realism
- **Albert Camus** - French absurdism  
- **Erich Maria Remarque** - German war literature

We'll build classes to analyze writing styles, themes, and linguistic patterns.

## üì¶ Import Libraries

In [None]:
import json
import re
from collections import Counter
from pathlib import Path
import math
from typing import List, Dict, Set
import string

## üìù Document Class

Represents a single text document (excerpt from a book)

In [None]:
class Document:
    """Represents a single text document with analysis methods"""
    
    # Common English stop words
    STOP_WORDS = {
        'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'i',
        'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you', 'do', 'at',
        'this', 'but', 'his', 'by', 'from', 'they', 'we', 'say', 'her', 'she',
        'or', 'an', 'will', 'my', 'one', 'all', 'would', 'there', 'their',
        'what', 'so', 'up', 'out', 'if', 'about', 'who', 'get', 'which', 'go',
        'me', 'when', 'make', 'can', 'like', 'time', 'no', 'just', 'him', 'know',
        'take', 'people', 'into', 'year', 'your', 'good', 'some', 'could', 'them',
        'see', 'other', 'than', 'then', 'now', 'look', 'only', 'come', 'its', 'over',
        'think', 'also', 'back', 'after', 'use', 'two', 'how', 'our', 'work',
        'first', 'well', 'way', 'even', 'new', 'want', 'because', 'any', 'these',
        'give', 'day', 'most', 'us', 'is', 'was', 'are', 'been', 'has', 'had',
        'were', 'said', 'did', 'having', 'may', 'should', 'am'
    }
    
    def __init__(self, author: str, title: str, text: str, theme: str = None, year: int = None):
        self.author = author
        self.title = title
        self.text = text
        self.theme = theme
        self.year = year
        self._tokens = None  # Cache for tokens
        self._words = None   # Cache for words only
        
    def tokenize(self) -> List[str]:
        """Split text into tokens (words)"""
        if self._tokens is None:
            # Remove punctuation and convert to lowercase
            text = self.text.lower()
            # Keep only letters and spaces
            text = re.sub(r'[^a-z\s]', '', text)
            self._tokens = text.split()
        return self._tokens
    
    def get_words(self) -> List[str]:
        """Get list of words (same as tokens for now, but could filter further)"""
        if self._words is None:
            self._words = self.tokenize()
        return self._words
    
    def word_count(self) -> int:
        """Total number of words in document"""
        return len(self.get_words())
    
    def unique_word_count(self) -> int:
        """Number of unique words (vocabulary size)"""
        return len(set(self.get_words()))
    
    def vocabulary_richness(self) -> float:
        """Ratio of unique words to total words (type-token ratio)"""
        total = self.word_count()
        if total == 0:
            return 0
        return self.unique_word_count() / total
    
    def average_word_length(self) -> float:
        """Average length of words in characters"""
        words = self.get_words()
        if not words:
            return 0
        return sum(len(word) for word in words) / len(words)
    
    def sentence_count(self) -> int:
        """Estimate number of sentences"""
        # Count sentence-ending punctuation
        return len(re.findall(r'[.!?]+', self.text))
    
    def average_sentence_length(self) -> float:
        """Average words per sentence"""
        sentences = self.sentence_count()
        if sentences == 0:
            return 0
        return self.word_count() / sentences
    
    def word_frequency(self, top_n: int = 10) -> List[tuple]:
        """Most common words with their frequencies"""
        words = self.get_words()
        counter = Counter(words)
        return counter.most_common(top_n)
    
    def non_stopword_frequency(self, top_n: int = 10) -> List[tuple]:
        """Most common non-stopwords"""
        words = [w for w in self.get_words() if w not in self.STOP_WORDS]
        counter = Counter(words)
        return counter.most_common(top_n)
    
    def long_words(self, min_length: int = 7) -> List[str]:
        """Find words longer than min_length characters"""
        return sorted(set(w for w in self.get_words() if len(w) >= min_length))
    
    def contains_word(self, word: str) -> bool:
        """Check if document contains a specific word"""
        return word.lower() in self.get_words()
    
    def word_positions(self, word: str) -> List[int]:
        """Get all positions (indices) where a word appears"""
        word = word.lower()
        return [i for i, w in enumerate(self.get_words()) if w == word]
    
    def flesch_reading_ease(self) -> float:
        """
        Calculate Flesch Reading Ease score.
        90-100: Very Easy
        60-70: Standard
        0-30: Very Difficult
        """
        words = self.word_count()
        sentences = self.sentence_count()
        
        if words == 0 or sentences == 0:
            return 0
        
        # Count syllables (rough approximation)
        syllables = sum(self._count_syllables(word) for word in self.get_words())
        
        # Flesch formula
        score = 206.835 - 1.015 * (words / sentences) - 84.6 * (syllables / words)
        return round(score, 2)
    
    def _count_syllables(self, word: str) -> int:
        """Rough syllable counter"""
        word = word.lower()
        count = 0
        vowels = 'aeiouy'
        previous_was_vowel = False
        
        for char in word:
            is_vowel = char in vowels
            if is_vowel and not previous_was_vowel:
                count += 1
            previous_was_vowel = is_vowel
        
        # Adjust for silent e
        if word.endswith('e'):
            count -= 1
        
        # At least one syllable
        if count == 0:
            count = 1
            
        return count
    
    def sentiment_score(self) -> Dict[str, int]:
        """
        Basic sentiment analysis using word lists.
        Returns counts of positive, negative, and death-related words.
        """
        positive_words = {
            'love', 'hope', 'joy', 'peace', 'happy', 'beauty', 'beautiful', 
            'good', 'comfort', 'freedom', 'light', 'life', 'salvation'
        }
        
        negative_words = {
            'death', 'fear', 'pain', 'suffering', 'despair', 'shame', 'guilt',
            'terrible', 'awful', 'horrible', 'bad', 'evil', 'dark', 'murder'
        }
        
        death_words = {
            'death', 'dead', 'die', 'died', 'dying', 'kill', 'killed', 'murder',
            'murdered', 'corpse', 'grave', 'funeral', 'coffin'
        }
        
        words = self.get_words()
        
        return {
            'positive': sum(1 for w in words if w in positive_words),
            'negative': sum(1 for w in words if w in negative_words),
            'death_related': sum(1 for w in words if w in death_words)
        }
    
    def __repr__(self):
        return f"Document('{self.title}' by {self.author}, {self.word_count()} words)"
    
    def summary(self) -> str:
        """Generate a summary of the document statistics"""
        return f"""
üìÑ {self.title}
‚úçÔ∏è  Author: {self.author}
üìÖ Year: {self.year}
üè∑Ô∏è  Theme: {self.theme}

üìä Statistics:
   ‚Ä¢ Words: {self.word_count()}
   ‚Ä¢ Unique words: {self.unique_word_count()}
   ‚Ä¢ Vocabulary richness: {self.vocabulary_richness():.2%}
   ‚Ä¢ Average word length: {self.average_word_length():.2f} characters
   ‚Ä¢ Sentences: {self.sentence_count()}
   ‚Ä¢ Words per sentence: {self.average_sentence_length():.1f}
   ‚Ä¢ Readability (Flesch): {self.flesch_reading_ease():.1f}

üîù Top 5 content words: {', '.join(w for w, _ in self.non_stopword_frequency(5))}
"""

## ‚úçÔ∏è Author Class

Represents an author and analyzes all their documents collectively

In [None]:
class Author:
    """Represents an author with multiple documents"""
    
    def __init__(self, name: str):
        self.name = name
        self.documents: List[Document] = []
    
    def add_document(self, document: Document):
        """Add a document to this author's collection"""
        if document.author == self.name:
            self.documents.append(document)
        else:
            raise ValueError(f"Document author '{document.author}' doesn't match Author '{self.name}'")
    
    def document_count(self) -> int:
        """Total number of documents by this author"""
        return len(self.documents)
    
    def total_words(self) -> int:
        """Total words across all documents"""
        return sum(doc.word_count() for doc in self.documents)
    
    def average_words_per_document(self) -> float:
        """Average document length"""
        if not self.documents:
            return 0
        return self.total_words() / len(self.documents)
    
    def vocabulary_size(self) -> int:
        """Total unique words used by author across all documents"""
        all_words = []
        for doc in self.documents:
            all_words.extend(doc.get_words())
        return len(set(all_words))
    
    def overall_vocabulary_richness(self) -> float:
        """Vocabulary richness across all documents"""
        total = self.total_words()
        if total == 0:
            return 0
        return self.vocabulary_size() / total
    
    def average_word_length(self) -> float:
        """Average word length across all documents"""
        if not self.documents:
            return 0
        return sum(doc.average_word_length() for doc in self.documents) / len(self.documents)
    
    def average_sentence_length(self) -> float:
        """Average sentence length across all documents"""
        if not self.documents:
            return 0
        return sum(doc.average_sentence_length() for doc in self.documents) / len(self.documents)
    
    def average_readability(self) -> float:
        """Average Flesch reading ease score"""
        if not self.documents:
            return 0
        return sum(doc.flesch_reading_ease() for doc in self.documents) / len(self.documents)
    
    def favorite_words(self, top_n: int = 10, exclude_stopwords: bool = True) -> List[tuple]:
        """Most frequently used words across all documents"""
        all_words = []
        for doc in self.documents:
            if exclude_stopwords:
                all_words.extend([w for w in doc.get_words() if w not in Document.STOP_WORDS])
            else:
                all_words.extend(doc.get_words())
        
        counter = Counter(all_words)
        return counter.most_common(top_n)
    
    def themes_distribution(self) -> Dict[str, int]:
        """Count documents by theme"""
        themes = [doc.theme for doc in self.documents if doc.theme]
        return dict(Counter(themes))
    
    def signature_words(self, min_frequency: int = 3) -> List[str]:
        """
        Words that appear frequently in this author's work.
        These could be considered 'signature' words for the author.
        """
        word_freq = self.favorite_words(top_n=50, exclude_stopwords=True)
        return [word for word, count in word_freq if count >= min_frequency]
    
    def sentiment_profile(self) -> Dict[str, float]:
        """Average sentiment scores across all documents"""
        if not self.documents:
            return {'positive': 0, 'negative': 0, 'death_related': 0}
        
        total_positive = sum(doc.sentiment_score()['positive'] for doc in self.documents)
        total_negative = sum(doc.sentiment_score()['negative'] for doc in self.documents)
        total_death = sum(doc.sentiment_score()['death_related'] for doc in self.documents)
        total_words = self.total_words()
        
        if total_words == 0:
            return {'positive': 0, 'negative': 0, 'death_related': 0}
        
        return {
            'positive': (total_positive / total_words) * 100,
            'negative': (total_negative / total_words) * 100,
            'death_related': (total_death / total_words) * 100
        }
    
    def stylistic_fingerprint(self) -> Dict[str, float]:
        """
        Create a 'fingerprint' of the author's writing style
        using various metrics
        """
        return {
            'avg_word_length': round(self.average_word_length(), 2),
            'avg_sentence_length': round(self.average_sentence_length(), 2),
            'vocabulary_richness': round(self.overall_vocabulary_richness(), 4),
            'readability': round(self.average_readability(), 2),
            'sentiment_positive_pct': round(self.sentiment_profile()['positive'], 3),
            'sentiment_negative_pct': round(self.sentiment_profile()['negative'], 3),
            'death_theme_pct': round(self.sentiment_profile()['death_related'], 3)
        }
    
    def find_documents_with_word(self, word: str) -> List[Document]:
        """Find all documents containing a specific word"""
        return [doc for doc in self.documents if doc.contains_word(word)]
    
    def __repr__(self):
        return f"Author('{self.name}', {len(self.documents)} documents, {self.total_words()} words)"
    
    def summary(self) -> str:
        """Detailed summary of the author's corpus"""
        fp = self.stylistic_fingerprint()
        themes = self.themes_distribution()
        top_words = self.favorite_words(5)
        
        return f"""
‚úçÔ∏è  {self.name}
{'=' * (len(self.name) + 3)}

üìö Corpus Overview:
   ‚Ä¢ Documents: {self.document_count()}
   ‚Ä¢ Total words: {self.total_words():,}
   ‚Ä¢ Vocabulary size: {self.vocabulary_size():,} unique words
   ‚Ä¢ Average document length: {self.average_words_per_document():.0f} words

üé® Writing Style:
   ‚Ä¢ Average word length: {fp['avg_word_length']} characters
   ‚Ä¢ Average sentence length: {fp['avg_sentence_length']:.1f} words
   ‚Ä¢ Vocabulary richness: {fp['vocabulary_richness']:.4f}
   ‚Ä¢ Readability score: {fp['readability']:.1f}

üí≠ Themes: {', '.join(f'{k} ({v})' for k, v in themes.items())}

üîù Favorite words: {', '.join(w for w, _ in top_words)}

üòä Sentiment: {fp['sentiment_positive_pct']:.1f}% positive, {fp['sentiment_negative_pct']:.1f}% negative
üíÄ Death references: {fp['death_theme_pct']:.1f}% of words
"""

## üìö Corpus Class

Manages a collection of documents from multiple authors and provides comparative analysis

In [None]:
class Corpus:
    """A collection of documents from multiple authors with analysis capabilities"""
    
    def __init__(self, name: str = "Literary Corpus"):
        self.name = name
        self.documents: List[Document] = []
        self.authors: Dict[str, Author] = {}
    
    def add_document(self, document: Document):
        """Add a document to the corpus"""
        self.documents.append(document)
        
        # Add to author's collection
        if document.author not in self.authors:
            self.authors[document.author] = Author(document.author)
        self.authors[document.author].add_document(document)
    
    def load_from_json(self, filepath: str):
        """Load documents from a JSON file"""
        with open(filepath, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        for item in data:
            doc = Document(
                author=item['author'],
                title=item['title'],
                text=item['text'],
                theme=item.get('theme'),
                year=item.get('year')
            )
            self.add_document(doc)
    
    def get_author(self, author_name: str) -> Author:
        """Get Author object by name"""
        return self.authors.get(author_name)
    
    def author_names(self) -> List[str]:
        """List of all author names in corpus"""
        return sorted(self.authors.keys())
    
    def document_count(self) -> int:
        """Total number of documents"""
        return len(self.documents)
    
    def total_words(self) -> int:
        """Total words in entire corpus"""
        return sum(doc.word_count() for doc in self.documents)
    
    def compare_authors(self, metric: str = 'avg_word_length') -> Dict[str, float]:
        """
        Compare authors on a specific metric.
        Available metrics: avg_word_length, avg_sentence_length, 
                          vocabulary_richness, readability
        """
        result = {}
        for author_name, author in self.authors.items():
            fp = author.stylistic_fingerprint()
            if metric in fp:
                result[author_name] = fp[metric]
        return dict(sorted(result.items(), key=lambda x: x[1], reverse=True))
    
    def find_documents_by_theme(self, theme: str) -> List[Document]:
        """Find all documents with a specific theme"""
        return [doc for doc in self.documents if doc.theme == theme]
    
    def themes_in_corpus(self) -> List[str]:
        """Get all unique themes in the corpus"""
        themes = set(doc.theme for doc in self.documents if doc.theme)
        return sorted(themes)
    
    def search(self, query: str) -> List[Document]:
        """Search for documents containing a specific word"""
        query = query.lower()
        return [doc for doc in self.documents if query in doc.text.lower()]
    
    def tf_idf(self, word: str) -> Dict[str, float]:
        """
        Calculate TF-IDF (Term Frequency-Inverse Document Frequency) for a word.
        Returns TF-IDF scores for each document containing the word.
        """
        word = word.lower()
        n_documents = len(self.documents)
        
        # Document frequency (how many documents contain the word)
        df = sum(1 for doc in self.documents if doc.contains_word(word))
        
        if df == 0:
            return {}
        
        # Inverse document frequency
        idf = math.log(n_documents / df)
        
        # Calculate TF-IDF for each document
        results = {}
        for doc in self.documents:
            if doc.contains_word(word):
                # Term frequency in this document
                tf = doc.get_words().count(word) / doc.word_count()
                tfidf = tf * idf
                results[doc.title] = round(tfidf, 4)
        
        return dict(sorted(results.items(), key=lambda x: x[1], reverse=True))
    
    def most_distinctive_words_per_author(self, top_n: int = 5) -> Dict[str, List[str]]:
        """
        Find words that are distinctive to each author using TF-IDF concept.
        Words that appear frequently in one author but rarely in others.
        """
        result = {}
        
        for author_name, author in self.authors.items():
            # Get author's favorite words
            author_words = [w for w, _ in author.favorite_words(30)]
            
            # Calculate how distinctive each word is
            distinctiveness = {}
            for word in author_words:
                # How often does this author use it vs others?
                author_freq = sum(1 for doc in author.documents if doc.contains_word(word))
                other_freq = sum(1 for doc in self.documents 
                               if doc.author != author_name and doc.contains_word(word))
                
                # Simple distinctiveness score
                if author_freq > 0:
                    score = author_freq / (other_freq + 1)  # +1 to avoid division by zero
                    distinctiveness[word] = score
            
            # Get top N most distinctive
            top_words = sorted(distinctiveness.items(), key=lambda x: x[1], reverse=True)[:top_n]
            result[author_name] = [word for word, _ in top_words]
        
        return result
    
    def author_similarity(self, author1: str, author2: str) -> float:
        """
        Calculate similarity between two authors based on shared vocabulary.
        Returns a score between 0 and 1 (1 = identical vocabulary overlap).
        """
        if author1 not in self.authors or author2 not in self.authors:
            return 0.0
        
        # Get signature words for each author
        words1 = set(self.authors[author1].signature_words(min_frequency=2))
        words2 = set(self.authors[author2].signature_words(min_frequency=2))
        
        if not words1 or not words2:
            return 0.0
        
        # Jaccard similarity
        intersection = len(words1 & words2)
        union = len(words1 | words2)
        
        return round(intersection / union, 3) if union > 0 else 0.0
    
    def chronological_documents(self) -> List[Document]:
        """Return documents sorted by year"""
        return sorted([doc for doc in self.documents if doc.year], 
                     key=lambda x: x.year)
    
    def __repr__(self):
        return f"Corpus('{self.name}', {len(self.authors)} authors, {len(self.documents)} documents)"
    
    def summary(self) -> str:
        """Overview of the entire corpus"""
        return f"""
üìö {self.name}
{'=' * (len(self.name) + 3)}

üìä Corpus Statistics:
   ‚Ä¢ Authors: {len(self.authors)}
   ‚Ä¢ Documents: {len(self.documents)}
   ‚Ä¢ Total words: {self.total_words():,}
   ‚Ä¢ Themes: {len(self.themes_in_corpus())}

‚úçÔ∏è  Authors: {', '.join(self.author_names())}

üè∑Ô∏è  Themes: {', '.join(self.themes_in_corpus())}
"""

---

# üî¨ Analysis Examples

Let's put our classes to work!

## 1Ô∏è‚É£ Load the Data

In [None]:
# Create a corpus
corpus = Corpus("Great 20th Century Writers")

# Load all three authors
corpus.load_from_json('data/dostoevsky.json')
corpus.load_from_json('data/camus.json')
corpus.load_from_json('data/remarque.json')

print(corpus.summary())

## 2Ô∏è‚É£ Analyze a Single Document

In [None]:
# Get the first Dostoevsky document
doc = corpus.documents[0]

print(doc.summary())

## 3Ô∏è‚É£ Compare Writing Styles of Authors

In [None]:
# Print summary for each author
for author_name in corpus.author_names():
    author = corpus.get_author(author_name)
    print(author.summary())
    print("\n" + "="*80 + "\n")

## 4Ô∏è‚É£ Compare Specific Metrics

In [None]:
print("üìä AVERAGE WORD LENGTH (who uses longer words?)")
for author, length in corpus.compare_authors('avg_word_length').items():
    print(f"   {author:30} {length:.2f} characters")

print("\nüìè SENTENCE LENGTH (who writes longer sentences?)")
for author, length in corpus.compare_authors('avg_sentence_length').items():
    print(f"   {author:30} {length:.1f} words")

print("\nüìñ VOCABULARY RICHNESS (who uses more varied vocabulary?)")
for author, richness in corpus.compare_authors('vocabulary_richness').items():
    print(f"   {author:30} {richness:.4f}")

print("\nüìù READABILITY (Flesch score: higher = easier to read)")
for author, score in corpus.compare_authors('readability').items():
    print(f"   {author:30} {score:.1f}")

## 5Ô∏è‚É£ Find Distinctive Words for Each Author

In [None]:
print("üéØ DISTINCTIVE WORDS (words characteristic of each author):\n")

distinctive = corpus.most_distinctive_words_per_author(top_n=8)
for author, words in distinctive.items():
    print(f"{author}:")
    print(f"   {', '.join(words)}")
    print()

## 6Ô∏è‚É£ TF-IDF Analysis

Find which documents use a specific word most meaningfully

In [None]:
# Analyze the word "death"
print("üíÄ TF-IDF for 'death' (which documents focus most on death?):\n")
death_tfidf = corpus.tf_idf('death')
for doc_title, score in list(death_tfidf.items())[:5]:
    print(f"   {score:.4f} - {doc_title}")

print("\n" + "="*80)

# Analyze the word "freedom"
print("\nüïäÔ∏è TF-IDF for 'freedom' (which documents focus most on freedom?):\n")
freedom_tfidf = corpus.tf_idf('freedom')
for doc_title, score in list(freedom_tfidf.items())[:5]:
    print(f"   {score:.4f} - {doc_title}")

## 7Ô∏è‚É£ Author Similarity Analysis

In [None]:
print("ü§ù AUTHOR SIMILARITY (based on vocabulary overlap):\n")

authors = corpus.author_names()
for i, author1 in enumerate(authors):
    for author2 in authors[i+1:]:
        similarity = corpus.author_similarity(author1, author2)
        print(f"   {author1} ‚ÜîÔ∏è {author2}: {similarity:.3f}")

## 8Ô∏è‚É£ Search for Themes

In [None]:
print("üè∑Ô∏è  ALL THEMES IN CORPUS:")
for theme in corpus.themes_in_corpus():
    docs = corpus.find_documents_by_theme(theme)
    print(f"\n   {theme} ({len(docs)} documents):")
    for doc in docs:
        print(f"      ‚Ä¢ {doc.title}")

## 9Ô∏è‚É£ Sentiment Analysis

In [None]:
print("üòäüò¢ SENTIMENT PROFILES BY AUTHOR:\n")

for author_name in corpus.author_names():
    author = corpus.get_author(author_name)
    sentiment = author.sentiment_profile()
    
    print(f"{author_name}:")
    print(f"   Positive: {sentiment['positive']:.2f}%")
    print(f"   Negative: {sentiment['negative']:.2f}%")
    print(f"   Death-related: {sentiment['death_related']:.2f}%")
    print()

## üîü Custom Analysis: Find Long Words

In [None]:
# Who uses the longest, most complex words?
print("üìè LONGEST WORDS (9+ characters) BY AUTHOR:\n")

for author_name in corpus.author_names():
    author = corpus.get_author(author_name)
    all_long_words = set()
    
    for doc in author.documents:
        all_long_words.update(doc.long_words(min_length=9))
    
    print(f"{author_name} ({len(all_long_words)} unique long words):")
    print(f"   {', '.join(sorted(list(all_long_words))[:10])}")
    print()

---

# üéØ Exercises for Students

1. **Add a new author**: Create a JSON file for another author (e.g., Tolstoy, Kafka, Hemingway) and load it into the corpus

2. **Extend Document class**: Add a method to find the most common word pairs (bigrams)

3. **Create a ThemeAnalyzer class**: A class that focuses specifically on theme-based analysis

4. **Visualization**: Use matplotlib to create bar charts comparing authors on different metrics

5. **Advanced sentiment**: Expand the sentiment analysis with more emotion categories

6. **Word cloud**: Generate word clouds for each author showing their most frequent words

7. **Comparison method**: Add a method to `Author` class that compares this author with another author

8. **Time analysis**: Analyze how writing style changed over time (using the year field)

9. **Export functionality**: Add methods to export analysis results to CSV or JSON

10. **Inheritance practice**: Create specialized subclasses like `NovelExcerpt`, `PhilosophicalText`, etc. that inherit from Document