# NLTK Complete Guide - Section 15: Corpus Management

## What is a Corpus?

A **corpus** (plural: corpora) is a large, structured collection of text used for linguistic research and NLP tasks. Think of it as a dataset specifically designed for text analysis.

### Why Corpora Matter

| Purpose | Description |
|---------|-------------|
| **Training Data** | Machine learning models need labeled text |
| **Linguistic Research** | Study language patterns, frequency, usage |
| **Benchmarking** | Standard datasets for comparing algorithms |
| **Vocabulary Building** | Create word lists, dictionaries |
| **Domain Adaptation** | Customize NLP tools for specific fields |

### Types of Corpora

| Type | Description | Example |
|------|-------------|---------|
| **Raw Text** | Plain text without annotations | Project Gutenberg books |
| **Categorized** | Text organized by topic/category | Brown Corpus (news, fiction, etc.) |
| **Tagged** | Words labeled with POS tags | Penn Treebank |
| **Annotated** | Multiple annotation layers | Named entities, syntax trees |
| **Parallel** | Same text in multiple languages | Europarl (EU proceedings) |

### What This Notebook Covers

| Topic | Description |
|-------|-------------|
| **Built-in Corpora** | Explore NLTK's rich corpus collection |
| **Corpus Readers** | Access methods for different corpus types |
| **Custom Corpora** | Create your own corpus from files |
| **Categorization** | Organize documents by category |
| **Statistics** | Analyze corpus properties |

In [1]:
import nltk
import os

nltk.download('gutenberg', quiet=True)
nltk.download('brown', quiet=True)
nltk.download('reuters', quiet=True)
nltk.download('inaugural', quiet=True)
nltk.download('webtext', quiet=True)
nltk.download('nps_chat', quiet=True)
nltk.download('treebank', quiet=True)

from nltk.corpus import gutenberg, brown, reuters, inaugural, webtext
from nltk.corpus import PlaintextCorpusReader, TaggedCorpusReader
from nltk.tokenize import word_tokenize

### Understanding the Imports

| Import | Purpose |
|--------|---------|
| `gutenberg` | Classic literature from Project Gutenberg |
| `brown` | Categorized text from various genres |
| `reuters` | News articles with topic categories |
| `inaugural` | US Presidential inaugural addresses |
| `webtext` | Internet text (forums, reviews, etc.) |
| `PlaintextCorpusReader` | Reader for your own text files |
| `TaggedCorpusReader` | Reader for POS-tagged text |

## 15.1 Built-in Corpora Overview

NLTK comes with **over 100 corpora** covering various languages, domains, and annotation types. These are invaluable for:
- Learning NLP techniques
- Prototyping and testing
- Benchmarking algorithms

### Corpus Access Pattern

All NLTK corpora follow a consistent access pattern:

```python
from nltk.corpus import corpus_name

# Common methods available:
corpus_name.fileids()      # List of files
corpus_name.raw()          # Raw text as string
corpus_name.words()        # List of words
corpus_name.sents()        # List of sentences
corpus_name.categories()   # Categories (if applicable)
```

In [2]:
# Survey of popular NLTK corpora and their uses
corpora_info = {
    'gutenberg': ('Classic literature (18 texts)', 'Language modeling, style analysis'),
    'brown': ('Categorized American English', 'POS tagging, genre classification'),
    'reuters': ('News articles (10,788 docs)', 'Multi-label classification'),
    'inaugural': ('US Presidential speeches', 'Historical language analysis'),
    'webtext': ('Web and chat text', 'Informal language processing'),
    'treebank': ('Parsed Wall Street Journal', 'Syntactic parsing'),
    'movie_reviews': ('2000 movie reviews', 'Sentiment analysis'),
    'stopwords': ('Stop words (multiple languages)', 'Text preprocessing'),
    'wordnet': ('Lexical database', 'Word meanings, synonyms'),
    'names': ('8000+ male/female names', 'Name classification'),
}

print("POPULAR NLTK CORPORA")
print("=" * 75)
print(f"{'Corpus':<15} {'Description':<35} {'Common Use'}")
print("-" * 75)

for corpus, (description, use) in corpora_info.items():
    print(f"{corpus:<15} {description:<35} {use}")

print("\nüí° All corpora can be downloaded with: nltk.download('corpus_name')")

POPULAR NLTK CORPORA
Corpus          Description                         Common Use
---------------------------------------------------------------------------
gutenberg       Classic literature (18 texts)       Language modeling, style analysis
brown           Categorized American English        POS tagging, genre classification
reuters         News articles (10,788 docs)         Multi-label classification
inaugural       US Presidential speeches            Historical language analysis
webtext         Web and chat text                   Informal language processing
treebank        Parsed Wall Street Journal          Syntactic parsing
movie_reviews   2000 movie reviews                  Sentiment analysis
stopwords       Stop words (multiple languages)     Text preprocessing
wordnet         Lexical database                    Word meanings, synonyms
names           8000+ male/female names             Name classification

üí° All corpora can be downloaded with: nltk.download('corpus_nam

## 15.2 Gutenberg Corpus

The **Gutenberg Corpus** contains 18 classic literary works from Project Gutenberg, including:
- Jane Austen novels
- Shakespeare plays
- Milton's Paradise Lost
- The Bible (King James Version)

### Why Use Gutenberg?

- **Clean text** - Well-formatted literary prose
- **Public domain** - No copyright issues
- **Diverse styles** - Different authors and time periods
- **Long texts** - Substantial content for analysis

In [3]:
# Explore the Gutenberg corpus
print("GUTENBERG CORPUS FILES")
print("=" * 55)
print(f"{'File':<32} {'Words':>10} {'Chars':>12}")
print("-" * 55)

total_words = 0
for fileid in gutenberg.fileids():
    words = len(gutenberg.words(fileid))
    chars = len(gutenberg.raw(fileid))
    total_words += words
    print(f"{fileid:<32} {words:>10,} {chars:>12,}")

print("-" * 55)
print(f"{'TOTAL':<32} {total_words:>10,}")

print(f"""
üí° Corpus Statistics:
   ‚Ä¢ {len(gutenberg.fileids())} classic literary works
   ‚Ä¢ Over {total_words:,} words total
   ‚Ä¢ Spans multiple centuries of English
""")

GUTENBERG CORPUS FILES
File                                  Words        Chars
-------------------------------------------------------
austen-emma.txt                     192,427      887,071
austen-persuasion.txt                98,171      466,292
austen-sense.txt                    141,576      673,022
bible-kjv.txt                     1,010,654    4,332,554
blake-poems.txt                       8,354       38,153
bryant-stories.txt                   55,563      249,439
burgess-busterbrown.txt              18,963       84,663
carroll-alice.txt                    34,110      144,395
chesterton-ball.txt                  96,996      457,450
chesterton-brown.txt                 86,063      406,629
chesterton-thursday.txt              69,213      320,525
edgeworth-parents.txt               210,663      935,158
melville-moby_dick.txt              260,819    1,242,990
milton-paradise.txt                  96,825      468,220
shakespeare-caesar.txt               25,833      112,310
shakespea

In [4]:
# Demonstrating different access methods for a corpus
fileid = 'austen-emma.txt'

print(f"ACCESSING '{fileid}'")
print("=" * 60)

# METHOD 1: raw() - Get raw text as a single string
raw = gutenberg.raw(fileid)
print(f"\nüìÑ raw() - Returns raw text string:")
print(f"   Length: {len(raw):,} characters")
print(f"   Preview: \"{raw[:100]}...\"")

# METHOD 2: words() - Get list of words (tokens)
words = gutenberg.words(fileid)
print(f"\nüìù words() - Returns list of word tokens:")
print(f"   Count: {len(words):,} words")
print(f"   First 10: {list(words[:10])}")

# METHOD 3: sents() - Get list of sentences (each is a list of words)
sents = gutenberg.sents(fileid)
print(f"\nüìñ sents() - Returns list of sentences:")
print(f"   Count: {len(sents):,} sentences")
print(f"   First sentence: {' '.join(sents[0])}")

print(f"""
üí° When to use which method:
   ‚Ä¢ raw()   ‚Üí Need original formatting, regex searching
   ‚Ä¢ words() ‚Üí Word frequency, vocabulary analysis
   ‚Ä¢ sents() ‚Üí Sentence-level analysis, n-grams
""")

ACCESSING 'austen-emma.txt'

üìÑ raw() - Returns raw text string:
   Length: 887,071 characters
   Preview: "[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a..."

üìù words() - Returns list of word tokens:
   Count: 192,427 words
   First 10: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

üìñ sents() - Returns list of sentences:
   Count: 7,752 sentences
   First sentence: [ Emma by Jane Austen 1816 ]

üí° When to use which method:
   ‚Ä¢ raw()   ‚Üí Need original formatting, regex searching
   ‚Ä¢ words() ‚Üí Word frequency, vocabulary analysis
   ‚Ä¢ sents() ‚Üí Sentence-level analysis, n-grams



## 15.3 Brown Corpus (Categorized Text)

The **Brown Corpus** is a groundbreaking corpus compiled in the 1960s at Brown University. It was the first major structured corpus of American English.

### Key Features

| Feature | Description |
|---------|-------------|
| **Size** | ~1 million words |
| **Categories** | 15 genres (news, fiction, academic, etc.) |
| **POS Tagged** | Every word has a part-of-speech tag |
| **Balanced** | Systematic sampling across genres |

### Categories in Brown

| Category | Description | Examples |
|----------|-------------|----------|
| `news` | Newspaper text | Press reporting |
| `editorial` | Opinion pieces | Newspaper editorials |
| `fiction` | Imaginative prose | Science fiction, romance |
| `government` | Official documents | Reports, regulations |
| `hobbies` | Special interest | Crafts, collecting |
| `humor` | Comedic writing | Satire, jokes |
| `learned` | Academic writing | Science, humanities |
| `mystery` | Detective fiction | Crime novels |
| `religion` | Religious writing | Sermons, theology |
| `romance` | Love stories | Romantic fiction |
| `science_fiction` | Sci-fi stories | Space, future |

In [5]:
# Explore Brown corpus categories
print("BROWN CORPUS CATEGORIES")
print("=" * 60)
print(f"{'Category':<20} {'Files':>8} {'Words':>12} {'Avg Words/File':>15}")
print("-" * 60)

for cat in brown.categories():
    files = len(brown.fileids(categories=cat))
    words = len(brown.words(categories=cat))
    avg = words // files
    print(f"{cat:<20} {files:>8} {words:>12,} {avg:>15,}")

total_files = len(brown.fileids())
total_words = len(brown.words())
print("-" * 60)
print(f"{'TOTAL':<20} {total_files:>8} {total_words:>12,}")

print(f"""
üí° The Brown Corpus is perfect for:
   ‚Ä¢ Studying differences between genres
   ‚Ä¢ Training POS taggers
   ‚Ä¢ Building language models
""")

BROWN CORPUS CATEGORIES
Category                Files        Words  Avg Words/File
------------------------------------------------------------
adventure                  29       69,342           2,391
belles_lettres             75      173,096           2,307
editorial                  27       61,604           2,281
fiction                    29       68,488           2,361
government                 30       70,117           2,337
hobbies                    36       82,345           2,287
humor                       9       21,695           2,410
learned                    80      181,888           2,273
lore                       48      110,299           2,297
mystery                    24       57,169           2,382
news                       44      100,554           2,285
religion                   17       39,399           2,317
reviews                    17       40,704           2,394
romance                    29       70,022           2,414
science_fiction             6 

In [6]:
# Accessing text by category
print("ACCESSING TEXT BY CATEGORY")
print("=" * 55)

# Single category
news_words = brown.words(categories='news')
fiction_words = brown.words(categories='fiction')

print(f"\nüì∞ News category:")
print(f"   Words: {len(news_words):,}")
print(f"   Sample: {' '.join(news_words[:15])}...")

print(f"\nüìö Fiction category:")
print(f"   Words: {len(fiction_words):,}")
print(f"   Sample: {' '.join(fiction_words[:15])}...")

# Multiple categories combined
print(f"\nüîÄ Combining categories:")
multi_words = brown.words(categories=['news', 'editorial', 'government'])
print(f"   News + Editorial + Government: {len(multi_words):,} words")

# By file ID
print(f"\nüìÅ By file ID:")
print(f"   All file IDs: {len(brown.fileids())} files")
print(f"   Sample: {brown.fileids()[:3]}")

ACCESSING TEXT BY CATEGORY

üì∞ News category:
   Words: 100,554
   Sample: The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced...

üìö Fiction category:
   Words: 68,488
   Sample: Thirty-three Scotty did not go back to school . His parents talked seriously and lengthily...

üîÄ Combining categories:
   News + Editorial + Government: 232,275 words

üìÅ By file ID:
   All file IDs: 500 files
   Sample: ['ca01', 'ca02', 'ca03']


In [7]:
# Brown corpus is POS-tagged - each word has a part-of-speech tag
print("POS-TAGGED DATA IN BROWN CORPUS")
print("=" * 55)

# Get tagged words
tagged = brown.tagged_words(categories='news')[:15]

print(f"\nTagged words from 'news' category:")
print(f"{tagged}")

print(f"""
Format: (word, POS_tag)

Common Brown POS tags:
  NN  = Noun (singular)
  NNS = Noun (plural)
  VB  = Verb (base form)
  VBD = Verb (past tense)
  JJ  = Adjective
  RB  = Adverb
  IN  = Preposition
  AT  = Article (the, a)

üí° This pre-tagged data is valuable for:
   ‚Ä¢ Training your own POS tagger
   ‚Ä¢ Studying word usage by category
   ‚Ä¢ Grammar analysis
""")

POS-TAGGED DATA IN BROWN CORPUS

Tagged words from 'news' category:
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD')]

Format: (word, POS_tag)

Common Brown POS tags:
  NN  = Noun (singular)
  NNS = Noun (plural)
  VB  = Verb (base form)
  VBD = Verb (past tense)
  JJ  = Adjective
  RB  = Adverb
  IN  = Preposition
  AT  = Article (the, a)

üí° This pre-tagged data is valuable for:
   ‚Ä¢ Training your own POS tagger
   ‚Ä¢ Studying word usage by category
   ‚Ä¢ Grammar analysis



## 15.4 Reuters Corpus (Multi-label Classification)

The **Reuters Corpus** is a collection of news articles from Reuters newswire, commonly used for text classification research.

### Key Features

| Feature | Description |
|---------|-------------|
| **Size** | 10,788 news documents |
| **Categories** | 90 topic categories |
| **Multi-label** | Documents can have multiple categories |
| **Pre-split** | Training and test sets defined |

### Multi-label vs Multi-class

| Type | Description | Example |
|------|-------------|---------|
| **Multi-class** | Each doc has ONE label | Email ‚Üí spam OR not spam |
| **Multi-label** | Each doc can have MULTIPLE labels | News ‚Üí "grain" AND "wheat" AND "trade" |

Reuters is multi-label: a news article about wheat exports might be tagged with both "grain" and "trade" categories.

In [8]:
# Explore Reuters corpus
print("REUTERS CORPUS OVERVIEW")
print("=" * 55)

print(f"""
üìä Corpus Statistics:
   Total documents: {len(reuters.fileids()):,}
   Total categories: {len(reuters.categories())}
   
üìÅ Sample categories (first 15):
   {reuters.categories()[:15]}
   
üí° Most popular categories deal with:
   ‚Ä¢ Commodities (grain, crude oil, coffee)
   ‚Ä¢ Finance (money, interest rates)
   ‚Ä¢ Trade (imports, exports)
""")

REUTERS CORPUS OVERVIEW

üìä Corpus Statistics:
   Total documents: 10,788
   Total categories: 90

üìÅ Sample categories (first 15):
   ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil']

üí° Most popular categories deal with:
   ‚Ä¢ Commodities (grain, crude oil, coffee)
   ‚Ä¢ Finance (money, interest rates)
   ‚Ä¢ Trade (imports, exports)



In [9]:
# Multi-label: Documents can have multiple categories
sample_file = reuters.fileids()[100]  # Pick a document

print("MULTI-LABEL CLASSIFICATION EXAMPLE")
print("=" * 55)

print(f"\nüìÑ Document: {sample_file}")
print(f"üìë Categories: {reuters.categories(sample_file)}")
print(f"\nüìù Text preview:")
print(f"   {reuters.raw(sample_file)[:400]}...")

# Show some documents with multiple categories
print(f"\nüè∑Ô∏è Examples of multi-label documents:")
multi_label_count = 0
for fileid in reuters.fileids()[:100]:
    cats = reuters.categories(fileid)
    if len(cats) >= 3:
        print(f"   {fileid}: {cats}")
        multi_label_count += 1
        if multi_label_count >= 5:
            break

MULTI-LABEL CLASSIFICATION EXAMPLE

üìÑ Document: test/15023
üìë Categories: ['earn']

üìù Text preview:
   CITYTRUST BANCORP INC &lt;CITR> 1ST QTR NET
  Shr 1.40 dlrs vs 1.16 dlrs
      Net 5,776,000 vs 4,429,000
      Avg shrs 4,132,828 vs 3,834,117
  

...

üè∑Ô∏è Examples of multi-label documents:
   test/14832: ['corn', 'grain', 'rice', 'rubber', 'sugar', 'tin', 'trade']
   test/14840: ['coffee', 'lumber', 'palm-oil', 'rubber', 'veg-oil']
   test/14858: ['carcass', 'corn', 'grain', 'livestock', 'oilseed', 'rice', 'soybean', 'trade']
   test/14892: ['oilseed', 'palm-oil', 'soy-oil', 'soybean', 'veg-oil']
   test/14913: ['dlr', 'money-fx', 'yen']


In [10]:
# Reuters has a built-in train/test split
# File IDs starting with 'training/' are for training
# File IDs starting with 'test/' are for testing

train_files = [f for f in reuters.fileids() if f.startswith('training/')]
test_files = [f for f in reuters.fileids() if f.startswith('test/')]

print("BUILT-IN TRAIN/TEST SPLIT")
print("=" * 55)
print(f"""
üìä Split Statistics:
   Training documents: {len(train_files):,}
   Test documents:     {len(test_files):,}
   Split ratio:        {len(train_files)/len(reuters.fileids())*100:.1f}% / {len(test_files)/len(reuters.fileids())*100:.1f}%
   
üí° Why built-in splits matter:
   ‚Ä¢ Ensures fair comparison between algorithms
   ‚Ä¢ Standard benchmark for research papers
   ‚Ä¢ No need to create your own split
   
üìÅ Sample training file: {train_files[0]}
üìÅ Sample test file:     {test_files[0]}
""")

BUILT-IN TRAIN/TEST SPLIT

üìä Split Statistics:
   Training documents: 7,769
   Test documents:     3,019
   Split ratio:        72.0% / 28.0%

üí° Why built-in splits matter:
   ‚Ä¢ Ensures fair comparison between algorithms
   ‚Ä¢ Standard benchmark for research papers
   ‚Ä¢ No need to create your own split

üìÅ Sample training file: training/1
üìÅ Sample test file:     test/14826



## 15.5 Creating Your Own Custom Corpus

NLTK isn't limited to built-in corpora. You can create corpus readers for your own text collections!

### PlaintextCorpusReader

The simplest way to create a custom corpus:

```python
from nltk.corpus import PlaintextCorpusReader

# Point to a directory containing .txt files
corpus = PlaintextCorpusReader('./my_texts/', r'.*\.txt')

# Now use standard corpus methods
corpus.fileids()
corpus.words()
corpus.sents()
```

### Why Use Corpus Readers?

| Benefit | Description |
|---------|-------------|
| **Consistent API** | Same methods as built-in corpora |
| **Lazy loading** | Files loaded only when needed |
| **Automatic tokenization** | Words/sentences extracted automatically |
| **Scalable** | Handle large document collections |

In [11]:
# Create a sample corpus directory with text files

corpus_dir = './my_corpus'
os.makedirs(corpus_dir, exist_ok=True)

# Create sample documents on different topics
texts = {
    'doc1.txt': """Natural language processing is a field of computer science.
It deals with the interaction between computers and humans using natural language.
NLP combines computational linguistics with machine learning and deep learning.
Applications include translation, sentiment analysis, and chatbots.""",
    
    'doc2.txt': """Machine learning is transforming how we build software systems.
Deep learning models can understand complex patterns in data.
Neural networks have achieved remarkable results in image and speech recognition.
AI is becoming more accessible to developers through open-source libraries.""",
    
    'doc3.txt': """Python is a popular programming language for data science.
It is widely used in scientific computing and web development.
Python has a rich ecosystem of libraries like NumPy, Pandas, and scikit-learn.
Its simple syntax makes it ideal for beginners and experts alike.""",
}

# Write files
for filename, content in texts.items():
    filepath = os.path.join(corpus_dir, filename)
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)

print("CUSTOM CORPUS CREATED")
print("=" * 55)
print(f"üìÅ Directory: {corpus_dir}")
print(f"üìÑ Files created: {len(texts)}")
for filename in texts.keys():
    print(f"   ‚Ä¢ {filename}")

CUSTOM CORPUS CREATED
üìÅ Directory: ./my_corpus
üìÑ Files created: 3
   ‚Ä¢ doc1.txt
   ‚Ä¢ doc2.txt
   ‚Ä¢ doc3.txt


In [12]:
# Load the custom corpus with PlaintextCorpusReader
# Parameters:
#   - root: directory containing files
#   - fileids: regex pattern to match files

my_corpus = PlaintextCorpusReader(
    corpus_dir,           # Directory path
    r'.*\.txt'           # Match all .txt files
)

print("LOADING CUSTOM CORPUS")
print("=" * 55)
print(f"""
üìö Corpus loaded successfully!

üìÅ Files: {my_corpus.fileids()}

üí° Now you can use all standard corpus methods:
   ‚Ä¢ my_corpus.raw()    - Get raw text
   ‚Ä¢ my_corpus.words()  - Get word tokens  
   ‚Ä¢ my_corpus.sents()  - Get sentences
""")

LOADING CUSTOM CORPUS

üìö Corpus loaded successfully!

üìÅ Files: ['doc1.txt', 'doc2.txt', 'doc3.txt']

üí° Now you can use all standard corpus methods:
   ‚Ä¢ my_corpus.raw()    - Get raw text
   ‚Ä¢ my_corpus.words()  - Get word tokens  
   ‚Ä¢ my_corpus.sents()  - Get sentences



In [13]:
# Access methods work the same as built-in corpora!
print("USING CUSTOM CORPUS")
print("=" * 55)

# Corpus-wide statistics
print(f"\nüìä Corpus-wide statistics:")
print(f"   Total words:     {len(my_corpus.words()):,}")
print(f"   Total sentences: {len(my_corpus.sents())}")
print(f"   Unique words:    {len(set(w.lower() for w in my_corpus.words() if w.isalpha()))}")

# Per-file access
print(f"\nüìÑ Per-file access:")
for fileid in my_corpus.fileids():
    words = len(my_corpus.words(fileid))
    sents = len(my_corpus.sents(fileid))
    print(f"   {fileid}: {words} words, {sents} sentences")

# Get specific content
print(f"\nüìù Words from doc1.txt:")
print(f"   {list(my_corpus.words('doc1.txt'))[:15]}...")

print(f"\nüìñ First sentence of doc2.txt:")
print(f"   {' '.join(my_corpus.sents('doc2.txt')[0])}")

USING CUSTOM CORPUS

üìä Corpus-wide statistics:
   Total words:     139
   Total sentences: 12
   Unique words:    90

üìÑ Per-file access:
   doc1.txt: 44 words, 4 sentences
   doc2.txt: 45 words, 4 sentences
   doc3.txt: 50 words, 4 sentences

üìù Words from doc1.txt:
   ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'computer', 'science', '.', 'It', 'deals', 'with', 'the', 'interaction']...

üìñ First sentence of doc2.txt:
   Machine learning is transforming how we build software systems .


## 15.6 Categorized Corpus

For machine learning tasks, you often need documents organized by category. NLTK's `CategorizedPlaintextCorpusReader` handles this!

### Two Ways to Define Categories

1. **Directory-based**: Each subdirectory is a category
   ```
   corpus/
   ‚îú‚îÄ‚îÄ sports/
   ‚îÇ   ‚îú‚îÄ‚îÄ article1.txt
   ‚îÇ   ‚îî‚îÄ‚îÄ article2.txt
   ‚îî‚îÄ‚îÄ politics/
       ‚îú‚îÄ‚îÄ article3.txt
       ‚îî‚îÄ‚îÄ article4.txt
   ```

2. **File-based**: Categories stored in a separate file

### Use Cases

- **Document classification** - Train classifiers on labeled documents
- **Topic modeling** - Compare vocabulary across categories
- **Genre analysis** - Study writing styles by category

In [14]:
# Create a categorized corpus using directory structure
from nltk.corpus import CategorizedPlaintextCorpusReader

# Create directory structure: category/file.txt
cat_corpus_dir = './categorized_corpus'

categories_data = {
    'tech': {
        'software.txt': "Software development requires programming skills and creativity. Modern applications use agile methodologies and continuous integration.",
        'hardware.txt': "Computer hardware includes processors, memory, and storage devices. GPUs have become essential for machine learning workloads.",
    },
    'science': {
        'biology.txt': "Biology studies living organisms and their interactions with the environment. DNA sequencing has revolutionized genetic research.",
        'physics.txt': "Physics explains the fundamental laws of the universe through mathematics. Quantum mechanics describes behavior at atomic scales.",
    },
}

# Create directories and files
for category, files in categories_data.items():
    cat_path = os.path.join(cat_corpus_dir, category)
    os.makedirs(cat_path, exist_ok=True)
    
    for filename, content in files.items():
        with open(os.path.join(cat_path, filename), 'w') as f:
            f.write(content)

print("CATEGORIZED CORPUS STRUCTURE")
print("=" * 55)
print(f"""
üìÅ {cat_corpus_dir}/
   ‚îú‚îÄ‚îÄ üìÇ tech/
   ‚îÇ   ‚îú‚îÄ‚îÄ üìÑ software.txt
   ‚îÇ   ‚îî‚îÄ‚îÄ üìÑ hardware.txt
   ‚îî‚îÄ‚îÄ üìÇ science/
       ‚îú‚îÄ‚îÄ üìÑ biology.txt
       ‚îî‚îÄ‚îÄ üìÑ physics.txt
""")

CATEGORIZED CORPUS STRUCTURE

üìÅ ./categorized_corpus/
   ‚îú‚îÄ‚îÄ üìÇ tech/
   ‚îÇ   ‚îú‚îÄ‚îÄ üìÑ software.txt
   ‚îÇ   ‚îî‚îÄ‚îÄ üìÑ hardware.txt
   ‚îî‚îÄ‚îÄ üìÇ science/
       ‚îú‚îÄ‚îÄ üìÑ biology.txt
       ‚îî‚îÄ‚îÄ üìÑ physics.txt



In [15]:
# Load categorized corpus
# cat_pattern extracts category from the path using regex groups

cat_corpus = CategorizedPlaintextCorpusReader(
    cat_corpus_dir,
    r'.*/.*\.txt',            # Match files in subdirectories
    cat_pattern=r'(\w+)/.*'   # First directory name = category
)

print("LOADING CATEGORIZED CORPUS")
print("=" * 55)
print(f"""
üìë Categories found: {cat_corpus.categories()}

üìÅ All files:
   {cat_corpus.fileids()}
   
üí° The cat_pattern regex extracts categories:
   'tech/software.txt' ‚Üí category = 'tech'
   'science/biology.txt' ‚Üí category = 'science'
""")

LOADING CATEGORIZED CORPUS

üìë Categories found: ['science', 'tech']

üìÅ All files:
   ['science/biology.txt', 'science/physics.txt', 'tech/hardware.txt', 'tech/software.txt']

üí° The cat_pattern regex extracts categories:
   'tech/software.txt' ‚Üí category = 'tech'
   'science/biology.txt' ‚Üí category = 'science'



In [16]:
# Access content by category - just like the Brown corpus!
print("ACCESSING BY CATEGORY")
print("=" * 55)

for category in cat_corpus.categories():
    files = cat_corpus.fileids(categories=category)
    words = cat_corpus.words(categories=category)
    
    print(f"\nüìÇ Category: {category.upper()}")
    print(f"   Files: {files}")
    print(f"   Word count: {len(list(words))}")
    print(f"   Sample words: {list(words)[:10]}...")

print(f"""

üí° This is the same pattern used in Brown corpus:
   brown.words(categories='news')
   
   Works exactly the same for custom corpora!
""")

ACCESSING BY CATEGORY

üìÇ Category: SCIENCE
   Files: ['science/biology.txt', 'science/physics.txt']
   Word count: 37
   Sample words: ['Biology', 'studies', 'living', 'organisms', 'and', 'their', 'interactions', 'with', 'the', 'environment']...

üìÇ Category: TECH
   Files: ['tech/hardware.txt', 'tech/software.txt']
   Word count: 37
   Sample words: ['Computer', 'hardware', 'includes', 'processors', ',', 'memory', ',', 'and', 'storage', 'devices']...


üí° This is the same pattern used in Brown corpus:
   brown.words(categories='news')

   Works exactly the same for custom corpora!



## 15.7 Corpus Statistics and Analysis

Understanding your corpus is crucial before analysis. Key statistics include:

### Basic Metrics

| Metric | Description | Why It Matters |
|--------|-------------|----------------|
| **Word count** | Total words | Corpus size |
| **Vocabulary size** | Unique words | Lexical richness |
| **Avg word length** | Mean characters per word | Text complexity |
| **Avg sentence length** | Mean words per sentence | Writing style |
| **Lexical diversity** | Unique/Total ratio | Vocabulary variety |

### Corpus Comparison

Comparing statistics across corpora reveals:
- Writing style differences (formal vs informal)
- Vocabulary complexity (academic vs casual)
- Document structure (short tweets vs long articles)

In [17]:
def corpus_statistics(corpus, name="Corpus"):
    """
    Calculate comprehensive statistics for a corpus.
    
    Args:
        corpus: NLTK corpus object
        name: Display name for the corpus
    
    Returns:
        Dictionary of statistics
    """
    # Basic counts
    words = list(corpus.words())
    alpha_words = [w for w in words if w.isalpha()]
    
    stats = {
        'name': name,
        'files': len(corpus.fileids()),
        'total_tokens': len(words),
        'total_words': len(alpha_words),
        'unique_words': len(set(w.lower() for w in alpha_words)),
        'sentences': len(corpus.sents()),
        'characters': len(corpus.raw()),
    }
    
    # Derived statistics
    if stats['total_words'] > 0:
        stats['avg_word_length'] = sum(len(w) for w in alpha_words) / len(alpha_words)
    else:
        stats['avg_word_length'] = 0
        
    if stats['sentences'] > 0:
        stats['avg_sent_length'] = stats['total_words'] / stats['sentences']
    else:
        stats['avg_sent_length'] = 0
        
    if stats['total_words'] > 0:
        stats['lexical_diversity'] = stats['unique_words'] / stats['total_words']
    else:
        stats['lexical_diversity'] = 0
    
    return stats

print("‚úÖ corpus_statistics() function defined!")

‚úÖ corpus_statistics() function defined!


In [18]:
# Compare statistics across different NLTK corpora
corpora_to_analyze = [
    (gutenberg, 'Gutenberg'),
    (brown, 'Brown'),
    (inaugural, 'Inaugural'),
]

print("CORPUS STATISTICS COMPARISON")
print("=" * 75)

all_stats = []
for corpus, name in corpora_to_analyze:
    stats = corpus_statistics(corpus, name)
    all_stats.append(stats)

# Display comparison table
print(f"\n{'Metric':<25}", end='')
for s in all_stats:
    print(f"{s['name']:<18}", end='')
print()
print("-" * 75)

metrics = [
    ('Files', 'files', ','),
    ('Total Words', 'total_words', ','),
    ('Unique Words', 'unique_words', ','),
    ('Sentences', 'sentences', ','),
    ('Avg Word Length', 'avg_word_length', '.2f'),
    ('Avg Sentence Length', 'avg_sent_length', '.1f'),
    ('Lexical Diversity', 'lexical_diversity', '.4f'),
]

for label, key, fmt in metrics:
    print(f"{label:<25}", end='')
    for s in all_stats:
        value = s[key]
        if fmt == ',':
            print(f"{value:<18,}", end='')
        else:
            print(f"{value:<18{fmt}}", end='')
    print()

print(f"""

üí° Observations:
   ‚Ä¢ Gutenberg has highest lexical diversity (literary prose)
   ‚Ä¢ Inaugural speeches have longest sentences (formal rhetoric)
   ‚Ä¢ Brown is balanced across genres (by design)
""")

CORPUS STATISTICS COMPARISON

Metric                   Gutenberg         Brown             Inaugural         
---------------------------------------------------------------------------
Files                    18                500               60                
Total Words              2,135,400         981,716           141,230           
Unique Words             41,487            40,234            9,354             
Sentences                98,552            57,340            5,395             
Avg Word Length          4.18              4.68              4.71              
Avg Sentence Length      21.7              17.1              26.2              
Lexical Diversity        0.0194            0.0410            0.0662            


üí° Observations:
   ‚Ä¢ Gutenberg has highest lexical diversity (literary prose)
   ‚Ä¢ Inaugural speeches have longest sentences (formal rhetoric)
   ‚Ä¢ Brown is balanced across genres (by design)



## 15.8 Complete Corpus Manager Class

Let's build a reusable class that wraps corpus management functionality.

### Features

| Method | Description |
|--------|-------------|
| `add_document()` | Add new documents to corpus |
| `get_statistics()` | Get corpus statistics |
| `search()` | Search for terms across corpus |
| `get_concordance()` | Show word in context |
| `vocabulary()` | Get word frequency dictionary |

This class provides a convenient interface for working with custom corpora in real projects.

In [19]:
class CorpusManager:
    """
    Utility class for managing and analyzing text corpora.
    
    Provides convenient methods for:
    - Loading and extending corpora
    - Calculating statistics
    - Searching and concordance
    - Vocabulary analysis
    
    Example:
        manager = CorpusManager('./my_corpus')
        manager.add_document('new.txt', 'New document text')
        stats = manager.get_statistics()
        results = manager.search('python')
    """
    
    def __init__(self, corpus_path, pattern=r'.*\.txt'):
        """
        Initialize corpus manager.
        
        Args:
            corpus_path: Directory containing corpus files
            pattern: Regex pattern to match files
        """
        self.path = corpus_path
        self.pattern = pattern
        self._reload_corpus()
        print(f"CorpusManager initialized: {len(self.corpus.fileids())} files loaded")
    
    def _reload_corpus(self):
        """Reload corpus from disk (call after adding files)"""
        self.corpus = PlaintextCorpusReader(self.path, self.pattern)
    
    def add_document(self, filename, content):
        """
        Add a new document to the corpus.
        
        Args:
            filename: Name for the new file
            content: Text content to write
        """
        filepath = os.path.join(self.path, filename)
        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(content)
        self._reload_corpus()  # Reload to include new file
        print(f"Added: {filename}")
    
    def get_statistics(self):
        """Get comprehensive corpus statistics."""
        return corpus_statistics(self.corpus, self.path)
    
    def search(self, term):
        """
        Search for a term across all documents.
        
        Args:
            term: Search term (case-insensitive)
        
        Returns:
            List of (filename, count) tuples, sorted by count
        """
        results = []
        for fileid in self.corpus.fileids():
            text = self.corpus.raw(fileid).lower()
            count = text.count(term.lower())
            if count > 0:
                results.append((fileid, count))
        return sorted(results, key=lambda x: x[1], reverse=True)
    
    def get_concordance(self, word, width=40, lines=10):
        """
        Show a word in context (concordance view).
        
        Args:
            word: Word to look up
            width: Context width in characters
            lines: Max number of examples
        """
        from nltk import Text
        text = Text(self.corpus.words())
        text.concordance(word, width=width, lines=lines)
    
    def vocabulary(self, min_freq=1):
        """
        Get vocabulary with frequency filter.
        
        Args:
            min_freq: Minimum frequency threshold
        
        Returns:
            Dictionary {word: count}
        """
        from collections import Counter
        words = [w.lower() for w in self.corpus.words() if w.isalpha()]
        freq = Counter(words)
        return {w: c for w, c in freq.items() if c >= min_freq}
    
    def top_words(self, n=20, exclude_stopwords=False):
        """
        Get the most frequent words.
        
        Args:
            n: Number of words to return
            exclude_stopwords: Whether to filter out common words
        """
        from nltk.corpus import stopwords
        vocab = self.vocabulary()
        
        if exclude_stopwords:
            stop_words = set(stopwords.words('english'))
            vocab = {w: c for w, c in vocab.items() if w not in stop_words}
        
        return sorted(vocab.items(), key=lambda x: x[1], reverse=True)[:n]

print("‚úÖ CorpusManager class defined!")

‚úÖ CorpusManager class defined!


In [20]:
# Demonstrate the CorpusManager class
print("CORPUS MANAGER DEMONSTRATION")
print("=" * 55)

manager = CorpusManager('./my_corpus')

# Add a new document dynamically
print("\nüìù Adding a new document...")
manager.add_document('doc4.txt', """
Data analysis is essential for business intelligence.
Visualization helps communicate insights effectively.
Python and R are popular tools for data analysis.
Statistical methods underpin modern data science.
""")

print(f"\nüìÅ Current files: {manager.corpus.fileids()}")

# Search across corpus
print(f"\nüîç Search for 'Python':")
results = manager.search('python')
for filename, count in results:
    print(f"   {filename}: {count} occurrence(s)")

# Get statistics
print(f"\nüìä Corpus Statistics:")
stats = manager.get_statistics()
print(f"   Total words: {stats['total_words']:,}")
print(f"   Unique words: {stats['unique_words']}")
print(f"   Lexical diversity: {stats['lexical_diversity']:.4f}")

# Top words
print(f"\nüî§ Top 10 words (excluding stopwords):")
for word, count in manager.top_words(10, exclude_stopwords=True):
    print(f"   {word}: {count}")

CORPUS MANAGER DEMONSTRATION
CorpusManager initialized: 3 files loaded

üìù Adding a new document...
Added: doc4.txt

üìÅ Current files: ['doc1.txt', 'doc2.txt', 'doc3.txt', 'doc4.txt']

üîç Search for 'Python':
   doc3.txt: 2 occurrence(s)
   doc4.txt: 1 occurrence(s)

üìä Corpus Statistics:
   Total words: 148
   Unique words: 105
   Lexical diversity: 0.7095

üî§ Top 10 words (excluding stopwords):
   data: 5
   learning: 4
   language: 3
   science: 3
   analysis: 3
   python: 3
   natural: 2
   machine: 2
   deep: 2
   libraries: 2


In [21]:
# Cleanup temporary directories
import shutil

print("CLEANUP")
print("=" * 55)

shutil.rmtree('./my_corpus', ignore_errors=True)
shutil.rmtree('./categorized_corpus', ignore_errors=True)

print("‚úÖ Cleaned up temporary corpus directories")
print("""
üí° In real projects, you would keep your corpus!
   These temporary files were just for demonstration.
""")

CLEANUP
‚úÖ Cleaned up temporary corpus directories

üí° In real projects, you would keep your corpus!
   These temporary files were just for demonstration.



## Summary & Quick Reference

### NLTK Corpus Readers

| Reader | Use Case | Example |
|--------|----------|---------|
| `PlaintextCorpusReader` | Plain text files | Blog posts, articles |
| `CategorizedPlaintextCorpusReader` | Categorized documents | Labeled training data |
| `TaggedCorpusReader` | POS-tagged text | Custom tagged data |
| `BracketParseCorpusReader` | Parsed tree structures | Syntax trees |
| `XMLCorpusReader` | XML documents | Structured data |

### Common Corpus Methods

```python
# Loading a corpus
from nltk.corpus import corpus_name

# Access methods
corpus.fileids()                    # List all files
corpus.fileids(categories='cat')    # Files in category
corpus.categories()                 # List categories
corpus.categories(fileid)           # Categories for file

# Content access
corpus.raw()                        # Raw text string
corpus.raw(fileid)                  # Raw text of specific file
corpus.words()                      # List of words
corpus.words(categories='cat')      # Words in category
corpus.sents()                      # List of sentences
corpus.tagged_words()               # Words with POS tags
```

### Creating Custom Corpora

```python
from nltk.corpus import PlaintextCorpusReader

# Basic corpus
corpus = PlaintextCorpusReader('./directory/', r'.*\.txt')

# Categorized corpus  
from nltk.corpus import CategorizedPlaintextCorpusReader
corpus = CategorizedPlaintextCorpusReader(
    './directory/',
    r'.*/.*\.txt',
    cat_pattern=r'(\w+)/.*'  # Category from directory name
)
```

### Key Statistics to Track

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Vocabulary Size** | Unique words | Lexical richness |
| **Lexical Diversity** | Unique / Total | 0.1 (repetitive) to 1.0 (varied) |
| **Avg Sentence Length** | Words / Sentences | Complexity indicator |

### Best Practices

1. **Organize by category** when building classification datasets
2. **Use consistent naming** for files
3. **Document your corpus** - include README with sources
4. **Version control** your corpus alongside code
5. **Calculate statistics** before analysis to understand your data

### Next Steps

- Build a corpus from your own data (web scraping, APIs)
- Compare vocabulary across categories
- Use your corpus for classification or language modeling
- Section 16: Advanced Topics for parsing and optimization