# Creating custom corpora

- Setting up a custom corpus
- Creating a wordlist corpus
- Creating a part-of-speech tagged word corpus
- Creating a chunked phrase corpus
- Creating a categorized text corpus
- Creating a categorized chunk corpus reader
- Lazy corpus loading
- Creating a custom corpus view
- Creating a MongoDB-backed corpus reader
- Corpus editing with file locking

# Inheritance diagrams

- `CorpusReader`
  - fileids  
- `PlaintextCorpusReader(CorpusReader)`
  - words, sents, paras  
- `WordListCorpusReader(CorpusReader)`
 - words  
- `TaggedCorpusReader(CorpusReader)`
 - words, sents, paras, tagged_words, tagged_sents, tagged_paras  
- `ChunkedCorpusReader(CorpusReader)`
 - words, sents, paras, tagged_words, tagged_sents, tagged_paras, chunked_words, chunked_sents, chunked_paras  
- `ConllCorpusReader(CorpusReader)`
 - words, sents, paras, tagged_words, tagged_sents, iob_words, iob_sents
- `ConllChunkCorpusReader(ConllCorpusReader)`
 - words, sents, paras, tagged_words, tagged_sents, iob_words, iob_sents, chunked_words, chunked_sents  
- `CategorizedCorpusReader`
 - categories, fileid  
- `CategorizedPlaintextCorpusReader(PlaintextCorpusReader, CategorizedCorpusReader)`
- Not provided by NLTK
 - `CategorizedChunkedCorpusReader(ChunkedCorpusReader, CategorizedCorpusReader)`
 - `CategorizexConllChunkCorpusReader(ConllCorpusReader, ConllChunkCorpusReader, CategorizedCorpusReader)`

## Setting up a custom corpus

- A `corpus` is a collection of text documents, and `corpora` is the plural of corpus.
- A `custom corpus` is a bunch of text files in a directory, often alongside many other directories of text files.

### How to do it?

- NLTK defines a list of data directories, or paths, in nltk.data.path. Custom corpora must be within one of these paths so it can be found by NLTK.
- In order to avoid conflict with the official data package, we'll create a custom nltk_data directory in our home directory.
- We can create a simple wordlist file and make sure it loads.
 - Create a subdirectory in corpora to hold our custom corpus
 - Create a wordlist file and put this file into the subdirectory
- nltk.data.load can also load pickle files and .yaml files
- For most corpora access, we won't actually need to use nltk.data.load, since that will be handled by the `CorpusReader` classes

In [1]:
import os, os.path
path = os.path.expanduser('~/nltk_data')
if not os.path.exists(path):
    print('path does not exist')
    os.mkdir(path)
os.path.exists(path)

True

In [2]:
import nltk.data
path in nltk.data.path

True

In [3]:
import nltk.data
nltk.data.load('corpora/cookbook/mywords.txt', format='raw')

b'nltk\n'

### loading a YAML file
import nltk.data
nltk.data.load('corpora/cookbook/synonyms.yaml')
=> {'bday': 'birthday'}

## Creating a wordlist corpus

The `WordListCorpusReader` class is one of the simplest `CorpusReader` classes.
- Provide access to a file containing a list of words, one word per line
- When we call the words() function, it calls nltk.tokenize.line_tokenize() on the raw file data

In [4]:
from nltk.corpus.reader import WordListCorpusReader

wordlist_dir = nltk.data.find('corpora/cookbook')
reader = WordListCorpusReader(wordlist_dir,'.*\.txt')
print('reader words: ', reader.words())
print('reader file id: ', reader.fileids())

# equivalent as follows:
reader.raw()
from nltk.tokenize import line_tokenize
print('step by step: ', line_tokenize(reader.raw()))

reader words:  ['nltk', 'nltk', 'corpus', 'corpora', 'wordnet']
reader file id:  ['mywords.txt', 'wordlist.txt']
step by step:  ['nltk', 'nltk', 'corpus', 'corpora', 'wordnet']


In [5]:
wordlist_dir = nltk.data.find('corpora/cookbook')
print(wordlist_dir)

/Users/liam/nltk_data/corpora/cookbook


### Names wordlist corpus

In [6]:
from nltk.corpus import names
for fileid in names.fileids():
    print('names wordlist corpus: {:>10}. This corpus contains {:4} names.'.format(fileid, len(names.words(fileid))))

names wordlist corpus: female.txt. This corpus contains 5001 names.
names wordlist corpus:   male.txt. This corpus contains 2943 names.


### English words corpus

In [7]:
from nltk.corpus import words
for fileid in words.fileids():
    print('words wordlist corpus: {:>10}. This corpus contains {:6} words.'.format(fileid, len(words.words(fileid))))

words wordlist corpus:         en. This corpus contains 235886 words.
words wordlist corpus:   en-basic. This corpus contains    850 words.


## Creating a part-of-speech tagged word corpus

- Part-of-speech tagging is the process of identifying the part-of-speech tag for a word.
 - Most of the time, a tagger must first be trained on a training corpus.

### The simplest format for a tagged corpus is of the form word/tag.
- An excerpt from the brown corpus:
 - The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.  
- Different corpora can use different tags to mean the same thing.

### What is the TaggedCorpusReader class?
- TaggedCorpusReader provides several methods for extracting text from a corpus.
 - words()
 - sents()
 - paras()
 - tagged_words()
 - tagged_sents()
 - tagged_paras()

### Tonkenizer can be customized.
#### Customizing the word tokenizer
- The default word tokenizer is an instance of nltk.tokenize.WhitespaceTokeniser
- We can pass different tokenizer to word_tokenizer

#### Customizing the sentence tokenizer
- The default sentence tokenizer is an instance of nltk.tokenize.RegexpTokenize with '\n'
 - Assuming that each sentence is on a line all by itself, and individual sentences do not have line breaks.
- We can pass different tokenizer to sent_tokenizer

#### Customizing the paragraph block reader
- The Paragraphs are assumed to be split by blank lines.
 - This is done with the para_block_reader function in nltk.corpus.reader.util
- There are a number of other block reader functions in nltk.corpus.reader.util, whose purpose is to read blocks of text from a stream

#### Customizing the tag separator
- The default is sep = '/'
- If we want to split words and tags with '|', we should pass in sep = '|'

#### Converting tags to a universal tagset
- NLTK provides a method for converting known tagsets to a universal tagset.
- A tagset is just a list of part-of-speech tags used by one or more corpora.
- To map corpus tags to the universal tagset, the corpus reader must be initialized with a known tagset name

In [8]:
from nltk.corpus.reader import TaggedCorpusReader

reader = TaggedCorpusReader(wordlist_dir, r'.*\.pos')
print('words: ', reader.words())
print('sents: ', reader.sents())
print('paras: ', reader.paras())
print('tagged_words: ', reader.tagged_words())
print('tagged_sents: ', reader.tagged_sents())
print('tagged_paras: ', reader.tagged_paras())

words:  ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
sents:  [['The', 'expense', 'and', 'time', 'involved', 'are'], ['astronomical', '.']]
paras:  [[['The', 'expense', 'and', 'time', 'involved', 'are'], ['astronomical', '.']]]
tagged_words:  [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
tagged_sents:  [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER')], [('astronomical', 'JJ'), ('.', '.')]]
tagged_paras:  [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER')], [('astronomical', 'JJ'), ('.', '.')]]]


### Examples for converting tags to a universal tagset

In [9]:
reader = TaggedCorpusReader(wordlist_dir, r'.*\.pos', tagset='en-brown')
reader.tagged_words(tagset='universal')

[('The', 'DET'), ('expense', 'NOUN'), ('and', 'CONJ'), ...]

In [10]:
from nltk.corpus import treebank
treebank.tagged_words()

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

In [11]:
treebank.tagged_words(tagset='universal')

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ...]

In [12]:
treebank.tagged_words(tagset='brown')

[('Pierre', 'UNK'), ('Vinken', 'UNK'), (',', 'UNK'), ...]

## Creating a chunked phrase corpus

A chunk is a short phrase within a sentence.

### What is the ChunkedCorpusReader class?
- ChunkedCorpusReader provides several methods for extracting text from a corpus.
 - words()
 - sents()
 - paras()
 - tagged_words()
 - tagged_sents()
 - tagged_paras()
 - chunked_words()
 - chunked_sents()
 - chunked_paras()
- We can draw a tree by calling the draw() method.

### What is ConllChunkCorpusReader class?
- ConllChunkCorpusReader provides several methods as follows:
 - words()
 - sents()
 - tagged_words()
 - tagged_sents()
 - chunked_words()
 - chunked_sents()
 - iob_words()
 - iob_sents()
- An alternative format for denoting chunks is called IOB tags.
 - IOB tags are similar to part-of-speech tags, but provide a way to denote the inside, outside, and beginning of a chunk.
 - To read a corpus using the IOB format, we must use the ConllChunkCorpusReader class.
  - Each sentence is separated by a blank line, but there is no separation for paragraphs.
   - This means that the para_* methods are not available
   
### Tree leaves
- When it comes to chunk trees, the leaves of a tree are the tagged tokens.

### Treebank chunk corpus
- The nltk.corpus.treebank_chunk corpus uses ChunkedCorpusReader to provide part-of-speech tagged words and noun phrase chunks of Wall Street Journal headlines.

### CoNLL2000 corpus
- CoNLL stands for the Conference on Computational Natural Language Learning.

In [13]:
from nltk.corpus.reader import ChunkedCorpusReader

reader = ChunkedCorpusReader(wordlist_dir, r'.*\.chunk')
print(reader.chunked_words())
print(reader.chunked_sents())
print(reader.chunked_paras())

[Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ...]
[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', ''), ('IN', None), Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (',', ','), Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]
[[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', ''), ('IN', None), Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (',', ','), Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]]


In [22]:
print(reader.chunked_words()[0].leaves())
print(reader.chunked_sents()[0].leaves())
print(reader.chunked_paras()[0][0].leaves())

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]
[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', ''), ('IN', None), ('300', 'CD'), ('jobs', 'NNS'), (',', ','), ('the', 'DT'), ('spokesman', 'NN'), ('said', 'VBD'), ('.', '.')]
[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', ''), ('IN', None), ('300', 'CD'), ('jobs', 'NNS'), (',', ','), ('the', 'DT'), ('spokesman', 'NN'), ('said', 'VBD'), ('.', '.')]


In [None]:
reader.chunked_sents()[0].draw()

In [14]:
from nltk.corpus.reader import ConllChunkCorpusReader

conllreader = ConllChunkCorpusReader(wordlist_dir, r'.*\.iob', ('NP', 'VP', 'PP'))

[Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), ...]

In [15]:
conllreader.chunked_words()

[Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), ...]

In [16]:
conllreader.chunked_sents()

[Tree('S', [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), Tree('NP', [('executive', 'JJ'), ('vice', 'NN'), ('president', 'NN')]), Tree('PP', [('of', 'IN')]), Tree('NP', [('Balcor', 'NNP')]), ('.', '.')])]

In [18]:
for elem in conllreader.iob_words():
    print(elem)

('Mr.', 'NNP', 'B-NP')
('Meador', 'NNP', 'I-NP')
('had', 'VBD', 'B-VP')
('been', 'VBN', 'I-VP')
('executive', 'JJ', 'B-NP')
('vice', 'NN', 'I-NP')
('president', 'NN', 'I-NP')
('of', 'IN', 'B-PP')
('Balcor', 'NNP', 'B-NP')
('.', '.', 'O')


In [19]:
for elem in conllreader.iob_sents():
    print(elem)

[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ('had', 'VBD', 'B-VP'), ('been', 'VBN', 'I-VP'), ('executive', 'JJ', 'B-NP'), ('vice', 'NN', 'I-NP'), ('president', 'NN', 'I-NP'), ('of', 'IN', 'B-PP'), ('Balcor', 'NNP', 'B-NP'), ('.', '.', 'O')]


## Creating a categorized text corpus

If we have a large corpus of text, we might want to categorize it into separate sections. `This could be helpful for text classification.`

In [23]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [26]:
from nltk.corpus.reader import CategorizedPlaintextCorpusReader

# option 1
reader = CategorizedPlaintextCorpusReader('corpora', r'movie_.*\.txt', cat_pattern = r'movie_(\w+)\.txt')
for cat in reader.categories():
    print('category: {:4} files: {}'.format(cat, reader.fileids(cat)))

category: neg  files: ['movie_neg.txt']
category: pos  files: ['movie_pos.txt']


In [28]:
# option 2
reader = CategorizedPlaintextCorpusReader('corpora', r'movie_.*\.txt',
                                          cat_map = {'movie_pos.txt':['pos'], 'movie_neg.txt':['neg']})
reader.categories()

['neg', 'pos']

### What is CategorizedPlaintextCorpusReader class and how it works?
- The first two arguments to CategorizedPlaintextCorpusReader are the root directory and fileids
- (option 1) The `cat_pattern` keyword is passed to CategorizedCorpusReader, which overrides the common corpus reader functions such as fileids(), words(), sents(), and paras() to accept a categories keyword argument.
- The CategorizedCorpusReader class provides the categories() function, which returns a list of all the known categories in the corpus.
- (option 2) Instead of `cat_pattern`, we could pass in a `cat_map`, which is a dictionary mapping a fileid argument to a list of category labels
- (option 3) A third way of specifying categories is to use the `cat_file` keyword argument to specify a filename containing a mapping of `fileid` to category. -> check the book

### Categorized tagged corpus reader

The brown corpus reader is actually an istance of `CategorizedTaggedCorpusReader`, which inherits from `CategorizedCorpusReader` and `TaggedCorpusReader`

### Categorized corpora
- The movie_reviews corpus reader is an instance of `CategorizedPlaintextCorpusReader`, as is the `reuters` corpus reader. 
 - reuters has 90 catetories
 - movie_reviews has 2 categories
- These corpora are often used for training and evaluating classifiers

## Creating a categorized chunk corpus reader
### Why do we have to create a categorized chunk corpus reader?
- NLTK only provides a `CategorizedPlaintextCorpusReader` and `CategorizedTaggedCorpusReader` class

In [31]:
from nltk.corpus.reader import CategorizedCorpusReader, ChunkedCorpusReader

class CategorizedChunkedCorpusReader(CategorizedCorpusReader, ChunkedCorpusReader):
    def __init__(self, *args, **kwargs):
        CategorizedCorpusReader.__init__(self, kwargs)
        ChunkedCorpusReader.__init__(self, *args, **kwargs)
    
    def _resolve(self, fileids, categories):
        if fileids is not None and categories is not None:
            raise ValueError('Specify fileids or categories, not both')
        if categories is not None:
            return self.fileids(categories)
        else:
            return fileids
    
    def raw(self, fileids = None, categories = None):
        return ChunkedCorpusReader.raw(self, self._resolve(fileids, categories))
    
    def words(self, fileids = None, categories = None):
        return ChunkedCorpusReader.words(self, self._resolve(fileids, categories))
    
    def sents(self, fileids = None, categories = None):
        return ChunkedCorpusReader.sents(self, self._resolve(fileids, categories))
    
    def paras(self, fileids = None, categories = None):
        return ChunkedCorpusReader.paras(self, self._resolve(fileids, categories))
    
    def tagged_words(self, fileids = None, categories = None):
        return ChunkedCorpusReader.tagged_words(self, self._resolve(fileids, categories))

    def tagged_sents(self, fileids = None, categories = None):
        return ChunkedCorpusReader.tagged_sents(self, self._resolve(fileids, categories))

    def tagged_paras(self, fileids = None, categories = None):
        return ChunkedCorpusReader.tagged_paras(self, self._resolve(fileids, categories))
    
    def chunked_words(self, fileids = None, categories = None):
        return ChunkedCorpusReader.chunked_words(self, self._resolve(fileids, categories))

    def chunked_sents(self, fileids = None, categories = None):
        return ChunkedCorpusReader.chunked_sents(self, self._resolve(fileids, categories))

    def chunked_paras(self, fileids = None, categories = None):
        return ChunkedCorpusReader.chunked_paras(self, self._resolve(fileids, categories))

In [33]:
path = nltk.data.find('corpora/treebank/tagged')
reader = CategorizedChunkedCorpusReader(path, r'wsj_.*\.pos', cat_pattern = r'wsj_(.*)\.pos')
print(len(reader.categories()) == len(reader.fileids()))
print(len(reader.chunked_sents(categories=['0001'])))

True
16


In [32]:
from nltk.corpus.reader import CategorizedCorpusReader, ConllCorpusReader, ConllChunkCorpusReader

class CategorizedConllChunkCorpusReader(CategorizedCorpusReader, ConllChunkCorpusReader):
    def __init__(self, *args, **kwargs):
        CategorizedCorpusReader.__init__(self, kwargs)
        ConllChunkCorpusReader.__init__(self, *args, **kwargs)
    
    def _resolve(self, fileids, categories):
        if fileids is not None and categories is not None:
            raise ValueError('Specify fileids or categories, not both')
        if categories is not None:
            return self.fileids(categories)
        else:
            return fileids
    
    def raw(self, fileids = None, categories = None):
        return ConllCorpusReader.raw(self, self._resolve(fileids, categories))
    
    def words(self, fileids = None, categories = None):
        return ConllCorpusReader.words(self, self._resolve(fileids, categories))
    
    def sents(self, fileids = None, categories = None):
        return ConllCorpusReader.sents(self, self._resolve(fileids, categories))
    
    def tagged_words(self, fileids = None, categories = None):
        return ConllCorpusReader.tagged_words(self, self._resolve(fileids, categories))
    
    def tagged_sents(self, fileids = None, categories = None):
        return ConllCorpusReader.tagged_sents(self, self._resolve(fileids, categories))
    
    def chunked_words(self, fileids = None, categories = None, chunk_types = None):
        return ConllCorpusReader.chunked_words(self, self._resolve(fileids, categories), chunk_types)
    
    def chunked_sents(self, fileids = None, categories = None, chunk_types = None):
        return ConllCorpusReader.chunked_sents(self, self._resolve(fileids, categories), chunk_types)
    
    def parsed_sents(self, fileids = None, categories = None, pos_in_tree = None):
        return ConllCorpusReader.parsed_sents(self, self._resolve(fileids, categories), pos_in_tree)
    
    def srl_spans(self, fileids = None, categories = None):
        return ConllCorpusReader.srl_spans(self, self._resolve(fileids, categories))
    
    def srl_instances(self, fileids = None, categories = None, pos_in_tree = None, flatten = True):
        return ConllCorpusReader.srl_instances(self, self._resolve(fileids, categories), pos_in_tree, flatten)
    
    def iob_words(self, fileids = None, categories = None):
        return ConllCorpusReader.iob_words(self, self._resolve(fileids, categories))
    
    def iob_sents(self, fileids = None, categories = None):
        return ConllCorpusReader.iob_sents(self, self._resolve(fileids, categories))

In [36]:
path = nltk.data.find('corpora/conll2000')
reader = CategorizedConllChunkCorpusReader(path, r'.*\.txt', ('NP', 'VP', 'PP'), cat_pattern = r'(.*)\.txt')
reader.categories()
reader.fileids()
len(reader.chunked_sents(categories=['test']))

2012

## Lazy corpus loading

- Loading a corpus reader can be an expensive operation due to the number of files, file sizes, and various initialization tasks.
- Using LazyCorpusLoader to speed up module import time when a corpus reader is defined.

### How to use LazyCorpusLoader?
- It requires two arguments: the name of the corpus and the corpus reader class, plus any other arguments needed to initialize the corpus reader class
 - the `name` argument specifies the root directory name of the corpus, which must be within a corpora subdirectory of one of the paths in nltk.data.path
 - the corpus reader class `reader_cls` should be the name of a subclass of `CorpusReader`
  - We will also need to pass in any other arguments required by the `reader_cls` argument for initialization.

#### An example

In [38]:
from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import WordListCorpusReader
reader = LazyCorpusLoader('cookbook', WordListCorpusReader, ['wordlist'])

In [39]:
isinstance(reader, LazyCorpusLoader)

True

In [40]:
reader.fileids()

['wordlist']

In [41]:
isinstance(reader, LazyCorpusLoader)

False

In [42]:
isinstance(reader, WordListCorpusReader)

True

#### Other examples

In [47]:
from nltk.corpus.reader import BracketParseCorpusReader

treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg', tagset = 'wsj', encoding= 'ascii')

In [48]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus.reader import tagged_treebank_para_block_reader

treebank_chunk = LazyCorpusLoader('treebank/tagged',
                                  ChunkedCorpusReader,
                                  r'wsj_.*\.pos',
                                  sent_tokenizer = RegexpTokenizer(r'(?<=/\.)\s*(?![^\[]*\])', gaps=True),
                                  para_block_reader = tagged_treebank_para_block_reader,
                                  encoding = 'ascii')

In [49]:
from nltk.corpus.reader import PlaintextCorpusReader

treebank_raw = LazyCorpusLoader('treebank/raw',
                                PlaintextCorpusReader, r'wsj_.*',
                                encoding='ISO-8859-2')

## Creating a custom corpus view

### What is corpus views?
- A `corpus view` is a class wrapper around a corpus file that reads in blocks of tokens as needed.
- The purpose is to provide a view into a file without reading the whole file at once.

### Other view choices
#### Pickle corpus view
- The `PickleCorpusView` can be found in nltk.corpus.reader.util.
- check book page 79

#### Concatenated corpus view
- The `ConcatenatedCorpusView` class can be found in nltk.corpus.reader.util.
- It's useful when we have multiple files that we want a corpus reader to treat as a single file.

### Block reader functions
- read_whitespace_block(): This will read 20 lines from the stream, splitting each line into tokens by whitespace.
- read_wordpunct_block(): This reads 20 lines from the stream, splitting each line using nltk.tokenize.wordpunct_tokenize()
- read_line_block(): This read 20 lines from the stream and returns them as a list, with each line as a token.
- read_regexp_block(): This takes two additional arguments, which must be regular expressions that can be passed to re.match(): start_re and end_re.
 - The start_re variable matches the starting line of a block
 - The end_re variable matches the ending line of the block. 
  - default value is None
 - The return value is a single token of all lines in the block joined into a single string.

In [53]:
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.corpus.reader.util import StreamBackedCorpusView

class IgnoreHeadingCorpusView(StreamBackedCorpusView):
    def __init__(self, *args, **kwargs):
        StreamBackedCorpusView.__init__(self, *args, **kwargs)
        # open self._stream
        self._open()
        #skip the heading block
        self.read_block(self._stream)
        # reset the start position to the current position in the stream
        self._filepos = [self._stream.tell()]

class IgnoreHeadingCorpusReader(PlaintextCorpusReader):
    CorpusView = IgnoreHeadingCorpusView
    
# example
plain = PlaintextCorpusReader('corpora', ['heading_text.txt'])
print('original len is', len(plain.paras()))
reader = IgnoreHeadingCorpusReader('corpora', ['heading_text.txt'])
print('len after using IgnoreHeadingCorpusReader is', len(reader.paras()))

original len is 4
len after using IgnoreHeadingCorpusReader is 3


## Creating a MongoDB-backed corpus reader

- require MongoDB and PyMongo

---
```
import pymongo
from nltk.data import LazyLoader
from nltk.tokenize import TreebankWordTokenizer
from nltk.util import AbstractLazySequence, LazyMap, LazyConcatenation

class MongoDBLazySequence(AbstractLazySequence):
    def __init__(self, host='localhost', port=27017, db='test',
        collection='corpus', field='text'):
        self.conn = pymongo.MongoClient(host, port)
        self.collection = self.conn[db][collection]
        self.field = field

    def __len__(self):
        return self.collection.count()

    def iterate_from(self, start):
        f = lambda d: d.get(self.field, '')
        return iter(LazyMap(f, self.collection.find(fields=[self.field], skip=start)))

class MongoDBCorpusReader(object):
    def __init__(self, 
                 word_tokenizer=TreebankWordTokenizer(),   
                 sent_tokenizer=LazyLoader('tokenizers/punkt/PY3 /english.pickle'), 
                 **kwargs):
        self._seq = MongoDBLazySequence(**kwargs)
        self._word_tokenize = word_tokenizer.tokenize
        self._sent_tokenize = sent_tokenizer.tokenize

    def text(self):
        return self._seq

    def words(self):
        return LazyConcatenation(LazyMap(self._word_tokenize, self.text()))
    
    def sents(self):
        return LazyConcatenation(LazyMap(self._sent_tokenize, self.text()))
```
---

### How to use it?
```
reader = MongoDBCorpusReader(db = 'website',
                             collection = 'comments',
                             field = 'comment')
```

## Corpus editing with file locking

- Corpus reader and views are all read-only, but there will be times when we want to add to or edit the corpus files.
- In order to lock file, we have to install the `lockfile` library
- Read page. 82 ~ 84

---
```
import lockfile, tempfile, shutil
    def append_line(fname, line): with lockfile.FileLock(fname):
        fp = open(fname, 'a+')
        fp.write(line)
        fp.write('\n')
        fp.close()
    def remove_line(fname, line):
        with lockfile.FileLock(fname):
            tmp = tempfile.TemporaryFile()
            fp = open(fname, 'rw+')
            # write all lines from orig file, except if matches given line
            for l in fp:
                if l.strip() != line:
                    tmp.write(l)
            # reset file pointers so entire files are copied
            fp.seek(0)
            tmp.seek(0)
            # copy tmp into fp, then truncate to remove trailing line(s)
            shutil.copyfileobj(tmp, fp)
            fp.truncate()
            fp.close()
            tmp.close()
```
---

### How to use it?
```
append_line('test.txt', 'foo')
remove_line('test.txt', 'foo')
```