The python script seg_by_vowel.py
segments the Brown Corpus into chunks based on a few different delimiters.
- space
- a, e, i, o and u
The space delimiter chunks the Brown corpus into pieces equivalent to orthographic words. The vowel delimiters chunk the corpus into non-word sequences that even include whitespace characters.