Python package for converting xml and epubs to text files
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
epub_conversion
.gitignore
MANIFEST.in
README.md
setup.py

README.md

epub conversion

Create text corpuses using epubs and wiki dumps. This is a python package with a Converter for epub and xml (wiki dumps) to text, lines, or Python generators.

Usage:

Epub usage

Book by book

To convert epubs to text files, usage is straightforward. First create a converter object:

converter = Converter("my_ebooks_folder/")

Then using this converter let's concatenate all the text within the ebooks into a single mega text file:

converter.convert("my_succinct_text_file.gz")

Line by line

You can also proceed line by line:

from epub_conversion.utils import open_book, convert_epub_to_lines

book = open_book("twilight.epub")

lines = convert_epub_to_lines(book)

Wikidump usage

Redirections

Suppose you are interested in all redirections in a given Wikipedia dump file that is still compressed, then you can access the dump as follows:

wiki = epub_conversion.wiki_decoder.almost_smart_open("enwiki.bz2")

Taking this dump as our input let us now use a generator to output all pairs of title and redirection title in this dump:

redirections = {redirect_from:redirect_to
	for redirect_from, redirect_to in epub_conversion.wiki_decoder.get_redirection_list(wiki)
}

Page text

Suppose you are interested in the lines within each page's text section only, then:

for line in epub_conversion.wiki_decoder.convert_wiki_to_lines(wiki):
	process_line( line )

See Also:

  • Wikipedia NER a Python module that uses epub_conversion to process Wikipedia dumps and output only the lines that contain page to page links, with the link anchor texts extracted, and all markup removed.