Skip to content

Python parsing library for for Wordpress XML export files

Notifications You must be signed in to change notification settings

RealGeeks/wp_export_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wp_export_parser

Build Status

Parsing XML sucks. This library provides a cleaner interface to get at the data in a Wordpress export XML file.

I'm using the built-in etree.ElementTree parser to parse the Wordpress XML file.

If you have a Wordpress export that breaks the parser I feel your pain. Try looking at the line that Expat is barfing on and manually fixing it.

Example Usage

from wp_export_parser import WPParser

with open('wp-export.xml') as export_file:
    parser = WPParser(export_file)
    print parser.get_domain() # outputs www.example.com
    for p in parser.get_items():
        categories = p['categories']
        comments = p['comments']
        post_title = p['title']
        post_type = p['post_type']
        post_body = p['body']
        print "post type: {}\nPost title: {}\nPost : {}\n".format(post_type,
                                                                  post_title,
                                                                  post_body)

Features

wp_export_parser can extract the following features from a Wordpress export file:

  • Posts
  • Pages
  • Comments (exposed as a generator returning dicts)
  • Categories (exposed as list of strings)
  • Postmeta (exposed as dict)

Shortcodes

Wordpress export files often include shortcodes, which the Wordpress rendering engine replaces with HTML. Since you probably aren't going to want to reimplement Wordpress's shortcodes in your own blogging engine, I have ripped out the shortcode parsing regular expressions and provided implementations of the most commonly-used shortcodes inside wp_export_parser.

  • [youtube]: wp_export_parser retrieves the correct embed code (using oEmbed) and replaces the shortcode transparently.
  • [caption]: wp_export_parser attempts to generate the same HTML Wordpress will generate (and assumes UTF-8 encoding)

Feel free to fork and contribute more shortcode support with a pull request

Wordpress oddities

  • wp_export_parser attempts to emulate the same behavior Wordpress uses to add <p> and <br> tags. I did this by attempting a 1-to-1 translation of the giant regular expression Wordpress uses to render posts.

Notes

  • wp_eport_parser will parse files iteratively so it should be able to handle really large exports. get_pages() returns a generator.
  • wp_export_parser sometimes will return unicode strings for the blog contents.
  • Tested with CPython 2.7 and 3.5

To run the tests in docker

# Spin up docker container
docker build -t wp_export . && docker run -ti -v `pwd`:/opt/wp_export_parser wp_export bash
# From within the running container, run the tests
tox

Changelog

  • Added Dockerfile for Test environment
  • Conditionally importing to support python 2.7 and 3.5

License

Copyright (c) 2012-2022 Kevin McCarthy. Released under the terms of the MIT license.

About

Python parsing library for for Wordpress XML export files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •