0
+"The Screen-Scraper's Friend"
0
+http://www.crummy.com/software/BeautifulSoup/
0
+Beautiful Soup parses a (possibly invalid) XML or HTML document into a
0
+tree representation. It provides methods and Pythonic idioms that make
0
+it easy to navigate, search, and modify the tree.
0
+A well-formed XML/HTML document yields a well-formed data
0
+structure. An ill-formed XML/HTML document yields a correspondingly
0
+ill-formed data structure. If your document is only locally
0
+well-formed, you can use this library to find and process the
0
+well-formed part of it.
0
+Beautiful Soup works with Python 2.2 and up. It has no external
0
+dependencies, but you'll have more success at converting data to UTF-8
0
+if you also install these three packages:
0
+* chardet, for auto-detecting character encodings
0
+ http://chardet.feedparser.org/
0
+* cjkcodecs and iconv_codec, which add more encodings to the ones supported
0
+ http://cjkpython.i18n.org/
0
+Beautiful Soup defines classes for two main parsing strategies:
0
+ * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
0
+ language that kind of looks like XML.
0
+ * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
0
+ or invalid. This class has web browser-like heuristics for
0
+ obtaining a sensible parse tree in the face of common HTML errors.
0
+Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
0
+the encoding of an HTML or XML document, and converting it to
0
+Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
0
+For more than you ever wanted to know about Beautiful Soup, see the
0
+http://www.crummy.com/software/BeautifulSoup/documentation.html
0
+Here, have some legalese:
0
+Copyright (c) 2004-2007, Leonard Richardson
0
+Redistribution and use in source and binary forms, with or without
0
+modification, are permitted provided that the following conditions are
0
+ * Redistributions of source code must retain the above copyright
0
+ notice, this list of conditions and the following disclaimer.
0
+ * Redistributions in binary form must reproduce the above
0
+ copyright notice, this list of conditions and the following
0
+ disclaimer in the documentation and/or other materials provided
0
+ with the distribution.
0
+ * Neither the name of the the Beautiful Soup Consortium and All
0
+ Night Kosher Bakery nor the names of its contributors may be
0
+ used to endorse or promote products derived from this software
0
+ without specific prior written permission.
0
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
0
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
0
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
0
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
0
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
0
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
0
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
0
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
0
+LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
0
+NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
0
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
0
+from __future__ import generators
0
+__author__ = "Leonard Richardson (leonardr@segfault.org)"
0
+__copyright__ = "Copyright (c) 2004-2007 Leonard Richardson"
0
+__license__ = "New-style BSD"
0
+from sgmllib import SGMLParser, SGMLParseError
0
+ from htmlentitydefs import name2codepoint
0
+#This hack makes Beautiful Soup able to parse XML with namespaces
0
+sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
0
+DEFAULT_OUTPUT_ENCODING = "utf-8"
0
+# First, the classes that represent markup elements.
0
+ """Contains the navigational information for some part of the page
0
+ (either a tag or a piece of text)"""
0
+ def setup(self, parent=None, previous=None):
0
+ """Sets up the initial relations between this element and
0
+ self.previous = previous
0
+ self.previousSibling = None
0
+ self.nextSibling = None
0
+ if self.parent and self.parent.contents:
0
+ self.previousSibling = self.parent.contents[-1]
0
+ self.previousSibling.nextSibling = self