Skip to content

HyperionGray/pagelib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pagelib

Pagelib is currently underdevelopment and is not ready for production or development environments.

Introduction

Pagelib turns nasty HTML strings into friendly HTML objects.

An HtmlPage object is construced from an HTML string:

>>> from pagelib import HtmlPage
>>> page = HtmlPage('<html><head><title>Hello</title><meta name="description" content="Some page you've downloaded from the web and now have to parse."></meta></head><body><p>Hello, world!</p></body></html>')
>>> page
HtmlPage(title=Hello, bytes=121)

Components of the page can be accessed through it's properties:

>>> page.title
'Hello'
>>> page.description
'Some page you've downloaded from the web and now have to parse.'
>>> page.language_code
'en'
>>> page.language
'English'
>>> page.text
'Hello, world!'

Pagelib exposes a parsel selector that can be used to extract further elements from the page using xpaths or css:

>>> page.selector.xpath('//p/text()').extract()
['Hello, world!']

Installation

Installing from PyPI

$ pip install pagelib

Dependencies

Pagelib depends on libicu-dev, which can be installed by running the following command:

$ sudo apt install libicu-dev

About

Object-oriented Html pages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages