Skip to content

chidimo/pywebber

Repository files navigation

pywebber

Python Web Development Tools

Alt text

Utilities

  1. Link and words harvester Ripper

parsers

html.parser $ pip install html5lib # html5lib lxml, lxml-xml

  1. Text generator LoremPysum

Installation

pip install pywebber --upgrade
pip install https://github.com/Parousiaic/pywebber/archive/master.zip

Usage

Ripper - harvest words and links on a static web page.

$ from pywebber import Ripper

Accessing words and links is easy

$ page = Ripper('http://python.org')
$ soup = page.soup # the raw Beautifulsoup4 object
$ uncleaned_links = page.raw_links # all raw <a> tags on page as bs4 objects
$ cleaned_links = page.links() # generator of all links in the form `http://www.domain.location`
$ words = page.words() # a generator of words between <p> tags

The following object creation options are available

  1. url : Default to url="http://python.org"
  2. parser : Default to parser="html.parser". To see a complete list of parsers, user object_instance.parsers
  3. refresh : Default to refresh=False. The first time Ripper hits a page, it saves a prettified Beautifulsoup4 object of the scrapped page in a text file from which consequent calling of the class reads. But if set to True, Ripper will hit the site to get its data every single time its called to construc the page object.
  4. save_path : Default to save_path=None. In this case, Ripper creates a folder on your USER DESKTOP. This folder name is in the format domainName_extension. Every page scrapped from that site is saved inside this folder. Its also possible to set save_path=/some/other/path. The save file name is of the format page_url.txt
  5. split_string : Defaults to string.punctuation.extend(["n", " ", "://",]). You can supply a list to add to this set.
  6. stop_words : Defaults to ['', '#', '\n', 'the', 'to', "but", "and"]. These are words that should not be included when object_instance.words() is called. You can supply a list to add to this set.

LoremPysum - Generate random texts

$ from pywebber import LoremPysum

Create a single LoremPysum instance with default Lorem Ipsum text

$ p = LoremPysum(*args, domains=None, lorem=True)

You can also decide to include your words with the standard lorem ipsum text. But if you want your words only simply pass lorem=False like this ::

$ p = LoremPysum(*args, domains=None, lorem=False)

*args is an optional list of files from which to get the words to be used. Just pass any number of text files as shown below

$ p = LoremPysum("file1_path.txt1", "file2_path.txt", domains=None, lorem=True)

The following methods are defined

$ p.email() # return a single email address. You could pass in a file for list of domains. Defaults are `[".com", ".info", ".net", ".org"]`
$ p.name() # return a name in the form "firstname I. lastname".
$ p.sentence() # generate a single sentence.
$ p.paragraphs() # return a single paragraph of standard Lorem Ipsum text.
$ p.paragraphs(count=3) # return 3 paragraphs where the first paragraph is the standard text.
$ p.paragraphs(common=False) # return a single paragraph where the first paragraph is random.
$ p.title() # generate a string (title case) with 2 to n words. Defaults is 5. Good for article titles.

In case you want to look into the words used, the following instance attributes are defined. ::

$ p.common # A list of the first few words in the lorem ipsum text
$ p.words # A list of all the words in the lorem ipsum text.
$ p.standard # Standard lorem ipsum text. Usually the first 1/3rd portion of a sample file.
$ p.domains # list of domain name endings

Code

Credits

  1. Luca De Vitis for the inspiration and starter code for LoremPysum
  2. 'BeautifulSoup documentation'