Skip to content

bmuller/robostrippy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RoboStrippy travis

RoboStrippy lets you strip websites. Like a robot.

BeautifulSoup and other Python libs make parsing easy - but they don't make it easy to encapsulate that logic in ways that let you treat web resources like objects. RoboStrippy does. The current version is for Python 3.2 and above.

Installation

pip install robostrippy

Usage

Let’s say you want to pull down information about a business from the Yellow Pages website.

Single Item

For the first example, let’s pretend you want a single business’ information and you already have the URL.

First, create a class and then describe one of the items on the page that you want to pull out (in this case, the business details):

#!/usr/bin/env python
from robostrippy.resource import attr, attrList, Resource


class YellowPage(Resource):
    phone = attr("p.phone strong")
    street = attr("span.street-address")
    city = attr("span.locality")
    state = attr("span.region")
    zip = attr("span.postal-code")

    @property
    def address(self):
        return " ".join([self.street, self.city, self.state, self.zip])

In the class definition, I’ve described the attributes that I want to pull out and the query necessary to pull that information out. For each attribute, you can give a CSS selector code (potentially as an array if there are multiple steps).

It is then trivial to pull information from the page:

url = "http://www.yellowpages.com/silver-spring-md/mip/quarry-house-tavern-3342829"
yp = YellowPage(url)
print yp.phone
print yp.address

Pages with Lists

For the second example, let’s say you want to be able to search the Yellow Pages website and then get the details for each matching business.

First, let’s define our listing class.

class YellowPagesListItem(Resource):
    name = attr("h3.business-name a")
    url = attr("h3.business-name a", attribute = 'href')

    @property
    def details(self):
        return YellowPage(self.url)


class YellowPagesList(Resource):
    businesses = attrList("div.result", YellowPagesListItem)

    def __init__(self, city, name):
        city = city.replace(',', '').replace(' ', '-')
        name = name.replace(' ', '-')
        Resource.__init__(self, "http://www.yellowpages.com/%s/%s" % (city, name))

This allows us to create an object that represents the list of businesses based on city and business name, and then get the details per business.

ypl = YellowPagesList("Washington, DC", "Quarry House Tavern")
business = ypl.businesses[0]

# print name from list
print business.name

# now fetch details page
details = business.details
print "lives at %s" % details.address
print "with phone # %s" % details.phone

###Missing elements What if the element you are looking for is missing? You can specify an alternative element to use thanks to the attrCoalesce class:

title = attrCoalesce(('meta[property="og:title"]', {'attribute': 'content'}),
                     ('meta[name="twitter:title"]', {'attribute': 'content'}),
                     'title')

See the examples folder for other examples.

About

Python lib to strip websites. Like a robot.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages