Skip to content

searls/TagScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TagScraper

TagScraper is a very lightweight means to facilitate Screen Scraping for iPhone applications. At the most basic level, this means performing an XPath query against an XML document, usually loaded from the Internet.

Adding TagScraper to your project

  1. Clone the TagScraper repository to a permanent location on your hard drive:
    $ git clone git://github.com/searls/TagScraper.git
  1. Find TagScraper.xcodeproj in Finder, then drag-and-drop it into your project's "Groups & Files" pane in Xcode underneath your Project.
    • Uncheck "Copy items"
    • Set "Reference Type" to "Relative to Project"
    • Check "Recursively create groups for any added folders"
  2. Click "TagScraper.xcodeproj" in your "Groups & Files" pane, and in the upper right, you should see a file named "libTagScraper.a". Check the checkbox to the far right with a bullseye above it.
  3. Set up the project's dependencies.
    1. From the menu bar, select "Project" -> "Edit Project Settings" and click the "General" tab.
    2. Click the + icon under "Direct Dependencies" and add TagScraper.
    3. Click the + icon under "Linked Libraries" and add libxml2.dylib.
  4. Set up the project settings
    1. From the menu bar, select "Project" -> "Edit Project Settings" and click the "Build" tab.
    2. Set Configuration to All Configurations
    3. Under "Other Linker Flags" add both: -all_load and -ObjC
    4. Under "Header Search Paths", add the relative path from your XCode project to the TagScraper source files. If your project and TagScraper shared the same root folder, this would be "../TagScraper/src"

Usage

Simply import the global header to access any of TagScraper's classes. For now, those are merely Tag and XPathQuery

`#import "TagScraper.h"`

Then to try it in your code, here's an example converting an NSString to an NSData, performing the XPath query //p, and produce a Tag object from it. It should log "Content" to console.

`NSString *html = @"

Content

";` `NSData *data = [html dataUsingEncoding:NSUTF8StringEncoding];` `Tag *tag = [XPathQuery firstResultForXPathQuery:@"//p" onDocument:data];` `NSLog([tag retrieveText]);`

I encourage you to explore the (brief) source code of the static library to see what your options are. The best usage examples will be in the Tag Scraper's "Tests" group, which can be executed by opening the TagScraper.xcodeproj and switching the Active Target to TagScraperTests.

What TagScraper Does

  • Performs XPath queries on NSData objects representing XML/HTML text
  • Returns XPath results as a custom Tag object or as NSDictionary objects
  • Is well suited to messily scrape data from web sites, typically for the purpose of redisplaying that data in a native iPhone application.
  • Converts Tag object hierarchy back to HTML for debugging purposes (i.e. NSLog([tag toHTML]))

What TagScraper Doesn't Do

  • Make any effort to understand the difference between XML, XHTML, HTML, or specific doctypes within.
  • Handle HTML entities or XML special characters (yet)

Limitations

  • Currently, will load and then release an XML document between every XPath query, making it terribly inefficient to perform multiple queries against one document.
  • XPath queries are currently only case sensitive. Adding some flags to control this and other contextual items would be nice.
  • Very few HTML entities are properly escaped.
  • In general, the Tag object model will strip whitespace aggressively without being clear in the API. This is because it was written to scrape pages that used abhorrent amounts of unnecessary whitespace within HTML content tags.

To-dos

  • Come up with a beginQueries/endQueries API in order to allow multiple queries to be performed on the same XML Document to save cycles.
  • As stated before, any weird entities can totally make this thing explode. If you'd like to help, checkout NSStringAdditions and flesh out the entities that are replaced. One just needs to take the effort to look up what's necessary and in what cases.

About

A straightforward way to perform XPath queries on HTML in a simple object model using the iPhone SDK

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published