TagScraper is a very lightweight means to facilitate Screen Scraping for iPhone applications. At the most basic level, this means performing an XPath query against an XML document, usually loaded from the Internet.
- Clone the TagScraper repository to a permanent location on your hard drive:
$ git clone git://github.com/searls/TagScraper.git
- Find
TagScraper.xcodeproj
in Finder, then drag-and-drop it into your project's "Groups & Files" pane in Xcode underneath your Project.- Uncheck "Copy items"
- Set "Reference Type" to "Relative to Project"
- Check "Recursively create groups for any added folders"
- Click "TagScraper.xcodeproj" in your "Groups & Files" pane, and in the upper right, you should see a file named "libTagScraper.a". Check the checkbox to the far right with a bullseye above it.
- Set up the project's dependencies.
- From the menu bar, select "Project" -> "Edit Project Settings" and click the "General" tab.
- Click the
+
icon under "Direct Dependencies" and addTagScraper
. - Click the
+
icon under "Linked Libraries" and addlibxml2.dylib
.
- Set up the project settings
- From the menu bar, select "Project" -> "Edit Project Settings" and click the "Build" tab.
- Set Configuration to
All Configurations
- Under "Other Linker Flags" add both:
-all_load
and-ObjC
- Under "Header Search Paths", add the relative path from your XCode project to the TagScraper source files. If your project and TagScraper shared the same root folder, this would be "../TagScraper/src"
Simply import the global header to access any of TagScraper's classes. For now, those are merely Tag
and XPathQuery
`#import "TagScraper.h"`
Then to try it in your code, here's an example converting an NSString to an NSData, performing the XPath query //p
, and produce a Tag object from it. It should log "Content" to console.
`NSString *html = @"Content
";` `NSData *data = [html dataUsingEncoding:NSUTF8StringEncoding];` `Tag *tag = [XPathQuery firstResultForXPathQuery:@"//p" onDocument:data];` `NSLog([tag retrieveText]);`
I encourage you to explore the (brief) source code of the static library to see what your options are. The best usage examples will be in the Tag Scraper's "Tests" group, which can be executed by opening the TagScraper.xcodeproj
and switching the Active Target to TagScraperTests
.
- Performs XPath queries on NSData objects representing XML/HTML text
- Returns XPath results as a custom Tag object or as NSDictionary objects
- Is well suited to messily scrape data from web sites, typically for the purpose of redisplaying that data in a native iPhone application.
- Converts Tag object hierarchy back to HTML for debugging purposes (i.e.
NSLog([tag toHTML])
)
- Make any effort to understand the difference between XML, XHTML, HTML, or specific doctypes within.
- Handle HTML entities or XML special characters (yet)
- Currently, will load and then release an XML document between every XPath query, making it terribly inefficient to perform multiple queries against one document.
- XPath queries are currently only case sensitive. Adding some flags to control this and other contextual items would be nice.
- Very few HTML entities are properly escaped.
- In general, the Tag object model will strip whitespace aggressively without being clear in the API. This is because it was written to scrape pages that used abhorrent amounts of unnecessary whitespace within HTML content tags.
- Come up with a beginQueries/endQueries API in order to allow multiple queries to be performed on the same XML Document to save cycles.
- As stated before, any weird entities can totally make this thing explode. If you'd like to help, checkout NSStringAdditions and flesh out the entities that are replaced. One just needs to take the effort to look up what's necessary and in what cases.