TagScraper

TagScraper is a very lightweight means to facilitate Screen Scraping for iPhone applications. At the most basic level, this means performing an XPath query against an XML document, usually loaded from the Internet.

Adding TagScraper to your project

Clone the TagScraper repository to a permanent location on your hard drive:

    $ git clone git://github.com/searls/TagScraper.git

Find TagScraper.xcodeproj in Finder, then drag-and-drop it into your project's "Groups & Files" pane in Xcode underneath your Project.
- Uncheck "Copy items"
- Set "Reference Type" to "Relative to Project"
- Check "Recursively create groups for any added folders"
Click "TagScraper.xcodeproj" in your "Groups & Files" pane, and in the upper right, you should see a file named "libTagScraper.a". Check the checkbox to the far right with a bullseye above it.
Set up the project's dependencies.
1. From the menu bar, select "Project" -> "Edit Project Settings" and click the "General" tab.
2. Click the + icon under "Direct Dependencies" and add TagScraper.
3. Click the + icon under "Linked Libraries" and add libxml2.dylib.
Set up the project settings
1. From the menu bar, select "Project" -> "Edit Project Settings" and click the "Build" tab.
2. Set Configuration to All Configurations
3. Under "Other Linker Flags" add both: -all_load and -ObjC
4. Under "Header Search Paths", add the relative path from your XCode project to the TagScraper source files. If your project and TagScraper shared the same root folder, this would be "../TagScraper/src"

Usage

Simply import the global header to access any of TagScraper's classes. For now, those are merely Tag and XPathQuery

`#import "TagScraper.h"`

Then to try it in your code, here's an example converting an NSString to an NSData, performing the XPath query //p, and produce a Tag object from it. It should log "Content" to console.

`NSString *html = @"Content";`
`NSData *data = [html dataUsingEncoding:NSUTF8StringEncoding];`
`Tag *tag = [XPathQuery firstResultForXPathQuery:@"//p" onDocument:data];`
`NSLog([tag retrieveText]);`

I encourage you to explore the (brief) source code of the static library to see what your options are. The best usage examples will be in the Tag Scraper's "Tests" group, which can be executed by opening the TagScraper.xcodeproj and switching the Active Target to TagScraperTests.

What TagScraper Does

Performs XPath queries on NSData objects representing XML/HTML text
Returns XPath results as a custom Tag object or as NSDictionary objects
Is well suited to messily scrape data from web sites, typically for the purpose of redisplaying that data in a native iPhone application.
Converts Tag object hierarchy back to HTML for debugging purposes (i.e. NSLog([tag toHTML]))

What TagScraper Doesn't Do

Make any effort to understand the difference between XML, XHTML, HTML, or specific doctypes within.
Handle HTML entities or XML special characters (yet)

Limitations

Currently, will load and then release an XML document between every XPath query, making it terribly inefficient to perform multiple queries against one document.
XPath queries are currently only case sensitive. Adding some flags to control this and other contextual items would be nice.
Very few HTML entities are properly escaped.
In general, the Tag object model will strip whitespace aggressively without being clear in the API. This is because it was written to scrape pages that used abhorrent amounts of unnecessary whitespace within HTML content tags.

To-dos

Come up with a beginQueries/endQueries API in order to allow multiple queries to be performed on the same XML Document to save cycles.
As stated before, any weird entities can totally make this thing explode. If you'd like to help, checkout NSStringAdditions and flesh out the entities that are replaced. One just needs to take the effort to look up what's necessary and in what cases.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
samples/TSDemo		samples/TSDemo
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.mdown		README.mdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samples/TSDemo

samples/TSDemo

src

src

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.mdown

README.mdown

Repository files navigation

TagScraper

Adding TagScraper to your project

Usage

What TagScraper Does

What TagScraper Doesn't Do

Limitations

To-dos

About

Releases

Packages

License

searls/TagScraper

Folders and files

Latest commit

History

Repository files navigation

TagScraper

Adding TagScraper to your project

Usage

What TagScraper Does

What TagScraper Doesn't Do

Limitations

To-dos

About

Resources

License

Stars

Watchers

Forks