jaimeiniesta / metainspector
- Source
- Commits
- Network (0)
- Issues (0)
- Downloads (2)
- Wiki (1)
- Graphs
-
Branch:
master
| name | age | message | |
|---|---|---|---|
| |
CHANGELOG.rdoc | Thu Jun 04 02:36:56 -0700 2009 | |
| |
MIT-LICENSE | Wed May 13 08:32:28 -0700 2009 | |
| |
README.rdoc | Thu Jun 04 02:36:56 -0700 2009 | |
| |
lib/ | ||
| |
metainspector.gemspec | Thu Jun 04 02:36:56 -0700 2009 | |
| |
samples/ | ||
| |
test/ | Thu Jun 04 02:36:56 -0700 2009 |
MetaInspector
MetaInspector is a gem for web scraping purposes. You give it an URL, and it returns you metadata from it.
Dependencies
MetaInspector uses the nokogiri gem to parse HTML. You can install it from github.
Run the following if you haven’t already:
gem sources -a http://gems.github.com
Then install the gem:
sudo gem install tenderlove-nokogiri
If you’re on Ubuntu, you might need to install these packages before installing nokogiri:
sudo aptitude install libxslt-dev libxml2 libxml2-dev
Installation
Run the following if you haven’t already:
gem sources -a http://gems.github.com
Then install the gem:
sudo gem install jaimeiniesta-metainspector
Usage
Initialize a MetaInspector instance with an URL like this:
page = MetaInspector.new('http://pagerankalert.com')
Once scraped, you can see the scraped data like this:
page.address # URL of the page page.title # title of the page, as string page.description # meta description, as string page.keywords # meta keywords, as string page.links # array of strings, with every link found on the page
The full scraped document if accessible from:
page.document # Nokogiri doc that you can use it to get any element from the page
Examples
You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
$ irb
>> require 'metainspector'
=> true
>> page = MetaInspector.new('http://pagerankalert.com')
=> #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil>
>> page.title
=> "PageRankAlert.com :: Track your pagerank changes"
>> page.description
=> "Track your PageRank(TM) changes and receive alert by email"
>> page.keywords
=> "pagerank, seo, optimization, google"
>> page.links.size
=> 31
>> page.links[30]
=> "http://www.nuvio.cz/"
>> page.document.class
=> Nokogiri::HTML::Document
To Do
- Get page.base_dir from the address
- Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
- Return array of images in page as absolute URLs
- Return contents of meta robots tag
- Be able to set a timeout in seconds
- Detect charset
- If keywords seem to be separated by blank spaces, replace them with commas
- Mocks
- Check content type, process only HTML pages, don’t try to scrape TAR files like ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
Copyright © 2009 Jaime Iniesta, released under the MIT license


