github
Advanced Search
  • Home
  • Pricing and Signup
  • Explore GitHub
  • Blog
  • Login

jaimeiniesta / metainspector

  • Admin
  • Watch Unwatch
  • Fork
  • Your Fork
  • Pull Request
  • Download Source
    • 6
    • 0
  • Source
  • Commits
  • Network (0)
  • Issues (0)
  • Downloads (2)
  • Wiki (1)
  • Graphs
  • Branch: master

click here to add a description

click here to add a homepage

  • Branches (1)
    • master ✓
  • Tags (2)
    • 1.1.2
    • 1.1.1
Sending Request…
Click here to lend your support to: metainspector and make a donation at www.pledgie.com ! Edit Pledgie Setup

Pledgie Donations

Once activated, we'll place the following badge in your repository's detail box:
Pledgie_example
This service is courtesy of Pledgie.

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, an array with all the links, all the images in it, etc. — Read more

  cancel

  cancel
  • Private
  • Read-Only
  • HTTP Read-Only

This URL has Read+Write access

Removed address setter 
jaimeiniesta (author)
Thu Jun 04 02:36:56 -0700 2009
commit  55abd8f1c0989d0eccec1ba7c74b151ef4bc4b8e
tree    6651cf76a07f17ea1303635b24b0e0f46904e13f
parent  d6d8ee8f3b310388bee119068d20c65de3cf6306
metainspector /
name age
history
message
file CHANGELOG.rdoc Thu Jun 04 02:36:56 -0700 2009 Removed address setter [jaimeiniesta]
file MIT-LICENSE Wed May 13 08:32:28 -0700 2009 Updated readme and license [jaimeiniesta]
file README.rdoc Thu Jun 04 02:36:56 -0700 2009 Removed address setter [jaimeiniesta]
directory lib/ Loading commit data...
file metainspector.gemspec Thu Jun 04 02:36:56 -0700 2009 Removed address setter [jaimeiniesta]
directory samples/
directory test/ Thu Jun 04 02:36:56 -0700 2009 Removed address setter [jaimeiniesta]
README.rdoc

MetaInspector

MetaInspector is a gem for web scraping purposes. You give it an URL, and it returns you metadata from it.

Dependencies

MetaInspector uses the nokogiri gem to parse HTML. You can install it from github.

Run the following if you haven’t already:

  gem sources -a http://gems.github.com

Then install the gem:

  sudo gem install tenderlove-nokogiri

If you’re on Ubuntu, you might need to install these packages before installing nokogiri:

  sudo aptitude install libxslt-dev libxml2 libxml2-dev

Installation

Run the following if you haven’t already:

  gem sources -a http://gems.github.com

Then install the gem:

  sudo gem install jaimeiniesta-metainspector

Usage

Initialize a MetaInspector instance with an URL like this:

  page = MetaInspector.new('http://pagerankalert.com')

Once scraped, you can see the scraped data like this:

  page.address       # URL of the page
  page.title         # title of the page, as string
  page.description   # meta description, as string
  page.keywords      # meta keywords, as string
  page.links         # array of strings, with every link found on the page

The full scraped document if accessible from:

  page.document # Nokogiri doc that you can use it to get any element from the page

Examples

You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:

  $ irb
  >> require 'metainspector'
  => true

  >> page = MetaInspector.new('http://pagerankalert.com')
  => #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil>

  >> page.title
  => "PageRankAlert.com :: Track your pagerank changes"

  >> page.description
  => "Track your PageRank(TM) changes and receive alert by email"

  >> page.keywords
  => "pagerank, seo, optimization, google"

  >> page.links.size
  => 31

  >> page.links[30]
  => "http://www.nuvio.cz/"

  >> page.document.class
  => Nokogiri::HTML::Document

To Do

  • Get page.base_dir from the address
  • Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
  • Return array of images in page as absolute URLs
  • Return contents of meta robots tag
  • Be able to set a timeout in seconds
  • Detect charset
  • If keywords seem to be separated by blank spaces, replace them with commas
  • Mocks
  • Check content type, process only HTML pages, don’t try to scrape TAR files like ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/

Copyright © 2009 Jaime Iniesta, released under the MIT license

Blog | Support | Training | Contact | API | Status | Twitter | Help | Security
© 2010 GitHub Inc. All rights reserved. | Terms of Service | Privacy Policy
Powered by the Dedicated Servers and
Cloud Computing of Rackspace Hosting®
Dedicated Server