Scraper Library

Objectives

To provide a generic ruby gem which easily facilitates the scraping of various sites. The following lists all the types of webpages that will be targeted by this libary:

Youtube.com
Wikipedia.org
Vimeo.com
Flickr.com
Any blog, article, news, etc.

Extracting information from Youtube or vimeo

For youtube and vimeo, the following sample code best describes what you can expect:

@scraper = Scraper( :url => "http://www.youtube.com/watch?v=MDhMBxAHGYE" )
# => #<Scraper::Youtube>

@scraper.thumbnail
# => "http://i.ytimg.com/vi/MDhMBxAHGYE/2.jpg"   

@scraper.title
# => "Rick Roll [Geek Edition]"

@scraper.html
# => "<object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/MDhMBxAHGYE&hl=en&fs=1&"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/MDhMBxAHGYE&hl=en&fs=1&" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object>"

Extracting content from blogs, news articles, and beyond

When a url from a webpage that isn't part of the special group (movies, photos, and other multimedia), the content portion of the page is extracted from that url using a relevancy scoring algorithm.

Example:

@scraper = Scraper( :url => "http://www.alistapart.com/articles/unwebbable")
# => #<Scraper::Article>

@scraper.title 
# => "A List Apart: Articles: Unwebbable"

@scraper.text
# => "It's time we came to grips with the fact that not every "document" can be a web page." ...

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
lib		lib
test		test
.document		.document
.gitignore		.gitignore
LICENSE		LICENSE
README.markdown		README.markdown
Rakefile		Rakefile
VERSION		VERSION
scraper.gemspec		scraper.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

test

test

.document

.document

.gitignore

.gitignore

LICENSE

LICENSE

README.markdown

README.markdown

Rakefile

Rakefile

VERSION

VERSION

scraper.gemspec

scraper.gemspec

Repository files navigation

Scraper Library

Objectives

Extracting information from Youtube or vimeo

Extracting content from blogs, news articles, and beyond

About

Releases

Packages

Languages

License

cyx/scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper Library

Objectives

Extracting information from Youtube or vimeo

Extracting content from blogs, news articles, and beyond

About

Resources

License

Stars

Watchers

Forks

Languages