Skip to content

MBC Only - Downloads and stores a given URL (including js, css, and images) for offline use.

License

Notifications You must be signed in to change notification settings

Lightspeed-Systems/webhog

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webhog

webhog is a package that stores and downloads a given URL (including js, css, and images) for offline use and uploads it to a given AWS-S3 account (more persistance options to come).

##Installation go get github.com/johnernaut/webhog

##Usage Make a POST request to http://localhost:3000/scrape with a header set to value X-API-KEY: SCRAPEAPI. Pass in a JSON value of the URL you'd like to fetch: { "url": "http://facebook.com"} (as an example). You'll notice an Ent dir: /blah/blah/blah printed to the console - your assets are saved there. To test, open the given index.html file.

##Configuration Create a webhog.yml file in the running directory. The following options are supported:

development:
  mongodb: mongodb://127.0.0.1:27017/webhog
  api_key: SCRAPEAPI
  aws_key: <env var name for s3 key>
  aws_secret: <env var name for s3 secret>
  bucket: mybucket
production:
staging:

The setting root-key is established via a MARTINI_ENV environment variable that you should set.

Usage Docs

Start with a POST request to Webhog, like so:

POST http://webcache.mybigcampus.com/api/scrape

Be sure to include the X-API-KEY = SCRAPEAPI as a header and the form data of url = http://www.twitter.com. The JSON response will look like this:

{
    "id": "549adfd30a97ec9119d60bff",
    "uuid": "4b848173-78d0-4f9f-4e34-21e7652edec9",
    "url": "http://www.twitter.com",
    "aws_link": "",
    "status": "parsing",
    "created_at": "2014-12-24T15:43:39.392Z"
}

Note the status key. If the url has been parsed in the past , it will be cached with the status of complete and the aws_link will have the tar.gz file to download. Otherwise, you can use the entity endpoint to get a status of the parsing progress like so:

GET http://webcache.mybigcampus.com/api/entity/4b848173-78d0-4f9f-4e34-21e7652edec9

Note that the param in the url resource is the uuid that was returned from the initial POST request above. Feel free to query the route above as many times you like until the parsing is complete. Once finished, you should see a JSON response like so:

{
    "id": "549adfd30a97ec9119d60bff",
    "uuid": "4b848173-78d0-4f9f-4e34-21e7652edec9",
    "url": "http://www.twitter.com",
    "aws_link": "https://s3.amazonaws.com/mbc-webcache-production/iDwUM6CnfI10401333421234110778778.tar.gz",
    "status": "complete",
    "created_at": "2014-12-24T15:43:39.392Z"
}

Use the aws_link to pull a tar'ed and gzipped file of the webpage, it's assets and such to use locally in your stashed content.

About

MBC Only - Downloads and stores a given URL (including js, css, and images) for offline use.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 49.5%
  • Go 48.2%
  • CSS 2.3%