Skip to content

INA-DLWeb/LiveArchivingProxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 

Repository files navigation

LiveArchivingProxy

An HTTP Proxy that archives all intercepted trafic.

The Live Archiving Proxy (LAP) project is an HTTP proxy that is able to capture the traffic that flows through it. The LAP delegates the handling of the captured data to one or multiple writers using a simple network protocol. Writers exists for the DAFF, WARC and ARC format. Using an HTTP proxy for Web archiving enables the use of any HTTP client for crawling (Heritrix, PhantomJS, HTTrack, Scrapy, etc.) while keeping a unified and simple storage backend. The LAP is designed to be highly performant, easy to use and archive-format agnostic. It will run on any 64-bit linux system.

Ina uses the LAP in production since 2012 for 50% of its crawls and plans to use if for 100% of its crawls by 2014.

Getting started

Code resources

ChangeLog

Note: This changelog only list major apparent changes.

1.2.1 2014-08-26

  • better HTTP/1.1 handling
  • vortex log fix
  • bloom filter handshake timeout
  • bloom filter TCP tunneling removed
  • hostname fix
  • DNS caching for IPv6 fix
  • deflate fix
  • proxy setting revamped
  • PAR version fix

1.2.0 2014-05-06

  • pseudo HTTPS mode (see user manual)
  • compression-factor info for compressibility hint (LZ4)
  • bypass mode (lap-bypass header in request)
  • PUT web service
  • discard-when-no-writer option
  • allow-range-requests option
  • revamped screen log
  • various bug fixes

X.X.X 2013-07-10

  • initial public release

About

An HTTP Proxy that archives all intercepted traffic.

Resources

Stars

Watchers

Forks

Packages

No packages published