Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Comparing changes

Choose two branches to see what's changed or to start a new pull request. If you need to, you can also compare across forks.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also compare across forks.
base fork: CodeforLeipzig/eris-scraper
base: 5924232d1c
...
head fork: CodeforLeipzig/eris-scraper
compare: cd06e0d394
Checking mergeability… Don't worry, you can still create the pull request.
  • 4 commits
  • 5 files changed
  • 0 commit comments
  • 1 contributor
View
1  .gitignore
@@ -37,3 +37,4 @@ build/
/output/
/web_cache/
/scraped_data/
+attachments/*
View
2  Gemfile
@@ -1,5 +1,7 @@
source "https://rubygems.org"
+gem 'addressable'
+gem 'typhoeus'
gem 'pupa'
gem 'nokogiri'
gem 'pry'
View
14 Gemfile.lock
@@ -7,14 +7,18 @@ GEM
multi_json (~> 1.3)
thread_safe (~> 0.1)
tzinfo (~> 0.3.37)
+ addressable (2.3.6)
bson (2.2.4)
coderay (1.1.0)
colored (1.2)
connection_pool (2.0.0)
+ ethon (0.7.0)
+ ffi (>= 1.3.0)
faraday (0.9.0)
multipart-post (>= 1.2, < 3)
faraday_middleware (0.9.1)
faraday (>= 0.7.4, < 0.10)
+ ffi (1.9.3)
i18n (0.6.9)
json-schema (2.1.9)
libv8 (3.16.14.3)
@@ -33,10 +37,10 @@ GEM
multipart-post (2.0.0)
nokogiri (1.6.2.1)
mini_portile (= 0.6.0)
- oj (2.9.3)
+ oj (2.9.4)
optionable (0.2.0)
pg (0.17.1)
- polyglot (0.3.4)
+ polyglot (0.3.5)
pry (0.9.12.6)
coderay (~> 1.0)
method_source (~> 0.8)
@@ -57,17 +61,21 @@ GEM
therubyracer (0.12.1)
libv8 (~> 3.16.14.0)
ref
- thread_safe (0.3.3)
+ thread_safe (0.3.4)
treetop (1.4.15)
polyglot
polyglot (>= 0.3.1)
+ typhoeus (0.6.8)
+ ethon (>= 0.7.0)
tzinfo (0.3.39)
PLATFORMS
ruby
DEPENDENCIES
+ addressable
nokogiri
pry
pupa
therubyracer
+ typhoeus
View
6 application.rb
@@ -5,7 +5,13 @@
require 'bundler/setup'
require 'pupa'
+require 'typhoeus'
+require 'typhoeus/adapters/faraday'
require 'nokogiri'
+# Use Addressable::URI to handle URIs with umlauts
+require 'addressable/uri'
+Faraday::Utils.default_uri_parser = Addressable::URI.method(:parse)
+
require 'models/resolution'
View
32 resolution.rb
@@ -29,19 +29,43 @@ def scrape_objects
resolution.text = doc.css('table:contains("Beschlusstext") ~ table:first').text
resolution.einreicher = doc.css('td:contains("Einreicher:") ~ td:first').text
- resolution.anlagen_text = doc.css('table:contains("Download") ~ table:first font').text
+ resolution.anlagen_text = doc.css('table:contains("Download") ~ table:first td:first').text
script = doc.css('table:contains("Download") ~ table:first script').text
- if pdf_urls = extract_js_array(:URL, script)
- resolution.anlagen_urls = pdf_urls
- end
+
+ pdf_urls = extract_js_array(:URL, script).map! { |path| build_url(path.strip!) if path }
+ resolution.anlagen_urls = pdf_urls if pdf_urls.present?
dispatch(resolution)
+
+ download_attachments! resolution.anlagen_urls
end
+ end
+ # Download the attachment to the filesystem and cache it forever.
+ def download_attachments!(urls)
+ return if urls.blank?
+ begin
+ # Send HTTP requests in parallel. – See Pupa's README to learn more.
+ attachment_downloader.in_parallel(attachment_download_manager) do
+ urls.each do |url|
+ attachment_downloader.get(url)
+ end
+ end
+ rescue Faraday::Error::ClientError => e
+ error(e.response.inspect)
+ end
end
private
+ def attachment_download_manager
+ @attachment_download_manager ||= Typhoeus::Hydra.new(max_concurrency: 20)
+ end
+
+ def attachment_downloader
+ @attachment_downloader ||= Pupa::Processor::Client.new(cache_dir: File.expand_path('attachments', Dir.pwd), expires_in: nil)
+ end
+
require 'v8'
def extract_js_array(name, js_source)
context_shim = "document = { write: function() {} };"

No commit comments for this range

Something went wrong with that request. Please try again.