- Oracle Java 8
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
- Maven
sudo apt-get update
sudo apt-get install maven
- Node.js 6.x and npm
curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
sudo apt-get install -y nodejs
- Go to project folder
- Run
mvn clean install
- Find new build in path
target/scraper.jar
- Install Ruby
sudo apt-get update
sudo apt-get install ruby
- Install Rake
sudo gem install rake
- Install Ant
sudo apt-get update
sudo apt-get install ant
- Clone repository framework_templates to the same parent folder
- Clone repository gexcloud as vagrant to the same parent folder
- Clone repository nutch-fork as nutch to the same parent folder
- Go to parent_folder/framework_templates/scraper
- Run
rake build_nutch
- Run
rake build_scraper
- Run
sudo rake build
- Clone appstore-apps to the same parent folder
- Go to parent_folder/appstore-apps
- Increment version in /appstore-apps/apps/scraper/build_config.rb
- Run
gex_env=main rake deploy:upload['scraper']
-
Build project
-
Application needs Consul, ElasticSearch and Nutch REST API running.
- Consul (https://hub.docker.com/r/progrium/consul/)
docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap
- ElasticSearch (https://hub.docker.com/r/nshou/elasticsearch-kibana/)
docker run -d -p 9200:9200 -p 9300:9300 -p 5601:5601 nshou/elasticsearch-kibana:kibana4
- For Nutch API run scraper docker container.
- Consul (https://hub.docker.com/r/progrium/consul/)
-
For logs you should create folder with path
/usr/local/scraper
with write permissions to all -
Run project from main class(io.gex.scraper.api.Main) with two parameters path_to_config and -dev
- Config file example
{ "appId":"1234", "webServerPort": 4567, "consulHost": "localhost", "consulPort": 8500, "nutchHost": "http://0.0.0.0", "nutchPort": 8081, "defScrapArchJob": { "urls": null, "crawlIndexesHost": "http://index.commoncrawl.org", "warcFilesHost": "http://commoncrawl.s3.amazonaws.com/", "crawlLinksLimit": null, "fromYear": 2017, "toYear": 2017, "fetchThreadsNum": 32, "elastic": { "host": "0.0.0.0", "port": 9300, "clusterName": null, "indexName": "scraper", "type": "scrap_old_data" } }, "defScrapJob": { "urls": null, "depth": 2, "interval": 7200, "extractArticle": false, "elasticIndexName": "scraper" } }
- Config file example
-
Go to http://0.0.0.0:3000. By default for debug start up two web servers: java web server on port 4567 and node.js web server on port 3000 which proxy java web server for dynamically adding assets.