CCrawler

An implementation of collaboration based crawling approach focused at improving the overall crawling capability.

Project URL: https://github.com/SiddharthC/CCrawler

Structures

.                             - <project root>
├── ccrawler                  - collaborative crawler implementation directory
│   ├── settings.py	          - default constant variables defined
│   └── spiders               - spider directory
│       └── base_spider.py    - base spider implementation
├── items.json                - a (temporary) json file containing crawled data (title, link, content)
├── fetch_crawl.py
├── merge_crawl.py
├── remote_crawl.py
├── update_solr.py            - update Solr with json files
├── schema.xml                - Put this file in to Solr conf directory
├── scrapy.cfg                - project configuration 
└── urls.txt                  - urls list file containing an allowed domain and start urls

Run

To crawl the urls specified in urls.txt using a spider and to store the generated data at a specified location as a json file.

$ ./fetch_crawl -t <spider_name> -d <location_to_store_data> -u <path_to_url_file> -c <collaborative_crawling_flag (0 or 1)>

Update json file in Solr

$ ./update_solr # Update json files in Solr.

Standalone crawling by spider to store data as json

$ scrapy crawl base -o items.json -t json --nolog

Standalone crawling by spider to store data as xml

$ scrapy crawl base -o items.xml -t xml --nolog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCrawler

Project URL: https://github.com/SiddharthC/CCrawler

Structures

Run

To crawl the urls specified in urls.txt using a spider and to store the generated data at a specified location as a json file.

Update json file in Solr

Standalone crawling by spider to store data as json

Standalone crawling by spider to store data as xml

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
ccrawler		ccrawler
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
fetch_crawl.py		fetch_crawl.py
node_setup_script.sh		node_setup_script.sh
remote_crawl.py		remote_crawl.py
schema.xml		schema.xml
scrapy.cfg		scrapy.cfg
update_solr.py		update_solr.py
urls.txt		urls.txt

SiddharthC/CCrawler

Folders and files

Latest commit

History

Repository files navigation

CCrawler

Project URL: https://github.com/SiddharthC/CCrawler

Structures

Run

To crawl the urls specified in urls.txt using a spider and to store the generated data at a specified location as a json file.

Update json file in Solr

Standalone crawling by spider to store data as json

Standalone crawling by spider to store data as xml

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages