GitHub - FCC/Crawler: Crawler is a bare-bones spider designed to quickly and effectively build an index of all files and pages on a given Web site as well as the link relationship (both incoming and outgoing) between each page.

FCC / Crawler Public

Notifications You must be signed in to change notification settings
Fork 40
Star 89

Crawler is a bare-bones spider designed to quickly and effectively build an index of all files and pages on a given Web site as well as the link relationship (both incoming and outgoing) between each page.

89 stars 40 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
includes		includes
README.txt		README.txt
TODO.txt		TODO.txt
browse.php		browse.php
config.php		config.php
crawl.php		crawl.php
create-tables.sql		create-tables.sql
export.php		export.php
query.php		query.php
sitemap.php		sitemap.php
stats.php		stats.php

Repository files navigation

TO USE:

1. Edit config.PHP with appropriate database and domain information
2. (for now) in phpMyAdmin insert the seed URL into the urls table.
	* URL should be something like: www.fcc.gov
	* URL should have a trailing slash
	* (for now) May also want to set clicks to '0' to avoid problems 
3. Open crawler.php
4. (optional) open stats.php to watch progress

TIPS:
	Changes to php.ini
		1. Increase memory limit (1GB)
		2. Remove execution time limit
	Changes to mysql.ini
		* Increased max query size (to avoid "mysql went away" error)

Additional documentation (source code) in (/source)