Skip to content
This repository has been archived by the owner on Jan 1, 2019. It is now read-only.

nghiaqh/pycrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PREREQUIREMENTS
1. Python 2.6
2. MySQL 5.0

INSTALLATION
1. Create a database named sitelinks
2. Update the database configuration in config.py (hostname, username, password)
3. Run setup.py: python setup.python
4. Update config.py:
   - crawl_domain: domain you need to crawl.
   - fileext: file types you don't want to crawl.

TO START CRAWLING
Run PyCrawler.py:
- the crawler will start with the first URL in queue (initialized when setup),
- parses the crawled html to get links & adds them to queue.
- It repeats above steps until there's no URLs in queue.

- URLs are saved into database with status: 
    200 OK
    302/301 Redirect
    404 Not found
    403 Not allowed
    0 Exception when request the URL


Releases

No releases published

Packages

No packages published

Languages