Overview

This is an open source, multi-threaded website crawler written in Python. There is still a lot of work to do, so feel free to help out with development.

Note: This is part of an open source search engine. The purpose of this tool is to gather links only. The analytics, data harvesting, and search algorithms are being created as separate programs.

Proxy Usage

To add proxys, a few modifictaions need to be done in main.py

At the initialization part, change USE_PROXY to True and NUMBER_OF_THREADS to the number of threads you want to add

USE_PROXY = False 
# DB_FILE = PROJECT_NAME + "_info.db"
NUMBER_OF_THREADS = 1
queue = Queue()
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME, USE_PROXY)

In create_workers function, add the proxy's username, password and url in proxys. The format is as below.
- ps: currently I just assume each

def create_workers():
    # With proxy
    proxys = [
        # proxy_lis1,
        # proxy_lis2,
        # ...
    ] #format: proxy_list = ["username","password","url(xxx.xxx.xxx.xxx:port)"] All the info required for one proxy
    # Each proxy will be assigned to one thread. So len(proxys) should be equal to NUMBER_OF_THREADS

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.gitignore		.gitignore
README.md		README.md
dict2csv.py		dict2csv.py
domain.py		domain.py
fliter.py		fliter.py
general.py		general.py
link_finder.py		link_finder.py
main.py		main.py
query_phone_number.py		query_phone_number.py
requirements.txt		requirements.txt
spider.py		spider.py
sql_helper.py		sql_helper.py
test_proxy.py		test_proxy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Proxy Usage

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Proxy Usage

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages