GitHub - Ksyula/Multithreading-Scraping-in-Python: Multithreading web scraping with real-time-changing proxy-servers

Parallel web scraping

The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Goal

The script extracts names and prices of the Top-100 crypto coins and stores the data into a db.

Disclaimer

The task is quite contrived and serves mainly for study purpose. There are innumerous of mature sources containing both real-time and historical cryptocurrency data.

Solved problems within the project

Multiple pages with single-level nesting have been scraped. Propagation is handled by collecting internal links from the main page and iterating through them.
To prevent bans from the remote server, a proxy management mechanism was implemented.
Since free public proxy servers are often unreliable, the following approach was used to address this issue:
- A separate scraping script extracts a list of free public proxy servers from a website.
- With each script execution, the list of 10 proxy servers is updated with currently available ones.
- During execution, some proxies may become unavailable. Each scraping request cycles through the list to find an active proxy before proceeding.
To accelerate the scraping of all 101 web pages, multithreading was utilized. The workload is distributed across four threads running concurrently.
The extracted data is written directly to a database.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Coins.db		Coins.db
Crawling_vs_Scraping.png		Crawling_vs_Scraping.png
README.md		README.md
clean_db.py		clean_db.py
main.py		main.py
parse_and_write.py		parse_and_write.py
proxy_list.py		proxy_list.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel web scraping

Goal

Disclaimer

Solved problems within the project

About

Uh oh!

Releases

Packages

Languages

Ksyula/Multithreading-Scraping-in-Python

Folders and files

Latest commit

History

Repository files navigation

Parallel web scraping

Goal

Disclaimer

Solved problems within the project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages