WebScraper

Overview

This repository contains two web scraping setups, each capable of operating both locally and within a Docker container. Choose between the following options:

Scapy Framework: Leveraging a dynamic spider powered by JSON configurations for traversing the web. Visit Scrapy's official documentation to learn more.
Requests & Webdriver with HTML Parser: An alternative and simpler approach for acquiring HTML content through HTTPX or Playwright and storing it locally for future use. Subsequently parse the webpages utilizing Selectolax, saving the extracted information into a database as well as local JSON files.

Keep your CSS selectors stored in the JSON file, enabling seamless incorporation of new elements into your database via simple addition of item models.

Data Persistence

Persist gathered data in an SQL database of your choice. Included are connection strings (i.e. URIs) for PostgreSQL and SQLite databases. As a value-added feature, this setup includes a reverse proxy to manage the pgAdmin4 dashboard!

Note: In order to identify records inside the database, utilize the SHA1 hash derived from a designated field.

Getting Started

Initiate by setting up one of the required dependencies below:

Python 3.x
Docker

Subsequent actions include:

Clone this repository.
Generate a .env file from the provided '.env.example'.

Additional guidance and support can be accessed via:

bash run.sh -h

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.devcontainer		.devcontainer
app		app
proxy		proxy
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
compose.yaml		compose.yaml
reset.sh		reset.sh
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebScraper

Overview

Data Persistence

Getting Started

Recommended Websites for Practicing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MMalikDev/Webscraper

Folders and files

Latest commit

History

Repository files navigation

WebScraper

Overview

Data Persistence

Getting Started

Recommended Websites for Practicing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages