Skip to content

MMalikDev/Webscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebScraper

Overview

This repository contains two web scraping setups, each capable of operating both locally and within a Docker container. Choose between the following options:

  • Scapy Framework: Leveraging a dynamic spider powered by JSON configurations for traversing the web. Visit Scrapy's official documentation to learn more.

  • Requests & Webdriver with HTML Parser: An alternative and simpler approach for acquiring HTML content through HTTPX or Playwright and storing it locally for future use. Subsequently parse the webpages utilizing Selectolax, saving the extracted information into a database as well as local JSON files.

Keep your CSS selectors stored in the JSON file, enabling seamless incorporation of new elements into your database via simple addition of item models.

Data Persistence

Persist gathered data in an SQL database of your choice. Included are connection strings (i.e. URIs) for PostgreSQL and SQLite databases. As a value-added feature, this setup includes a reverse proxy to manage the pgAdmin4 dashboard!

Note: In order to identify records inside the database, utilize the SHA1 hash derived from a designated field.

Getting Started

Initiate by setting up one of the required dependencies below:

  • Python 3.x
  • Docker

Subsequent actions include:

  1. Clone this repository.
  2. Generate a .env file from the provided '.env.example'.

Additional guidance and support can be accessed via:

bash run.sh -h

Recommended Websites for Practicing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published