This project is a web scraper that uses Puppeteer for scraping web pages and OpenAI for analyzing the scraped content.
The scraper extracts HTML from target URLs, processes it with OpenAI to generate Puppeteer scripts, and saves the data extracted data based on the keywords provided in structured JSON files.
- Web Scraping: Uses Puppeteer to scrape websites and extract HTML.
- Content Analysis: Integrates with OpenAI to analyze scraped HTML and generate scripts for specific content extraction.
- File Management: Saves generated scripts and extracted data in structured directories.
- CLI Menu: Includes a simple command-line interface for initiating scraping and clearing generated files.
- Modularized Code: The project is structured for maintainability and scalability, with services and utilities separated into distinct modules.
/project-root
├── /generated
│ ├── /extractedData # Stores JSON output of extracted data
│ ├── /html # Stores HTML files scraped from websites
│ └── /scripts # Stores generated Puppeteer scripts
├── /src
│ ├── /migrations # MongoDB migration scripts
│ │ └── default_migration.js # Default migration script to set up initial collections
│ ├── /utils
│ │ ├── fileUtils.js # File-related utilities (e.g., saving files, ensuring directories exist)
│ │ └── directoryUtils.js # Utilities for managing directories
│ ├── /services
│ │ ├── openaiService.js # Handles communication with OpenAI API
│ │ ├── puppeteerService.js # Manages Puppeteer browser sessions and scraping logic
│ │ ├── scraperService.js # Scraping logic and relevant content checking
│ │ ├── db.js # MongoDB connection and database management
│ │ ├── directoryService.js # Services related to clearing directories
│ │ └── migrationService.js # Handles applying and managing MongoDB migrations
│ ├── db.js # MongoDB connection and database management
│ ├── cli.js # Command-line interface for user interactions, including migrations
│ ├── config.js # Configuration for the project (API keys, URLs to process)
│ └── app.js # Main application logic, processing URLs and orchestrating services
├── package.json
└── .env # Environment variables (API keys, MongoDB connection string, URLs to process)
Requirements
Node.js: Make sure you have Node.js installed on your system.
Puppeteer: Puppeteer is used for scraping web pages.
OpenAI API: You'll need an OpenAI API key to use the content analysis feature.
Installation
- Clone the repository
git clone https://github.com/your-username/web-scraper.git- Navigate to the project directory:
cd web-scraper- Install the dependencies:
npm install- Set up the environment variables by creating a .env file in the project root:
touch .env- Add the following variables to your .env file:
OPENAI_API_KEY=your-openai-api-key
URLS_TO_PROCESS=[{"url":"https://example.com", "content":"blog articles"}]Usage
You can run the project using the command-line interface (CLI) or programmatically from app.js.
Start Scraping
To start scraping and processing URLs, run:
npm startFollow the on-screen prompts to initiate scraping or clear generated files.
Clear Generated Files
To clear the extractedData, scripts, and html directories, select the "Clear Generated Files" option in the CLI menu.
Customization
You can customize the project by adding URLs to the URLS_TO_PROCESS environment variable in the .env file.
Modify the structure and content of the scraper in the relevant service files (like scraperService.js).
Contribution
Feel free to fork the project and submit pull requests. Make sure to test any changes thoroughly.
License
This project is licensed under the MIT License.