Web Scraper with OpenAI Integration

This project is a web scraper that uses Puppeteer for scraping web pages and OpenAI for analyzing the scraped content.
The scraper extracts HTML from target URLs, processes it with OpenAI to generate Puppeteer scripts, and saves the data extracted data based on the keywords provided in structured JSON files.

Features

Web Scraping: Uses Puppeteer to scrape websites and extract HTML.
Content Analysis: Integrates with OpenAI to analyze scraped HTML and generate scripts for specific content extraction.
File Management: Saves generated scripts and extracted data in structured directories.
CLI Menu: Includes a simple command-line interface for initiating scraping and clearing generated files.
Modularized Code: The project is structured for maintainability and scalability, with services and utilities separated into distinct modules.

Project Structure

/project-root
├── /generated
│   ├── /extractedData  # Stores JSON output of extracted data
│   ├── /html           # Stores HTML files scraped from websites
│   └── /scripts        # Stores generated Puppeteer scripts
├── /src
│   ├── /migrations     # MongoDB migration scripts
│   │   └── default_migration.js  # Default migration script to set up initial collections
│   ├── /utils
│   │   ├── fileUtils.js         # File-related utilities (e.g., saving files, ensuring directories exist)
│   │   └── directoryUtils.js    # Utilities for managing directories
│   ├── /services
│   │   ├── openaiService.js     # Handles communication with OpenAI API
│   │   ├── puppeteerService.js  # Manages Puppeteer browser sessions and scraping logic
│   │   ├── scraperService.js    # Scraping logic and relevant content checking
│   │   ├── db.js                # MongoDB connection and database management
│   │   ├── directoryService.js  # Services related to clearing directories
│   │   └── migrationService.js  # Handles applying and managing MongoDB migrations
│   ├── db.js                    # MongoDB connection and database management
│   ├── cli.js                   # Command-line interface for user interactions, including migrations
│   ├── config.js                # Configuration for the project (API keys, URLs to process)
│   └── app.js                   # Main application logic, processing URLs and orchestrating services
├── package.json
└── .env                         # Environment variables (API keys, MongoDB connection string, URLs to process)

Requirements
Node.js: Make sure you have Node.js installed on your system.
Puppeteer: Puppeteer is used for scraping web pages.
OpenAI API: You'll need an OpenAI API key to use the content analysis feature.

Installation

Clone the repository

git clone https://github.com/your-username/web-scraper.git

Navigate to the project directory:

cd web-scraper

Install the dependencies:

npm install

Set up the environment variables by creating a .env file in the project root:

touch .env

Add the following variables to your .env file:

OPENAI_API_KEY=your-openai-api-key
URLS_TO_PROCESS=[{"url":"https://example.com", "content":"blog articles"}]

Usage
You can run the project using the command-line interface (CLI) or programmatically from app.js.

Start Scraping
To start scraping and processing URLs, run:

npm start

Follow the on-screen prompts to initiate scraping or clear generated files.

Clear Generated Files
To clear the extractedData, scripts, and html directories, select the "Clear Generated Files" option in the CLI menu.

Customization
You can customize the project by adding URLs to the URLS_TO_PROCESS environment variable in the .env file.
Modify the structure and content of the scraper in the relevant service files (like scraperService.js).

Contribution
Feel free to fork the project and submit pull requests. Make sure to test any changes thoroughly.

License
This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.idea		.idea
src		src
.gitignore		.gitignore
README.md		README.md
example.env		example.env
migrate-mongo-config.js		migrate-mongo-config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper with OpenAI Integration

Features

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraper with OpenAI Integration

Features

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages