A Next.js project that demonstrates using Puppeteer within API routes for server-side web scraping, automation, and rendering content to a web page. This setup integrates headless browser functionality inside of a Next.js application for dynamic content generation.
- Advanced Web Scraping: Extract specific content from webpages using CSS selectors and text search
- Screenshot API: Capture screenshots of any webpage
- PDF Generation: Convert HTML to PDF documents
First, install dependencies:
npm install
Next, run the development server:
npm run dev
The advanced web scraping functionality (/scraper-ext
) provides:
- CSS Selector Targeting: Extract specific elements from webpages
- Text Search: Filter elements containing specific text
- Sibling Analysis: Extract content from elements that follow your target elements
- Interactive UI: User-friendly interface for configuring scraping parameters
To use the scraping UI:
- Navigate to
http://localhost:3000/scraper-ext
- Enter the URL you want to scrape
- Specify a CSS selector (e.g.,
h1
,.article-content
,#main
) - Optionally enter text to search for within those elements
- View the extracted content with HTML formatting
Access the following endpoints in your browser:
- Web Scraping UI:
http://localhost:3000/scraper-ext
- Interactive UI for extracting content from websites - Screenshots:
http://localhost:3000/api/screenshot?url=https://www.bbc.co.uk
- PDF Generation:
http://localhost:3000/api/pdf
If you have the command line utility curl installed, proceed as follows:
curl "http://localhost:3000/api/scrape-ext?url=https://example.com&selector=h1&text=Example"
curl -O --output-dir <path_to_local_directory> http://localhost:3000/api/screenshot\?url\=https://www.bbc.co.uk
curl -O --output-dir <path_to_local_directory> http://localhost:3000/api/pdf
Notes:
- Replace
<path_to_local_directory>
with a local directory such as~/Downloads
- It may take a few seconds to generate screenshots or PDFs
- The PDF example pulls HTML from the Next.js default page at http://localhost:3000