A comprehensive Node.js tool for scraping Webflow websites and creating fully functional offline versions. This project was developed to scrape FruitPunch AI and create a complete offline copy with all assets, dynamic content, and working internal links.
- ✅ Complete Website Scraping: Downloads all HTML pages, CSS, JavaScript, images, fonts, and other assets
- ✅ Dynamic Content Support: Handles Webflow CMS content loaded via JavaScript/API calls
- ✅ API Interception: Captures and caches API responses for offline functionality
- ✅ Cloudflare Bypass: Supports manual Cloudflare challenge completion in headful mode
- ✅ Link Rewriting: Converts absolute internal links to relative paths for offline navigation
- ✅ Asset Management: Organizes all assets in a centralized
_assetsdirectory - ✅ Popup Removal: Automatically hides/removes newsletter popups and tracking scripts
- ✅ Incremental Scraping: Re-scrape only blank or failed pages without re-downloading everything
- ✅ Screenshot Capture: Takes full-page screenshots of each scraped page for verification
- Node.js (v16 or higher)
- npm or yarn
- Playwright (automatically installed via npm)
- Clone this repository:
git clone https://github.com/BusterFranken/webflow-scraper.git
cd webflow-scraper- Install dependencies:
npm installRun the main scraper to download the entire website:
node scrape.mjsThis will:
- Fetch the sitemap from the target website
- Download all pages and assets
- Save everything to the
offline/directory - Generate a report in
offline/_report.json
Note: The scraper runs in headful mode (visible browser) to allow manual completion of Cloudflare challenges if they appear.
If some pages come back blank (often due to dynamic content loading), use the re-scraper:
node rescrape-blank.mjsThis script:
- Identifies pages with minimal content
- Re-scrapes only those pages with extended timeouts
- Waits for CMS content to fully load
- Intercepts and caches API responses
After scraping, fix internal links to work offline:
node fix-links.mjsThis converts all absolute internal links to relative paths.
Generate a list of all localhost URLs for easy navigation:
node generate-links.mjsThis creates offline/_localhost-links.txt with all page URLs.
Start a local HTTP server to view the scraped website:
cd offline
python3 -m http.server 8000Then open http://localhost:8000/index/index.html in your browser.
Edit scrape.mjs to customize:
- Target Website: Change the
BASEconstant (line ~27) - Output Directory: Modify the
OUTconstant (line ~28) - Concurrency: Adjust
CONCURRENCYfor parallel requests (line ~29) - Timeout: Change
TIMEOUTfor page loading (line ~30)
webflow-scraper/
├── scrape.mjs # Main scraping script
├── rescrape-blank.mjs # Re-scraper for blank pages
├── fix-links.mjs # Link fixing utility
├── generate-links.mjs # Localhost links generator
├── package.json # Dependencies
├── .gitignore # Git ignore rules
└── offline/ # Scraped website content
├── _assets/ # All downloaded assets (CSS, JS, images, fonts)
├── _api_data/ # Cached API responses for offline use
├── _report.json # Scraping report
├── _sitemap.xml # Cached sitemap
├── _localhost-links.txt # List of localhost URLs
└── [page directories]/ # HTML pages organized by URL structure
└── index.html # Page content
└── screenshot.png # Full-page screenshot
The scraper fetches the website's sitemap.xml to discover all pages. If the sitemap is protected by Cloudflare, it uses Playwright to bypass the challenge.
For each page:
- Navigates to the URL using Playwright
- Waits for Cloudflare challenges (if any)
- Detects CMS pages and waits for dynamic content to load
- Scrolls the page to trigger lazy-loaded content
- Intercepts API calls and caches responses
- Captures the fully rendered HTML from the DOM
- Intercepts all network requests to capture assets
- Downloads CSS, JavaScript, images, fonts, and other resources
- Parses CSS files to extract referenced assets (fonts, images in
url()) - Rewrites asset paths to point to local files
- Removes tracking scripts and external dependencies
- Rewrites internal links to relative paths
- Injects API interceptor script for offline API calls
- Hides popups and overlays with CSS
For Webflow CMS pages that load content dynamically:
- Intercepts
fetch()andXMLHttpRequestcalls to Webflow APIs - Caches API responses as JSON files
- Injects client-side script to serve cached responses when offline
- Run
node rescrape-blank.mjsto re-scrape blank pages - Check browser console for JavaScript errors
- Verify API responses are cached in
offline/_api_data/
- The scraper runs in headful mode - complete challenges manually in the browser window
- Reduce
CONCURRENCYto 1 for easier manual intervention
- Check
offline/_report.jsonfor failed downloads - Some assets may have very long filenames (ENAMETOOLONG errors) - these are logged but don't break functionality
- Run
node fix-links.mjsto rewrite internal links - Ensure you're viewing via HTTP server, not
file://protocol
- Dynamic JavaScript: Some JavaScript that requires live API connections may not work offline
- External Services: Forms, search, and other features requiring backend services won't function offline
- Large Repositories: The scraped content can be quite large (hundreds of MB) due to all assets being downloaded
This project is provided as-is for educational and archival purposes.
Developed to create an offline archive of the FruitPunch AI website.