Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .agent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ Both SDKs support the following endpoints:
| SmartScraper | βœ… | βœ… | AI-powered data extraction |
| SearchScraper | βœ… | βœ… | Multi-website search extraction |
| Markdownify | βœ… | βœ… | HTML to Markdown conversion |
| Sitemap | ❌ | βœ… | Sitemap URL extraction |
| SmartCrawler | βœ… | βœ… | Sitemap generation & crawling |
| AgenticScraper | βœ… | βœ… | Browser automation |
| Scrape | βœ… | βœ… | Basic HTML extraction |
Expand Down Expand Up @@ -259,6 +260,7 @@ Both SDKs support the following endpoints:
- `searchScraper.js`
- `crawl.js`
- `markdownify.js`
- `sitemap.js`
- `agenticScraper.js`
- `scrape.js`
- `scheduledJobs.js`
Expand Down
36 changes: 36 additions & 0 deletions scrapegraph-js/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,27 @@ const url = 'https://scrapegraphai.com/';
})();
```

### Sitemap

Extract all URLs from a website's sitemap. Automatically discovers sitemap from robots.txt or common sitemap locations.

```javascript
import { sitemap } from 'scrapegraph-js';

const apiKey = 'your-api-key';
const websiteUrl = 'https://example.com';

(async () => {
try {
const response = await sitemap(apiKey, websiteUrl);
console.log('Total URLs found:', response.urls.length);
console.log('URLs:', response.urls);
} catch (error) {
console.error('Error:', error);
}
})();
```

### Checking API Credits

```javascript
Expand Down Expand Up @@ -688,6 +709,21 @@ Starts a crawl job to extract structured data from a website and its linked page

Converts a webpage into clean, well-structured markdown format.

### Sitemap

#### `sitemap(apiKey, websiteUrl, options)`

Extracts all URLs from a website's sitemap. Automatically discovers sitemap from robots.txt or common sitemap locations.

**Parameters:**
- `apiKey` (string): Your ScrapeGraph AI API key
- `websiteUrl` (string): The URL of the website to extract sitemap from
- `options` (object, optional): Additional options
- `mock` (boolean): Override mock mode for this request

**Returns:** Promise resolving to an object containing:
- `urls` (array): List of URLs extracted from the sitemap

### Agentic Scraper

#### `agenticScraper(apiKey, url, steps, useSession, userPrompt, outputSchema, aiExtraction)`
Expand Down
128 changes: 128 additions & 0 deletions scrapegraph-js/examples/sitemap/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Sitemap Examples

This directory contains examples demonstrating how to use the `sitemap` endpoint to extract URLs from website sitemaps.

## πŸ“ Examples

### 1. Basic Sitemap Extraction (`sitemap_example.js`)

Demonstrates the basic usage of the sitemap endpoint:
- Extract all URLs from a website's sitemap
- Display the URLs
- Save URLs to a text file
- Save complete response as JSON

**Usage:**
```bash
node sitemap_example.js
```

**What it does:**
1. Calls the sitemap API with a target website URL
2. Retrieves all URLs from the sitemap
3. Displays the first 10 URLs in the console
4. Saves all URLs to `sitemap_urls.txt`
5. Saves the full response to `sitemap_urls.json`

### 2. Advanced: Sitemap + SmartScraper (`sitemap_with_smartscraper.js`)

Shows how to combine sitemap extraction with smartScraper for batch processing:
- Extract sitemap URLs
- Filter URLs based on patterns (e.g., blog posts)
- Scrape selected URLs with smartScraper
- Display results and summary

**Usage:**
```bash
node sitemap_with_smartscraper.js
```

**What it does:**
1. Extracts all URLs from a website's sitemap
2. Filters URLs (example: only blog posts or specific sections)
3. Scrapes each filtered URL using smartScraper
4. Extracts structured data from each page
5. Displays a summary of successful and failed scrapes

**Use Cases:**
- Bulk content extraction from blogs
- E-commerce product catalog scraping
- News article aggregation
- Content migration and archival

## πŸ”‘ Setup

Before running the examples, make sure you have:

1. **API Key**: Set your ScrapeGraph AI API key as an environment variable:
```bash
export SGAI_APIKEY="your-api-key-here"
```

Or create a `.env` file in the project root:
```
SGAI_APIKEY=your-api-key-here
```

2. **Dependencies**: Install required packages:
```bash
npm install
```

## πŸ“Š Expected Output

### Basic Sitemap Example Output:
```
πŸ—ΊοΈ Extracting sitemap from: https://example.com/
⏳ Please wait...

βœ… Sitemap extracted successfully!
πŸ“Š Total URLs found: 150

πŸ“„ First 10 URLs:
1. https://example.com/
2. https://example.com/about
3. https://example.com/products
...

πŸ’Ύ URLs saved to: sitemap_urls.txt
πŸ’Ύ JSON saved to: sitemap_urls.json
```

### Advanced Example Output:
```
πŸ—ΊοΈ Step 1: Extracting sitemap from: https://example.com/
⏳ Please wait...

βœ… Sitemap extracted successfully!
πŸ“Š Total URLs found: 150

🎯 Selected 3 URLs to scrape:
1. https://example.com/blog/post-1
2. https://example.com/blog/post-2
3. https://example.com/blog/post-3

πŸ€– Step 2: Scraping selected URLs...

πŸ“„ Scraping (1/3): https://example.com/blog/post-1
βœ… Success
...

πŸ“ˆ Summary:
βœ… Successful: 3
❌ Failed: 0
πŸ“Š Total: 3
```

## πŸ’‘ Tips

1. **Rate Limiting**: When scraping multiple URLs, add delays between requests to avoid rate limiting
2. **Error Handling**: Always use try/catch blocks to handle API errors gracefully
3. **Filtering**: Use URL patterns to filter specific sections (e.g., `/blog/`, `/products/`)
4. **Batch Size**: Start with a small batch to test before processing hundreds of URLs

## πŸ”— Related Documentation

- [Sitemap API Documentation](../../README.md#sitemap)
- [SmartScraper Documentation](../../README.md#smart-scraper)
- [ScrapeGraph AI API Docs](https://docs.scrapegraphai.com)
72 changes: 72 additions & 0 deletions scrapegraph-js/examples/sitemap/sitemap_example.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import { sitemap } from 'scrapegraph-js';
import fs from 'fs';
import 'dotenv/config';

/**
* Example: Extract sitemap URLs from a website
*
* This example demonstrates how to use the sitemap endpoint to extract
* all URLs from a website's sitemap.xml file.
*/

// Get API key from environment variable
const apiKey = process.env.SGAI_APIKEY;

// Target website URL
const url = 'https://scrapegraphai.com/';

console.log('πŸ—ΊοΈ Extracting sitemap from:', url);
console.log('⏳ Please wait...\n');

try {
// Call the sitemap endpoint
const response = await sitemap(apiKey, url);

console.log('βœ… Sitemap extracted successfully!');
console.log(`πŸ“Š Total URLs found: ${response.urls.length}\n`);

// Display first 10 URLs
console.log('πŸ“„ First 10 URLs:');
response.urls.slice(0, 10).forEach((url, index) => {
console.log(` ${index + 1}. ${url}`);
});

if (response.urls.length > 10) {
console.log(` ... and ${response.urls.length - 10} more URLs`);
}

// Save the complete list to a file
saveUrlsToFile(response.urls, 'sitemap_urls.txt');

// Save as JSON for programmatic use
saveUrlsToJson(response, 'sitemap_urls.json');

} catch (error) {
console.error('❌ Error:', error.message);
process.exit(1);
}

/**
* Helper function to save URLs to a text file
*/
function saveUrlsToFile(urls, filename) {
try {
const content = urls.join('\n');
fs.writeFileSync(filename, content);
console.log(`\nπŸ’Ύ URLs saved to: ${filename}`);
} catch (err) {
console.error('❌ Error saving file:', err.message);
}
}

/**
* Helper function to save complete response as JSON
*/
function saveUrlsToJson(response, filename) {
try {
fs.writeFileSync(filename, JSON.stringify(response, null, 2));
console.log(`πŸ’Ύ JSON saved to: ${filename}`);
} catch (err) {
console.error('❌ Error saving JSON:', err.message);
}
}
106 changes: 106 additions & 0 deletions scrapegraph-js/examples/sitemap/sitemap_with_smartscraper.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
import { sitemap, smartScraper } from 'scrapegraph-js';
import 'dotenv/config';

/**
* Advanced Example: Extract sitemap and scrape selected URLs
*
* This example demonstrates how to combine the sitemap endpoint
* with smartScraper to extract structured data from multiple pages.
*/

const apiKey = process.env.SGAI_APIKEY;

// Configuration
const websiteUrl = 'https://scrapegraphai.com/';
const maxPagesToScrape = 3; // Limit number of pages to scrape
const userPrompt = 'Extract the page title and main heading';

console.log('πŸ—ΊοΈ Step 1: Extracting sitemap from:', websiteUrl);
console.log('⏳ Please wait...\n');

try {
// Step 1: Get all URLs from sitemap
const sitemapResponse = await sitemap(apiKey, websiteUrl);

console.log('βœ… Sitemap extracted successfully!');
console.log(`πŸ“Š Total URLs found: ${sitemapResponse.urls.length}\n`);

// Step 2: Filter URLs (example: only blog posts)
const filteredUrls = sitemapResponse.urls
.filter(url => url.includes('/blog/') || url.includes('/post/'))
.slice(0, maxPagesToScrape);

if (filteredUrls.length === 0) {
console.log('ℹ️ No blog URLs found, using first 3 URLs instead');
filteredUrls.push(...sitemapResponse.urls.slice(0, maxPagesToScrape));
}

console.log(`🎯 Selected ${filteredUrls.length} URLs to scrape:`);
filteredUrls.forEach((url, index) => {
console.log(` ${index + 1}. ${url}`);
});

// Step 3: Scrape each selected URL
console.log('\nπŸ€– Step 2: Scraping selected URLs...\n');

const results = [];

for (let i = 0; i < filteredUrls.length; i++) {
const url = filteredUrls[i];
console.log(`πŸ“„ Scraping (${i + 1}/${filteredUrls.length}): ${url}`);

try {
const scrapeResponse = await smartScraper(
apiKey,
url,
userPrompt
);

results.push({
url: url,
data: scrapeResponse.result,
status: 'success'
});

console.log(' βœ… Success');

// Add a small delay between requests to avoid rate limiting
if (i < filteredUrls.length - 1) {
await new Promise(resolve => setTimeout(resolve, 1000));
}

} catch (error) {
console.log(` ❌ Failed: ${error.message}`);
results.push({
url: url,
error: error.message,
status: 'failed'
});
}
}

// Step 4: Display results
console.log('\nπŸ“Š Scraping Results:\n');
results.forEach((result, index) => {
console.log(`${index + 1}. ${result.url}`);
if (result.status === 'success') {
console.log(' Status: βœ… Success');
console.log(' Data:', JSON.stringify(result.data, null, 2));
} else {
console.log(' Status: ❌ Failed');
console.log(' Error:', result.error);
}
console.log('');
});

// Summary
const successCount = results.filter(r => r.status === 'success').length;
console.log('πŸ“ˆ Summary:');
console.log(` βœ… Successful: ${successCount}`);
console.log(` ❌ Failed: ${results.length - successCount}`);
console.log(` πŸ“Š Total: ${results.length}`);

} catch (error) {
console.error('❌ Error:', error.message);
process.exit(1);
}
Loading
Loading