A Python-based web scraper for collecting item data from Final Fantasy XI (FFXI) Auction House websites. This tool uses headless browsers to navigate and extract item information, prices, stock levels, and other relevant data for price comparison and analysis.
- Flexible Browser Support: Choose between Playwright or Selenium for web scraping
- Headless Operation: Run browsers in headless mode for efficient scraping
- HTML Parsing: Robust HTML parsing using BeautifulSoup
- Multiple Export Formats: Export data to JSON, CSV, or both
- Retry Logic: Automatic retry with exponential backoff for failed requests
- Configurable: YAML-based configuration for easy customization
- Logging: Comprehensive logging with Loguru
- Modular Design: Clean, extensible architecture
FFXI-python-AH-Scrapper/
├── src/
│ └── ffxi_ah_scraper/
│ ├── scrapers/ # Browser automation modules
│ │ ├── base_scraper.py
│ │ ├── playwright_scraper.py
│ │ └── selenium_scraper.py
│ ├── parsers/ # HTML parsing modules
│ │ ├── html_parser.py
│ │ └── item_parser.py
│ ├── exporters/ # Data export modules
│ │ ├── data_exporter.py
│ │ ├── json_exporter.py
│ │ └── csv_exporter.py
│ └── utils/ # Utility modules
│ ├── config_loader.py
│ └── logger_setup.py
├── data/
│ ├── raw/ # Raw HTML files
│ └── processed/ # Exported data (JSON/CSV)
├── tests/ # Test files
├── config.yaml # Configuration file
├── main.py # Main entry point
└── requirements.txt # Python dependencies
- Python 3.8 or higher
- pip package manager
- Clone the repository:
git clone https://github.com/yourusername/FFXI-python-AH-Scrapper.git
cd FFXI-python-AH-Scrapper- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Install Playwright browsers (if using Playwright):
playwright install chromium- Copy the environment template:
cp .env.example .envEdit config.yaml to customize the scraper:
# Scraper Settings
scraper:
browser_type: "playwright" # or "selenium"
headless: true
timeout: 30000
request_delay: 2
max_retries: 3
# URLs (update with actual FFXI AH endpoints)
urls:
base_url: "https://www.ffxiah.com"
item_search: "/browse"
item_details: "/item/{item_id}"
# Export Settings
export:
output_dir: "data/processed"
format: "both" # "json", "csv", or "both"
save_raw_html: true- Update the configuration file with actual FFXI AH URLs
- Modify
main.pyto uncomment example code and add your scraping logic - Run the scraper:
python main.pyfrom ffxi_ah_scraper.scrapers.playwright_scraper import PlaywrightScraper
from ffxi_ah_scraper.parsers.item_parser import ItemParser
from ffxi_ah_scraper.utils.config_loader import load_config
# Load configuration
config = load_config("config.yaml")
# Initialize scraper
with PlaywrightScraper(config) as scraper:
# Scrape item page
html = scraper.scrape_with_retry("https://www.ffxiah.com/item/4096")
# Parse the data
parser = ItemParser(html)
item_data = parser.extract_item_data()
print(item_data)from ffxi_ah_scraper.scrapers.playwright_scraper import PlaywrightScraper
from ffxi_ah_scraper.parsers.item_parser import ItemParser
from ffxi_ah_scraper.exporters.json_exporter import JSONExporter
config = load_config("config.yaml")
with PlaywrightScraper(config) as scraper:
# Get search results
html = scraper.scrape_with_retry("https://www.ffxiah.com/browse?q=potion")
parser = ItemParser(html)
items = parser.extract_search_results()
# Export to JSON
exporter = JSONExporter("data/processed")
exporter.export(items, "search_results")Extend the HTMLParser class to create custom parsers:
from ffxi_ah_scraper.parsers.html_parser import HTMLParser
class CustomParser(HTMLParser):
def extract_custom_data(self):
# Your custom extraction logic
return self.get_text(".custom-selector")Extend the DataExporter class for custom export formats:
from ffxi_ah_scraper.exporters.data_exporter import DataExporter
class XMLExporter(DataExporter):
def export(self, data, filename):
# Your custom export logic
pass{
"item_id": "4096",
"item_name": "Fire Crystal",
"category": "Crystals",
"price_data": [
{
"server": "Bahamut",
"price": 1500,
"stock": 100,
"seller": "Merchant1"
}
],
"last_updated": "2025-12-29T12:00:00"
}item_id,item_name,category,server,price,stock
4096,Fire Crystal,Crystals,Bahamut,1500,100pytest tests/black src/mypy src/- Always respect the website's
robots.txtfile - Implement appropriate delays between requests to avoid overloading servers
- Review and comply with the website's Terms of Service
- Consider using official APIs if available
The current parser implementations use placeholder CSS selectors. You'll need to:
- Inspect the actual FFXI AH website's HTML structure
- Update the selectors in
item_parser.pyto match the actual elements - Test the selectors to ensure accurate data extraction
- Playwright: Automatically downloads and manages browser binaries
- Selenium: May require manual ChromeDriver installation/configuration
- ImportError: Make sure you're in the virtual environment and dependencies are installed
- Browser not found: Run
playwright install chromiumfor Playwright - Timeout errors: Increase the timeout value in
config.yaml - Parsing errors: Check and update CSS selectors for the current website structure
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is provided as-is for educational purposes. Please ensure you comply with all applicable laws and website terms of service when using this tool.
This scraper is a tool for data collection and should be used responsibly. The authors are not responsible for misuse or any violations of terms of service. Always verify that your use case complies with the target website's policies and applicable laws.
- Database integration for storing scraped data
- Scheduler for periodic scraping
- Price trend analysis
- Multi-server comparison
- Rate limiting configuration
- Proxy support
- User authentication handling
- API endpoint creation for scraped data
For issues, questions, or contributions, please open an issue on GitHub.