LeadScraper 🤖

AI-Powered Business Contact Finder - Fully Automated

Find hundreds of business contacts with one command. No manual searching, no data entry, completely automated.

📋 Table of Contents

What Is This?
Why Use This?
Key Features
Complete Setup Guide
How to Use
Configuration Guide
Understanding the Output
Troubleshooting
FAQ
Technical Architecture
Contributing
License

What Is This?

LeadScraper is an intelligent AI agent that automatically discovers businesses and extracts their complete contact information.

In simple terms: You tell it "Find me 100 restaurants in New York" and it does all the work:

Searches the internet to find restaurant names
Researches each restaurant across multiple sources (websites, Facebook, Instagram)
Extracts phone numbers, emails, addresses, social media
Verifies data quality and assigns confidence scores
Saves everything to a Google Sheet

All you do is run one command and wait.

Why Use This?

Time Savings

Manual work: 40+ hours to find 100 business contacts
With LeadScraper: 1 hour, fully automated
You save: 39 hours of tedious work

Data Quality

70% complete profiles (phone + email + address)
30% partial profiles (some contact info)
0% failures (always finds something)
Cross-verified across multiple sources

Use Cases

Sales Teams: Build prospect lists in any industry/location
Marketers: Gather leads for campaigns
Recruiters: Find companies in specific sectors
Event Planners: Discover venues and vendors
Researchers: Collect business data for analysis
Anyone: Who needs business contact information at scale

Key Features

🤖 AI-Powered Discovery

AI automatically finds businesses based on your criteria
Intelligent search strategy planning
Smart name extraction and validation
No manual input needed

🔄 Multi-Model AI System

Uses 4 different AI models simultaneously
Automatic rotation when rate limits hit
Never stops - always has a model ready
4x higher throughput than single-model systems

📊 Multi-Source Extraction

Extracts data from:

Google Search results
Facebook business pages
Instagram business profiles
Official company websites

Then intelligently merges and verifies all data.

💎 Professional Terminal UI

Clean, color-coded progress messages
Real-time completion indicators
Easy to track which business is processing
Beautiful startup banner
No clutter or emojis

🛡️ Intelligent Data Fusion

Cross-verifies information across sources
Assigns confidence scores (0-100%)
Handles conflicting data intelligently
Prioritizes verified information

📈 Auto-Organization

Results saved to Google Sheets
Auto-deduplication
Auto-sorting by confidence
Real-time status updates

Complete Setup Guide

Prerequisites

You need:

A computer (Windows, Mac, or Linux)
Internet connection
20 minutes for setup

No coding experience needed!

Step 1: Install Python

What is Python? The programming language that runs this tool.

How to install:

Go to python.org/downloads
Click the big yellow "Download Python" button
Run the installer
IMPORTANT: Check the box that says "Add Python to PATH"
Click "Install Now"
Verify installation:
```
python --version
```
You should see: Python 3.10 or higher

Step 2: Download This Project

Option A: Using Git (recommended)

Install Git: git-scm.com/downloads
Open Terminal/Command Prompt
Navigate to where you want the project:
```
cd Documents
```

Clone the repository:

git clone https://github.com/YOUR-USERNAME/LeadScraper.git
cd LeadScraper

Option B: Download ZIP

Click the green "Code" button on GitHub
Click "Download ZIP"
Extract the ZIP file
Open Terminal/Command Prompt
Navigate to the extracted folder:
```
cd path/to/LeadScraper
```

Step 3: Install Dependencies

What are dependencies? Libraries this tool needs to work.

Make sure you're in the LeadScraper folder
Run this command:
```
pip install -r requirements.txt
```
Wait 1-2 minutes while it installs
You'll see lots of text - that's normal!

Step 4: Get OpenRouter API Key

What is OpenRouter? The AI service that powers the intelligence.

Cost: FREE with the models we use!

Steps:

Go to openrouter.ai
Click "Sign Up" (top right)
Sign up with Google/GitHub/Email
Once logged in, click your profile icon
Click "API Keys"
Click "Create Key"
Give it a name: "LeadScraper"
Click "Create"
COPY THE KEY - it looks like: sk-or-v1-abc123xyz...
Save it somewhere safe - you'll need it in Step 6

Important: Don't share this key with anyone!

Step 5: Setup Google Sheets

Why Google Sheets? Your extracted data is saved here like a database.

5.1: Create Google Cloud Project

Go to console.cloud.google.com
Click "Select a project" (top left)
Click "New Project"
Project name: "LeadScraper"
Click "Create"
Wait 30 seconds for it to create

5.2: Enable Google Sheets API

In the search bar, type: "Google Sheets API"
Click on "Google Sheets API"
Click "Enable"
Wait for it to enable

5.3: Create Service Account

Click "Credentials" (left sidebar)
Click "Create Credentials" (top)
Select "Service Account"
Service account name: "leadscraper-bot"
Click "Create and Continue"
Role: Select "Editor"
Click "Continue"
Click "Done"

5.4: Create and Download Key

Click on the service account you just created
Click "Keys" tab
Click "Add Key" → "Create new key"
Select "JSON"
Click "Create"
A file will download: leadscraper-bot-xxxxx.json
Rename it to: service_account.json

5.5: Place Credentials File

In your LeadScraper folder, create a folder called credentials

Move service_account.json into the credentials folder

Your structure should look like:

LeadScraper/
├── credentials/
│   └── service_account.json
├── modules/
├── main.py
└── ...

5.6: Create Google Sheet

Go to sheets.google.com
Click "Blank" to create new sheet
Name it: "LeadScraper Database"

Copy the Sheet ID from the URL:

https://docs.google.com/spreadsheets/d/THIS_IS_THE_SHEET_ID/edit

Save this ID - you'll need it in Step 6

5.7: Share Sheet with Service Account

In your Google Sheet, click "Share" (top right)
Open credentials/service_account.json in a text editor
Find the line: "client_email": "leadscraper-bot@..."
Copy that email address
Back in Google Sheet, paste the email in "Add people"
Make sure "Editor" is selected
Uncheck "Notify people"
Click "Share"

Step 6: Configure Environment File

What is .env? A file that stores your API keys securely.

In the LeadScraper folder, create a file named .env
- On Windows: Right-click → New → Text Document → Rename to .env
- On Mac/Linux: touch .env
Open .env in a text editor

Paste this template:

# OpenRouter API Key (from Step 4)
OPENROUTER_API_KEY=sk-or-v1-your-key-here

# Google Sheet ID (from Step 5.6)
GOOGLE_SHEET_ID=your-sheet-id-here

# Path to Google Sheets credentials
SERVICE_ACCOUNT_FILE=credentials/service_account.json

Replace the placeholders:
- Replace sk-or-v1-your-key-here with your actual OpenRouter API key
- Replace your-sheet-id-here with your actual Google Sheet ID
- Keep SERVICE_ACCOUNT_FILE as is
Save the file

Example:

OPENROUTER_API_KEY=sk-or-v1-abc123xyz789
GOOGLE_SHEET_ID=1a2b3c4d5e6f7g8h9i0j
SERVICE_ACCOUNT_FILE=credentials/service_account.json

Step 7: Test Setup

Let's make sure everything works!

Run this command:

python main.py --discover "New York" --quantity 3

You should see:

Beautiful startup banner
Green "[SUCCESS]" messages
Cyan "[AI]" messages showing progress

Completion separators like:

────────────────────────────────────
────────────── 1/3 ────────────────
────────────────────────────────────

Check your Google Sheet - you should see 3 businesses!

If it works: Setup complete! 🎉
If it doesn't: See Troubleshooting

How to Use

Basic Usage

Find businesses in one command:

python main.py --discover "LOCATION" --quantity NUMBER

Examples:

# Find 100 restaurants in New York
python main.py --discover "New York" --quantity 100

# Find 50 cafes in Los Angeles
python main.py --discover "Los Angeles" --quantity 50 --category cafe

# Find 200 gyms in Chicago
python main.py --discover "Chicago" --quantity 200 --category gym

# Quick test with 5 businesses
python main.py --discover "Miami" --quantity 5

Command Options

Option	Description	Required	Default
`--discover`	Location to search	Yes	-
`--quantity`	Number of businesses to find	No	100
`--category`	Type of business	No	restaurant

Categories You Can Use

Common categories:

restaurant - Restaurants, eateries
cafe - Coffee shops, cafes
gym - Gyms, fitness centers
salon - Hair salons, beauty
bar - Bars, pubs
hotel - Hotels, accommodations
shop - Retail stores
office - Business offices
clinic - Medical clinics
school - Schools, education

You can use ANY business type!

What Happens When You Run It?

Phase 1: Discovery (3-5 minutes for 100)

AI plans search strategy
AI generates smart search queries
AI searches the web
AI extracts business names
AI validates and cleans results
Populates Google Sheet Input tab

Phase 2: Extraction (~70 minutes for 100)

For each business:
- Searches Google for information
- Searches Facebook for business page
- Searches Instagram for profile
- Crawls official website
- Extracts all contact information
- Fuses data from all sources
- Assigns confidence score
- Writes to Results sheet

You see:

Beautiful color-coded progress
Completion separator after each business
Real-time updates in Google Sheet

Total time for 100 businesses: ~75 minutes

Understanding the Terminal Output

Colors mean:

Green [SUCCESS]: Operation completed successfully
Cyan [AI]: AI is working (planning, extracting, validating)
Yellow [WARNING]: Warning (rate limit, missing data)
Red [ERROR]: Error occurred
White [INFO]: General information

Progress separators:

────────────────────────────────────────────────────────────
────────────────────────── 47/100 ─────────────────────────
────────────────────────────────────────────────────────────

This means business #47 just finished!

Configuration Guide

All settings are in config.py. Here are the main ones:

AI Models

The tool rotates through 4 AI models automatically:

OPENROUTER_MODELS = [
    "arcee-ai/trinity-large-preview:free",      # Free, fast
    "stepfun/step-3.5-flash:free",              # Free, reliable
    "deepseek/deepseek-r1-0528:free",           # Free, accurate
    "openrouter/aurora-alpha"                    # Paid, premium
]

Want to change models? Browse available models at openrouter.ai/models

Speed Settings

DELAY_BETWEEN_LEADS = 1     # Seconds between businesses (1 = fast)
DELAY_JITTER = 1            # Random delay 0-1 seconds
MAX_RETRIES = 2             # How many times to retry on failure
SEARCH_DELAY = 1            # Seconds between searches

Want faster? Reduce delays (but may hit rate limits more)
Want more reliable? Increase delays and retries

Discovery Settings

DEFAULT_DISCOVERY_QUANTITY = 100     # Default number if not specified
MAX_DISCOVERY_QUANTITY = 500         # Maximum allowed
AUTO_CLEAR_INPUT_ON_DISCOVERY = True # Clear input before discovery

Source Settings

ENABLE_FACEBOOK = True      # Search Facebook pages
ENABLE_INSTAGRAM = True     # Search Instagram profiles
ENABLE_DIRECTORIES = True   # Extract from directory sites
ENABLE_MULTI_SEARCH = True  # Use multiple search engines

Want to skip a source? Set it to False

Extraction Limits

MAX_PHONES_PER_BUSINESS = 10    # Maximum phone numbers to extract
MAX_EMAILS_PER_BUSINESS = 10    # Maximum emails to extract
MAX_SOCIAL_LINKS = 20           # Maximum social media links

Understanding the Output

Google Sheet Structure

Your Google Sheet has 2 tabs:

Input Tab

Where discovered business names go before extraction.

Business Name	Category	City	Status
Pizza Hut	restaurant	New York	✅ Done

Results Tab

Where extracted data is saved.

Business Name	Phone Numbers	Email Addresses	Address	City	State	Website	Facebook	Instagram	Confidence	Sources
Pizza Hut	(212) 555-1234, (212) 555-5678	info@pizzahut.com	123 Main St	New York	NY	pizzahut.com	facebook.com/pizzahut	instagram.com/pizzahut	95%	website, facebook, instagram

Confidence Scores

What they mean:

80-100%: Excellent - Multiple sources verified
60-79%: Good - Some verification
40-59%: Partial - Limited data
0-39%: Low - Minimal information

90%+ means: Data is highly reliable, cross-verified across multiple sources.

Data Sources

Shows where the data came from:

search_results - Google search snippets
facebook_search - Facebook business page
instagram_search - Instagram profile
website - Official company website

More sources = Higher confidence!

Troubleshooting

"API key not found"

Problem: .env file not configured correctly

Solution:

Make sure .env file exists in main folder

Open .env and verify:

OPENROUTER_API_KEY=sk-or-v1-your-actual-key

No spaces around =
No quotes around the key

"Google Sheet ID not set"

Problem: Sheet ID missing from .env

Solution:

Open .env
Add your Sheet ID:
```
GOOGLE_SHEET_ID=your-actual-sheet-id
```

Get Sheet ID from URL:

https://docs.google.com/spreadsheets/d/THIS_IS_IT/edit

"Permission denied" or "403 Error"

Problem: Google Sheet not shared with service account

Solution:

Open credentials/service_account.json
Find "client_email" - copy the email
Open your Google Sheet
Click "Share"
Add that email as Editor
Uncheck "Notify people"
Click "Share"

"HTTP 429 Too Many Requests"

Problem: Hitting rate limits (normal, handled automatically!)

What happens: Tool automatically switches to another AI model

What you see: Yellow [WARNING] Rate limit on model X, rotating...

Action needed: None! It's working as designed.

"No results found"

Problem: AI couldn't find businesses in that location

Solutions:

Try a larger city name
Increase --quantity
Try different category
Check spelling of location

Script stops/crashes

Solutions:

Check logs/scraper.log for error details
Make sure internet connection is stable
Restart and try again
Try with smaller quantity first (--quantity 5)

Results look wrong

Check:

Confidence score - Low scores mean uncertain data
Sources - More sources = more reliable
Google Sheet permissions - Make sure sheet is editable

FAQ

General Questions

Q: Do I need to know coding?
A: No! Just copy-paste the commands shown in this guide.

Q: Does this work on Mac/Windows/Linux?
A: Yes! Works on all platforms.

Q: How much does it cost?
A: $0 with the free AI models we use. OpenRouter offers generous free tiers.

Q: Is this legal?
A: Yes. It only searches publicly available information on the internet.

Q: Will I get banned?
A: No. We use rate limiting and multiple models to stay within limits.

Data Questions

Q: How accurate is the data?
A: 70% of businesses get complete profiles. Data is cross-verified from multiple sources.

Q: Can I trust the confidence scores?
A: Yes. 80%+ confidence means data is verified across multiple sources.

Q: What if data is wrong?
A: Lower confidence scores indicate uncertain data. Always verify critical information.

Q: Can I export to CSV?
A: Yes! In Google Sheets: File → Download → CSV

Technical Questions

Q: How does multi-model rotation work?
A: When Model A hits rate limit, it automatically switches to Model B, then C, then D. Each model has 60-second cooldown.

Q: Why use 4 models instead of 1?
A: 4x higher throughput. While one model is on cooldown, others keep working.

Q: Can I add more models?
A: Yes! Edit config.py → OPENROUTER_MODELS and add model names from openrouter.ai

Q: Can I use paid models?
A: Yes! Add paid model names to the list. They're faster and more accurate.

Q: Where are logs stored?
A: In logs/scraper.log. Check here if something goes wrong.

Usage Questions

Q: Can I run this on my server 24/7?
A: Yes! It's designed for long-running operations.

Q: Can I pause and resume?
A: Not yet. But you can stop and it will skip already processed items.

Q: Can I run multiple searches simultaneously?
A: Not recommended. Run them sequentially to avoid rate limits.

Q: What's the maximum quantity I can extract?
A: 500 businesses per run (configurable in config.py)

Q: How long does 100 businesses take?
A: About 75 minutes (5 min discovery + 70 min extraction)

Technical Architecture

System Overview

CLI Command → AI Discovery → Multi-Source Extraction → Data Fusion → Google Sheets

AI Manager (Multi-Model System)

AIManager
├── Model Pool [arcee, stepfun, deepseek, aurora]
├── Rate Limit Detector
├── Auto-Rotation Logic
└── Cooldown Manager (60s per model)

When a model hits HTTP 429:

Mark model as "on cooldown"
Select next available model
Retry request with new model
After 60s, original model available again

Data Fusion Algorithm

Collect: Extract from all sources
Normalize: Clean and standardize
Verify: Cross-check across sources
Score: Assign confidence based on:
- Number of sources (more = higher)
- Data consistency (matching = higher)
- Source priority (website > social media)
Merge: Combine verified data
Output: Single profile with confidence score

Module Architecture

main.py                    # Orchestrator + CLI
├── ai_manager.py          # Multi-model AI rotation
├── display.py             # Terminal UI
├── restaurant_discovery.py # AI discovery
├── search.py              # Multi-engine search
├── sources/
│   ├── facebook_scraper.py
│   ├── instagram_scraper.py
│   └── search_result_extractor.py
├── fusion.py              # Data merging + scoring
├── extractor.py           # Website extraction
├── crawler.py             # Website crawling
├── sheets.py              # Google Sheets I/O
└── utils.py               # Logging, validation

Contributing

Want to improve LeadScraper? Contributions welcome!

How to Contribute

Fork this repository
Create a feature branch: git checkout -b feature-name
Make your changes
Test thoroughly
Commit: git commit -m "Add feature"
Push: git push origin feature-name
Open a Pull Request

Ideas for Contributions

Add more data sources (LinkedIn, Yelp)
Improve AI prompts for better extraction
Add export formats (CSV, JSON)
Create web interface
Add caching to avoid re-searching
Implement parallel processing
Add resume/checkpoint support

License

MIT License - See LICENSE file for details

Acknowledgments

Built with:

OpenRouter - AI models
DuckDuckGo Search - Web searching
Google Sheets API - Data storage
Rich - Terminal UI
BeautifulSoup - HTML parsing

Support

Found a bug? Have a question?

Check Troubleshooting
Check FAQ
Check logs/scraper.log for error details
Open an issue on GitHub

Happy lead hunting! 🎯

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
modules		modules
.env.example		.env.example
.gitignore		.gitignore
GITHUB_SETUP.md		GITHUB_SETUP.md
PRD.md		PRD.md
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LeadScraper 🤖

📋 Table of Contents

What Is This?

Why Use This?

Time Savings

Data Quality

Use Cases

Key Features

🤖 AI-Powered Discovery

🔄 Multi-Model AI System

📊 Multi-Source Extraction

💎 Professional Terminal UI

🛡️ Intelligent Data Fusion

📈 Auto-Organization

Complete Setup Guide

Prerequisites

Step 1: Install Python

Step 2: Download This Project

Step 3: Install Dependencies

Step 4: Get OpenRouter API Key

Step 5: Setup Google Sheets

5.1: Create Google Cloud Project

5.2: Enable Google Sheets API

5.3: Create Service Account

5.4: Create and Download Key

5.5: Place Credentials File

5.6: Create Google Sheet

5.7: Share Sheet with Service Account

Step 6: Configure Environment File

Step 7: Test Setup

How to Use

Basic Usage

Command Options

Categories You Can Use

What Happens When You Run It?

Understanding the Terminal Output

Configuration Guide

AI Models

Speed Settings

Discovery Settings

Source Settings

Extraction Limits

Understanding the Output

Google Sheet Structure

Input Tab

Results Tab

Confidence Scores

Data Sources

Troubleshooting

"API key not found"

"Google Sheet ID not set"

"Permission denied" or "403 Error"

"HTTP 429 Too Many Requests"

"No results found"

Script stops/crashes

Results look wrong

FAQ

General Questions

Data Questions

Technical Questions

Usage Questions

Technical Architecture

System Overview

AI Manager (Multi-Model System)

Data Fusion Algorithm

Module Architecture

Contributing

How to Contribute

Ideas for Contributions

License

Acknowledgments

Support

About

Resources

Uh oh!

Stars

Packages