Skip to content

Abdullah-Al-Raju/LeadScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LeadScraper 🤖

AI-Powered Business Contact Finder - Fully Automated

Find hundreds of business contacts with one command. No manual searching, no data entry, completely automated.


📋 Table of Contents


What Is This?

LeadScraper is an intelligent AI agent that automatically discovers businesses and extracts their complete contact information.

In simple terms: You tell it "Find me 100 restaurants in New York" and it does all the work:

  1. Searches the internet to find restaurant names
  2. Researches each restaurant across multiple sources (websites, Facebook, Instagram)
  3. Extracts phone numbers, emails, addresses, social media
  4. Verifies data quality and assigns confidence scores
  5. Saves everything to a Google Sheet

All you do is run one command and wait.


Why Use This?

Time Savings

  • Manual work: 40+ hours to find 100 business contacts
  • With LeadScraper: 1 hour, fully automated
  • You save: 39 hours of tedious work

Data Quality

  • 70% complete profiles (phone + email + address)
  • 30% partial profiles (some contact info)
  • 0% failures (always finds something)
  • Cross-verified across multiple sources

Use Cases

Sales Teams: Build prospect lists in any industry/location
Marketers: Gather leads for campaigns
Recruiters: Find companies in specific sectors
Event Planners: Discover venues and vendors
Researchers: Collect business data for analysis
Anyone: Who needs business contact information at scale


Key Features

🤖 AI-Powered Discovery

  • AI automatically finds businesses based on your criteria
  • Intelligent search strategy planning
  • Smart name extraction and validation
  • No manual input needed

🔄 Multi-Model AI System

  • Uses 4 different AI models simultaneously
  • Automatic rotation when rate limits hit
  • Never stops - always has a model ready
  • 4x higher throughput than single-model systems

📊 Multi-Source Extraction

Extracts data from:

  • Google Search results
  • Facebook business pages
  • Instagram business profiles
  • Official company websites

Then intelligently merges and verifies all data.

💎 Professional Terminal UI

  • Clean, color-coded progress messages
  • Real-time completion indicators
  • Easy to track which business is processing
  • Beautiful startup banner
  • No clutter or emojis

🛡️ Intelligent Data Fusion

  • Cross-verifies information across sources
  • Assigns confidence scores (0-100%)
  • Handles conflicting data intelligently
  • Prioritizes verified information

📈 Auto-Organization

  • Results saved to Google Sheets
  • Auto-deduplication
  • Auto-sorting by confidence
  • Real-time status updates

Complete Setup Guide

Prerequisites

You need:

  • A computer (Windows, Mac, or Linux)
  • Internet connection
  • 20 minutes for setup

No coding experience needed!


Step 1: Install Python

What is Python? The programming language that runs this tool.

How to install:

  1. Go to python.org/downloads
  2. Click the big yellow "Download Python" button
  3. Run the installer
  4. IMPORTANT: Check the box that says "Add Python to PATH"
  5. Click "Install Now"
  6. Verify installation:
    python --version
    You should see: Python 3.10 or higher

Step 2: Download This Project

Option A: Using Git (recommended)

  1. Install Git: git-scm.com/downloads
  2. Open Terminal/Command Prompt
  3. Navigate to where you want the project:
    cd Documents
  4. Clone the repository:
    git clone https://github.com/YOUR-USERNAME/LeadScraper.git
    cd LeadScraper

Option B: Download ZIP

  1. Click the green "Code" button on GitHub
  2. Click "Download ZIP"
  3. Extract the ZIP file
  4. Open Terminal/Command Prompt
  5. Navigate to the extracted folder:
    cd path/to/LeadScraper

Step 3: Install Dependencies

What are dependencies? Libraries this tool needs to work.

  1. Make sure you're in the LeadScraper folder
  2. Run this command:
    pip install -r requirements.txt
  3. Wait 1-2 minutes while it installs
  4. You'll see lots of text - that's normal!

Step 4: Get OpenRouter API Key

What is OpenRouter? The AI service that powers the intelligence.

Cost: FREE with the models we use!

Steps:

  1. Go to openrouter.ai
  2. Click "Sign Up" (top right)
  3. Sign up with Google/GitHub/Email
  4. Once logged in, click your profile icon
  5. Click "API Keys"
  6. Click "Create Key"
  7. Give it a name: "LeadScraper"
  8. Click "Create"
  9. COPY THE KEY - it looks like: sk-or-v1-abc123xyz...
  10. Save it somewhere safe - you'll need it in Step 6

Important: Don't share this key with anyone!


Step 5: Setup Google Sheets

Why Google Sheets? Your extracted data is saved here like a database.

5.1: Create Google Cloud Project

  1. Go to console.cloud.google.com
  2. Click "Select a project" (top left)
  3. Click "New Project"
  4. Project name: "LeadScraper"
  5. Click "Create"
  6. Wait 30 seconds for it to create

5.2: Enable Google Sheets API

  1. In the search bar, type: "Google Sheets API"
  2. Click on "Google Sheets API"
  3. Click "Enable"
  4. Wait for it to enable

5.3: Create Service Account

  1. Click "Credentials" (left sidebar)
  2. Click "Create Credentials" (top)
  3. Select "Service Account"
  4. Service account name: "leadscraper-bot"
  5. Click "Create and Continue"
  6. Role: Select "Editor"
  7. Click "Continue"
  8. Click "Done"

5.4: Create and Download Key

  1. Click on the service account you just created
  2. Click "Keys" tab
  3. Click "Add Key" → "Create new key"
  4. Select "JSON"
  5. Click "Create"
  6. A file will download: leadscraper-bot-xxxxx.json
  7. Rename it to: service_account.json

5.5: Place Credentials File

  1. In your LeadScraper folder, create a folder called credentials

  2. Move service_account.json into the credentials folder

    Your structure should look like:

    LeadScraper/
    ├── credentials/
    │   └── service_account.json
    ├── modules/
    ├── main.py
    └── ...
    

5.6: Create Google Sheet

  1. Go to sheets.google.com
  2. Click "Blank" to create new sheet
  3. Name it: "LeadScraper Database"
  4. Copy the Sheet ID from the URL:
    https://docs.google.com/spreadsheets/d/THIS_IS_THE_SHEET_ID/edit
    
  5. Save this ID - you'll need it in Step 6

5.7: Share Sheet with Service Account

  1. In your Google Sheet, click "Share" (top right)
  2. Open credentials/service_account.json in a text editor
  3. Find the line: "client_email": "leadscraper-bot@..."
  4. Copy that email address
  5. Back in Google Sheet, paste the email in "Add people"
  6. Make sure "Editor" is selected
  7. Uncheck "Notify people"
  8. Click "Share"

Step 6: Configure Environment File

What is .env? A file that stores your API keys securely.

  1. In the LeadScraper folder, create a file named .env

    • On Windows: Right-click → New → Text Document → Rename to .env
    • On Mac/Linux: touch .env
  2. Open .env in a text editor

  3. Paste this template:

    # OpenRouter API Key (from Step 4)
    OPENROUTER_API_KEY=sk-or-v1-your-key-here
    
    # Google Sheet ID (from Step 5.6)
    GOOGLE_SHEET_ID=your-sheet-id-here
    
    # Path to Google Sheets credentials
    SERVICE_ACCOUNT_FILE=credentials/service_account.json
  4. Replace the placeholders:

    • Replace sk-or-v1-your-key-here with your actual OpenRouter API key
    • Replace your-sheet-id-here with your actual Google Sheet ID
    • Keep SERVICE_ACCOUNT_FILE as is
  5. Save the file

Example:

OPENROUTER_API_KEY=sk-or-v1-abc123xyz789
GOOGLE_SHEET_ID=1a2b3c4d5e6f7g8h9i0j
SERVICE_ACCOUNT_FILE=credentials/service_account.json

Step 7: Test Setup

Let's make sure everything works!

  1. Run this command:

    python main.py --discover "New York" --quantity 3
  2. You should see:

    • Beautiful startup banner
    • Green "[SUCCESS]" messages
    • Cyan "[AI]" messages showing progress
    • Completion separators like:
      ────────────────────────────────────
      ────────────── 1/3 ────────────────
      ────────────────────────────────────
      
  3. Check your Google Sheet - you should see 3 businesses!

If it works: Setup complete! 🎉
If it doesn't: See Troubleshooting


How to Use

Basic Usage

Find businesses in one command:

python main.py --discover "LOCATION" --quantity NUMBER

Examples:

# Find 100 restaurants in New York
python main.py --discover "New York" --quantity 100

# Find 50 cafes in Los Angeles
python main.py --discover "Los Angeles" --quantity 50 --category cafe

# Find 200 gyms in Chicago
python main.py --discover "Chicago" --quantity 200 --category gym

# Quick test with 5 businesses
python main.py --discover "Miami" --quantity 5

Command Options

Option Description Required Default
--discover Location to search Yes -
--quantity Number of businesses to find No 100
--category Type of business No restaurant

Categories You Can Use

Common categories:

  • restaurant - Restaurants, eateries
  • cafe - Coffee shops, cafes
  • gym - Gyms, fitness centers
  • salon - Hair salons, beauty
  • bar - Bars, pubs
  • hotel - Hotels, accommodations
  • shop - Retail stores
  • office - Business offices
  • clinic - Medical clinics
  • school - Schools, education

You can use ANY business type!


What Happens When You Run It?

Phase 1: Discovery (3-5 minutes for 100)

  1. AI plans search strategy
  2. AI generates smart search queries
  3. AI searches the web
  4. AI extracts business names
  5. AI validates and cleans results
  6. Populates Google Sheet Input tab

Phase 2: Extraction (~70 minutes for 100)

  1. For each business:
    • Searches Google for information
    • Searches Facebook for business page
    • Searches Instagram for profile
    • Crawls official website
    • Extracts all contact information
    • Fuses data from all sources
    • Assigns confidence score
    • Writes to Results sheet

You see:

  • Beautiful color-coded progress
  • Completion separator after each business
  • Real-time updates in Google Sheet

Total time for 100 businesses: ~75 minutes


Understanding the Terminal Output

Colors mean:

  • Green [SUCCESS]: Operation completed successfully
  • Cyan [AI]: AI is working (planning, extracting, validating)
  • Yellow [WARNING]: Warning (rate limit, missing data)
  • Red [ERROR]: Error occurred
  • White [INFO]: General information

Progress separators:

────────────────────────────────────────────────────────────
────────────────────────── 47/100 ─────────────────────────
────────────────────────────────────────────────────────────

This means business #47 just finished!


Configuration Guide

All settings are in config.py. Here are the main ones:

AI Models

The tool rotates through 4 AI models automatically:

OPENROUTER_MODELS = [
    "arcee-ai/trinity-large-preview:free",      # Free, fast
    "stepfun/step-3.5-flash:free",              # Free, reliable
    "deepseek/deepseek-r1-0528:free",           # Free, accurate
    "openrouter/aurora-alpha"                    # Paid, premium
]

Want to change models? Browse available models at openrouter.ai/models

Speed Settings

DELAY_BETWEEN_LEADS = 1     # Seconds between businesses (1 = fast)
DELAY_JITTER = 1            # Random delay 0-1 seconds
MAX_RETRIES = 2             # How many times to retry on failure
SEARCH_DELAY = 1            # Seconds between searches

Want faster? Reduce delays (but may hit rate limits more)
Want more reliable? Increase delays and retries

Discovery Settings

DEFAULT_DISCOVERY_QUANTITY = 100     # Default number if not specified
MAX_DISCOVERY_QUANTITY = 500         # Maximum allowed
AUTO_CLEAR_INPUT_ON_DISCOVERY = True # Clear input before discovery

Source Settings

ENABLE_FACEBOOK = True      # Search Facebook pages
ENABLE_INSTAGRAM = True     # Search Instagram profiles
ENABLE_DIRECTORIES = True   # Extract from directory sites
ENABLE_MULTI_SEARCH = True  # Use multiple search engines

Want to skip a source? Set it to False

Extraction Limits

MAX_PHONES_PER_BUSINESS = 10    # Maximum phone numbers to extract
MAX_EMAILS_PER_BUSINESS = 10    # Maximum emails to extract
MAX_SOCIAL_LINKS = 20           # Maximum social media links

Understanding the Output

Google Sheet Structure

Your Google Sheet has 2 tabs:

Input Tab

Where discovered business names go before extraction.

Business Name Category City Status
Pizza Hut restaurant New York ✅ Done

Results Tab

Where extracted data is saved.

Business Name Phone Numbers Email Addresses Address City State Website Facebook Instagram Confidence Sources
Pizza Hut (212) 555-1234, (212) 555-5678 info@pizzahut.com 123 Main St New York NY pizzahut.com facebook.com/pizzahut instagram.com/pizzahut 95% website, facebook, instagram

Confidence Scores

What they mean:

  • 80-100%: Excellent - Multiple sources verified
  • 60-79%: Good - Some verification
  • 40-59%: Partial - Limited data
  • 0-39%: Low - Minimal information

90%+ means: Data is highly reliable, cross-verified across multiple sources.

Data Sources

Shows where the data came from:

  • search_results - Google search snippets
  • facebook_search - Facebook business page
  • instagram_search - Instagram profile
  • website - Official company website

More sources = Higher confidence!


Troubleshooting

"API key not found"

Problem: .env file not configured correctly

Solution:

  1. Make sure .env file exists in main folder
  2. Open .env and verify:
    OPENROUTER_API_KEY=sk-or-v1-your-actual-key
  3. No spaces around =
  4. No quotes around the key

"Google Sheet ID not set"

Problem: Sheet ID missing from .env

Solution:

  1. Open .env
  2. Add your Sheet ID:
    GOOGLE_SHEET_ID=your-actual-sheet-id
  3. Get Sheet ID from URL:
    https://docs.google.com/spreadsheets/d/THIS_IS_IT/edit
    

"Permission denied" or "403 Error"

Problem: Google Sheet not shared with service account

Solution:

  1. Open credentials/service_account.json
  2. Find "client_email" - copy the email
  3. Open your Google Sheet
  4. Click "Share"
  5. Add that email as Editor
  6. Uncheck "Notify people"
  7. Click "Share"

"HTTP 429 Too Many Requests"

Problem: Hitting rate limits (normal, handled automatically!)

What happens: Tool automatically switches to another AI model

What you see: Yellow [WARNING] Rate limit on model X, rotating...

Action needed: None! It's working as designed.


"No results found"

Problem: AI couldn't find businesses in that location

Solutions:

  1. Try a larger city name
  2. Increase --quantity
  3. Try different category
  4. Check spelling of location

Script stops/crashes

Solutions:

  1. Check logs/scraper.log for error details
  2. Make sure internet connection is stable
  3. Restart and try again
  4. Try with smaller quantity first (--quantity 5)

Results look wrong

Check:

  1. Confidence score - Low scores mean uncertain data
  2. Sources - More sources = more reliable
  3. Google Sheet permissions - Make sure sheet is editable

FAQ

General Questions

Q: Do I need to know coding?
A: No! Just copy-paste the commands shown in this guide.

Q: Does this work on Mac/Windows/Linux?
A: Yes! Works on all platforms.

Q: How much does it cost?
A: $0 with the free AI models we use. OpenRouter offers generous free tiers.

Q: Is this legal?
A: Yes. It only searches publicly available information on the internet.

Q: Will I get banned?
A: No. We use rate limiting and multiple models to stay within limits.


Data Questions

Q: How accurate is the data?
A: 70% of businesses get complete profiles. Data is cross-verified from multiple sources.

Q: Can I trust the confidence scores?
A: Yes. 80%+ confidence means data is verified across multiple sources.

Q: What if data is wrong?
A: Lower confidence scores indicate uncertain data. Always verify critical information.

Q: Can I export to CSV?
A: Yes! In Google Sheets: File → Download → CSV


Technical Questions

Q: How does multi-model rotation work?
A: When Model A hits rate limit, it automatically switches to Model B, then C, then D. Each model has 60-second cooldown.

Q: Why use 4 models instead of 1?
A: 4x higher throughput. While one model is on cooldown, others keep working.

Q: Can I add more models?
A: Yes! Edit config.pyOPENROUTER_MODELS and add model names from openrouter.ai

Q: Can I use paid models?
A: Yes! Add paid model names to the list. They're faster and more accurate.

Q: Where are logs stored?
A: In logs/scraper.log. Check here if something goes wrong.


Usage Questions

Q: Can I run this on my server 24/7?
A: Yes! It's designed for long-running operations.

Q: Can I pause and resume?
A: Not yet. But you can stop and it will skip already processed items.

Q: Can I run multiple searches simultaneously?
A: Not recommended. Run them sequentially to avoid rate limits.

Q: What's the maximum quantity I can extract?
A: 500 businesses per run (configurable in config.py)

Q: How long does 100 businesses take?
A: About 75 minutes (5 min discovery + 70 min extraction)


Technical Architecture

System Overview

CLI Command → AI Discovery → Multi-Source Extraction → Data Fusion → Google Sheets

AI Manager (Multi-Model System)

AIManager
├── Model Pool [arcee, stepfun, deepseek, aurora]
├── Rate Limit Detector
├── Auto-Rotation Logic
└── Cooldown Manager (60s per model)

When a model hits HTTP 429:

  1. Mark model as "on cooldown"
  2. Select next available model
  3. Retry request with new model
  4. After 60s, original model available again

Data Fusion Algorithm

  1. Collect: Extract from all sources
  2. Normalize: Clean and standardize
  3. Verify: Cross-check across sources
  4. Score: Assign confidence based on:
    • Number of sources (more = higher)
    • Data consistency (matching = higher)
    • Source priority (website > social media)
  5. Merge: Combine verified data
  6. Output: Single profile with confidence score

Module Architecture

main.py                    # Orchestrator + CLI
├── ai_manager.py          # Multi-model AI rotation
├── display.py             # Terminal UI
├── restaurant_discovery.py # AI discovery
├── search.py              # Multi-engine search
├── sources/
│   ├── facebook_scraper.py
│   ├── instagram_scraper.py
│   └── search_result_extractor.py
├── fusion.py              # Data merging + scoring
├── extractor.py           # Website extraction
├── crawler.py             # Website crawling
├── sheets.py              # Google Sheets I/O
└── utils.py               # Logging, validation

Contributing

Want to improve LeadScraper? Contributions welcome!

How to Contribute

  1. Fork this repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes
  4. Test thoroughly
  5. Commit: git commit -m "Add feature"
  6. Push: git push origin feature-name
  7. Open a Pull Request

Ideas for Contributions

  • Add more data sources (LinkedIn, Yelp)
  • Improve AI prompts for better extraction
  • Add export formats (CSV, JSON)
  • Create web interface
  • Add caching to avoid re-searching
  • Implement parallel processing
  • Add resume/checkpoint support

License

MIT License - See LICENSE file for details


Acknowledgments

Built with:


Support

Found a bug? Have a question?

  1. Check Troubleshooting
  2. Check FAQ
  3. Check logs/scraper.log for error details
  4. Open an issue on GitHub

Happy lead hunting! 🎯

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages