Skip to content

MeshCore-ai/instagram-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Instagram Scraper Agent

An intelligent Instagram research assistant powered by OpenAI GPT and Apify. Send natural language requests to scrape and analyze Instagram profiles, posts, and hashtags with AI-driven insights.

Python 3.11+ FastAPI Poetry License: MIT

⚠️ Disclaimer: This tool is for educational and research purposes only. Users must comply with Instagram's Terms of Service and applicable laws. Use responsibly and ethically.

Features

πŸ€– AI-Powered Intelligence

  • Natural Language Interface: Ask questions in plain English, get intelligent insights
  • Contextual Conversations: Multi-turn conversations with memory
  • Smart Tool Selection: LLM automatically chooses the right scraping tools
  • Analytical Insights: Get engagement analysis, trends, and recommendations

πŸ“Š Instagram Scraping Capabilities

  • Profile Scraping: Followers, bio, verification status, posts count
  • Post Scraping: Captions, likes, comments, hashtags, media URLs
  • Hashtag Analysis: Recent posts, trends, top performers
  • Async Job Management: Handle long-running scrapes efficiently

πŸ› οΈ Developer-Friendly

  • REST API: Full-featured FastAPI application
  • Bearer Token Auth: Secure API access
  • Type Safety: Pydantic models for all data
  • Comprehensive Logging: Track all operations
  • Production-Ready: Systemd service, deployment scripts included

Quick Start

Prerequisites

Installation

  1. Clone the repository

    git clone https://github.com/MeshCore-ai/instagram-scraper.git
    cd instagram-scraper
  2. Install dependencies

    poetry install
  3. Configure environment

    cp deploy/.env.template .env
    # Edit .env with your API keys and tokens
  4. Generate a bearer token

    openssl rand -hex 32
    # Add this to MESH_BEARER_SECRET in .env
  5. Run the development server

    make dev
    # Or: poetry run uvicorn instagram_scraper.app:app --reload

The API will be available at http://localhost:8000. Visit http://localhost:8000/docs for interactive API documentation.

Configuration

All configuration is managed through environment variables. See deploy/.env.template for all options.

Required Variables

Variable Description Where to Get
MESH_BEARER_SECRET API authentication token Generate with openssl rand -hex 32
APIFY_API_TOKEN Apify API token https://console.apify.com/account/integrations
OPENAI_API_KEY OpenAI API key https://platform.openai.com/api-keys

Optional Variables

Variable Default Description
OPENAI_MODEL gpt-4-turbo OpenAI model to use
LOG_LEVEL INFO Logging level (DEBUG, INFO, WARNING, ERROR)
MAX_RESULTS_LIMIT 200 Maximum scraping results
REQUEST_TIMEOUT 300 Timeout in seconds
PORT 8000 Server port

Usage Examples

πŸ“š Comprehensive Examples: See examples/ for complete code examples in curl, Python, and JavaScript covering all use cases.

Natural Language Chat (Recommended)

Ask the agent anything about Instagram in natural language:

Example: Profile Analysis

curl -X POST http://localhost:8000/agent/chat \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Analyze the Instagram profile @nike. How many followers do they have and what is their engagement strategy?"
  }'

Example: Hashtag Research

curl -X POST http://localhost:8000/agent/chat \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Show me the top posts for #fitness from the last week"
  }'

Example: Comparison

curl -X POST http://localhost:8000/agent/chat \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Compare the engagement rates of @nike and @adidas"
  }'

Direct Scraping Endpoints

For structured requests without LLM overhead:

Scrape a Profile

curl -X POST http://localhost:8000/scrape/profile \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"username": "instagram"}'

Scrape Hashtag Posts

curl -X POST http://localhost:8000/scrape/hashtag \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "hashtag": "travel",
    "limit": 50
  }'

Asynchronous Jobs

For long-running scrapes:

Start a Job

curl -X POST http://localhost:8000/scrape/async \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "scrape_type": "hashtag",
    "parameters": {
      "hashtag": "photography",
      "limit": 200
    }
  }'
# Returns: {"run_id": "abc123", "status_url": "/job/abc123/status"}

Check Status

curl http://localhost:8000/job/abc123/status \
  -H "Authorization: Bearer YOUR_TOKEN"

Get Results

curl http://localhost:8000/job/abc123/results \
  -H "Authorization: Bearer YOUR_TOKEN"

API Endpoints

Health & Status

  • GET /health - Health check (no auth required)

Agent Endpoints

  • POST /agent/chat - Chat with AI agent (natural language)

Direct Scraping

  • POST /scrape/profile - Scrape Instagram profile
  • POST /scrape/posts - Scrape specific posts by URLs
  • POST /scrape/hashtag - Scrape hashtag posts

Async Jobs

  • POST /scrape/async - Start async scraping job
  • GET /job/{run_id}/status - Check job status
  • GET /job/{run_id}/results - Get job results
  • DELETE /job/{run_id} - Cancel running job

Full API documentation available at /docs when server is running.

Development

Available Make Commands

make help       # Show all available commands
make install    # Install dependencies
make dev        # Run development server with auto-reload
make test       # Run tests
make test-cov   # Run tests with coverage
make format     # Format code with black
make lint       # Run linting (ruff + mypy)
make clean      # Remove cache and build artifacts

Project Structure

instagram-scraper/
β”œβ”€β”€ src/instagram_scraper/
β”‚   β”œβ”€β”€ app.py              # FastAPI application
β”‚   β”œβ”€β”€ config.py           # Configuration management
β”‚   β”œβ”€β”€ models.py           # Pydantic data models
β”‚   β”œβ”€β”€ auth.py             # Authentication
β”‚   β”œβ”€β”€ apify_client.py     # Apify scraper wrapper
β”‚   └── agent/
β”‚       β”œβ”€β”€ llm_agent.py    # LLM orchestrator
β”‚       β”œβ”€β”€ tools.py        # Function tool definitions
β”‚       └── prompts.py      # System prompts
β”œβ”€β”€ tests/                  # Test suite
β”œβ”€β”€ deploy/                 # Deployment scripts
β”‚   β”œβ”€β”€ .env.template
β”‚   β”œβ”€β”€ instagram-scraper.service
β”‚   β”œβ”€β”€ deploy.sh
β”‚   └── server-setup.sh
└── Makefile               # Development commands

Deployment

See DEPLOYMENT.md for detailed deployment instructions.

Quick Deployment to Ubuntu Server

  1. Prepare the server

    ssh ubuntu@your-server
    wget https://raw.githubusercontent.com/MeshCore-ai/instagram-scraper/main/deploy/server-setup.sh
    chmod +x server-setup.sh
    ./server-setup.sh
  2. Configure environment

    # Edit /home/ubuntu/.env with your API keys
    nano /home/ubuntu/.env
  3. Deploy from local machine

    ./deploy/deploy.sh production ubuntu@your-server

The service will be running on port 8000 and managed by systemd.

Architecture

The system consists of three main layers:

  1. API Layer (FastAPI)

    • REST endpoints with bearer token authentication
    • Request validation with Pydantic
    • Async job management
  2. Intelligence Layer (OpenAI GPT)

    • Natural language understanding
    • Smart tool selection and orchestration
    • Context management for conversations
    • Response formatting and insights
  3. Scraping Layer (Apify)

    • Instagram profile scraping
    • Post and hashtag scraping
    • Job status tracking
    • Error handling and retries

Security Considerations

  • Authentication: All endpoints (except /health) require bearer token
  • Secret Management: Never commit .env files; use environment-specific configs
  • Rate Limiting: Consider implementing rate limiting in production
  • HTTPS: Always use HTTPS in production to protect tokens
  • API Keys: Rotate Apify and OpenAI keys regularly
  • Monitoring: Log all authentication attempts and API usage

Limitations & Important Notes

  • Public Data Only: Can only access publicly available Instagram data
  • Instagram API Limits: Subject to Instagram's rate limits and restrictions
  • Private Accounts: Limited information available for private profiles
  • Cost: Usage incurs costs from both Apify and OpenAI APIs. Monitor your usage carefully.
  • No Warranty: Provided as-is without guarantees. Instagram may change their platform at any time.
  • Rate Limiting: Be respectful of rate limits to avoid being blocked by Instagram or Apify
  • Data Privacy: Handle scraped data responsibly and in compliance with privacy laws

Troubleshooting

Common Issues

Service won't start

# Check logs
sudo journalctl -u instagram-scraper -f

# Verify environment variables
cat /home/ubuntu/.env

# Check service status
sudo systemctl status instagram-scraper

Authentication errors

  • Verify MESH_BEARER_SECRET matches between server and requests
  • Check Authorization header format: Bearer YOUR_TOKEN

Scraping fails

  • Verify Apify API token is valid
  • Check Apify account has sufficient credits
  • Ensure Instagram username/URL is correct

LLM not responding

  • Verify OpenAI API key is valid
  • Check OpenAI account has credits
  • Review logs for detailed error messages

Contributing

Contributions are welcome! Please ensure:

  • Code passes all linting checks (make lint)
  • Tests pass (make test)
  • Follow existing code style
  • Update documentation as needed

License

MIT License

Copyright (c) 2025 MeshCore AI

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Legal & Compliance

Important: This tool is provided for educational and research purposes. Users are responsible for:

  • Terms of Service: Complying with Instagram's Terms of Service and Community Guidelines
  • Rate Limits: Respecting rate limits and API usage policies from Instagram and Apify
  • Data Rights: Ensuring they have the right to scrape and use data for their intended purpose
  • Privacy Laws: Following applicable data protection laws (GDPR, CCPA, etc.)
  • Ethical Use: Not using this tool for spam, harassment, or malicious purposes
  • Commercial Use: Understanding any commercial use restrictions from Instagram and third-party services

Disclaimer: The authors and contributors of this project are not responsible for misuse of this tool or any violations of third-party terms of service. Users assume all legal risks associated with using this software.

Support

For issues, questions, or contributions:

  • Open an issue on GitHub
  • Review the API documentation at /docs
  • Check logs with sudo journalctl -u instagram-scraper -f

Built with ❀️ by MeshCore AI using FastAPI, OpenAI, and Apify

About

AI-powered Instagram scraper agent with natural language interface. Built with FastAPI, OpenAI, and Apify.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors