Skip to content

Bharathkumar1011/Dataengine

Repository files navigation

Investor Data Scraper & Feature Extractor

A comprehensive Python solution for scraping investor data from legal free sources and extracting structured features using NLP and AI.

Overview

This project provides a scalable solution for:

  1. Web Scraping: Collecting investor data from multiple free legal sources
  2. News Aggregation: Scraping investment news from RSS feeds and news websites
  3. Feature Extraction: Using NLP and AI to extract structured investor features
  4. Data Export: Saving results to Excel files and JSON formats

Data Sources Used

Investor Data Sources

  • OpenVC: Community-maintained investor database
  • Growjo: Company and investor growth data with API
  • Crunchbase (Free Tier): Public investor profiles
  • Alpha Vantage API: Financial data for investment firms

News Sources

  • TechCrunch: Venture capital and startup funding news
  • VentureBeat: Technology and venture funding news
  • Economic Times: Indian market and investment news
  • Mint: Indian financial markets news
  • Business Standard: Business and investment news
  • DealStreetAsia: Asian startup and VC news

Features Extracted

The system extracts these key investor features:

  • Ticket Size: Investment amount ranges
  • Stage Fit: Early/Mid/Late stage preferences
  • Sector Focus: Industry and technology focus areas
  • Past Investments: Similar sectors/models investment history
  • Bolt-on Potential: Portfolio company synergies
  • Fund Lifecycle & Dry Powder: Available capital status
  • Fund Mandate Constraints: Geographic, structural limitations
  • Partner Expertise: Team capabilities and relationships
  • Exit Track Record: Historical exit performance
  • Value-add Model: Operational vs financial support
  • Recent Activity: Fundraising, hiring, new deals
  • Surrogate Signals: Co-investors, awards, thought leadership
  • Reputation: Past experiences and market standing

Installation & Setup

  1. Clone/Download the files

  2. Install dependencies:

    pip install -r requirements.txt
  3. Setup environment variables:

    • Copy .env file and add your API keys:
    ALPHA_VANTAGE_API_KEY=your_api_key_here
    GROQ_API_KEY=your_groq_api_key_here
    
  4. Run the complete pipeline:

    python main_pipeline.py

Individual Module Usage

1. Scrape Investor Data Only

python investor_scraper.py

2. Scrape News Data Only

python news_scraper.py

3. Extract Features Only

python feature_extractor.py

Output Files

  • investor_data.xlsx: Scraped investor information with multiple sheets
  • investor_news.xlsx: News articles about investors and funding
  • extracted_features.json: Structured features in JSON format
  • feature_summary.json: Feature extraction summary statistics
  • execution_summary.json: Pipeline execution details
  • investor_pipeline.log: Detailed execution logs

Configuration

The config.py file contains:

  • Data source URLs and configurations
  • Feature extraction parameters
  • Scraping rate limits and headers
  • Supported investor features list

Legal Compliance

This scraper only uses:

  • Publicly available data
  • Free tier APIs
  • RSS feeds
  • Robots.txt compliant scraping
  • Rate limiting to avoid server overload

Important: Always review website terms of service before scraping.

Scalability Features

  • Session management for efficient HTTP requests
  • Rate limiting to respect server resources
  • Error handling and retry mechanisms
  • Modular design for easy source addition
  • Logging for monitoring and debugging
  • Excel/JSON export for data analysis
  • AI integration for advanced feature extraction

API Keys Required

Required (Free):

  • Alpha Vantage: Free financial data (5 calls/minute)
  • User Agent: For web scraping identification

Optional (Paid):

  • Groq API: For AI-powered feature extraction

Example Usage

from main_pipeline import InvestorDataPipeline

# Initialize pipeline
pipeline = InvestorDataPipeline()

# Run complete pipeline
results = pipeline.run_full_pipeline()

# Print summary
pipeline.print_summary()

Sample Output

{
  "investor_name": "Accel Partners",
  "firm_name": "Accel",
  "ticket_size": "Mid ($10-50M)",
  "stage_fit": "early",
  "sector_focus": ["enterprise", "fintech", "consumer"],
  "geographic_focus": ["india", "us"],
  "recent_activity": ["new_deals", "hiring"],
  "confidence": 0.85
}

Contributing

This codebase is designed to be easily extensible:

  • Add new data sources in config.py
  • Implement new scrapers following the existing patterns
  • Enhance feature extraction with additional NLP techniques
  • Add new export formats as needed

Error Handling

The system includes comprehensive error handling for:

  • Network connectivity issues
  • Rate limiting responses
  • HTML structure changes
  • API failures
  • Data parsing errors

Each module continues operation even if individual sources fail, ensuring maximum data collection.

Performance

  • Concurrent processing where possible
  • Smart caching to avoid redundant requests
  • Incremental data collection support
  • Memory efficient processing for large datasets

Support

For issues or questions:

  1. Check the generated log files for detailed error information
  2. Verify API keys and network connectivity
  3. Ensure all required packages are installed
  4. Review website terms of service for any changes

This solution provides a production-ready foundation for investor data collection and analysis that can be easily customized for specific research needs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages