Investor Data Scraper & Feature Extractor

A comprehensive Python solution for scraping investor data from legal free sources and extracting structured features using NLP and AI.

Overview

This project provides a scalable solution for:

Web Scraping: Collecting investor data from multiple free legal sources
News Aggregation: Scraping investment news from RSS feeds and news websites
Feature Extraction: Using NLP and AI to extract structured investor features
Data Export: Saving results to Excel files and JSON formats

Data Sources Used

Investor Data Sources

OpenVC: Community-maintained investor database
Growjo: Company and investor growth data with API
Crunchbase (Free Tier): Public investor profiles
Alpha Vantage API: Financial data for investment firms

News Sources

TechCrunch: Venture capital and startup funding news
VentureBeat: Technology and venture funding news
Economic Times: Indian market and investment news
Mint: Indian financial markets news
Business Standard: Business and investment news
DealStreetAsia: Asian startup and VC news

Features Extracted

The system extracts these key investor features:

Ticket Size: Investment amount ranges
Stage Fit: Early/Mid/Late stage preferences
Sector Focus: Industry and technology focus areas
Past Investments: Similar sectors/models investment history
Bolt-on Potential: Portfolio company synergies
Fund Lifecycle & Dry Powder: Available capital status
Fund Mandate Constraints: Geographic, structural limitations
Partner Expertise: Team capabilities and relationships
Exit Track Record: Historical exit performance
Value-add Model: Operational vs financial support
Recent Activity: Fundraising, hiring, new deals
Surrogate Signals: Co-investors, awards, thought leadership
Reputation: Past experiences and market standing

Installation & Setup

Clone/Download the files
Install dependencies:
```
pip install -r requirements.txt
```

Setup environment variables:

Copy .env file and add your API keys:

ALPHA_VANTAGE_API_KEY=your_api_key_here
GROQ_API_KEY=your_groq_api_key_here

Run the complete pipeline:
```
python main_pipeline.py
```

Individual Module Usage

1. Scrape Investor Data Only

python investor_scraper.py

2. Scrape News Data Only

python news_scraper.py

3. Extract Features Only

python feature_extractor.py

Output Files

investor_data.xlsx: Scraped investor information with multiple sheets
investor_news.xlsx: News articles about investors and funding
extracted_features.json: Structured features in JSON format
feature_summary.json: Feature extraction summary statistics
execution_summary.json: Pipeline execution details
investor_pipeline.log: Detailed execution logs

Configuration

The config.py file contains:

Data source URLs and configurations
Feature extraction parameters
Scraping rate limits and headers
Supported investor features list

Legal Compliance

This scraper only uses:

✅ Publicly available data
✅ Free tier APIs
✅ RSS feeds
✅ Robots.txt compliant scraping
✅ Rate limiting to avoid server overload

Important: Always review website terms of service before scraping.

Scalability Features

Session management for efficient HTTP requests
Rate limiting to respect server resources
Error handling and retry mechanisms
Modular design for easy source addition
Logging for monitoring and debugging
Excel/JSON export for data analysis
AI integration for advanced feature extraction

API Keys Required

Required (Free):

Alpha Vantage: Free financial data (5 calls/minute)
User Agent: For web scraping identification

Optional (Paid):

Groq API: For AI-powered feature extraction

Example Usage

from main_pipeline import InvestorDataPipeline

# Initialize pipeline
pipeline = InvestorDataPipeline()

# Run complete pipeline
results = pipeline.run_full_pipeline()

# Print summary
pipeline.print_summary()

Sample Output

{
  "investor_name": "Accel Partners",
  "firm_name": "Accel",
  "ticket_size": "Mid ($10-50M)",
  "stage_fit": "early",
  "sector_focus": ["enterprise", "fintech", "consumer"],
  "geographic_focus": ["india", "us"],
  "recent_activity": ["new_deals", "hiring"],
  "confidence": 0.85
}

Contributing

This codebase is designed to be easily extensible:

Add new data sources in config.py
Implement new scrapers following the existing patterns
Enhance feature extraction with additional NLP techniques
Add new export formats as needed

Error Handling

The system includes comprehensive error handling for:

Network connectivity issues
Rate limiting responses
HTML structure changes
API failures
Data parsing errors

Each module continues operation even if individual sources fail, ensuring maximum data collection.

Performance

Concurrent processing where possible
Smart caching to avoid redundant requests
Incremental data collection support
Memory efficient processing for large datasets

Support

For issues or questions:

Check the generated log files for detailed error information
Verify API keys and network connectivity
Ensure all required packages are installed
Review website terms of service for any changes

This solution provides a production-ready foundation for investor data collection and analysis that can be easily customized for specific research needs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Investor Data Scraper & Feature Extractor

Overview

Data Sources Used

Investor Data Sources

News Sources

Features Extracted

Installation & Setup

Individual Module Usage

1. Scrape Investor Data Only

2. Scrape News Data Only

3. Extract Features Only

Output Files

Configuration

Legal Compliance

Scalability Features

API Keys Required

Required (Free):

Optional (Paid):

Example Usage

Sample Output

Contributing

Error Handling

Performance

Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
config.py		config.py
demo.py		demo.py
env		env
feature_extractor.py		feature_extractor.py
indian_config.py		indian_config.py
indian_investor_scraper.py		indian_investor_scraper.py
investor_scraper.py		investor_scraper.py
main_pipeline.py		main_pipeline.py
news_scraper.py		news_scraper.py
requirements.txt		requirements.txt

Bharathkumar1011/Dataengine

Folders and files

Latest commit

History

Repository files navigation

Investor Data Scraper & Feature Extractor

Overview

Data Sources Used

Investor Data Sources

News Sources

Features Extracted

Installation & Setup

Individual Module Usage

1. Scrape Investor Data Only

2. Scrape News Data Only

3. Extract Features Only

Output Files

Configuration

Legal Compliance

Scalability Features

API Keys Required

Required (Free):

Optional (Paid):

Example Usage

Sample Output

Contributing

Error Handling

Performance

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages