A comprehensive Python solution for scraping investor data from legal free sources and extracting structured features using NLP and AI.
This project provides a scalable solution for:
- Web Scraping: Collecting investor data from multiple free legal sources
- News Aggregation: Scraping investment news from RSS feeds and news websites
- Feature Extraction: Using NLP and AI to extract structured investor features
- Data Export: Saving results to Excel files and JSON formats
- OpenVC: Community-maintained investor database
- Growjo: Company and investor growth data with API
- Crunchbase (Free Tier): Public investor profiles
- Alpha Vantage API: Financial data for investment firms
- TechCrunch: Venture capital and startup funding news
- VentureBeat: Technology and venture funding news
- Economic Times: Indian market and investment news
- Mint: Indian financial markets news
- Business Standard: Business and investment news
- DealStreetAsia: Asian startup and VC news
The system extracts these key investor features:
- Ticket Size: Investment amount ranges
- Stage Fit: Early/Mid/Late stage preferences
- Sector Focus: Industry and technology focus areas
- Past Investments: Similar sectors/models investment history
- Bolt-on Potential: Portfolio company synergies
- Fund Lifecycle & Dry Powder: Available capital status
- Fund Mandate Constraints: Geographic, structural limitations
- Partner Expertise: Team capabilities and relationships
- Exit Track Record: Historical exit performance
- Value-add Model: Operational vs financial support
- Recent Activity: Fundraising, hiring, new deals
- Surrogate Signals: Co-investors, awards, thought leadership
- Reputation: Past experiences and market standing
-
Clone/Download the files
-
Install dependencies:
pip install -r requirements.txt
-
Setup environment variables:
- Copy
.env
file and add your API keys:
ALPHA_VANTAGE_API_KEY=your_api_key_here GROQ_API_KEY=your_groq_api_key_here
- Copy
-
Run the complete pipeline:
python main_pipeline.py
python investor_scraper.py
python news_scraper.py
python feature_extractor.py
- investor_data.xlsx: Scraped investor information with multiple sheets
- investor_news.xlsx: News articles about investors and funding
- extracted_features.json: Structured features in JSON format
- feature_summary.json: Feature extraction summary statistics
- execution_summary.json: Pipeline execution details
- investor_pipeline.log: Detailed execution logs
The config.py
file contains:
- Data source URLs and configurations
- Feature extraction parameters
- Scraping rate limits and headers
- Supported investor features list
This scraper only uses:
- ✅ Publicly available data
- ✅ Free tier APIs
- ✅ RSS feeds
- ✅ Robots.txt compliant scraping
- ✅ Rate limiting to avoid server overload
Important: Always review website terms of service before scraping.
- Session management for efficient HTTP requests
- Rate limiting to respect server resources
- Error handling and retry mechanisms
- Modular design for easy source addition
- Logging for monitoring and debugging
- Excel/JSON export for data analysis
- AI integration for advanced feature extraction
- Alpha Vantage: Free financial data (5 calls/minute)
- User Agent: For web scraping identification
- Groq API: For AI-powered feature extraction
from main_pipeline import InvestorDataPipeline
# Initialize pipeline
pipeline = InvestorDataPipeline()
# Run complete pipeline
results = pipeline.run_full_pipeline()
# Print summary
pipeline.print_summary()
{
"investor_name": "Accel Partners",
"firm_name": "Accel",
"ticket_size": "Mid ($10-50M)",
"stage_fit": "early",
"sector_focus": ["enterprise", "fintech", "consumer"],
"geographic_focus": ["india", "us"],
"recent_activity": ["new_deals", "hiring"],
"confidence": 0.85
}
This codebase is designed to be easily extensible:
- Add new data sources in
config.py
- Implement new scrapers following the existing patterns
- Enhance feature extraction with additional NLP techniques
- Add new export formats as needed
The system includes comprehensive error handling for:
- Network connectivity issues
- Rate limiting responses
- HTML structure changes
- API failures
- Data parsing errors
Each module continues operation even if individual sources fail, ensuring maximum data collection.
- Concurrent processing where possible
- Smart caching to avoid redundant requests
- Incremental data collection support
- Memory efficient processing for large datasets
For issues or questions:
- Check the generated log files for detailed error information
- Verify API keys and network connectivity
- Ensure all required packages are installed
- Review website terms of service for any changes
This solution provides a production-ready foundation for investor data collection and analysis that can be easily customized for specific research needs.