Real-time Market Segmentation Engine using Unsupervised Learning.
MarketPulse AI is a high-performance market segmentation engine built on unsupervised learning. It automates market data collection (API & Web Scraping), performs advanced feature engineering, and applies a unified Standardization → PCA → K-Means pipeline to detect emerging market regimes and strategic asset clusters.
The system is designed for professional environments, featuring strict typing, automated model versioning, and real-time performance monitoring.
- Automated Ingestion: Dynamic ticker discovery (S&P 500) and multi-source financial news collection.
- Unified ML Pipeline: Encapsulated Scikit-Learn pipeline ensuring strict parity between training and inference.
-
Dynamic Hyper-parameter Tuning: Automatic selection of the optimal number of clusters (
$k$ ) based on Silhouette scores. - Business Intelligence: Automated cluster profiling and strategic labeling for human-readable insights.
- Monitoring & Observability: Real-time tracking of ML metrics (explained variance, silhouette) persisted in MongoDB.
- Runtime: Bun (Tooling) & Python 3.12+
- Package Manager: uv (Fast, reliable dependency management)
- Backend Framework: FastAPI (Pydantic v2)
- ML / Data Science:
scikit-learn(StandardScaler, PCA, KMeans)pandas,numpy(Advanced data manipulation)
- Infrastructure:
- Database: MongoDB Atlas (Storage of raw data, news, and ML metrics)
- Configuration:
pydantic-settings(Environment-based centralized config)
- Scraping:
BeautifulSoup4,httpx,yfinance
The codebase follows a modular architecture aligned with Senior Data Science standards.
MarketPulse_AI/
├── artifacts/ # Versioned ML models (.pkl)
├── logs/ # Application & Audit logs
├── src/
│ ├── api/ # FastAPI layer (Routes, Pydantic schemas)
│ ├── config.py # Centralized settings & environment management
│ ├── ingestion/ # Data acquisition (Yahoo Finance, RSS Scrapers)
│ ├── models/ # ML Logic (Unified Pipeline, Business Profiling)
│ ├── processing/ # Data cleaning & Feature engineering
│ └── utils/ # Core utilities (Logger, DB Client, Custom Exceptions)
├── pyproject.toml # uv-managed dependencies
└── README.md # This documentationUnlike traditional prototypes, MarketPulse AI uses a single sklearn.pipeline.Pipeline object. This prevents Data Leakage by ensuring that the StandardScaler and PCA parameters used during training are exactly the same during real-time inference.
- Python 3.12 or higher.
uvinstalled (curl -LsSf https://astral.sh/uv/install.sh | sh).- A running MongoDB instance (Local or Atlas).
git clone <repository_url>
cd MarketPulse_AI
uv syncCreate a .env file in the root directory:
MONGO_URI="mongodb+srv://user:pass@cluster.mongodb.net/"
DB_NAME="marketpulse"
CORS_ORIGINS='["http://localhost:3000"]'uv run uvicorn src.api.main:app --reloadAPI Documentation available at: http://127.0.0.1:8000/docs
GET /market-segments: Retrieves real-time asset clustering with business labels and PCA projections.POST /trigger-update: Forces a background retraining of the ML pipeline with new data.GET /monitoring/latest-metrics: Returns the health status of the latest model (Silhouette score, PCA variance).GET /market-news: Fetches the latest financial news feed (Yahoo Finance / Investing.com).
- v0.1: Initial prototype and API.
- v0.2: Unified Pipeline refactoring and Model Versioning.
- v0.3: ML Metrics persistence and Custom Error Handling.
- v0.4: Integration with a Next.js / TailwindCSS Dashboard.
- v0.5: Advanced Anomaly Detection using Isolation Forests.
- Typing: Strict type hints enforced throughout the project.
- Documentation: All docstrings follow the Google-style in English.
- Logging: Rotating file logs for production auditability.
- ROI Driven: Every technical decision is linked to data reliability and business scalability.