A Python project for data ingestion and document parsing using modern AI/ML libraries.
- Data ingestion and processing
- Document parsing capabilities
- Integration with LangChain and vector databases
- Support for various document formats (PDF, DOCX)
- ChromaDB and FAISS for vector storage and retrieval
This project uses the following key libraries:
- LangChain: For AI/ML pipeline management
- ChromaDB & FAISS: Vector databases for semantic search
- Sentence Transformers: For text embeddings
- Document Processing: Support for PDF (PyPDF2) and DOCX files
- pandas: Data manipulation and analysis
This project uses uv for dependency management.
-
Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh -
Create and activate virtual environment:
uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
uv sync
Run the main application:
python main.pymain.py- Main application entry point0-DataIngestParsing/- Data ingestion and parsing notebooks1-dataingestion.ipynb- Data ingestion workflows3-dataparsingdoc.ipynb- Document parsing examples
requirements.txt- Legacy requirements filepyproject.toml- Modern Python project configuration
For Jupyter notebook development, ipykernel is included in the dependencies.
[Add your license here]