A comprehensive RAG (Retrieval-Augmented Generation) pipeline for searching through code repositories using semantic similarity and pattern matching.
- Semantic Code Search: Find code using natural language queries
- Multiple Search Modes:
- Chunks: Show best-matching code chunks
- Function Definitions: Find function definitions
- Function References: Find function calls
- Variable References: Find variable usage
- Web Interface: Streamlit-based UI for easy interaction
- Command Line Interface: CLI for batch processing and automation
- Multi-language Support: Python, JavaScript, TypeScript, Java, Go, C++, and more
RAG-Experiments/
├── db_creation.py # Script to build FAISS index from code repository
├── db_retrieval.py # CLI script for code retrieval
├── streamlit_retrieval.py # Web interface for code search
├── requirements_streamlit.txt # Dependencies for Streamlit app
├── faiss_store/ # Directory containing FAISS index and metadata (gitignored)
└── README.md
- Clone the repository:
git clone <your-repo-url>
cd RAG-Experiments
- Install dependencies:
pip install -r requirements_streamlit.txt
First, create a FAISS index from your code repository:
python db_creation.py /path/to/your/repo /path/to/faiss_store --ext .py .ts .tsx .js .jsx .md
Options:
--model
: SentenceTransformer model (default:sentence-transformers/all-MiniLM-L6-v2
)--chunk-size
: Lines per chunk (default: 120)--chunk-overlap
: Overlap between chunks (default: 20)--ext
: File extensions to include
# Semantic search
python db_retrieval.py /path/to/faiss_store "how to create a trpc router" --mode chunks
# Find function definitions
python db_retrieval.py /path/to/faiss_store "find `AuthService` definitions" --mode function-defs
# Find function calls
python db_retrieval.py /path/to/faiss_store "where is `createUser` called" --mode function-refs
# Find variable references
python db_retrieval.py /path/to/faiss_store "DEBUG_MODE" --mode var-refs --identifier DEBUG_MODE
streamlit run streamlit_retrieval.py
Then open your browser to http://localhost:8501
- "how to implement authentication"
- "error handling patterns"
- "database connection setup"
- "find
AuthService
definitions" - "where is
createUser
called" - "
validateInput
function references"
- "DEBUG_MODE usage"
- "API_KEY configuration"
- "database connection string"
- Python:
.py
- JavaScript/TypeScript:
.js
,.jsx
,.ts
,.tsx
- Java:
.java
- Go:
.go
- C/C++:
.c
,.cpp
,.h
,.hpp
- Ruby:
.rb
- PHP:
.php
- And more...
- chunks: Show best-matching code chunks (semantic search)
- function-defs: Find function definitions using regex patterns
- function-refs: Find function calls and references
- var-refs: Find variable usage and references
You can use different SentenceTransformer models:
python db_creation.py /path/to/repo /path/to/faiss_store --model sentence-transformers/all-mpnet-base-v2
Filter results by programming language:
python db_retrieval.py /path/to/faiss_store "router" --mode function-refs --language typescript
Process multiple repositories:
for repo in repo1 repo2 repo3; do
python db_creation.py /path/to/$repo /path/to/faiss_store_$repo
done
- Indexing: ~1000 files per minute (depends on file size and model)
- Search: Sub-second response time for most queries
- Memory: ~500MB for 10k files (varies by model)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is open source and available under the MIT License.
- SentenceTransformers for semantic embeddings
- FAISS for efficient similarity search
- Streamlit for the web interface