Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,13 @@ DEEPSEEK_MODEL_PRO_NAME=deepseek-v4-pro
# RAG_VECTOR_BACKEND=local
# RAG_CHROMA_PATH=logs/chroma
# RAG_CHROMA_COLLECTION=study_agent

# === RAG Embeddings(默认 local_hash,无需 API key)===
# local_hash 适合本地开发和测试;openai 适合 Chroma 持久化向量检索。
# RAG_EMBEDDING_PROVIDER=local_hash
# RAG_EMBEDDING_PROVIDER=openai
# RAG_EMBEDDING_MODEL=text-embedding-3-small
# RAG_EMBEDDING_DIMENSIONS=1536
# RAG_EMBEDDING_API_KEY=
# RAG_EMBEDDING_BASE_URL=
# RAG_EMBEDDING_TIMEOUT_SECONDS=30
30 changes: 18 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<p>
<a href="https://github.com/2002yy/study-agent/actions/workflows/ci.yml"><img src="https://github.com/2002yy/study-agent/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
<img src="https://img.shields.io/badge/python-3.12-blue" alt="Python 3.12">
<img src="https://img.shields.io/badge/tests-273%20passed-green" alt="273 tests passed">
<img src="https://img.shields.io/badge/tests-277%20passed-green" alt="277 tests passed">
</p>

A local AI learning assistant with long-term memory, role-based group chat,
Expand All @@ -17,7 +17,7 @@ Study Agent 是一个本地优先的 AI 学习助手,重点不是简单调用
- **长期记忆**:Markdown memory + safe writer
- **上下文分层**:fast / light / deep / archive
- **联网搜索**:RSS / News fetch → article extraction → LLM digest → source tracing
- **RAG MVP**:本地 Markdown / TXT / DOCX / PDF 索引、关键词 / 本地向量原型 / hybrid / backend-vector 检索、引用上下文、来源块、Streamlit 检索/调试面板、聊天注入和 FastAPI RAG 接口
- **RAG MVP**:本地 Markdown / TXT / DOCX / PDF 索引、关键词 / 本地向量原型 / hybrid / backend-vector 检索、可配置 embedding provider、可选 Chroma 持久化、引用上下文、来源块、Streamlit 检索/调试面板、聊天注入和 FastAPI RAG 接口
- **工程安全**:SSRF protection、detect-secrets、配置模板
- **工程质量**:pytest 测试套件、Ruff、GitHub Actions CI、打包检查

Expand All @@ -27,11 +27,11 @@ Study Agent 是一个本地优先的 AI 学习助手,重点不是简单调用
- **Model routing** with fast / light / deep / archive context tiers
- **Long-term memory** based on Markdown files and safe-writer persistence
- **Web search pipeline**: feed registry → URL safety checks → article extraction → LLM digest → auditable source trace
- **RAG MVP**: local Markdown / TXT / DOCX / PDF indexing, lexical / local vector prototype / hybrid / backend-vector retrieval, citation-first context formatting, source blocks, a Streamlit retrieval/debug panel, optional chat injection, and FastAPI RAG endpoints
- **RAG MVP**: local Markdown / TXT / DOCX / PDF indexing, lexical / local vector prototype / hybrid / backend-vector retrieval, configurable embedding providers, optional Chroma persistence, citation-first context formatting, source blocks, a Streamlit retrieval/debug panel, optional chat injection, and FastAPI RAG endpoints
- **SSRF protection** for article fetching, **detect-secrets** in CI
- **Batched session logging** and multi-layer caching for performance
- **Performance budget**: mode-based `max_tokens` bounds on the main chat, WeChat, and news LLM paths
- **273 pytest tests**, Ruff clean, GitHub Actions CI workflow
- **277 pytest tests**, Ruff clean, mypy clean, GitHub Actions CI workflow

For a detailed breakdown of the stack and engineering highlights, see [Technical Stack & Engineering Highlights](docs/TECH_STACK.md).

Expand Down Expand Up @@ -207,13 +207,19 @@ pip-compile requirements-dev.in # 重新锁定开发依赖

参数优先级:代码显式参数 → 任务级环境变量 → 任务默认值 → 全局环境变量 → provider 级环境变量。完整配置见 [`.env.example`](.env.example) 和 [用户指南](USER_GUIDE.md)。

RAG 向量后端默认使用 `local`,不需要额外服务;可选 `chroma` adapter 需要用户自行安装 `chromadb`:
RAG 向量后端默认使用 `local`,不需要额外服务;可选 `chroma` adapter 需要用户自行安装 `chromadb`。Embedding provider 默认 `local_hash`,生产检索可显式切到 OpenAI-compatible embeddings

```bash
RAG_VECTOR_BACKEND=local
# RAG_VECTOR_BACKEND=chroma
# RAG_CHROMA_PATH=logs/chroma
# RAG_CHROMA_COLLECTION=study_agent

RAG_EMBEDDING_PROVIDER=local_hash
# RAG_EMBEDDING_PROVIDER=openai
# RAG_EMBEDDING_MODEL=text-embedding-3-small
# RAG_EMBEDDING_DIMENSIONS=1536
# RAG_EMBEDDING_API_KEY=...
```

---
Expand Down Expand Up @@ -243,7 +249,7 @@ RAG_VECTOR_BACKEND=local
│ ├── config.py # 全局配置
│ ├── router.py # 路由配置
│ ├── news/ # 新闻聚合链路
│ ├── rag/ # 本地 RAG MVP:加载、分块、索引、关键词/向量原型/可选后端检索
│ ├── rag/ # 本地 RAG MVP:加载、分块、索引、关键词/向量原型/embedding/可选后端检索
│ └── ui/ # Streamlit UI 组件
├── tests/ # pytest 测试套件
├── docs/ # 设计文档与工程说明
Expand All @@ -264,13 +270,13 @@ RAG_VECTOR_BACKEND=local
## 测试

```bash
pytest tests/ -v # current local baseline: 273 passed
pytest tests/ -v # current local baseline: 277 passed
pytest tests/ --cov=src # 覆盖率
ruff check src/ tests/ # linting
mypy --explicit-package-bases src/ # CI soft check; may report type debt
mypy --explicit-package-bases src/ # type check
```

CI 通过 GitHub Actions 在 push / pull request 上运行,集成 `pytest`、`ruff`、打包检查、`detect-secrets` 扫描,以及非阻断的 `mypy` soft check。当前验证状态见 [docs/TESTING.md](docs/TESTING.md)。
CI 通过 GitHub Actions 在 push / pull request 上运行,集成 `pytest`、`ruff`、打包检查、`detect-secrets` 扫描,以及 `mypy` soft check。当前验证状态见 [docs/TESTING.md](docs/TESTING.md)。

---

Expand Down Expand Up @@ -307,9 +313,9 @@ CI 通过 GitHub Actions 在 push / pull request 上运行,集成 `pytest`、`
求职导向的技术演进路线:

- [ ] FastAPI service layer (partial): `/health`, `/rag`, `/rag/index`, `/rag/query` implemented; `/chat` and `/memory` remain planned
- [x] RAG MVP: Markdown / TXT / DOCX / PDF loading, chunking, local keyword retrieval, local vector prototype, hybrid retrieval, citation context, source blocks, Streamlit retrieval panel, optional single-chat and WeChat interactive injection
- [ ] RAG document QA (partial): PDF parsing has file-size, page-count, extracted-text and encrypted-file guards; Chroma adapter scaffold exists; production embedding model retrieval remains planned
- [ ] Vector store: FAISS local prototype, pgvector engineering version
- [x] RAG MVP: Markdown / TXT / DOCX / PDF loading, chunking, local keyword retrieval, local vector prototype, hybrid retrieval, backend-vector retrieval, configurable embedding provider, optional Chroma adapter, citation context, source blocks, Streamlit retrieval panel, optional single-chat and WeChat interactive injection
- [ ] RAG document QA (partial): PDF parsing has file-size, page-count, extracted-text and encrypted-file guards; production embedding requires explicit API/env configuration and Chroma remains optional
- [ ] Vector store: Chroma optional adapter implemented; FAISS local prototype and pgvector engineering version remain planned
- [ ] Web UI: TypeScript + Vue3 / React, streaming chat, source panel
- [ ] Observability: trace_id, token usage, latency, provider fallback logs

Expand Down
4 changes: 2 additions & 2 deletions docs/INTERVIEW_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Study Agent 是一个本地优先的 AI 学习助手,重点在多 Provider 模
2. **长期记忆写入安全** — safe writer + preview/confirm 机制,防止不可逆的记忆污染
3. **联网搜索来源追溯** — Feed registry / RSS 多源聚合 → URL safety matrix → 文章正文三层提取 → LLM digest → pipeline trace 全过程来源可回溯
4. **Streamlit 重渲染性能优化** — 多层缓存策略、按模式批量落盘、主链路 token 预算控制
5. **CI / Ruff / detect-secrets 工程检查** — 273 pytest tests、Ruff clean、GitHub Actions workflow、detect-secrets 对未豁免发现硬阻断
5. **CI / Ruff / detect-secrets 工程检查** — 277 pytest tests、Ruff clean、mypy local clean、GitHub Actions workflow、detect-secrets 对未豁免发现硬阻断

## 可讲亮点

Expand All @@ -23,7 +23,7 @@ Study Agent 是一个本地优先的 AI 学习助手,重点在多 Provider 模

## 展示边界

- `mypy` 已接入 CI soft check,但当前本地仍有类型错误,不能说类型检查 clean。
- `mypy` 已接入 CI soft check,当前本地 `python -m mypy --explicit-package-bases src` clean;但 CI 配置仍是非阻断检查
- `performance_budget.py` 覆盖主要 chat / WeChat / news LLM 路径,辅助 LLM 调用仍需继续收口。
- `article_fetcher.py` 负责真实网络读取前的 DNS/IP SSRF 校验;`link_resolver.py` 是网络无关的 URL 预检和跳转记录。
- `detect-secrets` 已接入 CI,并通过解析扫描 JSON 的 `results` 对未豁免发现硬阻断;测试里的 Basic Auth 形态 URL 样例已显式 allowlist。
22 changes: 12 additions & 10 deletions docs/RAG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Status

Current status: **MVP implemented with a local vector prototype, not a production vector-store RAG system yet**.
Current status: **MVP implemented with a local-first retrieval path, configurable embeddings and an optional Chroma adapter**.

Implemented:

Expand All @@ -21,12 +21,13 @@ Implemented:
- UI source blocks for retrieved file paths, line ranges, scores and matched terms
- FastAPI endpoints: `GET /health`, `POST /rag`, `POST /rag/index`, `POST /rag/query`
- Streamlit knowledge/debug panel with index summary, document rows, chunk preview and score breakdowns
- Optional vector backend interface with local fallback and Chroma adapter scaffold
- Optional vector backend interface with local fallback and Chroma adapter
- Configurable embedding providers: deterministic `local_hash` by default, OpenAI-compatible embeddings when explicitly configured

Not implemented yet:

- Production embedding model integration
- FAISS, pgvector or managed vector stores
- Production-grade embedding evaluation, relevance tuning and re-index migration tooling
- Automatic injection into every generation path; current injection covers single chat and WeChat interactive replies, but not news discussion or after-session feedback

## Module Map
Expand All @@ -36,9 +37,9 @@ Not implemented yet:
| `src/rag/loader.py` | Load supported local files into normalized `RagDocument` objects |
| `src/rag/chunker.py` | Split documents into line-traceable `RagChunk` objects |
| `src/rag/index.py` | Build, save, load and search a local JSON RAG index |
| `src/rag/embeddings.py` | Embedding provider contract and local hash embedding provider |
| `src/rag/embeddings.py` | Embedding provider contract, local hash provider and OpenAI-compatible provider |
| `src/rag/backends.py` | Vector backend contract, local backend and environment-driven backend selection |
| `src/rag/chroma_backend.py` | Optional Chroma persistent backend adapter scaffold |
| `src/rag/chroma_backend.py` | Optional Chroma persistent backend adapter |
| `src/rag/vector.py` | Deterministic local vector prototype and hybrid retrieval |
| `src/rag/eval.py` | LLM-free retrieval quality evaluation over gold query fixtures |
| `src/rag/service.py` | Application-facing helpers for indexing, querying and context formatting |
Expand Down Expand Up @@ -67,7 +68,7 @@ Supported retrieval modes:
- `lexical`: TF-IDF-style term scoring
- `vector`: deterministic local hash-vector cosine similarity
- `hybrid`: normalized lexical score plus vector similarity
- `backend_vector`: configured vector backend; defaults to local and can use the optional Chroma adapter
- `backend_vector`: configured vector backend; defaults to local and can use the optional Chroma adapter with configured embeddings

Each result keeps:

Expand Down Expand Up @@ -139,11 +140,12 @@ P4-C / P6 adds Streamlit inspection controls:

P5 adds the first vector-backend abstraction:

- `EmbeddingProvider` protocol plus `LocalHashEmbeddingProvider`
- `EmbeddingProvider` protocol plus `LocalHashEmbeddingProvider` and `OpenAIEmbeddingProvider`
- `VectorBackend` protocol plus `LocalVectorBackend`
- `RAG_VECTOR_BACKEND=local|chroma`
- `RAG_EMBEDDING_PROVIDER=local_hash|openai`, `RAG_EMBEDDING_MODEL`, `RAG_EMBEDDING_DIMENSIONS`, `RAG_EMBEDDING_API_KEY`, `RAG_EMBEDDING_BASE_URL`
- Optional `ChromaVectorBackend` using lazy `chromadb` import, `PersistentClient`, collection `upsert` and vector query
- `tests/test_rag_backends.py` verifies local backend behavior, environment config and Chroma fake-client upsert/query behavior
- `tests/test_rag_backends.py` verifies local backend behavior, embedding environment config, OpenAI-compatible embedding batching and Chroma fake-client upsert/query behavior

## Next Steps

Expand All @@ -163,9 +165,9 @@ Goal: replace the local hash-vector prototype with optional real embeddings with

- [x] Extract an embedding-provider and vector-backend contract.
- [x] Keep JSON + lexical / hybrid retrieval as the zero-infrastructure fallback.
- [x] Add an optional Chroma adapter scaffold with lazy import and fake-client tests.
- [x] Add an optional Chroma adapter with lazy import and fake-client tests.
- [x] Make vector backend selection explicit through config.
- [ ] Add a production embedding provider; current Chroma adapter uses the local hash embedding provider by default.
- [x] Add a production embedding provider path; current default remains `local_hash`, while OpenAI-compatible embeddings require explicit env/API configuration.

### P6: Knowledge UI

Expand Down
Loading
Loading