This folder contains two variants of an incremental indexer and a small FastAPI query service:
index_create_local.py— index documents from the local./datadirectory into a Gemini File Search store.index_create_gcp_cloud.py— index documents stored in a Google Cloud Storage bucket into a Gemini File Search store; state and store_name are persisted in the GCS bucket underconfig/.app.py— FastAPI server that queries an existing File Search store and returns answers + citations.
Workflows supported:
-
Local flow (running entirely locally)
- Use
index_create_local.pyto create (if needed) and incrementally index files from./datainto a Gemini File Search store. - The script maintains a local `.
- After indexing,
app.pyreads.store_nameand exposes a/askendpoint that answers queries using the indexed documents.
- Use
-
GCS flow (source files and state live in Google Cloud Storage)
- Use
index_create_gcp_cloud.pyto index documents stored in a GCS bucket. - The script stores
config/store_name.txtandconfig/indexed_files.jsonin the bucket to track state and the store resource. app.pystill reads a local.store_namefile — you can either downloadconfig/store_name.txtto.store_nameor modifyapp.pyto read from GCS directly.
- Use
- Incremental: both index scripts keep a record of which files were already indexed (local:
.indexed_files.json; GCS:config/indexed_files.jsonin the bucket). On subsequent runs, only new or changed files are uploaded. - Idempotent: unchanged files are skipped to save API calls and indexing costs.
- Citations: when the query endpoint returns an answer it also returns citations — snippets and titles from the retrieved contexts used to construct the answer.
See requirements.txt in this folder. Minimum:
- Python 3.10+
- python-dotenv
- google-genai (Gemini Developer client)
- google-cloud-storage (for the GCS variant)
- fastapi, uvicorn, pydantic (for
app.py)
- Create and activate a virtualenv (macOS / zsh):
python3 -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Provide credentials:
- Add a
.envfile in this folder with your Gemini API key and any other variables used by the scripts, for example:
GOOGLE_API_KEY=your_api_key_here
DOCS_BUCKET=your-gcs-bucket-name
DOCS_PREFIX=PdfDocuments/
CONFIG_PREFIX=config/- For the GCS workflow, ensure your environment has Google Cloud credentials available (ADC). This usually means setting
GOOGLE_APPLICATION_CREDENTIALSto a service account JSON key:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"-
Put the documents you want to index into the
data/directory (supported: .txt, .pdf, .docx, .doc). -
Run the local indexer:
python3 index_create_local.pyBehavior:
- If
.store_namedoes not exist, the script will attempt to create a new File Search store and write its resource name to.store_name. - The script computes a SHA-256 hash for each file and stores a map of filename -> hash in
.indexed_files.json. - On subsequent runs, only new or changed files will be uploaded and indexed.
-
Upload your documents to the configured GCS bucket under the
DOCS_PREFIX(defaultPdfDocuments/). -
Run the GCS indexer:
python3 index_create_gcp_cloud.pyBehavior:
- The script looks for
config/store_name.txtin the bucket. If missing, it tries to create a store and writes the name to that blob. - It keeps
config/indexed_files.jsonin GCS containing a map of blob_name -> md5_hash and only uploads new/changed files. - The script downloads new/changed blobs to
/tmpbefore uploading them to the File Search store, then cleans up the temp files.
- Ensure
.store_nameexists locally (you can copy from GCSconfig/store_name.txtif using the GCS flow):
# download store name from GCS (example)
python3 - <<'PY'
from google.cloud import storage
b = storage.Client().bucket('your-bucket-name')
print(b.blob('config/store_name.txt').download_as_text())
PYor simply copy the file:
gsutil cp gs://your-bucket-name/config/store_name.txt .store_name- Start the FastAPI app:
uvicorn app:app --reload- POST a question to
/ask:
curl -X POST "http://127.0.0.1:8000/ask" -H "Content-Type: application/json" -d '{"query":"What is the main topic of lostinmiddle.pdf?"}'The response will contain the generated answer and a list of citations (document title + snippet) that the model grounded on.
- The indexing scripts use the Gemini Developer API (vertexai=False). If your installed
google-genaiclient doesn't support store creation, create the store manually and write the resource name into.store_name(or into the GCS config location). - If you encounter authentication errors, check these:
GOOGLE_API_KEYin.env(for genai Developer client authentication).GOOGLE_APPLICATION_CREDENTIALSfor GCS access (service account JSON).
- Large files may take time to index; the scripts poll operations until indexing completes.
- Add a small wrapper script to sync
config/store_name.txtfrom GCS to local.store_nameautomatically. - Add unit tests that mock
genai.Clientandgoogle.cloud.storage. - Add retry/backoff logic for transient API failures.
If you want, I can also add a Makefile or small helper scripts to automate the full flow (index -> copy store_name -> run app -> ask).