flowchart TD
U([π€ User]) -->|prompt + model choice| APP
subgraph RW["βοΈ Railway"]
APP[["app.py
Streamlit UI"]]
DB[("π Postgres
llm_runs")]
APP -->|log_run / log_comparison_runs| DB
end
subgraph Providers["π€ Model Providers"]
HF["π HuggingFace Router
gpt-oss-120b"]
GH["π’ GitHub Models
dynamically picked"]
end
CAT["evaluate/
prompt_evaluator.py"] -. scores + picks .-> GH
APP -->|single: HF_TOKEN| HF
APP -->|single: GITHUB_TOKEN| GH
APP -->|comparison: parallel| HF
APP -->|comparison: parallel| GH
HF -->|output + prompt_tokens
+ output_tokens| APP
GH -->|output + prompt_tokens
+ output_tokens| APP
DB -->|SQL queries| MB[["π Metabase
model tracing dashboard"]]
classDef store fill:#1f6feb,stroke:#fff,color:#fff
classDef dash fill:#6e40c9,stroke:#fff,color:#fff
class DB store
class MB dash
Flow:
- User selects a model (OSS, Commercial, or both) and submits a prompt
app.pycalls the provider(s) β parallel viaThreadPoolExecutorin comparison mode- Token counts (
prompt_tokens,output_tokens) are extracted fromresponse.usage - Every run is logged to Postgres β single row for
oss/commercial, two linked rows (samerun_group_id) forosscom - Metabase connects directly to the Railway Postgres and visualises model usage, latency, and token consumption
A Streamlit app that runs the same prompt against an OSS model (gpt-oss-120b via HuggingFace) and a commercial GitHub Model (picked dynamically from the live catalog), side by side β then logs every call to Postgres on Railway.
flowchart LR
U([π€ User]) -->|prompt| APP
APP[["app.py"]] -->|HF_TOKEN| HF["π HuggingFace Router
gpt-oss-120b"]
HF -->|output + usage| APP
APP -->|log_run
model_type=oss| DB[("π Postgres
llm_runs")]
flowchart LR
U([π€ User]) -->|prompt| APP
APP[["app.py"]] -->|GITHUB_TOKEN| CAT["evaluate/prompt_evaluator.py
fetch + score catalog"]
CAT -->|best model_id| APP
APP -->|GITHUB_TOKEN| GH["π’ GitHub Models
dynamically picked"]
GH -->|output + usage| APP
APP -->|log_run
model_type=commercial| DB[("π Postgres
llm_runs")]
flowchart LR
U([π€ User]) -->|prompt| APP
APP[["app.py
ThreadPoolExecutor"]] -->|parallel| HF["π HuggingFace Router
gpt-oss-120b"]
APP -->|parallel| GH["π’ GitHub Models"]
HF -->|out_a + tokens| APP
GH -->|out_b + tokens| APP
APP -->|log_comparison_runs
run_group_id shared| DB
subgraph DB["π Postgres Β· llm_runs"]
RA["row A
model_type=oss
run_group_id=xyz"]
RB["row B
model_type=commercial
run_group_id=xyz"]
end
cd prompt_process_trace_setup
pip install -r requirements.txt
cp .env.example .env # fill in GITHUB_TOKEN, HF_TOKEN, DATABASE_URL
streamlit run app.py- New project β deploy from this repo (root
prompt_process_trace_setup/) - Add a PostgreSQL plugin β Railway injects
DATABASE_URLautomatically - Set
GITHUB_TOKENandHF_TOKENas service variables - First request auto-creates the
llm_runstable
| File | Purpose |
|---|---|
app.py |
Streamlit UI β single or side-by-side mode |
db.py |
Postgres schema Β· log_run() Β· log_comparison_runs() |
evaluate/prompt_evaluator.py |
Fetches the live GitHub Models catalog and picks the best commercial model |
prompt/prompt.md |
The test prompt (movie review of Project Hail Mary) |
| Column | Type | Notes |
|---|---|---|
id |
serial | primary key |
run_at |
timestamptz | auto |
run_group_id |
text | shared UUID for osscom comparison rows |
model_id |
text | full model identifier |
model_type |
text | oss Β· commercial Β· osscom |
prompt |
text | |
output |
text | |
error |
text | null on success |
elapsed_sec |
float | wall-clock time |
prompt_tokens |
int | from response.usage |
output_tokens |
int | from response.usage |
mode |
text | single Β· comparison |
- DataJourneyHQ/list-github-models
- Metabase β open-source BI connected to Railway Postgres
- OSS Discovery + Deployed via GitHub Action @sayantikabanik
- API-first asset librarian for local-first document and image similarity analysis. https://github.com/arcnem-ai/omnivec @Kthom1
A workflow_dispatch workflow that runs the CrewAI agent on demand directly from GitHub Actions. No local setup, no OpenAI account β just a GitHub PAT.
How it works:
- You trigger it manually from the Actions tab with 3 inputs:
criteria,programming_languages,project_types - The agent uses
gpt-4o-minirouted through GitHub's model endpoint (https://models.inference.ai.azure.com) β authenticated via your GitHub token - It searches the web for open source projects, scrapes repos, and writes a discovery report
- The report is saved as both
.mdand.htmland uploaded as a downloadable artifact
Only 1 secret needed:
| Secret | What it is |
|---|---|
GH_MODELS_TOKEN |
Your GitHub PAT with Models read permission |
The workflow currently completes successfully even when the search tool fails.
Here's what actually happens at runtime:
Agent calls SerperDevTool to search the web
β
βΌ β 403 Forbidden β SERPER_API_KEY missing or invalid
β
β ERROR: 403 Client Error: Forbidden
β Tool: search_the_internet_with_serper
β Iteration: 26 β all 25 retries exhausted
β
βΌ
Agent falls back to LLM training knowledge
β
βΌ β
GitHub Actions reports SUCCESS
Artifact uploaded β looks like a normal report
The output looks completely valid β proper markdown, real project names, GitHub URLs, star counts. But it is generated entirely from the LLM's training data (knowledge cutoff: early 2024), not from live web search. There are no guardrails in place to detect or reject this.
| Risk | Detail |
|---|---|
| Stale data | Projects may be archived, renamed, or no longer maintained |
| Fabricated URLs | Links may point to wrong or non-existent repos |
| False confidence | Report reads as authoritative with no warning it failed |
| CI shows green | Nothing in the pipeline signals that tools broke |