Questions about model options #52

CharlesWiltgen · 2025-09-15T17:31:15Z

CharlesWiltgen
Sep 15, 2025

Hey, I love ck so far and I've been recommending it. Presumably I'm using bge-small, but if the tool gives me a way to validate that I've missed it. Some questions about the models:

I assume bge-small is the default because it's the fastest and generally "good enough". Is that a fair characterization? Are there scenarios in which bge-small would give a user better results than nomic or jina-code?
When you say that nomic is "better for large functions and classes", does this mean in terms of LoC or cyclomatic complexity? In either case, are there specific cut-off points where you would say that bge-small should be avoided in favor of nomic or jina-code?
When you say that jina-code is "specialized for code understanding", it unclear to me why I wouldn't always want to use it. I'd love to hear the team's thoughts on that!

[EDIT]

I just updated to v0.4.5 and re-indexed (wasn't sure if I needed to), and see that my index is using nomic. So in case that's interesting, bge-small wasn't the default for me.

▸ Indexing Repository
ℹ Scanning files in .
ℹ 🤖 Model: nomic-embed-text-v1.5
ℹ 📏 FastEmbed Config: 8192 token limit
ℹ 📄 Chunk Config: 1024 tokens target, 200 token overlap (~20%)

Answered by runonthespot

Sep 19, 2025

Great question. I've refined the model selection stuff a bit. Default is going to be bge mainly because Jina and Nomic are still very heavyweight for local usage even on a relatively beefy machine (I have an M2 Max w/ 64gb of ram, and my laptop starts to take off when indexing a moderate sized codebase on Nomic) but I assume we'll get there.

0.4.5 had a bug reporting the incorrect model - this has now been fixed, and some nice affordances for switching between models and some safeguards to handle various edge cases when doing so.

I think for now bge will remain the most sensible embedding option, but I'm thinking being able to plug in your own choice of remote backend embedding might have…

View full answer

runonthespot · 2025-09-19T22:32:00Z

runonthespot
Sep 19, 2025
Maintainer

Great question. I've refined the model selection stuff a bit. Default is going to be bge mainly because Jina and Nomic are still very heavyweight for local usage even on a relatively beefy machine (I have an M2 Max w/ 64gb of ram, and my laptop starts to take off when indexing a moderate sized codebase on Nomic) but I assume we'll get there.

0.4.5 had a bug reporting the incorrect model - this has now been fixed, and some nice affordances for switching between models and some safeguards to handle various edge cases when doing so.

I think for now bge will remain the most sensible embedding option, but I'm thinking being able to plug in your own choice of remote backend embedding might have to be the way forward (e.g. point at your own openai ada or cohere or hugging face served model), at least until we get a decent tiny embedder that has good code and language coverage.

0 replies

runonthespot · 2025-09-19T22:33:23Z

runonthespot
Sep 19, 2025
Maintainer

Specifically answering your questions:
-Nomic/Jina have bigger context windows (bge is only 512 tokens) so in theory we should be able to have slightly larger chunks - I'm told 1024 is a good happy medium for relatively large, but not so much that it dilutes accuracy)
-Bge as default
-Jina can't be default until hardware is generally stronger - but we'll get there!

0 replies

CharlesWiltgen · 2025-10-12T16:24:37Z

CharlesWiltgen
Oct 12, 2025
Author

Thank you, @runonthespot! Can I ask what you personally use, and what you would recommend for something like a 64GB M1 Max system? (The large-ish project I'm using it with is a TypeScript-centric monorepo with a good amount of internal documentation on data models, coding and style standards, architecture decision records, etc.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about model options #52

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Questions about model options #52

Uh oh!

Uh oh!

CharlesWiltgen Sep 15, 2025

Replies: 3 comments

Uh oh!

runonthespot Sep 19, 2025 Maintainer

Uh oh!

runonthespot Sep 19, 2025 Maintainer

Uh oh!

CharlesWiltgen Oct 12, 2025 Author

CharlesWiltgen
Sep 15, 2025

runonthespot
Sep 19, 2025
Maintainer

runonthespot
Sep 19, 2025
Maintainer

CharlesWiltgen
Oct 12, 2025
Author