Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use tree sitter to enhance context for code operations #51

Open
Robitx opened this issue Nov 9, 2023 · 3 comments
Open

feat: use tree sitter to enhance context for code operations #51

Robitx opened this issue Nov 9, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@Robitx
Copy link
Owner

Robitx commented Nov 9, 2023

https://neovim.io/doc/user/treesitter.html#lua-treesitter

@Robitx Robitx changed the title use tree sitter to enhance context for code operations feat: use tree sitter to enhance context for code operations Nov 9, 2023
@Robitx Robitx self-assigned this Nov 9, 2023
@sirupsen
Copy link

@Robitx even better might be to embed the code-base with a local model (or even OpenAI) to embed into the context, so it can be done cross-file

@teocns
Copy link

teocns commented Dec 4, 2023

I recommend enforcing brainstorming on very large projects, such as Chromium (38 million LoC).

Disclaimer: I am not a seasoned vim user, hence my knowledge might be limited

Per my understanding, Treesitter doesn't understand the semantics or the context beyond the structure of the code.
In a typical development routine of mine, most of the times the practical challenge is studying the workflow and lifecycle of components (symbols) across the [large] codebase.

Other tools (such as EasyCodeAI) have figured it out, and implemented their self-hosted codebase indexer which comes with its own limitations.

I think focusing on LSPs symbol referencing mechanism can be key; however when thinking large codebases there can be additional challenges, such as context overloading with dirty files that might not be of interest. What might be sounding plausible at first glance to me, is a multi-stage prompt operation whereas in the first, GPT is fed context upon referenced symbols, it discards those that are not-of-interest (based on a score) and only then it is feeded a cleaner context. The latest Github Copilot Chat VSCode extension works in that way.

To objectively align, we're limited by the context current GPT models can support, whose largest model adds up to 128k tokens.

@teocns
Copy link

teocns commented Dec 5, 2023

To add more thoughts to the topic, I'd also like to highlight the indexing challenge.
So far I have been working with two kinds of indexing mechanisms, being static and dynamic.

Looking at clangd as example:

Static indexing

it is able to pre-index an entire codebase based on compile_commands.json (in the case of clangd), which stores on disk and gives you an instantaneous symbol resolution across all the codebase.

Pros and cons are respectively: having instantaneous symbols resolution project-wise, at the expense of having to pre-index the entire codebase (takes arond ~3 hours on projects like Chromium on most powerful Mac) before being able to consume it.

Dynamic indexing

Indexes symbols dynamically based on the current buffer (by analyzing imports / includes). It is fast and handy, especially for module/sub-module work scope, but does not reach outer contexts: this is a significant drawback if we think objectively about the importance of out-of-scope context when GPT serves as a "codebase-wide assistant".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants