Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of the Unindexed Files [Staging Area] #294

Open
osopardo1 opened this issue Mar 26, 2024 · 5 comments · May be fixed by #440
Open

Optimization of the Unindexed Files [Staging Area] #294

osopardo1 opened this issue Mar 26, 2024 · 5 comments · May be fixed by #440

Comments

@osopardo1
Copy link
Member

osopardo1 commented Mar 26, 2024

Qbeast Spark supports reading files not indexed with Qbeast Metadata. There's different situations that can cause a table to have a hybrid state.

  • Different set of writers. One writes with Qbeast, the others with Delta. All writers can commit new files to the transaction log. Those files written as Delta will not contain any Qbeast Metadata.
  • Old Table in Delta or Parquet converted to Qbeast. When we execute the Convert To Qbeast command, we are just adding a single metadata commit to the table, without rewriting or analyzing any of the existing files.
  • Deletes and Updates. If a Table receives an Update or Delete operation and uses the default Copy on Write strategy, it will create new files that are not indexed.

The current behavior is to ignore the non-indexed files when reading and writing, thus disabling part of the Sampling capabilities and reducing the precision when estimating the index. Also, optimization of this "staging area", does not select that subset of files for any rearrangement operation.

This issue is to record and analyze which is the best storyline to follow when Optimizing the Non-Indexed files.

@fpj
Copy link
Contributor

fpj commented Sep 26, 2024

@osopardo1 Are these unindexed files the result of a transaction that did not use Qbeast to append?

@osopardo1 osopardo1 added enhancement type: enhancement Improvement of existing feature or code and removed type: bug Something isn't working enhancement labels Sep 30, 2024
@osopardo1
Copy link
Member Author

I will update the issue with a better problem formulation.
We still need to analyze which approaches we want to take, since Compaction, for example, is no longer used, and different techniques such as Index Without Rewrite were designed since this issue was open.

@osopardo1 osopardo1 changed the title Optimization of the Staging Area should index the data Optimization of the Non-Indexed Files [Staging Area] Oct 1, 2024
@osopardo1 osopardo1 added type:proposal and removed type: enhancement Improvement of existing feature or code labels Oct 1, 2024
@fpj
Copy link
Contributor

fpj commented Oct 1, 2024

Let's use "unindexed files", the term unindexed exists in English: https://en.wiktionary.org/wiki/unindexed#English

@osopardo1
Copy link
Member Author

Oh, at first it sounded weird to me. agree with Unindexed

@osopardo1 osopardo1 changed the title Optimization of the Non-Indexed Files [Staging Area] Optimization of the Unindexed Files [Staging Area] Oct 4, 2024
@osopardo1
Copy link
Member Author

Opened PR: #440

@osopardo1 osopardo1 linked a pull request Oct 22, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants