Refactor `src/index.rs` and Enhance Documentation #87

tareknaser · 2024-02-16T14:20:56Z

Description

This PR focuses on refactoring the src/index.rs file.

Changes

The changes primarily include:

Define constants at the beginning of the file
Add doc comments for structs, fields, and methods
Use serde::{Deserialize, Serialize} for loading and storing of the index file
Rename the discover_paths function to discover_files
Refactor certain parts of the code to improve readability, including reducing indentation where possible
Add a new test to ensure that only canonical paths are present in the index
Use of TempDir for testing
Updating platform conditions for improved cross-compatibility (#[cfg(target_family = "unix")])

Edit

Fixes #89 with 633335a

…Update Signed-off-by: Tarek <tareknaser360@gmail.com>

Signed-off-by: Tarek <tareknaser360@gmail.com>

github-actions · 2024-02-16T14:27:09Z

Benchmark for `c3d41c0`

Click to view benchmark

Test	Base	PR	%
compute_bytes_large/compute_bytes	138.9±1.32µs	140.1±2.81µs	+0.86%
compute_bytes_medium/compute_bytes	27.7±0.39µs	27.0±0.35µs	-2.53%
compute_bytes_small/compute_bytes	127.4±1.20ns	127.7±2.16ns	+0.24%
index_build/index_build/tests/	157.7±0.66µs	161.5±1.33µs	+2.41%
tests/lena.jpg/compute_bytes	13.3±0.09µs	13.5±0.28µs	+1.50%
tests/test.pdf/compute_bytes	111.1±0.69µs	136.8±0.54µs	+23.13%

src/index.rs

kirillt · 2024-02-19T14:03:24Z

src/index.rs

-        }
+        let file = File::open(index_path)?;
+        let reader = BufReader::new(file);
+        let index: ResourceIndex = serde_json::from_reader(reader)?;


Am I right that this changes the current file format to JSON?

In the main branch, index is stored in plain-text format:

1675087824710 58922-3276384608 Info/crdt_vs_ot.jpg 1675087824610 125308-3041567246 Info/Watson-Crick & Hoogsteen.png 1675087831180 386929-63288935 Screenshots/Screenshot_20210717_120856.jpg 1675087831440 945565-3484272718 Screenshots/Screenshot_20220111_040205_com.adobe.reader.jpg 1675087832200 57186-2804307164 Screenshots/lenovo-fedora-20210427-023231-183683953.png 1675087832990 107862-4031618865 Screenshots/lenovo-fedora-20210328-050520-182040364.png 1675087833830 343648-561346264 Screenshots/Screenshot_20210804_003308.jpg 1675087834130 257481-490338418 Info/Technologies.jpg 1675087846130 415620-3764250547 Gif/сitrus.gif

I have found that storing and loading the same index with the main branch doesn't produce the same object. Not a matter of sorting but timestamps might be different as well. The code in the main branch for storing and loading contains more logic that just writing the struct to a file.
The idea here was to make ResourceIndex implement Serialize. with this setup, we can serialize to various formats

but It's up to you, we can revert to the old file format if you prefer it

I like your idea, I just don't want to write huge JSON to disk.

Some compact format would be great: postcard, CBOR, what else could serve this purpose?

On other hand, if we use automatic serialization, then the format will be excessive most likely: 2 mappings means creating two separate structures. Right now, we write only one table and derive 2 mappings from it during reading.

On other hand, if we use automatic serialization, then the format will be excessive

Good point. I noticed that as well.

I now reverted ResourceIndex::store() and ResourceIndex::load() to the old code. I just kept the code to make them serializable to have this possibility later.

I just wanted to add some notes about the current implementation of the 2 methods:

we currently don't store or load information about collisions in ResourceIndex

I noticed some inconsistencies when trying to store an object of ResourceIndex and then load it. I noticed some differences not only in order but also in modified_time. I think this may be a problem with how we parse the time

I could move this conversation to an issue and work on a separate PR to solve it

@tareknaser this is interesting

I noticed some inconsistencies when trying to store an object of ResourceIndex and then load it. I noticed some differences not only in order but also in modified_time. I think this may be a problem with how we parse the time

Could you tell more about the timestamps discrepancies? Probably steps to reproduce with ark-cli? Yeah, we could create separate issue if this is a bug indeed.

we currently don't store or load information about collisions in ResourceIndex

Yes, but is it an issue? Collisions should be derivable from the table itself, right?

Btw test case fn index_entry_order() could help debug lines ordering.

Could you tell more about the timestamps discrepancies? Probably steps to reproduce with ark-cli? Yeah, we could create separate issue if this is a bug indeed.

I actually noticed this difference when I wrote a simple test
Here is the test to reproduce

#[test] fn resource_index_load_store() { let temp_dir = TempDir::new("arklib_test") .expect("Failed to create temporary directory"); let temp_dir = temp_dir.into_path(); create_file_at( temp_dir.to_owned(), Some(FILE_SIZE_1), Some(FILE_NAME_1), ); let index = ResourceIndex::build(temp_dir.to_owned()); index .store() .expect("Should store index successfully"); let loaded_index = ResourceIndex::load(temp_dir.to_owned()) .expect("Should load index successfully"); // Assert that the loaded index is equal to the original index assert_eq!(index, loaded_index); }

When I run this test, I get this output

thread 'index::tests::resource_index_load_store' panicked at src/index.rs:824:9: assertion `left == right` failed left: ResourceIndex { id2path: {ResourceId { data_size: 10, crc32: 3817498742 }: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt"}, path2id: {"/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt": IndexEntry { modified: SystemTime { tv_sec: 1709340915, tv_nsec: 542881544 }, id: ResourceId { data_size: 10, crc32: 3817498742 } }}, collisions: {}, root: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX" } right: ResourceIndex { id2path: {ResourceId { data_size: 10, crc32: 3817498742 }: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt"}, path2id: {"/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt": IndexEntry { modified: SystemTime { tv_sec: 1709340915, tv_nsec: 542000000 }, id: ResourceId { data_size: 10, crc32: 3817498742 } }}, collisions: {}, root: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX" }

Notice how tv_nsec is different

Looks like rounding.. If you could investigate why is it rounded, it would be great.

I think I understand now
We're currently storing and loading timestamps in milliseconds, but SystemTime relies on Timespec under the hood which is:

struct Timespec { tv_sec: i64, tv_nsec: Nanoseconds, }

I confirmed locally that if we store and parse the timestamp as nanoseconds, we don't run into this issue. I'll double-check before pushing the code.

src/index.rs

kirillt · 2024-02-19T14:16:26Z

src/index.rs

    pub fn index_new(&mut self, path: &dyn AsRef<Path>) -> Result<IndexUpdate> {
-        log::debug!("Indexing a new path");


Let's use name track_addition if it's same function as here https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L359

kirillt · 2024-02-19T14:21:57Z

src/index.rs

+    /// Updates a single entry in the index with a new resource located at the
+    /// specified path, replacing the old resource associated with the given
+    /// ID.
+    ///
+    /// # Restrictions
+    ///
+    /// The caller must ensure that:
+    /// * the index is up-to-date except for this single path
+    /// * the path has been indexed before
+    /// * the path maps into `old_id`
+    /// * the content by the path has been modified
+    ///
+    /// # Errors
+    ///
+    /// Returns an error if the path does not exist, if the path is a directory
+    /// or an empty file, if the index cannot find the specified path, or if
+    /// the content of the path has not been modified.
    pub fn update_one(


It seems to me, that we should separate this method into track_deletion and track_update as it was done in #72

https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L393
https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L416

The idea is that track methods serve as performance workaround when end-users know what they are doing. On other hand, update_all should be efficient enough to be called all the time. But from API point of view, it track methods are also simpler sometimes.

OPEN FOR DISCUSSION

Ok so the API will look like:

track_addition(), track_update() and track_deletion() when the user wants to be specific about the action

update_all() otherwise

This looks like a clean API but I have some questions about other public methods

what are the use cases for ResourceIndex::provide()?

we can now remove forget_id() in favor of track_deletion() or they do different tasks?

@tareknaser

ResourceIndex::provide() is the starting point for any user/app. It combines index loading and updating in single step. This could be done by separate calls, but it seems to me that users will always need updated index before they do something. I can imagine rare use-cases where they need more control, so we should expose separate steps, too.

Name forget_id is confusing now. In this PR, it is what is called track_deletion in the other PR. We have private function forget_path that is symmetric to insert_entry, both are called from track methods.

ResourceIndex::provide() is the starting point for any user/app. It combines index loading and updating in single step.

Got it
Thanks for clearing this out.

Name forget_id is confusing now.

The initial idea for this PR was to rewrite the file src/index.rs as it is and then implement the changes to replace #72
Do you prefer to include all the changes in the current PR?

Yeah... It's difficult to keep in mind several PRs because the scope is pretty big.

Let's incorporate useful parts of #72 and unit tests from there into this PR.

src/index.rs

kirillt · 2024-02-19T14:26:45Z

src/index.rs

+                    // canonicalize the path to avoid duplicates
+                    match fs::canonicalize(&path) {
+                        Ok(canonical_path) => {
+                            discovered_files.insert(canonical_path, entry);


After canonicalization, we should trim the common prefix ("root").

That's related to #86. Do you want me to implement it in this PR?

Ah yeah, we can do it in separate PR.

After spending some time modifying this PR based on #72, I've realized that it might not be a good idea to combine all the changes from both PRs into a single PR. #72 has significant API changes for ResourceIndex, and I think we should discuss these changes further in #91 before proceeding.

This PR focuses only on refactoring certain parts of the code to enhance readability. It doesn't alter any logic for arklib so it might be a good idea to keep it simple.
What do you think?

src/index.rs

Signed-off-by: Tarek <tareknaser360@gmail.com>

github-actions · 2024-02-27T22:36:08Z

Benchmark for `1c39010`

Click to view benchmark

Test	Base	PR	%
compute_bytes_large/compute_bytes	572.8±1.88µs	137.4±1.54µs	-76.01%
compute_bytes_medium/compute_bytes	27.5±0.45µs	26.8±0.43µs	-2.55%
compute_bytes_small/compute_bytes	128.6±1.53ns	129.2±3.06ns	+0.47%
index_build/index_build/tests/	157.8±0.82µs	171.0±1.28µs	+8.37%
tests/lena.jpg/compute_bytes	13.4±0.07µs	13.4±0.09µs	0.00%
tests/test.pdf/compute_bytes	339.3±16.35µs	108.6±1.37µs	-67.99%

src/index.rs

kirillt · 2024-03-01T12:58:56Z

src/index.rs

    fn forget_path(
        &mut self,
-        path: &CanonicalPath,
+        path: &Path,
        old_id: ResourceId,
    ) -> Result<IndexUpdate> {


I would return just deleted collection, because added is always empty here.

Maybe name it forget_entry since path and id have same importance?

I would return just deleted collection, because added is always empty here.

That's why I suggest finalizing the API schema in #91 before implementing changes in a separate PR. #72 modifies IndexUpdate, adds its Default implementation, and integrates it into multiple API methods.

Trying to merge #72 changes into this PR while making further modifications along the way could become complicated.

src/index.rs

kirillt · 2024-03-01T13:01:54Z

src/index.rs

+    /// Removes the given resource ID from the index and returns an update
+    /// containing the deleted entries
+    pub fn forget_id(&mut self, old_id: ResourceId) -> Result<IndexUpdate> {
+        log::debug!("Forgetting a single entry in the index");
+
+        // Collect all paths associated with the old ID
+        let mut old_paths = Vec::new();
+        for (path, entry) in &self.path2id {
+            if entry.id == old_id {
+                old_paths.push(path.clone());
+            }
+        }
+
+        // Remove entries from path2id and id2path
+        for path in &old_paths {
+            self.path2id.remove(path);
+        }
+        self.id2path.remove(&old_id);
+
+        let mut deleted = HashSet::new();
+        deleted.insert(old_id);
+
+        Ok(IndexUpdate {
+            added: HashMap::new(),
+            deleted,
+        })
+    }


From index/update-one branch:

/// The caller must ensure that: /// * the index is up-to-date except this single id /// * the resource with this id has been indexed before /// * the resource with this id doesn't exist anymore /// /// Should only be used if reactive updates are not possible. pub fn track_deletion(&mut self, id: ResourceId) -> Result<IndexUpdate> { log::debug!("Tracking a single deletion in the index"); let indexed_path = self.id2path.get(&id); if indexed_path.is_none() { return Err(ArklibError::Path( "The id cannot be found in the index".into(), )); } let indexed_path = indexed_path.unwrap().clone(); self.forget_entry(indexed_path.as_canonical_path(), id)?; Ok(IndexUpdate::default().deleted(id)) }

The changes in forget_id in #72 don't work on their own because ResourceIndex::forget_entry() and IndexUpdate were also modified.

Signed-off-by: Tarek <tareknaser360@gmail.com>

github-actions · 2024-03-02T00:58:38Z

Benchmark for `3ca26c4`

Click to view benchmark

Test	Base	PR	%
compute_bytes_large/compute_bytes	472.9±1.29µs	141.5±1.06µs	-70.08%
compute_bytes_medium/compute_bytes	26.8±0.22µs	27.5±0.20µs	+2.61%
compute_bytes_small/compute_bytes	129.4±2.68ns	129.1±0.89ns	-0.23%
index_build/index_build/tests/	158.8±5.51µs	173.6±1.02µs	+9.32%
tests/lena.jpg/compute_bytes	13.3±0.07µs	13.3±0.17µs	0.00%
tests/test.pdf/compute_bytes	111.1±4.27µs	110.1±0.47µs	-0.90%

Previously, the IndexEntry::modified field was parsed and loaded from miliseconds, which caused a rounding issue when comparing the loaded index with the stored one. This was due to SystemTime relying on Timespec, which includes nanoseconds. To address this issue, this commit updates the parsing and loading of IndexEntry::modified to use nanoseconds instead of miliseconds. Additionally, a new test resource_index_load_store has been added to verify that the stored index is equal to the loaded one. Signed-off-by: Tarek <tareknaser360@gmail.com>

…as broken invariants Signed-off-by: Tarek <tareknaser360@gmail.com>

The function load_raw_metadata() is only called in unit tests and is not used elsewhere in the codebase. Signed-off-by: Tarek <tareknaser360@gmail.com>

github-actions · 2024-03-08T23:08:48Z

Benchmark for `b6ce103`

Click to view benchmark

Test	Base	PR	%
compute_bytes_large/compute_bytes	141.5±1.56µs	141.5±2.31µs	0.00%
compute_bytes_medium/compute_bytes	27.8±0.22µs	27.9±0.57µs	+0.36%
compute_bytes_small/compute_bytes	134.5±1.54ns	134.9±4.61ns	+0.30%
index_build/index_build/tests/	160.4±0.91µs	172.3±0.68µs	+7.42%
tests/lena.jpg/compute_bytes	13.9±0.07µs	13.9±0.08µs	0.00%
tests/test.pdf/compute_bytes	114.7±1.19µs	117.5±0.39µs	+2.44%

kirillt · 2024-03-11T05:28:53Z

About change from milliseconds to nanoseconds in timestamps: we should double-check that we really can have nanoseconds on mobile platforms. I don't remember precisely, but it seems we had an issue with nanoseconds due to implicit rounding to milliseconds. Could be so even on Linux.

We should investigate why this threshold was introduced: 267b844#diff-701633ed548410624cfcae90d38a986adbd60deb5aeedd92355fe2e7702a9c18L34

src/index.rs

src/storage/meta.rs

load_raw_metadata was a method defined to read the metadata file and return its content. This commit removes it and its unit test because it is not used anywhere in the code and seems to be redundant Signed-off-by: Tarek <tareknaser360@gmail.com>

tareknaser · 2024-03-18T23:21:59Z

About change from milliseconds to nanoseconds in timestamps: we should double-check that we really can have nanoseconds on mobile platforms. I don't remember precisely, but it seems we had an issue with nanoseconds due to implicit rounding to milliseconds. Could be so even on Linux.

Before making any changes to arklib on how time is stored, we need to understand the impact of storing time in nanosecond on mobile systems.

If we must keep the millisecond precision, we have two options to consider:

We can keep the current implementation, which stores time in milliseconds. Only problem is that IndexEntry objects won't be equal when writing to ark file and then reading from it.
We can track only milliseconds when retrieving last_modified of each entry. This can be done by converting the SystemTime to milliseconds in scan_entry as follows:

let duration = modified
        .duration_since(UNIX_EPOCH)
        .expect("SystemTime before UNIX EPOCH!")
        .as_millis();
let modified =
	UNIX_EPOCH + std::time::Duration::from_millis(duration as u64);

kirillt · 2024-03-19T17:24:09Z

@tareknaser we could use nanoseconds, but:

Do we really need such a precision? Can several filesystem writes happen with time difference of e.g. 10ns? I hope milliseconds is enough.
We also should care about compatibility between different platforms. I'm not sure that Android provides developers with timestamps better than milliseconds.

I could imagine us needing microseconds to timestamp realistically filesystem events. Would we have the rounding issues in this case?

We can track only milliseconds when retrieving last_modified of each entry. This can be done by converting the SystemTime to milliseconds in scan_entry

This sounds like the way to go.

We can keep the current implementation, which stores time in milliseconds. Only problem is that IndexEntry objects won't be equal when writing to ark file and then reading from it.

Is this sill a problem if we go the way with converting to milliseconds in scan_entry?

tareknaser · 2024-03-20T03:28:24Z

Do we really need such a precision? Can several filesystem writes happen with time difference of e.g. 10ns? I hope milliseconds is enough.

Yes I think milliseconds is enough. Only issue is #87 (comment)

Is this sill a problem if we go the way with converting to milliseconds in scan_entry?

No. that should solve it

The code previously retrieved the last modified time with nanosecond precision. To ensure compatibility across different file systems, particularly on platforms like Android, where nanosecond precision might not be supported or could lead to compatibility issues, the code now adjusts the precision to milliseconds only. Signed-off-by: Tarek <tareknaser360@gmail.com>

tareknaser added 16 commits February 16, 2024 15:26

docs(index): add doc comments for ResourceIndex, IndexEntry and Index…

b8149f5

…Update Signed-off-by: Tarek <tareknaser360@gmail.com>

docs(index): add doc comment for ResourceIndex::build

2bc8a31

Signed-off-by: Tarek <tareknaser360@gmail.com>

refactor(index): use serde instead of custom methods

9c0d7fb

Signed-off-by: Tarek <tareknaser360@gmail.com>

refactor(index): rewrite discover_paths and remove is_hidden

7d72e93

Signed-off-by: Tarek <tareknaser360@gmail.com>

docs(index): add doc comments for scan_entry

0663292

Signed-off-by: Tarek <tareknaser360@gmail.com>

refactor(index): rewrite index_new

99b1316

Signed-off-by: Tarek <tareknaser360@gmail.com>

refactor(index): rewrite forget_id

56eed20

Signed-off-by: Tarek <tareknaser360@gmail.com>

refactor(index): rewrite forget_path

b9d30e2

Signed-off-by: Tarek <tareknaser360@gmail.com>

tests(index): use Tempdir for resource_index_load_store

ca0c7ac

Signed-off-by: Tarek <tareknaser360@gmail.com>

tests(index): use TempDir::new for tests

17e38ad

Signed-off-by: Tarek <tareknaser360@gmail.com>

test(index): add tests to ensure canonical paths

2377aa8

Signed-off-by: Tarek <tareknaser360@gmail.com>

refactor(index): rewrite provide

e6eddda

Signed-off-by: Tarek <tareknaser360@gmail.com>

refactor(index): rewrite update_one

47b6685

Signed-off-by: Tarek <tareknaser360@gmail.com>

refactor(index): rewrite update_all

218c651

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix(index): fix clippy warning

f98eeec

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix: update platform condition for cross-compatibility

1ca440f

Signed-off-by: Tarek <tareknaser360@gmail.com>