Skip to content

Commit 3d1565a

Browse files
committed
Move "loose object header" ser/de to git-object
The loose header manipulation currently lives in git-pack but it depends on nothing *of* git-pack. Move it to git-object (`::encode` and `::decode`), as well as update the way it's generated: a header should be no more than 28 bytes: 6 bytes for the kind, 20 for the size (log10(2**64) = 19.3), 1 for the space and 1 for the NUL, so return a `SmallVec<[u8, 28]>` instead of writing it out directly (this should slightly reduce the amount of IO when writing out the header to an unbuffered stream), this also avoids the need for `header_buf` in the traversal state. In order to generate the header without having to write out the entire content, add `WriteTo::size()`, the size is (relatively) easy to compute for all kinds. Ultimately this should avoid having to buffer or move around the object data before generating the header (though that's a bit TBD, I don't remember making those changes in git-pack). This also requires adding size computations to `git_actor::Signature` and `git_actor::Time`. For the latter the result should be reasonably efficient[^bithack]. If the time part gets moved to 64b, this should probably be updated to use a lookup table[^lookup] or even better `u64::log10` as hopefully it'll be stable by then[^70887]. Also add direct shortcuts to `WriteTo` (to generate a loose header) and `ObjectRef` (to directly parse a loose object). Others: * Lift conversion and header encoding at the start of `Sink::write_stream` as they don't seem to depend on the hash kind. * Alter `loose::Store` to inlin `write_header` but extract the creation of the output stream, this reveals that the dispatch on hash kind wasn't useful anymore as only the creation of the stream needs it (aside from `finalize_object`). One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway. But of course using usize means broken / corrupted files > 4GB on 32b platforms, which is not great either. Then again git itself has the same issue except even worse: it uses `unsigned long` internally, which is not only 32b on 32b platforms (ILP32) but also on 64 bits windows (LLP64)... Final note: the `cargo fmt` for commits and tags is *really bad* as it puts one sub-expression per line, a better alternative might be to provide size-only and `[inline]` versions of the encoding helpers? [^bithack]: it's considered a good solution when the input is uniformly distributed (http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10Obvious), and for us the input is biased toward efficiency: timestamp 1e9 is September 2001, and timestamp 1e8 is March 1973, so the vast majority of legit commits will take the first branch, and only joke commits will fail the second really. [^lookup]: https://commaok.xyz/post/lookup_tables/ however these solve the problem for u32 [^70887]: rust-lang/rust#70887 and efficiency improvements could be contributed to the stdlib directly.
1 parent bf63529 commit 3d1565a

File tree

24 files changed

+488
-254
lines changed

24 files changed

+488
-254
lines changed

Cargo.lock

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

git-actor/src/signature/mod.rs

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -97,18 +97,25 @@ mod write {
9797
pub fn write_to(&self, out: impl io::Write) -> io::Result<()> {
9898
self.to_ref().write_to(out)
9999
}
100+
/// Computes the number of bytes necessary to serialize this signature
101+
pub fn size(&self) -> usize {
102+
self.to_ref().size()
103+
}
100104
}
101105

102106
impl<'a> SignatureRef<'a> {
103107
/// Serialize this instance to `out` in the git serialization format for actors.
104108
pub fn write_to(&self, mut out: impl io::Write) -> io::Result<()> {
105109
out.write_all(validated_token(self.name)?)?;
106110
out.write_all(SPACE)?;
107-
out.write_all(&b"<"[..])?;
111+
out.write_all(b"<")?;
108112
out.write_all(validated_token(self.email)?)?;
109-
out.write_all(&b"> "[..])?;
110-
self.time.write_to(out)?;
111-
Ok(())
113+
out.write_all(b"> ")?;
114+
self.time.write_to(out)
115+
}
116+
/// Computes the number of bytes necessary to serialize this signature
117+
pub fn size(&self) -> usize {
118+
self.name.len() + self.email.len() + self.time.size() + 4
112119
}
113120
}
114121

git-actor/src/time.rs

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,10 @@ impl Time {
1717
pub fn write_to(&self, mut out: impl io::Write) -> io::Result<()> {
1818
itoa::write(&mut out, self.time)?;
1919
out.write_all(SPACE)?;
20-
out.write_all(&[match self.sign {
21-
Sign::Plus => b'+',
22-
Sign::Minus => b'-',
23-
}])?;
20+
out.write_all(match self.sign {
21+
Sign::Plus => b"+",
22+
Sign::Minus => b"-",
23+
})?;
2424

2525
const ZERO: &[u8; 1] = b"0";
2626

@@ -40,4 +40,31 @@ impl Time {
4040
}
4141
itoa::write(&mut out, minutes).map(|_| ())
4242
}
43+
/// Computes the number of bytes necessary to render this time
44+
pub fn size(&self) -> usize {
45+
// TODO: this is not year 2038 safe...
46+
(if self.time >= 1_000_000_000 {
47+
10
48+
} else if self.time >= 100_000_000 {
49+
9
50+
} else if self.time >= 10_000_000 {
51+
8
52+
} else if self.time >= 1_000_000 {
53+
7
54+
} else if self.time >= 100_000 {
55+
6
56+
} else if self.time >= 10_000 {
57+
5
58+
} else if self.time >= 1_000 {
59+
4
60+
} else if self.time >= 100 {
61+
3
62+
} else if self.time >= 10 {
63+
2
64+
} else {
65+
1
66+
}) + 2
67+
+ 2
68+
+ 2
69+
}
4370
}

git-object/Cargo.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,13 @@ git-hash = { version ="^0.8.0", path = "../git-hash" }
2323
git-validate = { version ="^0.5.3", path = "../git-validate" }
2424
git-actor = { version ="^0.6.0", path = "../git-actor" }
2525

26+
btoi = "0.4.2"
27+
itoa = "0.4.6"
2628
quick-error = "2.0.0"
2729
hex = "0.4.2"
2830
bstr = { version = "0.2.13", default-features = false, features = ["std", "unicode"] }
2931
nom = { version = "7", default-features = false, features = ["std"]}
30-
smallvec = "1.4.0"
32+
smallvec = { version = "1.4.0", features = ["write"] }
3133
serde = { version = "1.0.114", optional = true, default-features = false, features = ["derive"]}
3234

3335
[dev-dependencies]

git-object/src/blob.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@ impl<'a> crate::WriteTo for BlobRef<'a> {
88
out.write_all(self.data)
99
}
1010

11+
fn size(&self) -> usize {
12+
self.data.len()
13+
}
14+
1115
fn kind(&self) -> Kind {
1216
Kind::Blob
1317
}
@@ -19,6 +23,10 @@ impl crate::WriteTo for Blob {
1923
self.to_ref().write_to(out)
2024
}
2125

26+
fn size(&self) -> usize {
27+
self.to_ref().size()
28+
}
29+
2230
fn kind(&self) -> Kind {
2331
Kind::Blob
2432
}

git-object/src/commit/write.rs

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
use bstr::ByteSlice;
12
use std::io;
23

34
use crate::{encode, encode::NL, Commit, CommitRef, Kind};
@@ -21,6 +22,38 @@ impl crate::WriteTo for Commit {
2122
out.write_all(&self.message)
2223
}
2324

25+
fn size(&self) -> usize {
26+
let hashsize = self.tree.kind().len_in_hex();
27+
b"tree".len()
28+
+ 1
29+
+ hashsize
30+
+ 1
31+
+ self.parents.iter().count() * (b"parent".len() + 1 + hashsize + 1)
32+
+ b"author".len()
33+
+ 1
34+
+ self.author.size()
35+
+ 1
36+
+ b"committer".len()
37+
+ 1
38+
+ self.committer.size()
39+
+ 1
40+
+ self
41+
.encoding
42+
.as_ref()
43+
.map(|e| b"encoding".len() + 1 + e.len() + 1)
44+
.unwrap_or(0)
45+
+ self
46+
.extra_headers
47+
.iter()
48+
.map(|(name, value)| {
49+
// each header *value* is preceded by a space and followed by a newline
50+
name.len() + value.split_str("\n").map(|s| s.len() + 2).sum::<usize>()
51+
})
52+
.sum::<usize>()
53+
+ 1
54+
+ self.message.len()
55+
}
56+
2457
fn kind(&self) -> Kind {
2558
Kind::Commit
2659
}
@@ -45,6 +78,38 @@ impl<'a> crate::WriteTo for CommitRef<'a> {
4578
out.write_all(self.message)
4679
}
4780

81+
fn size(&self) -> usize {
82+
let hashsize = self.tree().kind().len_in_hex();
83+
b"tree".len()
84+
+ 1
85+
+ hashsize
86+
+ 1
87+
+ self.parents.iter().count() * (b"parent".len() + 1 + hashsize + 1)
88+
+ b"author".len()
89+
+ 1
90+
+ self.author.size()
91+
+ 1
92+
+ b"committer".len()
93+
+ 1
94+
+ self.committer.size()
95+
+ 1
96+
+ self
97+
.encoding
98+
.as_ref()
99+
.map(|e| b"encoding".len() + 1 + e.len() + 1)
100+
.unwrap_or(0)
101+
+ self
102+
.extra_headers
103+
.iter()
104+
.map(|(name, value)| {
105+
// each header *value* is preceded by a space and followed by a newline
106+
name.len() + value.split_str("\n").map(|s| s.len() + 2).sum::<usize>()
107+
})
108+
.sum::<usize>()
109+
+ 1
110+
+ self.message.len()
111+
}
112+
48113
fn kind(&self) -> Kind {
49114
Kind::Commit
50115
}

git-object/src/encode.rs

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
use std::io;
1+
//! Encoding utilities
2+
use std::io::{self, Write};
23

34
use bstr::{BString, ByteSlice};
45
use quick_error::quick_error;
@@ -15,13 +16,28 @@ quick_error! {
1516
}
1617
}
1718

19+
macro_rules! check {
20+
($e: expr) => {
21+
$e.expect("Writing to a Vec should never fail.")
22+
};
23+
}
24+
/// Generates a loose header buffer
25+
pub fn loose_header(kind: crate::Kind, size: usize) -> smallvec::SmallVec<[u8; 28]> {
26+
let mut v = smallvec::SmallVec::new();
27+
check!(v.write_all(kind.as_bytes()));
28+
check!(v.write_all(SPACE));
29+
check!(itoa::write(&mut v, size));
30+
check!(v.write_all(b"\0"));
31+
v
32+
}
33+
1834
impl From<Error> for io::Error {
1935
fn from(other: Error) -> io::Error {
2036
io::Error::new(io::ErrorKind::Other, other)
2137
}
2238
}
2339

24-
pub fn header_field_multi_line(name: &[u8], value: &[u8], mut out: impl io::Write) -> io::Result<()> {
40+
pub(crate) fn header_field_multi_line(name: &[u8], value: &[u8], mut out: impl io::Write) -> io::Result<()> {
2541
let mut lines = value.as_bstr().split_str(b"\n");
2642
trusted_header_field(name, lines.next().ok_or(Error::EmptyValue)?, &mut out)?;
2743
for line in lines {
@@ -32,14 +48,14 @@ pub fn header_field_multi_line(name: &[u8], value: &[u8], mut out: impl io::Writ
3248
Ok(())
3349
}
3450

35-
pub fn trusted_header_field(name: &[u8], value: &[u8], mut out: impl io::Write) -> io::Result<()> {
51+
pub(crate) fn trusted_header_field(name: &[u8], value: &[u8], mut out: impl io::Write) -> io::Result<()> {
3652
out.write_all(name)?;
3753
out.write_all(SPACE)?;
3854
out.write_all(value)?;
3955
out.write_all(NL)
4056
}
4157

42-
pub fn trusted_header_signature(
58+
pub(crate) fn trusted_header_signature(
4359
name: &[u8],
4460
value: &git_actor::SignatureRef<'_>,
4561
mut out: impl io::Write,
@@ -50,14 +66,14 @@ pub fn trusted_header_signature(
5066
out.write_all(NL)
5167
}
5268

53-
pub fn trusted_header_id(name: &[u8], value: &git_hash::ObjectId, mut out: impl io::Write) -> io::Result<()> {
69+
pub(crate) fn trusted_header_id(name: &[u8], value: &git_hash::ObjectId, mut out: impl io::Write) -> io::Result<()> {
5470
out.write_all(name)?;
55-
out.write_all(&SPACE[..])?;
71+
out.write_all(SPACE)?;
5672
value.write_hex_to(&mut out)?;
57-
out.write_all(&NL[..])
73+
out.write_all(NL)
5874
}
5975

60-
pub fn header_field(name: &[u8], value: &[u8], out: impl io::Write) -> io::Result<()> {
76+
pub(crate) fn header_field(name: &[u8], value: &[u8], out: impl io::Write) -> io::Result<()> {
6177
if value.is_empty() {
6278
return Err(Error::EmptyValue.into());
6379
}

0 commit comments

Comments
 (0)