From 32be5374f485cfb970fff44d559d17ea22fc1acb Mon Sep 17 00:00:00 2001 From: Arnaud Bouchez Date: Sat, 23 Mar 2024 11:23:12 +0100 Subject: [PATCH] using a perfect hash function for a huge performance boost --- entries/abouchez/README.md | 44 +- entries/abouchez/src/brcmormot.lpr | 348 +++++---------- entries/abouchez/src/brcmormotold.lpi | 137 ++++++ entries/abouchez/src/brcmormotold.lpr | 597 ++++++++++++++++++++++++++ 4 files changed, 868 insertions(+), 258 deletions(-) create mode 100644 entries/abouchez/src/brcmormotold.lpi create mode 100644 entries/abouchez/src/brcmormotold.lpr diff --git a/entries/abouchez/README.md b/entries/abouchez/README.md index f3f106f..4136a8a 100644 --- a/entries/abouchez/README.md +++ b/entries/abouchez/README.md @@ -25,14 +25,13 @@ Here are the main ideas behind this implementation proposal: - Process file in parallel using several threads (configurable, with `-t=16` by default); - Fed each thread from 64MB chunks of input (because thread scheduling is unfair, it is inefficient to pre-divide the size of the whole input file into the number of threads); - Each thread manages its own data, so there is no lock until the thread is finished and data is consolidated; -- Each station information (name and values) is packed into a record of exactly 64 bytes, with no external pointer/string, to match the CPU L1 cache size for efficiency; -- Use a dedicated hash table for the name lookup, with direct crc32c SSE4.2 hash - when `TDynArrayHashed` is involved, it requires a transient name copy on the stack, which is noticeably slower (see last paragraph of this document); +- Each station information (name and values) is packed into a record of exactly 16 bytes, with no external pointer/string, to match the CPU L1 cache size (64 bytes) for efficiency; +- Use a dedicated hash table for the name lookup, with crc32c perfect hash function - no name comparison nor storage is needed; - Store values as 16-bit or 32-bit integers (i.e. temperature multiplied by 10); - Parse temperatures with a dedicated code (expects single decimal input values); - No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux); - Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target); -- Some dedicated x86_64 asm has been written to replace *mORMot* `crc32c` and `MemCmp` general-purpose functions and gain a last few percents (nice to have); -- Can optionally output timing statistics and hash value on the console to debug and refine settings (with the `-v` command line switch); +- Can optionally output timing statistics and resultset hash value on the console to debug and refine settings (with the `-v` command line switch); - Can optionally set each thread affinity to a single core (with the `-a` command line switch). ## Why L1 Cache Matters @@ -41,9 +40,11 @@ The "64 bytes cache line" trick is quite unique among all implementations of the The L1 cache is well known in the performance hacking litterature to be the main bottleneck for any efficient in-memory process. If you want things to go fast, you should flatter your CPU L1 cache. -We are very lucky the station names are just big enough to fill no more than 64 bytes, with min/max values reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;) +Min/max values will be reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;) -In our first attempt, the `Station[]` array was in fact not aligned to 64 bytes itself. In fact, the RTL `SetLength()` does not align its data to the item size. So the pointer was aligned by 32 bytes, and any memory access would require filling two L1 cache lines. So we added some manual alignement of the data structure, and got 5% better performance. +In our first attempt, we stored the name into the `Station[]` array, so that each entry is 64 bytes long exactly. But since `crc32c` is a perfect hash function for our dataset, we could just store the 32-bit hash instead, for higher performance. On Intel/AMD/AARCH64 CPUs, we use hardware opcodes for this crc32c computation. + +See https://en.wikipedia.org/wiki/Perfect_hash_function for reference. ## Usage @@ -75,25 +76,25 @@ On my PC, it takes less than 5 seconds to process the 16GB file with 8/10 thread Let's compare `abouchez` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux: ``` -ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel5.txt +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=20 >resmormot.txt -real 0m4,216s -user 0m38,789s -sys 0m0,632s +real 0m2,350s +user 0m40,165s +sys 0m0,888s -ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./sbalazs measurements.txt 20 >ressb6.txt +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./sbalazs measurements.txt 20 >ressb.txt real 0m25,330s user 6m44,853s sys 0m31,167s ``` -We used 20 threads for `sbalazs`, and 10 threads for `abouchez` because it was giving the best results for each program on our PC. +We used 20 threads for both executable, because it was giving the best results for each program on our PC. Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `abouchez` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults). The `memmap()` feature makes the initial/cold `abouchez` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware): ``` -ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel4.txt +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=20 >resmormot.txt real 0m6,042s user 0m53,699s @@ -101,16 +102,17 @@ sys 0m2,941s ``` This is the expected behavior, and will be fine with the benchmark challenge, which ignores the min and max values during its 10 times run. So the first run will just warm up the file into memory. -On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not help: +On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not make any huge difference (we are within the error margin): ``` ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v -Processing measurements.txt with 10 threads and affinity=false +Processing measurements.txt with 20 threads and affinity=false result hash=8A6B746A,, result length=1139418, stations count=41343, valid utf8=1 -done in 4.25s 3.6 GB/s +done in 2.36s 6.6 GB/s + ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v -a -Processing measurements.txt with 10 threads and affinity=true +Processing measurements.txt with 20 threads and affinity=true result hash=8A6B746A, result length=1139418, stations count=41343, valid utf8=1 -done in 4.42s 3.5 GB/s +done in 2.44s 6.4 GB/s ``` Affinity may help on Ryzen 9, because its Zen 3 architecture is made of identical 16 cores with 32 threads, not this Intel E/P cores mess. But we will validate that on real hardware - no premature guess! @@ -145,6 +147,8 @@ This `-t=1` run is for fun: it will run the process in a single thread. It will Our proposal has been run on the benchmark hardware, using the full automation. +TO BE COMPLETED - NUMBERS BELOW ARE FOR THE OLD VERSION: + With 30 threads (on a busy system): ``` -- SSD -- @@ -191,7 +195,9 @@ It may be as expected: - The Ryzen CPU has 16 cores with 32 threads, and it makes sense that using only the "real" cores with CPU+RAM intensive work is enough to saturate them; - It is a known fact from experiment that forcing thread affinity is not a good idea, and it is always much better to let any modern Linux Operating System schedule the threads to the CPU cores, because it has a much better knowledge of the actual system load and status. Even on a "fair" CPU architecture like AMD Zen. -## Ending Note +## Old Version + +TO BE DETAILED (WITH NUMBERS?) You could disable our tuned asm in the project source code, and loose about 10% by using general purpose *mORMot* `crc32c()` and `CompareMem()` functions, which already runs SSE2/SSE4.2 tune assembly. diff --git a/entries/abouchez/src/brcmormot.lpr b/entries/abouchez/src/brcmormot.lpr index 7745381..35a1705 100644 --- a/entries/abouchez/src/brcmormot.lpr +++ b/entries/abouchez/src/brcmormot.lpr @@ -1,11 +1,8 @@ /// MIT code (c) Arnaud Bouchez, using the mORMot 2 framework program brcmormot; -{$define CUSTOMHASH} -// a dedicated hash table is 40% faster than mORMot generic TDynArrayHashed - -{$define CUSTOMASM} -// about 10% faster with some dedicated asm instead of mORMot code on x86_64 +{.$define NOPERFECTHASH} +// you can define this conditional to force name comparison (2.5x slower) {$I mormot.defines.inc} @@ -26,33 +23,22 @@ mormot.core.data; type - // a weather station info, using a whole CPU L1 cache line (64 bytes) + // a weather station info, using 1/4rd of a CPU L1 cache line (64/4=16 bytes) TBrcStation = packed record - NameLen: byte; // name as first "shortstring" field for TDynArray - NameText: array[1 .. 64 - 1 - 2 * 4 - 2 * 2] of byte; + NameHash: cardinal; // crc32c perfect hash of the name Sum, Count: integer; // we ensured no overflow occurs with 32-bit range Min, Max: SmallInt; // 16-bit (-32767..+32768) temperatures * 10 end; PBrcStation = ^TBrcStation; - TBrcStationDynArray = array of TBrcStation; - - TBrcStations = array[word] of TBrcStation; - PBrcStations = ^TBrcStations; TBrcList = record public - Count: integer; - {$ifdef CUSTOMHASH} - Station: PBrcStations; // perfectly aligned to 64 bytes from StationMem[] - StationHash: array of word; // store 0 if void, or Station[] index + 1 - StationMem: TBrcStationDynArray; + StationHash: array of word; // store 0 if void, or Station[] index + 1 + Station: array of TBrcStation; + StationName: array of PUtf8Char; // directly point to input memmap file + Count: PtrInt; + procedure Init(max: integer); function Search(name: pointer; namelen: PtrInt): PBrcStation; - {$else} - Station: TBrcStationDynArray; - Stations: TDynArrayHashed; - function Search(name: PByteArray): PBrcStation; - {$endif CUSTOMHASH} - procedure Init(max: integer; align: boolean); end; TBrcMain = class @@ -86,168 +72,47 @@ TBrcThread = class(TThread) { TBrcList } -{$ifdef CUSTOMHASH} - -{$ifndef OSLINUXX64} - {$undef CUSTOMASM} // asm below is for FPC + Linux x86_64 only -{$endif OSLINUXX64} - const HASHSIZE = 1 shl 18; // slightly oversized to avoid most collisions -procedure TBrcList.Init(max: integer; align: boolean); +procedure TBrcList.Init(max: integer); begin assert(max <= high(StationHash[0])); - SetLength(StationMem, max); // RTL won't align by 64 bytes - Station := pointer(StationMem); - if align then - while {%H-}PtrUInt(Station) and 63 <> 0 do // manual alignment - inc(PByte(Station)); + SetLength(Station, max); SetLength(StationHash, HASHSIZE); + SetLength(StationName, max); end; -{$ifdef CUSTOMASM} - -function dohash(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler; -asm - xor eax, eax // it is enough to hash up to 15 bytes for our purpose - mov ecx, len - cmp len, 8 - jb @less8 - crc32 rax, qword ptr [buf] - add buf, 8 -@less8: test cl, 4 - jz @less4 - crc32 eax, dword ptr [buf] - add buf, 4 -@less4: test cl, 2 - jz @less2 - crc32 eax, word ptr [buf] - add buf, 2 -@less2: test cl, 1 - jz @z - crc32 eax, byte ptr [buf] -@z: -end; - -function CompareMem(a, b: pointer; len: PtrInt): boolean; nostackframe; assembler; -asm - add a, len - add b, len - neg len - cmp len, -8 - ja @less8 - align 8 -@by8: mov rax, qword ptr [a + len] - cmp rax, qword ptr [b + len] - jne @diff - add len, 8 - jz @eq - cmp len, -8 - jna @by8 -@less8: cmp len, -4 - ja @less4 - mov eax, dword ptr [a + len] - cmp eax, dword ptr [b + len] - jne @diff - add len, 4 - jz @eq -@less4: cmp len, -2 - ja @less2 - movzx eax, word ptr [a + len] - movzx ecx, word ptr [b + len] - cmp eax, ecx - jne @diff - add len, 2 -@less2: test len, len - jz @eq - mov al, byte ptr [a + len] - cmp al, byte ptr [b + len] - je @eq -@diff: xor eax, eax - ret -@eq: mov eax, 1 // = found (most common case of no hash collision) -end; - -{$else} - -function dohash(buf: PAnsiChar; len: cardinal): PtrUInt; inline; -begin - if len > 16 then - len := 16; // it is enough to hash up to 16 bytes for our purpose - result := DefaultHasher(0, buf, len); // fast mORMot asm hasher (crc32c) -end; - -{$endif CUSTOMASM} - function TBrcList.Search(name: pointer; namelen: PtrInt): PBrcStation; var + h32: cardinal; h, x: PtrUInt; begin - assert(namelen <= SizeOf(TBrcStation.NameText)); - h := dohash(name, namelen); + h32 := crc32c(0, name, namelen); + h := h32; repeat h := h and (HASHSIZE - 1); x := StationHash[h]; if x = 0 then break; // void slot result := @Station[x - 1]; - if (result^.NameLen = namelen) and - CompareMem(@result^.NameText, name, namelen) then - exit; // found - inc(h); // hash collision: try next slot + if result^.NameHash = h32 then + {$ifdef NOPERFECTHASH} + if MemCmp(pointer(StationName[x - 1]), name, namelen + 1) = 0 then + {$endif NOPERFECTHASH} + exit; // found this perfect hash = found this name + inc(h); // hash modulo collision: linear probing until false; + assert(Count < length(Station)); + StationName[Count] := name; result := @Station[Count]; inc(Count); StationHash[h] := Count; - result^.NameLen := namelen; - MoveFast(name^, result^.NameText, namelen); - result^.Min := high(result^.Min); - result^.Max := low(result^.Max); -end; - -{$else} - -function StationHash(const Item; Hasher: THasher): cardinal; -var - s: TBrcStation absolute Item; // s.Name should be the first field -begin - result := Hasher(0, @s.NameText, s.NameLen); -end; - -function StationComp(const A, B): integer; -var - sa: TBrcStation absolute A; - sb: TBrcStation absolute B; -begin - result := MemCmp(@sa.NameLen, @sb.NameLen, sa.NameLen + 1); -end; - -procedure TBrcList.Init(max: integer; align: boolean); -begin - // align is just ignored, because TDynArray requires natural alignment - Stations.Init( - TypeInfo(TBrcStationDynArray), Station, @StationHash, @StationComp, nil, @Count); - Stations.Capacity := max; -end; - -function TBrcList.Search(name: PByteArray): PBrcStation; -var - i: PtrUInt; - added: boolean; -begin - assert(name^[0] < SizeOf(TBrcStation.NameText)); - i := Stations.FindHashedForAdding(name^, added); - result := @Station[i]; // in two steps (Station[] may be reallocated) - if not added then - exit; - MoveFast(name^, result^.NameLen, name^[0] + 1); + result^.NameHash := h32; result^.Min := high(result^.Min); result^.Max := low(result^.Max); end; -{$endif CUSTOMHASH} - { TBrcThread } @@ -255,7 +120,7 @@ constructor TBrcThread.Create(owner: TBrcMain); begin fOwner := owner; FreeOnTerminate := true; - fList.Init(fOwner.fMax, {align=}true); + fList.Init(fOwner.fMax); InterlockedIncrement(fOwner.fRunning); inherited Create({suspended=}false); end; @@ -266,10 +131,6 @@ procedure TBrcThread.Execute; v, m: integer; l, neg: PtrInt; s: PBrcStation; - {$ifndef CUSTOMHASH} - c: byte; - name: array[0..63] of byte; // efficient map of a temp shortstring on FPC - {$endif CUSTOMHASH} begin while fOwner.GetChunk(start, stop) do begin @@ -278,20 +139,9 @@ procedure TBrcThread.Execute; repeat // parse the name; l := 2; - {$ifdef CUSTOMHASH} start := p; while p[l] <> ord(';') do inc(l); // small local loop is faster than SSE2 ByteScanIndex() - {$else} - repeat - c := p[l]; - if c = ord(';') then - break; - inc(l); - name[l] := c; // fill name[] as a shortstring - until false; - name[0] := l; - {$endif CUSTOMHASH} p := @p[l + 1]; // + 1 to ignore ; // parse the temperature (as -12.3 -3.4 5.6 78.9 patterns) into value * 10 if p[0] = ord('-') then @@ -313,11 +163,7 @@ procedure TBrcThread.Execute; p := @p[5]; end; // store the value - {$ifdef CUSTOMHASH} s := fList.Search(start, l); - {$else} - s := fList.Search(@name); - {$endif CUSTOMHASH} inc(s^.Sum, v); inc(s^.Count); m := s^.Min; @@ -347,7 +193,7 @@ constructor TBrcMain.Create(const fn: TFileName; threads, max: integer; if not fMem.Map(fn) then raise ESynException.CreateUtf8('Impossible to find %', [fn]); fMax := max; - fList.Init(fMax, {align=}false); // not aligned for TDynArray.Sort to work + fList.Init(fMax); fCurrentChunk := pointer(fMem.Buffer); fCurrentRemain := fMem.Size; core := 0; @@ -404,29 +250,40 @@ function TBrcMain.GetChunk(out start, stop: PByteArray): boolean; fSafe.UnLock; end; +function NameLen(p: PUtf8Char): PtrInt; inline; +begin + result := 2; + while p[result] <> ';' do + inc(result); +end; + procedure TBrcMain.Aggregate(const another: TBrcList); var - s, d: PBrcStation; n: integer; + s, d: PBrcStation; + p: PPUtf8Char; begin fSafe.Lock; // several TBrcThread may finish at the same time - n := another.Count; - s := pointer(another.Station); - repeat - {$ifdef CUSTOMHASH} - d := fList.Search(@s^.NameText, s^.NameLen); - {$else} - d := fList.Search(@s^.NameLen); - {$endif CUSTOMHASH} - inc(d^.Count, s^.Count); - inc(d^.Sum, s^.Sum); - if s^.Max > d^.Max then - d^.Max := s^.Max; - if s^.Min < d^.Min then - d^.Min := s^.Min; - inc(s); - dec(n); - until n = 0; + if fList.Count = 0 then + fList := another + else + begin + n := another.Count; + s := pointer(another.Station); + p := pointer(another.StationName); + repeat + d := fList.Search(p^, NameLen(p^)); + inc(d^.Count, s^.Count); + inc(d^.Sum, s^.Sum); + if s^.Max > d^.Max then + d^.Max := s^.Max; + if s^.Min < d^.Min then + d^.Min := s^.Min; + inc(s); + inc(p); + dec(n); + until n = 0; + end; fSafe.UnLock; if InterlockedDecrement(fRunning) = 0 then fEvent.SetEvent; // all threads finished: release main console thread @@ -453,21 +310,6 @@ procedure AddTemp(w: TTextWriter; sep: AnsiChar; val: PtrInt); w.Add(AnsiChar(val - d10 * 10 + ord('0'))); end; -function ByStationName(const A, B): integer; -var - sa: TBrcStation absolute A; - sb: TBrcStation absolute B; - la, lb: PtrInt; -begin - la := sa.NameLen; - lb := sb.NameLen; - if la < lb then - la := lb; - result := MemCmp(@sa.NameText, @sb.NameText, la); - if result = 0 then - result := sa.NameLen - sb.NameLen; -end; - function Average(sum, count: PtrInt): PtrInt; // sum and result are temperature * 10 (one fixed decimal) var @@ -485,41 +327,68 @@ function Average(sum, count: PtrInt): PtrInt; //ConsoleWrite([sum / (count * 10), ' ', result / 10]); end; +function ByStationName(const A, B): integer; +var + pa, pb: PByte; +begin + result := 0; + pa := pointer(A); + pb := pointer(B); + if pa = pb then + exit; + repeat + if pa^ <> pb^ then + break + else if pa^ = ord(';') then + exit; // Str1 = Str2 + inc(pa); + inc(pb); + until false; + if pa^ = ord(';') then + result := -1 + else if pb^ = ord(';') then + result := 1 + else + result := pa^ - pb^; +end; + function TBrcMain.SortedText: RawUtf8; var - n: integer; + c: PtrInt; + n: PCardinal; s: PBrcStation; + p: PUtf8Char; st: TRawByteStringStream; w: TTextWriter; - tmp: TTextWriterStackBuffer; + ndx: TSynTempBuffer; begin - {$ifdef CUSTOMHASH} - DynArrayFakeLength(fList.Station, fList.Count); - DynArray(TypeInfo(TBrcStationDynArray), fList.Station).Sort(ByStationName); - {$else} - fList.Stations.Sort(ByStationName); - {$endif CUSTOMHASH} + // compute the sorted-by-name indexes of all stations + c := fList.Count; + assert(c <> 0); + DynArraySortIndexed( + pointer(fList.StationName), SizeOf(PUtf8Char), c, ndx, ByStationName); + // generate output FastSetString(result, nil, 1200000); // pre-allocate result st := TRawByteStringStream.Create(result); try - w := TTextWriter.Create(st, @tmp, SizeOf(tmp)); + w := TTextWriter.Create(st, @ndx.tmp, SizeOf(ndx.tmp)); try w.Add('{'); - s := pointer(fList.Station); - n := fList.Count; - if n > 0 then - repeat - assert(s^.Count <> 0); - w.AddNoJsonEscape(@s^.NameText, s^.NameLen); - AddTemp(w, '=', s^.Min); - AddTemp(w, '/', Average(s^.Sum, s^.Count)); - AddTemp(w, '/', s^.Max); - dec(n); - if n = 0 then - break; - w.Add(',', ' '); - inc(s); - until false; + n := ndx.buf; + repeat + s := @fList.Station[n^]; + assert(s^.Count <> 0); + p := fList.StationName[n^]; + w.AddNoJsonEscape(p, NameLen(p)); + AddTemp(w, '=', s^.Min); + AddTemp(w, '/', Average(s^.Sum, s^.Count)); + AddTemp(w, '/', s^.Max); + dec(c); + if c = 0 then + break; + w.Add(',', ' '); + inc(n); + until false; w.Add('}'); w.FlushFinal; FakeLength(result, w.WrittenBytes); @@ -528,6 +397,7 @@ function TBrcMain.SortedText: RawUtf8; end; finally st.Free; + ndx.Done; end; end; @@ -539,11 +409,11 @@ function TBrcMain.SortedText: RawUtf8; res: RawUtf8; start, stop: Int64; begin - assert(SizeOf(TBrcStation) = 64); // 64 bytes = CPU L1 cache line size + assert(SizeOf(TBrcStation) = 64 div 4); // 64 = CPU L1 cache line size // read command line parameters Executable.Command.ExeDescription := 'The mORMot One Billion Row Challenge'; if Executable.Command.Arg(0, 'the data source #filename') then - Utf8ToFileName(Executable.Command.Args[0], fn); + Utf8ToFileName(Executable.Command.Args[0], fn{%H-}); verbose := Executable.Command.Option( ['v', 'verbose'], 'generate verbose output with timing'); affinity := Executable.Command.Option( diff --git a/entries/abouchez/src/brcmormotold.lpi b/entries/abouchez/src/brcmormotold.lpi new file mode 100644 index 0000000..6bc0b59 --- /dev/null +++ b/entries/abouchez/src/brcmormotold.lpi @@ -0,0 +1,137 @@ + + + + + + + + + + + + + + <UseAppBundle Value="False"/> + <ResourceType Value="res"/> + </General> + <BuildModes Count="3"> + <Item1 Name="Default" Default="True"/> + <Item2 Name="Debug"> + <CompilerOptions> + <Version Value="11"/> + <Target> + <Filename Value="../../../bin/abouchez"/> + </Target> + <SearchPaths> + <IncludeFiles Value="$(ProjOutDir)"/> + <UnitOutputDirectory Value="../../../bin/lib/$(TargetCPU)-$(TargetOS)"/> + </SearchPaths> + <Parsing> + <SyntaxOptions> + <IncludeAssertionCode Value="True"/> + </SyntaxOptions> + </Parsing> + <CodeGeneration> + <Checks> + <IOChecks Value="True"/> + <RangeChecks Value="True"/> + <OverflowChecks Value="True"/> + <StackChecks Value="True"/> + </Checks> + <VerifyObjMethodCallValidity Value="True"/> + </CodeGeneration> + <Linking> + <Debugging> + <DebugInfoType Value="dsDwarf3"/> + <UseHeaptrc Value="True"/> + <TrashVariables Value="True"/> + <UseExternalDbgSyms Value="True"/> + </Debugging> + </Linking> + </CompilerOptions> + </Item2> + <Item3 Name="Release"> + <CompilerOptions> + <Version Value="11"/> + <Target> + <Filename Value="../../../bin/abouchez"/> + </Target> + <SearchPaths> + <IncludeFiles Value="$(ProjOutDir)"/> + <UnitOutputDirectory Value="../../../bin/lib/$(TargetCPU)-$(TargetOS)"/> + </SearchPaths> + <CodeGeneration> + <SmartLinkUnit Value="True"/> + <TargetProcessor Value="COREAVX2"/> + <Optimizations> + <OptimizationLevel Value="3"/> + </Optimizations> + </CodeGeneration> + <Linking> + <Debugging> + <GenerateDebugInfo Value="False"/> + </Debugging> + <LinkSmart Value="True"/> + </Linking> + </CompilerOptions> + </Item3> + </BuildModes> + <PublishOptions> + <Version Value="2"/> + <UseFileFilters Value="True"/> + </PublishOptions> + <RunParams> + <FormatVersion Value="2"/> + </RunParams> + <RequiredPackages Count="1"> + <Item1> + <PackageName Value="mormot2"/> + </Item1> + </RequiredPackages> + <Units Count="1"> + <Unit0> + <Filename Value="brcmormotold.lpr"/> + <IsPartOfProject Value="True"/> + <UnitName Value="brcmormot"/> + </Unit0> + </Units> + </ProjectOptions> + <CompilerOptions> + <Version Value="11"/> + <Target> + <Filename Value="../../../bin/abouchez"/> + </Target> + <SearchPaths> + <IncludeFiles Value="$(ProjOutDir)"/> + <UnitOutputDirectory Value="../../../bin/lib/$(TargetCPU)-$(TargetOS)"/> + </SearchPaths> + <Parsing> + <SyntaxOptions> + <IncludeAssertionCode Value="True"/> + </SyntaxOptions> + </Parsing> + <CodeGeneration> + <Optimizations> + <OptimizationLevel Value="3"/> + </Optimizations> + </CodeGeneration> + <Linking> + <Debugging> + <DebugInfoType Value="dsDwarf3"/> + </Debugging> + </Linking> + </CompilerOptions> + <Debugging> + <Exceptions Count="3"> + <Item1> + <Name Value="EAbort"/> + </Item1> + <Item2> + <Name Value="ECodetoolError"/> + </Item2> + <Item3> + <Name Value="EFOpenError"/> + </Item3> + </Exceptions> + </Debugging> +</CONFIG> diff --git a/entries/abouchez/src/brcmormotold.lpr b/entries/abouchez/src/brcmormotold.lpr new file mode 100644 index 0000000..728e280 --- /dev/null +++ b/entries/abouchez/src/brcmormotold.lpr @@ -0,0 +1,597 @@ +/// MIT code (c) Arnaud Bouchez, using the mORMot 2 framework +// - initial revision with full hashing and transient storage of the names +program brcmormotold; + +{$define CUSTOMHASH} +// a dedicated hash table is 40% faster than mORMot generic TDynArrayHashed + +{$define CUSTOMASM} +// about 10% faster with some dedicated asm instead of mORMot code on x86_64 + +{$I mormot.defines.inc} + +{$ifdef OSWINDOWS} + {$apptype console} +{$endif OSWINDOWS} + +uses + {$ifdef UNIX} + cthreads, + {$endif UNIX} + classes, + sysutils, + mormot.core.base, + mormot.core.os, + mormot.core.unicode, + mormot.core.text, + mormot.core.data; + +type + // a weather station info, using a whole CPU L1 cache line (64 bytes) + TBrcStation = packed record + NameLen: byte; // name as first "shortstring" field for TDynArray + NameText: array[1 .. 64 - 1 - 2 * 4 - 2 * 2] of byte; + Sum, Count: integer; // we ensured no overflow occurs with 32-bit range + Min, Max: SmallInt; // 16-bit (-32767..+32768) temperatures * 10 + end; + PBrcStation = ^TBrcStation; + TBrcStationDynArray = array of TBrcStation; + + TBrcStations = array[word] of TBrcStation; + PBrcStations = ^TBrcStations; + + TBrcList = record + public + Count: integer; + {$ifdef CUSTOMHASH} + Station: PBrcStations; // perfectly aligned to 64 bytes from StationMem[] + StationHash: array of word; // store 0 if void, or Station[] index + 1 + StationMem: TBrcStationDynArray; + function Search(name: pointer; namelen: PtrInt): PBrcStation; + {$else} + Station: TBrcStationDynArray; + Stations: TDynArrayHashed; + function Search(name: PByteArray): PBrcStation; + {$endif CUSTOMHASH} + procedure Init(max: integer; align: boolean); + end; + + TBrcMain = class + protected + fSafe: TLightLock; + fEvent: TSynEvent; + fRunning, fMax: integer; + fCurrentChunk: PByteArray; + fCurrentRemain: PtrUInt; + fList: TBrcList; + fMem: TMemoryMap; + procedure Aggregate(const another: TBrcList); + function GetChunk(out start, stop: PByteArray): boolean; + public + constructor Create(const fn: TFileName; threads, max: integer; + affinity: boolean); + destructor Destroy; override; + procedure WaitFor; + function SortedText: RawUtf8; + end; + + TBrcThread = class(TThread) + protected + fOwner: TBrcMain; + fList: TBrcList; // each thread work on its own list + procedure Execute; override; + public + constructor Create(owner: TBrcMain); + end; + + +{ TBrcList } + +{$ifdef CUSTOMHASH} + +{$ifndef OSLINUXX64} + {$undef CUSTOMASM} // asm below is for FPC + Linux x86_64 only +{$endif OSLINUXX64} + +const + HASHSIZE = 1 shl 18; // slightly oversized to avoid most collisions + +procedure TBrcList.Init(max: integer; align: boolean); +begin + assert(max <= high(StationHash[0])); + SetLength(StationMem, max); // RTL won't align by 64 bytes + Station := pointer(StationMem); + if align then + while {%H-}PtrUInt(Station) and 63 <> 0 do // manual alignment + inc(PByte(Station)); + SetLength(StationHash, HASHSIZE); +end; + +{$ifdef CUSTOMASM} + +function dohash(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler; +asm + xor eax, eax // it is enough to hash up to 15 bytes for our purpose + mov ecx, len + cmp len, 8 + jb @less8 + crc32 rax, qword ptr [buf] + add buf, 8 +@less8: test cl, 4 + jz @less4 + crc32 eax, dword ptr [buf] + add buf, 4 +@less4: test cl, 2 + jz @less2 + crc32 eax, word ptr [buf] + add buf, 2 +@less2: test cl, 1 + jz @z + crc32 eax, byte ptr [buf] +@z: +end; + +function CompareMem(a, b: pointer; len: PtrInt): boolean; nostackframe; assembler; +asm + add a, len + add b, len + neg len + cmp len, -8 + ja @less8 + align 8 +@by8: mov rax, qword ptr [a + len] + cmp rax, qword ptr [b + len] + jne @diff + add len, 8 + jz @eq + cmp len, -8 + jna @by8 +@less8: cmp len, -4 + ja @less4 + mov eax, dword ptr [a + len] + cmp eax, dword ptr [b + len] + jne @diff + add len, 4 + jz @eq +@less4: cmp len, -2 + ja @less2 + movzx eax, word ptr [a + len] + movzx ecx, word ptr [b + len] + cmp eax, ecx + jne @diff + add len, 2 +@less2: test len, len + jz @eq + mov al, byte ptr [a + len] + cmp al, byte ptr [b + len] + je @eq +@diff: xor eax, eax + ret +@eq: mov eax, 1 // = found (most common case of no hash collision) +end; + +{$else} + +function dohash(buf: PAnsiChar; len: cardinal): PtrUInt; inline; +begin + if len > 16 then + len := 16; // it is enough to hash up to 16 bytes for our purpose + result := DefaultHasher(0, buf, len); // fast mORMot asm hasher (crc32c) +end; + +{$endif CUSTOMASM} + +function TBrcList.Search(name: pointer; namelen: PtrInt): PBrcStation; +var + h, x: PtrUInt; +begin + assert(namelen <= SizeOf(TBrcStation.NameText)); + h := dohash(name, namelen); + repeat + h := h and (HASHSIZE - 1); + x := StationHash[h]; + if x = 0 then + break; // void slot + result := @Station[x - 1]; + if (result^.NameLen = namelen) and + CompareMem(@result^.NameText, name, namelen) then + exit; // found + inc(h); // hash collision: try next slot + until false; + result := @Station[Count]; + inc(Count); + StationHash[h] := Count; + result^.NameLen := namelen; + MoveFast(name^, result^.NameText, namelen); + result^.Min := high(result^.Min); + result^.Max := low(result^.Max); +end; + +{$else} + +function StationHash(const Item; Hasher: THasher): cardinal; +var + s: TBrcStation absolute Item; // s.Name should be the first field +begin + result := Hasher(0, @s.NameText, s.NameLen); +end; + +function StationComp(const A, B): integer; +var + sa: TBrcStation absolute A; + sb: TBrcStation absolute B; +begin + result := MemCmp(@sa.NameLen, @sb.NameLen, sa.NameLen + 1); +end; + +procedure TBrcList.Init(max: integer; align: boolean); +begin + // align is just ignored, because TDynArray requires natural alignment + Stations.Init( + TypeInfo(TBrcStationDynArray), Station, @StationHash, @StationComp, nil, @Count); + Stations.Capacity := max; +end; + +function TBrcList.Search(name: PByteArray): PBrcStation; +var + i: PtrUInt; + added: boolean; +begin + assert(name^[0] < SizeOf(TBrcStation.NameText)); + i := Stations.FindHashedForAdding(name^, added); + result := @Station[i]; // in two steps (Station[] may be reallocated) + if not added then + exit; + MoveFast(name^, result^.NameLen, name^[0] + 1); + result^.Min := high(result^.Min); + result^.Max := low(result^.Max); +end; + +{$endif CUSTOMHASH} + + +{ TBrcThread } + +constructor TBrcThread.Create(owner: TBrcMain); +begin + fOwner := owner; + FreeOnTerminate := true; + fList.Init(fOwner.fMax, {align=}true); + InterlockedIncrement(fOwner.fRunning); + inherited Create({suspended=}false); +end; + +procedure TBrcThread.Execute; +var + p, start, stop: PByteArray; + v, m: integer; + l, neg: PtrInt; + s: PBrcStation; + {$ifndef CUSTOMHASH} + c: byte; + name: array[0..63] of byte; // efficient map of a temp shortstring on FPC + {$endif CUSTOMHASH} +begin + while fOwner.GetChunk(start, stop) do + begin + // parse this thread chunk + p := start; + repeat + // parse the name; + l := 2; + {$ifdef CUSTOMHASH} + start := p; + while p[l] <> ord(';') do + inc(l); // small local loop is faster than SSE2 ByteScanIndex() + {$else} + repeat + c := p[l]; + if c = ord(';') then + break; + inc(l); + name[l] := c; // fill name[] as a shortstring + until false; + name[0] := l; + {$endif CUSTOMHASH} + p := @p[l + 1]; // + 1 to ignore ; + // parse the temperature (as -12.3 -3.4 5.6 78.9 patterns) into value * 10 + if p[0] = ord('-') then + begin + neg := -1; + p := @p[1]; + end + else + neg := 1; + if p[2] = ord('.') then // xx.x + begin + // note: the PCardinal(p)^ + "shr and $ff" trick is actually slower + v := (p[0] * 100 + p[1] * 10 + p[3] - (ord('0') * 111)) * neg; + p := @p[6]; // also jump ending $13/$10 + end + else + begin + v := (p[0] * 10 + p[2] - (ord('0') * 11)) * neg; // x.x + p := @p[5]; + end; + // store the value + {$ifdef CUSTOMHASH} + s := fList.Search(start, l); + {$else} + s := fList.Search(@name); + {$endif CUSTOMHASH} + inc(s^.Sum, v); + inc(s^.Count); + m := s^.Min; + if v < m then + m := v; // branchless cmovl + s^.Min := m; + m := s^.Max; + if v > m then + m := v; + s^.Max := m; + until p >= stop; + end; + // aggregate this thread values into the main list + fOwner.Aggregate(fList); +end; + + +{ TBrcMain } + +constructor TBrcMain.Create(const fn: TFileName; threads, max: integer; + affinity: boolean); +var + i, cores, core: integer; + one: TBrcThread; +begin + fEvent := TSynEvent.Create; + if not fMem.Map(fn) then + raise ESynException.CreateUtf8('Impossible to find %', [fn]); + fMax := max; + fList.Init(fMax, {align=}false); // not aligned for TDynArray.Sort to work + fCurrentChunk := pointer(fMem.Buffer); + fCurrentRemain := fMem.Size; + core := 0; + cores := SystemInfo.dwNumberOfProcessors; + for i := 0 to threads - 1 do + begin + one := TBrcThread.Create(self); + if not affinity then + continue; + SetThreadCpuAffinity(one, core); + inc(core, 2); + if core >= cores then + dec(core, cores - 1); // e.g. 0,2,1,3,0,2.. with 4 cpus + end; +end; + +destructor TBrcMain.Destroy; +begin + inherited Destroy; + fMem.UnMap; + fEvent.Free; +end; + +const + CHUNKSIZE = 64 shl 20; // fed each TBrcThread with 64MB chunks + // it is faster than naive parallel process of size / threads input because + // OS thread scheduling is never fair so some threads will finish sooner + +function TBrcMain.GetChunk(out start, stop: PByteArray): boolean; +var + chunk: PtrUInt; +begin + result := false; + fSafe.Lock; + chunk := fCurrentRemain; + if chunk <> 0 then + begin + start := fCurrentChunk; + if chunk > CHUNKSIZE then + begin + stop := pointer(GotoNextLine(pointer(@start[CHUNKSIZE]))); + chunk := PAnsiChar(stop) - PAnsiChar(start); + end + else + begin + stop := @start[chunk]; + while PAnsiChar(stop)[-1] <= ' ' do + dec(PByte(stop)); // ensure final stop at meaningful char + end; + dec(fCurrentRemain, chunk); + fCurrentChunk := @fCurrentChunk[chunk]; + result := true; + end; + fSafe.UnLock; +end; + +procedure TBrcMain.Aggregate(const another: TBrcList); +var + s, d: PBrcStation; + n: integer; +begin + fSafe.Lock; // several TBrcThread may finish at the same time + n := another.Count; + s := pointer(another.Station); + repeat + {$ifdef CUSTOMHASH} + d := fList.Search(@s^.NameText, s^.NameLen); + {$else} + d := fList.Search(@s^.NameLen); + {$endif CUSTOMHASH} + inc(d^.Count, s^.Count); + inc(d^.Sum, s^.Sum); + if s^.Max > d^.Max then + d^.Max := s^.Max; + if s^.Min < d^.Min then + d^.Min := s^.Min; + inc(s); + dec(n); + until n = 0; + fSafe.UnLock; + if InterlockedDecrement(fRunning) = 0 then + fEvent.SetEvent; // all threads finished: release main console thread +end; + +procedure TBrcMain.WaitFor; +begin + fEvent.WaitForEver; +end; + +procedure AddTemp(w: TTextWriter; sep: AnsiChar; val: PtrInt); +var + d10: PtrInt; +begin + w.Add(sep); + if val < 0 then + begin + w.Add('-'); + val := -val; + end; + d10 := val div 10; // val as temperature * 10 + w.AddString(SmallUInt32Utf8[d10]); // in 0..999 range + w.Add('.'); + w.Add(AnsiChar(val - d10 * 10 + ord('0'))); +end; + +function ByStationName(const A, B): integer; +var + sa: TBrcStation absolute A; + sb: TBrcStation absolute B; + la, lb: PtrInt; +begin + la := sa.NameLen; + lb := sb.NameLen; + if la < lb then + la := lb; + result := MemCmp(@sa.NameText, @sb.NameText, la); + if result = 0 then + result := sa.NameLen - sb.NameLen; +end; + +function Average(sum, count: PtrInt): PtrInt; +// sum and result are temperature * 10 (one fixed decimal) +var + x, t: PtrInt; // temperature * 100 (two fixed decimals) +begin + x := (sum * 10) div count; // average + // this weird algo follows the "official" PascalRound() implementation + t := (x div 10) * 10; // truncate + if abs(x - t) >= 5 then + if x < 0 then + dec(t, 10) + else + inc(t, 10); + result := t div 10; // truncate back to one decimal (temperature * 10) + //ConsoleWrite([sum / (count * 10), ' ', result / 10]); +end; + +function TBrcMain.SortedText: RawUtf8; +var + n: integer; + s: PBrcStation; + st: TRawByteStringStream; + w: TTextWriter; + tmp: TTextWriterStackBuffer; +begin + {$ifdef CUSTOMHASH} + DynArrayFakeLength(fList.Station, fList.Count); + DynArray(TypeInfo(TBrcStationDynArray), fList.Station).Sort(ByStationName); + {$else} + fList.Stations.Sort(ByStationName); + {$endif CUSTOMHASH} + FastSetString(result, nil, 1200000); // pre-allocate result + st := TRawByteStringStream.Create(result); + try + w := TTextWriter.Create(st, @tmp, SizeOf(tmp)); + try + w.Add('{'); + s := pointer(fList.Station); + n := fList.Count; + if n > 0 then + repeat + assert(s^.Count <> 0); + w.AddNoJsonEscape(@s^.NameText, s^.NameLen); + AddTemp(w, '=', s^.Min); + AddTemp(w, '/', Average(s^.Sum, s^.Count)); + AddTemp(w, '/', s^.Max); + dec(n); + if n = 0 then + break; + w.Add(',', ' '); + inc(s); + until false; + w.Add('}'); + w.FlushFinal; + FakeLength(result, w.WrittenBytes); + finally + w.Free; + end; + finally + st.Free; + end; +end; + +var + fn: TFileName; + threads: integer; + verbose, affinity, help: boolean; + main: TBrcMain; + res: RawUtf8; + start, stop: Int64; +begin + assert(SizeOf(TBrcStation) = 64); // 64 bytes = CPU L1 cache line size + // read command line parameters + Executable.Command.ExeDescription := 'The mORMot One Billion Row Challenge'; + if Executable.Command.Arg(0, 'the data source #filename') then + Utf8ToFileName(Executable.Command.Args[0], fn); + verbose := Executable.Command.Option( + ['v', 'verbose'], 'generate verbose output with timing'); + affinity := Executable.Command.Option( + ['a', 'affinity'], 'force thread affinity to a single CPU core'); + Executable.Command.Get( + ['t', 'threads'], threads, '#number of threads to run', 16); + help := Executable.Command.Option(['h', 'help'], 'display this help'); + if Executable.Command.ConsoleWriteUnknown then + exit + else if help or + (fn = '') then + begin + ConsoleWrite(Executable.Command.FullDescription); + exit; + end; + // actual process + if verbose then + ConsoleWrite(['Processing ', fn, ' with ', threads, ' threads', + ' and affinity=', BOOL_STR[affinity]]); + QueryPerformanceMicroSeconds(start); + try + main := TBrcMain.Create(fn, threads, {max=}45000, affinity); + // note: current stations count = 41343 for 2.5MB of data per thread + try + main.WaitFor; + res := main.SortedText; + if verbose then + ConsoleWrite(['result hash=', CardinalToHexShort(crc32cHash(res)), + ', result length=', length(res), + ', stations count=', main.fList.Count, + ', valid utf8=', IsValidUtf8(res)]) + else + ConsoleWrite(res); + finally + main.Free; + end; + except + on E: Exception do + ConsoleShowFatalException(E); + end; + // optional timing output + if verbose then + begin + QueryPerformanceMicroSeconds(stop); + dec(stop, start); + ConsoleWrite(['done in ', MicroSecToString(stop), ' ', + KB((FileSize(fn) * 1000000) div stop), '/s']); + end; +end. +