From 5d9273d7e26112281e1a7d9ed8f95a14058fd03c Mon Sep 17 00:00:00 2001 From: Arnaud Bouchez Date: Thu, 21 Mar 2024 17:19:27 +0100 Subject: [PATCH 1/5] introducing mORMot / abcz entry - please check the README.md information --- entries/abcz/README.md | 140 ++++++++ entries/abcz/src/brcmormot.lpi | 135 ++++++++ entries/abcz/src/brcmormot.lpr | 578 +++++++++++++++++++++++++++++++++ 3 files changed, 853 insertions(+) create mode 100644 entries/abcz/README.md create mode 100644 entries/abcz/src/brcmormot.lpi create mode 100644 entries/abcz/src/brcmormot.lpr diff --git a/entries/abcz/README.md b/entries/abcz/README.md new file mode 100644 index 0000000..0d96799 --- /dev/null +++ b/entries/abcz/README.md @@ -0,0 +1,140 @@ +# mORMot version of The One Billion Row Challenge + +## mORMot 2 is Required + +This entry requires the **mORMot 2** package to compile. + +Download it from https://github.com/synopse/mORMot2 + +It is better to fork the current state of the mORMot 2 repository, or get the latest release. + +## Licence Terms + +This code is licenced by its sole author (A. Bouchez) as MIT terms, to be used for pedagogical reasons. + +I am very happy to share decades of server-side performance coding techniques using FPC on x86_64. ;) + +## Presentation + +Here are the main ideas behind this implementation proposal: + +- **mORMot** makes cross-platform and cross-compiler support simple (e.g. `TMemMap`, `TDynArray.Sort`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing); +- Memory map the entire 16GB file at once (so won't work on 32-bit OS, but reduce syscalls); +- Process file in parallel using several threads (configurable, with `-t=16` by default); +- Each thread is fed from 64MB chunks of input (because thread scheduling is unfair, it is inefficient to pre-divide the size of the whole input file into the number of threads); +- Each thread manages its own data, so there is no lock until the thread is finished and data is consolidated; +- Each station information (name and values) is packed into a record of exactly 64 bytes, with no external pointer/string, so match the CPU L1 cache size for efficiency; +- Use a dedicated hash table for the name lookup, with direct crc32c SSE4.2 hash - when `TDynArrayHashed` is involved, it requires a transient name copy on the stack, which is noticeably slower (see last paragraph of this document); +- Store values as 16-bit or 32-bit integers (temperature multiplied by 10); +- Parse temperatures with a dedicated code (expects single decimal input values); +- No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux); +- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target) with no SIMD involved; +- Some dedicated x86_64 asm has been written to replace mORMot `crc32c` and `MemCmp` general-purpose functions and gain a last few percents; +- Can optionally output timing statistics and hash value on the console to debug and refine settings (with the `-v` command line switch); +- Can optionally set each thread affinity to a single core (with the `-a` command line switch). + +The "64 bytes cache line" trick is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance. The L1 cache is well known to be the main bottleneck for any efficient in-memory process. We are very lucky the station names are just big enough to fill no more than 64 bytes, with min/max values reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;) + +## Usage + +If you execute the `mormot` executable without any parameter, it will give you some hints about its usage (using mORMot `TCommandLine` abilities): + +``` +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot +The mORMot One Billion Row Challenge + +Usage: mormot [options] [params] + + the data source filename + +Options: + -v, --verbose generate verbose output with timing + -a, --affinity force thread affinity to a single CPU core + -h, --help display this help + +Params: + -t, --threads (default 16) + number of threads to run +``` +We will use these command-line switches for local (dev PC), and benchmark (challenge HW) analysis. + +## Local Analysis + +On my PC, it takes less than 5 seconds to process the 16GB file with 8 threads. + +If we use the `time` command on Linux, we can see that there is little time spend in kernel (sys) land. + +If we compare our `mormot` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`): + +``` +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel5.txt + +real 0m4,216s +user 0m38,789s +sys 0m0,632s + +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./sbalazs measurements.txt 20 >ressb6.txt + +real 0m25,330s +user 6m44,853s +sys 0m31,167s +``` +We used 20 threads for `sbalazs`, and 10 threads for `mormot` because it was giving the best results on each entry on this particular PC. + +Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `mormot` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults). + +The `memmap` feature makes the initial `mormot` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware): +``` +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel4.txt + +real 0m6,042s +user 0m53,699s +sys 0m2,941s +``` +This is the expected behavior, and will be fine with the benchmark challenge, which ignores the min and max values during its 10 times run. So the first run will just warm up the file into memory. + +On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not help: +``` +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot measurements.txt -t=10 -v +Processing measurements.txt with 10 threads and affinity=false +result hash=8A6B746A,, result length=1139418, stations count=41343, valid utf8=1 +done in 4.25s 3.6 GB/s +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot measurements.txt -t=10 -v -a +Processing measurements.txt with 10 threads and affinity=true +result hash=8A6B746A, result length=1139418, stations count=41343, valid utf8=1 +done in 4.42s 3.5 GB/s +``` +Affinity may help on Ryzen 9, because its Zen 3 architecture is made of identical 16 cores with 32 threads, not this Intel E/P cores mess. But we will validate that on real hardware - no premature guess! + +The `-v` verbose mode makes such testing easy. The `hash` value can quickly check that the generated output is correct, and that it is valid `utf8` content (as expected). + +## Benchmark Integration + +Every system is quite unique, especially about its CPU multi-thread abilities. For instance, my Intel Core i5 has both P-cores and E-cores so its threading model is pretty unfair. The Zen architecture should be more balanced. + +So we first need to find out which options leverage at best the hardware it runs on. + +On the https://github.com/gcarreno/1brc-ObjectPascal challenge hardware, which is a Ryzen 9 5950x with 16 cores / 32 threads and 64MB of L3 cache, each thread using around 2.5MB of its own data, we should try several options with 16-24-32 threads, for instance: + +``` +./mormot measurements.txt -v -t=8 +./mormot measurements.txt -v -t=16 +./mormot measurements.txt -v -t=24 +./mormot measurements.txt -v -t=32 +./mormot measurements.txt -v -t=16 -a +./mormot measurements.txt -v -t=24 -a +./mormot measurements.txt -v -t=32 -a +``` +Please run those command lines, to guess which parameters are to be run for the benchmark to give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here. + +## Feedback Needed + +Here we will put some additional information, once our proposal has been run on the benchmark hardware. + +Stay tuned! + +## Ending Note + +There is a "pure mORMot" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little bit overhead. + +Arnaud :D \ No newline at end of file diff --git a/entries/abcz/src/brcmormot.lpi b/entries/abcz/src/brcmormot.lpi new file mode 100644 index 0000000..b1e391d --- /dev/null +++ b/entries/abcz/src/brcmormot.lpi @@ -0,0 +1,135 @@ + + + + + + + + + + + + + <UseAppBundle Value="False"/> + <ResourceType Value="res"/> + </General> + <BuildModes> + <Item Name="Default" Default="True"/> + <Item Name="Debug"> + <CompilerOptions> + <Version Value="11"/> + <Target> + <Filename Value="../../../bin/mormot"/> + </Target> + <SearchPaths> + <IncludeFiles Value="$(ProjOutDir)"/> + <UnitOutputDirectory Value="../../../bin/lib/$(TargetCPU)-$(TargetOS)"/> + </SearchPaths> + <Parsing> + <SyntaxOptions> + <IncludeAssertionCode Value="True"/> + </SyntaxOptions> + </Parsing> + <CodeGeneration> + <Checks> + <IOChecks Value="True"/> + <RangeChecks Value="True"/> + <OverflowChecks Value="True"/> + <StackChecks Value="True"/> + </Checks> + <VerifyObjMethodCallValidity Value="True"/> + </CodeGeneration> + <Linking> + <Debugging> + <DebugInfoType Value="dsDwarf3"/> + <UseHeaptrc Value="True"/> + <TrashVariables Value="True"/> + <UseExternalDbgSyms Value="True"/> + </Debugging> + </Linking> + </CompilerOptions> + </Item> + <Item Name="Release"> + <CompilerOptions> + <Version Value="11"/> + <Target> + <Filename Value="../../../bin/mormot"/> + </Target> + <SearchPaths> + <IncludeFiles Value="$(ProjOutDir)"/> + <UnitOutputDirectory Value="../../../bin/lib/$(TargetCPU)-$(TargetOS)"/> + </SearchPaths> + <CodeGeneration> + <SmartLinkUnit Value="True"/> + <TargetProcessor Value="COREAVX2"/> + <Optimizations> + <OptimizationLevel Value="3"/> + </Optimizations> + </CodeGeneration> + <Linking> + <Debugging> + <GenerateDebugInfo Value="False"/> + </Debugging> + <LinkSmart Value="True"/> + </Linking> + </CompilerOptions> + </Item> + </BuildModes> + <PublishOptions> + <Version Value="2"/> + <UseFileFilters Value="True"/> + </PublishOptions> + <RunParams> + <FormatVersion Value="2"/> + </RunParams> + <RequiredPackages> + <Item> + <PackageName Value="mormot2"/> + </Item> + </RequiredPackages> + <Units> + <Unit> + <Filename Value="brcmormot.lpr"/> + <IsPartOfProject Value="True"/> + </Unit> + </Units> + </ProjectOptions> + <CompilerOptions> + <Version Value="11"/> + <Target> + <Filename Value="../../../bin/mormot"/> + </Target> + <SearchPaths> + <IncludeFiles Value="$(ProjOutDir)"/> + <UnitOutputDirectory Value="../../../bin/lib/$(TargetCPU)-$(TargetOS)"/> + </SearchPaths> + <Parsing> + <SyntaxOptions> + <IncludeAssertionCode Value="True"/> + </SyntaxOptions> + </Parsing> + <CodeGeneration> + <Optimizations> + <OptimizationLevel Value="3"/> + </Optimizations> + </CodeGeneration> + <Linking> + <Debugging> + <DebugInfoType Value="dsDwarf3"/> + </Debugging> + </Linking> + </CompilerOptions> + <Debugging> + <Exceptions> + <Item> + <Name Value="EAbort"/> + </Item> + <Item> + <Name Value="ECodetoolError"/> + </Item> + <Item> + <Name Value="EFOpenError"/> + </Item> + </Exceptions> + </Debugging> +</CONFIG> diff --git a/entries/abcz/src/brcmormot.lpr b/entries/abcz/src/brcmormot.lpr new file mode 100644 index 0000000..1fe151c --- /dev/null +++ b/entries/abcz/src/brcmormot.lpr @@ -0,0 +1,578 @@ +/// MIT code (c) Arnaud Bouchez, using the mORMot 2 framework +program brcmormot; + +{$define CUSTOMHASH} +// a dedicated hash table is 40% faster than mORMot generic TDynArrayHashed + +{$define CUSTOMASM} +// a few % faster with some dedicated asm instead of mORMot code on x86_64 + +{$I mormot.defines.inc} + +{$ifdef OSWINDOWS} + {$apptype console} +{$endif OSWINDOWS} + +uses + {$ifdef UNIX} + cthreads, + {$endif UNIX} + classes, + sysutils, + mormot.core.base, + mormot.core.os, + mormot.core.unicode, + mormot.core.text, + mormot.core.data; + +type + // a weather station info, using a whole CPU L1 cache line (64 bytes) + TBrcStation = packed record + NameLen: byte; // name as first "shortstring" field for TDynArray + NameText: array[1 .. 64 - 1 - 2 * 4 - 2 * 2] of byte; + Sum, Count: integer; // we ensured no overflow occurs with 32-bit range + Min, Max: SmallInt; // 16-bit (-32767..+32768) temperatures * 10 + end; + PBrcStation = ^TBrcStation; + TBrcStations = array of TBrcStation; + + TBrcList = record + public + Station: TBrcStations; + Count: integer; + {$ifdef CUSTOMHASH} + StationHash: array of word; // store 0 if void, or Station[] index + 1 + function Search(name: pointer; namelen: PtrInt): PBrcStation; + {$else} + Stations: TDynArrayHashed; + function Search(name: PByteArray): PBrcStation; + {$endif CUSTOMHASH} + procedure Init(max: integer); + end; + + TBrcMain = class + protected + fSafe: TLightLock; + fEvent: TSynEvent; + fRunning: integer; + fCurrentChunk: PByteArray; + fCurrentRemain: PtrUInt; + fList: TBrcList; + fMem: TMemoryMap; + procedure Aggregate(const another: TBrcList); + function GetChunk(out start, stop: PByteArray): boolean; + public + constructor Create(const fn: TFileName; threads, max: integer; + affinity: boolean); + destructor Destroy; override; + procedure WaitFor; + function SortedText: RawUtf8; + end; + + TBrcThread = class(TThread) + protected + fOwner: TBrcMain; + fList: TBrcList; // each thread work on its own list + procedure Execute; override; + public + constructor Create(owner: TBrcMain); + end; + + +{ TBrcList } + +{$ifdef CUSTOMHASH} + +{$ifndef OSLINUXX64} + {$undef CUSTOMASM} // asm below is for FPC + Linux x86_64 only +{$endif OSLINUXX64} + +const + HASHSIZE = 1 shl 18; // slightly oversized to avoid most collisions + +procedure TBrcList.Init(max: integer); +begin + assert(max <= high(StationHash[0])); + SetLength(Station, max); + SetLength(StationHash, HASHSIZE); +end; + +{$ifdef CUSTOMASM} + +function crc32c(buf: PAnsiChar; len: cardinal): PtrUInt; nostackframe; assembler; +asm + xor eax, eax // it is enough to hash up to 15 bytes for our purpose + mov ecx, len + cmp len, 8 + jb @less8 + crc32 rax, qword ptr [buf] + add buf, 8 +@less8: test cl, 4 + jz @less4 + crc32 eax, dword ptr [buf] + add buf, 4 +@less4: test cl, 2 + jz @less2 + crc32 eax, word ptr [buf] + add buf, 2 +@less2: test cl, 1 + jz @z + crc32 eax, byte ptr [buf] +@z: +end; + +function MemEqual(a, b: pointer; len: PtrInt): integer; nostackframe; assembler; +asm + add a, len + add b, len + neg len + cmp len, -8 + ja @less8 + align 8 +@by8: mov rax, qword ptr [a + len] + cmp rax, qword ptr [b + len] + jne @diff + add len, 8 + jz @eq + cmp len, -8 + jna @by8 +@less8: cmp len, -4 + ja @less4 + mov eax, dword ptr [a + len] + cmp eax, dword ptr [b + len] + jne @diff + add len, 4 + jz @eq +@less4: cmp len, -2 + ja @less2 + movzx eax, word ptr [a + len] + movzx ecx, word ptr [b + len] + cmp eax, ecx + jne @diff + add len, 2 +@less2: test len, len + jz @eq + mov al, byte ptr [a + len] + cmp al, byte ptr [b + len] + je @eq +@diff: mov eax, 1 + ret +@eq: xor eax, eax // 0 = found (most common case of no hash collision) +end; + +{$endif CUSTOMASM} + +function TBrcList.Search(name: pointer; namelen: PtrInt): PBrcStation; +var + h, x: PtrUInt; +begin + assert(namelen <= SizeOf(TBrcStation.NameText)); + h := crc32c({$ifndef CUSTOMASM} 0, {$endif} name, namelen); + repeat + h := h and (HASHSIZE - 1); + x := StationHash[h]; + if x = 0 then + break; // void slot + result := @Station[x - 1]; + if (result^.NameLen = namelen) and + ({$ifdef CUSTOMASM}MemEqual{$else}MemCmp{$endif}( + @result^.NameText, name, namelen) = 0) then + exit; // found + inc(h); // hash collision: try next slot + until false; + result := @Station[Count]; + inc(Count); + StationHash[h] := Count; + result^.NameLen := namelen; + MoveFast(name^, result^.NameText, namelen); + result^.Min := high(result^.Min); + result^.Max := low(result^.Max); +end; + +{$else} + +function StationHash(const Item; Hasher: THasher): cardinal; +var + s: TBrcStation absolute Item; // s.Name should be the first field +begin + result := Hasher(0, @s.NameText, s.NameLen); +end; + +function StationComp(const A, B): integer; +var + sa: TBrcStation absolute A; + sb: TBrcStation absolute B; +begin + result := MemCmp(@sa.NameLen, @sb.NameLen, sa.NameLen + 1); +end; + +procedure TBrcList.Init(max: integer); +begin + Stations.Init( + TypeInfo(TBrcStations), Station, @StationHash, @StationComp, nil, @Count); + Stations.Capacity := max; +end; + +function TBrcList.Search(name: PByteArray): PBrcStation; +var + i: PtrUInt; + added: boolean; +begin + assert(name^[0] < SizeOf(TBrcStation.NameText)); + i := Stations.FindHashedForAdding(name^, added); + result := @Station[i]; // in two steps (Station[] may be reallocated if added) + if not added then + exit; + MoveFast(name^, result^.NameLen, name^[0] + 1); + result^.Min := high(result^.Min); + result^.Max := low(result^.Max); +end; + +{$endif CUSTOMHASH} + + +{ TBrcThread } + +constructor TBrcThread.Create(owner: TBrcMain); +begin + fOwner := owner; + FreeOnTerminate := true; + fList.Init(length(fOwner.fList.Station)); + InterlockedIncrement(fOwner.fRunning); + inherited Create({suspended=}false); +end; + +procedure TBrcThread.Execute; +var + p, start, stop: PByteArray; + v: integer; + l, neg: PtrInt; + s: PBrcStation; + {$ifndef CUSTOMHASH} + c: byte; + name: array[0..63] of byte; // efficient map of a temp shortstring on FPC + {$endif CUSTOMHASH} +begin + while fOwner.GetChunk(start, stop) do + begin + // parse this thread chunk + p := start; + repeat + // parse the name; + l := 2; + {$ifdef CUSTOMHASH} + start := p; + while p[l] <> ord(';') do + inc(l); // small local loop is faster than SSE2 ByteScanIndex() + {$else} + repeat + c := p[l]; + if c = ord(';') then + break; + inc(l); + name[l] := c; // fill name[] as a shortstring + until false; + name[0] := l; + {$endif CUSTOMHASH} + p := @p[l + 1]; // + 1 to ignore ; + // parse the temperature (as -12.3 -3.4 5.6 78.9 patterns) into value * 10 + if p[0] = ord('-') then + begin + neg := -1; + p := @p[1]; + end + else + neg := 1; + if p[2] = ord('.') then // xx.x + begin + // note: the PCardinal(p)^ + "shr and $ff" trick is actually slower + v := (p[0] * 100 + p[1] * 10 + p[3] - (ord('0') * 111)) * neg; + p := @p[6]; // also jump ending $13/$10 + end + else + begin + v := (p[0] * 10 + p[2] - (ord('0') * 11)) * neg; // x.x + p := @p[5]; + end; + // store the value + {$ifdef CUSTOMHASH} + s := fList.Search(start, l); + {$else} + s := fList.Search(@name); + {$endif CUSTOMHASH} + inc(s^.Count); + if v < s^.Min then + s^.Min := v; + if v > s^.Max then + s^.Max := v; + inc(s^.Sum, v); + until p >= stop; + end; + // aggregate this thread values into the main list + fOwner.Aggregate(fList); +end; + + +{ TBrcMain } + +constructor TBrcMain.Create(const fn: TFileName; threads, max: integer; + affinity: boolean); +var + i, cores, core: integer; + one: TBrcThread; +begin + fEvent := TSynEvent.Create; + if not fMem.Map(fn) then + raise ESynException.CreateUtf8('Impossible to find %', [fn]); + fList.Init(max); + fCurrentChunk := pointer(fMem.Buffer); + fCurrentRemain := fMem.Size; + core := 0; + cores := SystemInfo.dwNumberOfProcessors; + for i := 0 to threads - 1 do + begin + one := TBrcThread.Create(self); + if not affinity then + continue; + SetThreadCpuAffinity(one, core); + inc(core, 2); + if core >= cores then + dec(core, cores - 1); // e.g. 0,2,1,3,0,2.. with 4 cpus + end; +end; + +destructor TBrcMain.Destroy; +begin + inherited Destroy; + fMem.UnMap; + fEvent.Free; +end; + +const + CHUNKSIZE = 64 shl 20; // fed each TBrcThread with 64MB chunks + // it is faster than naive parallel process of size / threads input because + // OS thread scheduling is never fair so some threads will finish sooner + +function TBrcMain.GetChunk(out start, stop: PByteArray): boolean; +var + chunk: PtrUInt; +begin + result := false; + fSafe.Lock; + chunk := fCurrentRemain; + if chunk <> 0 then + begin + start := fCurrentChunk; + if chunk > CHUNKSIZE then + begin + stop := pointer(GotoNextLine(pointer(@start[CHUNKSIZE]))); + chunk := PAnsiChar(stop) - PAnsiChar(start); + end + else + begin + stop := @start[chunk]; + while PAnsiChar(stop)[-1] <= ' ' do + dec(PByte(stop)); // ensure final stop at meaningful char + end; + dec(fCurrentRemain, chunk); + fCurrentChunk := @fCurrentChunk[chunk]; + result := true; + end; + fSafe.UnLock; +end; + +procedure TBrcMain.Aggregate(const another: TBrcList); +var + s, d: PBrcStation; + n: integer; +begin + fSafe.Lock; // several TBrcThread may finish at the same time + {$ifdef CUSTOMHASH} + if fList.Count = 0 then + fList := another // we can reuse the existing hash table + else + {$endif CUSTOMHASH} + begin + n := another.Count; + s := pointer(another.Station); + repeat + {$ifdef CUSTOMHASH} + d := fList.Search(@s^.NameText, s^.NameLen); + {$else} + d := fList.Search(@s^.NameLen); + {$endif CUSTOMHASH} + inc(d^.Count, s^.Count); + inc(d^.Sum, s^.Sum); + if s^.Max > d^.Max then + d^.Max := s^.Max; + if s^.Min < d^.Min then + d^.Min := s^.Min; + inc(s); + dec(n); + until n = 0; + end; + fSafe.UnLock; + if InterlockedDecrement(fRunning) = 0 then + fEvent.SetEvent; // all threads finished: release main console thread +end; + +procedure TBrcMain.WaitFor; +begin + fEvent.WaitForEver; +end; + +procedure AddTemp(w: TTextWriter; sep: AnsiChar; val: PtrInt); +var + d10: PtrInt; +begin + w.Add(sep); + if val < 0 then + begin + w.Add('-'); + val := -val; + end; + d10 := val div 10; // val as temperature * 10 + w.AddString(SmallUInt32Utf8[d10]); // in 0..999 range + w.Add('.'); + w.Add(AnsiChar(val - d10 * 10 + ord('0'))); +end; + +function ByStationName(const A, B): integer; +var + sa: TBrcStation absolute A; + sb: TBrcStation absolute B; + la, lb: PtrInt; +begin + la := sa.NameLen; + lb := sb.NameLen; + if la < lb then + la := lb; + result := MemCmp(@sa.NameText, @sb.NameText, la); + if result = 0 then + result := sa.NameLen - sb.NameLen; +end; + +function Average(sum, count: PtrInt): integer; +// sum and result are temperature * 10 (one fixed decimal) +var + x, t: PtrInt; // temperature * 100 (two fixed decimals) +begin + x := (sum * 10) div count; // average + // this weird algo follows the "official" PascalRound() implementation + t := (x div 10) * 10; // truncate + if abs(x - t) >= 5 then + if x < 0 then + dec(t, 10) + else + inc(t, 10); + result := t div 10; // truncate back to one decimal (temperature * 10) + //ConsoleWrite([sum / (count * 10), ' ', result / 10]); +end; + +function TBrcMain.SortedText: RawUtf8; +var + n: integer; + s: PBrcStation; + st: TRawByteStringStream; + w: TTextWriter; + tmp: TTextWriterStackBuffer; +begin + {$ifdef CUSTOMHASH} + DynArrayFakeLength(pointer(fList.Station), fList.Count); + DynArray(TypeInfo(TBrcStations), fList.Station).Sort(ByStationName); + {$else} + fList.Stations.Sort(ByStationName); + {$endif CUSTOMHASH} + FastSetString(result, nil, 1200000); // pre-allocate result + st := TRawByteStringStream.Create(result); + try + w := TTextWriter.Create(st, @tmp, SizeOf(tmp)); + try + w.Add('{'); + s := pointer(fList.Station); + n := fList.Count; + if n > 0 then + repeat + assert(s^.Count <> 0); + w.AddNoJsonEscape(@s^.NameText, s^.NameLen); + AddTemp(w, '=', s^.Min); + AddTemp(w, '/', Average(s^.Sum, s^.Count)); + AddTemp(w, '/', s^.Max); + dec(n); + if n = 0 then + break; + w.Add(',', ' '); + inc(s); + until false; + w.Add('}'); + w.FlushFinal; + FakeLength(result, w.WrittenBytes); + finally + w.Free; + end; + finally + st.Free; + end; +end; + +var + fn: TFileName; + threads: integer; + verbose, affinity: boolean; + main: TBrcMain; + res: RawUtf8; + start, stop: Int64; +begin + assert(SizeOf(TBrcStation) = 64); // 64 bytes = CPU L1 cache line size + // read command line parameters + Executable.Command.ExeDescription := 'The mORMot One Billion Row Challenge'; + fn := Executable.Command.ArgString(0, 'the data source #filename'); + verbose := Executable.Command.Option( + ['v', 'verbose'], 'generate verbose output with timing'); + affinity := Executable.Command.Option( + ['a', 'affinity'], 'force thread affinity to a single CPU core'); + Executable.Command.Get( + ['t', 'threads'], threads, '#number of threads to run', 16); + if Executable.Command.ConsoleWriteUnknown then + exit + else if Executable.Command.Option(['h', 'help'], 'display this help') or + (fn = '') then + begin + ConsoleWrite(Executable.Command.FullDescription); + exit; + end; + // actual process + if verbose then + ConsoleWrite(['Processing ', fn, ' with ', threads, ' threads', + ' and affinity=', BOOL_STR[affinity]]); + QueryPerformanceMicroSeconds(start); + try + main := TBrcMain.Create(fn, threads, {max=}45000, affinity); + // note: current stations count = 41343 for 2.5MB of data per thread + try + main.WaitFor; + res := main.SortedText; + if verbose then + ConsoleWrite(['result hash=', CardinalToHexShort(crc32cHash(res)), + ', result length=', length(res), + ', stations count=', main.fList.Count, + ', valid utf8=', IsValidUtf8(res)]) + else + ConsoleWrite(res); + finally + main.Free; + end; + except + on E: Exception do + ConsoleShowFatalException(E); + end; + // optional timing output + if verbose then + begin + QueryPerformanceMicroSeconds(stop); + dec(stop, start); + ConsoleWrite(['done in ', MicroSecToString(stop), ' ', + KB((FileSize(fn) * 1000000) div stop), '/s']); + end; +end. + From 8fe86234f2937c2a4a9c61c0412e3816d00503a4 Mon Sep 17 00:00:00 2001 From: Arnaud Bouchez <ab@synopse.info> Date: Thu, 21 Mar 2024 19:20:00 +0100 Subject: [PATCH 2/5] minor fixes --- entries/abcz/README.md | 22 ++++++++++------------ entries/abcz/src/brcmormot.lpr | 5 +++-- 2 files changed, 13 insertions(+), 14 deletions(-) diff --git a/entries/abcz/README.md b/entries/abcz/README.md index 0d96799..04b00ad 100644 --- a/entries/abcz/README.md +++ b/entries/abcz/README.md @@ -19,17 +19,17 @@ I am very happy to share decades of server-side performance coding techniques us Here are the main ideas behind this implementation proposal: - **mORMot** makes cross-platform and cross-compiler support simple (e.g. `TMemMap`, `TDynArray.Sort`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing); -- Memory map the entire 16GB file at once (so won't work on 32-bit OS, but reduce syscalls); +- Will memmap the entire 16GB file at once into memory (so won't work on 32-bit OS, but reduce syscalls); - Process file in parallel using several threads (configurable, with `-t=16` by default); -- Each thread is fed from 64MB chunks of input (because thread scheduling is unfair, it is inefficient to pre-divide the size of the whole input file into the number of threads); +- Fed each thread from 64MB chunks of input (because thread scheduling is unfair, it is inefficient to pre-divide the size of the whole input file into the number of threads); - Each thread manages its own data, so there is no lock until the thread is finished and data is consolidated; -- Each station information (name and values) is packed into a record of exactly 64 bytes, with no external pointer/string, so match the CPU L1 cache size for efficiency; +- Each station information (name and values) is packed into a record of exactly 64 bytes, with no external pointer/string, to match the CPU L1 cache size for efficiency; - Use a dedicated hash table for the name lookup, with direct crc32c SSE4.2 hash - when `TDynArrayHashed` is involved, it requires a transient name copy on the stack, which is noticeably slower (see last paragraph of this document); -- Store values as 16-bit or 32-bit integers (temperature multiplied by 10); +- Store values as 16-bit or 32-bit integers (i.e. temperature multiplied by 10); - Parse temperatures with a dedicated code (expects single decimal input values); - No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux); -- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target) with no SIMD involved; -- Some dedicated x86_64 asm has been written to replace mORMot `crc32c` and `MemCmp` general-purpose functions and gain a last few percents; +- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target); +- Some dedicated x86_64 asm has been written to replace mORMot `crc32c` and `MemCmp` general-purpose functions and gain a last few percents (nice to have); - Can optionally output timing statistics and hash value on the console to debug and refine settings (with the `-v` command line switch); - Can optionally set each thread affinity to a single core (with the `-a` command line switch). @@ -60,11 +60,9 @@ We will use these command-line switches for local (dev PC), and benchmark (chall ## Local Analysis -On my PC, it takes less than 5 seconds to process the 16GB file with 8 threads. +On my PC, it takes less than 5 seconds to process the 16GB file with 8/10 threads. -If we use the `time` command on Linux, we can see that there is little time spend in kernel (sys) land. - -If we compare our `mormot` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`): +Let's compare our `mormot` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux: ``` ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel5.txt @@ -79,7 +77,7 @@ real 0m25,330s user 6m44,853s sys 0m31,167s ``` -We used 20 threads for `sbalazs`, and 10 threads for `mormot` because it was giving the best results on each entry on this particular PC. +We used 20 threads for `sbalazs`, and 10 threads for `mormot` because it was giving the best results for each program on our PC. Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `mormot` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults). @@ -125,7 +123,7 @@ On the https://github.com/gcarreno/1brc-ObjectPascal challenge hardware, which i ./mormot measurements.txt -v -t=24 -a ./mormot measurements.txt -v -t=32 -a ``` -Please run those command lines, to guess which parameters are to be run for the benchmark to give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here. +Please run those command lines, to guess which parameters are to be run for the benchmark, and would give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here. ## Feedback Needed diff --git a/entries/abcz/src/brcmormot.lpr b/entries/abcz/src/brcmormot.lpr index 1fe151c..c06c727 100644 --- a/entries/abcz/src/brcmormot.lpr +++ b/entries/abcz/src/brcmormot.lpr @@ -518,7 +518,7 @@ function TBrcMain.SortedText: RawUtf8; var fn: TFileName; threads: integer; - verbose, affinity: boolean; + verbose, affinity, help: boolean; main: TBrcMain; res: RawUtf8; start, stop: Int64; @@ -533,9 +533,10 @@ function TBrcMain.SortedText: RawUtf8; ['a', 'affinity'], 'force thread affinity to a single CPU core'); Executable.Command.Get( ['t', 'threads'], threads, '#number of threads to run', 16); + help := Executable.Command.Option(['h', 'help'], 'display this help'); if Executable.Command.ConsoleWriteUnknown then exit - else if Executable.Command.Option(['h', 'help'], 'display this help') or + else if help or (fn = '') then begin ConsoleWrite(Executable.Command.FullDescription); From 9b5aeb71a0a0fafb7eac7250b439cff29f0b8095 Mon Sep 17 00:00:00 2001 From: Arnaud Bouchez <ab@synopse.info> Date: Fri, 22 Mar 2024 08:56:57 +0100 Subject: [PATCH 3/5] fixed mORMot / abouchez proposal as requested for proper integration --- entries/{abcz => abouchez}/README.md | 40 ++++++++++---------- entries/{abcz => abouchez}/src/brcmormot.lpi | 6 +-- entries/{abcz => abouchez}/src/brcmormot.lpr | 16 +++++--- 3 files changed, 33 insertions(+), 29 deletions(-) rename entries/{abcz => abouchez}/README.md (79%) rename entries/{abcz => abouchez}/src/brcmormot.lpi (95%) rename entries/{abcz => abouchez}/src/brcmormot.lpr (98%) diff --git a/entries/abcz/README.md b/entries/abouchez/README.md similarity index 79% rename from entries/abcz/README.md rename to entries/abouchez/README.md index 04b00ad..1ec8860 100644 --- a/entries/abcz/README.md +++ b/entries/abouchez/README.md @@ -1,4 +1,4 @@ -# mORMot version of The One Billion Row Challenge +# mORMot version of The One Billion Row Challenge by Arnaud Bouchez ## mORMot 2 is Required @@ -37,13 +37,13 @@ The "64 bytes cache line" trick is quite unique among all implementations of the ## Usage -If you execute the `mormot` executable without any parameter, it will give you some hints about its usage (using mORMot `TCommandLine` abilities): +If you execute the `abouchez` executable without any parameter, it will give you some hints about its usage (using mORMot `TCommandLine` abilities): ``` -ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez The mORMot One Billion Row Challenge -Usage: mormot <filename> [options] [params] +Usage: abouchez <filename> [options] [params] <filename> the data source filename @@ -62,10 +62,10 @@ We will use these command-line switches for local (dev PC), and benchmark (chall On my PC, it takes less than 5 seconds to process the 16GB file with 8/10 threads. -Let's compare our `mormot` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux: +Let's compare `abouchez` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux: ``` -ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel5.txt +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel5.txt real 0m4,216s user 0m38,789s @@ -77,13 +77,13 @@ real 0m25,330s user 6m44,853s sys 0m31,167s ``` -We used 20 threads for `sbalazs`, and 10 threads for `mormot` because it was giving the best results for each program on our PC. +We used 20 threads for `sbalazs`, and 10 threads for `abouchez` because it was giving the best results for each program on our PC. -Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `mormot` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults). +Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `abouchez` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults). -The `memmap` feature makes the initial `mormot` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware): +The `memmap()` feature makes the initial/cold `abouchez` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware): ``` -ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./mormot measurements.txt -t=10 >resmrel4.txt +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ time ./abouchez measurements.txt -t=10 >resmrel4.txt real 0m6,042s user 0m53,699s @@ -93,11 +93,11 @@ This is the expected behavior, and will be fine with the benchmark challenge, wh On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not help: ``` -ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot measurements.txt -t=10 -v +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v Processing measurements.txt with 10 threads and affinity=false result hash=8A6B746A,, result length=1139418, stations count=41343, valid utf8=1 done in 4.25s 3.6 GB/s -ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./mormot measurements.txt -t=10 -v -a +ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v -a Processing measurements.txt with 10 threads and affinity=true result hash=8A6B746A, result length=1139418, stations count=41343, valid utf8=1 done in 4.42s 3.5 GB/s @@ -115,13 +115,13 @@ So we first need to find out which options leverage at best the hardware it runs On the https://github.com/gcarreno/1brc-ObjectPascal challenge hardware, which is a Ryzen 9 5950x with 16 cores / 32 threads and 64MB of L3 cache, each thread using around 2.5MB of its own data, we should try several options with 16-24-32 threads, for instance: ``` -./mormot measurements.txt -v -t=8 -./mormot measurements.txt -v -t=16 -./mormot measurements.txt -v -t=24 -./mormot measurements.txt -v -t=32 -./mormot measurements.txt -v -t=16 -a -./mormot measurements.txt -v -t=24 -a -./mormot measurements.txt -v -t=32 -a +./abouchez measurements.txt -v -t=8 +./abouchez measurements.txt -v -t=16 +./abouchez measurements.txt -v -t=24 +./abouchez measurements.txt -v -t=32 +./abouchez measurements.txt -v -t=16 -a +./abouchez measurements.txt -v -t=24 -a +./abouchez measurements.txt -v -t=32 -a ``` Please run those command lines, to guess which parameters are to be run for the benchmark, and would give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here. @@ -133,6 +133,6 @@ Stay tuned! ## Ending Note -There is a "pure mORMot" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little bit overhead. +There is a "pure mORMot" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead. Arnaud :D \ No newline at end of file diff --git a/entries/abcz/src/brcmormot.lpi b/entries/abouchez/src/brcmormot.lpi similarity index 95% rename from entries/abcz/src/brcmormot.lpi rename to entries/abouchez/src/brcmormot.lpi index b1e391d..edef79c 100644 --- a/entries/abcz/src/brcmormot.lpi +++ b/entries/abouchez/src/brcmormot.lpi @@ -19,7 +19,7 @@ <CompilerOptions> <Version Value="11"/> <Target> - <Filename Value="../../../bin/mormot"/> + <Filename Value="../../../bin/abouchez"/> </Target> <SearchPaths> <IncludeFiles Value="$(ProjOutDir)"/> @@ -53,7 +53,7 @@ <CompilerOptions> <Version Value="11"/> <Target> - <Filename Value="../../../bin/mormot"/> + <Filename Value="../../../bin/abouchez"/> </Target> <SearchPaths> <IncludeFiles Value="$(ProjOutDir)"/> @@ -97,7 +97,7 @@ <CompilerOptions> <Version Value="11"/> <Target> - <Filename Value="../../../bin/mormot"/> + <Filename Value="../../../bin/abouchez"/> </Target> <SearchPaths> <IncludeFiles Value="$(ProjOutDir)"/> diff --git a/entries/abcz/src/brcmormot.lpr b/entries/abouchez/src/brcmormot.lpr similarity index 98% rename from entries/abcz/src/brcmormot.lpr rename to entries/abouchez/src/brcmormot.lpr index c06c727..1f507e6 100644 --- a/entries/abcz/src/brcmormot.lpr +++ b/entries/abouchez/src/brcmormot.lpr @@ -245,7 +245,7 @@ constructor TBrcThread.Create(owner: TBrcMain); procedure TBrcThread.Execute; var p, start, stop: PByteArray; - v: integer; + v, m: integer; l, neg: PtrInt; s: PBrcStation; {$ifndef CUSTOMHASH} @@ -300,12 +300,16 @@ procedure TBrcThread.Execute; {$else} s := fList.Search(@name); {$endif CUSTOMHASH} - inc(s^.Count); - if v < s^.Min then - s^.Min := v; - if v > s^.Max then - s^.Max := v; inc(s^.Sum, v); + inc(s^.Count); + m := s^.Min; + if v < m then + m := v; // branchless cmovl + s^.Min := m; + m := s^.Max; + if v > m then + m := v; + s^.Max := m; until p >= stop; end; // aggregate this thread values into the main list From da737f461a7571a5fbcace3554dffdeae0b68b6e Mon Sep 17 00:00:00 2001 From: Arnaud Bouchez <ab@synopse.info> Date: Fri, 22 Mar 2024 09:41:55 +0100 Subject: [PATCH 4/5] small README precisions --- entries/abouchez/README.md | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/entries/abouchez/README.md b/entries/abouchez/README.md index 1ec8860..dd469e5 100644 --- a/entries/abouchez/README.md +++ b/entries/abouchez/README.md @@ -115,16 +115,22 @@ So we first need to find out which options leverage at best the hardware it runs On the https://github.com/gcarreno/1brc-ObjectPascal challenge hardware, which is a Ryzen 9 5950x with 16 cores / 32 threads and 64MB of L3 cache, each thread using around 2.5MB of its own data, we should try several options with 16-24-32 threads, for instance: ``` -./abouchez measurements.txt -v -t=8 -./abouchez measurements.txt -v -t=16 -./abouchez measurements.txt -v -t=24 -./abouchez measurements.txt -v -t=32 -./abouchez measurements.txt -v -t=16 -a -./abouchez measurements.txt -v -t=24 -a -./abouchez measurements.txt -v -t=32 -a +time ./abouchez measurements.txt -v -t=8 +time ./abouchez measurements.txt -v -t=16 +time ./abouchez measurements.txt -v -t=24 +time ./abouchez measurements.txt -v -t=32 +time ./abouchez measurements.txt -v -t=16 -a +time ./abouchez measurements.txt -v -t=24 -a +time ./abouchez measurements.txt -v -t=32 -a ``` Please run those command lines, to guess which parameters are to be run for the benchmark, and would give the best results on the actual benchmark PC with its Ryzen 9 CPU. We will see if core affinity makes a difference here. +Then we could run: +``` +time ./abouchez measurements.txt -v -t=1 +``` +This `-t=1` run is for fun: it will run the process in a single thread. It will help to guess how optimized (and lockfree) our parsing code is, and to validate the CPU multi-core abilities. In a perfect world, other `-t=##` runs should stand for a perfect division of `real` time per the number of working threads, and the `user` value reported by `time` should remain almost the same when we add threads up to the number of CPU cores. + ## Feedback Needed Here we will put some additional information, once our proposal has been run on the benchmark hardware. From e6d36197eea2a8c118d7e3562464e1ec153681af Mon Sep 17 00:00:00 2001 From: Arnaud Bouchez <ab@synopse.info> Date: Fri, 22 Mar 2024 09:48:51 +0100 Subject: [PATCH 5/5] now I hope we are OK with the README format --- entries/abouchez/README.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/entries/abouchez/README.md b/entries/abouchez/README.md index dd469e5..2dbd5af 100644 --- a/entries/abouchez/README.md +++ b/entries/abouchez/README.md @@ -1,4 +1,6 @@ -# mORMot version of The One Billion Row Challenge by Arnaud Bouchez +# Arnaud Bouchez + +**mORMot entry to The One Billion Row Challenge in Object Pascal.** ## mORMot 2 is Required @@ -6,7 +8,7 @@ This entry requires the **mORMot 2** package to compile. Download it from https://github.com/synopse/mORMot2 -It is better to fork the current state of the mORMot 2 repository, or get the latest release. +It is better to fork the current state of the *mORMot 2* repository, or get the latest release. ## Licence Terms @@ -29,7 +31,7 @@ Here are the main ideas behind this implementation proposal: - Parse temperatures with a dedicated code (expects single decimal input values); - No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux); - Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target); -- Some dedicated x86_64 asm has been written to replace mORMot `crc32c` and `MemCmp` general-purpose functions and gain a last few percents (nice to have); +- Some dedicated x86_64 asm has been written to replace *mORMot* `crc32c` and `MemCmp` general-purpose functions and gain a last few percents (nice to have); - Can optionally output timing statistics and hash value on the console to debug and refine settings (with the `-v` command line switch); - Can optionally set each thread affinity to a single core (with the `-a` command line switch). @@ -37,7 +39,7 @@ The "64 bytes cache line" trick is quite unique among all implementations of the ## Usage -If you execute the `abouchez` executable without any parameter, it will give you some hints about its usage (using mORMot `TCommandLine` abilities): +If you execute the `abouchez` executable without any parameter, it will give you some hints about its usage (using *mORMot* `TCommandLine` abilities): ``` ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez @@ -139,6 +141,6 @@ Stay tuned! ## Ending Note -There is a "pure mORMot" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead. +There is a "*pure mORMot*" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead. Arnaud :D \ No newline at end of file