-
Notifications
You must be signed in to change notification settings - Fork 72
/
MAIN-SBP-GCC-10.xml
1735 lines (1551 loc) · 97.5 KB
/
MAIN-SBP-GCC-10.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<!--<?oxygen RNGSchema="http://www.oasis-open.org/docbook/xml/5.0/rng/docbook.rng" type="xml"?>-->
<!DOCTYPE article [
<!ENTITY % entity SYSTEM "entity-decl.ent">
%entity;
]>
<article role="sbp" xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:id="art-sbp-gcc10-sle15"
xml:lang="en">
<info>
<title>Advanced Optimization and New Capabilities of GCC 10</title>
<productname>Development Tools Module, SUSE Linux Enterprise</productname>
<productnumber>15 SP2</productnumber>
<dm:docmanager xmlns:dm="urn:x-suse:ns:docmanager">
<dm:bugtracker>
<dm:url>https://github.com/SUSE/suse-best-practices/issues/new</dm:url>
<dm:product>Advanced Optimization and New Capabilities of GCC 10</dm:product>
</dm:bugtracker>
<dm:editurl>https://github.com/SUSE/suse-best-practices/edit/main/xml/</dm:editurl>
</dm:docmanager>
<meta name="series">SUSE Best Practices</meta>
<!-- <meta name="type">Best Practices</meta>-->
<meta name="category">
<phrase>Tuning & Performance</phrase>
<phrase>Developer Tools</phrase>
</meta>
<meta name="task">
<phrase>Configuration</phrase>
</meta>
<meta name="title">Advanced Optimization and New Capabilities of GCC 10</meta>
<meta name="description">Overview of GCC 10 and compilation optimization options for
applications</meta>
<meta name="productname">
<productname version="15 SP2">SLES</productname>
</meta>
<meta name="published">2021-03-12</meta>
<meta name="platform">SUSE Linux Enterprise Server 15 SP2</meta>
<meta name="platform">Development Tools Module</meta>
<authorgroup>
<author>
<personname>
<firstname>Martin</firstname>
<surname>Jambor</surname>
</personname>
<affiliation>
<jobtitle>Toolchain Developer</jobtitle>
<orgname>SUSE</orgname>
</affiliation>
</author>
<author>
<personname>
<firstname>Jan</firstname>
<surname>Hubička</surname>
</personname>
<affiliation>
<jobtitle>Toolchain Developer</jobtitle>
<orgname>SUSE</orgname>
</affiliation>
</author>
<author>
<personname>
<firstname>Richard</firstname>
<surname>Biener</surname>
</personname>
<affiliation>
<jobtitle>Toolchain Developer</jobtitle>
<orgname>SUSE</orgname>
</affiliation>
</author>
<author>
<personname>
<firstname>Martin</firstname>
<surname>Liška</surname>
</personname>
<affiliation>
<jobtitle>Toolchain Developer</jobtitle>
<orgname>SUSE</orgname>
</affiliation>
</author>
<author>
<personname>
<firstname>Michael</firstname>
<surname>Matz</surname>
</personname>
<affiliation>
<jobtitle>Toolchain Team Lead</jobtitle>
<orgname>SUSE</orgname>
</affiliation>
</author>
<author>
<personname>
<firstname>Brent</firstname>
<surname>Hollingsworth</surname>
</personname>
<affiliation>
<jobtitle>Engineering Manager</jobtitle>
<orgname>AMD</orgname>
</affiliation>
</author>
<!-- <editor>
<orgname></orgname>
</editor>
<othercredit>
<orgname></orgname>
</othercredit>-->
</authorgroup>
<cover role="logos">
<mediaobject>
<imageobject role="fo">
<imagedata fileref="suse.svg" width="5em" align="center" valign="bottom"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="suse.svg" width="152px" align="center" valign="bottom"/>
</imageobject>
</mediaobject>
</cover>
<date>2021-03-12</date>
<abstract>
<para> The document at hand provides an overview of GCC 10 as the current Development Tools
Module compiler in SUSE Linux Enterprise 15 SP2. It focuses on the important optimization levels
and options <emphasis role="strong">Link Time Optimization (LTO)</emphasis> and <emphasis
role="strong">Profile Guided Optimization (PGO)</emphasis>. Their effects are demonstrated by
compiling the SPEC CPU benchmark suite for AMD EPYC 7002 Series Processors and building Mozilla
Firefox for a generic <literal>x86_64</literal> machine. </para>
<para>
<emphasis role="strong">Disclaimer: </emphasis>
Documents published as part of the SUSE Best Practices series have been contributed voluntarily
by SUSE employees and third parties. They are meant to serve as examples of how particular
actions can be performed. They have been compiled with utmost attention to detail. However,
this does not guarantee complete accuracy. SUSE cannot verify that actions described in these
documents do what is claimed or whether actions described have unintended consequences.
SUSE LLC, its affiliates, the authors, and the translators may not be held liable for possible errors
or the consequences thereof.
</para>
</abstract>
</info>
<sect1 xml:id="sec-gcc10-overview">
<title>Overview</title>
<para> The first release of the GNU Compiler Collection (GCC) with the major version 10, GCC 10.1,
has been released in May 2020. GCC 10.2 with fixes to 94 bugs has followed in June of the same
year. Subsequently, it replaced the compiler in the SUSE Linux Enterprise (SLE) Development
Module. GCC 10 comes with many new features, such as implementing parts of the most recent
versions of specifications of various languages (especially <literal>C2X</literal>,
<literal>C++17</literal>, <literal>C++20</literal>) and their extensions (OpenMP, OpenACC),
supporting new capabilities of a wide range of computer architectures and numerous generic
optimization improvements. </para>
<para> This document gives an overview of GCC 10. The focus of the document is on how to select
appropriate optimization options for your application and stressing benefits of advanced modes of
compilation. First, we describe the optimization levels the compiler offers, and other important
options developers often use. We explain when and how you can benefit from using <emphasis
role="bold">Link Time Optimization (LTO)</emphasis> and <emphasis role="bold">Profile Guided
Optimization (PGO)</emphasis> builds. We also detail their effects when building a set of well
known CPU intensive benchmarks, and we are looking at how this performs on the AMD Zen 2 based
EPYC 7002 Series Processor. Finally, we take a closer look at the effects they have on a big
software project: Mozilla Firefox. </para>
</sect1>
<sect1 xml:id="sec-gcc10-various-worlds-of-compilers">
<title>System compiler versus Developer Tools Module compiler</title>
<para> The major version of the system compiler in SUSE Linux Enterprise 15 remains to be GCC 7,
regardless of the service pack level. This is to minimize the danger of any unintended changes
over the entire life time of the product. </para>
<screen>sles15: # gcc --version
gcc (SUSE Linux) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
</screen>
<para> That does not mean that, as a user of SUSE Linux Enterprise 15, you are forced to use a
compiler with features frozen in 2016. You can install an add-on module called <emphasis
role="strong">Development Tools Module</emphasis>. This module is included in the SUSE Linux
Enterprise Server 15 subscription and contains a much newer compiler. </para>
<para> At the time of writing this document, the compiler included in the Development Tools Module
is GCC 10.2. Nevertheless, it is important to stress that, unlike the system compiler, the major
version of the most recent GCC from the module will change shortly after the upstream release of
GCC 11.2 (most likely in summer 2021), GCC 12.2 (summer 2022) and so forth. Note that only the
most recent compiler in the Development Tools Module is supported at any time, except for a six
months overlap period after the upgrade happened. Developers on a SUSE Linux Enterprise Server 15
system therefore have always access to two supported GCC versions: the almost unchanging system
compiler and the most recent compiler from the Development Tools Module. </para>
<para> Programs and libraries built with the compiler from the Development Tools Module can run on
computers running SUSE Linux Enterprise Server 15 which do not have the module installed. All
necessary runtime libraries are available from the main repositories of the operating system
itself, and new ones are added through the standard update mechanism. In the document at hand, we
use the term GCC 10 as synonym for any minor version of the major version 10 and GCC 10.2, to
refer to specifically that version. In practice, they should be interchangeable. </para>
<sect2 xml:id="sec-gcc10-when-module-compiler">
<title>When to use compilers from the Development Tools Module</title>
<para> In many cases you will find that the system compiler perfectly satisfies your needs. After
all, it is the compiler used to build all packages and their updates in the system itself. On
the other hand, there are situations where a newer compiler is necessary, or where you want to
consider using a newer compiler to get some of the benefits of its ongoing development. </para>
<para> If the program or library you are building uses language features which are not supported
by GCC 7, you cannot use the system compiler. However, the compiler from the Development Tools
Module will usually be sufficiently new. The most obvious case is <literal>C++</literal>. GCC 10
has a mature implementation of <literal>C++17</literal> features, whereas the one in GCC 7 is
only experimental and incomplete. The <literal>GNU C++ Library</literal>, which accompanies GCC
10, is also almost <literal>C++17</literal> feature-complete. Only <emphasis role="italic"
>hardware interference sizes</emphasis>
<footnote>
<para> Proposal P0154R1</para>
</footnote> are not implemented and <emphasis role="italic">elementary string
conversions</emphasis>
<footnote>
<para> Proposal P0067R5</para>
</footnote> have extra limitations. Most of <literal>C++20</literal> features are
implemented in GCC 10 as experimental features. Try them out with appropriate caution. Most
notably, <emphasis role="italic">Modules</emphasis>
<footnote>
<para> Proposals P1103R3, P1766R1, P1811R0, P1703R1, P1874R1, P1979R0, P1779R3, P1857R3,
P2115R0 and P1815R2</para>
</footnote> and <emphasis role="italic">Atomic Compare-and-Exchange with Padding Bits</emphasis>
<footnote>
<para> Proposal P0528R3</para>
</footnote> are not supported yet, while <emphasis role="italic">Coroutines</emphasis>
<footnote>
<para> Proposal P0912R5</para>
</footnote> are implemented but require that the source file is compiled with the
<literal>-fcoroutines</literal> switch. If you are interested in the implementation status of
any particular <literal>C++</literal> feature in the compiler, consult the following pages: </para>
<itemizedlist>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/projects/cxx-status.html"><literal>C++</literal>
Standards Support in GCC</link>, and </para>
</listitem>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-10.2.0/libstdc++/manual/">The GNU
<literal>C++</literal> Library Manual</link>. </para>
</listitem>
</itemizedlist>
<para> Advances in supporting new language specifications are not limited to
<literal>C++</literal>. GCC 10 supports several new features from the ISO 202X
<literal>C</literal> standard draft, and the Fortran compiler has also seen many improvements.
And if you use <literal>OpenMP</literal> or <literal>OpenACC</literal> extensions for parallel
programming, you will realize that the compiler supports a lot of features of new versions of
these standards. For more details, visit the links at the end of this section. </para>
<para> In addition to new supported language constructs, GCC 10 offers improved diagnostics when
it reports errors and warnings to the user so that they are easier to understand and to be acted
upon. This is particularly useful when dealing with issues in templated <literal>C++
code</literal>. Furthermore, there are several new warnings which help to avoid common
programming mistakes. </para>
<para> Because GCC10 is newer, it can generate code for many recent processors not supported by
GCC 7. Such a list of processors would be too large to be displayed here. Nevertheless, in <xref
linkend="sec-gcc10-spec"/> we specifically look at optimizing code for an AMD EPYC 7002 Series
Processor which is based on AMD Zen 2 cores. At this point we should stress that the <emphasis
role="italic">system compiler</emphasis> does not know this kind of core and therefore cannot
optimize for it. GCC 10, on the other hand, is the second major release supporting AMD Zen 2
cores, and thus can often produce significantly faster code for it. </para>
<para> Finally, the general optimization pipeline of the compiler has also significantly improved
over the years, which we will demonstrate in the last sections of this document. To find out
more about improvements in versions of GCC 8, 9 and 10, visit their respective
<quote>changes</quote> pages: </para>
<itemizedlist>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/gcc-8/changes.html">GCC 8 Release Series Changes, New
Features, and Fixes</link>, </para>
</listitem>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/gcc-9/changes.html">GCC 9 Release Series Changes, New
Features, and Fixes</link>, and </para>
</listitem>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/gcc-10/changes.html">GCC 10 Release Series Changes, New
Features, and Fixes</link>. </para>
</listitem>
</itemizedlist>
</sect2>
<sect2 xml:id="sec-gcc10-issues-with-module-compiler">
<title>Potential issues with the Development Tools Module</title>
<para> GCC 10 from the Development Tools Module can sometimes behave differently in a way that
can cause issues which were not present with the system compiler. Such problems encountered by
other users are listed in the following documents: </para>
<itemizedlist>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/gcc-8/porting_to.html">Porting to GCC 8</link>, </para>
</listitem>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/gcc-9/porting_to.html">Porting to GCC 9</link>, and
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/gcc-10/porting_to.html">Porting to GCC 10</link>.
</para>
</listitem>
</itemizedlist>
<para> We encourage you to read through these three short pages to get an understanding of the
problems. The document at hand briefly mentions two such potential pitfalls.</para>
<para>The first one is that, for performance reasons, GCC 10 defaults to
<literal>-fno-common</literal> which means that a linker error will now be reported if the same
variable is defined in two <literal>C</literal> compilation units. This can happen if two or
more <literal>.c</literal> files include the same header file which intends to declare a
variable but omits the <literal>extern</literal> keyword when doing so, inadvertently resulting
in multiple definitions. If you encounter such an error, you simply need to add the
<literal>extern</literal> keyword to the declaration in the header file and define the variable
in only a single compilation unit. Alternatively, you can compile your project with an explicit
<literal>-fcommon</literal> if you are willing to accept that this behavior is inconsistent
with <literal>C++</literal> and may incur speed and code size penalties. </para>
<para> The second issue highlighted here is that the <literal>C++</literal> compiler in GCC 8 and
later now assumes that no execution path in a non-void function simply reaches the end of the
function without a return statement. This means it is assumed that such code paths will never be
executed, and thus they will be eliminated. You should therefore pay special attention to
warnings produced by <literal>-Wreturn-type</literal>. This option is enabled by default and
indicates which functions might be affected. </para>
</sect2>
<sect2 xml:id="sec-gcc10-installing-module-compiler">
<title>Installing GCC 10 from the Development Tools Module</title>
<para> Similar to other modules and extensions for SUSE Linux Enterprise Server 15, you can
activate the Development Tools Module either using the command line tool
<command>SUSEConnect</command> or using the <command>YaST</command> setup and configuration
tool. To use the former, carry out the following steps: </para>
<procedure>
<step>
<para> As root, start by listing the available and activated modules and extensions: </para>
<screen>sles15: # SUSEConnect --list-extensions</screen>
</step>
<step>
<para> In the computer output, look for <quote>Development Tools Module</quote>: </para>
<screen>
Development Tools Module 15 SP2 x86_64
Activate with: SUSEConnect -p sle-module-development-tools/15.2/x86_64
</screen>
<para> If you see the text <literal>(Activated)</literal> next to the module name, the module
is already ready to be used. You can safely proceed to the installation of the compiler
packages. </para>
</step>
<step>
<para> Otherwise, issue the activation command that is shown in the computer output above: </para>
<screen>sles15: # SUSEConnect -p sle-module-development-tools/15.2/x86_64
Registering system to SUSE Customer Center
Updating system details on https://scc.suse.com ...
Activating sle-module-development-tools 15.2 x86_64 ...
-> Adding service to system ...
-> Installing release package ...
Successfully registered system
</screen>
</step>
</procedure>
<para> If you prefer to use <command>YaST</command>, the procedure is also straightforward. Run
YaST as root and go to the <emphasis role="strong">Add-On Products</emphasis> menu in the
<command>Software</command> section. If <quote>Development Tools Module</quote> is among the
listed installed modules, you already have the module activated and can proceed with installing
individual compiler packages. If not, click the <emphasis role="strong">Add</emphasis> button,
select <emphasis role="strong">Select Extensions and Modules from Registration
Server</emphasis>, and <command>YaST</command> will guide you through a simple procedure to add
the module. </para>
<!-- Too detailed YaST procedure removed, probably not necessary
<para>To use YaST to install the Development Tools Module on a SUSE Linux Enterprise
Server 15 system, carry out the following steps:</para>
<procedure>
<step>
<para>As root, run YaST and go to the <command>Add-On Products</command> menu in the
Software section.</para>
</step>
<step>
<para>If the list of installed modules already includes <quote>Development Tools
Module</quote>, you already have the module installed and can proceed to installing
individual compiler packages. Otherwise press the <command>Add</command> button.</para>
</step>
<step>
<para>Select Extensions and Modules from Registration Server and press the
<command>Next</command> button.</para>
</step>
<step>
<para>Select the <quote>Development Tools Module</quote>, check the checkbox next to it
and press the <command>Next</command> button.</para>
</step>
<step>
<para>YaST will present you with the list of changes to the system it is about to make.
Review them and press the <command>Accept</command> button.</para>
</step>
<step>
<para>Press the <command>OK</command> button to exit the Add-on Products menu and exit
YaST.</para>
</step>
</procedure>
-->
<para> When you have the Development Tools Module installed, you can verify that the GCC 10
packages are available to be installed on your system:. </para>
<screen>sles15: # zypper search gcc10
Refreshing service 'Basesystem_Module_15_SP2_x86_64'.
Refreshing service 'Desktop_Applications_Module_15_SP2_x86_64'.
Refreshing service 'Development_Tools_Module_15_SP2_x86_64'.
Refreshing service 'SUSE_Linux_Enterprise_Server_15_SP2_x86_64'.
Refreshing service 'SUSE_Package_Hub_15_SP2_x86_64'.
Refreshing service 'Server_Applications_Module_15_SP2_x86_64'.
Loading repository data...
Reading installed packages...
S | Name | Summary
--+------------------------------+-------------------------------------------------------
| gcc10 | The GNU C Compiler and Support Files
| gcc10 | The GNU C Compiler and Support Files
| gcc10-32bit | The GNU C Compiler 32bit support
| gcc10-ada | GNU Ada Compiler Based on GCC (GNAT)
| gcc10-ada-32bit | GNU Ada Compiler Based on GCC (GNAT)
| gcc10-c++ | The GNU C++ Compiler
| gcc10-c++-32bit | The GNU C++ Compiler
| gcc10-fortran | The GNU Fortran Compiler and Support Files
| gcc10-fortran-32bit | The GNU Fortran Compiler and Support Files
| gcc10-go | GNU Go Compiler
| gcc10-go-32bit | GNU Go Compiler
| gcc10-info | Documentation for the GNU compiler collection
| gcc10-locale | Locale Data for the GNU Compiler Collection
| libstdc++6-devel-gcc10 | Include Files and Libraries mandatory for Development
| libstdc++6-devel-gcc10-32bit | Include Files and Libraries mandatory for Development
| libstdc++6-pp-gcc10 | GDB pretty printers for the C++ standard library
| libstdc++6-pp-gcc10-32bit | GDB pretty printers for the C++ standard library
</screen>
<para> Now you can simply install the compilers for the programming languages you use with
<command>zypper</command>: </para>
<screen>sles15: # zypper install gcc10 gcc10-c++ gcc10-fortran
</screen>
<para> The compilers are installed on your system, the executables are called
<command>gcc-10</command>, <command>g++-10</command>, <command>gfortran-10</command> and so on.
It is also possible to install the packages in <command>YaST</command>. To do so, simply enter
the <quote>Software Management</quote> menu in the <emphasis role="strong">Software</emphasis>
section and search for <quote>gcc10</quote>. Then select the packages you want to install.
Finally, click the <emphasis role="strong">Accept</emphasis> button. </para>
<note>
<title>Newer compilers on openSUSE Leap 15.2</title>
<para> The community distribution openSUSE Leap 15.2 shares most of the base packages with SUSE
Linux Enterprise Server 15 SP2. The system compiler on systems running openSUSE Leap 15.2 is
also GCC 7.5. There is no Development Tools Module for the community distribution available,
but a newer compiler is provided. Simply install the packages <package>gcc10</package>,
<package>gcc10-c++</package>, <package>gcc10-fortran</package>, and the like. </para>
</note>
</sect2>
</sect1>
<sect1 xml:id="sec-gcc10-optimization-levels">
<title>Optimization levels and related options</title>
<para> GCC has a rich optimization pipeline that is controlled by approximately a hundred of
command line options. It would be impractical to force users to decide about each one of them
whether they want to have it switched on when compiling their code. Like all other modern
compilers, GCC therefore introduces the concept of optimization levels which allow the user to
pick one common configuration from a few options. Optionally, the user can tweak the selected
level, but that does not happen frequently. </para>
<para> The default is to not optimize at all. You can specify this optimization level on the
command line as <literal>-O0</literal>. It is often used when developing and debugging a project.
This means it is usually accompanied with the command line switch <literal>-g</literal> so that
debug information is emitted. As no optimizations take place, no information is lost because of
it. No variables are optimized away, the compiler only inlines functions with special attributes
that require it, and so on. As a consequence, the debugger can almost always find everything it
searches for in the running program and report on its state very well. On the other hand, the
resulting code is big and slow. Thus this optimization level should not be used for release
builds. </para>
<para> The most common optimization level for release builds is <literal>-O2</literal> which
attempts to optimize the code aggressively but avoids large compile times and excessive code
growth. Optimization level <literal>-O3</literal> instructs GCC to simply optimize as much as
possible, even if the resulting code might be considerably bigger and the compilation can take
longer. Note that neither <literal>-O2</literal> nor <literal>-O3</literal> imply anything about
the precision and semantics of floating-point operations. Even at the optimization level
<literal>-O3</literal> GCC implements math functions so that they strictly follow the respective
IEEE and/or ISO rules. This often means that the compiled programs run markedly slower than
necessary if such strict adherence is not required. The command line switch
<literal>-ffast-math</literal> is a common way to relax rules governing floating-point
operations. It is out of scope of this document to provide a list of the fine-grained options it
enables and their meaning. However, if your software crunches floating-point numbers and its
runtime is a priority, you can look them up in the GCC manual and review what semantics of
floating-point operations you need. </para>
<para> The most aggressive optimization level is <literal>-Ofast</literal> which does imply
<literal>-ffast-math</literal> along with a few other options that disregard strict standard
compliance. In GCC 10 this level also means the optimizers may introduce data races when moving
memory stores which may not be safe for multithreaded applications. Additionally, the Fortran
compiler can take advantage of associativity of math operations even across parentheses and
convert big memory allocations on the heap to allocations on stack. The last mentioned
transformation may cause the code to violate maximum stack size allowed by
<command>ulimit</command> which is then reported to the user as a segmentation fault. We often
use level <literal>-Ofast</literal> to build benchmarks. It is a shorthand for the options on top
of <literal>-O3</literal>, which often make them run faster, and the benchmarks are usually
written in a way that they still run correctly. </para>
<para> If you feed the compiler with huge machine-generated input, especially if individual
functions happen to be extremely large, the compile time can become an issue even when using
<literal>-O2</literal>. In such cases, use the most lightweight optimization level
<literal>-O1</literal> that avoids running almost all optimizations with quadratic complexity.
Finally, the <literal>-Os</literal> level directs the compiler to aggressively optimize for the
size of the binary. </para>
<note>
<title>Optimization level recommendation</title>
<para> Usually we recommend using <literal>-O2</literal>. This is the optimization level we use
to build most SUSE and openSUSE packages, because at this level the compiler makes balanced size
and speed trade-offs when building a general-purpose operating system. However, we suggest using
<literal>-O3</literal> if you know that your project is compute-intensive and is either small
or an important part of your actual workload. Moreover, if the compiled code contains
performance-critical floating-point operations, we strongly advise that you investigate whether
<literal>-ffast-math</literal> or any of the fine-grained options it implies can be safely
used. </para>
</note>
<para> If your project and the techniques you use to debug or instrument it do not depend on
<emphasis role="italic">ELF symbol interposition</emphasis>, you may consider trying to speed it
up by using <literal>-fno-semantic-interposition</literal>. This allows the compiler to inline
calls and propagate information even when it would be illegal if a symbol changed during dynamic
linking. Using this option to signal to the compiler that interposition is not going to happen is
known to significantly boost performance of some projects, most notably the Python interpreter. </para>
<para> Some projects use <literal>-fno-strict-aliasing</literal> to work around type punning
problems in the source code. This is not recommended except for very low-level hand optimized
code such as the Linux kernel. Type-based alias analysis is a very powerful tool. It is used to
enable other transforms, such as store-to-load propagation that in turn enables other
transformations, such as aggressive inlining, vectorization and other high level transformations. </para>
<para> With the <literal>-g</literal> switch GCC still tries hard to generate useful debug
information even when optimizing. However, a lot of information is irrecoverably lost in the
process. Debuggers also often struggle to present the user with a view of the state of a program
in which statements are not necessarily executed in the original order. Debugging optimized code
can therefore be a challenging task but usually is still somewhat possible. </para>
<para> The complete list of optimization and other command line switches is available in the
compiler manual, provided in the info format in the package <package>gcc10-info</package> or
online at <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-10.2.0/gcc/">the GCC project Web
site</link>. </para>
<para> Bear in mind that although almost all optimizing compilers have the concept of optimization
levels and their optimization levels often have the same names as those in GCC, they do
not necessarily mean to make the same trade-offs. Famously, GCC's <literal>-Os</literal>
optimizes for size much more aggressively than LLVM/Clang's level with the same name. Therefore,
it often produces slower code; the more equivalent option in Clang is <literal>-Oz</literal>
which GCC does not have. Similarly, <literal>-O2</literal> can have different meanings for
different compilers. For example, the difference between <literal>-O2</literal> and
<literal>-O3</literal> is much bigger in GCC than in LLVM/Clang. </para>
<note>
<title>Changing the optimization level with <command>cmake</command></title>
<para> If you use <command>cmake</command> to configure and set up builds of your application, be
aware that its <emphasis role="italic">release</emphasis> optimization level defaults to
<literal>-O3</literal> which might not be what you want. To change it, you must modify the
<literal>CMAKE_C_FLAGS_RELEASE</literal>, <literal>CMAKE_CXX_FLAGS_RELEASE</literal> and/or
<literal>CMAKE_Fortran_FLAGS_RELEASE</literal>, since these variables are appended at the end
of the compilation command lines, thus overwriting any level set in the variables
<literal>CMAKE_C_FLAGS</literal>, <literal>CMAKE_CXX_FLAGS</literal>, and the like. </para>
</note>
</sect1>
<sect1 xml:id="sec-gcc10-target-options">
<title>Taking advantage of newer processors</title>
<para> By default GCC assumes that you want to run the compiled program on a wide variety of CPUs,
including fairly old ones, regardless of the selected optimization level. On architectures like
<literal>x86_64</literal> and <literal>aarch64</literal> the generated code will only contain
instructions available on every CPU model of the architecture, including the earliest ones. On
<literal>x86_64</literal> in particular this means that the programs will use the
<literal>SSE</literal> and <literal>SSE2</literal> instruction sets for floating point and
vector operations but not any more recent ones. </para>
<para> If you know that the generated binary will run only on machines with newer instruction set
extensions, you can specify it on the command line. Their complete list is available in the
manual, but the most prominent one is <literal>-march</literal> which lets you select a CPU model
to generate code for. For example, if you know that your program will only be executed on AMD
EPYC 7002 Series Processors which is based on AMD Zen 2 cores or processors that are compatible
with it, you can instruct GCC to take advantage of all the instructions the CPU supports with
option <literal>-march=znver2</literal>. Note that on SUSE Linux Enterprise Server 15, the system
compiler does not know this particular value of the switch; you need to use GCC 10 from the
Development Tools Module to optimize code for these processors. </para>
<para> To run the program on the machine on which you are compiling it, you can have the compiler
auto-detect the target CPU model for you with the option <literal>-march=native</literal>. This
only works if the compiler is new enough. The system compiler of SUSE Linux Enterprise Server,
for example, misidentifies AMD EPYC 7002 Series Processors as being based on the AMD Zen 1 core.
Among other things, this means that it only emits 128bit vector instructions, even though the CPU
has data-paths wide enough to efficiently process 256bit ones. Again, the easy solution is to use
the compiler from the Development Tools Module when targeting recent processors. </para>
</sect1>
<sect1 xml:id="sec-gcc10-lto">
<title>Link Time Optimization (LTO)</title>
<para>
<xref linkend="fig-gcc10-nonlto-build" xrefstyle="template:Figure %n"/> outlines the classic mode
of operation of a compiler and a linker. Pieces of a program are compiled and optimized in chunks
defined by the user called compilation units to produce so-called object files which already
contain binary machine instructions and which are combined together by a linker. Because the
linker works at such low level, it cannot perform much optimization and the division of the
program into compilation units thus presents a profound barrier to optimization. </para>
<figure xml:id="fig-gcc10-nonlto-build">
<title>Traditional program build</title>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="gcc10-nonlto.svg" width="100%" format="SVG"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="gcc10-nonlto.svg" width="100%" format="SVG"/>
</imageobject>
</mediaobject>
</figure>
<para> This limitation can be overcome by rearranging the process so that the linker does not
receive as its input the almost finished object files containing machine instructions, but is
invoked on files containing so called <emphasis role="italic">intermediate language</emphasis>
(IL) which is a much richer representation of each original compilation unit (see figure <xref
linkend="fig-gcc10-lto-build" xrefstyle="template:figure %n"/>). The linker identifies the input
as not yet entirely compiled and invokes a linker plugin which in turn runs the compiler again.
But this time it has at its disposal the representation of the entire program or library that is
being built. The compiler makes decisions about what optimizations across function and
compilation unit boundaries will be carried out and then divides the program into a set of
partitions. Each of the partitions is further optimized independently, and machine code is
emitted for it, which is finally linked the traditional way. Processing of the partitions is
performed in parallel. </para>
<figure xml:id="fig-gcc10-lto-build">
<title>Building a program with GCC using Link Time Optimization (LTO)</title>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="gcc10-lto.svg" width="100%" format="SVG"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="gcc10-lto.svg" width="100%" format="SVG"/>
</imageobject>
</mediaobject>
</figure>
<para> To use <emphasis role="italic">Link Time Optimization</emphasis>, all you need do is to add
the <literal>-flto</literal> switch to the compilation command line. The vast majority of
packages in the Linux distribution openSUSE Tumbleweed has been built with LTO for over a year
without any major problems. A lot of work has recently been put into emitting good debug
information when building with LTO. Thus the debugging experience is not limited anymore as it
was a couple of years ago. </para>
<para> LTO in GCC always consists of a <emphasis role="italic">whole program analysis</emphasis>
(WHOPR) stage followed by the majority of the compilation process performed in parallel, which
greatly reduces the build times of most projects. To control the parallelism, you can explicitly
cap the number of parallel compilation processes by <emphasis role="italic">n</emphasis> if you
specify <literal>-flto=<replaceable>n</replaceable></literal> at linker command line.
Alternatively, it is possible to use the GNU <command>make</command> jobserver with
<literal>-flto=jobserv</literal> while also prepending the <emphasis role="strong"
>makefile</emphasis> rule invoking link step with character <literal>+</literal> to instruct GNU
make to keep the jobserver available to the linker process. You can also use
<literal>-flto=auto</literal> which instructs GCC to search for the jobserver and if it is not
found, use all available CPU threads. </para>
<!--
<para>The number of partitions a program is split into depends only on the linked program itself
because it affects the resulting binary which are required to be identical despite different
host CPU configurations. It is however possible to control the number using <literal>- -param
lto-partitions=<emphasis role="italic">n</emphasis></literal> parameter.</para>
-->
<para> Note that there is a technical difference in how GCC and LLVM/Clang approach LTO. Clang
provides two LTO mechanisms, so-called <emphasis role="italic">thin LTO</emphasis> and <emphasis
role="italic">full LTO</emphasis>. In full LTO, LLVM processes the whole program as if it was a
single translation unit which does not allow for any parallelism. GCC can be configured to
operate in this way with the option <literal>-flto-partition=one</literal>. LLVM in thin-LTO mode
can compile different compilation units in parallel and makes possible inlining across
compilation unit boundaries, but not most other types of cross-module optimizations. This
mechanism therefore has inherently higher code quality penalty than full LTO or the approach of
GCC. </para>
<sect2 xml:id="sec-gcc10-selected-lto-benefits">
<title>Most notable benefits of LTO</title>
<para> Applications built with LTO are often faster, mainly because the compiler can <emphasis
role="italic">inline</emphasis> calls to functions in another compilation unit. This
possibility also allows programmers to structure their code according to its logical division
because they are not forced to put function definitions into header files to enable their
inlining. Not all calls conveying information known at compilation time can be inlined. But GCC
can still track and propagate constants, value ranges and devirtualization contexts to the
callees, often even when passed in an aggregate or by reference, that can then subsequently save
unnecessary computations. LTO allows such propagation across compilation unit boundaries, too. </para>
<para> Link Time Optimization with <literal>whole program analysis</literal> also offers many
opportunities to shrink the code size of the built project. Thanks to <emphasis role="italic"
>symbol promotion</emphasis> and inter-procedural <emphasis role="italic">unreachable code
elimination</emphasis>, functions and their parts which are not necessary in any particular
project can be removed even when they are not declared <literal>static</literal> and are not
defined in an anonymous namespace. Automatic <emphasis role="italic">attribute
discovery</emphasis> can identify <literal>C++</literal> functions that do not throw exceptions
which allows the compiler to avoid generating a lot of code in exception cleanup regions.
<emphasis role="italic">Identical code folding</emphasis> can find functions with the same
semantics and remove all but one of them. The code size savings are often very significant and a
compelling reason to use LTO even for applications which are not CPU-bound. </para>
<note>
<title>Building libraries with LTO</title>
<para> The symbol promotion is controlled by resolution information given to the linker and
depends on type of the DSO build. When producing a dynamically loaded shared library, all
symbols with default visibility can be overwritten by the dynamic linker. This blocks the
promotion of all functions not declared inline, thus it is necessary to use the hidden
visibility wherever possible to achieve best results. Similar problems happen even when
building static libraries with <literal>-rdynamic</literal>. </para>
</note>
</sect2>
<sect2 xml:id="sec-gcc10-lto-issues">
<title>Potential issues with LTO</title>
<para> As noted earlier, the vast majority of packages in the openSUSE Tumbleweed distribution
are built with LTO without any need to tweak them, and they work fine. Nevertheless, some
low-level constructs pose a problem for LTO. One typical issue are symbols defined in <emphasis
role="italic">inline assembly</emphasis> which can happen to be placed in a different partition
from their uses and subsequently fail the final linking step. To build such projects with LTO,
the assembler snippets defining symbols must be placed into a separate assembler source file so
that they only participate in the final linking step. Global <literal>register</literal>
variables are not supported by LTO, and programs either must not use this feature or be built
the traditional way. </para>
<para> Another notable limitation of LTO is that it does not support <emphasis role="italic"
>symbol versioning</emphasis> implemented with special inline assembly snippets (as opposed to
a linker map file). To define symbol versions in the source files, you can do so with the new
<literal>symver</literal> function attribute. As an example, the following snippet will make
the function <literal>foo_v1</literal> implement <literal>foo</literal> in <emphasis
role="italic">node</emphasis>
<literal>VERS_1</literal> (which must be specified in the version script supplied to the
linker). Consult <link
xlink:href="https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-symver-function-attribute"
>the manual</link> for more details. </para>
<screen>__attribute__ ((__symver__ ("foo@VERS_1")))
int foo_v1 (void)
{
}
</screen>
<para> Sometimes the extra power of LTO reveals pre-existing problems which do not manifest
themselves otherwise. Violations of (strict) <emphasis role="italic">aliasing</emphasis> rules
and <literal>C++</literal>
<emphasis role="italic">one definition rule</emphasis> tend to cause misbehavior significantly
more often; the latter is fortunately reported by the <literal>-Wodr</literal> warning which is
on by default and should not be ignored. We have also seen cases where the use of the
<literal>flatten</literal> function attribute led to unsustainable amount of inlining with LTO.
Furthermore, LTO is not a good fit for code snippets compiled by <literal>configure</literal>
scripts (generated by <literal>autoconf</literal>) to discover the availability of various
features, especially when the script then searches for a string in the generated assembly. </para>
<para> Finally, we needed to configure the virtual machines building the biggest openSUSE
packages to have more memory than when not using LTO. Whereas in the traditional mode of
compilation 1 GB of RAM per core was enough to build Mozilla Firefox, the serial step of LTO
means the build-bots need 16 GB even when it has fewer than 16 cores. </para>
</sect2>
</sect1>
<sect1 xml:id="sec-gcc10-pgo">
<title>Profile-Guided Optimization (PGO)</title>
<para> Optimizing compilers frequently make decisions according to which path through the code
they consider most likely to be executed, how many times a loop is expected to iterate, and
similar estimates. They also often face trade-offs between potential runtime benefits and code
size growth. Ideally, they would optimize only frequently executed (also called <emphasis
role="italic">hot</emphasis>) bits of a program for speed and everything else for size to reduce
strain on caches and make the distribution of the built software cheaper. Unfortunately, guessing
which parts of a program are the <emphasis role="italic">hot</emphasis> ones is difficult, and
even sophisticated estimation algorithms implemented in GCC are no good match for a measurement. </para>
<para> If you do not mind adding an extra level of complexity to the build system of your project, you
can make such measurement part of the process. The <emphasis role="strong">makefile</emphasis>
(or any other) build script needs to compile it twice. The first time it needs to compile with
the <literal>-fprofile-generate</literal> option and then execute the first binary in one or
multiple <emphasis role="italic">train runs</emphasis> during which it will save information
about the behavior of the program to special files. Afterward, the project needs to be rebuilt
again, this time with the <literal>-fprofile-use</literal> option which instructs the compiler to
look for the files with the measurements and use them when making optimization decisions, a
process called <emphasis role="italic">Profile-Guided Optimization (PGO)</emphasis>. </para>
<para> It is important that the train exhibits the same characteristics as the real workload.
Unless you use the option <literal>-fprofile-partial-training</literal> in the second build, it
needs to exercise the code that is also the most frequently executed in real use, otherwise it
will be optimized for size and PGO would make more harm than good. With the option, GCC reverts
to guessing properties of portions of the projects not exercised in the train run, as if they
were compiled without profile feedback. This however also means that the code size will not
typically shrink as much as one would expect from a PGO build. </para>
<para> On the other hand, train runs do not need to be a perfect simulation of the real workload.
For example, even though a test suite should not be a very good train run in theory because it
disproportionally often tests various corner cases, in practice many projects use it as a train
run and achieve significant runtime improvements with real workloads, too. </para>
<para> Profiles collected using an instrumented binary for multithreaded programs may be
inconsistent because of missed counter updates. You can use
<literal>-fprofile-correction</literal> in addition to <literal>-fprofile-use</literal> so that
GCC uses heuristics to correct or smooth out such inconsistencies instead of emitting an error. </para>
<para> Profile-Guided Optimization can be combined and is complimentary to Link Time Optimization.
While LTO expands what the compiler can do, PGO informs it about which parts of the program are
the important ones and should be focused on. The following sections detail this by means of two
rather different case studies. </para>
</sect1>
<sect1 xml:id="sec-gcc10-spec">
<title>Performance evaluation: SPEC CPU 2017</title>
<para>
<emphasis role="italic">Standard Performance Evaluation Corporation</emphasis> (SPEC) is a
non-profit corporation that publishes a variety of industry standard benchmarks to evaluate
performance and other characteristics of computer systems. Its latest suite of CPU intensive
workloads, SPEC CPU 2017, is often used to compare compilers and how well they optimize code with
different settings because the included benchmarks are well known and represent a wide variety of
computation-heavy programs. This section highlights selected results of a GCC 10 evaluation using
the suite. </para>
<para> Note that when we use SPEC to perform compiler comparisons, we are lenient toward some
official SPEC rules which system manufacturers need to observe to claim an official score for
their system. We disregard the concepts of <emphasis role="italic">base</emphasis> and <emphasis
role="italic">peak</emphasis> metrics and simply focus on results of compilations using a
particular set of options. We even patched several benchmarks: </para>
<itemizedlist>
<listitem>
<para> Benchmarks <literal>502.gcc_r</literal>, <literal>505.mcf_r</literal>,
<literal>511.povray_r</literal>, and <literal>527.cam4_r</literal> contain an implementation
of quicksort which violates (strict) <literal>C/C++</literal> aliasing rules which can lead to
erroneous behavior when optimizing at link time. SPEC decided not to change the released
benchmarks and simply suggests that these benchmarks are built with the
<literal>-fno-strict-aliasing</literal> option when they are built with GCC. That makes
evaluation of compilers using SPEC problematic, gauging their ability to use aliasing rules to
facilitate optimizations is important. We have therefore disabled it only for the problematic
<literal>qsort</literal> attributes with the following function attribute: </para>
<screen>__attribute__((optimize("-fno-strict-aliasing")))</screen>
<para> As a result, the only benchmark which we compile with
<literal>-fno-strict-aliasing</literal> is <literal>500.perlbench_r</literal>. </para>
</listitem>
<listitem>
<para> We have increased the tolerance of <literal>549.fotonik3d_r</literal> to rounding errors
after it became clear the intention was that the compiler can use relaxed semantics of floating
point operations in the benchmark (see <link
xlink:href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84201">GCC bug 84201</link>). </para>
</listitem>
</itemizedlist>
<para> For the mentioned reasons (and probably some more), all the results in this document are
<emphasis role="italic">non-reportable</emphasis>. Finally, SPEC 2017 CPU offers so-called
<emphasis role="italic">speed</emphasis> and <emphasis role="italic">rate</emphasis> metrics.
For our purposes, we mostly ignore the differences and simply run the benchmarks configured for
rate metrics (mainly because the runtimes are smaller) but we always run all benchmarks
single-threaded. </para>
<para> SPEC specifies a base runtime for each benchmark and defines a <emphasis role="italic"
>rate</emphasis> as the ratio of the base runtime and the median measured runtime (this rate is
a separate concept from the rate metrics). The overall suite score is then calculated as
geometric mean of these ratios. The bigger the rate or score, the better it is. In the remainder
of this section, we report runtimes using relative rates and their geometric means as they were
measured on an AMD EPYC 7502P Processor running SUSE Linux Enterprise Server 15 SP2. </para>
<sect2 xml:id="sec-gcc10-spec-lto-pgo">
<title>Benefits of LTO and PGO</title>
<para> In <xref linkend="sec-gcc10-optimization-levels"/> we recommend that HPC workloads are
compiled with <literal>-O3</literal> and benchmarks with <literal>-Ofast</literal>. But it is
still interesting to look at integer crunching benchmarks built with only <literal>-O2</literal>
because that is how Linux distributions often build the programs from which they were extracted.
We have already mentioned that almost the whole openSUSE Tumbleweed distribution is now built
with LTO, and selected packages with PGO, and the following paragraphs demonstrate why. </para>
<figure xml:id="fig-gcc10-specint-o2-pgolto-geomean">
<title>Overall performance (bigger is better) of SPEC INTrate 2017 built with GCC 10.2 and
-O2</title>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="gcc10-specint-o2-pgolto-perf-geomean.svg" width="85%" format="SVG"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="gcc10-specint-o2-pgolto-perf-geomean.svg" width="85%" format="SVG"/>
</imageobject>
</mediaobject>
</figure>
<!-- xrefstyle="select:label" in xref also works but puts Figure with capital F everywhere -->
<para>
<xref linkend="fig-gcc10-specint-o2-pgolto-geomean" xrefstyle="template:Figure %n"/> shows the
overall performance effect on the whole integer benchmark suite as captured by the geometric
mean of all individual benchmark rates. The remarkable uplift of performance when using PGO is
mostly down to much quicker <literal>525.x264_r</literal> (see <xref
linkend="fig-gcc10-specint-o2-pgolto-perf-x264" xrefstyle="template:figure %n"/>). The reason
is that, with profile feedback, GCC performs vectorization also at <literal>-O2</literal> and
this benchmark benefits a great deal from vectorization, in practice it really should be
compiled with at least <literal>-O3</literal>. Nevertheless, several other benchmarks also
benefit from these advanced modes of operation, as can be seen on <xref
linkend="fig-gcc10-specint-o2-pgolto-perf-indiv" xrefstyle="template:figure %n"/>. </para>
<figure xml:id="fig-gcc10-specint-o2-pgolto-perf-x264">
<title>Performance (bigger is better) of <literal>525.x264_r</literal> built with GCC 10.2 and
-O2</title>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="gcc10-specint-o2-pgolto-perf-x264.svg" width="100%" format="SVG"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="gcc10-specint-o2-pgolto-perf-x264.svg" width="100%" format="SVG"/>
</imageobject>
</mediaobject>
</figure>
<figure xml:id="fig-gcc10-specint-o2-pgolto-perf-indiv">
<title>Runtime performance (bigger is better) of selected integer benchmarks built with GCC 10.2
and -O2</title>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="gcc10-specint-o2-pgolto-perf-indiv.svg" width="100%" format="SVG"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="gcc10-specint-o2-pgolto-perf-indiv.svg" width="100%" format="SVG"/>
</imageobject>
</mediaobject>
</figure>
<para>
<xref linkend="fig-gcc10-specint-o2-ltopgo-size" xrefstyle="template:Figure %n"/> shows another
important reason which is the reduction of the size of the binaries (measured without debug
info), which can be significant with LTO or a combination of LTO and PGO. Note that it does not
depict that the size of benchmark <literal>548.exchange2_r</literal> grew by 50% and almost 250%
when built with PGO or both PGO and LTO respectively, which looks huge but the growth is from a
particularly small base. </para>
<figure xml:id="fig-gcc10-specint-o2-ltopgo-size">
<title>Binary size (smaller is better) of selected integer benchmarks built with GCC 10.2 and
-O2</title>
<mediaobject>
<imageobject role="fo">
<imagedata fileref="gcc10-specint-o2-pgolto-size.svg" width="90%" format="SVG"/>
</imageobject>
<imageobject role="html">
<imagedata fileref="gcc10-specint-o2-pgolto-size.svg" width="90%" format="SVG"/>
</imageobject>
</mediaobject>
</figure>
<para> The runtime benefits and binary size savings can be even bigger when using the
optimization level <literal>-Ofast</literal> and option <literal>-march=native</literal> to
allow the compiler to take advantage of all instructions that the AMD EPYC 7502P Processor
supports. <xref linkend="fig-gcc10-specint-ofast-pgolto-geomean"
xrefstyle="template:Figure
%n"/> shows the respective geometric means and <xref
linkend="fig-gcc10-specint-ofast-pgolto-perf-indiv" xrefstyle="template:figure %n"/> shows the
benchmarks with the most profound effect. Even though optimization levels <literal>-O3</literal>
and <literal>-Ofast</literal> are permitted to be relaxed about the final binary size, PGO and
especially LTO can bring it nicely down at these levels, too. <xref
linkend="fig-gcc10-specint-ofast-pgolto-size" xrefstyle="template:Figure %n"/> depicts the
relative binary sizes of the most affected benchmarks. </para>
<figure xml:id="fig-gcc10-specint-ofast-pgolto-geomean">
<title>Overall performance (bigger is better) of SPEC INTrate 2017 built with GCC 10.2 and