Speed up build of thin data streams #11618

Honny1 · 2024-02-22T12:13:59Z

Description:

This PR speeds up the build of thin data streams using the --thin flag. Before the changes, the build takes quite a long time: 48m31.2s. After changes, the build takes a total of 4m40s.

The old build approach was to copy Benchmark instances with only one profile. Then the next build steps are performed with that copy. This leads to a slowdown due to deep copying of the Benchmark class and performing link steps for each thin DS.

The new approach changes the way thin DSs are built. The first step is to create a Benchmark for the thick DS. Then the XML generation from the Benchmark for one profile is performed. Before the generation, a dictionary of components that will not be included in the XML is created. And then it is linked to OVAL.

Review Hints:

To test the --thin flag, you can run this script:

#!/bin/bash

./build_product fedora --thin

for filename in ./build/thin_ds/*.xml; do
    echo "$filename"
    oscap xccdf eval  --profile "(all)" --results-arf "arf_results/arf_$(basename -- $filename)" "$filename"
done

The script generates a thin Datastream for each rule and then performs a scan using oscap.
This test takes more than an hour to run because there are approximately 1830 rules to process and some are memory intensive.

To test the speed up of the build, you can run this command:

./build_product fedora --thin -p

Flag -p enables profiling of the build.

openshift-ci · 2024-02-22T12:14:10Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

github-actions · 2024-02-22T12:15:18Z

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment

Oracle Linux 8 Environment

github-actions · 2024-02-22T12:18:43Z

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:11618

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:11618

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:11618 make deploy-local

jan-cerny

Unfortunately the generated data streams are broken because the rules don't reference the OVAL components correctly.

For example, the build/thin_ds/ssg-fedora-ds_selinux_state.xml contains this:

     <xccdf-1.2:check system="http://oval.mitre.org/XMLSchema/oval-definitions-5">
       <xccdf-1.2:check-content-ref href="oval-unlinked.xml" name="selinux_state" />
      </xccdf-1.2:check>

jan-cerny · 2024-02-23T08:12:18Z

build-scripts/build_xccdf.py

+    if args.thin_ds_components_dir != "off":
+        if not os.path.exists(args.thin_ds_components_dir):
+            os.makedirs(args.thin_ds_components_dir)
+        store_xccdf_per_profile(loader, args.thin_ds_components_dir)


When I build thin data streams for RHEL 9 product, it tracebacks here:

[ 30%] [rhel9-content] generating plain XCCDF, OVAL and OCIL files Traceback (most recent call last): File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 138, in <module> main() File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 132, in main store_xccdf_per_profile(loader, args.thin_ds_components_dir) File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 93, in store_xccdf_per_profile for id_, xccdftree in loader.get_benchmark_xml_by_profile(): File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 1603, in get_benchmark_xml_by_profile profile_id, benchmark = self.benchmark.get_benchmark_xml_for_profile(profile) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 518, in get_benchmark_xml_for_profile return profile.id_, self.to_xml_element( ^^^^^^^^^^^^^^^^^^^^ File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 469, in to_xml_element self._add_groups_xml(root, components_to_not_include, env_yaml) File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 439, in _add_groups_xml root.append(group.to_xml_element(env_yaml, components_to_not_include)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element self._add_sub_groups(group, components_to_not_include, env_yaml) File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups group.append(_group.to_xml_element(env_yaml, components_to_not_include)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element self._add_sub_groups(group, components_to_not_include, env_yaml) File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups group.append(_group.to_xml_element(env_yaml, components_to_not_include)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element self._add_sub_groups(group, components_to_not_include, env_yaml) File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups group.append(_group.to_xml_element(env_yaml, components_to_not_include)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 722, in to_xml_element self._add_rules_xml(group, rules_to_not_include, env_yaml) File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 648, in _add_rules_xml group.append(rule.to_xml_element(env_yaml)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 1132, in to_xml_element add_reference_elements(rule, self.references, ref_uri_dict) File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 120, in add_reference_elements raise ValueError(msg) ValueError: Error processing reference cis: ['5.6.1.4']. A reference type has been added that the project doesn't know about.

The problem is related to the references, specifically, how reference URIs are stored and processed throughout the built process. Unfortunately, we now have product-specific reference types and global reference types. Global reference types are defined in ssg/constants.py and the product-specific ones are set in each product's product.yml. The reference URIs are included into the infamous variable env_yaml. See the Rule.to_xml_element() method in build_yaml.py around line 1128. I think we probably need to pass env_yaml to this function and then pass it down to the called functions, as can we infer from the traceback.

fd product.yml products | xargs grep cis

I fixed traceback.

jan-cerny · 2024-02-23T08:15:45Z

build-scripts/build_xccdf.py

+    ocil = loader.export_ocil_to_xml()
+    link_ocil(xccdftree, checks, args.ocil, ocil)
+
+    ssg.xml.ElementTree.ElementTree(xccdftree).write(args.xccdf)


When I think of it, I realize that in case of building thin data streams we don't want to store this file and moreover we don't need to serialize it. But I can see that most of the code depends on the xccdftree. Therefore, I'm afraid that it would be difficult to rework this now. But I think it is a nice idea for future tickets. WDYT?

Yes, this xccdf file is created because it is used in the next steps of the build. Especially in the formatting steps. These steps might be removed when python2 is not supported.

jan-cerny · 2024-02-23T08:41:05Z

build-scripts/build_xccdf.py

+def store_xccdf_per_profile(loader, thin_ds_components_dir):
+    for id_, xccdftree in loader.get_benchmark_xml_by_profile():
+        xccdf_file_name = os.path.join(thin_ds_components_dir, "xccdf_{}.xml".format(id_))
+        ssg.xml.ElementTree.ElementTree(xccdftree).write(xccdf_file_name)


I think that the write method needs to have the encoding set to "utf-8" because it caused a problem recently, look #11614.

jan-cerny · 2024-02-23T08:41:24Z

build-scripts/build_xccdf.py

+    ocil = loader.export_ocil_to_xml()
+    link_ocil(xccdftree, checks, args.ocil, ocil)
+
+    ssg.xml.ElementTree.ElementTree(xccdftree).write(args.xccdf)


another write, check the encoding

jan-cerny · 2024-02-23T08:45:02Z

ssg/build_yaml.py

+        # This is where references should be put if there are any
+        # This is where rationale should be put if there are any


I left these comments in the method because they were present in the old version to_xml_element.

I see. I also notice that the method is part of the Group class. So these comments actually make sense, but our groups don't have references and rationales, they wouldn't fit our structure. I feel that these comments provide unrequested information. So If you're fine with this type of noise you can keep them. If not, I would prefer to remove them.

jan-cerny

This is a huge speed up. On my machine building thin DSs for all rules in rhel9 content took 2:15 which is comparable to the normal build which takes 0:53. This improvement makes the feature useful for various use cases.

jan-cerny · 2024-02-26T07:18:35Z

ssg/build_yaml.py

+        # This is where references should be put if there are any
+        # This is where rationale should be put if there are any


I see. I also notice that the method is part of the Group class. So these comments actually make sense, but our groups don't have references and rationales, they wouldn't fit our structure. I feel that these comments provide unrequested information. So If you're fine with this type of noise you can keep them. If not, I would prefer to remove them.

jan-cerny · 2024-02-26T08:05:50Z

ssg/build_yaml.py

+        groups = set()
+        for group in self.groups.values():
+            rules_, groups_ = group.get_not_included_components(rule_ids_list)
+            if len(rules) == len(self.rules) and len(rules_) == len(group.rules):


Here we have:

rules

rules_

self.rules

group.rules

Confusing!

Honny1 · 2024-02-27T09:58:57Z

/packit retest-failed

codeclimate · 2024-02-27T09:59:02Z

Code Climate has analyzed commit 7c7c823 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category	Count
Complexity	1

The test coverage on the diff in this pull request is 36.8% (50% is the threshold).

This pull request will bring the total coverage in the repository to 58.0% (-0.1% change).

View more on Code Climate.

jan-cerny · 2024-02-28T07:28:27Z

/packit retest-failed

jan-cerny

I have built the data stream using the -r option. I also have built all thin data streams using the -t option. I have checked their contents. I have used some of them in oscap scans. I have used them in automatus tests. I also compared the normal data streams of rhel9 before and after this change.

openshift-ci bot added the do-not-merge/work-in-progress Used by openshift-ci bot. label Feb 22, 2024

Honny1 force-pushed the speedup-thin-ds branch from d6d7163 to 68814df Compare February 22, 2024 12:29

jan-cerny reviewed Feb 23, 2024

View reviewed changes

Honny1 added 2 commits February 23, 2024 20:24

Batch generation of OVAL

09e4c21

Rework build of benchmark xml

287d5aa

Honny1 force-pushed the speedup-thin-ds branch 2 times, most recently from 63902da to a2acbc7 Compare February 23, 2024 19:38

Change abroach for generation of thin ds

ca99d04

Honny1 force-pushed the speedup-thin-ds branch from a2acbc7 to 2027550 Compare February 23, 2024 19:39

jan-cerny self-assigned this Feb 26, 2024

jan-cerny reviewed Feb 26, 2024

View reviewed changes

jan-cerny requested changes Feb 26, 2024

View reviewed changes

Honny1 added 4 commits February 26, 2024 11:41

Rework Group xml generation

ad04988

Add filtering of groups and rules to build of XCCDF benchamrk

fc3edb9

Add env_yaml to xccdf as xml

bebbc25

Fix search not included components

7c7c823

Honny1 force-pushed the speedup-thin-ds branch from 3d9a5a5 to 7c7c823 Compare February 26, 2024 13:28

Honny1 marked this pull request as ready for review February 27, 2024 09:57

openshift-ci bot removed the do-not-merge/work-in-progress Used by openshift-ci bot. label Feb 27, 2024

Honny1 added the Infrastructure Our content build system label Feb 27, 2024

Honny1 requested a review from jan-cerny February 27, 2024 10:02

jan-cerny added this to the 0.1.73 milestone Feb 28, 2024

jan-cerny approved these changes Feb 28, 2024

View reviewed changes

jan-cerny merged commit 2bb366b into ComplianceAsCode:master Feb 28, 2024
41 of 44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up build of thin data streams #11618

Speed up build of thin data streams #11618

Honny1 commented Feb 22, 2024 •

edited

openshift-ci bot commented Feb 22, 2024

github-actions bot commented Feb 22, 2024

github-actions bot commented Feb 22, 2024

jan-cerny left a comment

jan-cerny Feb 23, 2024

jan-cerny Feb 23, 2024

Honny1 Feb 23, 2024

jan-cerny Feb 23, 2024

Honny1 Feb 23, 2024

jan-cerny Feb 23, 2024

jan-cerny Feb 23, 2024

jan-cerny Feb 23, 2024

Honny1 Feb 23, 2024

jan-cerny Feb 26, 2024

jan-cerny left a comment

jan-cerny Feb 26, 2024

jan-cerny Feb 26, 2024

Honny1 commented Feb 27, 2024

codeclimate bot commented Feb 27, 2024

jan-cerny commented Feb 28, 2024

jan-cerny left a comment

		# This is where references should be put if there are any
		# This is where rationale should be put if there are any

Speed up build of thin data streams #11618

Speed up build of thin data streams #11618

Conversation

Honny1 commented Feb 22, 2024 • edited

Description:

Review Hints:

openshift-ci bot commented Feb 22, 2024

github-actions bot commented Feb 22, 2024

github-actions bot commented Feb 22, 2024

jan-cerny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-cerny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Feb 27, 2024

codeclimate bot commented Feb 27, 2024

jan-cerny commented Feb 28, 2024

jan-cerny left a comment

Choose a reason for hiding this comment

Honny1 commented Feb 22, 2024 •

edited