Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up build of thin data streams #11618

Merged
merged 7 commits into from
Feb 28, 2024

Conversation

Honny1
Copy link
Collaborator

@Honny1 Honny1 commented Feb 22, 2024

Description:

This PR speeds up the build of thin data streams using the --thin flag. Before the changes, the build takes quite a long time: 48m31.2s. After changes, the build takes a total of 4m40s.

The old build approach was to copy Benchmark instances with only one profile. Then the next build steps are performed with that copy. This leads to a slowdown due to deep copying of the Benchmark class and performing link steps for each thin DS.

The new approach changes the way thin DSs are built. The first step is to create a Benchmark for the thick DS. Then the XML generation from the Benchmark for one profile is performed. Before the generation, a dictionary of components that will not be included in the XML is created. And then it is linked to OVAL.

Review Hints:

To test the --thin flag, you can run this script:

#!/bin/bash

./build_product fedora --thin

for filename in ./build/thin_ds/*.xml; do
    echo "$filename"
    oscap xccdf eval  --profile "(all)" --results-arf "arf_results/arf_$(basename -- $filename)" "$filename"
done

The script generates a thin Datastream for each rule and then performs a scan using oscap.
This test takes more than an hour to run because there are approximately 1830 rules to process and some are memory intensive.

To test the speed up of the build, you can run this command:

./build_product fedora --thin -p

Flag -p enables profiling of the build.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Used by openshift-ci bot. label Feb 22, 2024
Copy link

openshift-ci bot commented Feb 22, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment
Open in Gitpod

Oracle Linux 8 Environment
Open in Gitpod

Copy link

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:11618

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:11618

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:11618 make deploy-local

Copy link
Collaborator

@jan-cerny jan-cerny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the generated data streams are broken because the rules don't reference the OVAL components correctly.

For example, the build/thin_ds/ssg-fedora-ds_selinux_state.xml contains this:

     <xccdf-1.2:check system="http://oval.mitre.org/XMLSchema/oval-definitions-5">
       <xccdf-1.2:check-content-ref href="oval-unlinked.xml" name="selinux_state" />
      </xccdf-1.2:check> 

if args.thin_ds_components_dir != "off":
if not os.path.exists(args.thin_ds_components_dir):
os.makedirs(args.thin_ds_components_dir)
store_xccdf_per_profile(loader, args.thin_ds_components_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I build thin data streams for RHEL 9 product, it tracebacks here:

[ 30%] [rhel9-content] generating plain XCCDF, OVAL and OCIL files
Traceback (most recent call last):
  File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 138, in <module>
    main()
  File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 132, in main
    store_xccdf_per_profile(loader, args.thin_ds_components_dir)
  File "/home/jcerny/work/git/content/build-scripts/build_xccdf.py", line 93, in store_xccdf_per_profile
    for id_, xccdftree in loader.get_benchmark_xml_by_profile():
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 1603, in get_benchmark_xml_by_profile
    profile_id, benchmark = self.benchmark.get_benchmark_xml_for_profile(profile)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 518, in get_benchmark_xml_for_profile
    return profile.id_, self.to_xml_element(
                        ^^^^^^^^^^^^^^^^^^^^
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 469, in to_xml_element
    self._add_groups_xml(root, components_to_not_include, env_yaml)
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 439, in _add_groups_xml
    root.append(group.to_xml_element(env_yaml, components_to_not_include))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element
    self._add_sub_groups(group, components_to_not_include, env_yaml)
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups
    group.append(_group.to_xml_element(env_yaml, components_to_not_include))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element
    self._add_sub_groups(group, components_to_not_include, env_yaml)
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups
    group.append(_group.to_xml_element(env_yaml, components_to_not_include))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 723, in to_xml_element
    self._add_sub_groups(group, components_to_not_include, env_yaml)
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 693, in _add_sub_groups
    group.append(_group.to_xml_element(env_yaml, components_to_not_include))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 722, in to_xml_element
    self._add_rules_xml(group, rules_to_not_include, env_yaml)
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 648, in _add_rules_xml
    group.append(rule.to_xml_element(env_yaml))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 1132, in to_xml_element
    add_reference_elements(rule, self.references, ref_uri_dict)
  File "/home/jcerny/work/git/content/ssg/build_yaml.py", line 120, in add_reference_elements
    raise ValueError(msg)
ValueError: Error processing reference cis: ['5.6.1.4']. A reference type has been added that the project doesn't know about.

The problem is related to the references, specifically, how reference URIs are stored and processed throughout the built process. Unfortunately, we now have product-specific reference types and global reference types. Global reference types are defined in ssg/constants.py and the product-specific ones are set in each product's product.yml. The reference URIs are included into the infamous variable env_yaml. See the Rule.to_xml_element() method in build_yaml.py around line 1128. I think we probably need to pass env_yaml to this function and then pass it down to the called functions, as can we infer from the traceback.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fd product.yml products | xargs grep cis

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed traceback.

ocil = loader.export_ocil_to_xml()
link_ocil(xccdftree, checks, args.ocil, ocil)

ssg.xml.ElementTree.ElementTree(xccdftree).write(args.xccdf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think of it, I realize that in case of building thin data streams we don't want to store this file and moreover we don't need to serialize it. But I can see that most of the code depends on the xccdftree. Therefore, I'm afraid that it would be difficult to rework this now. But I think it is a nice idea for future tickets. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this xccdf file is created because it is used in the next steps of the build. Especially in the formatting steps. These steps might be removed when python2 is not supported.

def store_xccdf_per_profile(loader, thin_ds_components_dir):
for id_, xccdftree in loader.get_benchmark_xml_by_profile():
xccdf_file_name = os.path.join(thin_ds_components_dir, "xccdf_{}.xml".format(id_))
ssg.xml.ElementTree.ElementTree(xccdftree).write(xccdf_file_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the write method needs to have the encoding set to "utf-8" because it caused a problem recently, look #11614.

ocil = loader.export_ocil_to_xml()
link_ocil(xccdftree, checks, args.ocil, ocil)

ssg.xml.ElementTree.ElementTree(xccdftree).write(args.xccdf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another write, check the encoding

Comment on lines 707 to 708
# This is where references should be put if there are any
# This is where rationale should be put if there are any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left these comments in the method because they were present in the old version to_xml_element.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I also notice that the method is part of the Group class. So these comments actually make sense, but our groups don't have references and rationales, they wouldn't fit our structure. I feel that these comments provide unrequested information. So If you're fine with this type of noise you can keep them. If not, I would prefer to remove them.

@Honny1 Honny1 force-pushed the speedup-thin-ds branch 2 times, most recently from 63902da to a2acbc7 Compare February 23, 2024 19:38
Copy link
Collaborator

@jan-cerny jan-cerny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge speed up. On my machine building thin DSs for all rules in rhel9 content took 2:15 which is comparable to the normal build which takes 0:53. This improvement makes the feature useful for various use cases.

Comment on lines 707 to 708
# This is where references should be put if there are any
# This is where rationale should be put if there are any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I also notice that the method is part of the Group class. So these comments actually make sense, but our groups don't have references and rationales, they wouldn't fit our structure. I feel that these comments provide unrequested information. So If you're fine with this type of noise you can keep them. If not, I would prefer to remove them.

groups = set()
for group in self.groups.values():
rules_, groups_ = group.get_not_included_components(rule_ids_list)
if len(rules) == len(self.rules) and len(rules_) == len(group.rules):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have:

  • rules
  • rules_
  • self.rules
  • group.rules

Confusing!

@Honny1 Honny1 marked this pull request as ready for review February 27, 2024 09:57
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Used by openshift-ci bot. label Feb 27, 2024
@Honny1 Honny1 added the Infrastructure Our content build system label Feb 27, 2024
@Honny1
Copy link
Collaborator Author

Honny1 commented Feb 27, 2024

/packit retest-failed

Copy link

codeclimate bot commented Feb 27, 2024

Code Climate has analyzed commit 7c7c823 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 1

The test coverage on the diff in this pull request is 36.8% (50% is the threshold).

This pull request will bring the total coverage in the repository to 58.0% (-0.1% change).

View more on Code Climate.

@jan-cerny
Copy link
Collaborator

/packit retest-failed

@jan-cerny jan-cerny added this to the 0.1.73 milestone Feb 28, 2024
Copy link
Collaborator

@jan-cerny jan-cerny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have built the data stream using the -r option. I also have built all thin data streams using the -t option. I have checked their contents. I have used some of them in oscap scans. I have used them in automatus tests. I also compared the normal data streams of rhel9 before and after this change.

@jan-cerny jan-cerny merged commit 2bb366b into ComplianceAsCode:master Feb 28, 2024
41 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Our content build system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants