Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

License header formatting #983

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from
Draft

License header formatting #983

wants to merge 32 commits into from

Conversation

natibek
Copy link
Contributor

@natibek natibek commented Jul 8, 2024

There are inconsistencies in when license headers are added to source code (and whether the added ones are the same). This PR solves this with a checker that checks if the source code has a license header and if it does, whether it is the correct one. It accounts for shebang lines, comments at the beginning of files unrelated to license headers, and pylint and mypy disable lines. This check has been added to all_.py.

  • Like the other checks, it can be called with ./checks/license_header_format_.py from the root directory.
  • -i can be used to perform an incremental check. This is not enabled by default.
  • A specific file can be passed as an argument and just that will be checked. By default, it checks all the files matching *.py.
  • The --no-header flag is used to handle only cases where no header is found. By default, it is False.
  • The --bad-header flag is used to handle only cases where an incorrect header is found. By default, it is False.
  • The --apply flag is used to fix the header problems. By default, it only checks if the headers are correct and does not make any fixes.
    • It will remove bad headers and replace them with the correct ones.
    • If no license header is found, it will add one.

@natibek natibek changed the title License header checker and adder License header formatting Jul 11, 2024
@richrines1 richrines1 self-assigned this Aug 2, 2024
Copy link
Contributor

@richrines1 richrines1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this!! added some inline suggestions. a few comments more generally:

  • there should be no references to "server" here. Any dynamic/repo-specific behavior should be handled via arguments or the config file - we would like checks in checks-superstaq to generalize outside of our repos as much as possible (within reason)
  • once we correct outdated headers once, it seems like we shouldn't need to keep checking them? in which case maybe the "outdated" functionality doesn't need to live in in this script - we can save the code you use to make these initial corrections somewhere internally, and then use this script to check headers from here on out
  • if we still need to explicitly check for the string "Infleqtion" in places, maybe we could add a "licensee" value to the config file in addition to the header, instead of hard coding it?
  • similarly, i can maybe see why it's unavoidable but i feel like the hard-coded "apache" checks somewhat defeat the purposes of saving the header in the config. do you think there's an easy way to check if the headers are ~the same, up to licensee/year? maybe we could allow the header in pyproject.toml to include {YEAR} and {LICENSEE} tags, which we could convert to wildcards when comparing against existing licenses

also fwiw it's also ok if this script doesn't handle every possible case perfectly - if it gets confused it can always just throw an error saying to fix the headers manually :)

"""
)
parser.add_argument(
"--apply", action="store_true", help="Add the license header to files.", default=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to set default=False with action="store_true"

Suggested change
"--apply", action="store_true", help="Add the license header to files.", default=False
"--apply", action="store_true", help="Add the license header to files."

(ditto below)

license_header = ""
exceptions = ["# pylint:", "#!/", "# mypy:"]

with open(file, "r+") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the "+" here? if not:

Suggested change
with open(file, "r+") as f:
with open(file, "r") as f:

Comment on lines 34 to 37
try:
data: dict[str, Any] = tomlkit.parse(Path("pyproject.toml").read_text())
expected_license_header = str(data["tool"]["license_header_format"]["license_header"])
in_server = "Apache" not in expected_license_header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should put this in a function instead executing it globally

Comment on lines 39 to 42
raise KeyError(
"Under [tool.license_header_format] add a license_header field with the license\
heder that should be added to source code files in the repository."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to raise an error in this case, we can just have an info message saying that no license header was found and then return as if it succeeded

Comment on lines 81 to 84
return f"""
Beginning at line: {self.start_line_num}
Ending at line : {self.end_line_num}\n
{self.license_header}\n"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
return f"""
Beginning at line: {self.start_line_num}
Ending at line : {self.end_line_num}\n
{self.license_header}\n"""
return (
f"Beginning at line: {self.start_line_num}\n"
f"Ending at line : {self.end_line_num}\n\n"
f"{self.license_header}\n"
)

@natibek
Copy link
Contributor Author

natibek commented Aug 5, 2024

  • once we correct outdated headers once, it seems like we shouldn't need to keep checking them? in which case maybe the "outdated" functionality doesn't need to live in in this script - we can save the code you use to make these initial corrections somewhere internally, and then use this script to check headers from here on out

That makes sense. We can also keep it but change the logic a bit. After the initial fix, instead of checking for ColdQuanta in the license header, we can check if it belongs to the licensee but is a different license. This can catch cases of changing the license provider.

  • similarly, i can maybe see why it's unavoidable but i feel like the hard-coded "apache" checks somewhat defeat the purposes of saving the header in the config. do you think there's an easy way to check if the headers are ~the same, up to licensee/year? maybe we could allow the header in pyproject.toml to include {YEAR} and {LICENSEE} tags, which we could convert to wildcards when comparing against existing licenses

I added a few more fields to replace the hard-coded variables. The cirq license header check pylint plugin does something similar. However, apache 2.0 licenses seem to have 2 different formattings from what I have seen in the license headers and that would mess with the matching if we use the wild card approach.

and license_header.start_line_num <= line_num + 1 < license_header.end_line_num
):
if line[-2] == ",":
prepend += line[:-1] + f" 2024 {licensee}.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silly legal q: will we need to update the year in every file on 1/1/2025? or do they stay the same until if/when we update the file?

if the former we might want to replace 2024 with e.g. datetime.datetime.now().year. if the latter maybe we want to make the last two digits wildcards?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we need to update the year in every file on 1/1/2025?

No. Given that, we can pick whichever path is easier.

@natibek
Copy link
Contributor Author

natibek commented Aug 8, 2024

@richrines1 can you please take a look? The biggest change is that I am using difflib.SequenceMatcher to check if the body of the license header matches the header specified in the pyproject.toml file. Instead of checking if the license name (eg Apache) is in the header, we check if the header body matches the provided header body. If it matches above a threshold and the licensee is not included in the copyright line and the header is editable, the licensee is appended to the header.

Copy link
Contributor

@richrines1 richrines1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(partial review)

Comment on lines 179 to 184
if re.search(copyright_pattern, line):
copyright_line += line
body = "\n".join(header_as_lst[idx + 1 :]).strip("#")
break
else:
copyright_line += line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if re.search(copyright_pattern, line):
copyright_line += line
body = "\n".join(header_as_lst[idx + 1 :]).strip("#")
break
else:
copyright_line += line
copyright_line += line
if re.search(copyright_pattern, line):
body = "\n".join(header_as_lst[idx + 1 :]).strip("#")
break


for license_header in license_header_lst:
similar_body = (
difflib.SequenceMatcher(None, body, license_header.license_header).ratio() > 0.94
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this is comparing the existing header to the part of the existing header below the copyright line. should it be comparing to the expected header instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also might want to generalize this a bit to determine similarity, e.g. by comparing

    "".join(line.lstrip("#").strip().lower() for line in header.splitlines()),
    "".join(line.lstrip("#").strip().lower() for line in expected_header.splitlines()),

so that licenses will always get marked as similar if they only differ by cases/whitespace/comment style/etc

target = (
expected_license_header.replace("{YEAR}", r"20\d{2}")
.replace("{LICENSEE}", licensee)
.replace("\n", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing the newlines?

Comment on lines +192 to +196
.replace("(", r"\(")
.replace(")", r"\)")
.replace(".", r"\.")
.replace("'", r"\'")
.replace('"', r"\"")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use re.escape() to do this?

license_header.header_type = HeaderType.VALID
valid = True
elif similar_body and re.search(appended_pattern, license_header.license_header):
license_header.header_type = HeaderType.VALID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to treat this differently - if the licenses are similar but not exactly the same we should probably still rewrite them to match the expected formatting

"""
copyright_line = ""
body = ""
copyright_pattern = re.compile(r"Copyright .*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this one case insensitive? e.g.

Suggested change
copyright_pattern = re.compile(r"Copyright .*")
copyright_pattern = re.compile(r"Copyright .*", flags=re.IGNORECASE)

@vtomole vtomole marked this pull request as draft August 29, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants