Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture and Analyse Specific Keyword Sets on top of what Licenses might be found #17

Open
nigellh opened this issue Mar 21, 2023 · 1 comment

Comments

@nigellh
Copy link

nigellh commented Mar 21, 2023

This is complex and may well take time to flesh out and I will keep tweaking it as I keep thinking of things.

Our understanding is that LS can also be used to scan for specific keywords. These are a high level set of requirements and so it is probable that this will need to be treated as an Epic.

This is likely to be needed to be split into 2 phases. Capture and Analysis.

Ignoring license identification for the time being, let's just focus on keywords and use a common one where there might be code to identify as 3rd Party, but it does not have any license information.

Copyright: Many files have a copyright in, but no license information. These need to be looked at and tracked down to see what license applies and ensure that the code is being used according to that license and that it does not invoke specific criteria from that license such as the copy left criteria.


Capturing Information


Where a keyword is identified in a file, you will need to see more than just the line the keyword is on. It could be the only word on that line and so without further information, you have no context in which to know if that keyword is interesting or not.

Where a keyword is found, it needs to capture a number of lines of code before and after the keyword line. We have found 20 lines to be the best compromise in not capturing too much information, but having enough to know if this keyword needs to be taken forward for investigation.

This also means that if you are scanning code that you do not own (e.g. another companies) you are unlikely to infringe their IP with such a limited amount being captured.

Where keywords are close to the top or the bottom of the file then it will be limited to what lines are available to that cutoff point.

Where keywords are closer together than 20 lines, it is possible that you will get a greater number of lines than 41. For example, if you have two keywords 10 lines apart, you would get the 20 lines before the first, 20 lines after the second the 2 lines that have the keywords in and then the other 10 between them so it might be 52 lines in this case.

Where there are lots of lines without keywords in a file with some, outside of what lines are captured, just mark these blank lines as - nnn lines deleted with a a couple of blank ones before and after to make them stand out.


Every keyword needs to have the discovery location of the file that it is found in listed. This can be hundreds or perhaps thousands of files.


Keywords will have different contexts. With the exception of URLs, the keywords will be used in a number of different places and so you do not want to select one keyword and list every file it is in. Some smaller sub-context that the 41 lines will make the keyword unique to other places it is found. For example:

Copyright Apple
Copyright Microsoft
Copyright IBM

Each of these has the same keyword, but different context. So each of these will need to be captured as separate 'instances' of the same keyword.

URLs should just be listed as the 'URL'! e.g. www.microsoft.com

Note it still needs the +/- 20 lines of context.

Note it needs the full URL to be captured, not just the base URL. For example:

www.microsoft.com
www.microsoft.com/docs/windows-setup
www.microsoft.com/docs/windows-setup#L132-L814


There is a need to ensure that false positives can be 'learnt' or 'specified' in some way. For example, in code that is processing a license, you will end up with hundreds/thousands of field names with license in it. for example:

start.license.field

If we were to capture everyone of these, it causes a significant increase in the number of keywords found that are irrelevant.

There probably needs to be rules to be considered as a 'real' keyword. For example:

  • The word is at the beginning of a line and has a '.' or '. ' or ', ' (Other punctuation such as :. ;, etc) after it.
  • Space before and after it

False positive rules:

  • Punctuation character on either side of it.

(Above lists to be added to)



Analysing Information


Each keyword needs to be filtered on. So show all of the different Copyright or License that have been found


Each entry needs to be able to be marked as 'Not Interesting' in that it is a false positive, or 'Interesting' in that it needs further investigation.


We need to be able list all the discovery locations and to select one and see the full context around the keyword that was discovered in that file.


We need to be able to capture information about the keyword and where it was found. Such as:

  • Keyword
  • Filename
  • Discovery locations

We need to be able to copy information from any of the contexts for further investigation.


More to come.

@markstur
Copy link
Contributor

Thank you @nigellh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants