New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change parsing algorithm from download URL to package id+version #343
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense here to hit some other data source to resolve the conflict case multiple package definitions parsed, e.g. HEAD on packages container, gallery DB query etc. It looks like we have improved the parsing a bit but it doesn't resolve the root problem.
Also, do we have data on the number of packages the old algorithm vs. the new algorithm could parse successfully? We should run the two impls over the entire gallery DB. You could even weight the results by download count in DB.
|
||
requestUrl = HttpUtility.UrlDecode(requestUrl); | ||
|
||
var urlSegments = requestUrl.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a flat container URL, you can use the n-3 and n-2 segments as well to avoid any ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Was surprised that this wasn't done up until now.
{ | ||
var normalizedVersion = parsedVersion.ToNormalizedString(); | ||
|
||
if (string.Compare(normalizedVersion, versionPart, ignoreCase: true) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use Compare here. Equals is a better fit. The == 0 is a bit more cryptic that a boolean returned by Equals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
var urlSegments = requestUrl.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries); | ||
var fileName = urlSegments.Last(); | ||
|
||
if (fileName.EndsWith(_nupkgExtension)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant with the check above. Also, do we have a CDN rule making this case insensitive? I think no but it's worth checking if .NUPKG works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
private static bool IsNumeric(string segment) | ||
{ | ||
return int.TryParse(segment, out _); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this method used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was previously used but is no longer...should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
_logger.LogWarning(LogEvents.MultiplePackageIDVersionParseOptions, message); | ||
} | ||
|
||
var packageDefinition = packageDefinitions.First(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we assume the first here? Could all of them be run against the translator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Translator could potentially de-dupe some PackageDefinition
s and then get rid of the need to warn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the cases the translator is fixing, those are the cases when a package id ends with [dot][Number].
For example:
"incorrectpackageid": "xrmlibrary.extensionmethods",
"incorrectpackageversionpattern": "2016.(.*)",
"correctedpackageid": "xrmlibrary.extensionmethods.2016",
"correctedpackageversionpattern": "$1"
in order for this to work, the incorrect package id should be without a number. This is exactly what you get in the first result of the new algorithm, meaning, any other check will be redundant and impact perf.
if (packageDefinitions.Count > 1) | ||
{ | ||
var message = $"Found multiple parse options for URL {cdnLogEntry.RequestUrl}: {string.Join(", ", packageDefinitions)}"; | ||
_logger.LogWarning(LogEvents.MultiplePackageIDVersionParseOptions, message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we have a metric or Sev4 on this? Seems like a warning we will probably never remember to look for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This data is meant to be used to assess future solutions for resolving ambiguity (like DB check/blob HEAD request). There is no intention to monitor this, since it's not actionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @joelverhagen --this is a step forward but it doesn't resolve the root problem. We should make a simple HEAD request on the feed or gallery DB to resolve conflicts. I don't think it would be too hard to implement or would add too much complexity to the job and it would have enormous benefit to us. Regardless, I would be ok with taking this change on its own and implementing checking the feed in a separate PR. I also agree with @joelverhagen that we should try and run some tests on how this would perform on our existing packages so that we don't break any existing download counts.
|
||
var nextDotIndex = fileName.IndexOf('.'); | ||
|
||
while (nextDotIndex != -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure you can write this as a do while
and remove the redundancy between
var nextDotIndex = fileName.IndexOf('.');
and
nextDotIndex = fileName.IndexOf('.', nextDotIndex + 1);
e.g.
var nextDotIndex = -1;
do
{
nextDotIndex = fileName.IndexOf('.', nextDotIndex + 1);
... rest of the code ...
} while (nextDotIndex != -1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think it makes a big difference. Will keep as is.
|
||
private static bool IsNumeric(string segment) | ||
{ | ||
return int.TryParse(segment, out _); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was previously used but is no longer...should be removed.
{ | ||
bool translateOccurred = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this variable. Just return true where you set translateOccurred
to true and return false at the end of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer a method to have a single return statement. Makes the code more readable.
[InlineData("nuget.core", "1.0.1-beta1", "http://localhost/packages/nuget.core.1.0.1-beta1.nupkg")] | ||
[InlineData("nuget.core", "1.0.1-beta1.1", "http://localhost/packages/nuget.core.1.0.1-beta1.1.nupkg")] | ||
[InlineData("nuget.core", "1.0.1+metadata", "http://localhost/packages/nuget.core.1.0.1%2Bmetadata.nupkg")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a package URL contains only normalized strings. I removed all the tests cases that contain non-normalized strings, since the new algorithm uses this assumption.
_logger.LogWarning(LogEvents.MultiplePackageIDVersionParseOptions, message); | ||
} | ||
|
||
var packageDefinition = packageDefinitions.First(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Translator could potentially de-dupe some PackageDefinition
s and then get rid of the need to warn.
The (eventual) fix to reach out to the gallery (db or API) should be insanely fast, and cached in the job client. Also note that we have this job running in parallel. Data ingress is enormous, and there's no room to afford a mix-up really, as that would potentially mean to halt the job, retroactively correct parsed blobs and db records (and potentially rolled-up records!), and play catch-up with incoming data when re-enabling the job once a fix is deployed. In other words, this job should really be resilient against gallery or gallery db outages. Currently, it is, as it has no dependency on it. In essence, we'll need a circuit-breaker for the newly added dependency to gallery db/api when the root cause is being tackled. In addition, a fresh deployment will have no cache for these auto-corrections, which is a bit of a waste. Would be good if the final solution learned over time. There's already some mechanism built-in for corrective mappings, which could be used as a persisted cache. |
Here are some results from comparing the two algorithms. I ran them against all packages in prod.
The above numbers don't include packages that are fixed with the translation json.
|
if (packageDefinitions.Count > 1) | ||
{ | ||
var message = $"Found multiple parse options for URL {cdnLogEntry.RequestUrl}: {string.Join(", ", packageDefinitions)}"; | ||
_logger.LogWarning(LogEvents.MultiplePackageIDVersionParseOptions, message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log formatting can be used here, right?
Same thing with line 52
@xavierdecoster , agree. This PR doesn't aim to achieve this eventual fix. |
The results from PROD DB are conclusive to me. |
These ones don't work anymore:
If we are regressing in some cases I feel less exciting about this fix. |
@joelverhagen , the new algorithm doesn't work well when there's a valid prerelease version as part of the id. In those cases it considers the rest of the string to be part of the prerelease label. |
* [Orchestrator] Add maximum retry count for missing packages (#342) * Change parsing algorithm from download URL to package id+version (#343) * Change parsing algorithm from download URL to package id+version * PR comments * bug fix * check failing cases * Fix id+version parsing issue for Microsoft.VisualStudio.Shell.15.0 (#347)
* Change parsing algorithm from download URL to package id+version * PR comments * bug fix * check failing cases
…size of the test package (#343)
…size of the test package (#343)
* Feed2Catalog: replace inner loops with one outer loop (#329) Resolve NuGet/Engineering#1526. * V3: rename asynchronous IStorage methods (#336) Progress on NuGet/NuGetGallery#6267. * Test: fix flaky basic search tests (#337) Progress on NuGet/NuGetGallery#6292. * Add signing to search service (#338) Progress on https://github.com/NuGet/Engineering/issues/1644 * Test: add registration tests (#341) Progress on NuGet/NuGetGallery#6317. * remove unused scripts (#340) * MonitoringProcessor: improve throughput (#342) Progress on NuGet/NuGetGallery#6327. * Catalog2Dnx: improve throughput (#335) Progress on NuGet/NuGetGallery#6267. * Functional tests should not be dependent on a specific hash and file size of the test package (#343) * Add MicroBuild dependency and signing of output DLLs (#345) Progress on https://github.com/NuGet/Engineering/issues/1644 * Rewrite SafeRoleEnvironment to not use Assembly.LoadWithPartialName (#339) * Test: improve and add registration tests (#346) More progress on NuGet/NuGetGallery#6317. * Use ServerCommon commit that is consistent with other repositories (#347) Progress on https://github.com/NuGet/Engineering/issues/1644 * [Monitoring] Ensure packages are signed (#348) This adds a validation to ensure indexed packages are signed by verifying that the package contains a [package signature file](https://github.com/NuGet/Home/wiki/Package-Signatures-Technical-Details#-the-package-signature-file). This validation is disabled for now. To enable this validation, the flag `-expectsSignature true` must be added to ng.exe's command line arguments. This validation will be enabled once all packages are repository signed. Part of NuGet/Engineering#1627 * V3: make stylistic consistency and cleanup changes (#349) Progress on NuGet/NuGetGallery#6411.
* Change parsing algorithm from download URL to package id+version * PR comments * bug fix * check failing cases
* [Orchestrator] Add maximum retry count for missing packages (#342) * Change parsing algorithm from download URL to package id+version (#343) * Change parsing algorithm from download URL to package id+version * PR comments * bug fix * check failing cases * Fix id+version parsing issue for Microsoft.VisualStudio.Shell.15.0 (#347)
…size of the test package (#343)
…size of the test package (#343)
* Feed2Catalog: replace inner loops with one outer loop (#329) Resolve NuGet/Engineering#1526. * V3: rename asynchronous IStorage methods (#336) Progress on NuGet/NuGetGallery#6267. * Test: fix flaky basic search tests (#337) Progress on NuGet/NuGetGallery#6292. * Add signing to search service (#338) Progress on https://github.com/NuGet/Engineering/issues/1644 * Test: add registration tests (#341) Progress on NuGet/NuGetGallery#6317. * remove unused scripts (#340) * MonitoringProcessor: improve throughput (#342) Progress on NuGet/NuGetGallery#6327. * Catalog2Dnx: improve throughput (#335) Progress on NuGet/NuGetGallery#6267. * Functional tests should not be dependent on a specific hash and file size of the test package (#343) * Add MicroBuild dependency and signing of output DLLs (#345) Progress on https://github.com/NuGet/Engineering/issues/1644 * Rewrite SafeRoleEnvironment to not use Assembly.LoadWithPartialName (#339) * Test: improve and add registration tests (#346) More progress on NuGet/NuGetGallery#6317. * Use ServerCommon commit that is consistent with other repositories (#347) Progress on https://github.com/NuGet/Engineering/issues/1644 * [Monitoring] Ensure packages are signed (#348) This adds a validation to ensure indexed packages are signed by verifying that the package contains a [package signature file](https://github.com/NuGet/Home/wiki/Package-Signatures-Technical-Details#-the-package-signature-file). This validation is disabled for now. To enable this validation, the flag `-expectsSignature true` must be added to ng.exe's command line arguments. This validation will be enabled once all packages are repository signed. Part of NuGet/Engineering#1627 * V3: make stylistic consistency and cleanup changes (#349) Progress on NuGet/NuGetGallery#6411.
A proposal to change the algorithm that parses download URL to Package id+version.
Comes as a bug fix for: NuGet/NuGetGallery#5498
The issue is that version 2.4.0-beta.1.build3958 is not parsed correctly, because the current alg counts the number of '.' in the file name.
Note, that this code doesn't solve all issues (like [.][Number] in the id) - only improved the current state a bit.
The proposal assumes that the version in the URL has to be normalized. It makes sense because this is how we save packages, and the same assumption is used in Gallery to generate URLs to packages.
It brute forces all the version options, and comes up with a list of possible combinations of [id, version] where the version is a legal normalized string.
In this PR, I only log the different combinations if there is more then one, and use the first one.
In the future we should consider a "smart" way to decide which combination is valid. Based on the logs we can determine how frequently we get multiple options, and which solution we need (i.e. how much DB load we should expect if we use Gallery DB).
I still got a bunch of tests to run, but would appreciate your feedback early on.