Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change parsing algorithm from download URL to package id+version #343

Merged
merged 5 commits into from Feb 23, 2018

Conversation

skofman1
Copy link
Contributor

A proposal to change the algorithm that parses download URL to Package id+version.
Comes as a bug fix for: NuGet/NuGetGallery#5498

The issue is that version 2.4.0-beta.1.build3958 is not parsed correctly, because the current alg counts the number of '.' in the file name.
Note, that this code doesn't solve all issues (like [.][Number] in the id) - only improved the current state a bit.
The proposal assumes that the version in the URL has to be normalized. It makes sense because this is how we save packages, and the same assumption is used in Gallery to generate URLs to packages.
It brute forces all the version options, and comes up with a list of possible combinations of [id, version] where the version is a legal normalized string.
In this PR, I only log the different combinations if there is more then one, and use the first one.
In the future we should consider a "smart" way to decide which combination is valid. Based on the logs we can determine how frequently we get multiple options, and which solution we need (i.e. how much DB load we should expect if we use Gallery DB).

I still got a bunch of tests to run, but would appreciate your feedback early on.

Copy link
Member

@joelverhagen joelverhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense here to hit some other data source to resolve the conflict case multiple package definitions parsed, e.g. HEAD on packages container, gallery DB query etc. It looks like we have improved the parsing a bit but it doesn't resolve the root problem.

Also, do we have data on the number of packages the old algorithm vs. the new algorithm could parse successfully? We should run the two impls over the entire gallery DB. You could even weight the results by download count in DB.


requestUrl = HttpUtility.UrlDecode(requestUrl);

var urlSegments = requestUrl.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a flat container URL, you can use the n-3 and n-2 segments as well to avoid any ambiguity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Was surprised that this wasn't done up until now.

{
var normalizedVersion = parsedVersion.ToNormalizedString();

if (string.Compare(normalizedVersion, versionPart, ignoreCase: true) == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use Compare here. Equals is a better fit. The == 0 is a bit more cryptic that a boolean returned by Equals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

var urlSegments = requestUrl.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries);
var fileName = urlSegments.Last();

if (fileName.EndsWith(_nupkgExtension))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant with the check above. Also, do we have a CDN rule making this case insensitive? I think no but it's worth checking if .NUPKG works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


private static bool IsNumeric(string segment)
{
return int.TryParse(segment, out _);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this method used?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was previously used but is no longer...should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

_logger.LogWarning(LogEvents.MultiplePackageIDVersionParseOptions, message);
}

var packageDefinition = packageDefinitions.First();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we assume the first here? Could all of them be run against the translator?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translator could potentially de-dupe some PackageDefinitions and then get rid of the need to warn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the cases the translator is fixing, those are the cases when a package id ends with [dot][Number].
For example:
"incorrectpackageid": "xrmlibrary.extensionmethods",
"incorrectpackageversionpattern": "2016.(.*)",
"correctedpackageid": "xrmlibrary.extensionmethods.2016",
"correctedpackageversionpattern": "$1"

in order for this to work, the incorrect package id should be without a number. This is exactly what you get in the first result of the new algorithm, meaning, any other check will be redundant and impact perf.

if (packageDefinitions.Count > 1)
{
var message = $"Found multiple parse options for URL {cdnLogEntry.RequestUrl}: {string.Join(", ", packageDefinitions)}";
_logger.LogWarning(LogEvents.MultiplePackageIDVersionParseOptions, message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have a metric or Sev4 on this? Seems like a warning we will probably never remember to look for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data is meant to be used to assess future solutions for resolving ambiguity (like DB check/blob HEAD request). There is no intention to monitor this, since it's not actionable.

Copy link

@scottbommarito scottbommarito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @joelverhagen --this is a step forward but it doesn't resolve the root problem. We should make a simple HEAD request on the feed or gallery DB to resolve conflicts. I don't think it would be too hard to implement or would add too much complexity to the job and it would have enormous benefit to us. Regardless, I would be ok with taking this change on its own and implementing checking the feed in a separate PR. I also agree with @joelverhagen that we should try and run some tests on how this would perform on our existing packages so that we don't break any existing download counts.


var nextDotIndex = fileName.IndexOf('.');

while (nextDotIndex != -1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you can write this as a do while and remove the redundancy between

var nextDotIndex = fileName.IndexOf('.');

and

nextDotIndex = fileName.IndexOf('.', nextDotIndex + 1);

e.g.

var nextDotIndex = -1;
do
{
    nextDotIndex = fileName.IndexOf('.', nextDotIndex + 1);
    ... rest of the code ...
} while (nextDotIndex != -1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it makes a big difference. Will keep as is.


private static bool IsNumeric(string segment)
{
return int.TryParse(segment, out _);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was previously used but is no longer...should be removed.

{
bool translateOccurred = false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this variable. Just return true where you set translateOccurred to true and return false at the end of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer a method to have a single return statement. Makes the code more readable.

[InlineData("nuget.core", "1.0.1-beta1", "http://localhost/packages/nuget.core.1.0.1-beta1.nupkg")]
[InlineData("nuget.core", "1.0.1-beta1.1", "http://localhost/packages/nuget.core.1.0.1-beta1.1.nupkg")]
[InlineData("nuget.core", "1.0.1+metadata", "http://localhost/packages/nuget.core.1.0.1%2Bmetadata.nupkg")]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a package URL contains only normalized strings. I removed all the tests cases that contain non-normalized strings, since the new algorithm uses this assumption.

_logger.LogWarning(LogEvents.MultiplePackageIDVersionParseOptions, message);
}

var packageDefinition = packageDefinitions.First();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translator could potentially de-dupe some PackageDefinitions and then get rid of the need to warn.

@xavierdecoster
Copy link
Member

I think it makes sense here to hit some other data source to resolve the conflict case multiple package definitions parsed, e.g. HEAD on packages container, gallery DB query etc. It looks like we have improved the parsing a bit but it doesn't resolve the root problem.

The (eventual) fix to reach out to the gallery (db or API) should be insanely fast, and cached in the job client. Also note that we have this job running in parallel. Data ingress is enormous, and there's no room to afford a mix-up really, as that would potentially mean to halt the job, retroactively correct parsed blobs and db records (and potentially rolled-up records!), and play catch-up with incoming data when re-enabling the job once a fix is deployed. In other words, this job should really be resilient against gallery or gallery db outages. Currently, it is, as it has no dependency on it.

In essence, we'll need a circuit-breaker for the newly added dependency to gallery db/api when the root cause is being tackled. In addition, a fresh deployment will have no cache for these auto-corrections, which is a bit of a waste. Would be good if the final solution learned over time. There's already some mechanism built-in for corrective mappings, which could be used as a persisted cache.

@skofman1
Copy link
Contributor Author

Here are some results from comparing the two algorithms. I ran them against all packages in prod.

  1. The old algorithm had 4,442 wrong results.
    wrongresult.txt

  2. The new algorithm had 1,999 wrong results:
    wrongresult.txt

The above numbers don't include packages that are fixed with the translation json.

  1. Run time: running the new algorithm on ~1.4M packages took ~12 seconds vs. the old algorithm that took ~6 seconds. Although the perf is worse, I think the quality of results justifies this.

if (packageDefinitions.Count > 1)
{
var message = $"Found multiple parse options for URL {cdnLogEntry.RequestUrl}: {string.Join(", ", packageDefinitions)}";
_logger.LogWarning(LogEvents.MultiplePackageIDVersionParseOptions, message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log formatting can be used here, right?

Same thing with line 52

@skofman1
Copy link
Contributor Author

@xavierdecoster , agree. This PR doesn't aim to achieve this eventual fix.

@joelverhagen
Copy link
Member

The results from PROD DB are conclusive to me. :shipit:

@joelverhagen
Copy link
Member

These ones don't work anymore:

[ImisComplexIpart20.2.1.235-EA][20.2.1.235-EA]
[ImisIboNet-20.2.1.235-EA][20.2.1.235-EA]
[ImisIboNet20.2.1.235-EA][20.2.1.235-EA]
[Inovout.Data.1.0.6-beta][1.0.6-beta]
[Inovout.Web.Mvc.1.0.6-beta][1.0.6-beta]
[Magick.NET-7.0.0.18-Q16-x64][1.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.App][2.0.0-preview1-002111-00]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.App][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.App][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetAppHost][2.0.0-preview1-002111-00]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetAppHost][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetAppHost][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHost][2.0.0-preview1-002111-00]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHost][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHost][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHostPolicy][2.0.0-preview1-002111-00]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHostPolicy][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHostPolicy][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHostResolver][2.0.0-preview1-002111-00]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHostResolver][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.DotNetHostResolver][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.ILAsm][2.0.0-preview1-25301-02]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.ILAsm][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.ILAsm][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.ILDAsm][2.0.0-preview1-25301-02]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.ILDAsm][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.ILDAsm][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Jit][2.0.0-preview1-25301-02]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Jit][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Jit][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Native][2.0.0-preview1-25301-02]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Native][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Native][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Runtime.CoreCLR][2.0.0-preview1-25301-02]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Runtime.CoreCLR][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.Runtime.CoreCLR][2.0.0]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.TestHost][2.0.0-preview1-25301-02]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.TestHost][2.0.0-preview2-25407-01]
[runtime.tizen.4.0.0-armel.Microsoft.NETCore.TestHost][2.0.0]
[test.System.DirectoryServices.4.5.0-preview1-2572][4.5.0-preview1-25726-06]
[test.System.DirectoryServices.AccountManagement.4.5.0-preview1-25726][4.5.0-preview1-25726-06]

If we are regressing in some cases I feel less exciting about this fix.

@skofman1
Copy link
Contributor Author

@joelverhagen , the new algorithm doesn't work well when there's a valid prerelease version as part of the id. In those cases it considers the rest of the string to be part of the prerelease label.
The fact that majority of downloads are from a V3 client (verified for tizen), will mitigate this.

@skofman1 skofman1 merged commit fe2af29 into dev Feb 23, 2018
skofman1 added a commit that referenced this pull request Feb 24, 2018
* [Orchestrator] Add maximum retry count for missing packages (#342)

* Change parsing algorithm from download URL to package id+version (#343)

* Change parsing algorithm from download URL to package id+version

* PR comments

* bug fix

* check failing cases

* Fix id+version parsing issue for Microsoft.VisualStudio.Shell.15.0 (#347)
loic-sharma pushed a commit that referenced this pull request Mar 1, 2018
* Change parsing algorithm from download URL to package id+version

* PR comments

* bug fix

* check failing cases
joelverhagen pushed a commit that referenced this pull request Sep 27, 2019
joelverhagen pushed a commit that referenced this pull request Sep 27, 2019
joelverhagen pushed a commit that referenced this pull request Sep 27, 2019
* Feed2Catalog:  replace inner loops with one outer loop (#329)
Resolve NuGet/Engineering#1526.
* V3:  rename asynchronous IStorage methods (#336)
Progress on NuGet/NuGetGallery#6267.
* Test:  fix flaky basic search tests (#337)
Progress on NuGet/NuGetGallery#6292.
* Add signing to search service (#338)
Progress on https://github.com/NuGet/Engineering/issues/1644
* Test:  add registration tests (#341)
Progress on NuGet/NuGetGallery#6317.
* remove unused scripts (#340)
* MonitoringProcessor:  improve throughput (#342)
Progress on NuGet/NuGetGallery#6327.
* Catalog2Dnx:  improve throughput (#335)
Progress on NuGet/NuGetGallery#6267.
* Functional tests should not be dependent on a specific hash and file size of the test package (#343)
* Add MicroBuild dependency and signing of output DLLs (#345)
Progress on https://github.com/NuGet/Engineering/issues/1644
* Rewrite SafeRoleEnvironment to not use Assembly.LoadWithPartialName (#339)
* Test:  improve and add registration tests (#346)
More progress on NuGet/NuGetGallery#6317.
* Use ServerCommon commit that is consistent with other repositories (#347)
Progress on https://github.com/NuGet/Engineering/issues/1644
* [Monitoring] Ensure packages are signed (#348)
This adds a validation to ensure indexed packages are signed by verifying that the package contains a [package signature file](https://github.com/NuGet/Home/wiki/Package-Signatures-Technical-Details#-the-package-signature-file).
This validation is disabled for now. To enable this validation, the flag `-expectsSignature true` must be added to ng.exe's command line arguments. This validation will be enabled once all packages are repository signed.
Part of NuGet/Engineering#1627
* V3:  make stylistic consistency and cleanup changes (#349)
Progress on NuGet/NuGetGallery#6411.
joelverhagen pushed a commit that referenced this pull request Oct 26, 2020
* Change parsing algorithm from download URL to package id+version

* PR comments

* bug fix

* check failing cases
joelverhagen pushed a commit that referenced this pull request Oct 26, 2020
* [Orchestrator] Add maximum retry count for missing packages (#342)

* Change parsing algorithm from download URL to package id+version (#343)

* Change parsing algorithm from download URL to package id+version

* PR comments

* bug fix

* check failing cases

* Fix id+version parsing issue for Microsoft.VisualStudio.Shell.15.0 (#347)
joelverhagen pushed a commit that referenced this pull request Oct 26, 2020
joelverhagen pushed a commit that referenced this pull request Oct 26, 2020
joelverhagen pushed a commit that referenced this pull request Oct 26, 2020
* Feed2Catalog:  replace inner loops with one outer loop (#329)
Resolve NuGet/Engineering#1526.
* V3:  rename asynchronous IStorage methods (#336)
Progress on NuGet/NuGetGallery#6267.
* Test:  fix flaky basic search tests (#337)
Progress on NuGet/NuGetGallery#6292.
* Add signing to search service (#338)
Progress on https://github.com/NuGet/Engineering/issues/1644
* Test:  add registration tests (#341)
Progress on NuGet/NuGetGallery#6317.
* remove unused scripts (#340)
* MonitoringProcessor:  improve throughput (#342)
Progress on NuGet/NuGetGallery#6327.
* Catalog2Dnx:  improve throughput (#335)
Progress on NuGet/NuGetGallery#6267.
* Functional tests should not be dependent on a specific hash and file size of the test package (#343)
* Add MicroBuild dependency and signing of output DLLs (#345)
Progress on https://github.com/NuGet/Engineering/issues/1644
* Rewrite SafeRoleEnvironment to not use Assembly.LoadWithPartialName (#339)
* Test:  improve and add registration tests (#346)
More progress on NuGet/NuGetGallery#6317.
* Use ServerCommon commit that is consistent with other repositories (#347)
Progress on https://github.com/NuGet/Engineering/issues/1644
* [Monitoring] Ensure packages are signed (#348)
This adds a validation to ensure indexed packages are signed by verifying that the package contains a [package signature file](https://github.com/NuGet/Home/wiki/Package-Signatures-Technical-Details#-the-package-signature-file).
This validation is disabled for now. To enable this validation, the flag `-expectsSignature true` must be added to ng.exe's command line arguments. This validation will be enabled once all packages are repository signed.
Part of NuGet/Engineering#1627
* V3:  make stylistic consistency and cleanup changes (#349)
Progress on NuGet/NuGetGallery#6411.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants