Skip to content
This repository has been archived by the owner on Mar 16, 2021. It is now read-only.

Catalog2Dnx: improve throughput #335

Merged
merged 1 commit into from Aug 22, 2018
Merged

Catalog2Dnx: improve throughput #335

merged 1 commit into from Aug 22, 2018

Conversation

dtivel
Copy link
Contributor

@dtivel dtivel commented Aug 9, 2018

Progress on NuGet/NuGetGallery#6267.

This is an optimization for Azure storage-based usage of Catalog2Dnx. This change:

  1. Downloads a package's .nuspec file from blob storage without downloading the entire package.
  2. Performs server-side package copies instead of client-side copies.

This results in huge savings in time and disk I/O.

Catalog2Dnx will fall back to downloading a package via HTTP if necessary.

On a local test machine with a China storage destination I observed a ~60% improvement in processing throughput with ~4,800 packages following production package size and package entry count distributions to the nearest percentile.

Notes:

  1. Previously, Catalog2Dnx downloaded a package, extracted the .nuspec, then uploaded both to flat container. The download and upload were atomic, especially around blob properties. To achieve the same eventual results, Catalog2Dnx's package sync/no-op check now includes a destination blob properties check.
  2. Microsoft.Azure.Storage.DataMovement 0.6.0 requires WindowsAzure.Storage 8.1.4 or later. All projects in the solution reference WindowsAzure.Storage 8.2.1, except NuGet.Services.Metadata.Catalog.Monitoring which referenced WindowsAzure.Storage 7.1.2. However, NuGet.Services.Metadata.Catalog.Monitoring references another project in the solution, so I updated NuGet.Services.Metadata.Catalog.Monitoring to use the same WindowsAzure.Storage 8.2.1. This updated some other dependencies.

@@ -36,6 +40,8 @@ public class DnxCatalogCollector : CommitCollector
: base(index, telemetryService, handlerFunc, httpClientTimeout)
{
_storageFactory = storageFactory ?? throw new ArgumentNullException(nameof(storageFactory));
_sourceStorage = preferredPackageSourceStorage;
Copy link
Contributor

@skofman1 skofman1 Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null checks, unless it'sm ok for it to be null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK to be null.

await TransferManager.CopyAsync(copiableContent.SourceBlob, blob, isServiceCopy: true);

// Copying (above) did not apply these properties so apply them again.
blob.Properties.ContentType = content.ContentType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to apply all properties? (future proofing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the only properties we set today are ContentType and CacheControl.

var destinationStorage = _storageFactory.Create(packageId);
var destinationUri = destinationStorage.GetUri(DnxMaker.GetRelativeAddressNupkg(packageId, packageVersion));

var isNupkgSynchronized = await destinationStorage.AreSynchronized(sourceUri, destinationUri);
var isPackageInIndex = await _dnxMaker.HasPackageInIndexAsync(destinationStorage, packageId, packageVersion, cancellationToken);
var areRequiredPropertiesPresent = await AreRequiredPropertiesPresentAsync(destinationStorage, destinationUri);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AreRequiredPropertiesPresentAsync [](start = 65, length = 33)

Out of curiosity, why is this needed? Is this so allow us to fix packages that are missing properties by reflowing? Do we have cases of blobs in the flat container without these properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we set blob properties and upload blob content in one operation. As part of my changes I don't want to introduce scenarios where a blob could be uploaded but not have the correct blob properties (say, because of an exception).

The properties may be different or may have different values in flat container.

@@ -173,6 +195,70 @@ private async Task UpdatePackageVersionIndexAsync(IEnumerable<CatalogEntry> cata
string version,
Uri sourceUri,
CancellationToken cancellationToken)
{
var shouldUseSourceStorage = _sourceStorage != null;
var sourceBlob = await GetBlockBlobAsync(id, version) as AzureCloudBlockBlob;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetBlockBlobAsync [](start = 35, length = 17)

What do you think of renaming GetBlockBlobAsync to something like GetSourceBlobOrNullAsync to clearly indicate that:

  1. It gets the source blob
  2. It returns null if the blob couldn't be fetched

Also, it seems a little weird to call GetBlockBlobAsync after checking that _sourceStorage isn't null. What do you think of removing the _sourceStorage check from GetBlockBlobAsync and checking _sourceStorage here instead?

throw new InvalidOperationException("Only Azure storage support has been implemented.");
}

var normalizedVersion = NuGetVersionUtility.NormalizeVersion(packageVersion);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to normalized at this level? Can't we make this the caller's concern and renamed the parameter to normalizedVersion?

DnxConstants.ApplicationOctetStreamContentType,
DnxConstants.DefaultCacheControl);

await storage.Save(nupkgUri, content, cancellationToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this Save? The implementation must check types to make this work. Seems like a broken abstraction. Can't we have a Copy method that only knows how to copy? And let Save only upload?

// single GET request returns the whole blob in a consistent state, but we're reading the blob many
// different times. To detect the blob changing between reads, we check the ETag again later.
// If the ETag's differ, we'll fall back to using a single HTTP GET request.
var etag = sourceBlob.ETag;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only necessary in the off chance that we have to catalog2dnx instances running? Or is there some other way to have someone else updating the blob?

Copy link
Member

@joelverhagen joelverhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⌚️

Seems like the copy abstraction is wrong here. The as plus the InvalidOperationException in GetContentStream are smelly.


if (string.IsNullOrEmpty(nuspec))
{
_logger.LogWarning("No .nuspec available for {Id}/{Version}. Skipping.", id, version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping [](start = 81, length = 8)

Is Skipping correct? It looks like this package will be processed by ProcessPackageDetailsViaHttpAsync. If so, how about Falling back to HTTP processing.?

@@ -13,35 +13,30 @@ namespace NuGet.Services.Metadata.Catalog.Persistence
{
public sealed class AzureCloudBlockBlob : ICloudBlockBlob
{
private readonly CloudBlockBlob _blob;
public CloudBlockBlob Blob { get; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: usually properties are placed after the constructors. Same for ETag and Uri.

@dtivel dtivel force-pushed the dev-dtivel-c2d branch 3 times, most recently from 10b9fef to 19625cc Compare August 15, 2018 16:43
Copy link
Member

@joelverhagen joelverhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@dtivel dtivel merged commit 606ca27 into dev Aug 22, 2018
@dtivel dtivel deleted the dev-dtivel-c2d branch August 22, 2018 19:04
ryuyu added a commit that referenced this pull request Sep 11, 2018
* Feed2Catalog:  replace inner loops with one outer loop (#329)
Resolve NuGet/Engineering#1526.
* V3:  rename asynchronous IStorage methods (#336)
Progress on NuGet/NuGetGallery#6267.
* Test:  fix flaky basic search tests (#337)
Progress on NuGet/NuGetGallery#6292.
* Add signing to search service (#338)
Progress on https://github.com/NuGet/Engineering/issues/1644
* Test:  add registration tests (#341)
Progress on NuGet/NuGetGallery#6317.
* remove unused scripts (#340)
* MonitoringProcessor:  improve throughput (#342)
Progress on NuGet/NuGetGallery#6327.
* Catalog2Dnx:  improve throughput (#335)
Progress on NuGet/NuGetGallery#6267.
* Functional tests should not be dependent on a specific hash and file size of the test package (#343)
* Add MicroBuild dependency and signing of output DLLs (#345)
Progress on https://github.com/NuGet/Engineering/issues/1644
* Rewrite SafeRoleEnvironment to not use Assembly.LoadWithPartialName (#339)
* Test:  improve and add registration tests (#346)
More progress on NuGet/NuGetGallery#6317.
* Use ServerCommon commit that is consistent with other repositories (#347)
Progress on https://github.com/NuGet/Engineering/issues/1644
* [Monitoring] Ensure packages are signed (#348)
This adds a validation to ensure indexed packages are signed by verifying that the package contains a [package signature file](https://github.com/NuGet/Home/wiki/Package-Signatures-Technical-Details#-the-package-signature-file).
This validation is disabled for now. To enable this validation, the flag `-expectsSignature true` must be added to ng.exe's command line arguments. This validation will be enabled once all packages are repository signed.
Part of NuGet/Engineering#1627
* V3:  make stylistic consistency and cleanup changes (#349)
Progress on NuGet/NuGetGallery#6411.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants