Skip to content

Azure Stack Support #5532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

patrickdillon
Copy link

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR adds infrastructure provisioning support on Azure Stack. Currently Azure Stack is completely unsupported, but with the changes in this PR I was able to fully provision an OpenShift cluster.

This PR adds a new field armEndpoint to the cluster spec, and extends azureEnvironment to accept a new value, HybridEnvironment, to indicate installation to Azure Stack:

armEndpoint: https://management.ashRegion.ashInstance.com
azureEnvironment: HybridEnvironment

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #
#5201

Special notes for your reviewer:

This is a large PR, which I know is not preferred, but I have laid out the commits logically and with messages so they should be easy to follow in that manner. I would be happy to break it up into smaller PRs if that would help.

Furthermore, there were some significant challenges in this implementation. Particularly: I could not get tag reconciliation using the tagging service to work: an inscrutable 500 error was returned. Therefore, 88fc6ea skips adding the tagging service for azure stack. Perhaps I should do the same for MachinePool?

I was pretty satisfied with how other challenges were addressed, but definitely happy to discuss them. Thanks!

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

Adds support for installing to Azure Stack environments. Users can specify `cloudEnvironment: HybridEnvironment` and specify `armEndpoint` in the cluster spec.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 31, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 31, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @patrickdillon!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @patrickdillon. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 31, 2025
@k8s-ci-robot k8s-ci-robot requested review from Jont828 and nojnhuh March 31, 2025 18:46
@damdo
Copy link
Member

damdo commented Apr 1, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 1, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 5, 2025
@patrickdillon
Copy link
Author

Don't think anybody has looked at this yet so I went ahead and force pushed to rebase, and fixed the unit test failure (due to newly wrappd error).

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 7, 2025
@willie-yao
Copy link
Contributor

/assign @willie-yao @jackfrancis

Copy link

codecov bot commented Apr 10, 2025

Codecov Report

Attention: Patch coverage is 21.78218% with 79 lines in your changes missing coverage. Please review.

Project coverage is 52.81%. Comparing base (b18718c) to head (4605c9e).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
azure/defaults.go 6.25% 15 Missing ⚠️
azure/services/virtualmachines/client.go 0.00% 10 Missing ⚠️
azure/scope/machine.go 0.00% 9 Missing ⚠️
azure/services/availabilitysets/spec.go 72.72% 4 Missing and 2 partials ⚠️
azure/services/disks/client.go 0.00% 4 Missing ⚠️
azure/errors.go 0.00% 3 Missing ⚠️
azure/scope/cluster.go 40.00% 3 Missing ⚠️
controllers/azuremachine_reconciler.go 0.00% 3 Missing ⚠️
azure/scope/clients.go 60.00% 1 Missing and 1 partial ⚠️
azure/services/identities/client.go 0.00% 2 Missing ⚠️
... and 19 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5532      +/-   ##
==========================================
- Coverage   52.86%   52.81%   -0.05%     
==========================================
  Files         272      272              
  Lines       29474    29520      +46     
==========================================
+ Hits        15582    15592      +10     
- Misses      13080    13113      +33     
- Partials      812      815       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this! I just had a few comments that are mostly nitpicky and addressing a need for unit testing. Also, I think adding some documentation relating to this feature would be great!

@@ -48,6 +48,7 @@ type AzureClusterClassSpec struct {
// - GermanCloud: "AzureGermanCloud"
// - PublicCloud: "AzurePublicCloud"
// - USGovernmentCloud: "AzureUSGovernmentCloud"
// - StackCloud: "HybridEnvironment"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be renamed to AzureStackCloud to keep it consistent with the other clouds, or is it required to be called "HybridCloud"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"HybridCloud" is what is returned by the azure autorest package, but I believe it should be possible to allow users to set "AzureStackCloud" in the cluster spec, but we accept "HybridCloud" internally within the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this to AzureStack. This means when users provide their environment json file, they will need to set the name to AzureStack.

@@ -186,6 +193,7 @@ type AzureManagedControlPlaneClassSpec struct {
// - PublicCloud: "AzurePublicCloud"
// - USGovernmentCloud: "AzureUSGovernmentCloud"
//
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to add AzureStack to the comment here as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally did not add it here, because I did not think there would be support for Azure Stack with managed control planes. Happy to discuss more, but I don't have a lot of background knowledge on this.

Comment on lines 383 to 397
case StackCloudName:
cloudEnv, err := azure.EnvironmentFromURL(armEndpoint)
if err != nil {
return nil, fmt.Errorf("unable to get Azure Stack cloud environment: %w", err)
}
opts.APIVersion = StackAPIVersionProfile
opts.Cloud = cloud.Configuration{
ActiveDirectoryAuthorityHost: cloudEnv.ActiveDirectoryEndpoint,
Services: map[cloud.ServiceName]cloud.ServiceConfiguration{
cloud.ResourceManager: {
Audience: cloudEnv.TokenAudience,
Endpoint: cloudEnv.ResourceManagerEndpoint,
},
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test case for this in TestARMClientOptions?

Comment on lines +98 to +100
if strings.EqualFold(cloudEnvironment, azure.StackCloudName) {
return ptr.To(int32(2)), nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test case for when cloud environment is Azure Stack to TestParameters in spec_test.go?

}
if err != nil {
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good improvement, but is it related to Azure Stack support? What problem is it trying to solve?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, azure stack throws a 400 error that says the force flag is not supported... Should I add a comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I think I forgot to add the comment. This is specified in the commit message, but I will add a code comment in my next pass.

@patrickdillon
Copy link
Author

@willie-yao thanks for the in-depth review and feedback.

I am just back today from vacation, & will incorporate the changes ASAP.

@nawazkh
Copy link
Member

nawazkh commented May 7, 2025

Hey @patrickdillon , how is this PR coming along ? How can we help you push this forward ?

@patrickdillon
Copy link
Author

Hey @patrickdillon , how is this PR coming along ? How can we help you push this forward ?

@nawazkh thanks for checking in. I've just started working on the feedback, and should have plenty of time tomorrow to look into it.

@patrickdillon
Copy link
Author

patrickdillon commented May 9, 2025

Hey @patrickdillon , how is this PR coming along ? How can we help you push this forward ?

Progress has been slow as I've spent much of the day yak shaving and am still stuck. I have been working with a slightly older version of CAPZ & ASO, and when bumping the version, and bumping ASO to v2.11.0 I am hitting an error

DEBUG E0509 18:35:48.735767 90 controller.go:316] "Reconciler error" err="failed to reconcile cluster services: failed to reconcile AzureCluster service group: failed to create resource openshift-cluster-api-guests/padillon05091431-zrk9r-rg (service: group): Internal error occurred: failed calling webhook "default.v1api20200601.resourcegroups.resources.azure.com": failed to call webhook: the server could not find the requested resource" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/padillon05091431-zrk9r" namespace="openshift-cluster-api-guests" name="padillon05091431-zrk9r" reconcileID="befe0200-5aba-4d8b-bdb2-650a372e172c"

This is happening on public Azure. Trying to figure out if I missed a dependency or something.

edit/update: still troubleshooting this. The best lead I have is perhaps the webhook/controller is not getting registered for default.v1api20200601.resourcegroups.resources.azure.com. I see resources.azure.com/v1api20200601storage get registered, so I am going to follow up on that discrepancy. In the openshift installer we run the manager binaries directly, so like I said this is yak shaving and a complete distraction from the task at hand. Hopefully I will be able to resolve it quickly to make progress.

ok finally figured it out was just using the wrong aso crds 🤦

@willie-yao
Copy link
Contributor

No worries! Also looks like you need a rebase to fix the CI

Adds AzureStack as a valid cloud environment. The value
"AzureStackCloud" corresponds to the name expected by the azure
autorest package, which will expect a environment configuration
file to be found at a path specified by the environment variable
AZURE_ENVIRONMENT_FILEPATH. See:

https://github.com/Azure/go-autorest/blob/main/autorest/azure/environments.go#L300-L302
Sets ARM Client Options when using the Azure Stack environment.
Sets the APIVersion to a hybrid cloud profile to ensure compatibility
with hybrid environments.
Azure Stack Hub does not support private dns zones, so skip them.
The Resource SKU API for availability sets may not be available in
an Azure Stack environment. The cache is used to determine the
fault domain count. For Azure Stack, we can default to 2. Future
work could potentially set this programatically or expose the
fault domain count in the API.
The tag service using the V2 SDK is not available in azure stack.
Skip tag reconciliation in Azure Stack environments.
The standard 2020-06-01 API Version is not supported for disk
operations in Azure Stack, so change to the compatible 2018-06-01
profile.
Azure Stack returns a 400 error when trying to delete a VM with
the force flag and the error message suggests retrying without
the flag.
Trying to get tags at scope in Azure Stack results in an error,
so just assume that IPs are managed.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jackfrancis. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2025
@patrickdillon
Copy link
Author

I have now updated and tested the PR to get the environment using EnvironmentFromName. This greatly simplifies this PR (33->12 files changed). I still need to work on adding the tests, and also to update the description of this PR to indicate how the configuration is laid down.

@willie-yao
Copy link
Contributor

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 6, 2025

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-test 170b6f9 link true /test pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-apiversion-upgrade 170b6f9 link true /test pull-cluster-api-provider-azure-apiversion-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jackfrancis jackfrancis modified the milestones: v1.20, next Jun 9, 2025
@willie-yao
Copy link
Contributor

@patrickdillon Is this ready for another round of review? Let me know when you're able to get to the tests! You might also need to rebase as there were quite a few things merged lately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

7 participants