Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CEP] Supporting domains without privilege while GAing feature-flags to paid privileges #32482

Open
sravfeyn opened this issue Jan 12, 2023 · 26 comments
Labels
CEP CommCare Enhancement Proposal

Comments

@sravfeyn
Copy link
Member

sravfeyn commented Jan 12, 2023

Abstract

This proposes a solution on how to migrate feature-flags/toggles (when GA-ing) into paid privileges for domains that may not have a paid plan or are below the plan that the toggle is being upgraded into.

The solution proposes to migrate the StaticToggle document into a new FrozenToggle document (thereby disabling its Edits in the Edit Toggle UI for further edits) and patch domain_has_privilege with a decorator that checks the FrozenToggle for enablement for a given domain.

Motivation

Currently, we are in the process of GA-ing all the feature flags and deleting them after GA. It is often the case that some of the flags need to be converted into a new privilege under one of the paid subscription plans. The domains that have these flags enabled may not be subscribed to these plans and we might want to still enable the functionality provided via the new privilege to these domains.

There is an existing strategy that has been used to do one of migrations such as ERM migration. This involves creating custom software plans and roles to grandfather the domains under the new privilege. While this works well for one-off cases, it has few below drawbacks.

  1. Extra custom subscription (role, plan) model objects will need to be created for each of the domains and privilege to be migrated (potentially around the order of Number of Domains X Number of toggles). This could get more complicated as the toggles will be migrated one after another, hence creating custom plans/roles on top of each other to capture various combinations of the feature flags on various domains.
  2. The process involves multi stage migration.

Given that we want to migrate a large number of toggles, this becomes specially challenging, so a simpler method is desired.

Specification

  1. A new FrozenToggle class will be added with pretty much the same spec as Toggle class except that the 'enabled_users' attribute on FrozenToggle class (which holds enabled domains also) will be frozen for further modifications.
  2. The FrozenToggle document will have an additional attribute to hold the new privilege which the flag will be migrated to.
  3. domain_has_privilege function will be updated to check FrozenToggle document to lookup domains that have the toggle enabled historically.

The migration process involves

  1. removing the StaticToggle for the feature flag,
  2. creating a new FrozenToggle object with the new privilege and
  3. writing a Django migration that converts the Toggle document into FrozenToggle document

The migration can be run as part of the migration inside the deploy. After the deploy the toggle will not be available for editing in the UI. So this solution can be deployed in one go without a manual maintenance.

Impact on users

For the purpose of this CEP, this doesn't impact the users directly.

Impact on hosting

There is no manual migrations to be run, so this will be a simpler process.

Release Timeline

This will be implemented as soon as the it is deemed good enough and other the feedback is integrated.

Open questions and issues

  1. One of the goals of GAing features is to be able to delete all the toggle related code (?). While this solution still requires various Toggle related code in place for now, when we do want to remove the code it will still be possible by converting all the FrozenToggle documents to a simple static JSON documents to lookup the privilege for old domains with historical privilege for a toggle. (Think of FrozenToggle as a place to keep track of a new privilege and enabled-domains, it can exist without any of the toggle related code when/if we want to)
  2. This doesn't/can't address the user only feature-flags. As there are only 7 such feature-flags it will need to be dealt in a different way at a different time.
@sravfeyn sravfeyn added the CEP CommCare Enhancement Proposal label Jan 12, 2023
@sravfeyn
Copy link
Member Author

sravfeyn commented Jan 12, 2023

Please review this CEP, any feedback will be much appreciated. Thanks! @dimagi/team-commcare-hq

@sravfeyn
Copy link
Member Author

Pinging those who might care about this more @snopoke @dannyroberts @biyeun @gherceg @kaapstorm

@millerdev
Copy link
Contributor

It sounds less than ideal for long-term maintainability to use a new toggle type for grandfathered domains. It makes the process for determining if a feature is available more complicated since there would be two code paths to enabling the feature: 1) through the software plan system 2) through the toggle system.

Rather than creating a new toggle type, would it be possible to create a management command to automatically find the domains for which a given toggle is enabled and create custom software plans and roles to grandfather the domains under the new privilege?

The management command would take an argument (the toggle slug). It may need to generate code to be committed to commcare-hq, and it also may need to have a mode to assign the custom software plans in the database once the generated code has been deployed.

@sravfeyn
Copy link
Member Author

It sounds less than ideal for long-term maintainability to use a new toggle type for grandfathered domains.

This is really not a toggle, in that it can be not be enabled/disabled. Abstractly its just a record of domains with a specific privilege.

It makes the process for determining if a feature is available more complicated since there would be two code paths to enabling the feature: 1) through the software plan system 2) through the toggle system.

From a code path point of view, I was planning to patch domain_has_privilege itself and make the toggle part invisible from an abstraction point of view.

would it be possible to create a management command to automatically find the domains for which a given toggle is enabled and create custom software plans and roles to grandfather the domains under the new privilege?

We do have that currently, and its downsides are listed in the Motivation section. The main issue is that there will be too many custom roles and privileges on top of each other as we GA toggles one after another. Apart from that it will involve multistage migration for each GA (migration of toggles to plan and then killing the toggle).

It sounds less than ideal for long-term maintainability to use a new toggle type for grandfathered domains.

I agree, it does. It adds an extra model to replace the existing Toggle but the model is essentially a static record (it doesn't require any UI or CRUD components), so there wouldn't be much code around it. The main advantage of this is that it makes GAing feature flags easier while still keeping the option of migrating even this static data model to custom software plans open.

If we do want to remove this probono privilege for these domains say after an year for not upgrading, this will involve just deleting this data model without again having to edit the software plans for these domains.

@orangejenny
Copy link
Contributor

orangejenny commented Jan 16, 2023

when we do want to remove the code it will still be possible by converting all the FrozenToggle documents to a simple static JSON documents to lookup the privilege for old domains with historical privilege for a toggle

What would this entail? I do think that a major benefit of removing toggles - probably the biggest code benefit - is removing the complexity of toggle checks. I recognize (and applaud!) plans to remove a lot of flags, but losing the code simplification that usually comes with GAing a feature flag is a big loss. Misunderstanding, see further conversation.

@sravfeyn sravfeyn mentioned this issue Jan 17, 2023
4 tasks
@mkangia
Copy link
Contributor

mkangia commented Jan 17, 2023

Currently, we are in the process of GA-ing all the feature flags and deleting them after GA

@sravfeyn where can one read more about this?

@snopoke
Copy link
Contributor

snopoke commented Jan 17, 2023

writing a Django migration that converts the Toggle document into FrozenToggle document

I don't see the need for any kind of data migration. Changing the class type from StaticToggle to FrozenToggle should be enough. The actual data storage layer shouldn't need to change at all.

@biyeun
Copy link
Member

biyeun commented Jan 17, 2023

Agree with Simon's note that a migration from StaticToggle to FrozenToggle isn't needed, as this is data that lives in couch db and the structure itself is not changing.

I also agree with Jenny's comment that the benefit to GAing feature flags is a reduction in code complexity and I would prefer a process that creates custom SoftwarePlans for the grandfathered domains. @gherceg created a very good process with this kind of migration when working on the ERM code. It might be best to chat with him and, ideally, formalize this process in the documentation.

@sravfeyn
Copy link
Member Author

sravfeyn commented Jan 17, 2023

but losing the code simplification that usually comes with GAing a feature flag is a big loss

The toggle related code can all be deleted in this approach. I think I didn't do a good job of explaining the exact strategy particularly the part where it doesn't need any of the Toggle related code which can be cleaned up.

So, to make it easy for others to understand the implications of both I have coded out both the approaches.

The existing approach using custom subscription plans

Here is the code. What this does is below

  1. Adds a management command that updates/creates various subscription models (role, version and plan) to grant the privilege for all domains that have a particular toggle enabled.
  2. The exact details of what gets updated and created can be seen here in ERM migration.

This will end up creating multiple custom plans/versions for each combination of toggles. For example,

  • For domain d1 with flag/privilege a, b a custom plan such as pro_v0_ab may need to be created.
  • For domain d2 with flag/privilege b, c a custom plan such as pro_v0_bc may need to be created
  • For domain d3 with flag/privilege a, c a custom plan such as pro_v0_ac may need to be created.

So a plan needs to exist in every combination that the toggles are enabled for the domains. Further even if domains have same combination of toggles but are on a different plan that will end up creating other plans such as advanced_v0_ab etc. This will end up creating a lot of plans.

The proposed approach of using a new model

Here I have implemented the above described approach. Note that this does not need any Toggle related code at all. All of the toggle code can be cleaned up. This adds a single extra model and modifies domain_has_privilege with couple of lines.

I do agree that looking up the privilege is now ore complicated by having to look at this new model, but this seems much better compared to creating many combinations of custom plans.

Is there a better way?
################

It would be great if those with more accounting app knowledge know of better a way to implement this in a more neat way without the drawbacks of above two approaches.

Meanwhile, implementing the proposed approach would unblock most of the other work related to GAing feature flags. If we find a better method, it should be easy to transition from the proposed approach.

@orangejenny
Copy link
Contributor

@sravfeyn Thank you for the code example - I did misunderstand, disregard my earlier comment.

@sravfeyn
Copy link
Member Author

sravfeyn commented Jan 18, 2023

writing a Django migration that converts the Toggle document into FrozenToggle document

I don't see the need for any kind of data migration. Changing the class type from StaticToggle to FrozenToggle should be enough. The actual data storage layer shouldn't need to change at all.

We will need to store the new privilege on the doc for which a migration becomes necessary. Also, I think the doc_type will still be StaticToggle if not migrated and it will also not allow to cleanup the toggle related code. The other issue is we will need to create a couch view to lookup by privilege. Given all these drawbacks with the FrozenToggle, I have isolated all of this into a separate model that doesn't need any of the toggle code. Here is the approach.

@sravfeyn
Copy link
Member Author

I hope the comments are addressed and it's good to go ahead with this approach

@sravfeyn
Copy link
Member Author

sravfeyn commented Jan 18, 2023

Currently, we are in the process of GA-ing all the feature flags and deleting them after GA

@sravfeyn where can one read more about this?

I am not sure if there is a formal decision/goal around this, but at least SolTech is trying to GA some of feature flags. Some context here https://confluence.dimagi.com/display/GTD/Considerations+for+moving+features+from+Limited+to+General+Availability This is one of the places.

@snopoke
Copy link
Contributor

snopoke commented Jan 18, 2023

writing a Django migration that converts the Toggle document into FrozenToggle document

I don't see the need for any kind of data migration. Changing the class type from StaticToggle to FrozenToggle should be enough. The actual data storage layer shouldn't need to change at all.

We will need to store the new privilege on the doc for which a migration becomes necessary.

There's no need to have that stored in the DB, you could have associate it with the toggle in python.

Also, I think the doc_type will still be StaticToggle if not migrated and it will also not allow to cleanup the toggle related code. The other issue is we will need to create a couch view to lookup by privilege. Given all these drawbacks with the FrozenToggle, I have isolated all of this into a separate model that doesn't need any of the toggle code. Here is the approach.

StaticToggle isn't a couch model, almost all of our interaction with toggles is via these wrapper classes. The actual couch model is just called Toggle. Different wrapper classes offer slightly different functionality (and in some cases even store extra data on the toggle if that's necessary e.g. DynamicallyPredictablyRandomToggle stores the randomness in the DB so that it can be edited at runtime).

For FrozenToggle I expect you can just extend the StaticToggle class and add an attribute for the permission name. The lookup can be done in python since the list of toggles is static (the same way we do lookup by name / slug).

@sravfeyn
Copy link
Member Author

sravfeyn commented Jan 18, 2023

StaticToggle isn't a couch model, almost all of our interaction with toggles is via these wrapper classes.

Right, I meant to say Toggle instead of StaticToggle.

For FrozenToggle I expect you can just extend the StaticToggle class and add an attribute for the permission name. The lookup can be done in python since the list of toggles is static (the same way we do lookup by name / slug).

That's right, but we still need to have the corresponding Toggle doc in the couch and Toggle related code to figure out who has enabled it. The only thing it will accomplish is to disable the UI, whereas with the new model I have added here doesn't require any of the Toggle related code which allows its cleanup latter.

@snopoke
Copy link
Contributor

snopoke commented Jan 18, 2023

That's right, but we still need to have the corresponding Toggle doc in the couch and Toggle related code to figure out who has enabled it. The only thing it will accomplish is to disable the UI, whereas with the new model I have added here doesn't require any of the Toggle related code which allows its cleanup latter.

I don't see the need for that new model at all. You are essentially duplicating the toggle model in SQL. If we keep the underlying couch model as it is we can still do what you need:

class FrozenToggle(StaticToggle):
    privilege_slug: str

    def __init__(self, ..., privilege_slug, ...):
        ...
        self.privilege_slug = privilege_slug

def frozen_toggles_by_privilege():
    return {
        t.privilege_slug: t
        for t in all_toggles_by_name_in_scope(globals(), FrozenToggle).values()
    }


def domain_has_privilege_from_toggle(privilege_slug, domain):
    toggle = frozen_toggles_by_privilege.get(privilege_slug)
    return toggle and toggle.enabled(domain)


FORM_LINK_WORKFLOW = FrozenToggle(...., privilege_slug=privileges.FORM_LINK_WORKFLOW)

@millerdev
Copy link
Contributor

One thing that might improve the case for the FrozenToggle approach would be to give some estimates of the number of new plan versions that would be created if the custom subscription plans approach were adopted instead. This could be done by counting feature flags to be GA'd as well as the domains that currently have them enabled. If it's on the order of tens or even a few hundred then that might not be a big deal, but if it's something like millions then it might have scale implications for the plan version code?

Another small disadvantage to the proposed FrozenToggle is that it will add a new Couch doc type, which will eventually need to be migrated to SQL.

On the other hand, I do see the draw of a simple way to store a list of domains with "grandfathered" permissions. It would be nice if that data was stored in SQL from the beginning.

@biyeun
Copy link
Member

biyeun commented Jan 20, 2023

On the other hand, I do see the draw of a simple way to store a list of domains with "grandfathered" permissions. It would be nice if that data was stored in SQL from the beginning.

Hmmm...I'm intrigued by this thought @millerdev. Perhaps there is a set of grandfathered permissions separate from a SoftwarePlan that is applied to each domain in the migration. We should have some way of alerting accounting admins in the UI of what these grandfathered permissions are.

@gherceg
Copy link
Contributor

gherceg commented Jan 23, 2023

I'm about to go on a bit of a tangent to the current discussion, but it might have an impact on how we address this.

I think the complexity of adding grandfathered privileges to roles is inherent to how our accounting models are structured. However I don't want to rule out the possibility that the perceived complexity is due to my implementation of the ERM migration biasing the perception of how this more general purpose migration command would function, so if you see a flaw in the initial ERM migration logic, specifically with the new custom role names, please call it out.

I understand the desire to avoid separating grandfathered privileges from custom software plans since in theory a grandfathered privilege is the same as a custom privilege, but in practice I think sharing custom software plan roles across different software plans creates too much complexity. As Sravan noted, the concerning part of the ERM migration was the resulting names of these new custom software plan roles (e.g., "Standard Plan (With ERM)"). Obviously a naming scheme like this scales very poorly. As long as we aspire to share custom software plan roles across different software plans, we will aspire to have human readable names that attempt to encompass what is unique about that custom software plan role. In a world where custom software plan roles were not shared but instead tied directly to the software plan version, it would be straightforward to create a new software plan version with the updated custom privileges, and the human readable name would not need to attempt to encompass the privileges of that role.

With that in mind, and assuming what I've said is accurate, I'm in favor of storing grandfathered privileges separately from the role associated with the software plan. As long as we are diligent about only using domain_has_privilege to check that a domain has a privilege, adding the check for grandfathered privileges in that method adds less complexity than continuing to strive for human readable custom software plan role names.

@snopoke
Copy link
Contributor

snopoke commented Jan 24, 2023

Another small disadvantage to the proposed FrozenToggle is that it will add a new Couch doc type, which will eventually need to be migrated to SQL.

Just FYI

@millerdev this won't create a new couch doc type since it is just a wrapper on the same Toggle couch model

@sravfeyn
Copy link
Member Author

sravfeyn commented Jan 27, 2023

One thing that might improve the case for the FrozenToggle approach would be to give some estimates of the number of new plan versions that would be created if the custom subscription plans approach were adopted instead.

This will be in the order of the domains that have toggles enabled multiplied by the number toggles to be GAed. But the main concern here is not the scale, but the complexity of having so many custom plans which should make it difficult for future situations similar to the one we are dealing with. Someone trying to create a custom plan for another use case may have to deal with all these custom plans. This to me indicates that the current accounting modelling is not suited to store this many grandfathered privileges.

However I don't want to rule out the possibility that the perceived complexity is due to my implementation of the ERM migration biasing the perception of how this more general purpose migration command would function, so if you see a flaw in the initial ERM migration logic, specifically with the new custom role names, please call it out.

I don't think the complexity comes out of your way of implementation, regardless of the how the command is implemented, the custom plans and other accounting models do need to be created, I think it comes from the way we model the privileges.

Also, the issue is not merely with the naming scheme either. The plans do need to be created regardless of how we name them, I used custom names (such as custom_plan_ab etc) only to indicate the difference between them, even if we name them all the same thing, the complexity still remains.

I don't see the need for that new model at all. You are essentially duplicating the toggle model in SQL. If we keep the underlying couch model as it is we can still do what you need:

This is indeed the very first approach I tried in the beginning. But it doesn't allow cleaning up the toggle related code (which was also what some of the initial feedback on this CEP advised), so I came up with an independent SQL model that doesn't need any toggle related code.

@sravfeyn
Copy link
Member Author

sravfeyn commented Jan 27, 2023

@snopoke and I discussed the way forward on this over a voice call. We decided to go forward with the light FrozenToggle approach without adding any new data models. The updated code is here. The main argument against FrozenToggle was that we lose the code cleanup that often comes when GAing features. This is correct, however there is a slight clarification.

  1. The approach still cleans up all the code such as toggle.enabled(TOGGLE_TO_BE_GAed)
  2. It is true that this doesn't allow the cleanup of the Toggle framework code, however to cleanup that code requires GAing every toggle which is not the goal currently, and even if we were to GA them all, there will probably be always a use case for a Toggle framework to try out experiments and one-off features, so it doesn't make sense to clean up the Toggle framework code at this point.

@millerdev
Copy link
Contributor

@snopoke and I discussed the way forward on this over a voice call...

The chosen solution does not appear to incorporate some feedback provided in the comments on this CEP. Going to voice with a small subset of CEP respondents means the async participants can't actually fully participate in a consensus decision because part of the discussion wasn't visible to them.

The thing I care most about is the decision not to store this information in SQL with other software plan privilege data. Others may have similar concerns.

@sravfeyn
Copy link
Member Author

sravfeyn commented Feb 1, 2023

The thing I care most about is the decision not to store this information in SQL with other software plan privilege data. Others may have similar concerns.

We didn't create any extra Couch model or SQL model.

The chosen solution does not appear to incorporate some feedback provided in the comments on this CEP.

I believe I have addressed all of the comments, please do let me know if there is something outstanding that's not taken into consideration. Your concern was responded to by Simon in this comment, so I didn't add a redundant response for you.

@millerdev
Copy link
Contributor

millerdev commented Feb 1, 2023

Your response did not convincingly address my comment. I asked for an estimate and you didn't give one.

I also didn't see Simon's comment as a complete refutation of the suggestion to store the data in SQL (where other accounting data is stored), but rather one small point I made that turned out not to be correct.

Edit: while Simon is technically correct that it won't create a new couch doc type, the new field is very similar to a new doc type because it facilitates a new use for the existing doc type, which adds complexity that needs to be understood when that model is eventually migrated to SQL.

@sravfeyn
Copy link
Member Author

sravfeyn commented Feb 2, 2023

Your response did not convincingly address my comment. I asked for an estimate and you didn't give one.

I also didn't see Simon's comment as a complete refutation of the suggestion to store the data in SQL (where other accounting data is stored), but rather one small point I made that turned out not to be correct.

I am happy to discuss your comment over voice to make it more clear. An estimate was not provided because it did not have enough bearing on the decision since it was not about the scale implications as explained in this comment.

Edit: while Simon is technically correct that it won't create a new couch doc type, the new field is very similar to a new doc type because it facilitates a new use for the existing doc type, which adds complexity that needs to be understood when that model is eventually migrated to SQL.

Agreed. But even if I added a new SQL model, it should still be read through and be understood to do a SQL migration. This concern being the valid on either of the approaches, the approach that didn't create a new data model was picked.

Overall, I don't think the approach we took was free of any drawbacks, but in comparison to the other three this seemed better to both Simon and me. Nevertheless, if you feel there is a better approach or more things that we should address here I am happy to setup a call. Please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CEP CommCare Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

7 participants