Skip to content

Feature/delete service stats keys generator#74

Merged
eguzki merged 28 commits intofeature/delete-service-stats-integrationfrom
feature/delete-service-stats-keys-generator
Feb 28, 2019
Merged

Feature/delete service stats keys generator#74
eguzki merged 28 commits intofeature/delete-service-stats-integrationfrom
feature/delete-service-stats-keys-generator

Conversation

@eguzki
Copy link
Copy Markdown
Member

@eguzki eguzki commented Feb 6, 2019

Stats key generator used by Stats::PartitionEraserJob and Stats::PartitionGeneratorJob background jobs

@eguzki eguzki changed the base branch from master to feature/delete-service-stats-integration February 6, 2019 11:39
@eguzki eguzki requested review from davidor and unleashed and removed request for unleashed February 6, 2019 11:40
@eguzki eguzki mentioned this pull request Feb 6, 2019
3 tasks
Comment thread lib/3scale/backend/stats/commons.rb Outdated
Comment thread lib/3scale/backend/stats/commons.rb Outdated
@davidor
Copy link
Copy Markdown
Contributor

davidor commented Feb 18, 2019

In general, I found the code a bit hard to follow. Here are a few reasons why I think the code might be more complex than it should:

  • There are no tests.
  • In my opinion, the way Enumerators are used makes the code more difficult to understand and know which keys are generated where.
  • KeyTypesFactory is coupled with DeleteJobDef. I think that the generator of keys should only receive the params it cares about (service, apps, metrics, users, to, from).
  • The code generates invalid keys. It generates invalid response code "buckets" ("1XX", "9XX", etc.), and also, it generates keys for specific codes like 200, 403, etc. which are not used.
  • PartitionGenerator recibes an instance of KeyGenerator, but it only needs to know the total number of keys. I'd say this is an unnecessary coupling. Moreover, this is a class that only calls .step. I think we should not define a new class just for this.
  • There are a few things that can be simplified. For example, there's a ServiceKeyPartGenerator but it feels unnecessary to me because, in the job, there's only one service ID. Same for ResponseCodeKeyPartGenerator, the list of "buckets" for response codes is fixed and defined in a constant of the Stats::Commons module.
  • I think that the KeyPartGenerators duplicates code already present in the Period module. For example, the .succ method of that module could be used to replace code in KeyPartGenerators.

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 21, 2019

Some comments:

There are no tests.

👍 WIP

In my opinion, the way Enumerators are used makes the code more difficult to understand and know which keys are generated where.

👎 The algorithm is challenging. IMO Enumerators technique is an elegant tool to implement key generators, specially when there are different key types and each key contains several parts each of which is itself a generator. Let's put off this discussion until you see tests and have an overview of all requirements of the algorithm.

KeyTypesFactory is coupled with DeleteJobDef. I think that the generator of keys should only receive the params it cares about (service, apps, metrics, users, to, from).

👎 My friend Rubocop would complain with functions with too many arguments. They are all encapsulated in a job, which is a data container. Each generator takes only necessary part from that container.

The code generates invalid keys. It generates invalid response code "buckets" ("1XX", "9XX", etc.), and also, it generates keys for specific codes like 200, 403, etc. which are not used.

👍 Done

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 21, 2019

more comments

PartitionGenerator recibes an instance of KeyGenerator, but it only needs to know the total number of keys. I'd say this is an unnecessary coupling. Moreover, this is a class that only calls .step. I think we should not define a new class just for this.

👍 will be updated.

There are a few things that can be simplified. For example, there's a ServiceKeyPartGenerator but it feels unnecessary to me because, in the job, there's only one service ID. Same for ResponseCodeKeyPartGenerator, the list of "buckets" for response codes is fixed and defined in a constant of the Stats::Commons module.

👍 about ServiceKeyPartGenerator. ServiceID could be passed to the key formatters (those building the key string out of all parts) instead of being another part of the key with its own generator. Will update.

👎 Regarding ResponseCodeKeyPartGenerator, I disagree. This is all about generators. In this specific case, the generator just needs to iterate over a list of response codes, but still is a generator. The only difference is the source of the list. Other generators take data from resque job, while ResponseCodeKeyPartGenerator takes data source from array defined in commons. But still needs to generate items in the same way.

I think that the KeyPartGenerators duplicates code already present in the Period module. For example, the .succ method of that module could be used to replace code in KeyPartGenerators

👎 We need to be very careful. Applied granularity set for a generator depends on the key type. For instance, service type keys do no have year as period. We can not use .succ method, we need to define closed groups of granularities to be used by each key type. This is done in Stats::Commons module.
Anyway, if there is another example where we can reuse code from Period module, I would be happy to hear and avoid DRY.

@eguzki eguzki force-pushed the feature/delete-service-stats-keys-generator branch from b31d7e4 to f1f2cb5 Compare February 25, 2019 15:38
@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 26, 2019

That's all about ThreeScale::Backend::Stats::DeleteJobDef. Few attributes:

ATTRIBUTES = %i[service_id applications metrics users from to context_info].freeze

Some serialization methods, internal private validation method and run_async method. Maybe run_async method should not be part of ThreeScale::Backend::Stats::DeleteJobDef. But, this is a very simple class.

I do not see coupling issues

Copy link
Copy Markdown
Contributor

@mikz mikz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an outside contributor, this seems overcomplicated. I don't really see the need for enumerators or generators and factories.

This snippet does 80% of the work with 20% of the code. The only missing part is generating periods, which still can be done quite nicely in a few lines of code.

class SimpleKeyGenerator
  attr_reader :service_id, :applications, :metrics, :users

  def initialize(service_id: , applications: [], metrics: [],users: [], from:, to:)
    @service_id = service_id
    @applications = applications
    @metrics = metrics
    @users = users
    @from = from
    @until = to
  end

  def periods
    {
        hour: [ 'generate hours' ],
        day: [ 'generate days']
    }
  end

  def to_a
    periods.flat_map do |granularity, start_time|
      RESPONSE_CODES.flat_map do |response_code|
        # stats/{service:#{service_id}}/response_code:#{response_code}/#{period_granularity}[:#{period_start_time_compacted_to_seconds}]
        %W[
          stats/{service:#{service_id}}/response_code:#{response_code}/#{granularity}[:#{start_time}]
        ] +

          applications.flat_map do |application_id|
            # stats/{service:#{service_id}}/cinstance:#{application_id}/response_code:#{response_code}/#{period_granularity}[:#{period_start_time_compacted_to_seconds}]
            "stats/{service:#{service_id}}/cinstance:#{application_id}/response_code:#{response_code}/#{granularity}[:#{start_time}]"
          end +

          users.flat_map do |user_id|
            # stats/{service:#{service_id}}/uinstance:#{user_id}/response_code:#{response_code}/#{period_granularity}[:#{period_start_time_compacted_to_seconds}]
            "stats/{service:#{service_id}}/uinstance:#{user_id}/response_code:#{response_code}/#{granularity}[:#{start_time}]"
          end
      end +

        metrics.flat_map do |metric_id|
          # stats/{service:#{service_id}}/metric:#{metric_id}/#{period_granularity}[:#{period_start_time_compacted_to_seconds}]
          %W[
            stats/{service:#{service_id}}/metric:#{metric_id}/#{granularity}[:#{start_time}]
          ] +

            applications.flat_map do |application_id|
              # stats/{service:#{service_id}}/cinstance:#{application_id}/metric:#{metric_id}/#{period_granularity}[:#{period_start_time_compacted_to_seconds}]
              "stats/{service:#{service_id}}/cinstance:#{application_id}/metric:#{metric_id}/#{granularity}[:#{start_time}]"
            end +

            users.flat_map do |user_id|
              # stats/{service:#{service_id}}/uinstance:#{user_id}/metric:#{metric_id}/#{period_granularity}[:#{period_start_time_compacted_to_seconds}]
              "stats/{service:#{service_id}}/uinstance:#{user_id}/metric:#{metric_id}/#{granularity}[:#{start_time}]"
            end
        end
    end
  end
end

Comment thread lib/3scale/backend/stats/commons.rb Outdated
Comment thread lib/3scale/backend/stats/key_part_generators.rb Outdated
@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 27, 2019

@mikz fair enough. I guess this needs some explanation.

Mi first implementation was the same, using nested iterators and appending results.

applications.each do |app|
  metrics.each do |metric|
    periods.each do |period|
      key_formatter(metric, app, period)
  end
end + ...

You can even use Array.product, although you would have to load all keys in memory

applications.product(metrics, periods).map do |app, metric, period|
  key_formatter(metric, app, period)
end

This is simple implementation and easy to understand. However, IMO, it is tightly coupled and has no separation of concerns at all. If you ever need to extend it, you would need to modify it. It breaks S, O and D of the SOLID principles. It works? yeah it does.

So I took a step back and analyzed the problem to give a more generalized solutions. We need to generate keys. There are different key types. Each type is identified with a string formatter and the formatter needs parts to compute final key. Thus, key type includes some Key Parts. Elements of theKey Parts are generated by Generators. But not only one, each part can be composed of by several Generators. This is the case for period. Single key part, period, includes part elements from several generators, i.e., year, month, day, etc... So the abstract diagram would be:

- keytype:
     name: A
     parts: 
       - apps:
            generators:
               - Apps Generator
       - metrics:
            generators:
               - Metric Generator
       - period:
              generators:
               - year Generator
               - day Generator
               - month Generator
- keytype:
     name: B
     parts: 
       - metrics:
            generators:
               - Metric Generator
       - period:
              generators:
               - day Generator
               - month Generator

So, based on concepts KeyType, KeyPart and Generator this PR is the implementation of this idea. This is the engine of a KeyGenerator whose scope can not be limited to just stats. We can use this generator for any kind of keys.

Defining new keys or updating existing keys is not a task of developing new nested iterators, it is a definition/configuration task. More declarative style. Let's see how we define some KeyType:

response_code_keypart = KeyPart.new(:response_code)           
response_code_keypart << ResponseCodeKeyPartGenerator.new(job)

application_keypart = KeyPart.new(:application)    
application_keypart << AppKeyPartGenerator.new(job)

key_type = KeyType.new(key_formatter)
key_type << response_code_keypart
key_type << application_keypart

This can be improved. No doubt. It should be very easy to implement a factory that generates this code out of a yaml object, for instance. But I thought there was no need for it. Anyway it is easier to work on key parts and generators, than working with nested iterators and appending them, which is error prone.

Regarding outsider contributor who needs to add new key type or update existing one, he/she only needs to work on configuration issues and not implementation details of nested iterators. I would be more than happy to hear the opinion of an outsider contributor.

This is the value proposal of this PR. If you do not approve it, it has to be thrown away and reimplement from scratch. This proposal works. It generates keys in a stream way, instead of loading all them in memory and it tries to provide flexible and easy way to define/update keys.

Finally, I propose to go on with this proposal. Once merged, feel free to open PR with alternative implementation that does 80% of the work with 20% of the code. I will be more than happy to discuss about it.

@mikz
Copy link
Copy Markdown
Contributor

mikz commented Feb 27, 2019

@eguzki No matter how you keep explaining it is still overcomplicated. There are no new keys being added and there is no need for a generalized solution. The root of all evil is imaginary problems.

Using Array#product would be a nice improvement.

You are right there is no need for generating this from yaml as there is no need for the generators and factories. It is a useless wall of code, that is hard to understand and make sense of when it can be implemented in 15 lines of code.

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 27, 2019

Using Array#product loads all keys in memory.

300K keys of stats per metric, application, user and year.

@mikz
Copy link
Copy Markdown
Contributor

mikz commented Feb 27, 2019

@eguzki from what I see your enumerators are doing pretty much the same, eventually the array has to be constructed anyway, right?

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 27, 2019

@mikz enumerators only generate keys as long as you keep requesting them.

@mikz
Copy link
Copy Markdown
Contributor

mikz commented Feb 27, 2019

@eguzki sure, why you can't do that with Array#product? ([ 1 ] * 300).enum_for(:product, [2] * 1000) gives you enumerator that is generating keys as you take them.

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 27, 2019

Finally, I propose to go on with this proposal. Once merged, feel free to open PR with alternative implementation that does 80% of the work with 20% of the code. I will be more than happy to discuss about it.

what do other think? @davidor @unleashed

@unleashed
Copy link
Copy Markdown
Contributor

@eguzki I don't know whether you saw my comment above, but I don't think this is in a mergeable state and I think we can wait to land this with some degree of consensus. I agree with Michal this looks overly complex for the task and it could also reuse existing code.

@davidor
Copy link
Copy Markdown
Contributor

davidor commented Feb 27, 2019

I prefer @mikz 's solution because it can be easily understood. Although I think we'd need to adapt it a bit, at least to use the Key helpers defined in https://github.com/3scale/apisonator/blob/master/lib/3scale/backend/stats/keys.rb or, alternatively, if that module cannot be easily used from the code we need in this PR, it should be adapted.

I think we should adopt @mikz 's solution unless there's a compelling reason against it.

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 27, 2019

ready for review @mikz @davidor

Comment thread spec/unit/stats/key_generator_spec.rb Outdated
context 'responsecode_service keys' do
let(:expected_keys) do
%w[200 2XX 403 404 4XX 500 503 5XX].product(%i[hour day week month eternity]).map do |code, gr|
ThreeScale::Backend::Stats::Keys.service_response_code_value_key(service_id, code, ThreeScale::Backend::Period[gr].new(from))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a comment somewhere explaining that these tests just check that a subset of the keys that are supposed to be generated are there.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each context checks a subset of keys. That's what you mean?

@davidor
Copy link
Copy Markdown
Contributor

davidor commented Feb 28, 2019

@eguzki The code looks good to me 👍

I have just one comment. I saw that in the tests, if I'm not mistaken, you're not verifying that all the keys that should be generated actually were. Also, you're not checking that unnecessary keys were not generated. You're just checking a subset. Any reason to do that? It should probably be documented.

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 28, 2019

I am verifying that all the keys that should be generated actually are. I am missing something?

True that I am not verifying unnecessary keys are not generated. Will do it

@davidor
Copy link
Copy Markdown
Contributor

davidor commented Feb 28, 2019

@eguzki Take for example context 'responsecode_service keys'. Anyone works because all of them follow the same pattern.

expected_keys uses the from parameter but it does not use the to one. So for example, if the period spans over several months, only the first one will appear in the keys in expected_keys.

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 28, 2019

For simplicity, time frame was on 1 unit. One day, one hour, one month, one year....

Will open time window to a month.

@eguzki eguzki requested a review from davidor February 28, 2019 10:56
@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 28, 2019

@davidor tests now check a time window of one month (do you want one year :=) ) and check extra keys are not generated

is_expected.to include(*expected_keys_usage_user)
end
end
context 'usage user keys' do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

@eguzki eguzki Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidor

RSpec.describe [1, 2, 3] do
  it { is_expected.to contain_exactly(1, 2) }
end

does not pass.

Why are checking subsets, easier to detect errors when one subset does not generate appropriately

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, of course 👍

@davidor
Copy link
Copy Markdown
Contributor

davidor commented Feb 28, 2019

@eguzki I think that the contains_exactly matcher could help simplify some of the tests.

Everything else looks good to me. Can you please squash the old commits before merging?

@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 28, 2019

merging to integration branch

@eguzki eguzki merged commit 0e109f1 into feature/delete-service-stats-integration Feb 28, 2019
@bors bors Bot deleted the feature/delete-service-stats-keys-generator branch February 28, 2019 11:12
@eguzki
Copy link
Copy Markdown
Member Author

eguzki commented Feb 28, 2019

@davidor forgot to squash old commits. will do it before opening PR to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants