Managing schema registry's schema references #49

FrancescoPessina · 2020-10-30T11:20:34Z

This PR enables Avro schema parsing with new schema registry's schema references feature.

…m registry

Strech · 2020-11-04T21:53:27Z

Hi @FrancescoPessina thanks a lot for your PR, I appreciate it. I've checked the proposed solution and I think that I discover a bit more. First of all, to save space reference in the target schema will not be de-referenced, it means that if you apply the same approach as I did you will inject it in and it can/will potentially be registered as a new schema version with a de-referenced schema.

I think since a schema reference is introduced and Avrora already has something similar we should come to a common behavior that will work for both cases.

As a guess, we can keep a reference, but internally store schema as de-referenced and in case if we can't have a reference (old confluent) we can inject it.

FrancescoPessina · 2020-11-05T13:43:59Z

@Strech thank you for your reply :) I'm sorry but I don't understand what are you proposing. The approach in this PR takes the schema with the reference and de-references it only to parse the message read from Kafka.
If a schema does not contain a reference, the approach is the same as before.

How would you change this approach?

Strech · 2020-11-05T14:33:24Z

@FrancescoPessina The issue will happen with such sequence:

You load schema with references from the registry
You decode message, so far so good
You encode message – auto-registration will happen and schema is different from original (nested schemas were injected)
New ambigous version will be registered, and it's the issue

My proposal is to keep track of references without de-reference.

We have JSON schema and we also have erlavro schema storage. To be able to decode message in fact you need to have all types inside the erlavro, so I suggest to stuff it instead of altering the original JSON schema
We would need to slightly adjust registering to keep in mind the original JSON schema and nested schemas (should they also be registered?)

FrancescoPessina · 2020-11-05T15:13:55Z

Ok, I've understood a bit better :)
On the "Java" side, the problem of the registration is "solved" by Confluent setting the following properties of the Avro serializer:

auto.register.schemas=false
use.latest.version=true
This disables auto registration and force to use the latest version of the Avro schema. Could we implement something similar here in Avrora?

Still in Java, the Avro maven plugin works the same way of this PR: to decode Avro messages uses the de-referenced schema (with the import feature) but registering is managed by Confluent's Schema Registry maven plugin, which registers schemas with reference. In this way schemas are registered during the CI pipeline and not automatically when the message is produced.

About your first point, how can I prevent this instruction to fail when is received a Json with the reference?
:avro_json_decoder.decode_schema(payload, allow_bad_references: true)

The problem is that the decoder looks for a type (the type referenced) which is not a native type, so I think the avro_json_decoder cannot properly decode the message.

On registering side yes, this should be fixed a bit. Both referenced and referencing schemas should be registered.

Strech · 2020-11-05T19:17:11Z

@FrancescoPessina I've read about references a bit more and what I can confirm

On the "Java" side, the problem of the registration is "solved" by Confluent setting the following properties of the Avro serializer:

auto.register.schemas=false
use.latest.version=true

This already possible with current settings registry_schemas_autoreg: false

About your first point, how can I prevent this instruction to fail when is received a Json with the reference?
:avro_json_decoder.decode_schema(payload, allow_bad_references: true)

You are right and it's already used here

avrora/lib/avrora/schema.ex

Lines 159 to 164 in 81e1831

    
           defp do_parse(payload) do 
        
             {:ok, :avro_json_decoder.decode_schema(payload, allow_bad_references: true)} 
        
           rescue 
        
             error in ArgumentError -> {:error, error.message} 
        
             error in ErlangError -> {:error, error.original} 
        
           end

The problem is that the decoder looks for a type (the type referenced) which is not a native type, so I think the avro_json_decoder cannot properly decode the message.

So I refresh a bit of how the references have done now. Schema during parsing collect all references and then resolves them, I think this is a more seamless way to add the functionality you have built

avrora/lib/avrora/schema.ex

Lines 125 to 133 in 81e1831

    
           with {:ok, schema} <- do_parse(payload), 
        
                {:ok, references} <- ReferenceCollector.collect(schema), 
        
                lookup_table <- :avro_schema_store.add_type(schema, lookup_table) do 
        
             payloads = 
        
               references 
        
               |> Enum.reject(&:avro_schema_store.lookup_type(&1, lookup_table)) 
        
               |> Enum.map(fn reference -> 
        
                 reference |> reference_lookup_fun.() |> unwrap!() 
        
               end)

If we can collect references anyway, we also can resolve them via registry I guess. I think it's still not puzzled in my head to complete the solution, but I think we can then leverage on the existing way or enhance it. And then it should work for both cases with maybe some new settings (or existing)

UPD1: Here Schema.parse is used, maybe a potential place for reference lookup?

avrora/lib/avrora/storage/registry.ex

Line 36 in 5ca985c

{:ok, schema} <- Schema.parse(schema) do

FrancescoPessina · 2020-11-06T13:08:26Z

@Strech ok, I dug a bit more into the code and understood how reference lookup works.
There is one issue which I'll try to explain you, regarding this function:

avrora/lib/avrora/schema/reference_collector.ex

Line 45 in f51856a

    
           defp do_collect({:avro_record_type, _, namespace, _, aliases, fields, fullname, _}) do

This collects the references contained into a schema inspecting the schema itself. For example, from this schema:

{
  "type": "record",
  "name": "Account",
  "namespace": "io.confluent",
  "aliases": ["Profile"],
  "fields": [
    {
      "name": "payment_history",
      "type": "io.confluent.PaymentHistory"
    },
    {
      "name": "messenger",
      "type": "io.confluent.Messenger"
    },
    {
      "name": "emails",
      "type": {
        "type": "map",
        "values": "io.confluent.Email"
      }
    },
    {
      "name": "settings",
      "type": {
        "type": "map",
        "values": {
          "type": "record",
          "name": "Value",
          "fields": [
            {
              "name": "value",
              "type": "string"
            }
          ]
        }
      }
    }
  ]
}

the references extracted will be ["io.confluent.Email", "io.confluent.Messenger", "io.confluent.PaymentHistory"].
It seems fine and with this approach we could ignore what the CSR passes us in the references attribute of the response, and in the reference_lookup function we could call the schema registry and get references schemas, like this:

def reference_lookup(r) do
    Logger.info("Called reference lookup with reference: " <> inspect(r))
    Avrora.Storage.Registry.get(r)
  end

But there is one problem: the subject name could be slightly different from the Avro record name. For example, using the TopicRecordNameStrategy (see https://www.confluent.io/blog/put-several-event-types-kafka-topic/) the schema name would be something like <topic-name>-io.confluent.Email. In this case our reference_lookup function would fail because the subject which is searching for (io.confluent.Email) which does not exists in the CSR (because the subject name is <topic-name>-io.confluent.Email.

So, we have to pass down to Schema.parse, in some way, the data contained into the references attribute got from CSR response.

Strech · 2020-11-06T21:41:49Z

@RafaelCamarda I think I have an idea.

Sine the reference lookup will be defined in the schema registry storage, we can use a closure functionality to resolve the naming issue and not expose the references anywhere, the potential code might look like this

avrora/lib/avrora/storage/registry.ex

Line 29 in f51856a

def get(key) when is_binary(key) do

  def get(key) when is_binary(key) do
    with {:ok, schema_name} <- Name.parse(key),
         {name, version} <- {schema_name.name, schema_name.version || "latest"},
         {:ok, response} <- http_client_get("subjects/#{name}/versions/#{version}"),
         {:ok, id} <- Map.fetch(response, "id"),
         {:ok, references} <- Map.fetch(response, "$ref") # <---- Meta-code of extracting references
         {:ok, version} <- Map.fetch(response, "version"),
         {:ok, schema} <- Map.fetch(response, "schema") do

      # Meta-code of mapping a subject name to reference name
      references = %{
        "io.confluent.Payment" => "topic-io.confluent.Payment"
      }

      lookup_function = fun r -> do
        Logger.info("Called reference lookup with reference: " <> inspect(r))
        Avrora.Storage.Registry.get(Map.get(references, r)) # <--- Meta-code of getting a real reference subject name
      end

      {:ok, schema} = Schema.parse(schema)
      Logger.debug("obtaining schema `#{schema_name.name}` with version `#{version}`")

      {:ok, %{schema | id: id, version: version}}
    end
  end

This approach allows us to keep reference knowledge within the Storage.SchemaRegistry (which it suppose to be) and at the same time we will re-use existing code of referencing

FrancescoPessina · 2020-11-08T14:59:31Z

@Strech I pushed a new PR (#50 ) developing your idea :) I close this one.

francescopessina87 added 2 commits October 30, 2020 12:17

Managing schema registry's schema references while getting schema fro…

6aee14c

…m registry

Added schema references feature's tests

3b91501

FrancescoPessina mentioned this pull request Nov 8, 2020

Managing schema registry's schema references #50

Merged

FrancescoPessina closed this Nov 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing schema registry's schema references #49

Managing schema registry's schema references #49

FrancescoPessina commented Oct 30, 2020

Strech commented Nov 4, 2020

FrancescoPessina commented Nov 5, 2020

Strech commented Nov 5, 2020

FrancescoPessina commented Nov 5, 2020 •

edited

Loading

Strech commented Nov 5, 2020 •

edited

Loading

FrancescoPessina commented Nov 6, 2020

Strech commented Nov 6, 2020 •

edited

Loading

FrancescoPessina commented Nov 8, 2020 •

edited

Loading

Managing schema registry's schema references #49

Managing schema registry's schema references #49

Conversation

FrancescoPessina commented Oct 30, 2020

Strech commented Nov 4, 2020

FrancescoPessina commented Nov 5, 2020

Strech commented Nov 5, 2020

FrancescoPessina commented Nov 5, 2020 • edited Loading

Strech commented Nov 5, 2020 • edited Loading

FrancescoPessina commented Nov 6, 2020

Strech commented Nov 6, 2020 • edited Loading

FrancescoPessina commented Nov 8, 2020 • edited Loading

FrancescoPessina commented Nov 5, 2020 •

edited

Loading

Strech commented Nov 5, 2020 •

edited

Loading

Strech commented Nov 6, 2020 •

edited

Loading

FrancescoPessina commented Nov 8, 2020 •

edited

Loading