Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-dirty serialized columns are being saved anyway #8328

Closed
elado opened this issue Nov 26, 2012 · 63 comments · Fixed by #15674
Closed

Non-dirty serialized columns are being saved anyway #8328

elado opened this issue Nov 26, 2012 · 63 comments · Fixed by #15674

Comments

@elado
Copy link

elado commented Nov 26, 2012

New Rails 3.2.9 app with one model that has a serialized column:

rails new ar_serialized_dirty_test
cd ar_serialized_dirty_test

rails g model Product name:string data:text
rake db:migrate

# product.rb

class Product < ActiveRecord::Base
  serialize :data
end

On every save, this column is being updated, regardless if it was changed or not.

> p = Product.create(name: "A")
  SQL (0.6ms)  INSERT INTO "products" ("created_at", "data", "name", "updated_at") VALUES (?, ?, ?, ?)  [["created_at", Mon, 26 Nov 2012 23:08:26 UTC +00:00], ["data", nil], ["name", "A"], ["updated_at", Mon, 26 Nov 2012 23:08:26 UTC +00:00]]
 => #<Product id: 1, name: "A", data: nil, created_at: "2012-11-26 23:08:26", updated_at: "2012-11-26 23:08:26"> 
> p.save!
   (0.3ms)  UPDATE "products" SET "updated_at" = '2012-11-26 23:08:34.660484', "data" = '---\n...\n' WHERE "products"."id" = 1
 => true 
> p.save!
   (0.3ms)  UPDATE "products" SET "updated_at" = '2012-11-26 23:08:45.777419', "data" = '---\n...\n' WHERE "products"."id" = 1
 => true 
> p.data_changed?
 => false 
> p.save!
   (0.3ms)  UPDATE "products" SET "updated_at" = '2012-11-26 23:13:20.839794', "data" = '---\n...\n' WHERE "products"."id" = 1
@felipecsl
Copy link

+1 for that

@senny
Copy link
Member

senny commented Nov 27, 2012

This is the intended behavior. It was added with this commit 144e869 by @jonleighton . There is even an explantation why it behaves the way it does:

# Serialized attributes should always be written in case they've been
# changed in place.
def keys_for_partial_write
  changed | (attributes.keys & self.class.serialized_attributes.keys)
end

The problem is that serialized data-structures can be changed without using the ActiveRecord setters. Then there is no way to detect that a change actually happend. To work around this problem serialized attributes are written all the time.

@rafaelfranca
Copy link
Member

@senny right. Closing it.

@elado thank you for reporting.

@elado
Copy link
Author

elado commented Nov 28, 2012

Thanks.
Indeed, looks like _changed? method doesn't return true if I change a value within a value, p.data[:x] = 1

What about keeping the original serialized value after fetching and compare it to the current serialized value before saving?

@rafaelfranca
Copy link
Member

@elado I think this is a unnecessary overhead, much more than writing the serialized attribute every time.

@unorthodoxgeek
Copy link

@rafaelfranca think large scale, 1M+ writes per minute to the RDB. on the ruby side, that wouldn't take too much time, but it can bring a really strong SQL machine down on its knees.

there are less expensive ways of doing this than comparing, for instance, extend the serialized object to have a changed attribute accessor and when a setter method is called on it, it will also set that value to true.

@elado
Copy link
Author

elado commented Nov 28, 2012

@rafaelfranca, as @unorthodoxgeek said, it's definitely a performance issue.

Instead of an expensive DB access for no reason, Rails could either serialize the initial value after fetching and compare to it to the current serialized value for dirty check, or have #eql? method in the serializer class (the one that has #load/#dump) that'll compare things by content.

I believe usually the serialized attribute is a Hash or an Array, so #eql? shouldn't be that expensive.

Let's remember this code overhead (not performance overhead) only happens when there are serialized attributes, and current state is that a query is executed every time and this is far longer operation.

Thanks

@senny
Copy link
Member

senny commented Nov 28, 2012

@elado If you can come up with a good PR I think there is nothing that will prevent it from getting merged. As the current behavior is known and expected we don't treat this as a bug though. Feel free to use the Rails Core GoogleGroup to discuss the idea and the change you want to make. Of course you can also just hack up the PR and start a discussion with it.

@rafaelfranca
Copy link
Member

@elado @unorthodoxgeek I see. It make sense.

Mind to open a pull request?

I'm reopening this issue.

@rafaelfranca rafaelfranca reopened this Nov 30, 2012
@jonleighton
Copy link
Member

@tenderlove has talked to me about using Object#hash to determine which attributes have changed. That seems like to way to go to me. Then we don't need special handling for serialized attributes.

@al2o3cr
Copy link
Contributor

al2o3cr commented Dec 2, 2012

The challenge here is how to "save the original" for comparison purposes - just stashing it someplace isn't sufficient. For instance, if you've got a Hash with Array values:

irb: h = { a: [1,2,3], b: [4,5,6] }
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}
irb: h2 = h.deep_dup
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}
irb: h2[:a][0] = -1
===> -1
irb: h
===> {:a=>[-1, 2, 3], :b=>[4, 5, 6]}

Neither dup nor clone will do the right thing here either - there is one reliable way to do it, but it's (somewhat) expensive and fails if an object can't be Marshalled:

irb: h = { a: [1,2,3], b: [4,5,6] }
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}
irb: h_copy = Marshal.load(Marshal.dump(h))
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}
irb: h_copy[:a][0] = -1
===> -1
irb: h
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}

There's also the side-issue that such copying would have to be done pessimistically (every time an object is loaded from the DB) or users would likely encounter bugs if they mutate the serialized attribute in-place (since calling some_serialized_attribute_will_change! isn't currently required by the "save serialized things every time" strategy). In particular, users may wind up with silent data loss...

@unorthodoxgeek
Copy link

what about marshalling it and saving an md5 of the output string? should be
a fairly good way of comparing two hard-to-compare objects...

@ragalie
Copy link

ragalie commented Dec 3, 2012

This isn't the ideal solution, but why not note when the serialized attribute is read/written to and only save it in that case. If the attribute hasn't been read it couldn't have been modified.

For applications where the serialized attribute is rarely accessed this would cut down on extraneous SQL calls while avoiding some of complexities in the proposals that try to compare versions of the serialized attribute.

@ndbroadbent
Copy link
Contributor

I agree with @elado's initial suggestion of storing the original unserialized string, but maybe as a MD5 hash. If you have:

class Product < ActiveRecord::Base
  serialize :categories, Array
end

then the original string's MD5 hash could be stored somewhere like @product.unserialized_categories_hash or @product.unserialized_attribute_hashes['categories'], before being parsed.

That would make it easy to detect changes properly, even if the attribute is modified in place.

@darthdeus
Copy link

+1

@njakobsen
Copy link
Contributor

+1

@tenderlove via @jonleighton had the answer. Every object has a .hash method. Call that to see if the serialized hash is different. If so, mark as dirty.

@neerajsingh0101
Copy link

@jonleighton any thoughts on how to retrieve old value so that when p.data_changed? is invoked we can do the Object#hash matching. Looking at the code I could not find any clean way to retrieve the old value when the value is changed in-memory.

@venkatesanmurali
Copy link

So how do we handle the following scenario?

  1. I modify some data on a serialized data. (say, i delete some values in the data array).
  2. I would want to save the data only if its changed. Why should i unnecessarily update a row if there is no value changed. (i m on a loop and only some data will be changed based on condition)

@letronje
Copy link

Is there a way to disable this at a per-model level ?

@njakobsen
Copy link
Contributor

@senny Referring back to your code quote from @jonleighton, I'm not sure why serialized data is so special just because it can be changed in-place. A string can also be changed in place without the changes being detected.

user = User.first
user.first_name.replace('Changed')
user.first_name #=> "Changed"
user.changes #=> {}

@al2o3cr
Copy link
Contributor

al2o3cr commented Aug 22, 2013

@njakobsen - sure, but the situation with serialized fields is different (IMO) for two reasons:

  • in-place modifications are far more likely for serialized data; the first_name.replace example is far less common than something like user.some_serialized_field[:foo] = 'bar'
  • users who are performing in-place modifications on String columns are either already calling user.first_name_will_change! (in the example above) beforehand or are already experiencing bugs. Users who are modifying serialized data in-place are not, since that data is always saved.

@myitcv
Copy link

myitcv commented Aug 22, 2013

Supporting in-place changes feels like it is beyond the scope of this discussion, no? Not least because trying to support it (and being consistent) would require that a shadow .clone always be held for every attribute, serializable or not. There's a clear cost associated with that. Or the ambiguity that is created around saying 'you can modify some attributes in place but not others' if you only support it for serializable attributes - a recipe for more bugs perhaps?

As @jonleighton pointed out, using the .hash 'interface' at least gives a clean, standardised way for comparing equality when it comes to setting an attribute. It would also push the task of .clone (or similar) onto the caller, but this seems like a fair compromise if speed/efficiency is the goal.

So +1 for the use of .hash in determining whether an attribute has changed when it is set via = etc. This would then allow accurate tracking of whether the attribute is dirty or not and therefore short-circuiting of database writes if the attribute hasn't changed.

And then another +1 for spelling out in the docs that in-place changes won't be picked up (particularly here) - I got bitten by this because I (wrongly?) wrapped my obj.save with if obj.changed?

@njakobsen
Copy link
Contributor

@myitcv If the .hash implementation is used, there will be no need to document gotchas with in-place changes since they would be detected.

@jonastemplestein
Copy link

@kbrock storing a hash or even a clone of the serialized value on initialization is fine, if the developer has explicitly opted into the behavior and knows how to turn it off. Another performance implication of dirty tracking serialized columns is the comparison/hash calculation on each save, which you can't avoid anyway. This is likely another reason the current behavior was implemented.

I'll take some time tomorrow to gist this

@mrrooijen
Copy link

What's the status on this issue? This is a terrible issue. I see that @kbrock got #13799 pulled into master, but has anything else changed since then that remedies this? I'm running into race conditions where my worker processes keep reverting user configuration because it keeps saving the CLEAN state of that configuration (serialized attribute) in the background right after a user updated the configuration through the web interface. This bug basically causes data (i.e. user input) to be lost.

@sgrif
Copy link
Contributor

sgrif commented May 31, 2014

This is doable after the refactoring I've done recently. I'll be submitting a fix some time this week.

@sgrif
Copy link
Contributor

sgrif commented Jun 1, 2014

This issue is fixed in #15458

@kbrock
Copy link
Contributor

kbrock commented Jun 16, 2014

thanks @sgrif

@Fryguy
Copy link
Contributor

Fryguy commented Jun 16, 2014

Great work @sgrif ... Glad to see this got merged in.

EDIT: Should really say thanks as well to the many others who contributed to this long-standing issue.

@schmurfy
Copy link

Thanks !
long standing is quite an understatement ;)
glad to see it fixed.

@felixbuenemann
Copy link
Contributor

This is currently broken with serialized binary columns. The changes hash will be populated after load, so it will still always be saved. See #16701 for details.
Update: It's fixed now, thanks @sgrif.

@steakchaser
Copy link

Rails 4.2.0 still seeing this issue when parent -> child with accepts_nested_attributes_for and child record has serialize field with type specified as Array.

If I omit the class_name param from serialize, it works fine. If present, saving the parent causes a save on each child.

@sgrif
Copy link
Contributor

sgrif commented Jan 5, 2015

@steakchaser Can you please open a new issue, with a reproducible script to demonstrate the issue, using this template? Ping me on the issue, and I'll look into it.

@Ajaxy
Copy link

Ajaxy commented Feb 17, 2015

Which rails version should this be fixed in? Still facing in 4.1.6 and 4.2.0

@felixbuenemann
Copy link
Contributor

Should be in 4.2.0, if you're still having problems open a new issue and maybe add a reference to this ticket.

@sgrif
Copy link
Contributor

sgrif commented Feb 17, 2015

There's a few edge cases not handled in 4.2.0 which are fixed on 4.2.1. Please check on the 4-2-stable branch before opening a new issue, and provide a reproduction script using our template here: https://github.com/rails/rails/blob/master/guides/bug_report_templates/active_record_master.rb

@sgrif
Copy link
Contributor

sgrif commented Feb 17, 2015

(In a new issue, not as a comment on this one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment