Permalink
Browse files

fix issues with encoding caused by using literals

  • Loading branch information...
1 parent ef3e164 commit 511c37fa4ef8a4ea653b9b6c99beec390fde9c78 @JonRowe committed Mar 14, 2013
Showing with 96 additions and 20 deletions.
  1. +39 −20 lib/diff/lcs/hunk.rb
  2. +57 −0 spec/hunk_spec.rb
View
@@ -120,19 +120,19 @@ def old_diff
# Calculate item number range. Old diff range is just like a context
# diff range, except the ranges are on one line with the action between
# them.
- s = "#{context_range(:old)}#{op_act[block.op]}#{context_range(:new)}\n"
+ s = encode("#{context_range(:old)}#{op_act[block.op]}#{context_range(:new)}\n")
# If removing anything, just print out all the remove lines in the hunk
# which is just all the remove lines in the block.
- @data_old[@start_old .. @end_old].each { |e| s << "< #{e}\n" } unless block.remove.empty?
- s << "---\n" if block.op == "!"
- @data_new[@start_new .. @end_new].each { |e| s << "> #{e}\n" } unless block.insert.empty?
+ @data_old[@start_old .. @end_old].each { |e| s << encode("< ")+e+encode("\n") } unless block.remove.empty?
+ s << encode("---\n") if block.op == "!"
+ @data_new[@start_new .. @end_new].each { |e| s << encode("> ")+e+encode("\n") } unless block.insert.empty?
s
end
private :old_diff
def unified_diff
# Calculate item number range.
- s = "@@ -#{unified_range(:old)} +#{unified_range(:new)} @@\n"
+ s = encode("@@ -#{unified_range(:old)} +#{unified_range(:new)} @@\n")
# Outlist starts containing the hunk of the old file. Removing an item
# just means putting a '-' in front of it. Inserting an item requires
@@ -145,54 +145,54 @@ def unified_diff
# file -- don't take removed items into account.
lo, hi, num_added, num_removed = @start_old, @end_old, 0, 0
- outlist = @data_old[lo .. hi].collect { |e| e.gsub(/^/, ' ') }
+ outlist = @data_old[lo .. hi].collect { |e| match_encoding_gsub(e,'^', ' ') }
@blocks.each do |block|
block.remove.each do |item|
op = item.action.to_s # -
offset = item.position - lo + num_added
- outlist[offset].gsub!(/^ /, op.to_s)
+ match_encoding_gsub!(outlist[offset],'^ ', op.to_s)
num_removed += 1
end
block.insert.each do |item|
op = item.action.to_s # +
offset = item.position - @start_new + num_removed
- outlist[offset, 0] = "#{op}#{@data_new[item.position]}"
+ outlist[offset, 0] = encode(op)+@data_new[item.position]
num_added += 1
end
end
- s << outlist.join("\n")
+ s << outlist.join(encode("\n"))
end
private :unified_diff
def context_diff
- s = "***************\n"
- s << "*** #{context_range(:old)} ****\n"
+ s = encode("***************\n")
+ s << encode("*** #{context_range(:old)} ****\n")
r = context_range(:new)
# Print out file 1 part for each block in context diff format if there
# are any blocks that remove items
lo, hi = @start_old, @end_old
removes = @blocks.select { |e| not e.remove.empty? }
if removes
- outlist = @data_old[lo .. hi].collect { |e| e.gsub(/^/, ' ') }
+ outlist = @data_old[lo .. hi].collect { |e| match_encoding_gsub(e,'^', ' ') }
removes.each do |block|
block.remove.each do |item|
- outlist[item.position - lo].gsub!(/^ /) { block.op } # - or !
+ match_encoding_gsub!( outlist[item.position - lo], '^ ') { block.op } # - or !
end
end
s << outlist.join("\n")
end
- s << "\n--- #{r} ----\n"
+ s << encode("\n--- #{r} ----\n")
lo, hi = @start_new, @end_new
inserts = @blocks.select { |e| not e.insert.empty? }
if inserts
- outlist = @data_new[lo .. hi].collect { |e| e.gsub(/^/, ' ') }
+ outlist = @data_new[lo .. hi].collect { |e| match_encoding_gsub(e,'^', ' ') }
inserts.each do |block|
block.insert.each do |item|
- outlist[item.position - lo].gsub!(/^ /) { block.op } # + or !
+ match_encoding_gsub!( outlist[item.position - lo], '^ ') { block.op } # + or !
end
end
s << outlist.join("\n")
@@ -206,14 +206,14 @@ def ed_diff(format)
warn "Expecting only one block in an old diff hunk!" if @blocks.size > 1
if format == :reverse_ed
- s = "#{op_act[@blocks[0].op]}#{context_range(:old)}\n"
+ s = encode("#{op_act[@blocks[0].op]}#{context_range(:old)}\n")
else
- s = "#{context_range(:old).gsub(/,/, ' ')}#{op_act[@blocks[0].op]}\n"
+ s = encode("#{match_encoding_gsub(context_range(:old), ',', ' ')}#{op_act[@blocks[0].op]}\n")
end
unless @blocks[0].insert.empty?
- @data_new[@start_new .. @end_new].each { |e| s << "#{e}\n" }
- s << ".\n"
+ @data_new[@start_new .. @end_new].each { |e| s << e+encode("\n") }
@halostatue
halostatue Mar 15, 2013

I prefer spaces around operators:

-      @data_new[@start_new .. @end_new].each { |e| s << e+encode("\n") }
+      @data_new[@start_new .. @end_new].each { |e| s << e + encode("\n") }
+ s << encode(".\n")
end
s
end
@@ -249,4 +249,23 @@ def unified_range(mode)
(length == 1) ? "#{first}" : "#{first},#{length}"
end
private :unified_range
+
+ def encode(literal)
@halostatue
halostatue Mar 15, 2013

Shouldn't this be private, too?

@JonRowe
JonRowe Mar 15, 2013 Owner

Yep, sorry, I normally stick them under one private declaration but was matching your style (and forgot this one)

+ literal.encode @data_old[0].encoding
+ end
+ def encode_to(string, args)
+ args.map { |arg| arg.encode(string.encoding) }
+ end
+ private :encode_to
+
+ def match_encoding_gsub(string, *args, &block)
+ string.gsub( *encode_to(string,args), &block )
+ end
+ private :match_encoding_gsub
+
+ def match_encoding_gsub!(string, *args, &block)
+ string.gsub!( *encode_to(string,args), &block )
@halostatue
halostatue Mar 15, 2013

Both here and 262, I prefer not having spaces inside parenthesis. I know, I know. It doesn't match my brace style, but 😼

-    string.gsub!( *encode_to(string,args), &block )
+    string.gsub!(*encode_to(string,args), &block)
+ end
+ private :match_encoding_gsub!
+
end
View
@@ -0,0 +1,57 @@
+# -*- ruby encoding: utf-8 -*-
+
+require 'spec_helper'
+
+describe "Diff::LCS::Hunk" do
+
+ let(:old_data) { ["Tu avec carté {count} itém has".encode('UTF-16LE')] }
+ let(:new_data) { ["Tu avec carte {count} item has".encode('UTF-16LE')] }
+ let(:peices) { Diff::LCS.diff old_data, new_data }
+ let(:hunk) { Diff::LCS::Hunk.new(old_data, new_data, peices[0], 3, 0) }
+
+ it 'should be able to produce a unified diff from the two peices' do
+ expected =
+(<<-EOD.encode('UTF-16LE').chomp)
+@@ -1,2 +1,2 @@
+Tu avec carté {count} itém has
++Tu avec carte {count} item has
+EOD
+ expect(hunk.diff(:unified).to_s == expected).to eql true
+ end
+
+ it 'should be able to produce a context diff from the two peices' do
+ expected =
+(<<-EOD.encode('UTF-16LE').chomp)
+***************
+*** 1,2 ****
+Tu avec carté {count} itém has
+--- 1,2 ----
+Tu avec carte {count} item has
+EOD
+ expect(hunk.diff(:context).to_s == expected).to eql true
+ end
+
+ it 'should be able to produce an old diff from the two peices' do
+ expected =
+(<<-EOD.encode('UTF-16LE').chomp)
+1,2c1,2
+< Tu avec carté {count} itém has
+---
+> Tu avec carte {count} item has
+
+EOD
+ expect(hunk.diff(:old).to_s == expected).to eql true
+ end
+
+ it 'should be able to produce a reverse ed diff from the two peices' do
+ expected =
+(<<-EOD.encode('UTF-16LE').chomp)
+c1,2
+Tu avec carte {count} item has
+.
+
+EOD
+ expect(hunk.diff(:reverse_ed).to_s == expected).to eql true
+ end
+
+end

2 comments on commit 511c37f

@halostatue

Good idea, and a generally good implementation. I look forward to being able to merge this…except that it breaks under Ruby 1.8.7. At this point, I'm not willing to drop 1.8 support until I'm ready to release diff-lcs 2.0, and I've got some other changes I want to make before that point (especially considering that the code for this hasn't yet been written).

I don't mind saying that this is a 1.9+ only feature, but until I am ready to drop support for Ruby 1.8, this has to be implemented in a version-compatible way.

A couple of other thoughts while looking over the changes:

  1. It seems to me that if there is an encoding change between set A and set B, we might want to report that. I'm not sure how/if we can do that given that we're using standard diff-style output—but it would be nice to have that information (and is probably something worth considering in the core diff-lcs library, not just the diff-style outputted).
  2. I'm not that comfortable with the 'magic' of always using @data_old[0].encoding in #encode. If we're going to do that, we should probably set something like @diff_output_encoding early in the process; this will also help in the future if I decide to move to an enumerator-style interface for this (not out of the question, since the current implementation requires that both left and right corpuses be completely in memory, whereas we might be able to do something more efficient with enumerators at the cost of accuracy).

I would prefer the signature for #encode either include the encoding explicitly or rely on @diff_output_encoding. The latter is potentially very interesting, as it would allow us to consider outputting the diff itself in an entirely different encoding (e.g., the data involved is UTF-16LE, but our terminal is only UTF-8).

I've made a couple of stylistic points in-line, but I like the idea behind this change, and I can see merging and releasing this with the fix for 1.8.7.

@JonRowe
Owner

I've made the suggested stylistic points. The approach I suggested to @myronmarston for RSpec was just to report that the encoding differed and is thus un-diffable, it's hard to do real comparison because of issues of conversion (not all charsets will convert) and what not.

I've changed the magic encoding to be setup early on so it could be overridden, the issue with converting the output to a 'desired' format is again that not all encoding is transferable, the work I've done merely prevents errors from using literals not consistent with the input data. You could transcode the output before returning it to the terminal if they were compatible charsets I suppose...

Totally mean to but totally forgot to include the 1.8.7 hack that we use on RSpec to avoid the encoding issue, I've copied the method used there across.

Please sign in to comment.