Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate length of multibyte characters better #24

Merged
merged 3 commits into from
Jul 29, 2019
Merged

Conversation

JaTochNietDan
Copy link
Member

If we length check on just the string when there's some foreign characters in there it can often come back as a length you wouldn't expect. This doesn't make much sense to me for validation purposes, you always want the length of a string to mean x characters, not bytes.

See my Go Playground: https://play.golang.org/p/GNDIG2Tzx4y

I would expect the behaviour of the runed string, not the latter when validating.

If we length check on just the string when there's some foreign characters in there it can often come back as a length you wouldn't expect. This doesn't make much sense to me for validation purposes, you always want the length of a string to mean x characters, not bytes.

See my Go Playground: https://play.golang.org/p/GNDIG2Tzx4y

I would expect the behaviour of the runed string, not the latter when validating.
@codecov-io
Copy link

codecov-io commented Jul 29, 2019

Codecov Report

Merging #24 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted Files Coverage Δ
validate.go 95.04% <100%> (+0.02%) ⬆️

validate.go Outdated
@@ -439,14 +439,16 @@ func (v *Validator) HexColor(key, value string, message ...string) {
func (v *Validator) Len(key, value string, min, max int, message ...string) {
msg := getMessage(message, "")

runeVal := []rune(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing: if you're creating an intermediate variable then you might as well use length := len([]rune(value))? That's only what it's used for.

There is also utf8.RuneCountInString(), which seems a bit faster:

Original:
	BenchmarkLenASCII-4     300000000                4.04 ns/op
	BenchmarkLenMB-4        300000000                4.04 ns/op

Original PR:
	BenchmarkLenASCII-4     30000000                48.1 ns/op
	BenchmarkLenMB-4        20000000                66.3 ns/op

RuneCountInString:
	BenchmarkLenASCII-4     50000000                26.1 ns/op
	BenchmarkLenMB-4        50000000                32.7 ns/op

Not that it really matters all that much in this context, but don't see a reason not to use it?

Full patch with benchmark:

diff --git i/validate.go w/validate.go
index 64bb805..ec9846d 100644
--- i/validate.go
+++ w/validate.go
@@ -52,6 +52,7 @@ import (
 	"strconv"
 	"strings"
 	"time"
+	"unicode/utf8"
 
 	"github.com/teamwork/mailaddress"
 )
@@ -439,16 +440,16 @@ func (v *Validator) HexColor(key, value string, message ...string) {
 func (v *Validator) Len(key, value string, min, max int, message ...string) {
 	msg := getMessage(message, "")
 
-	runeVal := []rune(value)
+	l := utf8.RuneCountInString(value)
 
 	switch {
-	case len(runeVal) < min:
+	case l < min:
 		if msg != "" {
 			v.Append(key, msg)
 		} else {
 			v.Append(key, fmt.Sprintf(MessageLenLonger, min))
 		}
-	case max > 0 && len(runeVal) > max:
+	case max > 0 && l > max:
 		if msg != "" {
 			v.Append(key, msg)
 		} else {
diff --git i/validate_test.go w/validate_test.go
index 476c2be..1319d69 100644
--- i/validate_test.go
+++ w/validate_test.go
@@ -810,3 +810,17 @@ func TestErrorOrNil(t *testing.T) {
 		})
 	}
 }
+
+func BenchmarkLenASCII(b *testing.B) {
+	v := New()
+	for n := 0; n < b.N; n++ {
+		v.Len("x", "Hello, world, it's a sentence!", 0, 40)
+	}
+}
+
+func BenchmarkLenMB(b *testing.B) {
+	v := New()
+	for n := 0; n < b.N; n++ {
+		v.Len("x", "H€llo, world, it's a sentence¡", 0, 40)
+	}
+}

@@ -439,14 +439,16 @@ func (v *Validator) HexColor(key, value string, message ...string) {
func (v *Validator) Len(key, value string, min, max int, message ...string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the comment that it counts characters, and not byte size? There are use cases for both, so seems to me that it's good to clarify in docs. e.g.

// Len sets the minimum and maximum length for a string.
//
// A maximum of 0 indicates there is no upper limit. The length is counted in
// characters (runes), not bytes.

Copy link
Contributor

@arp242 arp242 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@JaTochNietDan JaTochNietDan merged commit 08bcdb8 into master Jul 29, 2019
@JaTochNietDan JaTochNietDan deleted the validateutf8 branch July 29, 2019 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants