Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use VARCHAR( n char) instead of VARCHAR(n) so semantics are used and not... #222

Closed
wants to merge 1 commit into from

Conversation

davidkarlsen
Copy link
Contributor

@davidkarlsen davidkarlsen commented Sep 2, 2013

... number of bytes. This is better when using multibyte charsets like UTF8

https://jira.springsource.org/browse/BATCH-2091


This change is Reviewable

…not number of bytes. This is better when using multibyte charsets like UTF8
@davidkarlsen
Copy link
Contributor Author

ping?

@davidkarlsen
Copy link
Contributor Author

Why don't you comment on this or pick it up?

JOB_NAME VARCHAR2(100) NOT NULL,
JOB_KEY VARCHAR2(32) NOT NULL,
JOB_NAME VARCHAR2(100 char) NOT NULL,
JOB_KEY VARCHAR2(32 char) NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a string generated by batch so I'm not sure we would need this feature here.

@mminella
Copy link
Member

I've made a couple notes. Add to those, I'm currently looking to see if any of the other databases we support have a similar function. I'd rather make a change like this all at once.

@pivotal-issuemaster
Copy link

@davidkarlsen Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

@davidkarlsen
Copy link
Contributor Author

@pivotal-issuemaster done!

@pivotal-issuemaster
Copy link

@davidkarlsen Thank you for signing the Contributor License Agreement!

@fmbenhassine
Copy link
Contributor

There is an open issue for this: https://jira.spring.io/browse/BATCH-2750

We will consider merging this PR in the upcoming v4.1.

@fmbenhassine
Copy link
Contributor

I'm currently looking to see if any of the other databases we support have a similar function. I'd rather make a change like this all at once.

I checked the data types for each database provider we support (except the embedded ones) and most of them use characters by default:

However, some providers use bytes by default:

Moreover, this default can vary from one version to another for each database provider. For example, the default has changed for MySQL between version 4 and 5 as per: https://stackoverflow.com/questions/1997540/mysql-varchar-lengths-and-utf-8.

I think it is almost unmanageable to test the DDL against each version of each database provider. So in hindsight, I'm not sure if this kind of issues should be fixed in the DDL or left as an adjustment to the user, as said in Appendix A.9:

Many users find that simply changing the schema to double the length of the VARCHAR columns is enough

This appendix clearly mentions that some adjustments (column lengths, additional indexes, etc) might be necessary if the default DDL is not enough.

For some providers like DB2, it is not possible to specify if the size of the VARCHAR is in bytes or chars (like for Oracle with VARCHAR2(30 char), only bytes are accepted). So if this issue happens to a DB2 user, he has to increase the size of the column, we can do nothing to "fix" our DDL.

BTW, even the fix in this PR may not be sufficient: Passing the EXIT_MESSAGE from EXIT_MESSAGE VARCHAR2(2500) to EXIT_MESSAGE VARCHAR2(2500 char) may still fail if the stack trace exceeds 2500 characters. What do we do in that case? update the DDL again to increase the size in chars? That would not solve the problem. So I think this kind of adjustments should be left to the user.

@davidkarlsen @mminella Do you agree?

@mminella
Copy link
Member

mminella commented Oct 3, 2018

Spring Batch also has code in it that truncates values before they are inserted in specific columns. I agree that this kind of thing is better off left to users to customize on their own. We could add a note in the docs explicitly about the idea that they may need to increase field sizes when using multibyte characters.

@fmbenhassine
Copy link
Contributor

Thank you Michael. This section of the docs is explicit about multi-byte characters and gives some examples reported by users like doubling the column size or setting the maxVarcharLength on the JobrepositoryFactoryBean.

I agree that this kind of thing is better off left to users to customize on their own

Ok, we are on the same page.

@davidkarlsen What do you think?

@davidkarlsen
Copy link
Contributor Author

I think it makes sense to use characters where possible - people reason about characters - not bytes. We had a case where the status rolled back because it could fit to the database - and the truncating in code will reason around characters rather than bytes.

Make it the easiest for people to use - if the 1st thing you need to do when you pick up a framework is to alter some low-level DDLs to make it work it is not a good UX.

@gavenkoa
Copy link

gavenkoa commented Oct 4, 2018

@mminella

Spring Batch also has code in it that truncates values before they are inserted in specific columns

My report:

https://jira.spring.io/browse/BATCH-2731

shows that truncating is done by Java String.length() regardless considering final byte length.

DB string storage space requirement varies depending on version & locale settings. Doubling storage resolve issue in 99% cases unless you get exception with all characters in U+0800 - U+FFFF range for which each character has 3 bytes in UTF-8. What is about tripling DB column space? ))

I my case I am not interested in storing of exception trace in BATCH table. I have ElasticSearch logging solution for that.

Instead of fixing character length I'd like Batch framework ignores this kind of error and continue processing instead of breaking batch job in the middle.

Otherwise it is unreliable to pass exception handling to Spring Batch framework and it is necessary to implement custom wrappers around Batch Tasklet interface to catch all Exceptions and to handle them safely.

@fmbenhassine
Copy link
Contributor

fmbenhassine commented Oct 4, 2018

@davidkarlsen

I think it makes sense to use characters where possible - people reason about characters - not bytes.

I agree on that, but if you choose to use oracle, it is oracle that makes you think about bytes or characters and make the decision, not Spring Batch. Other providers have a clear decision about that.

the truncating in code will reason around characters rather than bytes.

This argument made me re-think my position about this PR. But note that for a db provider like DB2 where you can specify column size only in bytes, the truncating being based on characters would still be an issue. On the other hand, if we change the truncating code to be based on bytes, the issue could still happen with db providers where the column size is specified in chars..

Make it the easiest for people to use - if the 1st thing you need to do when you pick up a framework is to alter some low-level DDLs to make it work it is not a good UX.

Well, I don't agree, the oracle DDL works out-of-the-box and tweaking it is not the first thing you need to do. You might need to adjust it only if your requirements are different from the defaults (column size, indexes, sequences, etc).


@gavenkoa

Thank you for the details. Indeed, truncating is based on characters, not bytes.

Doubling storage resolve issue in 99% cases unless you get exception with all characters in U+0800 - U+FFFF range for which each character has 3 bytes in UTF-8. What is about tripling DB column space? ))

As said in my previous comment, the DDL provides a default value for the column size but this size can be adjusted if necessary. The precise issue with Oracle is that we truncate the message at 2500 characters while the column size is defined in bytes. The change in this PR should fix the issue of BATCH-2091 (and the one you reported in BATCH-2731 too).

I my case I am not interested in storing of exception trace in BATCH table.

This is another discussion. I would be happy to discuss it in another issue if needed.


To keep it simple, since the problem is about oracle and only oracle provides the choice to use VARCHAR2(n) or VARCHAR2(n char), I think there is no harm in changing the DDL for oracle to use char. That would make both truncating code and column size consistent.

@fmbenhassine
Copy link
Contributor

Hi,

After further investigations, I was able to reproduce the issue (See here) and the changes in this PR do fix the issue. However, we need to add a migration script for people already using the old DDL in production. I added this script in a separate PR #650 on top of these changes so all the credits go to @davidkarlsen for the fix ! Thank you @davidkarlsen for your contribution.

Br,
Mahmoud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants