use VARCHAR( n char) instead of VARCHAR(n) so semantics are used and not... #222

davidkarlsen · 2013-09-02T08:21:53Z

... number of bytes. This is better when using multibyte charsets like UTF8

https://jira.springsource.org/browse/BATCH-2091

This change is

…not number of bytes. This is better when using multibyte charsets like UTF8

davidkarlsen · 2013-09-09T11:31:51Z

ping?

davidkarlsen · 2013-10-15T16:09:53Z

Why don't you comment on this or pick it up?

mminella · 2013-10-15T22:01:03Z

spring-batch-core/src/main/resources/org/springframework/batch/core/schema-oracle10g.sql

-	JOB_NAME VARCHAR2(100) NOT NULL,
-	JOB_KEY VARCHAR2(32) NOT NULL,
+	JOB_NAME VARCHAR2(100 char) NOT NULL,
+	JOB_KEY VARCHAR2(32 char) NOT NULL,


This is a string generated by batch so I'm not sure we would need this feature here.

mminella · 2013-10-15T22:04:14Z

I've made a couple notes. Add to those, I'm currently looking to see if any of the other databases we support have a similar function. I'd rather make a change like this all at once.

pivotal-issuemaster · 2016-07-21T14:19:14Z

@davidkarlsen Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

davidkarlsen · 2016-07-21T21:58:51Z

@pivotal-issuemaster done!

pivotal-issuemaster · 2017-01-01T03:00:44Z

@davidkarlsen Thank you for signing the Contributor License Agreement!

fmbenhassine · 2018-09-05T09:40:12Z

There is an open issue for this: https://jira.spring.io/browse/BATCH-2750

We will consider merging this PR in the upcoming v4.1.

fmbenhassine · 2018-10-03T10:34:52Z

I'm currently looking to see if any of the other databases we support have a similar function. I'd rather make a change like this all at once.

I checked the data types for each database provider we support (except the embedded ones) and most of them use characters by default:

However, some providers use bytes by default:

SyBase: http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1510/html/iqrefbb/X315931.htm
DB2: https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/intro/src/tpc/db2z_stringdatatypes.html
SQLFire: I could not find the info (Discontinued ? see: https://www.vmware.com/be/products/pivotal-sqlfire.html)

Moreover, this default can vary from one version to another for each database provider. For example, the default has changed for MySQL between version 4 and 5 as per: https://stackoverflow.com/questions/1997540/mysql-varchar-lengths-and-utf-8.

I think it is almost unmanageable to test the DDL against each version of each database provider. So in hindsight, I'm not sure if this kind of issues should be fixed in the DDL or left as an adjustment to the user, as said in Appendix A.9:

Many users find that simply changing the schema to double the length of the VARCHAR columns is enough

This appendix clearly mentions that some adjustments (column lengths, additional indexes, etc) might be necessary if the default DDL is not enough.

For some providers like DB2, it is not possible to specify if the size of the VARCHAR is in bytes or chars (like for Oracle with VARCHAR2(30 char), only bytes are accepted). So if this issue happens to a DB2 user, he has to increase the size of the column, we can do nothing to "fix" our DDL.

BTW, even the fix in this PR may not be sufficient: Passing the EXIT_MESSAGE from EXIT_MESSAGE VARCHAR2(2500) to EXIT_MESSAGE VARCHAR2(2500 char) may still fail if the stack trace exceeds 2500 characters. What do we do in that case? update the DDL again to increase the size in chars? That would not solve the problem. So I think this kind of adjustments should be left to the user.

@davidkarlsen @mminella Do you agree?

mminella · 2018-10-03T14:31:37Z

Spring Batch also has code in it that truncates values before they are inserted in specific columns. I agree that this kind of thing is better off left to users to customize on their own. We could add a note in the docs explicitly about the idea that they may need to increase field sizes when using multibyte characters.

fmbenhassine · 2018-10-03T19:37:07Z

Thank you Michael. This section of the docs is explicit about multi-byte characters and gives some examples reported by users like doubling the column size or setting the maxVarcharLength on the JobrepositoryFactoryBean.

I agree that this kind of thing is better off left to users to customize on their own

Ok, we are on the same page.

@davidkarlsen What do you think?

davidkarlsen · 2018-10-03T20:05:05Z

I think it makes sense to use characters where possible - people reason about characters - not bytes. We had a case where the status rolled back because it could fit to the database - and the truncating in code will reason around characters rather than bytes.

Make it the easiest for people to use - if the 1st thing you need to do when you pick up a framework is to alter some low-level DDLs to make it work it is not a good UX.

gavenkoa · 2018-10-04T12:24:37Z

@mminella

Spring Batch also has code in it that truncates values before they are inserted in specific columns

My report:

https://jira.spring.io/browse/BATCH-2731

shows that truncating is done by Java String.length() regardless considering final byte length.

DB string storage space requirement varies depending on version & locale settings. Doubling storage resolve issue in 99% cases unless you get exception with all characters in U+0800 - U+FFFF range for which each character has 3 bytes in UTF-8. What is about tripling DB column space? ))

I my case I am not interested in storing of exception trace in BATCH table. I have ElasticSearch logging solution for that.

Instead of fixing character length I'd like Batch framework ignores this kind of error and continue processing instead of breaking batch job in the middle.

Otherwise it is unreliable to pass exception handling to Spring Batch framework and it is necessary to implement custom wrappers around Batch Tasklet interface to catch all Exceptions and to handle them safely.

fmbenhassine · 2018-10-04T13:22:37Z

@davidkarlsen

I think it makes sense to use characters where possible - people reason about characters - not bytes.

I agree on that, but if you choose to use oracle, it is oracle that makes you think about bytes or characters and make the decision, not Spring Batch. Other providers have a clear decision about that.

the truncating in code will reason around characters rather than bytes.

This argument made me re-think my position about this PR. But note that for a db provider like DB2 where you can specify column size only in bytes, the truncating being based on characters would still be an issue. On the other hand, if we change the truncating code to be based on bytes, the issue could still happen with db providers where the column size is specified in chars..

Make it the easiest for people to use - if the 1st thing you need to do when you pick up a framework is to alter some low-level DDLs to make it work it is not a good UX.

Well, I don't agree, the oracle DDL works out-of-the-box and tweaking it is not the first thing you need to do. You might need to adjust it only if your requirements are different from the defaults (column size, indexes, sequences, etc).

@gavenkoa

Thank you for the details. Indeed, truncating is based on characters, not bytes.

Doubling storage resolve issue in 99% cases unless you get exception with all characters in U+0800 - U+FFFF range for which each character has 3 bytes in UTF-8. What is about tripling DB column space? ))

As said in my previous comment, the DDL provides a default value for the column size but this size can be adjusted if necessary. The precise issue with Oracle is that we truncate the message at 2500 characters while the column size is defined in bytes. The change in this PR should fix the issue of BATCH-2091 (and the one you reported in BATCH-2731 too).

I my case I am not interested in storing of exception trace in BATCH table.

This is another discussion. I would be happy to discuss it in another issue if needed.

To keep it simple, since the problem is about oracle and only oracle provides the choice to use VARCHAR2(n) or VARCHAR2(n char), I think there is no harm in changing the DDL for oracle to use char. That would make both truncating code and column size consistent.

fmbenhassine · 2018-10-09T20:27:45Z

Hi,

After further investigations, I was able to reproduce the issue (See here) and the changes in this PR do fix the issue. However, we need to add a migration script for people already using the old DDL in production. I added this script in a separate PR #650 on top of these changes so all the credits go to @davidkarlsen for the fix ! Thank you @davidkarlsen for your contribution.

Br,
Mahmoud

use VARCHAR( n char) instead of VARCHAR(n) so semantics are used and …

a36223f

…not number of bytes. This is better when using multibyte charsets like UTF8

mminella reviewed Oct 15, 2013
View reviewed changes

mminella force-pushed the master branch from 57ae2a1 to 04b339b Compare September 16, 2014 16:03

fmbenhassine added waiting-for-feedback and removed ready-to-review labels Oct 3, 2018

fmbenhassine added the pr-for: bug label Oct 6, 2018

fmbenhassine mentioned this pull request Oct 9, 2018

Use characters instead of bytes to support multi-bytes characters for oracle db #650

Closed

fmbenhassine closed this Oct 9, 2018

fmbenhassine mentioned this pull request Feb 20, 2023

SQLCODE=-302 truncateExitDescription in JdbcStepExecutionDao.java #4309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use VARCHAR( n char) instead of VARCHAR(n) so semantics are used and not... #222

use VARCHAR( n char) instead of VARCHAR(n) so semantics are used and not... #222

davidkarlsen commented Sep 2, 2013 •

edited

Loading

davidkarlsen commented Sep 9, 2013

davidkarlsen commented Oct 15, 2013

mminella Oct 15, 2013

mminella commented Oct 15, 2013

pivotal-issuemaster commented Jul 21, 2016

davidkarlsen commented Jul 21, 2016

pivotal-issuemaster commented Jan 1, 2017

fmbenhassine commented Sep 5, 2018

fmbenhassine commented Oct 3, 2018

mminella commented Oct 3, 2018

fmbenhassine commented Oct 3, 2018

davidkarlsen commented Oct 3, 2018

gavenkoa commented Oct 4, 2018

fmbenhassine commented Oct 4, 2018 •

edited

Loading

fmbenhassine commented Oct 9, 2018

use VARCHAR( n char) instead of VARCHAR(n) so semantics are used and not... #222

use VARCHAR( n char) instead of VARCHAR(n) so semantics are used and not... #222

Conversation

davidkarlsen commented Sep 2, 2013 • edited Loading

davidkarlsen commented Sep 9, 2013

davidkarlsen commented Oct 15, 2013

mminella Oct 15, 2013

Choose a reason for hiding this comment

mminella commented Oct 15, 2013

pivotal-issuemaster commented Jul 21, 2016

davidkarlsen commented Jul 21, 2016

pivotal-issuemaster commented Jan 1, 2017

fmbenhassine commented Sep 5, 2018

fmbenhassine commented Oct 3, 2018

mminella commented Oct 3, 2018

fmbenhassine commented Oct 3, 2018

davidkarlsen commented Oct 3, 2018

gavenkoa commented Oct 4, 2018

fmbenhassine commented Oct 4, 2018 • edited Loading

fmbenhassine commented Oct 9, 2018

davidkarlsen commented Sep 2, 2013 •

edited

Loading

fmbenhassine commented Oct 4, 2018 •

edited

Loading