Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Sha1HashKey to CityHash128Key for generating PreparedStatement handle and metadata cache keys #717

Merged
merged 12 commits into from
Jul 9, 2018

Conversation

cheenamalhotra
Copy link
Member

Fixes issue #716

@codecov-io
Copy link

codecov-io commented Jun 6, 2018

Codecov Report

Merging #717 into dev will increase coverage by 0.09%.
The diff coverage is 69.77%.

Impacted file tree graph

@@             Coverage Diff              @@
##                dev     #717      +/-   ##
============================================
+ Coverage     48.52%   48.61%   +0.09%     
- Complexity     2747     2775      +28     
============================================
  Files           115      116       +1     
  Lines         27156    27342     +186     
  Branches       4547     4562      +15     
============================================
+ Hits          13177    13293     +116     
- Misses        11809    11879      +70     
  Partials       2170     2170
Flag Coverage Δ Complexity Δ
#JDBC42 48.13% <66.66%> (+0.13%) 2732 <34> (+28) ⬆️
#JDBC43 48.59% <69.77%> (+0.3%) 2775 <35> (+38) ⬆️
Impacted Files Coverage Δ Complexity Δ
...oft/sqlserver/jdbc/SQLServerParameterMetaData.java 24.14% <ø> (ø) 31 <0> (ø) ⬇️
...oft/sqlserver/jdbc/SQLServerPreparedStatement.java 54.64% <100%> (+0.03%) 211 <0> (ø) ⬇️
...m/microsoft/sqlserver/jdbc/SQLServerStatement.java 60.1% <100%> (ø) 136 <0> (ø) ⬇️
...om/microsoft/sqlserver/jdbc/ParsedSQLMetadata.java 100% <100%> (ø) 0 <0> (ø) ⬇️
...in/java/com/microsoft/sqlserver/jdbc/CityHash.java 66.66% <66.66%> (ø) 26 <26> (?)
...c/main/java/com/microsoft/sqlserver/jdbc/Util.java 62% <75%> (+0.66%) 89 <0> (+1) ⬆️
.../microsoft/sqlserver/jdbc/SQLServerConnection.java 48.6% <78.12%> (-0.1%) 333 <9> (ø)
...om/microsoft/sqlserver/jdbc/SQLServerBulkCopy.java 53.29% <0%> (-0.25%) 262% <0%> (-1%)
...in/java/com/microsoft/sqlserver/jdbc/IOBuffer.java 53.7% <0%> (-0.2%) 0% <0%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49f95c9...b442832. Read the comment docs.

@@ -1096,3 +1096,364 @@ else if (databaseName.length() > 0)
return fullName.toString();
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a separate file for this?

Sha1HashKey(String s) {
bytes = getSha1Digest().digest(s.getBytes());
CityHash128Key(String s) {
segments = CityHash.cityHash128(s.getBytes(), 0, s.length());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance of this can be improved, especially in the case of large SQL strings. s.getBytes() ends up calling StringCoding.encode(...), which converts the internal String chars to bytes using the platform default encoding.

I recommend using s.getBytes(int srcBegin, int srcEnd, byte dst[], int dstBegin), which, while deprecated (but will never be removed), has substantially higher performance.

CityHash128Key(String s) {
   byte[] bytes = byte[s.length()];
   s.getBytes(0, s.length(), bytes, 0);
   segments = CityHash.cityHash128(bytes, 0, s.length());
}

Copy link

@RDearnaley RDearnaley Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.getBytes(int srcBegin, int srcEnd, byte dst[], int dstBegin) truncates the upper 8 bits of Java's internal Unicode-16 representation. Since this is SQL before inserting parameter values, it should be mostly 7-bit ASCII -- the only exception I can think of would be database names/table names/column names, at least some of which can be full Unicode. For European or other alphabetic languages, their Unicode-16 spaces tend to be 256 characters wide and the odds of this truncation causing a spurious collision seem very small -- but I'd be less confident if the language in question was, say, Chinese, that you couldn't ever have two SQLs that differed only by one ideogram in a table/column name, and the two ideograms in question unfortunately happened to have the same lower 8 bits. It's not very likely, but it's significantly less astronomically unlikely than a 128-bit hash collision.

Copy link

@RDearnaley RDearnaley Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, we need to avoid character reencoding, but I think we also need to avoid throwing away hash entropy while we do so. I was wondering about something like getBytes(Charset.UTF_16BE) -- if it matches the internal representation used by String, will it skip reencoding? So maybe something like:

   byte[] bytes = s.getBytes(Charset.UTF_16BE);  // avoid character reencoding
   segments = CityHash.cityHash128(bytes, 0, bytes.length);  // bytes.length = 2*s.length() + 2, due to Byte Order Mark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RDearnaley Looking at the code of StringCoding.encode() and the various encoder subclasses, I don't find any optimization for UTF_16BE. 😞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RDearnaley It appears only option would be a custom Charset and CharsetEncoder. It is possible, as there is a CharsetProvider SPI available. The "raw encoder" would simply "encode" chars as pairs of bytes (upper 8-bits, lower 8-bits). The cost is a 2x sized byte array due to the naive approach, but possibly worth it as the byte array is quickly garbage collected, or could potentially be cached as a WeakReference in a ThreadLocal inside of the CityHash128Key class.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How strange -- yes, taking a look in the Java code for them, it appears that none of UTF_16BE, UTF_16 etc make use of the obvious optimization that one of them has to be a no-op. I'm wondering if someone might have already created a raw double byte encoding to speed this up -- it seems a fairly obvious performance optimization for cases like this where you care about the entropy content rather than the specific values. If not, it might make a good addition to the Java version of CityHash

Copy link

@RDearnaley RDearnaley Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another possible approach whose speed would be worth testing:

private static byte[] toBytes( final String str )
{
    final int len = str.length();
    final char[] chars = str.getChars();
    final byte[] buff = new byte[2 * len];
    for (int i = 0, j = 0; i < len; i++) {
        buff[j++] = (byte) chars[i];  // Or str.charAt(i) and skipping str.getChars() might be faster?
        buff[j++] = (byte) (chars[i] >>> 8);  // Or >> might be faster?
    }
    return buff;
}

or just inline this since we only use it once. Old-school, I know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course it would work, but getChars() is going to create an additional copy, incurring a CPU hit as well as increasing GC pressure. Regarding inlining, probably unnecessary as the JIT will inline it when it gets hot.

I think a custom encoder would offer better performance, as the internal char array is passed without copy.

Copy link

@RDearnaley RDearnaley Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the str.charAt(i) approach also be making a copy, or would that just go on the stack/in a register?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str.charAt(i) is just a simple array access, with the value passed on the stack. If the compiler inlines the call, it could be a viable option. You'd have to find it in the output of -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:+PrintCompilation.


Sha1HashKey(String sql,
CityHash128Key(String sql,
String parametersDefinition) {
this(String.format("%s%s", sql, parametersDefinition));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.format() is a terrible way to concatenate two Strings. If I wrote this in my original commit, I must have been high. Just change to this(sql + parametersDefinition).

Copy link

@RDearnaley RDearnaley Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having taken another look, yes, this showed up in my YourKit runs -- in fact I saw it roughly as much as the character encoding issue we've been discussing above.

@cheenamalhotra cheenamalhotra added this to In progress in MSSQL JDBC Jun 26, 2018
1) Further speedups to prepared statement hashing

2) Caching of '?' chararacter positiobs in prepared statements to speed parameter substitution
@RDearnaley
Copy link

@cheenamalhotra I have sent you a pull request cheenamalhotra#11 with many of the additional speedups we've discussed above

cheenamalhotra and others added 7 commits June 30, 2018 17:02
Prepared statement performance fixes
Pull latest changes from Microsoft:dev branch
# Conflicts:
#	src/main/java/com/microsoft/sqlserver/jdbc/SQLServerConnection.java
#	src/main/java/com/microsoft/sqlserver/jdbc/SQLServerPreparedStatement.java
#	src/main/java/com/microsoft/sqlserver/jdbc/Util.java
added missing line for bulkcopy tests.
…by Rene)

Add String comparison with CityHashKey and fix test failures
@cheenamalhotra cheenamalhotra changed the title Change Sha1HashKey to CityHash128Key for Prepared Statement Caching Change Sha1HashKey to CityHash128Key for generating PreparedStatement handle and metadata cache keys Jul 4, 2018
@cheenamalhotra cheenamalhotra added this to the 7.0.0 milestone Jul 4, 2018

ParsedSQLCacheItem cacheItem = new ParsedSQLCacheItem(parsedSql, paramCount, procName, returnValueSyntax);
ParsedSQLCacheItem cacheItem = new ParsedSQLCacheItem (parsedSql, parameterPositions, procName, returnValueSyntax);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to have an extra space? Might need to apply formatter. Same line 267.

* SQL text to parse for positions of parameters to intialize.
*/
private static int[] locateParams(String sql) {
List<Integer> parameterPositions = new ArrayList<Integer>(0);
Copy link
Member

@rene-ye rene-ye Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template type inferred from LHS, not necessary to declare it a second time.

Proposed change:

LinkedList<Integer> parameterPositions = new LinkedList<>();

Copy link
Contributor

@peterbae peterbae Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List<Integer> parameterPositions = new LinkedList<>();

would be better imo

int i = 0;
for (Integer parameterPosition : parameterPositions) {
result[i++] = parameterPosition;
}
Copy link
Member

@rene-ye rene-ye Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a LinkedList to make adding/iterating faster OR use a parallel stream with ArrayList.

Edit: The stream option isn't very viable, and since order matters, wouldn't offer much performance gain. LinkedList is probably the way to go.

Edit2:

int[] result = parameterPosition.stream().mapToInt(Integer::valueOf).toArray()

is a good streaming alternative

}
else {
srcEnd = paramPositions[paramIndex];
}
Copy link
Member

@rene-ye rene-ye Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5388 - 5393 can be replaced with

srcEnd = (paramIndex >= paramPositions.length) ? sqlSrc.length() : paramPositions[paramIndex];

RDearnaley
RDearnaley previously approved these changes Jul 7, 2018
Copy link

@RDearnaley RDearnaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to test this and seem how the performance is with the (IMO unecessary for many customers, including us) new string comparison added is. If necessary we might ship with a version patched to remove the string comparison. But this is definitely an improvement, so I'm marking it 'Approve'.

cleaner code & logic
CityHash128Key(String s) {
unhashedString = s;
byte[] bytes = new byte[s.length()];
s.getBytes(0, s.length(), bytes, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBytes is a deprecated method - suppress warning or replace?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppress the warning -- there is an excellent performance reason why we're using it. And as Brett has pointed out, it's never going away, because it's sometimes the right thing to do.

@cheenamalhotra cheenamalhotra merged commit e2cf217 into microsoft:dev Jul 9, 2018
MSSQL JDBC automation moved this from In progress to Closed/Merged PRs Jul 9, 2018
cheenamalhotra added a commit that referenced this pull request Jul 31, 2018
* Update Snapshot for upcoming RTW release v7.0.0

* Change order of logic for checking the condition for using Bulk Copy API (#736)

Fix | Change order of logic for checking the condition for using Bulk Copy API (#736)

* Update CHANGELOG.md

* Merge pull request #732 from cheenamalhotra/module (Export driver in automatic module)

Introduce Automatic Module Name in POM.

* Update CHANGELOG.md

* Change Sha1HashKey to CityHash128Key for generating PreparedStatement handle and metadata cache keys (#717)

* Change Sha1HashKey to CityHash128Key

* Formatted code

* Prepared statement performance fixes

1) Further speedups to prepared statement hashing

2) Caching of '?' chararacter positiobs in prepared statements to speed parameter substitution

* String compare for hash keys
added missing line for bulkcopy tests.

* comment change

* Move CityHash class to a separate file

* spacings fixes
cleaner code & logic

* Add | Adding useBulkCopyForBatchInsert property to Request Boundary methods (#739)

* Apply the collation name change to UTF8SupportTest

* Package changes for CityHash with license information (#740)

* Reformatted Code + Updated formatter (#742)

* Reformatted Code + Updated formatter

* Fix policheck issue with 'Country' keyword (#745)

* Adding a new test for beginRequest()/endRequest() (#746)

* Add | Adding a new test to notify the developers to consider beginRequest()/endRequest() when adding a new Connection API

* Fix | Fixes for issues reported by static analysis tools (SonarQube + Fortify) (#747)

* handle buffer reading

for invalid buffer input by user

* Revert "handle buffer reading"

This reverts commit 11e2bf4.

* updated javadocs (#754)

* fixed some typos in javadocs (#760)

* API and JavaDoc changes for Spatial Datatypes (#752)

Add | API and JavaDoc changes for Spatial Datatypes (#752)

* Disallow non-parameterized queries for Bulk Copy API for batch insert (#756)

fix | Disallow non-parameterized queries for Bulk Copy API for batch insert (#756)

* Formatting | Change scope of unwanted Public APIs + Code Format (#757)

* Fix unwanted Public APIs.
* Updated formatter to add new line to EOF + formatted project

* Release | Release 7.0 changelog and version update (#748)

* Updated Changelog + release version changes

* Changelog entry updated as per review.

* Updated changelog for new changes

* Terminology update: request boundary declaration APIs

* Trigger Appveyor

* Update Samples and add new samples for new features (#761)

* Update Samples and add new Samples for new features

* Update samples from Peter

* Updated JavaDocs

* Switch to block comment

* Update License copyright (#767)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
MSSQL JDBC
  
Closed/Merged PRs
Development

Successfully merging this pull request may close these issues.

None yet

7 participants