Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable sse2 for CSV parsing. #2977

Merged
merged 1 commit into from Aug 29, 2018
Merged

Conversation

amosbird
Copy link
Collaborator

Testing data

select 'aaaaaaaa,bbbbbbbb,cccccccc,dddddddd,eeeeeeee,ffffffff,gggg,hhh' from numbers(3000000) into outfile '/tmp/test.csv'

Testing command

echo "select count() from file('/tmp/test.csv', CSV, 'a String, b String, c String, d String, e String, f String, g String, h String') where not ignore(e)" | clickhouse-benchmark

Before

QPS: 1.317, RPS: 3949749.687, MiB/s: 478.380, result RPS: 1.317, result MiB/s: 0.000.
0.000%  0.704 sec.
10.000% 0.712 sec.
20.000% 0.718 sec.
30.000% 0.726 sec.
40.000% 0.739 sec.
50.000% 0.754 sec.
60.000% 0.770 sec.
70.000% 0.788 sec.
80.000% 0.798 sec.
90.000% 0.815 sec.
95.000% 0.826 sec.
99.000% 0.850 sec.
99.900% 0.857 sec.
99.990% 0.858 sec.

After

QPS: 1.533, RPS: 4598308.336, MiB/s: 556.932, result RPS: 1.533, result MiB/s: 0.000.
0.000%  0.626 sec.
10.000% 0.635 sec.
20.000% 0.639 sec.
30.000% 0.642 sec.
40.000% 0.643 sec.
50.000% 0.645 sec.
60.000% 0.649 sec.
70.000% 0.652 sec.
80.000% 0.658 sec.
90.000% 0.682 sec.
95.000% 0.710 sec.
99.000% 0.727 sec.
99.900% 0.733 sec.
99.990% 0.734 sec.

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

&& *next_pos != delimiter && *next_pos != '\r' && *next_pos != '\n') /// NOTE You can make a SIMD version.
++next_pos;

[&]() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why lambda? Just code block {...} is Ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a early return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


[&]() {
#if __SSE2__
auto rc = _mm_set1_epi8('\r');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we include the corresponding ...intrin.h?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really cool 👍

__v2du a = reinterpret_cast<__v2du>(_mm_cmpeq_epi8(bytes, rc));
__v2du b = reinterpret_cast<__v2du>(_mm_cmpeq_epi8(bytes, nc));
__v2du c = reinterpret_cast<__v2du>(_mm_cmpeq_epi8(bytes, dc));
__m128i eq = reinterpret_cast<__m128i>(a | b | c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. But I don't understand, how is it better than writing two _mm_or_si128 instead?
Isn't the __v2du less portable or less documented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, lemme check the output assembly. gcc folks haven't answered back yet.

Copy link
Collaborator Author

@amosbird amosbird Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're exactly the same. I'll change to _mm_or_si128

https://la.wentropy.com/j6p8

&& *next_pos != delimiter && *next_pos != '\r' && *next_pos != '\n') /// NOTE You can make a SIMD version.
++next_pos;

[&]() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Testing data

```
select 'aaaaaaaa,bbbbbbbb,cccccccc,dddddddd,eeeeeeee,ffffffff,gggg,hhh' from numbers(3000000) into outfile '/tmp/test.csv'
```

Testing command
```
echo "select count() from file('/tmp/test.csv', CSV, 'a String, b String, c String, d String, e String, f String, g String, h String') where not ignore(e)" | clickhouse-benchmark

```

Before
```
QPS: 1.317, RPS: 3949749.687, MiB/s: 478.380, result RPS: 1.317, result MiB/s: 0.000.
0.000%  0.704 sec.
10.000% 0.712 sec.
20.000% 0.718 sec.
30.000% 0.726 sec.
40.000% 0.739 sec.
50.000% 0.754 sec.
60.000% 0.770 sec.
70.000% 0.788 sec.
80.000% 0.798 sec.
90.000% 0.815 sec.
95.000% 0.826 sec.
99.000% 0.850 sec.
99.900% 0.857 sec.
99.990% 0.858 sec.
```

After
```
QPS: 1.533, RPS: 4598308.336, MiB/s: 556.932, result RPS: 1.533, result MiB/s: 0.000.
0.000%  0.626 sec.
10.000% 0.635 sec.
20.000% 0.639 sec.
30.000% 0.642 sec.
40.000% 0.643 sec.
50.000% 0.645 sec.
60.000% 0.649 sec.
70.000% 0.652 sec.
80.000% 0.658 sec.
90.000% 0.682 sec.
95.000% 0.710 sec.
99.000% 0.727 sec.
99.900% 0.733 sec.
99.990% 0.734 sec.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants