Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BCP based load to TPROC-H Schema for SQL Server #593

Closed
sm-shaw opened this issue Sep 11, 2023 · 6 comments · Fixed by #611
Closed

Add BCP based load to TPROC-H Schema for SQL Server #593

sm-shaw opened this issue Sep 11, 2023 · 6 comments · Fixed by #611
Assignees
Labels
build enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@sm-shaw
Copy link
Contributor

sm-shaw commented Sep 11, 2023

Pull Requests #587 and #592 have added a BCP based load to the TPROC-C schema that improves loading performance. The same approach could be added to the TPROC-H schema to also improve performance there, using these pull requests as a template for the changes that need to be added.

@JoshInnis
Copy link
Contributor

Hello,

I work on the same team as Krithika Satish and I am using those PRs as templates to alter the TPROC-H loading performance.

The template from those PRs are working well for this load process to with one issue. The address field for the supplier and customer tables is created with this command: set address [ V_STR 25 ]

This command will allow addresses to be created with commas in it. Which causes issues with BCP. The current solution we have is to run the command: regsub -all "," $address " " address Which replaces the commas with spaces. This logic currently only runs when the BCP logic is run.

@sm-shaw
Copy link
Contributor Author

sm-shaw commented Sep 12, 2023

Hi,

We shouldn't change the data as that is what is derived from the TPC-H specification. However, I would be really surprised if BCP didn't work with commas in it, as I would think this would be a fairly typical requirement. Looking at the documentation:

https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-ver16&redirectedfrom=MSDN
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server?view=sql-server-ver16

It looks like the "-t" flag is used to set the delimiter and so far for TPROC-C we are setting a delimiter as a comma i.e. -t ","

exec bcp $tableName IN $filePath -b 500000 -a 16000 -T -S $server -c -t ","

Therefore, for data with commas in it be able to change the delimiter for something else.

As an example of what do see the commit here: a0e96c0

In the procedures creating the data you could set a variable called delimiter and then do upvar 2 in the bcp procedure to reference the setting for a particular table being loaded and then this delimiter would be used in the bcp procedure as follows:

exec bcp $tableName IN $filePath -b 500000 -a 16000 -T -S $server -c -t "$delimiter"

To be sure it is set you could also use the following so if delimiter is not set then a comma is used by default.

upvar 2 delimiter delimiter
if {![info exists $delimiter]} {
set delimiter ","
}

For manual bulk loads I have previously used a pipe delimiter "|" https://www.hammerdb.com/docs/ch13s04.html

@JoshInnis
Copy link
Contributor

Would it make sense instead to use pipes '|' for all temporary files in the TPCH load? That would simplify the codebase a little bit.

@JoshInnis
Copy link
Contributor

The bcp utility has some restrictions to date formatting that are causing an issue. The function mk_time in tpchcommon-1.0.tm formats dates with the three-letter abbreviation of the month, i.e.. 2023-SEP-14. However, in testing, BCP requires the numbered syntax: 2023-09-14, otherwise a parsing error is thrown.

Currently I made a copy on the function mk_time called mk_time_bcp that makes the necessary alteration for BCP (TPCC uses datetime in its tables, whereas TPCH uses date, which is why this issue didn't appear earlier).

If this is an adequate solution, does this new function belong in tpchcommon-1.0.tm or in mssqlsolap.tcl as this is the only place in the codebase where this function is needed?

@sm-shaw
Copy link
Contributor Author

sm-shaw commented Sep 15, 2023

Would it make sense instead to use pipes '|' for all temporary files in the TPCH load? That would simplify the codebase a little bit.

Yes, certainly if this is easier for you, then you can use a pipe delimiter for TPROC-H. A comma was only used as that is what was used for the TPROC-C BCP build.

@sm-shaw
Copy link
Contributor Author

sm-shaw commented Sep 15, 2023

Currently I made a copy on the function mk_time called mk_time_bcp that makes the necessary alteration for BCP (TPCC uses datetime in its tables, whereas TPCH uses date, which is why this issue didn't appear earlier).

If this is an adequate solution, does this new function belong in tpchcommon-1.0.tm or in mssqlsolap.tcl as this is the only place in the codebase where this function is needed?

Good question and solution. Let's keep it in tpchcommon-1.0.tm so it is next to the original function for reference and maybe just add a very brief comment above why it is needed for a reminder in future.

@sm-shaw sm-shaw linked a pull request Oct 3, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants