Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigquery upload #2

Open
wants to merge 17 commits into
base: tsv_export
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# CWTS ETL tooling
Version: 8.0.0
Version: 8.1.0

## Description

Expand All @@ -25,7 +25,7 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
| `archive_pipeline` | v1.0.0 |
| `aws_download_folder` | v1.0.0 |
| `bcp_data` | v1.0.2 |
| `check_errors` | v0.3.2 |
| `check_errors` | v0.3.3 |
| `classification_create_classification` | v1.0.0 |
| `classification_create_labeling` | v1.0.0 |
| `classification_create_vosviewer_maps` | v1.0.0 |
Expand All @@ -36,14 +36,17 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
| `credentials` | dev |
| `curl_download_file` | v1.3.0 |
| `executables` | v1.1.1 |
| `export_database` | dev |
| `export_table` | dev |
| `extract_noun_phrases` | v1.0.0 |
| `folder` | v1.0.6 |
| `folder` | v1.0.7 |
| `get_datetime` | v1.0.0 |
| `generate_database_documentation` | v0.1.0 |
| `grant_access_cwts_group` | v2.0.0 |
| `json_analyze_data` | v1.0.0 |
| `json_parse_data` | v1.1.1 |
| `load_database` | v1.0.0 |
| `load_bigquery_table` | dev |
| `log_runtime` | v0.0.1 |
| `notify` | v1.0.0 |
| `notify_errors` | v0.1.0 |
Expand Down Expand Up @@ -90,6 +93,8 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
- Add wait.bat :sleep_subprocess

### check_errors
- v0.3.4
- `%export_log_folder%` added for export_table function
- v0.3.3
- `%backup_log_folder%` added for backup-tooling
- v0.3.2
Expand Down Expand Up @@ -165,6 +170,10 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
- v1.1.0
- rename `%read_data_exe%` to `%readdata_exe%`

### export_database

### export_table

### extract_noun_phrases

- v1.0.0
Expand All @@ -173,6 +182,11 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable

### folder

- v1.0.8
- add `%bigquery_log_folder%`
- v1.0.7
- add `%export_data_folder%`
- add `%export_log_folder%`
- v1.0.6
- add `%publicationclassification_log_folder%`
- add `%publicationclassificationlabeling_log_folder%`
Expand Down Expand Up @@ -209,6 +223,8 @@ When writing new pipeline code or ETL-tooling functions, the `functions\variable
- v1.0.0
- The value of `%erase_previous%` should be set to `erase_previous` instead of `true`

### load_bigquery_table

### load_database

- v1.0.0
Expand Down
2 changes: 2 additions & 0 deletions functions/check_errors.bat
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,11 @@ set error_string=---------------------------------------------------------------

call :check_errors "Backup" "%backup_log_folder%" error
call :check_errors "BCP" "%bcp_log_folder%" error
call :check_errors "Bigquery" "%bigquery_log_folder%" error
call :check_errors "Classification" "%classification_log_folder%" error
call :check_errors "Documentatie Generator" "%database_documentatie_generator_log_folder%" error
call :check_errors "Download" "%download_log_folder%" error
call :check_errors "Export" "%export_log_folder%" error
call :check_errors "Json Parser" "%json_parser_log_folder%" error
call :check_errors "LargeFileSplitter" "%large_file_splitter_log_folder%" error
call :check_errors "NPExtractorDB" "%noun_phrase_extractor_log_folder%" error
Expand Down
68 changes: 68 additions & 0 deletions functions/csv_analyze_data.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
@echo off
:: =======================================================================================
:: Main
::: Use csv_analyzer_exe to analyse csv files.
::: As this is usually called as an asynchronous process using `start` this script
::: sends a signal when the process has finished.

:: Global variables
::: csv_analyzer_sample_lines: Number of csv lines to use for type detection
::: csv_analyzer_output_columns: select string for types file output

:: Input variables
::: 1. input_file: location of the csv files
::: 2. output_file: output folder for types files.

:: Executables
::: csv_analyzer_exe
:: =======================================================================================
setlocal

set input_file=%~1
set output_file=%~2

call :check_variables 2 %*

echo %db_name% - analyze data
%csv_analyzer_exe% ^
--input_file %input_file% ^
--output_file %output_file% ^
%csv_analyzer_sample_lines_arg% ^
%csv_analyzer_output_columns_arg%

:: Send signal to waiting processes
call %functions_folder%\wait.bat :send %~f0

endlocal
goto:eof
:: =======================================================================================


:: =======================================================================================
:check_variables
:: =======================================================================================

:: Set functions_folder to location of this script
set functions_folder=%~dp0
:: Set program_folder to relative location of this script
set programs_folder=%~dp0\..\programs

:: Get executable paths
call %programs_folder%\executables.bat

:: Check number of input variables
call %functions_folder%\variable.bat :check_parameters %*

:: Validate input variables
call %functions_folder%\variable.bat :check_folder input_file
call %functions_folder%\variable.bat :create_folder output_file

if defined csv_analyzer_sample_lines (
set csv_analyzer_sample_lines_arg=--sample_size %csv_analyzer_sample_lines%
)
if defined csv_analyzer_output_columns (
set csv_analyzer_output_columns_arg=--output_columns "%csv_analyzer_output_columns%"
)

goto:eof
:: =======================================================================================
101 changes: 101 additions & 0 deletions functions/export_database.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
:: =======================================================================================
:: Main
::: Export a database from sql server to tsv files

::: First run all scripts in export_sql_folder using the file name as table name
::: Then run the default export for all tables in the database for which there is
::: no exported tsv file yet.

:: global variables
::: server
::: export_table_include_header
::: export_table_include_types

:: input variables
::: 1. db_name: name of the database to export
::: 2. export_sql_folder: sql folder containing sql files corresponding to table names
::: which contain sql code to export the table
::: 3. output_folder: folder where the output files should be placed
::: 4. log_folder: log folder for this function
:: =======================================================================================
setlocal

set db_name=%~1
set export_sql_folder=%~2
set output_folder=%~3
set log_folder=%~4

call :check_variables 4 %*

set sqlcmd_exe=sqlcmd -S %server% -E -m 1 -y0

echo Export database %db_name%

set "table_query=select table_name from information_schema.tables order by table_name"
call %sqlcmd_exe% -Q "set nocount on; %table_query%" -o "%output_folder%\table_export.conf"
if exist "%export_sql_folder%" (
for /f %%f in ('dir /b /ON "%export_sql_folder%\*.sql"') do (
call :export_table %export_sql_folder%\%%f
)
)
for /f %%t in (%output_folder%\table_export.conf) do (
call :export_table %%t
)

endlocal
goto:eof
:: =======================================================================================


:: =======================================================================================
:export_table
::: export the table if the exported table does not already exist
:: =======================================================================================
set table_or_file=%~1

if exist %table_or_file% (
for %%f in (%table_or_file%) do set table_name=%%~nf
) else (
set table_name=%table_or_file%
)
set output_file=%output_folder%\%table_name%.tsv

if not exist %output_file% (
call %functions_folder%\export_table.bat ^
"%db_name%" ^
"%table_or_file%" ^
"%output_folder%" ^
"%log_folder%"
)
goto:eof
:: =======================================================================================


:: =======================================================================================
:check_variables
:: =======================================================================================

:: set functions_folder to location of this script
set functions_folder=%~dp0
:: set program_folder to relative location of this script
set programs_folder=%~dp0\..\programs

:: get executable paths
call %programs_folder%\executables.bat

:: check number of input parameters
call %functions_folder%\variable.bat :check_parameters %*

:: validate global variables
call %functions_folder%\variable.bat :check_variable server

:: validate input variables
call %functions_folder%\variable.bat :check_variable db_name
call %functions_folder%\variable.bat :check_variable export_sql_folder
call %functions_folder%\variable.bat :create_folder output_folder
call %functions_folder%\variable.bat :create_folder log_folder
call %functions_folder%\variable.bat :default_variable export_table_include_header false
call %functions_folder%\variable.bat :default_variable export_table_include_types false

goto:eof
:: =======================================================================================
120 changes: 120 additions & 0 deletions functions/export_table.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
:: =======================================================================================
:: Main
::: Export a table from sql server to a tsv file

:: global variables
::: server
::: export_table_include_header
::: export_table_include_types

:: input variables
::: 1. db_name: name of the database to query
::: 2. table_or_file: name of the table to export
::: or a query file with the sql statement that outputs a single table
::: 3. output_folder: folder where the output files should be placed
::: 4. log_folder: log folder for this function
:: =======================================================================================
setlocal

set db_name=%~1
set table_or_file=%~2
set output_folder=%~3
set log_folder=%~4

call :check_variables 4 %*

echo Export table %db_name%..%table_name% (%table_query_file%)

call %powershell_7_exe% "& %functions_folder%\export_table\export_table.ps1" ^
"-server %server%" ^
"-db_name %db_name%" ^
"-table_name %table_name%" ^
"-input_file %table_query_file%" ^
"-output_file %output_file%" ^
"-log_folder %log_folder%" ^
"%no_header_arg%" ^
"%verbose_arg%"

if "%export_table_include_types%" == "true" (
call %powershell_7_exe% "& %functions_folder%\export_table\export_table.ps1" ^
"-server %server%" ^
"-db_name %db_name%" ^
"-table_name %table_name%" ^
"-input_file %functions_folder%\export_table\export_types.sql" ^
"-output_file %types_file%" ^
"-log_folder %log_folder%" ^
"%verbose_arg%"
)
if "%export_table_include_types%" == "analyze" (
set csv_analyzer_output_columns='%table_name%' as table_name, column_name, mssql_type as data_type, max_length, is_nullable
call %functions_folder%\csv_analyze_data.bat ^
"%output_file%" ^
"%types_file%"
)

endlocal
goto:eof
:: =======================================================================================


:: =======================================================================================
:check_variables
:: =======================================================================================

:: set functions_folder to location of this script
set functions_folder=%~dp0
:: set program_folder to relative location of this script
set programs_folder=%~dp0\..\programs

:: get executable paths
call %programs_folder%\executables.bat

:: check number of input parameters
call %functions_folder%\variable.bat :check_parameters %*

:: validate global variables
call %functions_folder%\variable.bat :check_variable server

:: validate input variables
call %functions_folder%\variable.bat :check_variable db_name
call %functions_folder%\variable.bat :check_variable table_or_file
call %functions_folder%\variable.bat :create_folder output_folder
call %functions_folder%\variable.bat :create_folder log_folder
call %functions_folder%\variable.bat :default_variable export_table_include_header false
call %functions_folder%\variable.bat :default_variable export_table_include_types false

set types_output_folder=%output_folder%\types
if "%export_table_include_types%" == "true" (
call %functions_folder%\variable.bat :create_folder types_output_folder
)

if exist %table_or_file% (
::: Export by sql script mode
set table_query_file=%table_or_file%
for %%f in (%table_or_file%) do set table_name=%%~nf
if "%export_table_include_types%" == "true" (
set export_table_include_types=analyze
)
) else (
::: Export table mode
set table_name=%table_or_file%
set table_query_file=%functions_folder%\export_table\export_table.sql
)
set output_file=%output_folder%\%table_name%.tsv
set types_file=%types_output_folder%\types\%table_name%_types.tsv

if "%verbose%" == "true" (
set verbose_arg=-Verbose
)
if "%export_table_include_header%" == "false" (
set no_header_arg=-NoHeader
)


call %functions_folder%\variable.bat :check_variable table_name
call %functions_folder%\variable.bat :check_file table_query_file
call %functions_folder%\variable.bat :check_variable output_file
call %functions_folder%\variable.bat :check_variable types_file

goto:eof
:: =======================================================================================
Loading